Data Types  «Prev  Next»
Lesson 1

Storing numbers and text

In this module we will discuss how a computer stores the numbers and text that your programs work with.
After completing the module you will have the skills and knowledge necessary to:
  1. Explain how numbers are stored in binary and hexadecimal form
  2. Convert numbers between binary and decimal form
  3. Convert numbers between binary and hexadecimal form
  4. Explain how signed integers are stored in two's complement form
  5. Convert signed integers between decimal and two's complement form
  6. Explain how real numbers are stored as floating-point form
  7. Explain how characters are stored using ASCII and Unicode

Computer Science Structured Approach

When do we use bit, bytes, and words in Computer Science

In computer science, bits, bytes, and words are commonly used units of information, each serving different purposes based on the context:
  1. Bit:
    • A bit is the smallest unit of data in computing and digital communications, representing a binary digit, either 0 or 1. Bits are fundamental in computer processing, as they are the basis of all computer data.
    • Bits are used extensively in areas such as digital signal processing, data transmission, and storage where data needs to be represented efficiently at the most granular level. Encryption and error detection algorithms also often work directly with bits.
  2. Byte:
    • A byte is a unit of digital information that commonly consists of eight bits. A byte can represent a single character, such as a letter, digit, or punctuation mark in text data.
    • Bytes are the standard unit for measuring file sizes and data storage in computers. Operating systems and applications typically refer to file sizes and space in bytes, with larger units like kilobytes (KB), megabytes (MB), gigabytes (GB), and terabytes (TB) being multiples of bytes.
  3. Word:
    • In computer architecture, a word is a group of bytes that a processor is designed to handle as a unit. The size of a word typically depends on the architecture, commonly being 16, 32, or 64 bits.
    • Words are used in the context of processor operations and memory management. Instructions, addressing, and the organization of memory are often word-oriented. This means that a processor retrieves and processes data in word-sized chunks, which impacts performance and efficiency.

Each of these units plays a crucial role in defining how data is processed, stored, and transmitted in computer systems.


How does a Computer store Numbers and Text?

An early convention for representing text was ASCII (American Standard Code for Information Interchange) which assigned the characters on a standard typewriter to a number that could be stored in 7-bits (i.e. between 0-127). Capital 'A' is 65, capital 'B' is 66, lowercase 'a' is 97, lowercase 'b' is 98, etc. This system works great for storing English text, but it doesn't include the accented characters that are needed in other European languages, and it definitely doesn't include the thousands of characters or symbols found in Chinese, Korean, and many other languages. For that purpose, Unicode was created as a much larger character set in the late 1980s, although it still includes as a very small subset those original 127 characters from ASCII. So a capital 'A' is still essentially represented as the number 97, just as it has been for over 50 years. In a more technical sense, computers don't really store numbers either. They are really just a collection of billions of transistors that are either at a high or low voltage, and we group these transistors into groups of 8, 16, 32, 64, or more bits and think of them as a binary number that is composed of that many bits. This allows numbers that are essentially infinitely large, which is why computers can store any length of text, or strings of colors (pictures and videos), or strings of audio frequency (music), etc.
  • Data Storage in Computers: Numbers and Text in the Context of the PDP-11 Mini-computer: Computers fundamentally operate on binary data, using a series of electrical switches that can be either on (1) or off (0). The PDP-11 mini-computer, a marvel of its time, adheres to this basic principle. When programming on the PDP-11 using assembly language, it's essential to grasp how this system stores numbers and text.
    1. Binary Representation: At the heart of any computer, including the PDP-11, lies the binary system. All data whether numbers, text, or otherwise is represented in binary form using bits (binary digits). A bit can hold one of two values: 0 or 1.
    2. Data Word and Byte:
      1. Word: The PDP-11 primarily operates on 16-bit words. That means each word consists of 16 individual bits.
      2. Byte: A byte consists of 8 bits. Given the PDP-11's 16-bit word design, a word can be split into two contiguous bytes.
    3. Storing Numbers:
      1. Integers: On the PDP-11, integers are stored using a binary representation. A 16-bit word can represent integers ranging from 0 to 65,535 in unsigned form. For signed integers, the PDP-11 uses Two's Complement notation, allowing representation of numbers from -32,768 to 32,767.
      2. Floating-Point Numbers: While the PDP-11 has provisions to handle floating-point arithmetic, it's more intricate and generally requires specialized instructions or routines.
    4. Storing Text:
      1. ASCII Encoding: Text is typically stored using the ASCII (American Standard Code for Information Interchange) encoding on the PDP-11. Each character is represented by a unique 7-bit binary number. Because a byte on the PDP-11 can store 8 bits, ASCII characters occupy 7 of those bits, often with the 8th bit set to 0.
      2. Strings of text are generally sequences of these ASCII-encoded characters stored in contiguous memory locations.
    5. Addressing Modes:
      1. The PDP-11 boasts a rich set of addressing modes, enabling flexible ways to reference memory locations and registers. When your assembly program operates on numbers or text, understanding these modes is crucial as they determine how the operand's address is computed.
      2. For instance, the "Register" mode uses the contents of a register as the operand, while the "Autoincrement" mode uses the content of a register as a pointer, fetches the operand from that memory address, and then increments the register.
    6. Memory and Registers:
      1. Main Memory: The PDP-11's main memory is where your program's data resides during execution. Depending on the specific PDP-11 model and configuration, the amount of available memory can vary.
      2. Registers: The PDP-11 features eight general-purpose registers (R0 to R7). These registers can hold data, addresses, or both. They play a pivotal role in the execution of assembly programs, offering fast access to data.
    In essence, the PDP-11, like all computers, utilizes a binary system to represent and store data. However, its specific architecture – encompassing 16-bit words, versatile addressing modes, and a blend of memory and registers – dictates the nuances of data storage and manipulation when programming in PDP-11 assembly language. Mastery of these fundamentals ensures efficient and effective programming on this iconic mini-computer.
    In the next lesson we will examine how a computer stores non-negative integers such as 0, 23, and 318.


Elements of Machine Language needed by Low Level Programmers

Low-level programmers, such as those working with x86 assembly language, need to understand several aspects of machine language that high-level programmers, using languages like Java or C#, typically do not:
  1. Instruction Set Architecture (ISA): Low-level programmers must understand the specific instructions that the CPU can execute, including arithmetic operations, data movement, control flow, and system instructions. They need to know how to use these instructions to perform tasks.
  2. Registers: Low-level programmers must manage CPU registers directly. They need to know the purpose of different registers (e.g., general-purpose, segment, instruction pointer) and how to use them effectively to store temporary data and manage execution flow.
  3. Memory Addressing: Low-level programmers must understand different memory addressing modes (e.g., immediate, direct, indirect, indexed) and how to use them to access data stored in memory. They also need to manage stack and heap memory manually.
  4. Bitwise Operations: Low-level programming often requires manipulating data at the bit level, using operations like AND, OR, XOR, NOT, shifts, and rotates. These are crucial for tasks such as setting or clearing specific bits in a register or implementing certain algorithms.
  5. Instruction Execution Flow: Low-level programmers need to have a detailed understanding of how instructions are fetched, decoded, and executed by the CPU. This includes knowledge of pipelining, branching, and instruction timing.
  6. Interrupts and Exception Handling: Low-level programmers must understand how interrupts and exceptions are handled by the CPU, including how to write interrupt service routines (ISRs) and how to manage hardware interrupts.
  7. Assembly Language Syntax: Low-level programmers need to be proficient in the syntax and semantics of assembly language, which includes knowing the mnemonics for instructions, how to structure assembly programs, and how to use assemblers and linkers to create executable code.
  8. Hardware Interaction: Low-level programming often involves direct interaction with hardware components, such as I/O ports, timers, and other peripherals. This requires a detailed understanding of the hardware architecture and how to communicate with hardware devices.
  9. Optimization Techniques: Low-level programmers need to be adept at optimizing code for performance and size, understanding how different instructions and memory access patterns can affect the efficiency of the program.
  10. Debugging and Profiling: Debugging low-level code often requires specialized tools and techniques, such as using a debugger to step through assembly instructions, setting breakpoints at the machine code level, and analyzing core dumps.

In contrast, high-level programmers typically rely on abstracted, platform-independent constructs provided by the programming language and runtime environment, which shield them from the underlying machine-specific details.

SEMrush Software