Lesson 7 | Floating-point numbers |
Objective | Explain what a floating-point number is. |
Representation of Floating-Point Numbers in Computers: The IEEE Standard 754
In modern computing, the representation of real numbers, especially floating-point numbers, demands both precision and efficiency. To address this, the Institute of Electrical and Electronics Engineers (IEEE) introduced the IEEE Standard 754 for Floating-Point Arithmetic. This standard has been widely adopted by the computing industry and serves as the benchmark for floating-point computation in computer hardware, languages, and operating systems.
- Overview of IEEE Standard 754: The IEEE Standard 754 provides a comprehensive methodology for representing and computing floating-point numbers. It defines:
- Formats for representing floating-point numbers.
- Rounding rules and operations.
- Exception handling (e.g., handling of overflow, underflow, and NaN (Not a Number) situations).
- Representation of Numbers: The standard primarily defines two basic formats:
- Single Precision (32 bits): Comprising 1 bit for sign, 8 bits for the exponent, and 23 bits for the fraction.
- Double Precision (64 bits): Comprising 1 bit for sign, 11 bits for the exponent, and 52 bits for the fraction.
There are also extended formats, but the single and double precisions are the most commonly used.
- Application to Given Numbers:
- 1/3: This is a rational number but cannot be exactly represented in binary. IEEE 754 will provide an approximation.
- PI: An irrational number, it also cannot be exactly represented. In practice, a truncated or rounded version of its binary form is used.
- 1.23 x 10^35: This number would be represented using the sign bit, an exponent adjusted by a bias value, and a fraction derived from the number's mantissa.
- 2.6 x 10^-28: Similarly, this number would use the sign bit for its negative value, an appropriate exponent, and a fraction.
Indeed, most contemporary computers and computing systems utilize the IEEE Standard 754 for representing and manipulating floating-point numbers. Its widespread adoption ensures consistency and predictability across platforms, making it a cornerstone in the realm of numerical computation.
To represent real numbers such as 1/3, PI, -1.23 * 1035, and -2.6 * 10-28, most computers use IEEE Standard 754 floating-point numbers . Using this representation, a real number is expressed as the product of a binary number greater than or equal to 1 and less than 2 (called the mantissa) multiplied by 2 raised to a binary number exponent.
In practice, it is very unlikely that you will ever need to look at the binary form for the floating-point representation of a real number,
so we will just take a quick look at one example to give you the general idea. Single precision floating-point representation uses 32 bits.
1 bit is used for the sign bit, 8 bits are used for the exponent, and 23 bits are used for the mantissa. Here's the 32-bit floating-point representation of the real number 1/3. Floating-point number: Used to represent a real number on a computer.
Sign bit |
Exponent |
Mantissa |
0 |
01111101 |
01010101010101010101010 |
The sign bit is 0, indicating that this is a positive number.
The exponent is the binary representation of the decimal number 125. To obtain the actual exponent we subtract 127, to obtain -2.
This exponent bias allows the range of the exponent to be from -127 to 128. Finally, the mantissa represents the binary number
1.01010101010101010101010
.
Note that the leading 1 of the mantissa is implied, to provide an additional digit of precision.
The decimal value of the mantissa is:
20 + 2-2 + 2-4 + 2-6 + ... + 2-22
= 1 + 1/4 + 1/16 + 1/64 + ... + 1/2097152
This is approximately 1.3333333 and thus, 1/3 is represented, using 32-bit floating-point representation, as approximately 1.3333333 *
2
-2 or 1.3333333 * 1/4.
What's most important to remember about floating-point representation is that it allows you to represent a tremendous range of real numbers, but with limited precision.
Single precision (32-bit) floating-point numbers are accurate to about 7 decimal digits, and double precision (64-bit) floating-point numbers are accurate to about 17 decimal digits.
We have covered how a computer stores numbers. Next we will consider how text is stored.
The binary number 01010101010101010101010 represents the following decimal number:
1024 + 512 + 256 + 128 + 64 + 32 + 16 + 8 + 4 + 2 + 1 = 2047
Therefore, the decimal equivalent of the binary number 01010101010101010101010 is 2047.