Because computers can represent only a finite set of real numbers, when we ask a computer to store x, it actually stores the floating point representation, fl(x). When we ask for x+y, we get fl(fl(x)Åfl(y)), where Å is the computer's approximation of addition. As an example, consider the IEEE 754 floating point standard, which shows how computer arithmetic can keep mathematicians on their toes. (This standard is implemented by most systems.)
Bit representation Single precision uses a sign bit, 8 bits of exponent, and 23 bits of mantissa. This is called 24 bits of precision because the mantissa is assumed to start with an implied 1, except in special cases (such as zero).
To represent a number, write it in binary in scientific form:
(-1)s×1.mmmm¼×2e. If -126 £ e £ 127, then
store s in the sign bit,
store e+127 in binary in the exponent,
and store mmmm, rounded to the 23 bits behind the point, in the
mantissa.
Special cases: Exponent fields of all ones with a zero mantissa represent ±¥ (depending on sign); non-zero mantissas represent NaNs, or ``Not-a-Number''s, which are used for error conditions. Exponent fields of all zeros are even stranger. Mantissa of zero represents ±0 (depending on sign); a non-zero mantissa m represents a denormal number ±0.mmmm×2-126, so that there are a lot of small numbers between 0 and 1×2-126 which are not included in other floating point representations. Denormal numbers are typically implemented in software, so they are slower by a couple of orders of magnitude.
Double precision uses a sign bit, 11 bits of exponent with excess 1023, and 52 bits of mantissa.
No other numbers are represented. Notice that floating-point numbers do not have uniform density. Half are between 1 and -1; there are as many numbers between 1 and 2 as there are between 512 and 1024.
Rounding IEEE Floating point has four rounding modes. The default mode rounds to the nearest number, and, in case of ties, rounds to make the last bit 0. In gory detail, if the 24th bit is 0, chop after bit 23. If the 24th bit is 1 and some 1 bit follows, then add 1 to the 23rd bit. If the 24th bit is 1 and no following bit is 1, then add 1 to the 23rd bit iff the 23rd bit is 1. You can also select rounding toward +¥, -¥, or 0 (i.e., chopping).
Operations IEEE 754 sets a stringent requirement for mathematical operations: the result of a computed operation shall be the same as the correctly rounded result of the real operation on the input. E.g., fl(fl(x)Åfl(y)) = fl(fl(x)+fl(y)).
W. Kahan of UC Berkeley was awarded the Turing prize for his work on floating point. For an interesting interview about the adoption of the standard, see http://www.cs.unc.edu/~snoeyink/c/c205/754story.html.