Prior Page     Next Page     This Chapter    Next Chapter






Binary Fractional Numbers

Fractions present another number representation problem. How can one represent fractional quantities using bits?  

Representing fractions can be solved in the same way that positive powers of 2 represented integers, use negative powers of two to added up to approximate fractional quantities. The small negative powers of two are:

Thus you can pick some part of a 32-bit computer word and decide where the binary point should be. Let's say we pick the middle of the word somewhere. This gives 1 bit of sign, 15 bits of integer, and 16 bits of fraction. Thus one can have numbers from about -32,767 to +32,767 with about 4½ decimal digits of fractional precision.

If you use this technique you must convert numbers from the bit representation to decimal representation correctly. Pascal and C do not provide this flexibility in conversion, but some other programming languages do. PL/I for example allows you to declare numeric variables with the number of bits and where the binary point should be located. PL/I then builds code to do the shifting and adjusting for you, and its input/output routines also work properly. However one can always write conversion subroutines in other languages that do work properly so that you can both input fractional quantities and output them properly.

Note that the rules for binary arithmetic on these fractional numbers are the same as it is for integers, so it requires no change in the hardware to deal with binary fractions. That is this fractional representation is isomorphic to the binary twos-complement integer representation that the machine uses. It only requires that you think the numbers are fractions and that you understand and treat properly the binary-point (the binary equivalent of the decimal point). For example you can only meaningfully add two numbers if the binary point is aligned. Of course, the computer will add them no matter where you think the binary point is. This is just like decimal arithmetic where you must add two numbers with the decimal points aligned if you want the right answer.

One can shift the binary point of a number by multiplying or dividing by the proper power of two, just as one shifts the decimal point by multiplying or dividing by a power of ten. Most machines also provide arithmetic shift operations that shift the bit representation of the number right or left more quickly than a multiply instruction would. An arithmetic right shift of 3 is a binary division by 23 or 8; a left shift is a binary multiplication. Arithmetic right shifts usually copy the sign bit so that negative numbers stay negative. Arithmetic left shifts introduce zeros from the right side of the number. If the sign bit changes in an arithmetic left shift then the number has overflowed.

The binary point can be represented by counting the number of binary fractional digits in the number. Multiplication of a number with binary point of 5 and one with a binary point of 3 will give a number with a binary point of 8. That is binary numbers can be represented in general as having p binary digits and q fractional digits. The result of binary arithmetic for A with p sub 1,~ q sub 1 and B with p sub 2,~ q sub 2 gives the worst case p:

One reason that this scaled integer arithmetic is not popular is that unless the program scales numbers using multiplication and division by powers of the radix to make sure that the two operands of addition and subtraction have the same q, answers will be wrong. The programming language PL/I does this scaling automatically but for other languages you have to do it yourself.

Note you can do this in Pascal or C because it only involves the way you think about the the numbers you are using. Pascal doesn't have to know what you are doing, as long as you write the code properly.

One is not restricted to scaling in the binary representation, one can also scale in the decimal representation. The rules for decimal scaling are the same as in the table above. For example if you read in an integer but you know that it contains a number with two `implied' decimal digits, i.e. the number 500 represents 5 with a q=2. You treat it that way in the machine, multiply it by 2.1 (21 with a q=1), the result is (10500, with q=3). Now you can correctly output the result with the decimal point properly placed because you know the q=3, 10.500. Again you must be careful to only add and subtract numbers with the same decimal scale, and carefully scale the numbers around multiplication and division to preserve both the high-order digits while maintaining the required precision after the decimal point.


Fractional Quantities as Rational Numbers.

One can represent any number as the ratio of two integers. To do this with bits, rational fractional quantities are represented as two integers, one representing the numerator and other the denominator of a fraction that equals the number.

Multiplication and division of these numbers is easy. Addition and subtraction requires finding the common denominator (the denominator of the result) and then adding or subtracting the adjusted numerator.

The major problem with this representation of numbers is that it requires twice as much space, and comparison is costly because it requires two divisions. Non-rational numbers must be approximated by a rational. Rationals whose denominator gets too large for the integer size available must also be approximated. Error analysis is non-trivial for algorithms represented this way.

Few computers use it as a hardware method of representing numbers, but the algorithms for doing rational arithmetic are not difficult to code in any language using standard binary integers, and structures.


Floating Point Numbers

For computer languages that do not do fixed scaling of integers, like Pascal and C, the easiest way of handling fractions is with floating point numbers, sometimes called real numbers. Using this technique a number is represented in bits by three parts: sign, exponent, and fraction.  This is similar to scientific notation used to represent large or small numbers (e.g. -0.35 times 10 sup 8). The sign is negative, the exponent is 8 and the fraction is 0.35. Inside a computer these are kept in binary and put into separate fields in the floating point number. Most modern computers allow more than one `size' of floating point numbers.

Take the floating point used on the IBM PC and similar clones. They, as do many other machines, use a hardware chip-set that uses rules defined by IEEE Standard Floating Point conventions. Internal to the machine all arithmetic is done in an 80-bit floating point format.


This internal number can be kept in memory in either a 32-bit (short) or 64-bit (long) format.


The value of these IEEE floating point numbers are (-1) sup sign ~ times ~ significand ~ times ~ 2 sup exponent. In the short form the exponent is kept in excess-127 notation. In the long form the exponent is kept in excess-1023 notation. In the 80-bit temporary form the exponent is kept in excess-16383 notation The sign and significand in all forms create a sign-magnitude number. For example the short floating point number:

0 1000010 01101 0000 0000 0000 0000 0000


can be interpreted as the following binary expression, note that the binary 10 below, as the base of the exponent, represents a decimal 2; just as the binary exponent 1000010 represents the value of decimal 130.

1.101 ~ \(mu ~ 10 sup {1000010}


Conceptually this machine representation can be rewritten in decimal. The binary exponent is 10000010, in decimal this is 130, 130 in excess 127 notation is 130 - 127, because it's 127 more than it should be, that is it represents +3.

1.625 ~ times ~ 2 sup 3


So if computing 1.625 ~ times ~ 8 produces 13, the value of the number.

13
in the same way -13 would be represented as:

1 10000010 11010...0

The floating point result is normalized after each arithmetic operation. Normalizing means shifting the significand left to eliminate high-order zeros and adjusting the exponent to match. This means that in every normalized floating point number the top bit of the significand should be one. For example, let's subtract 10 from 13



13.0 is 0 10000010 11010...0

10.0 is 0 10000010 10100...0

They both have the same exponent so we can subtract the significands as they are. If they were not the same the number with the smaller exponent (and therefore the smaller number) would be shifted right introducing zeros at the top end of the significand and adding 1 to the exponent for every shift until the two exponents are equal. (Addition and subtraction can only be done with equal exponents, that is with numbers aligned.)
result 3.0 is 0 10000010 00110...0

The result has leading zeros in the significand, so it must be normalized.
normalize 0 10000001 01100...0
normalize 0 10000000 11000...0

The exponent of this number is 128 - 127 = 1; so the number is 1.5 ~ times ~ 2 sup 1 or 3.

If one always knows the top bit should be a one is there any reason to store it?   Of course not. So in the short and long IEEE floating point form there is an implied extra one bit before the most significant digit of the significand. Thus 3.0 in memory is represented as 1.5 ~ \(mu ~ 2 sup 1:


+3.0 0 10000000 10000...0 0x40400000



A 2.0 in memory is represented as 1 ~ \(mu ~ 2 sup 1:

+2.0 0 10000000 00000...0 0x40000000


And a 1.0 in memory is represented as 1 ~ \(mu ~ 2 sup 0:

+1.0 0 01111111 00000...0 0x3f800000


The internal temporary floating-point form has the extra top bit inserted, so that it is easy to do arithmetic and normalization. During arithmetic zero bits may occur at the top end of the significand, for example in a subtraction, as we have seen. After every arithmetic operation, a normalization operation shifts the top one bit back to the top of the significand and adjusts the exponent.

How does one represent zero, if there is always an implied bit in the significand?  



Zero is an obvious special case.

+zero 0 00000000 00000...0 0x00000000
-zero 1 00000000 00000...0 0x80000000


Zero is represented as an exponent of zero and a significand of zero.

There are two special values for the IEEE exponents, zero and all ones. A zero exponent is used for zero and very small unnormalized numbers. Because it uses sign-magnitude for the significand there is both a positive and negative zero in IEEE floating point.

The all-one exponent (the number 255 in short floating point format) represents special values like infinity, and values that are not-a-number, or NaN. There are two infinitys minus and positive, positive infinity is greater than any other floating point numbers, negative infinity is less than all other floating point numbers. Infinity is represented by an all zero significand with all one exponent. Arithmetic with infinity works as you would expect, add, subtract, or multiplying by infinity produces ±inf ; division by infinity generally produces 0.



The two infinitys are represented as:

+ inf 0 11111111 00000...0 0x7f800000
- inf 1 11111111 00000...0 0xff800000


Infinity divided by infinity gives an error or produces a value that is not a number on some computers.

Things that are `Not a Number' or NaNs can be used by a programmer to represent uninitialized values; if this floating value is ever used in a computation an error is generated.



+ NaN 0 11111111 10000...0 0x7f900000


Floating point in some computers comes in a 96 bit format. Occasionally one finds machines with 128 or 256 bit floating point numbers. Why this massive precision?   The reason is found in the scaling, especially when multiplication is combined with subtraction of numbers that can be close to the same value. A series of apparently harmless computations can turn the floating point result into nonsense. Users are usually unaware of the arithmetic problems they are about to encounter, and the machine, for it's part, has no way to determine that the precision of the computation has ebbed away. One way to put off incorrect answers is to use a lot of precision. But it only puts it off, it doesn't solve the problems. The only good solution is for the programmer to understand the program he has written and the floating point computational issues it involves.

The classic example of this problem is the equation for the sample standard deviation. Given a set of values xi, for n numbers, the most common way single pass way to compute the sample standard deviation, sigma, is to evaluate:



The problem with computing this quantity in floating point is that the average sum of squares and the square of the mean can be large numbers that are close. For example if the mean you computed is 50,000, which is not a particularly large number; then the square of the mean is 2,500,000,000. The average sum of the squares will be a similar number, say 2,500,000,125. The difference is only 125, but it is in the low-order three digits.

IEEE single precision floating points have 23 bits of significand. The number of bits 23, divided by our magic number 3.32, gives 6.9 digits, or just about 7 digits of precision. So the answer after the floating point subtraction will be close to zero. All the significance in the formula comes from the low order digits, and the subtraction throws away the high order digits. This produces garbage numbers to feed to the square root routine. If you are lucky this will result in a domain error trying take the square root of a negative number. If you are not lucky you will just get a garbage answer. The result of square-root has half the precision of its argument, so even if you had two digits of precision to start with by the time you finish there may only be a single digit of precision left.

What is a computationally safe way to compute the standard deviation?   One way is to compute the mean first. Then subtract the mean from each item and square the difference and sum across the items. The square root of the average sum of the squares is the sample standard deviation. This requires two passes across the data, one to compute the mean, the second one to find the sum of the squares of the differences. Computationally it can be done in one pass, if you memorize and use the following algorithm for standard deviation:

mean := 0;
D := 0;
S := 0;

for i := 1 to N do
begin
D := (data[i] - mean) / i;
mean := mean + D;
S := S + D * (data[i] - mean);
end

if ( N > 1 )
then std_dev := sqrt( S / ( N - 1 ) );
else std_dev := 0;  
/* standard deviation undefined */


Computationally Accurate Sample Mean and SD Algorithm

Although this algorithm is somewhat slower to compute it gives computationally accurate answers. If you don't believe it, then do a little proof to show it is mathematically equivalent to the other formula, and then examine the precision loss in the two methods.

Floating point should always be used cautiously, it doesn't just work for a while and then break like fixed point computations; with floating point, computational errors can just sneak up on you reducing your calculations to mush. With floating point calculations it is not always true for example that x~+~1~>~x; when x becomes large enough then adding one just yields x. That's because x+1 ~~ approx ~~ x for large x. So if you think about floating point arithmetic as approximate arithmetic you will be closer to the mark.

Because of this approximate nature of floating point computation comparisons for equality of two computed quantities seldom yield equal even if mathematically the quantities should be equal. Care should be taken to write algorithms that you can be sure will work. One good rule of thumb is that if computations have been done on a floating point number check for a range of values or check that something has exceeded some boundary. Don't rely on equal! The only time an equals test can be guaranteed to work for floating point values is when you assign a value to a floating point number and check to see if that same value is still there.

These floating point precision problems also have a nasty habit of cropping up in production software that has passed all the debugging tests. One reason for this is that programmers tend to be lazy; so they write small test cases in which answers are easy to compute. These small cases seldom generate products or sums that challenge the precision limitations of the hardware. When the real data comes along, the program doesn't work.

You may feel that all these problems are too awful to have to think about; therefore they must have been fixed by modern technology. From this you conclude that your Pascal system can't possibly exhibit these horrible problems. Don't be fooled!  All these problems are here in Think Pascal, Think C, and other languages. The problems run deeper than the machines that exhibit the problems. They have their origin in the unmathematical finite nature of our world. Even more subtle `errors' than the ones demonstrated are possible. Programmers, beware!








Prior Page     Next Page     This Chapter    Next Chapter


Copyright ©1996 Robert Uzgalis. All Rights Reserved.
contact: buz@cs.aukuni.ac.nz