Introduction to Floating Point Numbers in Java

In this tutorial we introduce the technology that Java uses to store floating point numbers. Java implements the 1985 IEEE 754 floating point format.

Two floating point types are supported; single precision and double precision. Here is a summary of each:

 Java    Size     Size       Range             Precision
 Type   (bytes)  (bits)     approximate        in decimal digits
================================================================
 float     4      32         +/- 3.4 * 1038     6-7
 double    8      64         +/- 1.8 * 10308    15 



As an example, consider the decimal (base 10) value 2.0. Humans prefer base 10, but computers prefer base 2.

When stored in IEEE 754 single precision format it looks like this in binary (base 2):

seee eeee emmm mmmm mmmm mmmm mmmm mmmm
0100 0000 0000 0000 0000 0000 0000 0000

Where s is the sign, e is the exponent and m is the mantissa, or fraction.

The exponent is 'biased' by +127. In other words, the exponent value stored in the number has 127 added to it. Referring to the decimal 2.0 example, above, the exponent works out to be 128, but after subtracting 127 the true exponent is actually 1.

The mantissa has an assumed 1 as the leftmost digit. This slick trick provides an additional bit of precision. 24 bits of precision fit into 23 bits!

The radix point of the mantissa is assumed to be to the right of the assumed "1". Referring to our decimal 2.0 example again, the mantissa looks like this:

1.00000000000000000000000 (base 2)

That's 1. followed by 23 zeroes. Remember we are still working with base 2.

Next we apply the exponent. In our example the exponent worked out to be 1. Therefore we will shift the radix point 1 place to the right. In other words we, are multiplying by 2. The result is this:

10.000000000000000000000 (base 2)

This number translates to 2.0 in base 10.

Figure 01 illustrates a useful technique for retrieving the internal storage format of a floating point number.

int x = Float.floatToIntBits((float)2.0);
System.out.printf("\n hex format of 2.0 = %x", x);
 


Figure 01 - Java code to display the bitwise format of a floating point number. The floatToIntBits method is a member of the Float class.
Figure 01 - Java code to display the bitwise format of a floating point number. The floatToIntBits method is a member of the Float class.

The code snippet in Figure 01 generates the output displayed in Figure 02. It illustrates how to obtain the internal storage format of the floating point number. The method used, floatToIntBits(), is a member of the Float class. It's a static method so we don't need to instantiate. A companion method exists in the Double class.

Figure 02 - The output of the code snippet in Figure 01: the bit pattern of a single precision IEEE 754 floating point number.
Figure 02 - The output of the code snippet in Figure 01: the bit pattern of a single precision IEEE 754 floating point number.

We can easily cobble up a snippet of code that builds properly but carries a potential problem. Figure 03 illustrates the precision problem.

We start with a number that looks mostly harmless: 1236.0007. We store the number in a float data item (IEEE 754 Single Precision format).

Next, we subtract the integer part of the number. Intuitively the result of the subtraction should be .0007. It's not.

Finally, we perform a test for equality to verify that we still have the .0007.

Figure 03 - Nefarious floating point precision problem in a simple code snippet
Figure 03 - Nefarious floating point precision problem in a simple code snippet

Figure 4 is the output of the code snippet in Figure 03.

Figure 04 - The output of the code snippet in Figure 03. An illustration of floating point error.
Figure 04 - The output of the code snippet in Figure 03. An illustration of floating point error.

OK, we have seen that the result of 1236.0007 minus 1236 is not .0007.

What went wrong? Is Java broken?

Not at all. When we store 1236.0007 into f1 (remember that f1 is a float data item), the bit pattern that is stored is 0x449a8006. This bit pattern actually represents 1236.000732421875, which is the closest we can get to 1236.0007 using 24 bits of precision. We simply cannot store it any better.

Figure 05 - Force a floating point number to infinity
Figure 05 - Force a floating point number to infinity

On a side note, Figure 05 illustrates how to reach infinity. The IEEE 754 format provides a special value for representing infinity. The code is a little kludgy, but it does illustrate how to overflow a floating point data item.

Figure 06 is the output of the code in Figure 05.

Figure 06 - Output of the code in Figure 05: to infinity and beyond.
Figure 06 - Output of the code in Figure 05: to infinity and beyond.

We have seen that floating point numbers can come with precision errors under some circumstances

More by this Author


Comments 1 comment

MaunG 6 years ago

Hey nicomp,

Nice write here :D..

    Sign in or sign up and post using a HubPages Network account.

    0 of 8192 characters used
    Post Comment

    No HTML is allowed in comments, but URLs will be hyperlinked. Comments are not for promoting your articles or other sites.


    Click to Rate This Article
    working