ArtsAutosBooksBusinessEducationEntertainmentFamilyFashionFoodGamesGenderHealthHolidaysHomeHubPagesPersonal FinancePetsPoliticsReligionSportsTechnologyTravel

A Framework of Exact Floating Point Arithmetic

Updated on July 10, 2012


Existing methods of numeric computations have limits. For example, consider the number of permutations of n objects (written as: n!) which occurs in the context of number of combinations of selecting k objects from n objects where order does not matter (written as: c(n, k); k ≤ n) . The n! is number of ways n objects can be ordered. It turns out that

c(n, k) = n! / (k! * (n - k)!)

The simple minded algorithm implied by that formula produces an overflow error for even small n. This simple minded algorithm works as follows:

  1. Calculate n!
  2. Calculate k!
  3. Calculate (n - k)!
  4. Calculate c(n, k) using above formula

A computer program implementing above algorithm which uses the computer's in built floating point arithmetic mechanism will generate an overflow error. n! simply becomes too big to be stored in space provided for storing a floating point value. This is problem number 1.

Another problem is of truncation errors. The accuracy and stability of numeric calculations are threatened by the fact that the significand in floating point number representation in use in today's computers are required to be truncated so that it can fit into space provided to store it in a floating point value.

When two floating point values are multiplied the storage required to store resulting significand is sum of storages required to store the significands of two values being multiplied.

The floating point number representation in use in today's computers use 2 or a power of 2 as radix for representing the value in significand. The dominant system that humans use to write numbers uses arbitrarily the decimal system of writing numbers in a positional notation. In decimal system one requires 10 distinct symbols to represent 10 digits including a symbol reserved to represent digit zero. The number of symbols in a numeral system is called the radix of that system.

Charles Babbage (an English mathematician) had proposed the construction of a calculating machine to calcualte astronomical and mathematical tables. In the proposed machine decimal number system was to be used. With advent of electronic computers it was easier to use 2 or a power of 2 as radix for internal representation purpose. In theory, electronic computers too can use the decimal 10 as radix by grouping consecutive digits in radix 2 system. Each group of 4 radix 2 symbols can represent 16 different combinations and since 16 exceeds we can use such groups to represent decimal digits. But in this case extra 6 combinations would be never used for number representation purpose. Early electronic computers were prone to frequent malfunctions and extra combinations when not expected (such as in a number) would signal a malfunction. And because of this such computer system would be of interest to early builders of computers. But construction of arithmetic unit of processors was easier if radix 2 system were to be used exclusively. Improvement in reliability of electronic components and invention of solid state electronics and rise of integrated circuit technology gave final push towards wider adoption of radix 2 for internal representation of numbers.

Regardless of which radix is used, there are fractions which terminate with recurring series of one or more of digits.

There is no radix in which ALL fractions will not terminate with recurring series of one or more digits.

In other words:

No matter what radix you use, there will be always at least one fraction which terminates with recurring series of one or more of digits.

In fact, when you prove this, you will also know that for all radixes, there are infinite number of fractions which terminate with recurring series of one or more of digits.

And this is problem number 2. No matter what radix you use any system of floating point representation comparable to IEEE 754-2008 will have some values losing precision because of truncation caused by recurring digits that cannot be stored.

First Steps Toward Solution

Absolute first and simple idea that comes to us is to use two integers to represent a floating point value, one for numerator and the other for denominator. The motivation for scientific notation was an ability to write numbers that are too large or too small to be conveniently written in standard decimal notation. This it does by including in the notation a power of 10. A number written in this notation has format as follows:

a * 10b (that is, "a" times 10 raised to the power of "b")

Here "a" is any real number called the significand or mantissa, "b" is called the exponent and it is an integer. The term "mantissa" was changed to "significand" because of confusion it may cause because of its common use to refer to the fractional part of the common logarithm. Both "a" and "b" can be negative numbers.

If we decide to use two integers then there is no need to make special provision for "b". When "b" is positive we can multiply 10b into numerator, when it is negative we can multiply 10-b into denominator. In QBasic using built-in integer types (INTEGER or LONG) we can work with exact fractional values but neither in-built floating point types (SINGLE or DOUBLE) nor this way of storing fractional values as a pair of integers will solve the problem of overflow or underflow errors.

How to Prevent Overflow?

In QBasic we can use a pair of integers "m" and "n", say, for storing numerator and denominator. If we store these as two integers of type LONG, then biggest non-zero number we can represent is: 2147483647/1 or 2.147483647 * 10+09. Compare this with maximum positive that DOUBLE can store: 1.79769313486231D+308. For more information on system limits for various number representation formats in QBasic see my article on hubpages titled: QBasic Numeric Limits.

As you can see in that article even single-precision floating point numbers can represent numbers with exponent values much larger than pair of LONG integers as numerator and denominator are capable of representing. Clearly we cannot just get rid of exponents. We need them. In non-scientific notation for writing numbers we would have written a very large number with many zeroes following digits (significant digits). These zeroes are really not zeroes but they are there because a physical quantity which the number represents cannot be measured with enough precision by available instrument technology and number is very large. For very small numbers there would be abrupt termination of digit sequence (significant digits) following many real zeroes indicating that digits which would follow are all zeroes. These zeroes too are really not zeroes but they are there because a physical quantity which the number represents cannot be measured with enough precision by available instrument technology. That is to say, in both the cases, the case of large number as well as the case of small number, the zeroes which precede significant digits are real and zeroes following significant digits are all not really zeroes but are unknown digits which imprecision of our instrument technology cannot reach.

To clarify the ideas in preceding paragraph I am reproducing examples of two numbers written in scientific notation from a wikipedia article on scientific notation. Following two examples are taken from wikipedia article on scientific notation. To reach the article click this link.

An electron's mass is about 0.00000000000000000000000000000091093822 kg. In scientific notation, this is written 9.1093822×10−31 kg.

The Earth's mass is about 5973600000000000000000000 kg. In scientific notation, this is written 5.9736×1024 kg.

First is the example of very small number. It has 30 zeroes after decimal point before non-zero digit sequence (significant digits) begins. These are real zeroes. However after 8 decimal significant digits are written strictly as a mathematical notation, what follows can be thought of as exact and real zeroes, but from measurement perspective these are unknown digits. If number 9.1093822×10−31 is one result of a measuring instrument then not only do we not know what unwritten trailing digits are after the 8 digits written, but some of the least significant digits may also be subject to inherent variability of nature or measuring instrument. That is, taken as mass of an electron we can hypothesize that in nature varieties of electrons are available with varying masses ("variability of nature"), or, that the imprecision of measuring is producing all the variations in measured quantity ("variability of instrument"). This number as given in wikipedia article is not one result of a measuring instrument, but it is average of many measurements performed using very high fidelity instrument. When we measure a quantity large number of times we can calculate an average from these measurements and resulting average is assumed to be very precise, with variability reducing by a factor of 2 for every quadrupling of number of measurements.

Second is the example of very large number. It has 20 zeroes after last of the non-zero digits written. Some of these may be construed as real and these will be most significant digits among all these 20 zeroes and rest will be unknown digits, "unknown" because of imprecision of measuring instrument. This idea that a trailing zero could be "significant" is expressed in the wikipedia article mentioned with following words:

A significant figure is a digit in a number that adds to its precision. This includes all nonzero numbers, zeroes between significant digits, and zeroes indicated to be significant. (Source: Wikipedia article on Scientific notation)

In other words, if instead of writing 5.9736×1024 if we were to write 5.97360×1024 we would be then making a statement about quality of average or single observation which the number represents. We are in effect saying that zero after digit 6 is "known to be" zero.

But a measuring instrument provisioned with displaying six decimal digits may show 5.97360×1024 and in that case the zero after 6 might not be significant.

Reinstating Exponent

Scientists are in habit of discarding trailing digits when they judge them to be not accurate. But scientists also retain some of these digits when it is part of plan to take many observations of same quantity and then calculate average from these observations.

In case of mathematical calculations, say, involving addition of infinite number of terms or in situation such as seen in calculation of Binomial Probability for large n, we do not have any digits that can be termed "not accurate", and so above mentioned "scientist's judgement" does not apply.

We want to serve both uses of fractional numbers. Scientists want numbers which are very large or very small and any size in between. Exact mathematical calculations demands no digits be discarded if possible. The adoption of floating point format similar to IEEE 754-2008 by mathematicians is because it is there. Most mathematics programmers try to go back to basics and when they do use available number formats they always bear in mind their deficiencies. So the affair of adoption of IEEE type format is largely an affair of scientists and engineers and mathematicians simply tagged along the effort.

But scientists and engineers are sometimes rudely awakened to need of exact arithmetic when large number of calculations are made in their problem solving.

Upshot of all this is that we need a system that will prevent loss of recurring digits as well as permit representation of very large and very small numbers. We reinstate exponent. In this article I propose a simple system to do that. I reserve to a future article the description of more complex system. So keep a watch on my articles on hubpages.

An Introduction to the Theory of Numbers
An Introduction to the Theory of Numbers

by G. H. Hardy and E. M. Wright


Let us think through this. We can replace the significand part with a pair of integer values representing numerator and denominator of significand value. And we use one more integer value which will be exponent of a base. In IEEE format base is 2, in scientific notation, since the notation is meant for humans, the base is 10.

The significand of double precision floating point number in IEEE 754 has 53 bits and when significand is stored in normalized form first binary digit is always "1", hence it is not actually stored but assumed to be present. Since 53 exceeds 31 we cannot use QBasic type LONG to store numerator or denominator. If we use 2 as base in exponent and for a value it happens to be positive then strictly speaking numerator is the numerator (separately stored as planned in this framework) multiplied by 2 raised to exponent. That is in such cases:

Actual numerator = numerator of significand * 2exponent

This gives us an idea. Why not use another exponentiated integer? Say, 3? So, our number will be represented by:

  1. Exponent of 2 (e2)
  2. Exponent of 3 (e3)
  3. A numerator (a)
  4. A denominator (b)

And the value of this number would be:

2e2 * 3e3 * a / b

And this is true even when e2, e3 are not positive. A fraction is of type: sign * a / b, where a and b are integer values. We know from number theory, that any integer greater than 1 can be written as a unique product of prime numbers. Therefore, we can state following:


If a and b are positive integers we can always write following:

a / b = p1e1 * p2e2 * ... * pnen

where p1, p2, ... pn, are prime numbers with p1 < p2 < ... < pn

and e1, e2, ... en, are integers which can be positive or negative but not zero

And furthermore this "factorization" is unique

There is no guarantee that p1 is 2 and p2 is is 3. In fact there is no guarantee that our number has more than 1 prime in this unique factorization. It could be an integer valued number: a / 1, or it could be 1 / a, where "a" is a power of single prime. This is ok. IEEE type floating point format just a electronic version of scientific notation. In taking inspiration from scientific notation we gained some, lost some. We gained a fixed base for our exponent storage and with it many simplifications of hardware design and ease of arithmetic. We lost precision control. When we have decided to find a way to completely banish any possibility whatsoever of overflow and underflow and in-exactness we need not hang on to fixed base. We can be as creative as total freedom allows. We are clear that our FLOP rate will be abysmally low. All we want is: Exact Arithmetic. So, we abandon fixed base.

Abandoning Base

Not only are we going to abandon fixed base but we can have anything for our base or nothing or several things for our base. We need not have 2 and 3 for our bases but nothing prevents us from having them. We can use p1 and p2. Or, we can completely forgo having any base at all. We will decide, we are the masters. When you are designing something you are creating something. You are the creator. And you can create a thing in whatever way you like. You are the masters.

Version Zero

A floating point value shall be represented by zero or more bases each having an exponent and zero or more of numerator factors and zero and more denominator factors.

That is it will be represented thus:

sign * b1e1 * b2e2 * ... * biei * n1 * n2 * ... * nj / (d1 * d2 * ... * dk)

Or, since the numerator factors (n's) are all integers which are factors here raised to 1 and denominator factors (d's) are all factors raised to -1, we can just write a fraction as following:

sign * b1e1 * b2e2 * ... * biei

Please note that although we say that numerators and denominators are same as bases (b's), there is fundamental difference: numerators and denominators are stored as it is, the bases need not be stored if they are well known to system. For well known bases we just store exponents. A typical storage medium for exponents could be QBasic type INTEGER. For storing numerators and denominators ideal storage medium is QBasic type LONG. We may receive and send back values in SINGLE or DOUBLE. But they are not good medium to store numerators or denominators because they do not report truncations as errors but a similar incident in LONG is reported as overflow error. It is important that we detect truncations (or overflow which is its whole number equivalent). If we fail to detect overflow then our system will not work. Since all natively implemented number representation systems sooner or later will run out of space no native number representation system is suitable for representing any component of our floating point numbers. We need a system capable of representing integers of unrestricted size. I have written an article on hubpages which describes two such systems: How to Represent Integer Values of Arbitrary Size in QBasic?

An Illustration

An illustration can clarify fresh ideas introduced in last paragraph.

In this illustration we propose that we use a byte to hold sign and 7 flags. These 7 flags stand for presence or absence of 7 primes: 2, 3, 5, 7, 11, 13 and 17. We shall call these "well known bases". If over and above numerator factors (n's) or denominator factors (d's) if any of these well known bases occur in our representation of a floating point value then the flag will be "set" and the bit corresponding to that base shall be set to "1".

For every well known base present in a floating point value we just need to provide its exponent. Since exponents tend to be small number we can use unicode inspired integer representation system described in How to Represent Integer Values of Arbitrary Size in QBasic? If you have read that article you know that I have proposed at least two variations: one, which calls for using every single bit to store a bit. Such unicode inspired numbers use leading bits indicate how many bytes are there in the value represented. That is, these bits indicate the length of integer in bytes. But these bits may leave some bits in the last byte unused. These remaining bits may or may not be used. I propose that we use the remaining bits. (Note: If reader is not able to understand what I am talking about I strongly urge him or her to read the section titled Unicode in my article: How to Represent Integer Values of Arbitrary Size in QBasic? I am also providing basic information on this variant of Unicode inspired number representation in the Appendix section at the end of this article.)

For other known prime factors we shall provide pair of values: the prime number's value and its exponent. The number of such pairs is provided as unicode inspired integer from the article I just mentioned.

As a matter of fact all integers that we need to store in our floating point system can be stored as unicode inspired integers. Reason for this is that the entire complex of sign and base exponent pairs shall be assembled in a string. And each component of floating point value can be very small or very large we are not going to be able to store these component integers in native number representation systems which call for fixed sizes.

Byte Order

Since we are proposing that we store all component integers in unicode inspired system in a string the byte order becomes an important issue. For implementing this framework in QBasic we must use same byte order as that is required in QBasic types INTEGER and LONG.

With this understanding of our system we show how the number 22/7 is represented in our system:

Example: 22/7

We represent 22/7 in a string
Since we want byte order to be
consistent with QBasic's internal
storage of INTEGER and LONG values
we show the bytes at LEN, LEN-1,
LEN-2, and so on until ... 2, 1.
Here LEN is the length of string.
LEN   00000000 (LS bit zero means sign positive)
               No well known bases used
               therefore no list of exponent
               values required
LEN-1 00000000 (Zero pairs of base-exponents
               because we are using only
               numerator factors (n's) and
               denominator factors (d's)
LEN-2 00000001 (Number of n's: 1)
LEN-3 00011010 (n1: 22)
LEN-4 00000001 (Number of d's: 1)
LEN-5 00000111 (d1: 7)
LEN-5=1, LEN=6
String shall be 6 bytes long. Please note we
have completely avoided using any bases.
We could have used 3 base-exponent pairs with
bases: 2, 7 and 11.
Alternatively, we could use these as well known
Examples of both alternatives follow.
In such representation there would be
zero n's and zero d's.
In case of first variant our string will have
following content:
LEN   00000000 (as before)
LEN-1 00000011 (3 base-exponent pairs)
LEN-2 00000010 (base=2)
LEN-3 00000001 (exponent of 2=1)
LEN-4 00000111 (base=7)
LEN-5 01111111 (exponent of 7=-1, please note
               that leading zero bit indicates
               length=1 because we are using
               Unicode inspired system. Please
               read my article: "How to
               Represent Integer Values of
               Arbitrary Size in QBasic?" on
LEN-6 00001011 (base=11)
LEN-7 00000001 (exponent=1)
LEN-8 00000000 (zero n's)
LEN-9 00000000 (zero d's)
Now using 2, 7 and 11 as well known
LEN   00110010 (flags show positive sign
               and presence of 2, 7 and 11)
LEN-1 00000001 (exponent=1)
LEN-2 01111111 (exponent=-1)
LEN-3 00000001 (exponent-1)
LEN-4 00000000 (zero n's)
LEN-5 00000000 (zero d's)


Example -223/71

We are still using same system
as used in previous example
LEN   00000001 (LS bit 1 means sign negative)
               Note that both 223 and 71 are
LEN-1 00000010 (Two pairs of base exponent pairs)
LEN-2 10000000 223 represented in this and following byte.
LEN-3 11011111 In binary it is 11011111. Since in Unicode
               inspired system first bits represent length
               we must prefix this with a byte to
               accommodate length bits. The length bits
               are: "0" for 1 byte, "10" for 2 bytes,
               "110" for 3 bytes and so on.
LEN-4 00000001 (exponent for 223=1)
LEN-5 01000111 71 needs only 1 byte in Unicode inspired system
LEN-6 01111111 (exponent for 71=-1)
LEN-7 00000000 (zero n's)
LEN-8 00000000 (zero d's)


For well known bases, we are using first primes only. But we as humans also work with decimal numbers and in past number systems were implemented which use 16 as base. Also 8 as base is popular in some programming languages. So we can have 2, 8, 10 and 16 as the first 4 well known bases. Also we need not have fixed number of well known bases. In illustration we had a fixed number 7 of flags available to indicate presence or absence of a well known base. We may instead use the Unicode inspired system to represent an unsigned number to be interpreted as flags. In this way if there are too many small primes as factors of numerator or denominator factors then it may save us some space to indicate presence of these primes in these flags bytes rather than provide primes as part of list of base exponent pairs.

Providing numerator factors (n's) and denominator factors (d's) simplifies our algorithms because we need not calculate prime factors every time a new intermediate floating point value is calculated. When these become too many or too big we may calculate prime factors and their exponents. For additive computations such as vector product values using base exponent pairs only burden computations with unnecessary multiplications. But if we never calculate prime factors and their exponents then additive computations will yield progressively larger numerators and denominators. In such computations size of numerators and denominators can be reduced because some primes may be common to both and can be cancelled out thereby reducing the sizes of these numbers and improving multiplication speeds which in its turn positively impacts vector product type computations.

Another reason for using numerator and denominator factors is input-output. We may receive these numbers from outside where we may have numbers which use native floating point formats. Also input from file may be as strings like: "22/7" or "3.14". In such cases it is much easier to first calculate floating point value which is closest to the format which input uses. As computation evolves the programmer may cause prime factors to be computed if he or she perceives that it will be of advantage from performance point of view.

In Conclusion

A versatile framework for exact floating point computation necessarily requires a system capable of representing integers of unrestricted size. I have written an article on hubpages on the topic of How to Represent Integer Values of Arbitrary Size in QBasic? This article uses the framework of arbitrary size integers described in that article. For floating point arithmetic we need a rich set of functions which will permit us to interface with existing formats and files. Among other features that such a system must have we must include ability to inter-convert from our formats and also output to files. The framework should permit enough flexibility to permit a programmer to control memory and speed requirements of numerical problem being attacked.

In coming set of articles I shall be developing QBasic programs to implement functions suggested in How to Represent Integer Values of Arbitrary Size in QBasic? and in this article. Also I shall be illustrating the use of these functions for problem solving from fields like Physics, Electronics, Number Theory, etc.

Please keep a watch on my profile page for more articles.

Appendix On Unicode Inspired Integer Representation System

In this appendix I am giving basic information on the Unicode inspired integer representation system that is used in this article. Readers who are interested in other systems proposed in How to Represent Integer Values of Arbitrary Size in QBasic? must read that article.

All fractions can be represented as: sign * a / b, where a and b are integer values. We know from number theory, that any integer greater than 1 can be written as a unique product of prime numbers. Therefore, we can state following:

If a and b are positive integers we can always write following:

a / b = p(1)e(1) * p(2)e(2) * ... * p(n)e(n)

where p(1), p(2), ... p(n), are prime numbers with p(1) < p(2) < ... < p(n)

and e(1), e(2), ... e(n), are integers which can be positive or negative but not zero

And furthermore this "factorization" is unique

This means that we need to represent following pieces of informations in a system designed to represent a fraction:

  1. Sign (this can be represented as a flag theoretically requiring just 1 bit to represent its value)
  2. n (this is a non-zero integer)
  3. p(i), i=1, 2, ... n-1, n. (these are all positive integers)
  4. e(i), i=1, 2, ... n-1, n. (these are all integers which can be positive or negative)

From this it is clear that when an integer is required to be stored we know a priori that it is going to be non-zero or possibly signed.

In the Unicode inspired system used in this article we shall need to represent both types of integers: unsigned (non-zero) and signed. Since we use 2's complement format to store integer values, for unsigned integers we do not care what is the value of leading bit because regardless of what it is the number will be notionally assumed to have sign bit of zero bit preceding the leading bit implying that the number is non-zero with leading bit as part of the number even if that bit is "1". For signed numbers value of leading bit has specific meaning therefore when a positive integer arises in course of of computation we need to prefix an extra zero bit to number to indicate that the number is positive and when a negative integer arises we need to fill extra space available with leading "1" bits. This is called sign extension. You can read more on sign extension in the article: Complement Systems of Integer Representation or in the article: How to Represent Integer Values of Arbitrary Size in QBasic?.

With this preliminary I now describe the Unicode inspired integer representation that I use in this article:

In this Unicode inspired format of integer representation we have zero or more leading "1" bits followed by a "0" (zero) bit. This sequence of bits are used to indicate length of integer in bytes including this leading byte as given in following table:

Byte 1    Number of
0xxxxxxx      1
10xxxxxx      2
110xxxxx      3
1110xxxx      4
and so on ...

If after these bits there are any lower order bits left in the byte where first "0" bit occurs then those bits are considered as leading bits of integer value these bytes represent.


01111111          (-1, minus 1)
10111111 11111111 (another way to represent -1
                  Please note that we can store
                  in fewer bytes too)
00001010          (+10 or unsigned 10)
01111111          (unsigned 127. If we know a priori
                  that the integer is going to be a non-zero
                  integer then this will be processed as
                  127 else it will be processed as -1)
01110110          (-10 or 118 depending on a priori
                  information available)


    0 of 8192 characters used
    Post Comment

    No comments yet.

    Click to Rate This Article