[Next] [Up/Previous]

Gradual Underflow (and then some)

One of the features of the original IEEE 754-1985 standard, based on the floating-point capabilities offered by Intel's 8087 coprocessor for the 8086 and 8088 microprocessors, was gradual underflow.

This was controversial at the time, as it added some complexity to the way computers handled floating-point numbers.

A typical computer floating-point format may look like this:

A sign: one bit, that is 1 for positive numbers (and also for zero), and 0 for negative numbers.

An exponent: a power, to which some base, perhaps 2 or 8 or 16, is raised, with the result being multiplied by the next part of the floating-point number to give its value.

A mantissa (or significand or characteristic), which is a string of binary bits treated as a fraction, a number in the range [0,1), which thus contains most of the information about the number's value.

For a floating-point number to be zero, the mantissa has to be zero. Usually, the exponent will also be at the lowest possible value in a floating-point zero, and the sign will be positive; this gives a valid floating-point zero a unique representation.

If a floating-point number is not equal to zero, usually computers will require that number to be normalized, or, at least, they will always produce normalized numbers as the result of calculations yielding nonzero results.


In the IBM System/360 computer, the base of the floating-point exponent was 16. This meant that the first four bits in the mantissa of a nonzero floating-point number that was normalized could be anything from 0001 to 1111, but they couldn't be 0000. This was a slight inefficiency in the coding of floating-point numbers, but it was not a major concern.

The same would apply to those computers which had an exponent base of eight.

However, many computers, such as the IBM 704 and its descendants, like the IBM 7090 and IBM 7094, the Univac 1108, and the PDP-10, used an exponent base of two. This meant that the first bit of the mantissa of any normalized nonzero floating-point number was always one. That was a more obvious waste.

The PDP-11 addressed that issue in the following way: zero was represented by a floating-point number with 0 as the sign bit, the minimum possible exponent value of all zeroes, and a mantissa that consisted entirely of zeroes.

Any non-zero floating-point number had to have a mantissa that did not have the minimum possible value. So whether a number was zero or not was indicated by the exponent, not the mantissa.

And that allowed the first bit of the mantissa, the one that always had to be zero if the number wasn't zero, to simply be omitted. So the mantissa could be considered to contain all but the first bit of a fraction in the range [0.5, 1); the mantissa itself would be in the range [0, 0.5), and 0.5 would always be added to it.


The IEEE 754-1985 standard was based on the floating-point format of the PDP-11 computer, but it added gradual underflow as a feature.

When the exponent was all zeroes, the mantissa had to be capable of being zero. But it was no longer bound to only be zero.

Instead, the mantissa now just wouldn't have 0.5 added to it. If it was considered to be a fraction in the range [0, 0.5), though, to avoid having a range of missing numbers that couldn't be represented, the all-zeroes exponent would have to cause the mantissa to be multiplied by the same power of two as an exponent of one, instead of a power of two that was one less.


Gradual underflow meant that more of the possible bit patterns that could occupy the storage for a floating-point number were useful, and represented a valid floating-point value.

But that, in itself, would be such a tiny benefit that it couldn't have been the real reason that it was proposed, and adopted, as a useful feature.

The important reason behind gradual underflow instead was the following, which I will illustrate using numbers in decimal form, such as they might appear in the display of a pocket calculator:

A pocket calculator with scientific notation might let you use zero, and numbers ranging in magnitude from 9.999999999 times 10^99 right down to 1.0 times 10^-99. So the exponent, which is a power of 10, can vary from +99 to -99, and the number must be normalized.

The difference between the two smallest possible numbers (other than zero) is the difference between 1.000000001 times 10^-99 and 1.0 times 10^-99. That difference is ten to the minus 109th power. But the difference between 1.0 times 10^-99 and zero is ten to the minus 99th power, which is ten to the tenth power times larger!

So in the system of normalized floating-point numbers, as Dr. William Kahan pointed out, there is a "hole" around zero. That reduces the relevance and usability of a lot of the floating-point numbers that can be represented in actual calculations - in the example here, only the full precision of numbers down to 1.0 times 10^-89 is really relevant; smaller numbers are distinguished by amounts that are smaller than the size of the gap between zero and the next numbers close to it.

If a pocket calculator allowed numbers that weren't normalized, at least for the smallest possible exponent, so that it could display 0.000000001 times 10^-99 without an error, then this issue would not be present; the available precision of the smallest numbers would match the distance from zero to the next smallest number.

Something Fancier: Posits

I didn't like the fact that changing from a hidden first bit to a visible first bit meant an extra complication in the rules around the exponent in a floating-point number. I wondered if there was a way to efficiently encode floating-point numbers without this inelegant attribute.

And so, I came up with something I called Extremely Gradual Underflow.

Here, there was a sign, and there was a power-of-two exponent.

But if the number was unnormalized, its value would be interpreted in a special way.

So the floating-point numbers from 1 1111 11111111 to 1 0000 10000000 would represent values ranging from 127 1/2 down to 2^-8, in a small example format with four bits for the exponent and eight bits for the mantissa.

But the floating-point number 1 1111 01111111 would represent a number just slightly smaller than 2^-8; the number of zeroes prior to the first one in the mantissa would be the most significant part of the exponent. And so 1 0000 01000000 would be 2^-24, and so on and so forth.

Some time after I came up with this, though, I realized that it wasn't really a new invention on my part. The digital signal encoding method of A-law encoding was, in fact, equivalent to a variation of this technique; I discuss that here.


Enter John L. Gustafson

Dr. William Kahan, working at Intel, developed the 8087 in the way that he did because he was very concerned with how the way floating-point was implemented on most computers up to that time had contributed to errors in floating-point calculations.

In addition to gradual underflow, the 8087 included a considerably more important feature which I have not mentioned on this page. For addition, subtraction, multiplication, division, and even square root, it guaranteed that it would always produce the most accurate result that could possibly be represented as a floating-point number in the format used.

The technique the 8087 used to achieve this involved augmenting floating-point numbers, when they were within the arithmetic-logic unit (ALU) of the coprocessor, with three bits coming after the least significant bit of the mantissa. These bits were called "guard", "round", and "sticky".

As a result of this, in addition to gradual underflow, the IEEE 754-1985 standard represented a significant advance in the quality of computer arithmetic. However, this did not mean that computer programs doing floating-point calculations stopped yielding incorrect results at times. Far from it; while this standard improved how floating-point numbers behaved, most of the fundamental problems of floating-point arithmetic on computers weren't affected at all by it.

Dr. John L. Gustafson was also concerned with the problem of error in floating-point calculation. He wanted to do more to reduce it, and because his goals were more ambitious, he was willing to go much further than Dr. Kahan did in devising the IEEE 754 standard.

He devised a variable-length format, where real numbers were represented as what he called "unums", which addressed one of the biggest causes of floating-point error head-on.

To simplify things by using an example in decimal:

On a pocket calculator, if you take 3, add 2 times 10^-15 to it, and then subtract 3, you will get zero, not 2 times 10^-15. That's because a pocket calculator displays 10 digits, and it may calculate internally to 11 digits or even 13 digits... but not sixteen.

With unums, though, because they're variable length, 3 plus 2 times 10^-15 just becomes 3.000000000000002 - the mantissa grows as much as it has to.

After the 8087 came out, IBM first devised a High Accuracy Arithmetic feature for some of its System/370 computers that provided better arithmetic, based on the guard, round, and sticky principle, to its existing floating-point format, and later added IEEE 754 format numbers to its ESA/390 computers, calling them Binary Floating-Point (BFP) as distinct from the original System/360 floating-point format, which they now referred to as Hexadecimal Floating-Point (HFP).

And nearly every other maker of computers with floating-point used IEEE 754 in their new designs; so it isn't just found on the 486 DX and later Intel processors, it was also used in Motorola's 68881 floating-point co-processor for their 68000 processor, it's used for PowerPC chips, SPARC chips, and pretty much everything else.

In the case of unums, though, the reaction of the world of computing was basically to pat Dr. Gustafson on the head and say "That's nice", and go back to business as usual.

Of course, BASIC interpreters let programmers handle variable-length strings, and even put them in arrays, for quite some time, and so a variable-length floating-point format would not be technically infeasible to implement.

It's just that it would still have been slower than a conventional fixed-length format; a lot. And most of the people who did work with floating-point numbers on computers wanted the maximum possible performance that hardware could possibly give them - at least, at a price they could afford.

Dr. Gustafson, however, did not give up. While he continued to advocate for unums, he also developed another floating-point format which was fixed-length, which would at least provide some degree of improvement over IEEE 754. He called this format "posits", and he presented it in a paper with the subtitle "Beating floating-point at its own game".


On a personal note, I was a bit chagrined when I first heard of posits. This format sounded to me as something essentially equivalent to a modification I made to Extremely Gradual Underflow when I worked out the design of the Concertina computer architecture. It is described here... but this doesn't really detract from Dr. Gustafson's status as the inventor of posits, since I presented them only as a way to efficiently encode a wide range of numbers, without grasping that extending the representable range of real numbers in floating-point might have advantages in making it easier to perform floating-point calculations accurately, which is what potentially makes them of genuine significance as opposed to a mere irrelevancy, which is all they were when I mentioned them on an obscure part of my web site.

Hyper-Gradual Underflow and Overflow

Also, I came up with an idea for extending the range of floating-point numbers even further.

After one goes through the complete range of the exponent field, if the floating-point number is organized so that the exponent is placed after the first 1 in the mantissa as a way to allow numbers to be ordered more easily, then the length of the exponent field can be dependent on the position of that first 1.

So after the regular run of numbers has the form (sign) 1 (regular-size exponent) (rest of mantissa), then the numbers of the form (sign) 01 (exponent) (rest of mantissa) could have an exponent that is one bit longer, and those with 001 preceding the exponent could have an exponent that is two bits longer.

That means that the precision of the mantissa declines in steps of two bits at a time, which may be considered unfortunate.

Mixed EGU/EGO

Given that posits have not been widely adopted, I have thought about creating a floating-point format which combines them with the IEEE 754 format.

Since Dr. Gustafson's posits are 64 bits long, I will put my examples in terms of the double-precision IEEE 754 format.

Numbers in this range:

S 11011111111 1111111111111111111111111111111111111111111111111111  (just under 2^768)
S 00100000000 0000000000000000000000000000000000000000000000000000  (2^-767)

will have the same value and the same representation as in IEEE 754, so basically the inner 75% of real numbers around 1.0 (and -1.0) are left unchanged.

Outside this range, Extremely Gradual Overflow and Extremely Gradual Underflow are used, just as in posits.

So we have numbers in the form:

S 000 1 11111111 111111111111111111111111111111111111111111111111111 (just under 2^-767)
S 000 1 00000000 000000000000000000000000000000000000000000000000000 (2^-1023) 
S 000 01 11111111 11111111111111111111111111111111111111111111111111 (just under 2^-1023)
S 000 01 00000000 00000000000000000000000000000000000000000000000000 (2^-1279)

and so on; the sign field, followed by 000, and then what are referred to as "regime bits" for the posit format, having the form 1, 01, 001, 0001, and so on, and then an eight-bit exponent field, and finally the significand bits, of which the first bit is hidden - although I tend to view the last of the regime bits as the first bit of the significand, following my original conception of Extremely Gradual Underflow.

For Extremely Gradual Overflow, similarly, numbers take the form:

S 111 10 11111111 11111111111111111111111111111111111111111111111111 (just under 2^1280)
S 111 10 00000000 00000000000000000000000000000000000000000000000000 (2^1024)
S 111 0 11111111 111111111111111111111111111111111111111111111111111 (just under 2^1024)
S 111 0 00000000 000000000000000000000000000000000000000000000000000 (2^768)

here, the regime bits take the form 0, 10, 110, 1110, 11110, and so on, so that they sort higher for larger numbers.


[Next] [Up/Previous]