[Next] [Up/Previous]

Chen-Ho Encoding and Densely Packed Decimal

In addition to speed, another objection to the use of decimal arithmetic on computers has been storage efficiency; this can be addressed through using Chen-Ho encoding, which we will now discuss.

One form of Chen-Ho encoding allows three decimal digits to be represented in ten binary bits, which may have up to 1,024 possible different values, and which therefore can encode the 1,000 possibilities for three digits with only a little waste.

Of course, one can do this simply by converting a three-digit number to binary; but Chen-Ho encoding avoids the need for performing multiplication or even addition.

Most of the time, a decimal digit can fit into three bits, when its value is from 0 through 7. Only the two digits 8 and 9 would not fit. And three times three is nine, which is one less than ten. Could there be a way to use that extra bit to indicate that one is handling a three-digit number containing some 8s and 9s as a special case?

Let us take a three-digit number and represent its digits either as aaa, bbb, and ccc respectively, if they are from 0 through 7, or as A, B, or C respectively, if they are either 8 and 9, according to the following code:

       aaa/bbb/ccc  A/B/C
  0        000
  1        001
  2        010
  3        011
  4        100
  5        101
  6        110
  7        111
  8                   0
  9                   1

then, the following chart indicates a possible form of Chen-Ho encoding that would successfully encode all the possibilities for three digits in ten bits:

0aaabbbccc (3 digits with 3 bits - 1 way)
100aaabbbC (2 digits with 3 bits, 1 digit with 1 bit - 3 ways)
101aaaBccc
110Abbbccc
11100aaaBC (1 digit with 3 bits, 2 digits with 1 bit - 3 ways)
11101AbbbC
11110ABccc
1111100ABC (3 digits with 1 bit - 1 way)

with 24 unused combinations remaining.

This method of encoding was patented by IBM, in U.S. patent 3,842,414. This patent was filed on June 18, 1973, and granted on October 15, 1974. Also, it was described in a paper in the January 1975 issue of Communications of the Association for Computing Machinery. (I thought I first encountered this method in a paper in the IBM Systems Journal, but my memory must be playing tricks on me.)

It was also in 1973 that a 7070/7074 emulation feature for the 370/165 and 168 used a form of Chen-Ho encoding to convert decimal addresses to binary addresses, thus minimizing wasted storage.

The form shown preserves the order of digits, so as to be easier to understand. In practice, to cut down on the circuitry required by minimizing the changes in how a bit is used, a scheme like this might be used instead:

0aaabbbccc
100Abbbccc
101Baaaccc
110Caaabbb
11100ABccc
11101ACbbb
11110BCaaa
1111100ABC

In fact, the actual coding used by IBM simplifies the required circuitry further, at the cost of making the format somewhat more difficult to understand. The format given above is similar, although not identical, to that described by Dr. T. C. Chen in his 1971 memo first describing the basic technique.

IBM has since proposed a modification of Chen-Ho encoding which allows truncating the leading three bits from the 10-bit code to make a 7-bit code for two digits, and truncating the leading six bits from the 10-bit code to make a 4-bit code for one digit, to allow taking an arbitrary number of the least-significant digits from an encoded number without re-encoding. This scheme is called Densely Packed Decimal, and it will be more fully explained below.

Here is a table showing IBM's official formats for Densely Packed Decimal and Chen-Ho encoding:

BCD digits         Chen-Ho encoding    Densely Packed
                                       Decimal encoding

0abc 0pqr 0uvw     0abpquvcrw          abcpqr0uvw
0abc 0pqr 100W     110pqabcrW          abcpqr100W
0abc 100R 0uvw     101abuvcRw          abcuvR101w
100C 0pqr 0uvw     100pquvCrw          uvCpqr110w
0abc 100R 100W     11110abcRW          abc10R111W
100C 0pqr 100W     11101pqCrW          pqC01r111W
100C 100R 0uvw     11100uvCRw          uvC00R111w
100C 100R 100W     1111100CRW          00C11R111W

     0pqr 0uvw     0pqruvw             pqr0uvw
     0pqr 100W     111rpqW             pqr100W
     100R 0uvw     100Ruvw             uvR101w
     100R 100W     110R00W             10R111W

Densely Packed Decimal

As noted above, a modification of Chen-Ho encoding has been produced at IBM which is regarded as an improvement; this improved coding is the subject of U.S. patent 6,525,579 which is currently in force. Its inventor, Michael Cowlishaw, an IBM Fellow, is responsible for an interesting web page with quite a bit of information about decimal arithmetic.

To explain the principle behind Densely Packed Decimal encoding, a different formulation than the official one shown in the table above, but which is equivalent in terms of the desired properties, is used for illustrative purposes.

Single digits can, of course, be represented in binary using BCD encoding:

0 0000    5 0101
1 0001    6 0110
2 0010    7 0111
3 0011    8 1000
4 0100    9 1001

The original paper describing Chen-Ho encoding noted that the same principle can be used to encode two digits in 100 of the 128 possible combinations of seven bits:

Once again, let ooo or ppp represent a digit from 0 to 7 represented by the following code:

000  0
001  1
010  2
011  3
100  4
101  5
110  6
111  7

and let d or e represent a digit from 8 to 9 in the following code:

0    8
1    9

Then, two digits can be represented in seven bits (in Chen-Ho encoding, not in the Densely Packed Decimal coding yet to be described) as:

0oooppp
100dppp
101oooe
11000de

In the Densely Packed Decimal coding, the format of the Chen-Ho codes are modified so that the prefix bits indicating the coding are no longer located at the front. In this way, the last four bits of the ten-bit coding for a number of the form 00x are the BCD coding of the digit x, with all preceding bits being zero, and the last seven bits of the ten-bit coding for a number of the form 0xy are a seven-bit coding (also modified from the seven-bit Chen-Ho encoding shown above to permit nesting within it) for the digits xy, with the preceding three bits equal to zero.

Given a three digit number, whose digits are ooo/d, ppp/e, and qqq/f, a coding of the Densely Packed Decimal type (not precisely the one specified by IBM, as noted above) can be arranged as follows:

oooppp0qqq
oooppp100f
oooqqq101e
pppqqq110d
ooo00e111f
ppp01d111f
qqq10d111e
00d11e111f

Note that despite this coding expanding gradually to the left, it still does not preserve the collating sequence of the decimal digits it encodes.

The advantage of this coding is that decimal numbers which do not fill the full length of a field can always be copied to a shorter field in which they would fit by truncation while still in coded form, without requiring reconversion.

Thus, two digits, represented by ppp/e and qqq/f, are encoded to a seven bit sequence using the following coding:

ppp0qqq
ppp100f
qqq101e
00e111f

which is included in the 10-bit coding above on the right side, and the following coding for a single bit, included in the two codings above,

0qqq
100f

produces the normal BCD code.

Of course, this applies to positive numbers only, so it assumes that negative decimal numbers are always indicated using sign-magnitude format, never by ten's complement notation.

Modified Densely Packed Decimal

Can the Densely Packed Decimal encoding be modified so as to be applicable to decimal numbers in a complement notation? The same encoding should work for both nine's complement and ten's complement decimal numbers, so the design will be derived as if I had nine's complement numbers in mind, although in practice it would be ten's complement decimal numbers which would be the application of such a coding.

Given the type of symmetry that would be needed by such a modified encoding, it seems reasonable to encode a single four-bit digit at the beginning of an encoded number, when the remainder of the number is entirely composed of complete groups of three digits, as follows:

0 0000     5 1011
1 0001     6 1100
2 0010     7 1101
3 0011     8 1110
4 0100     9 1111

To allow truncation of short negative numbers as well as short positive numbers, it is intended that, where the digit x encodes to the four bits yyyy using the code above, numbers of the form 0x shall encode to 000yyyy in binary, and numbers of the form 9x shall encode to 111yyyy in binary, when two digits are encoded to seven bits, and, similarly, when three digits are encoded to ten bits, numbers of the form 0xx shall encode to 000zzzzzzz in binary and numbers of the form 9xx shall encode to 111zzzzzzz in binary, where zzzzzzz is the seven-bit encoding of the two digits xx.

This suggests that the basis encoding of decimal digits should be such that the first, second, and third digit will encode to fff, ggg, or hhh respectively from this table,

0 000
1 001
2 010
3 011
6 100
7 101
8 110
9 111

or to p, q, or r respectively from this table:

4 0
5 1

First, let us develop the encoding of two digits to seven bits with the desired property.

This encoding can be described as follows:

ggg00hh
ggg0100 (?4)
qhh0101
q000111 (?4)
q111000 (?5)
qHH1010
ggg1011 (?5)
ggg11HH

where hh and HH refer to two halves of the table for hhh:

  hh       HH
0 00     6 00
1 01     7 01
2 10     8 10
3 11     9 11

and the items in parentheses show the value for the last digit which is in four cases represented by a zero-length code, being determined instead from the prefix bits.

This clearly enough has the desired property, since it precedes four bit codes that correspond to the four-bit code for digits given above with ggg in the four applicable cases of the six given.

The encoding of three digits to ten bits can then be derived as an expansion of the seven bit encoding, so that both properties are obtained:

fffggg00hh          [0..3,6..9][0..3,6..9][0..3]
fffggg0100 (??4)    [0..3,6..9][0..3,6..9][4]
fffqhh0101          [0..3,6..9][4..5]     [0..3]
pggghh0110          [4..5]     [0..3,6..9][0..3]
fffq000111 (??4)    [0..3,6..9][4..5]     [4]
pggg010111 (??4)    [4..5]     [0..3,6..9][4]
pqhh100111          [4..5]     [4..5]     [0..3]
pq00101111 (??4)    [4..5]     [4..5]     [4]
pq11010000 (??5)    [4..5]     [4..5]     [5]
pqHH011000          [4..5]     [4..5]     [6..9]
pggg101000 (??5)    [4..5]     [0..3,6..9][5]
fffq111000 (??5)    [0..3,6..9][4..5]     [5]
pgggHH1001          [4..5]     [0..3,6..9][6..9]
fffqHH1010          [0..3,6..9][4..5]     [6..9]
fffggg1011 (??5)    [0..3,6..9][0..3,6..9][5]
fffggg11HH          [0..3,6..9][0..3,6..9][6..9]

in which table, fff, which is 000 for 0, and 111 for 9, precedes all the combinations in the table for the 7 bit code, and p precedes code combinations which are, in their last 7 bits, distinctive from anything defined in the 7-bit code.

Sign-Magnitude Numbers

The modified form of Densely Packed Decimal shown above was an attempt to deal with one problem in the original Densely Packed Decimal encoding, the fact that it made no provision for the sign of a number. In a computer the storage cells of which contain three decimal digits, being ten bits in length, if an additional bit is needed to encode sign, then one sign bit would need to be provided for the smallest possible signed numeric data format and extra sign bits would be wasted for signed numbers with a higher precision.

Storage efficiency, however, was only one of the reasons for using Chen-Ho encoding, or a related method of storing numbers, in a computer. The reason for using decimal arithmetic, instead of binary arithmetic, is so that data stored in a computer remains readily and directly comprehensible. The use of ten's complement form for negative numbers (or nine's complement form) tends to compromise the achievement of this goal.

Given that sign-magnitude encoding of signed numbers, because it most closely resembles how numbers are displayed in printed form, is the desired representation, can we produce an encoding which has the truncation property of Densely Packed Decimal, but which is applicable to sign-magnitude numbers that are wholly contained in raw decimal digits, without an auxilliary sign bit?

The basic coding for individual digits to be used suggests itself immediately as the following:

0 +0 000
1 +1 001
2 +2 010
3 +3 011
4 +4 0
5 -0 100
6 -1 101
7 -2 110
8 -3 111
9 -4 1

based on our experience with the encodings above, where the three-bit codes include the digits to be truncated, and the one-bit codes are at the opposite end of the series.

Our desired truncation behavior is that +0 0 2 would truncate to +0 2 and then to +2, while -0 0 2 would truncate to -0 2 and then to -2. Thus, when we are performing truncation, the overall sign stays the same, but the next digit after the area of truncation may change to incorporate within itself the sign of the number.

The coding I am going to describe below will work by elision instead of truncation. If it was rotated one bit to the left, it would have the desired truncation property, but without the rotation, one can either remove the three or six bits following the first bit of the code to shorten a signed number, or remove the first three or six bits to shorten an unsigned number.

First, the four-bit code for single digits would look like this:

0000 0 +0
0001 1 +1
0010 2 +2
0011 3 +3
0100 4 +4
1000 5 -0
1001 6 -1
1010 7 -2
1011 8 -3
1100 9 -4

The seven-bit code, being three bits longer, is to reduce to the four-bit code by removal of the second, third, and fourth bits for a signed number, and, because the sign is kept in the first place, it should also reduce to the four-bit code by removal of the first three digits for an unsigned number. This is not too challenging a requirement, because what we are aiming at is that the final digit will change by being given the sign bit of the first digit during signed truncation, and only digits from 0 through 4 can do this in a valid manner.

Again using ggg and hhh for three-bit codes, and q and r for one-bit codes, for the two digits, the arrangement:

gggh0hh
gggr100
qhhh101
q00r111

would allow any combination of two decimal digits to be distinguished, and, for the top two lines, where ggg is 000 or 100, allow, by the removal of the first three bits, the second digit to be revealed in the four-bit code, and, by removal of the three bits after the first bit, allow the sign of the first digit to be affixed to the second digit, which, provided that digit is in the range from 0 to 4, produces a useful result.

Since only the first two lines are critical for the desired property, there is some flexibility in how the code is designed.

The full ten-bit code, where the three digits can either be fff, ggg, or hhh, or p, q, or r respectively, then becomes:

fffgggh0hh
fffgggr100
fffqhhh101
pggghhh110
fffq00r111
pggg01r111
pqhh10h111
p00q11r111

Some examples of how this works are shown below:

1000000011 -003       1000111010 -037       1000000100 -004
   1000011  -03          1111010  -37          1000100  -04
      1011   -3                                   1100   -4


1000011100 -029       1000101101 -046
   1011100  -29          1101101  -46

Replacing the leading 1 by a leading 0 gives examples for positive numbers.

And What About Floating-Point?

If a computer that uses decimal digits internally is to be a fully general-purpose computer, in addition to working with integers, it will work with floating-point numbers.

Floating-point numbers are generally thought of as being in one of a limited number of fixed formats, and so the need to shorten a floating-point number by removing three bits or six bits, instead of by removing a complete ten-bit group, is clearly less severe.

However, if floating-point numbers are to be so shortened, they would be shortened by removal of the least significant digits. And this suggests making room for a sign bit by forcing the last digit of the mantissa portion of a floating-point number to be even. Thus, the following coding for digits suggests itself:

0 0+ 000
1 0- 001
2 2+ 010
3 2- 011
4 4+ 100
5 4- 101
6 6+ 110
7 6- 111
8 8+ 0
9 8- 1

One could develop a code in which trailing zeroes could be truncated by reversing the pattern of the code discussed immediately above. However, unlike an integer where removing the first few digits only produces the same result if the digits are removed are zero, what is desired when a floating-point number is shortened is to be able to remove any nonzero digits.

This cannot be achieved through truncation or elision of bits, as with Densely Packed Decimal and the modification of it shown above, and instead requires explicit conversion to decimal digits, followed by re-encoding. Thus, a special coding for the mantissa portions of floating-point numbers, even if it would still be applicable to lengthening short floating-point numbers by padding them with trailing zeroes, does not seem to be required.

In addition, as now noted above in the discussion of IBM's decimal floating-point format, the use of decimal digits is being taken advantage of for other purposes besides avoiding the time it takes to convert to and from binary, and, therefore, unnormalized values are put to use in this format.

But here the coding is anyways:

0fffggghhh
100pggghhh
101fffqhhh
110fffgggr
11100pqhhh
11101pgggr
11110fffqr

Yes, as it turns out, an encoding that lends itself to truncation (or elision for the sign-bit at the end case) from the right is very similar to the classic Chen-Ho encoding.

And the seven-bit code would be:

0fffggg
100pggg
101fffq
11100pq

with the four-bit code being:

0000 0 0+
0001 1 0-
0010 2 2+
0011 3 2-
0100 4 4+
0101 5 4-
0110 6 6+
0111 7 6-
1000 8 8+
1001 9 8-

thus, 0100000001 at the end of a signed mantissa stands for 400-, and by elision becomes 0100001, standing for 40-, and by elision again becomes 0101, standing for 4-. Just as this fails if a digit being truncated is not zero, it fails if the last bit of the preceding digit, which is also truncated, due to where elision takes place, is not zero. Once again, the second line of the format is important to ensure it works properly even for the trailing digit 8: thus, 1000000001 is 800-, 1000001 is 80-, and 1001 is 8-.

A Single Code?

Is it possible to have a code which allows padding with zeroes on the right as well as truncation of zeroes on the left?

Perhaps even if it is not quite possible, some simple bit manipulation, simpler than translating the code into decimal digits, and then back again to the binary representation, could accomplish shortening from both ends.

My first attempt to construct a bidirectional code:

Digit  Leftmost  Middle  Rightmost

  0    +0 000     0000   000 0+
  1    +1 001     0001   001 0-
  2    +2 010     0110   010 2+
  3    +3 011     0111   011 2-
  4               0010   100 4+
  5    -0 100     1001   101 4-
  6    -1 101     1000   110 6+
  7    -2 110     1111   111 6-
  8    -3 111     1110
  9               1101

was not successful; attempting to allow truncation and elision from both directions was simply too complicated, however I might try to attempt it.

However, elision from the right is only needed if one is going to keep the sign of the mantissa with the mantissa. Since it would be desirable for the exponent to occupy a single three-digit code group, rather than being only two digits long, while an exponent range from -49 to +49 might be acceptable, one from -499 to +499 is clearly more than satisfactory, so placing the sign of the number together with the sign of the exponent, and making the exponent range -249 to +249 would be entirely acceptable.

Then, a compression format ideally should support left truncation, left elision, and right truncation. Since 10 is not divisible by 4, there seems no way to permit left double elision, to permit exponents to be shortened, but left single elision, which truncates the exponent, would work if the exponent were, as is the commonest convention, in excess-250 format, which would shrink to excess-25 format after left elision which would preserve the sign of the number.

Let us try to construct such a code, beginning with the following table of substitutions:

Digit  Leftmost  Middle  Rightmost

  0    +0 000     0000   000
  1    +1 001     0010   001
  2    +2 010     0100   010
  3    +3 011     0110   011
  4               0111
  5    -0 100     1000   100
  6    -1 101     1010   111
  7    -2 110     1100   110
  8    -3 111     1110   111
  9               1111

To allow truncation from the left, the code, in the middle digit, for 0, must be 0000; and that is the same value as is required for truncation from the right. For elision from the left, 5 must be represented by 1000 in the middle digit. So far, so good.

In the case where the rightmost digit is a 4 or a 9, to allow elision and truncation from the left, we will need to reserve the codes 0001 and 1001; this can be done.

In the case where the leftmost digit is a 4 or a 9, to allow truncation from the right without special logic, the code 1000 will have to be reserved, and that is not possible. So it appears that a bidirectional code may be possible, but for truncation only. Since the versatility of handling floating-point numbers requires the ability to handle signed numbers as a precondition, this does not appear useful.

But here it is anyways:

Digit  Leftmost  Middle  Rightmost

  0       000     0000     000
  1       001     0010     001
  2       010     0011     010
  3       011     0100     011
  4       100     0101     100
  5       101     1001     101
  6       110     1010     110
  7       111     1011     111
  8               1100
  9               1101

  0       000     0001 00
  1       001     0001 01
  2       010     0001 10
  3       011     0001 11
  4       100     0110 00
  5       101     0110 01
  6       110     0110 10
  7       111     0110 11
  8               0111 00    0              
  9               0111 01    1

  0            00 1000     000
  1            01 1000     001
  2            10 1000     010
  3            11 1000     011
  4            00 1110     100
  5            01 1110     101
  6            10 1110     110
  7            11 1110     111
  8       0    00 1111
  9       1    01 1111

  0            00 0111 10
  1            01 0111 10
  2            10 0111 10
  3            11 0111 10
  4            00 0111 11
  5            10 1111 00
  6            10 1111 01
  7            10 1111 10
  8       0    10 1111 11    0
  9       1    11 1111 00    1

With this code, I think one can get unique 7-bit codes for each two-digit group, and unique 4-bit codes for each digit, trimming off one or two zeroes respectively, where each zero is represented by 000, from the right or the left. But the shorter codes will be different for each direction.

In that case, though, a code suitable to a modified form of elision, where the positions of the bits to be removed when truncating or eliding from the right are not the rightmost three and six bits necessarily, should be possible, by shrinking each digit through sequestering one bit from each digit, somewhat the same basic technique as was used to create the code that allowed truncating a ten's complement number from the left. In this regard, it might be noted that the official form of Densely Packed Decimal sequesters the least significant bit of each digit; if the digits were recoded so that their bit sequence is reversed, a scheme of handling signed numbers with it may be possible.

Of course, storing the digits of the mantissa in reverse order, in a code already supporting left truncation, would be much simpler than using a bidirectional code that was significantly more complex than the unidirectional code supporting both truncation and elision given above. And complexities and exceptions appear unavoidable even when supporting right truncation only in addition to left truncation.


[Next] [Up/Previous]