[Next] [Up] [Previous] [Index]

Semi-Arithmetic Coding

It was noted that a method of compression called arithmetic coding allows an additional amount of compression to be achieved over and above that provided by Huffman coding, since the effective length of the symbols used may now be in arbitrary fractions of a bit.

A simplified compression algorithm might be considered, where some fractions of a whole bit are allowed, but not arbitrary ones, so that the coding of symbols might still proceed in fixed blocks.

For example, one could design a coding tree such that at each level, either two or three alternatives existed, so that only base-2 or base-3 digits are used. An example of such a tree for the 26 letters of the English alphabet might be, based on the letter frequencies used previously as given by Jim Gillogly, without making any claim for optimality:

e 000    g 01B10   b 1AC10    k 1B11110    i 1CB
t 001    y 01B11   p 1AC11    x 1B111110   n 1CC
c 01AA   o 01C     a 1B0      j 1B111111A
m 01AB   r 1AA     d 1B10     q 1B111111B
w 01AC   s 1AB     f 1B110    z 1B111111C
l 01B0   u 1AC0    v 1B1110   h 1CA

This code, as it is patterned after arithmetic coding, has the additional prefix property that at any stage one knows whether the next symbol is a two-way branch (0 or 1) or a three-way branch (A, B, or C).

As 3 to the 5th power is 243, a simple way of encoding five base-3 digits in 8 bits can be used, following this scheme:

Let us designate our five base-3 digits as pqrst. Then, in the resulting binary code, we can use QQ, RR, SS, and TT to represent the last four digits, as coded by the scheme:

A 00
B 01
C 10

Then, we might code groups of five base-3 digits to groups of eight bits as follows:

Aqrst   QQRRSSTT
BArst   11RRSSTT
BBrst   RR11SSTT
BCrst   RRSS11TT
CArst   RRSSTT11
CBAst   1111SSTT
CBBst   11SS11TT
CBCst   11SSTT11
CCAst   SS1111TT
CCBst   SS11TT11
CCCst   SSTT1111

Thus, the positions of the inadmissible 11 value are used to encode first the first base-3 digit, and then others for which no room is left to directly code. The later section From 47 bits to 10 letters discusses the principle behind this form of coding, and notes that it was originated by IBM as an efficient way of storing decimal digits on a binary medium.

Here we are, now: we have a code to convert a sequence of letters into base-2 and base-3 digits; the codes always begin with a base-2 digit, and we always know what kind of digit is to come next, and we have a reasonably efficient way of converting base-3 digits into bits.

One way to handle a message compressed in this way would be to decompress it by taking bytes starting from the beginning of our message as our source of bits, and bytes starting from the end of the message as our source of base-3 digits. Then, when these two streams met in the middle, we would know our message had concluded.

An example of coding a message in this way would be:

NOW IS THE TIME

becomes

1CC01C01AC1CB1AB0011CA0000011CB01AB000

which, when separated into bit and base-3 digit streams, becomes

10101110011000001101000 and
CCCACCBABCACBAB

thus, our message begins with the bytes

10101110 01100000 1101000x

where the x indicates a bit to be filled with padding, which, for this example, will be assumed to be zero, and the base-3 digits, which are converted to bytes which start from the end of the message, convert to bytes as follows:

CCCAC 00101111
CBABC 11110110
ACBAB 10010001

and so the message becomes, in its final form,

10101110 01100000 11010000 10010001 11110110 00101111

[Next] [Up] [Previous] [Index]

Next
Table of Contents
Home Page