[Next] [Up] [Previous] [Index]

# Semi-Arithmetic Coding

It was noted that a method of compression called arithmetic coding allows an additional amount of compression to be achieved over and above that provided by Huffman coding, since the effective length of the symbols used may now be in arbitrary fractions of a bit.

A simplified compression algorithm might be considered, where some fractions of a whole bit are allowed, but not arbitrary ones, so that the coding of symbols might still proceed in fixed blocks.

For example, one could design a coding tree such that at each level, either two or three alternatives existed, so that only base-2 or base-3 digits are used. An example of such a tree for the 26 letters of the English alphabet might be, based on the letter frequencies used previously as given by Jim Gillogly, without making any claim for optimality:

```e 000    g 01B10   b 1AC10    k 1B11110    i 1CB
t 001    y 01B11   p 1AC11    x 1B111110   n 1CC
c 01AA   o 01C     a 1B0      j 1B111111A
m 01AB   r 1AA     d 1B10     q 1B111111B
w 01AC   s 1AB     f 1B110    z 1B111111C
l 01B0   u 1AC0    v 1B1110   h 1CA
```

This code, as it is patterned after arithmetic coding, has the additional prefix property that at any stage one knows whether the next symbol is a two-way branch (0 or 1) or a three-way branch (A, B, or C).

As 3 to the 5th power is 243, a simple way of encoding five base-3 digits in 8 bits can be used, following this scheme:

Let us designate our five base-3 digits as pqrst. Then, in the resulting binary code, we can use QQ, RR, SS, and TT to represent the last four digits, as coded by the scheme:

```A 00
B 01
C 10
```

Then, we might code groups of five base-3 digits to groups of eight bits as follows:

```Aqrst   QQRRSSTT
BBrst   RR11SSTT
CBAst   1111SSTT
CBBst   11SS11TT
CBCst   11SSTT11
CCAst   SS1111TT
CCBst   SS11TT11
CCCst   SSTT1111
```

Thus, the positions of the inadmissible 11 value are used to encode first the first base-3 digit, and then others for which no room is left to directly code. The later section From 47 bits to 10 letters discusses the principle behind this form of coding, and notes that it was originated by IBM as an efficient way of storing decimal digits on a binary medium.

Here we are, now: we have a code to convert a sequence of letters into base-2 and base-3 digits; the codes always begin with a base-2 digit, and we always know what kind of digit is to come next, and we have a reasonably efficient way of converting base-3 digits into bits.

One way to handle a message compressed in this way would be to decompress it by taking bytes starting from the beginning of our message as our source of bits, and bytes starting from the end of the message as our source of base-3 digits. Then, when these two streams met in the middle, we would know our message had concluded.

An example of coding a message in this way would be:

```NOW IS THE TIME
```

becomes

```1CC01C01AC1CB1AB0011CA0000011CB01AB000
```

which, when separated into bit and base-3 digit streams, becomes

```10101110011000001101000 and
CCCACCBABCACBAB
```

thus, our message begins with the bytes

```10101110 01100000 1101000x
```

where the x indicates a bit to be filled with padding, which, for this example, will be assumed to be zero, and the base-3 digits, which are converted to bytes which start from the end of the message, convert to bytes as follows:

```CCCAC 00101111
CBABC 11110110
ACBAB 10010001
```

and so the message becomes, in its final form,

```10101110 01100000 11010000 10010001 11110110 00101111
```

[Next] [Up] [Previous] [Index]