[Next] [Up/Previous] [Index]

Data Compression

English-language text uses only a fraction of the possible strings of the 26 letters. For example, no one would mistake

xwllygrZPXll mRRtQzryWx

for a portion of normal text.

It is possible to make use of the fact that certain sequences of letters are more common than others to fit a text document, or a data file of another type, into less than the usual amount of space.

Graphics and audio files can be compressed by specialized techniques. Often, they have long stretches where values either do not change, which can be dealt with by run-length encoding, or stretches where changes are small, which can be handled by, for example, replacing a 16-bit number by an 8-bit value showing the difference between the current number and the previous one belonging to the same family (an image file might contain alternating R, G, and B pixels, and the three values would be independent, so one compares an R value to the corresponding value in the previous pixel, that is, the previous R value, not to, say, that pixel's B value), called delta modulation.

Other, more specialized techniques of compression for such files involve Fourier transforms. The most popular such technique is the Discrete Cosine Transform, used to produce JPEG files. Wavelet compression is another related technique.

A very popular data compression technique is Lempel-Ziv compression. This is used with general purpose file compression programs on computers. Its basic principle is that, as a file is read, the part of the file that has already been processed is used as a dictionary of sequences of bytes likely to occur in that file. When a repeated sequence of bytes is long enough, a pointer back to the earlier part of the file is shorter than repeating the bytes. While the original concept of this kind of file compression is not patented, patents cover Lempel-Ziv-Welch compression, and many other variants which are more practical or efficient than the original form.

Huffman coding is the other major data compression technique applicable to text files.

In the chapter on paper-and-pencil methods of encryption, we met the straddling checkerboard method of converting letters to digits for easier manipulation:

  A T   O N E   S I R
  -------------------
  0 1 2 3 4 5 6 7 8 9
  -------------------
2 B C D F G H J K L M
6 P Q U V W X Y Z

in which the eight most common letters are represented by only a single digit, and the left over two digits begin two-digit codes for all the other less common letters.

A Huffman code is also a code with variable length equivalents that can be distinguished unambiguously from the front (which is known as having the prefix property). However, the term 'Huffman code' has a specific technical meaning that goes beyond this. Even if some effort is made to assign the shorter codes to the more frequent symbols, one does not necessarily have a Huffman code.

The distinguishing feature of a Huffman code is how the binary codes are assigned to the symbols being encoded. Let us say that one wishes to encode the 26 letters of the alphabet. Then, one takes the least frequent two letters of the alphabet, which happen to be Q and Z, and combine them into a single symbol for the next step, recording that the code for Q will be that symbol's code plus a 0 tacked on the end, and the code for Z will be that symbol's code plus a 1 tacked on the end, thus distinguishing Q from Z.

Repeat that process until the alphabet is reduced to a single symbol, which needs no code, and you have produced a Huffman code. One might wish to change the binary codes around without changing their length so as to have all the symbols with four bit codes in alphabetical order, and one still can call the code a Huffman code.

Working from a set of letter frequencies due to Jim Gillogly, based on a famous body of literary work, which are as follows:

E 12.32    S  6.28    C  2.48    K  0.80
T  9.05    R  5.72    Y  2.11    X  0.15
A  8.17    D  4.31    F  2.09    J  0.10
O  7.81    L  3.97    G  1.82    Q  0.09
I  6.89    U  3.04    P  1.56    Z  0.05
H  6.68    M  2.77    B  1.45
N  6.62    W  2.64    V  1.02

we can link the first few letters together with no uncertainties, since the symbol totalled together from all the previous letters is either the one with the lowest or the second-lowest frequency remaining:

                                V  1.02 -  2.21
                        K  0.80 -  1.19 -
                X  0.15 -  0.39 -
        J  0.10 -  0.24 -
Q  0.09 -  0.14 -
Z  0.05 -

Then the frequencies stop changing so rapidly, and we make some independent compounds:

P  1.56 -  3.01
B  1.45 -

F  2.09 -  3.91
G  1.82 -

Then Y tacks on to the long sequence we made initially:

  Y  2.11 -  4.32
...  2.21 -

and we continue combining one more pair of letters separately:

W  2.64 -  5.12
C  2.48 -

then two letters tack on to what we did above:

        M  2.77 -  5.78
P  1.56 -  3.01 -
B  1.45 -

        U  3.04 -  6.95
F  2.09 -  3.91 -
G  1.82 -

and then two letters combine directly:

D  4.31 -  8.28
L  3.97 -

and then two chains combine

  Y  2.11 -  4.32 --- 9.44
...  2.21 -        |
                   |
  W  2.64 -  5.12 -
  C  2.48 -

The next letter is added on to a chain:

                R  5.72 - 11.50
        M  2.77 -  5.78 -
P  1.56 -  3.01 -
B  1.45 -

Then two more pairs of letters are combined separately:

N  6.62 - 12.90
S  6.28 -

I  6.89 - 13.57
H  6.68 -

The next three letters tack on to existing chains:

                O  7.81 - 14.76
        U  3.04 -  6.95 -
F  2.09 -  3.91 -
G  1.82 -

        A  8.17 - 16.45
D  4.31 -  8.28 -
L  3.97 -

                         T  9.05 - 18.49
        Y  2.11 -  4.32 --- 9.44 -
      ...  2.21 -        |
                         |
        W  2.64 -  5.12 -
        C  2.48 -

and finally, the most frequent letter of the alphabet, E, joins up with a chain:

                        E 12.32 - 23.82
                R  5.72 - 11.50
        M  2.77 -  5.78 -
P  1.56 -  3.01 -
B  1.45 -

At this point, we now have only compound symbols in the running, and our compound symbols have the following probabilities:

23.82
18.49
16.45
14.76
13.57
12.90

which total to 99.99%, due to rounding error in calculating the individual letter frequencies independently. So, we haven't forgotten any letters!

Applying the Huffman code rule to these six pseudo-symbols ends up with the following result:

13.57 - 26.47 --- 57.68 ---
12.90 -        |         |
               |         |
16.45 - 31.21 -          |
14.76 -                  |
                         |
          23.82 - 42.31 -
          18.49 -

and that means we can now put our trees of letters together, and arrive at our Huffman code for the alphabet in the English language. Here is the complete tree:

                                               I  6.89 - 13.57 --- 26.47 --- 57.68 ---
                                               H  6.68 -        |         |         |
                                                                |         |         |
                                               N  6.62 - 12.90 -          |         |
                                               S  6.28 -                  |         |
                                                                          |         |
                                               A  8.17 - 16.45 --- 31.21 -          |
                                       D  4.31 -  8.28 -        |                   |
                                       L  3.97 -                |                   |
                                                                |                   |
                                               O  7.81 - 14.76 -                    |
                                       U  3.04 -  6.95 -                            |
                               F  2.09 -  3.91 -                                    |
                               G  1.82 -                                            |
                                                                                    |
                                                         E 12.32 - 23.82 --- 42.31 -
                                                 R  5.72 - 11.50          |
                                         M  2.77 -  5.78 -                |
                                 P  1.56 -  3.01 -                        |
                                 B  1.45 -                                |
                                                                          |
                                                         T  9.05 - 18.49 -
                                        Y  2.11 -  4.32 --- 9.44 -
                                V  1.02 -  2.21          |
                        K  0.80 -  1.19 -                |
                X  0.15 -  0.39 -                        |
        J  0.10 -  0.24 -                                |
Q  0.09 -  0.14 -                                        |
Z  0.05 -                                                |
                                                         |
                                        W  2.64 -  5.12 -
                                        C  2.48 -

and we make a code from it by using the two bits 0 and 1 to distinguish between the two branches coming from every fork on the tree, and so we get our Huffman code as follows:

- 0 --- 0 --- 0 - 0 : I 0000
     |     |    - 1 : H 0001
     |      - 1 - 0 : N 0010
     |          - 1 : S 0011
      - 1 --- 0 - 0 : A 0100
           |    - 1 - 0 : D 01010
           |        - 1 : L 01011
            - 1 - 0 : O 0110
                - 1 - 0 : U 01110
                    - 1 - 0 : F 011110
                        - 1 : G 011111
- 1 --- 0 - 0 : E 100
     |    - 1 - 0 : R 1010
     |        - 1 - 0 : M 10110
     |            - 1 - 0 : P 101110
     |                - 1 : B 101111 
      - 1 - 0 : T 110
          - 1 --- 0 - 0 : Y 11100
               |    - 1 - 0 : V 111010
               |        - 1 - 0 : K 1110110
               |            - 1 - 0 : X 11101110
               |                - 1 - 0 : J 111011110
               |                    - 1 - 0 : Q 1110111110
               |                        - 1 : Z 1110111111
                - 1 - 0 : W 11110
                    - 1 : C 11111

Of course, only the length of the symbols is important, so we can re-assign the codes as we like, as long as we keep the same length. To ensure that every letter can be assigned a code, this is done by assigning codes to the letters with the shortest codes first, and then converting all the unused codes to longer codes by appending both a 0 and a 1 to each one, creating two codes for each code that was present before, and then assigning these codes to the letters whose code is one bit longer. This method can even be used in a computer program to create randomized Huffman codes.

An algorithm for applying this process might work like this:

This has been proven to be the way to produce the most effective possible binary code for symbols, if you are not making use of any information except the frequencies of the symbols in isolation. An earlier code of this type, the Shannon-Fano type of code, worked from the top down instead of the bottom up. This led to some letters starting out placed with other letters whose frequencies were too far away, leading to slightly clumsy code assignments.

Often, in practice, a Huffman code for one kind of symbol will be mixed with other binary codes that are assigned based on guesses instead of actual frequencies. Sometimes this is hard to avoid. A Huffman code for the 26 letters of the alphabet is (well, almost) the best one can do to compress text, after it has been encrypted using a transposition cipher. Unencrypted text, though, still has other properties besides letter frequencies to exploit.

Fax machines use a clever modification of the Huffman principle. One Huffman code is used to represent stretches of black on the paper. Usually, these stretches are very short, the width of a line within a printed or typed character. Another code represents stretches of white. Their lengths vary more widely, and so a different code represents them.

With text, a multi-state code is useful also. Digraph frequencies can be exploited by preparing a separate Huffman code for the alphabet based on every possible value of the preceding letter in the message. Or one could make things simpler, by just having a code for letters after vowels, another for letters after consonants, and (if spaces are also represented somehow) another for letters at the beginning of words.

If text with spaces and punctuation marks is being reproduced, one could add these symbols to a Huffman code. But more efficiency would be obtained, and more security would be obtained as well, by eliminating the frequent repetition at almost regular intervals of the code for a space character, by prefixing each word, encoded with a Huffman code for letters only, with an indicator of the word's length, also Huffman-coded based on the frequency of words of different lengths.

With this kind of an arrangement, the provision for punctuation marks and digits will probably be ad hoc rather than based on exact frequency counts. Presumably, the codes for words of different lengths would be accompanied by a code for switching to punctuation, and the common case of just one punctuation mark followed by a return to normal text would have a relatively short code.

One can do slightly better than a Huffman code if one uses arithmetic coding. This allows symbols to be assigned a fractional number of bits, to better fit the frequencies of symbols. There are patents affecting this technique as well, and it involves quite a bit of extra trouble for a very small improvement.

If we, for simplicity, consider symbols that are all the same length, it is simple enough to see that they do not have to be an integer number of bits in length. For example, ten bits can take on 1,024 possible values. So three digits, which have 1,000 possible values, can be encoded into ten bits, even though more than three bits are required to represent a digit.

This simple principle has been used for practical purposes. Many computer systems allowed file and variable names to be composed of uppercase letters, digits, and perhaps one or two special characters. Three characters, from a set of 40, allow 64,000 possible combinations, which fit into 16 bits, having 65,536 possible values.

On the PDP-10 computer, the Digital Equipment Corporation used the following coding, called RADIX 50, to allow the six characters of a variable name to be placed in only 32 bits, so that an extra four bits in a 36-bit word could be used for other purposes:

Characters    Codes
(filler)      0
0 - 9         1 - 10
A - Z         11 - 36
.             37
$             38
%             39

This, at least, is the coding mentioned in the section of the PDP-10 timesharing manual that described the assembler language.

And, on the PDP-11 computer, to allow file names to be stored on disk drives without waste, a slightly different code, called RADIX-50, was used where the set of 40 characters from which a 16-bit representation of three characters was built was:

Characters    Codes
space         0
A - Z         1 - 26
$             27
.             28
(undefined)   29
0 - 9         30 - 39

at least according to a FORTRAN manual for the PDP-11. For file names, the period may have been dropped from the set of allowed characters (although periods did occur in file names, they did not need to be coded, as the position of the period was fixed) and one or both of minus and underscore may have been added. (In fact, another source indicates that those two characters were added, and also the tilde; I suspect that the codes are tilde: zero, minus: 28, and underscore: 29, but I haven't been able to find a definitive source for a later version of RADIX 50 used for files.)

In both cases, the name contains "50" instead of "40" because 50 is the way 40 is written in octal notation.

But what we are considering here is a system of coding which attempts to improve on Huffman coding. Thus, as in Huffman coding, the symbols will not all take up the same amount of space; but the amount of space they do take up will no longer be restricted to a whole number of bits.

The principle behind arithmetic coding is illustrated by this example: supposing we have five symbols, with the following probabilities:

A 1/4   C 1/6
B 1/4   D 1/6
        E 1/6

While representing A by 00 and B by 01 would be optimal, one would like to switch, when representing C, D, and E, from starting with the base-2 digit 1 to a base-3 digit. This could be done, since if one's message were thought of as a number, then the rule might be:

0 - 0.24999...          multiply by 4,
                        discarding the integer part,
                        and write A.

0.25 - 0.49999...       multiply by 4,
                        discarding the integer part,
                        and write B.

0.5 - 0.66666...        multiply by 6,
                        discarding the integer part, 
                        and write C.

0.66666... - 0.83333... multiply by 6,
                        discarding the integer part,
                        and write D.

0.83333... - 0.99999... multiply by 6,
                        discarding the integer part,
                        and write E.

This is obviously optimal for this particular alphabet. There are algorithms to allow arithmetic coding to be performed without having to perform arithmetic on one's entire compressed message for every symbol being added to it.

A nomenclator might give two or three digit codes to every item it encodes, whether a single uncommon letter like Z, or a common three-letter sequence like ENT or ING. This idea, using the unequal probabilities of a source stream to map variable-length chunks of the source to fixed-length (and therefore convenient to handle) codes was explored in a thesis by B. T. Tunstall, in which he showed that the following simple algorithm:

is optimal if only individual symbol probabilities are taken into account. When other probabilities are taken into account, by using bigram and trigram frequencies, one has a more accurate value for the probability of each code; but one also can see that it makes no sense to assign a code to THZ just because one wishes to assign a code to THE: thus, the proper way of assigning codes is more complicated. (If one has assigned codes for all common cases of a letter following TH, should one retain a code for TH, or just retain codes for T and H?) The problem of performing this type of coding optimally for sources with memory is still a current subject of research.


David A. Huffman passed away on October 7, 1999.


[Next] [Up/Previous] [Index]

Next
Skip to Next Section
Table of Contents
Home Page