A Proposal For Efficient Text Representation

An English-language text document is mostly composed of lower-case letters. There are 26 letters in the alphabet.

Thus, it would seem that it ought to be possible to use five bits, rather than eight bits, to represent a character on average.

Of course, since more than 32 characters are used in text documents, this means that the same sequence of five bits would have to have more than one meaning. This makes things more complicated. And, if one is willing to make things more complicated, why not use, say, a Huffman code? Or even arithmetic coding?

But shifting individual bits is somewhat slow and cumbersome on computers. Allowing the same 8-bit byte to have multiple meanings, which does make things somewhat more complex, on the other hand, is used in the popular UTF-8 code which allows normal ASCII characters to be represented in a single 8-bit byte, while using codes with two or three bytes to represent other 16-bit Unicode characters. That code, however, does not use persistent shift states, and begins each character with unique prefix bytes; although it sacrifices efficiency, it has many other desirable properties.

A five-bit code for characters has long been used for teletypewriters. Since it has to use a whole character to indicate a shift from letters to figures, and another character to shift back again, however, and fairly elaborate schemes were later added to allow the use of lower-case letters with it, perhaps there is room for further improvement.

I am thinking in terms of a code arrangement of the following type:


00000 space
00001  A   a   !   @
00010  B   b   "   °
00011  C   c   #
00100  D   d   %
00101  E   e   $   ¢
00110  F   f   &
00111  G   g   '   `
01000  H   h   (   [
01001  I   i   )   ]
01010  J   j   *   ^
01011  K   k   +   |
01100  L   l   ;   <
01101  M   m   =   ~
01110  N   n   :   >
01111  O   o   /   
10000  P   p   0
10001  Q   q   1   ¼
10010  R   r   2   ½
10011  S   s   3   ¾
10100  T   t   4
10101  U   u   5
10110  V   v   6
10111  W   w   7
11000  X   x   8   {
11001  Y   y   9   }
11010  Z   z   ?   \
11011 . (period)
11100 , (comma)
11101 - (hyphen)
11110 newline
11111 control

Instead of CR and LF characters, there is a single newline character; all other control functions are prefixed by the control character. This allows the period, comma, and hyphen to join the space as characters available in all shift states.

But doesn't this mean that shifting from figures to letters now requires at least a two-character code indicating a shift combination?

No, because I plan to use a different method entirely of indicating shifts.

A control code will indicate which coding method is to be used.

At the lowest level, shifts can be represented by the use of a disguised six-bit code or a disguised six and two-thirds bit code: one character having 32 possible values will indicate whether five succeeding characters having codes from 1 to 26 are uppercase letters, or figures from the main figures set, or one character having 27 possible values will indicate whether three succeeding characters having codes from 1 to 26 are uppercase letters, lowercase letters, or figures from the main figures set.

As ASCII requires seven bits per character to allow the use of lowercase letters, six and two-thirds bits per character is already an improvement. The flexibility of switching between a six-bit code for uppercase only and one including lowercase is also a benefit.

But this falls far short of the original goal of approaching five bits per character. The ability to use a six-bit code 'in disguise' does promote efficiency when one is dealing with languages such as Armenian or Thai, when the basic alphabet or syllabary has more than 32 (or, for this specific scheme, 29) characters, so that an excessive number of shifts would be required with a five-bit code. This is why the scheme of providing a printable character set of 30, 56, or 82 characters by means of regularly recurring prefix characters was described as operating on the lowest level; it would merely provide access to the basic alphabet. Occasional shifts would instead be indicated outside that by a more sophisticated scheme.

Note that a space, period, comma, or hyphen would not consume one of the sixth bits encoded in a prefix character, causing a prefix to occur less often. A prefix character would occur as soon as the prefix information in the previous prefix is used up, so that it could be recognized. Also, a newline causes the prefix information to be cleared, so that it can be used to end a transmission and allow the other side to reply. This is not a half-duplex versus full-duplex issue; the problem is that when one is finished typing, what one has typed cannot otherwise be sent unless the prefixes of the next thing one will type can be predicted.

It only takes one more character to encode

4THIS2IS1A8SENTENCE

than it does to encode

THIS IS A SENTENCE

and, since the numbers indicate where each word will end, one can use the same codes that indicate different letters to indicate different numbers; a larger set of characters is not needed. Elsewhere, I describe applying this principle to a Huffman coding.

So we have prefixes with meanings like "word of one lowercase letter", "word of two lowercase letters", ... "word of one uppercase letter", "word of two uppercase letters", ..., "word of two letters starting with an uppercase letter", "word of three letters starting with an uppercase letter", and so on.

There would also be prefixes for words followed by one figures character but no space and words followed by one figures character and a space. Allowing period, comma, and hyphen to stand outside the system is more complicated here, and, since 44 different prefix codes are not available, some less common categories of prefix code might be two characters long.

It is intended that this scheme would also be applicable to languages like Chinese. Thus, while 26 * 26 is 676, not a large enough number of characters for ordinary use, five times that many is 3,380. About 3,000 characters are what is required to handle almost all of an ordinary Chinese text; thus, a disguised twelve and one-half bit code, in which one 5-bit prefix character contains prefix information for two Chinese characters, each contained in two five-bit characters would be used as the base coding for Chinese, with occasional control characters indicating when something else was required. Open quote, close quote, and period could be indicated by the codes used for period, comma, and hyphen.

If that is dropped, of course, then the basic unit would be 29*29 or 841 characters. Even if three codes were reserved, though, since only the first code of a two-character unit needs to be compared to anything, the basic unit could be 26*29 (754) or even 26*32 (832) characters. Spaces between words are also not applicable to Chinese, and so a basic unit of 30*32, or 1,080, characters is entirely possible. In this case, a 5-bit prefix character could then be used to supply the required prefix information for three characters rather than two.

Note that while a 5-bit prefix character can multiply the number of meanings of five succeeding characters by two (2^5=32), three succeeding characters by three (3^3=27), and two succeeding characters by five (5^2=25), the flexibility is not continuous; after five, there is then a big jump of 32 produced by adding another character.

Still, it seems that the flexibility that is available is well-matched to the size of character repertoires in existing languages. As well, if one really needs to deal efficiently with a syllabary with, say, 260 characters, one could use a prefix consisting of two five-bit characters, having 1,024 possible values, to multiply the number of meanings of three succeeding characters by 10 (10^3=1,000), or, for that matter, five succeeding characters by 4 (4^5=2^10=1,024).

Thus, I envisage the two kinds of prefixes as being as given in the following table:


@ 00000 1,1,1,1,1    1,1,1   1,1   a_          exit to one rail
A 00001 1,1,1,1,2    1,1,2   1,2   aa_         exit to two rail
B 00010 1,1,1,2,1    1,1,3   1,3   aaa_        exit to three rail
C 00011 1,1,1,2,2    1,2,1   1,4   aaaa_       exit to five rail
D 00100 1,1,2,1,1    1,2,2   1,5   aaaaa_      exit to four rail
E 00101 1,1,2,1,2    1,2,3   2,1   aaaaaa_     exit to ten rail
F 00110 1,1,2,2,1    1,3,1   2,2   aaaaaaa_
G 00111 1,1,2,2,2    1,3,2   2,3   A_          change to one rail
H 01000 1,2,1,1,1    1,3,3   2,4   Aa_         change to two rail
I 01001 1,2,1,1,2    2,1,1   2,5   Aaa_        change to three rail
J 01010 1,2,1,2,1    2,1,2   3,1   Aaaa_       change to five rail
K 01011 1,2,1,2,2    2,1,3   3,2   Aaaaa_      change to four rail
L 01100 1,2,2,1,1    2,2,1   3,3   Aaaaaa_     change to ten rail
M 01101 1,2,2,1,2    2,2,2   3,4   Aaaaaaa_
N 01110 1,2,2,2,1    2,2,3   3,5   a._         A._
O 01111 1,2,2,2,2    2,3,1   4,1   aa._        Aa._
P 10000 2,1,1,1,1    2,3,2   4,2   aaa._       Aaa._
Q 10001 2,1,1,1,2    2,3,3   4,3   aaaa._      Aaaa._
R 10010 2,1,1,2,1    3,1,1   4,4   aaaaa._     Aaaaa._
S 10011 2,1,1,2,2    3,1,2   4,5   aaaaaa._    Aaaaaa._
T 10100 2,1,2,1,1    3,1,3   5,1   aaaaaaa._   Aaaaaaa._
U 10101 2,1,2,1,2    3,2,1   5,2   a'          A'
V 10110 2,1,2,2,1    3,2,2   5,3   aa'         Aa'
W 10111 2,1,2,2,2    3,2,3   5,4   aaa'        Aaa'
X 11000 2,2,1,1,1    3,3,1   5,5   aaaa'       Aaaa'
Y 11001 2,2,1,1,2    3,3,2         aaaaa'      Aaaaa'
Z 11010 2,2,1,2,1    3,3,3         aaaaaa'     Aaaaaa'
[ 11011 2,2,1,2,2                  aaaaaaa'    Aaaaaaa'
\ 11100 2,2,2,1,1                  aaaaaaa     (from all caps)
] 11101 2,2,2,1,2                  Aaaaaaa     (to all caps)
^ 11110 2,2,2,2,1                  newline     change language
_ 11111 2,2,2,2,2                  control ----->

In the fourth column, for example, aaa means a word of three lowercase letters, followed by a space; Aaaa means a word four letters in length that begins with a capital letter. Also, aaaaa._ means a five-letter word that is followed by a punctuation mark, and then a space; this punctuation mark can be one that this code causes to be shifted from a letter (such as ?) or it can be one of the always-available ones, such as the period and comma, and aaa' means three letters followed by a punctuation mark that are not followed by a space, which instead form part of a longer word; here, always-available punctuation marks are not included, since they lie outside the system and do not need to be indicated.

Using the letters in the first column to represent the codes, here is an example of how this works:

=Could you find my left-han=ded wrench for me, because I can'=t?
KCOULDBYOUCFINDAMY\LEFT]HANBDEDDWRENCHBFOROME\FBECAUSEGIWCANGNTZ

The first line shows the text, with = used as a padding character to align it with the five-bit characters representing it in the line below. Note how the word left-handed is treated as a normal ten-letter word, consisting of seven prepended letters followed by a three-letter word, while the word can't is decomposed into three letters followed by a punctuation mark preceding a one-letter word followed by a punctuation mark and a space.

As these prefix codes begin words, newline (which eats a preceding space) needs to be among them. If the 5-bit code indicating control precedes a code from the aaa._ or aaa' groups, it simply modifies them to begin with a capital letter.

The code 'exit to three rail' means that the prefix codes of this type stop being used, and a prefix character of the type described first every three applicable characters is used to choose between capital letters, small letters, and special characters; the code 'change to two rail' means that the prefix codes of this type remain in use, but in addition, a prefix character every five applicable characters switches between two possible interpretations for the 26 letters; this would be useful for languages using longer alphabets than that of English. This could be particularly useful with Russian, for example, which has 32 letters, and which might require either one rail or two given that some letters have a low frequency.

[Home]