[Next] [Up/Previous] [Index]

The Representation of Speech

Historically, the primary use of encryption has been, of course, to protect messages in text form. Advancing technology has allowed images and audio to be stored and communicated in digital form. A particularly effective method of compressing images is the Discrete Cosine Transform, which is used in the JPEG (Joint Photographic Experts Group) file format.

When sound is converted to an analogue electrical signal by an appropriate transducer (a device for converting changing levels of one quantity to changing levels of another) such as a microphone, the resulting electrical signal has a value that changes over time, oscillating between positive and negative.

A Compact Disc stores stereo musical recordings in the form of two digital audio channels, each one containing 44,100 16-bit signed integers for every second of sound. This leads to a total data rate of 176,400 bytes per second.

For transmitting a telephone conversation digitally, the same level of fidelity is not required. Only a single audio channel is used, and only frequencies of up to 3000 cycles per second (or 3000 Hertz) are required, which requires (because of a mathematical law called the Nyquist theorem) 6000 samples of the level of the audio signal (after it has been bandlimited to the range of frequencies to be reproduced, otherwise aliasing may result) to be taken each second.

For many communications applications, samples of audio waveforms are one byte in length, and they are represented by a type of floating-point notation to allow one byte to represent an adequate range of levels.

Simple floating-point notation, for an eight-bit byte, might look like this:

S EE MMMMM
0 11 11111  1111.1  
0 11 10000  1000.0
0 10 11111   111.11
0 10 10000   100.00
0 01 11111    11.111
0 01 10000    10.000
0 00 11111     1.1111
0 00 10000     1.0000

The sign bit is always shown as 0, which indicates a positive number. Negative numbers are often indicated in floating-point notation by making the sign bit a 1 without changing any other part of the number, although other conventions are used as well. For comparison purposes, the floating-point notations shown have all been scaled so that 1 represents the smallest nonzero number that can be indicated.

One way the range of values that can be represented can be extended is by allowing gradual underflow, where an unnormalized mantissa is permitted for the smallest exponent value.

S EE MMMMM
0 11 11111  11111000  
0 11 10000  10000000
0 10 11111   1111100
0 10 10000   1000000
0 01 11111    111110
0 01 10000    100000
0 00 11111     11111
0 00 10000     10000
0 00 01111      1111
0 00 01000      1000
0 00 00111       111
0 00 00100       100
0 00 00011        11
0 00 00010        10
0 00 00001         1

Another way of making a floating-point representation more efficient involves noting that, in the first case, the first mantissa bit (the field of a floating-point number that represents the actual number directly is called the mantissa because it would correspond to the fractional part of the number's logarithm to the base used for the exponent) is always one. With gradual underflow, that bit is only allowed to be zero for one exponent value. Instead of using gradual underflow, one could use the basic floating-point representation we started with, but simply omit the bit that is always equal to one.

This could produce a result like this:

S EEE MMMM
0 111 aaaa  1aaaa000
0 110 aaaa   1aaaa00
0 101 aaaa    1aaaa0
0 100 aaaa     1aaaa
0 011 aaaa      1aaa.a
0 010 aaaa       1aa.aa
0 001 aaaa        1a.aaa
0 000 aaaa         1.aaaa

Here, the variable bits of the mantissa are noted by aaaa, instead of being represented as all ones in one line, and all zeroes in a following line, for both compactness and clarity.

Today's personal computers use a standard floating-point format that combines gradual underflow with suppressing the first one bit in the mantissa. This is achieved by reserving a special exponent value, the lowest one, to behave differently from the others. That exponent value is required to multiply the mantissa by the same amount as the next higher exponent value (instead of a power of the radix that is one less), and the mantissa, for that exponent value, does not have its first one bit suppressed.

Another method of representing floating point quantities efficiently is something I call extremely gradual underflow. This retains the first one bit in the mantissa, but treats the degree of unnormalization of the mantissa as the most significant part of the exponent field. It works like this (the third column shows an alternate version of this format, to be explained below):

S EE MMMMM                         S M EE MMMM
0 11 1aaaa  1aaaa000000000000000   0 1 11 aaaa
0 10 1aaaa   1aaaa00000000000000   0 1 10 aaaa
0 01 1aaaa    1aaaa0000000000000   0 1 01 aaaa
0 00 1aaaa     1aaaa000000000000   0 1 00 aaaa

                                   S MM EE MMM
0 11 01aaa      1aaa000000000000   0 01 11 aaa
0 10 01aaa       1aaa00000000000   0 01 10 aaa
0 01 01aaa        1aaa0000000000   0 01 01 aaa
0 00 01aaa         1aaa000000000   0 01 00 aaa

                                   S MMM EE MM
0 11 001aa          1aa000000000   0 001 11 aa
0 10 001aa           1aa00000000   0 001 10 aa
0 01 001aa            1aa0000000   0 001 01 aa
0 00 001aa             1aa000000   0 001 00 aa

                                   S MMMM EE M
0 11 0001a              1a000000   0 0001 11 a
0 10 0001a               1a00000   0 0001 10 a
0 01 0001a                1a0000   0 0001 01 a
0 00 0001a                 1a000   0 0001 00 a

                                   S MMMMM EE
0 11 00001                  1000   0 00001 11
0 10 00001                   100   0 00001 10
0 01 00001                    10   0 00001 01
0 00 00001                     1   0 00001 00

Although usually a negative number is indicated simply by setting the sign bit to 1, another possibility is to also invert all the other bits in the number. In this way, for some of the simpler floating-point formats, an integer comparison instruction can also be used to test if one floating-point number is larger than another.

This definitely will not work for the complicated extremely gradual underflow format as it is shown here. However, that format can be coded so as to allow this to work, as follows: the exponent field can be made movable, and it can be placed after the first 1 bit in the mantissa field. This is the format shown in the third column above.

When this is done, for very small numbers the idea of allowing the exponent field to shrink suggests itself.

Thus, if the table above is continued, we obtain:

S EE MMMMM                              S MMMMM EE
0 11 00001                  1000        0 00001 11
0 10 00001                   100        0 00001 10
0 01 00001                    10        0 00001 01
0 00 00001                     1        0 00001 00

                                        S MMMMMM E
N/A                            0.1      0 000001 1
N/A                            0.01     0 000001 0

                                        S MMMMMMM
N/A                            0.001    0 0000001

Something very similar is used to represent sound signals in 8-bit form using the A-law, which is the standard for European microwave telephone transmission, and which is also sometimes used for satellite audio transmissions. However, the convention for representing the sign of numbers is different.

Mu-law encoding, used in the United States and Japan (and, I would suspect, Canada as well), instead operates as a conventional floating-point format, with the first bit of the mantissa, which is always a 1 when the exponent is a power of two, suppressed. The following table illustrates these formats, with capital letters indicating bits that are complemented:

Linear value                 Extremely Gradual  A-Law (1)    Suppressed Bit  Mu-Law      Suppressed Bit       A-Law (2)
                             Underflow                       Floating-Point              Floating-Point with
                             Floating-Point                                              Gradual Underflow

+1aaaa000000000000000000     0 1 11 aaaa        1111aaaa     0 111 aaaa      1000AAAA    0 111 aaaa           1111aaaa
+01aaaa00000000000000000     0 1 10 aaaa        1110aaaa     0 110 aaaa      1001AAAA    0 110 aaaa           1110aaaa
+001aaaa0000000000000000     0 1 01 aaaa        1101aaaa     0 101 aaaa      1010AAAA    0 101 aaaa           1101aaaa
+0001aaaa000000000000000     0 1 00 aaaa        1100aaaa     0 100 aaaa      1011AAAA    0 100 aaaa           1100aaaa
+00001aaab00000000000000     0 01 11 aaa        10111aaa     0 011 aaab      1100AAAB    0 011 aaab           1011aaab
+000001aaab0000000000000     0 01 10 aaa        10110aaa     0 010 aaab      1101AAAB    0 010 aaab           1010aaab
+0000001aaab000000000000     0 01 01 aaa        10101aaa     0 001 aaab      1110AAAB    0 001 aaab           1001aaab
+00000001aaab00000000000     0 01 00 aaa        10100aaa     0 000 aaab      1111AAAB    0 000 1aaa           10001aaa
+000000001aa000000000000     0 001 11 aa        100111aa                                 0 000 01aa           100001aa
+0000000001ab00000000000     0 001 10 ab        100110ab                                 0 000 001a           1000001a
+00000000001aa0000000000     0 001 01 aa        100101aa                                 0 000 0001           10000001
+000000000001aa000000000     0 001 00 aa        100100aa
+0000000000001a000000000     0 0001 11 a        1000111a
+00000000000001a00000000     0 0001 10 a        1000110a
+000000000000001a0000000     0 0001 01 a        1000101a
+0000000000000001a000000     0 0001 00 a        1000100a
+00000000000000001000000     0 00001 11         10000111
+00000000000000000100000     0 00001 10         10000110
+00000000000000000010000     0 00001 01         10000101
+00000000000000000001000     0 00001 00         10000100
+00000000000000000000100     0 000001 1         10000011
+00000000000000000000010     0 000001 0         10000010
+00000000000000000000001     0 0000001          10000001

+0                                              10000000                                                      10000000
-0                                              01111111                                                      01111111

-00000001aaab00000000000     1 01 00 aaa        00100aaa     1 000 aaab      0111AAAB    1 000 1aaa           00001aaa
-1aaaa000000000000000000     1 1 11 aaaa        0111aaaa     1 111 aaaa      0000AAAA    1 111 aaaa           0111aaaa

Usually, most descriptions of A-Law encoding and Mu-Law encoding state that it is Mu-Law encoding that has the greater dynamic range, acting on 14-bit values while A-Law encoding acts on 13-bit values; it appears to me, as shown on the diagram, that Mu-Law encoding acts on 13-bit values, and A-Law encoding acts on 24-bit values. It may be that the floating-point encoding used with Mu-Law encoding is applied not to the input signal value, but to its logarithm, or it may be that my original source for information on A-Law encoding either was not accurate, or I had misconstrued it; this seems likely, as using 24-bit digitization as the first step in digitizing a telephone conversation appears, in comparison to standards for high-quality digital audio, to be bizarre. The third column indicates what other sources appear to give for A-Law encoding, and this does cause it to act on 12-bit values (including the sign bit), which is at least one less bit than for Mu-Law encoding, even if there is a one-bit discrepancy in both cases.

Also, if this method, with a two-bit exponent, were used for encoding audio signals with 16 bits per sample, the result, for the loudest signals, would have the same precision as a 14-bit signed integer, 13 bits of mantissa. Many early digital audio systems used 14 bits per sample rather than 16 bits. But the dynamic range, the difference between the softest and loudest signals possible, would be that of a 56-bit integer.

One problem with using floating-point representations of signals for digital high-fidelity audio - although this particular format seems precise enough to largely make that problem minor - is that the human ear can still hear relatively faint sounds while another sound is present, if the two sounds are in different parts of the frequency spectrum. This is why some methods of music compression, such as those used with Sony's MiniDisc format, Philips' DCC (Digital Compact Cassette), and today's popular MP3 audio format, work by dividing the audio spectrum up into "critical bands", which are to some extent processed separately.

Transmitting 6000 bytes per second is an improvement over 176,400 bytes per second, but it is still a fairly high data rate, requiring a transmission rate of 48,000 baud.

Other techniques of compressing audio waveforms include delta modulation, where the difference between consecutive samples, rather than the samples themselves, are transmitted. A technique called ADPCM, adaptive pulse code modulation, works by such methods as extrapolating the previous two samples in a straight line, and assigning the available codes for levels for the current sample symmetrically around the extrapolated point.

The term LPC, which means linear predictive coding, does not, as it might seem, refer to this kind of technique, but instead to a method that can very effectively reduce the amount of data required to transmit a speech signal, because it is based on the way the human vocal tract forms speech sounds.

There was a good page about Linear Predictive Coding at the page

http://asylum.sf.ca.us/pub/u/howitt/lpc.tutorial.html

but that URL is no longer valid.

In the latter part of World War II, the United States developed a highly secure speech scrambling system which used the vocoder principle to convert speech to a digital format. This format was then enciphered by means of a one-time-pad, and the result was transmitted using the spread-spectrum technique.

The one-time-pad was in the form of a phonograph record, containing a signal which had six distinct levels. The records used by the two stations communicating were kept synchronized by the use of quartz crystal oscillators where the quartz crystals were kept at a controlled temperature. The system was called SIGSALY, and an article by David Kahn in the September, 1984 issue of Spectrum described it.

Speech was converted for transmission as follows:

The loudness of the portion of the sound in each of ten frequency bands, on average 280 Hz in width (ranging from 150 Hz to 2950 Hz), was determined for periods of one fiftieth of a second. This loudness was represented by one of six levels.

The fundamental frequency of the speaking voice was represented by 35 codes; a 36th code indicated that a white noise source should be used instead in reconstructing the voice. This was also sampled fifty times a second.

The intensities of sound in the bands indicated both the loudness of the fundamental signal, and the resonance of the vocal tract with respect to those harmonics of the fundamental signal that fell within the band. Either a waveform with the frequency of the fundamental, and a full set of harmonics, or white noise, was used as the source of the reconstructed sound in the reciever, and it was then filtered in the ten bands to match the observed intensities in these bands.

This involved the transmission of twelve base-6 digits, 50 times a second.

Since 6 to the 12th power is 2,176,782,336, which is just over 2^31, which is 2,147,483,648, this roughly corresponds to transmitting 200 bytes a second. This uses only two-thirds of the capacity of a 2,400-baud modem, and is quite a moderate data rate.

The sound quality this provided, however, was mediocre. A standard for linear predictive coding, known as CELP, comes in two versions which convert the human voice to a 2,400-baud signal or to a 4,800-baud signal.

[Next] [Up/Previous] [Index]

Next
Skip to Next Section
Table of Contents
Home Page