Historically, the primary use of encryption has been, of course, to protect messages in text form. Advancing technology has allowed images and audio to be stored and communicated in digital form. A particularly effective method of compressing images is the Discrete Cosine Transform, which is used in the JPEG (Joint Photographic Experts Group) file format.
When sound is converted to an analogue electrical signal by an appropriate transducer (a device for converting changing levels of one quantity to changing levels of another) such as a microphone, the resulting electrical signal has a value that changes over time, oscillating between positive and negative.
A Compact Disc stores stereo musical recordings in the form of two digital audio channels, each one containing 44,100 16-bit signed integers for every second of sound. This leads to a total data rate of 176,400 bytes per second.
For transmitting a telephone conversation digitally, the same level of fidelity is not required. Only a single audio channel is used, and only frequencies of up to 3000 cycles per second (or 3000 Hertz) are required, which requires (because of a mathematical law called the Nyquist theorem) 6000 samples of the level of the audio signal (after it has been bandlimited to the range of frequencies to be reproduced, otherwise aliasing may result) to be taken each second.
For many communications applications, samples of audio waveforms are one byte in length, and they are represented by a type of floating-point notation to allow one byte to represent an adequate range of levels.
Simple floating-point notation, for an eight-bit byte, might look like this:
S EE MMMMM 0 11 11111 1111.1 0 11 10000 1000.0 0 10 11111 111.11 0 10 10000 100.00 0 01 11111 11.111 0 01 10000 10.000 0 00 11111 1.1111 0 00 10000 1.0000
The sign bit is always shown as 0, which indicates a positive number. Negative numbers are often indicated in floating-point notation by making the sign bit a 1 without changing any other part of the number, although other conventions are used as well. For comparison purposes, the floating-point notations shown have all been scaled so that 1 represents the smallest nonzero number that can be indicated.
One way the range of values that can be represented can be extended is by allowing gradual underflow, where an unnormalized mantissa is permitted for the smallest exponent value.
S EE MMMMM 0 11 11111 11111000 0 11 10000 10000000 0 10 11111 1111100 0 10 10000 1000000 0 01 11111 111110 0 01 10000 100000 0 00 11111 11111 0 00 10000 10000 0 00 01111 1111 0 00 01000 1000 0 00 00111 111 0 00 00100 100 0 00 00011 11 0 00 00010 10 0 00 00001 1
Another way of making a floating-point representation more efficient involves noting that, in the first case, the first mantissa bit (the field of a floating-point number that represents the actual number directly is called the mantissa because it would correspond to the fractional part of the number's logarithm to the base used for the exponent) is always one. With gradual underflow, that bit is only allowed to be zero for one exponent value. Instead of using gradual underflow, one could use the basic floating-point representation we started with, but simply omit the bit that is always equal to one.
This could produce a result like this:
S EEE MMMM 0 111 aaaa 1aaaa000 0 110 aaaa 1aaaa00 0 101 aaaa 1aaaa0 0 100 aaaa 1aaaa 0 011 aaaa 1aaa.a 0 010 aaaa 1aa.aa 0 001 aaaa 1a.aaa 0 000 aaaa 1.aaaa
Here, the variable bits of the mantissa are noted by aaaa, instead of being represented as all ones in one line, and all zeroes in a following line, for both compactness and clarity.
Today's personal computers use a standard floating-point format that combines gradual underflow with suppressing the first one bit in the mantissa. This is achieved by reserving a special exponent value, the lowest one, to behave differently from the others. That exponent value is required to multiply the mantissa by the same amount as the next higher exponent value (instead of a power of the radix that is one less), and the mantissa, for that exponent value, does not have its first one bit suppressed.
Another method of representing floating point quantities efficiently is something I call extremely gradual underflow. This retains the first one bit in the mantissa, but treats the degree of unnormalization of the mantissa as the most significant part of the exponent field. It works like this (the third column shows an alternate version of this format, to be explained below):
S EE MMMMM S M EE MMMM 0 11 1aaaa 1aaaa000000000000000 0 1 11 aaaa 0 10 1aaaa 1aaaa00000000000000 0 1 10 aaaa 0 01 1aaaa 1aaaa0000000000000 0 1 01 aaaa 0 00 1aaaa 1aaaa000000000000 0 1 00 aaaa S MM EE MMM 0 11 01aaa 1aaa000000000000 0 01 11 aaa 0 10 01aaa 1aaa00000000000 0 01 10 aaa 0 01 01aaa 1aaa0000000000 0 01 01 aaa 0 00 01aaa 1aaa000000000 0 01 00 aaa S MMM EE MM 0 11 001aa 1aa000000000 0 001 11 aa 0 10 001aa 1aa00000000 0 001 10 aa 0 01 001aa 1aa0000000 0 001 01 aa 0 00 001aa 1aa000000 0 001 00 aa S MMMM EE M 0 11 0001a 1a000000 0 0001 11 a 0 10 0001a 1a00000 0 0001 10 a 0 01 0001a 1a0000 0 0001 01 a 0 00 0001a 1a000 0 0001 00 a S MMMMM EE 0 11 00001 1000 0 00001 11 0 10 00001 100 0 00001 10 0 01 00001 10 0 00001 01 0 00 00001 1 0 00001 00
Although usually a negative number is indicated simply by setting the sign bit to 1, another possibility is to also invert all the other bits in the number. In this way, for some of the simpler floating-point formats, an integer comparison instruction can also be used to test if one floating-point number is larger than another.
This definitely will not work for the complicated extremely gradual underflow format as it is shown here. However, that format can be coded so as to allow this to work, as follows: the exponent field can be made movable, and it can be placed after the first 1 bit in the mantissa field. This is the format shown in the third column above.
When this is done, for very small numbers the idea of allowing the exponent field to shrink suggests itself.
Thus, if the table above is continued, we obtain:
S EE MMMMM S MMMMM EE 0 11 00001 1000 0 00001 11 0 10 00001 100 0 00001 10 0 01 00001 10 0 00001 01 0 00 00001 1 0 00001 00 S MMMMMM E N/A 0.1 0 000001 1 N/A 0.01 0 000001 0 S MMMMMMM N/A 0.001 0 0000001
Something very similar is used to represent sound signals in 8-bit form using the A-law, which is the standard for European microwave telephone transmission, and which is also sometimes used for satellite audio transmissions. However, the convention for representing the sign of numbers is different.
Mu-law encoding, used in the United States and Japan (and, I would suspect, Canada as well), instead operates as a conventional floating-point format, with the first bit of the mantissa, which is always a 1 when the exponent is a power of two, suppressed. The following table illustrates these formats, with capital letters indicating bits that are complemented:
Linear value Extremely Gradual A-Law (1) Suppressed Bit Mu-Law Suppressed Bit A-Law (2) Underflow Floating-Point Floating-Point with Floating-Point Gradual Underflow +1aaaa000000000000000000 0 1 11 aaaa 1111aaaa 0 111 aaaa 1000AAAA 0 111 aaaa 1111aaaa +01aaaa00000000000000000 0 1 10 aaaa 1110aaaa 0 110 aaaa 1001AAAA 0 110 aaaa 1110aaaa +001aaaa0000000000000000 0 1 01 aaaa 1101aaaa 0 101 aaaa 1010AAAA 0 101 aaaa 1101aaaa +0001aaaa000000000000000 0 1 00 aaaa 1100aaaa 0 100 aaaa 1011AAAA 0 100 aaaa 1100aaaa +00001aaab00000000000000 0 01 11 aaa 10111aaa 0 011 aaab 1100AAAB 0 011 aaab 1011aaab +000001aaab0000000000000 0 01 10 aaa 10110aaa 0 010 aaab 1101AAAB 0 010 aaab 1010aaab +0000001aaab000000000000 0 01 01 aaa 10101aaa 0 001 aaab 1110AAAB 0 001 aaab 1001aaab +00000001aaab00000000000 0 01 00 aaa 10100aaa 0 000 aaab 1111AAAB 0 000 1aaa 10001aaa +000000001aa000000000000 0 001 11 aa 100111aa 0 000 01aa 100001aa +0000000001ab00000000000 0 001 10 ab 100110ab 0 000 001a 1000001a +00000000001aa0000000000 0 001 01 aa 100101aa 0 000 0001 10000001 +000000000001aa000000000 0 001 00 aa 100100aa +0000000000001a000000000 0 0001 11 a 1000111a +00000000000001a00000000 0 0001 10 a 1000110a +000000000000001a0000000 0 0001 01 a 1000101a +0000000000000001a000000 0 0001 00 a 1000100a +00000000000000001000000 0 00001 11 10000111 +00000000000000000100000 0 00001 10 10000110 +00000000000000000010000 0 00001 01 10000101 +00000000000000000001000 0 00001 00 10000100 +00000000000000000000100 0 000001 1 10000011 +00000000000000000000010 0 000001 0 10000010 +00000000000000000000001 0 0000001 10000001 +0 10000000 10000000 -0 01111111 01111111 -00000001aaab00000000000 1 01 00 aaa 00100aaa 1 000 aaab 0111AAAB 1 000 1aaa 00001aaa -1aaaa000000000000000000 1 1 11 aaaa 0111aaaa 1 111 aaaa 0000AAAA 1 111 aaaa 0111aaaa
Usually, most descriptions of A-Law encoding and Mu-Law encoding state that it is Mu-Law encoding that has the greater dynamic range, acting on 14-bit values while A-Law encoding acts on 13-bit values; it appears to me, as shown on the diagram, that Mu-Law encoding acts on 13-bit values, and A-Law encoding acts on 24-bit values. It may be that the floating-point encoding used with Mu-Law encoding is applied not to the input signal value, but to its logarithm, or it may be that my original source for information on A-Law encoding either was not accurate, or I had misconstrued it; this seems likely, as using 24-bit digitization as the first step in digitizing a telephone conversation appears, in comparison to standards for high-quality digital audio, to be bizarre. The third column indicates what other sources appear to give for A-Law encoding, and this does cause it to act on 12-bit values (including the sign bit), which is at least one less bit than for Mu-Law encoding, even if there is a one-bit discrepancy in both cases.
Also, if this method, with a two-bit exponent, were used for encoding audio signals with 16 bits per sample, the result, for the loudest signals, would have the same precision as a 14-bit signed integer, 13 bits of mantissa. Many early digital audio systems used 14 bits per sample rather than 16 bits. But the dynamic range, the difference between the softest and loudest signals possible, would be that of a 56-bit integer.
One problem with using floating-point representations of signals for digital high-fidelity audio - although this particular format seems precise enough to largely make that problem minor - is that the human ear can still hear relatively faint sounds while another sound is present, if the two sounds are in different parts of the frequency spectrum. This is why some methods of music compression, such as those used with Sony's MiniDisc format, Philips' DCC (Digital Compact Cassette), and today's popular MP3 audio format, work by dividing the audio spectrum up into "critical bands", which are to some extent processed separately.
Transmitting 6000 bytes per second is an improvement over 176,400 bytes per second, but it is still a fairly high data rate, requiring a transmission rate of 48,000 baud.
Other techniques of compressing audio waveforms include delta modulation, where the difference between consecutive samples, rather than the samples themselves, are transmitted. A technique called ADPCM, adaptive pulse code modulation, works by such methods as extrapolating the previous two samples in a straight line, and assigning the available codes for levels for the current sample symmetrically around the extrapolated point.
The term LPC, which means linear predictive coding, does not, as it might seem, refer to this kind of technique, but instead to a method that can very effectively reduce the amount of data required to transmit a speech signal, because it is based on the way the human vocal tract forms speech sounds.
There was a good page about Linear Predictive Coding at the page
http://asylum.sf.ca.us/pub/u/howitt/lpc.tutorial.html
but that URL is no longer valid.
In the latter part of World War II, the United States developed a highly secure speech scrambling system which used the vocoder principle to convert speech to a digital format. This format was then enciphered by means of a one-time-pad, and the result was transmitted using the spread-spectrum technique.
The one-time-pad was in the form of a phonograph record, containing a signal which had six distinct levels. The records used by the two stations communicating were kept synchronized by the use of quartz crystal oscillators where the quartz crystals were kept at a controlled temperature. The system was called SIGSALY, and an article by David Kahn in the September, 1984 issue of Spectrum described it.
Speech was converted for transmission as follows:
The loudness of the portion of the sound in each of ten frequency bands, on average 280 Hz in width (ranging from 150 Hz to 2950 Hz), was determined for periods of one fiftieth of a second. This loudness was represented by one of six levels.
The fundamental frequency of the speaking voice was represented by 35 codes; a 36th code indicated that a white noise source should be used instead in reconstructing the voice. This was also sampled fifty times a second.
The intensities of sound in the bands indicated both the loudness of the fundamental signal, and the resonance of the vocal tract with respect to those harmonics of the fundamental signal that fell within the band. Either a waveform with the frequency of the fundamental, and a full set of harmonics, or white noise, was used as the source of the reconstructed sound in the reciever, and it was then filtered in the ten bands to match the observed intensities in these bands.
This involved the transmission of twelve base-6 digits, 50 times a second.
Since 6 to the 12th power is 2,176,782,336, which is just over 2^31, which is 2,147,483,648, this roughly corresponds to transmitting 200 bytes a second. This uses only two-thirds of the capacity of a 2,400-baud modem, and is quite a moderate data rate.
The sound quality this provided, however, was mediocre. A standard for linear predictive coding, known as CELP, comes in two versions which convert the human voice to a 2,400-baud signal or to a 4,800-baud signal.