Indication of Parallelism

The bit labelled Explicit Indication of Parallelism will have the effects described here in all the currently defined normal alignment instruction modes except stack mode and register stack mode. In stack mode and register stack mode, indication of parallelism is not usable, as successive instructions are, by design, almost always logically dependent on one another, and multiple instructions can be found within a single 16-bit halfword.

As for the variant alignment modes:

In aligned instruction mode, parallelism is indicated in the individual instructions themselves in a different manner.

In addition, in the variant alignment mode general register mode, as all instructions are aligned on 32-bit boundaries, only one of the three possible formats of parallelism indication described here is available, the one prefixed with the bits 10. Each pair of bits now refers to a 32-bit word, and so one 32-bit parallelism prefix precedes fifteen 32-bit words, forming a block of sixteen 32-bit words, twice the length of that used with the normal address modes.

In the case of the variant alignment modes where instructions are aligned on 8-bit boundaries, it is still possible to use the explicit indication of parallelism feature, but then the parallelism information would be contained in the first or second 16-bit halfwords of a block of only eight 16-bit halfwords, so that the fifteen or fourteen locations normally controlled by the prefix would now refer to the fourteen or twelve remaining bytes in the block, instead of being insufficient to refer to thirty or twenty-eight bytes in the standard block of sixteen halfwords.

Normally, programs in memory consume only enough memory to describe the operations which they carry out. Instructions are decoded serially, through fetching one 16-bit halfword at a time from an internal buffer, even though the data path to memory will likely be wider (typically, 256 bits to permit a short vector to be loaded with a single memory access, or even 1024 bits to permit a long vector composed of 32-bit floating-point floating-point values to be loaded in two memory accesses).

Instructions belonging to the same thread are generally assumed to have dependencies which require that the result of one instruction be available before the execution of the next instruction can begin. Some simple tests may be carried out to permit some cases where no dependency exists to be detected and utilized to permit faster execution, but in general the architecture relies upon the ability to switch between different threads, each with their own set of registers, to achieve continuous utilization of the arithmetic units.

If it is necessary to utilize a processor embodying the architecture described here in an application where there is only a single thread of execution, or where one thread of execution occupies the majority of the machine's running time, it may be desirable to enhance the extent to which operations are carried out in parallel. This can be done by increasing the complexity of the analysis and decoding of instructions, or, as some architectures have done, an explicit indication of which instructions may be executed in parallel with which other instructions may be provided.

Intel's Itanium chip encodes three 41-bit instructions in a 128-bit unit which also contains a five-bit code at the beginning indicating which of the three instructions may be executed in parallel with each other.

The Texas Instruments 320C6000 series of digital signal processing chips fetch eight 32-bit instructions from memory at once; the last bit of an instruction, if a one, indicates it may be executed in parallel with the following instruction.

Here, it is desired to add an explicit indication of parallelism to an instruction set not designed around the presence of such an indication. This is done as follows: when the bit indicating that parallelism is explicitly indicated is set in the Program Status Quadword, programs consist of blocks of 16 halfwords of 16 bits.

Usually the first halfword in the block will begin with the two bits 10, and this indicates that the first two halfwords contain a description of the parallelism status of the remaining 14 halfwords in the block.

These two halfwords are considered to be a single 32-bit value. The last 28 bits of this value consist of two bits describing each of the other halfwords in the block, as follows:

00 Not the first halfword of an instruction
01 The first halfword of an instruction
   completely independent of preceding instructions
10 The first halfword of an instruction
   logically independent of preceding instructions
   but requiring the use of the same ALU facilities
   as one of them
11 The first halfword of an instruction logically
   dependent on a preceding instruction

After a 11 code, for subsequent instructions, determination of the presence of a logical dependency involves only previous instructions since that code; after a 10 code or a 11 code, determination of the need to use the same ALU portion involves only previous instructions since that code.

A stretch of instructions containing only 01 and 00 codes may be executed in parallel. A stretch of instructions not containing any 11 codes may be executed in pipeline fashion without delays.

The third bit of the 32-bit parallelism indication block is a 1 if the last instruction in the block continues into the next block. It is not expected that blocks will be padded with no-operation codes to make them come out evenly; instead, some implementations may fetch multiple blocks at one time.

The fourth bit of the 32-bit parallelism indication block indicates that the availability of the 64 ALUs in the long vector unit for performing operations other than long vector operations is assumed for purpose of determining ALU usage collisions. If this bit is set, and the use of the long vector ALU is denied to the current process by the Program Status Quadword, the absence of a valid indication of parallelism is recognized, and instructions are executed as though the indication is not present (but the first two halfwords in every sixteen are still skipped).

This can also happen where a process has access to all the long vector instructions and registers, but these instructions are implemented, as was the case on the original Cray-1, by the use of a single pipelined arithmetic-logic unit.

Although the indication that instructions can be executed in parallel will not be correct in that case, since it will be based on the assumption of more ALU facilities than actually exist, and so parallelism will be determined as if the indication is not present, not that the explicit indication that instructions are not logically dependent can still be utilized in this case.

The setting of this bit is applied starting with the first 10 or 11 code in the last 28 bits of the word. Setting this bit to zero when the long vector unit is available, therefore, allows an addition, a multiplication, and a division in the regular ALU to be combined with a long vector operation without having to wait for the cycle of 65 halfwords to return to the regular ALU again. Thus, this bit is normally only set to one when the long vector ALU bank is available and no long vector operations are being performed.

Parallelism and pipelining possibilities are to be indicated on the assumption that an indefinitely long stretch of program code may be fetched at one time; actual implementations would fetch only a fixed number of blocks, quite possibly one, and perform operations in parallel only if the instructions specifying them are in those blocks; or they may even not support parallelism except to the extent of skipping the appropriate number of halfwords at the beginning of each block.

Two additional formats for the indication of parallelism are provided which, in rare circumstances, allow only a single halfword, rather than two, to be used for indicating parallelism within a block:

If the first two bits of the first halfword in a block of sixteen halfwords containing executable code are ones, a different format is used, in which only the first halfword indicates parallelism, and the remaining 15 halfwords contain executable code.

In this format, it is required that the 15 halfwords consist of instructions not crossing block boundaries, and that all the instructions are at least two halfwords in length, except that the last instruction can be only one halfword long, if it can be included as part of the group of instructions that can be executed in parallel which precede it.

This, among other things, allows a block to be padded with a single-halfword no-operation instruction.

Initially, a 1 bit indicates the start of a new instruction; if one additional 1 bit follows, there is an ALU conflict, and if two additional 1 bits follow, there is a logical dependency. Instructions must be long enough that their indications contain at least one zero for this format to be used.

Since the first of the 15 halfwords in the block used for instructions must begin an instruction, the 15 bits corresponding to the 15 halfwords must begin with a 1, and thus that 1 can serve the purpose of defining the format.

If the first bit of the first halfword in a block of sixteen halfwords containing executable code is a zero, another different format is used, in which only the first halfword indicates parallelism, and the remaining 15 halfwords contain executable code.

In this format, it is required that the 15 halfwords consist entirely of instructions that are one halfword in length.

A single 1 bit indicates an ALU conflict, and a 1 bit followed by another 1 bit indicates a logical dependency. Only when logical dependencies and ALU conflicts are suitably separated to permit this indication to be unambiguous and valid may this format be used.

Note that neither of these formats includes a bit indicating if the long vector arithmetic/logic units are assumed to be available for use in determining whether or not an ALU conflict is indicated in the code as presented; thus, if it is desired to make use of them to allow more instructions to be executed in parallel, only the format using two halfwords to indicate available opportunities for parallelism can be used.

These formats may be seen in the following diagram:

Thus, in this diagram, the green blocks represent instructions.

The blue blocks represent groups of instructions that can execute in parallel, because they have neither any ALU conflicts nor any logical dependencies. Thus, an ALU conflict (or a logical dependency) separates one blue block from the next.

The red blocks represent groups of parallel groups of instructions that can execute at maximum pipeline speeds, because they have no logical dependencies. Thus, a logical dependency separates one red block from the next.

For purposes of determining whether two instructions require the same ALU facilities, the following rules apply:

The short vector ALU may only be used for short vector operations.

There is one fixed-point ALU for regular fixed-point operations, and one floating-point ALU for regular floating-point operations, and there are also 64 of each kind of ALU for performing long vector operations.

For the regular fixed-point and floating-point types, an ALU contains separate components which may execute one operation from each of the following groups in parallel:

Addition, subtraction, comparison
Multiplication
Division
AND, OR, Exclusive-OR

Type conversion instructions involve the entire ALU for both the source and destination operands.

A multiply and accumulate instruction uses both multiplication and addition; the logical dependency within the instruction only affects parallelism by possibly delaying the next block of instructions separated by a logical dependency or ALU conflict.

All string and packed decimal instructions make use of a single ALU dedicated to this type of operation.

All bit field instructions use a single ALU.

There is only one ALU for the purpose of carrying out an Execute Extended Translate instruction, but it is separate from the ALU for string and packed decimal instructions.

Because there is no field indicating explicitly which ALU is used for each instruction, the following simplistic algorithm is used for allocation of scalar integer and floating-point operations to the main ALU and the sixty-four ALUs used for long vector operations: instructions are treated as belonging to blocks of 65 halfwords, and the instruction beginning with the first halfword of the block uses the main ALU, the instruction beginning with the second halfword of the block uses the ALU associated with scratchpad register 0, and so on, up to the instruction beginning with the sixty-fifth halfword of the block which uses the ALU associated with scratchpad register 63. This restriction should work no hardship, as instructions that can be executed in parallel can also be reordered by a compiler.

An instruction following a conditional jump instruction is logically dependent on the result of that jump, and the conditional jump instruction is logically dependent on the instruction that set the status it tested.

Any instruction which accesses main memory is treated as if logically dependent on any other instruction which accesses main memory. Since the time required for a memory access is of the order of the time required to completely process an instruction rather than the time required to traverse the execution stage of the instruction, this is not merely marked as an ALU collision.

Changing the bit indicating whether parallelism is explicitly indicated affects the processing of the subsequent block of sixteen halfwords in memory, or any block fetched as the result of a branch instruction; thus, it may safely be done in the middle of a program. As well, subsequent means subsequent to the last halfword of the instruction making the change, so such a change can be made by an instruction that crosses block boundaries. These blocks are aligned on 256-bit boundaries; that is, they begin with bytes having five binary zeroes as the least significant digits in their addresses.

Since the indication of parallelism can only indicate whether or not an instruction can be executed in parallel with the preceding instruction, or immediately following the preceding instruction without pipeline delays, it can supply no information about what that instruction may do in conjunction with a branch instruction which has that instruction as its branch target. As a result, no parallelism is assumed to be possible when a jump is performed. This also applies when the program counter in use is changed if the computer is also in the bisequential operation mode.

While a single block can contain up to fifteen instructions, each 16 bits long, it is not necessary that every implementation of this architecture be capable of 15-way superscalar operation. It is more likely that the actual capability of reasonable implementations will be either 4-way or possibly 8-way superscalar operation. However, compilers are to generate instructions using this feature as though the superscalar capability of the underlying computer is indefinite; in this manner, superscalar issue can begin at arbitrary points in the instruction stream, and can cross block boundaries.

Also note that incorrect designation of instructions as being capable of parallel execution may cause errors in the results of a user procedure, but cannot cause escalation of privilege or any disruption to other process threads.

An Alternate Mode: Heads-and-Tails

In her Master's thesis, Heidi Pan described a novel method of efficiently encoding variable-length instructions for rapid processing. Assuming that there is a minimum length for the instructions in a variable-length instruction encoding, instructions would be grouped into bundles, and at the beginning of each bundle, the first parts of all the instructions in the bundle, having that minimum length, would be placed one after the other. This would simplify finding all the instructions in a bundle at once, and would speed decoding them in parallel.

The mode described above for explicit indication of parallelism indicates the beginning of each instruction directly, and thus can achieve the same result with the addition of some specialized decoding logic. However, the option of modifying the format of the indication of parallelism described above to use this encoding method has also been included in this architecture; this is done by setting the two bits in the Program Status Block marked Individual Postfix Supplementary Bit Fields and Dual Postfix Supplementary Bit Fields.

The implementation described here, however, still differs somewhat from the model proposed in Heidi Pan's thesis. The two differences are:

The Heads-and-Tails model was proposed for modified versions of fixed-length encoded RISC instruction sets; here, a CISC instruction set, without any re-encoding, is used; and
A condition of that model is that the head portion of the instruction indicate the length of the instruction; here, while in most cases the first sixteen bits of an instruction do indicate its length, there are occasional exceptions.

The second discrepancy is not particularly serious, as any practical bundle format still has to indicate where each instruction tail begins and/or ends. The first discrepancy leads to a modification of the proposed format; here, since an existing CISC instruction set is used, which is decoded in a forwards direction, instead of beginning the bundle with the successive heads in normal order, and ending it with the tails in reverse order, the 16-bit fixed length heads are placed at the end of the bundle in reverse order, and the tails, of variable length, are in forwards order at the beginning of the bundle; in this way, both the first head and the beginning of the first tail are constant.

In the specific format proposed here, the same 256-bit length used for indicating parallelism will serve as the instruction bundle. The first 32 bits are used to indicate both parallelism and the position of heads and tails.

This 32 bit field is organized as follows:

The first four bits indicate the number of instructions, and hence the number of tails, in the bundle.

Corresponding to each 16 bits of the tails, two bits in this 32-bit field have the following interpretation:

00 Not the first halfword of the tail of an instruction
01 The first halfword of the tail of an instruction
   completely independent of preceding instructions
10 The first halfword of the tail of an instruction
   logically independent of preceding instructions
   but requiring the use of the same ALU facilities
   as one of them
11 The first halfword of the tail of an instruction
   logically dependent on a preceding instruction

Corresponding to each head, two bits in this 32-bit field have the following interpretation:

00 The head of an instruction which has a tail
01 A 16-bit instruction completely independent of
   preceding instructions
10 A 16-bit instruction logically independent of
   preceding instructions but requiring the use
   of the same ALU facilities as one of them
11 A 16-bit instruction logically dependent on a
   preceding instruction

The unfortunate necessity of fetching parellism information for multi-halfword instructions from the bits corresponding to its tail results from the need to keep the length of the field indicating both parallelism and instruction limits both small and constant, whether a bundle contains a few long instructions or many short ones.

Note that the need for a four-bit field indicating the number of instructions in a bundle means that there is no room for an S bit. Thus, as with the formats with a 16-bit prefix field above, ALU dependencies are indicated on the basis that there is not a bank of 64 ALUs for vector operation available for superscalar use by this process.

Note that because a unit of 16 halfwords is processed, to some extent, as a unit, architectures embodying schemes such as this have sometimes been advertised as VLIW architectures. Since the block is composed of identical subunits, it appears to me that such architectures can only be termed pseudo-VLIW architectures at best. In the case of this architecture, since instructions can cross over from one block to the next, the indication of parallelism not being based on an instruction set of the RISC type, such a claim would be even more clearly inapplicable.

Advanced Indication of Parallelism

For more rapid and efficient instruction handling, if the bits in the Program Status Block indicating Explicit Indication of Parallelism and Fast Decode Assistance are both set, a mode is entered that is oriented to simplified instruction decoding, but which imposes some restrictions on programs.

In this mode, only instructions that are 16 or 32 bits long are allowed, not any that are longer.

The first 16 bits of a 256-bit block contain a bit that may be 0 or 1, followed by fifteen bits which indicate which instructions depend on one another.

In this mode, these fifteen bits constitute a stream of bits that continues from one block to the next. A sequence of 1 bits, bounded by 0 bits on both sides, indicates that the instruction in the position corresponding to the last 1 bit is an instruction on which a later instruction depends, and that later instruction is indicated by the first 1 bit in the next sequence of bits.

Because such sequences must be separated by at least one zero bit, it is not possible for an instruction to depend on the immediately preceding instruction. If there is no other way to rearrange instructions to avoid this, a no-op will have to be inserted.

Normally, the first bit of the block will be 0. In that case, the only supplementary bits called for will be those that indicate dependencies.

The need for a new ALU issue will not be explicitly indicated in this mode; unlike dependencies, the need for one should be relatively easy to derive from the opcodes of instructions.

The fifteen bits correspond in order to the remaining 16-bit units in the 256-bit block.

When an instruction is 32 bits long, its second half does not directly follow its first half. Instead, it is in the position in the following block that its first half occupied. This allows the fifteen 16-bit units in successive 256-bit blocks containing instruction code to be fed to fifteen instruction decoders operating in parallel, independently of each other.

However, it can potentially cause a problem. What if one branches to code in this mode from the outside? Not having decoded the preceding blocks, how does one know which 16-bit units contain second halves of instructions from the previous block, and need to be skipped over?

When a block contains a potential branch destination, then the first bit of the block is a 1 instead of a 0.

The next 15 bits are still used for the dependency bit stream.

The immediately following 16 bits start with two zero bits, and contain 1 bits corresponding to each 16-bit unit that is the beginning of an instruction, whether 16 bits or 32 bits long.

This means that the second 16-bit unit of the preceding block may not be the first half of a 32-bit instruction since this would conflict with that 16-bit unit being available for this use in the current block.