This is now my seventh attempt to propose a successor to my original Concertina architecture.
I hope that this time I have found a way to achieve the goals I have set for myself while avoiding excessive complexity. The basic instruction formats for this architecture have the form:
All instructions are, or at least start out as, 32 bits in length, and the way in which they are processed is organized to be suitable for an implementation in which instructions are fetched eight at a time, in blocks of 256 bits.
The intent of this design is that any portion of a program which consists only of instructins that are 32 bits long may be executed without any overhead caused by the possibility of instructions of other lengths being present. The principle used to achieve this is this: when, and only when, there is need for instructions of other lengths, one or more instructions in the preceding block indicate how many of the first few 32-bit instruction words in the next block are to be skipped over as containing data additional to normal 32 bit instructions; and, those instructions which are longer appear in the instruction stream as normal 32 bit instructions, but contain a pointer to the additional information required within that skipped-over part of the block.
The first two instruction formats are similar to those of many RISC architectures. There are memory-reference instructions, and register-to-register operate instructions that work with a bank of 32 registers.
The bit in the register-to-register instructions marked B, if it is zero, indicates that the instruction is guaranteed not to be dependent on the preceding instruction. This allows more rapid processing of instructions; they are considered to be grouped in blocks, where the first instruction of a block either has the B bit set, or is of a type without a B bit. Every instruction within a block must be one that can be safely executed in parallel.
The register-to-register instructions, as well as the augmented memory reference instructions, which may also perform calculations instead of merely loading or storing data, also have a bit marked C, which must be set to allow the instruction to change the condition codes.
Unlike most RISC architectures, but like the System/360, memory-reference instructions offer full base-index addressing.
There are 32 integer data registers, each one 64 bits wide, but only 8 address registers, also each 64 bits wide. As well, there are 32 floating-point data registers, each one 128 bits wide.
The base register field refers to one of the eight address registers, except that address register zero is not used as a base register.
The index register field indicates that indexing is not taking place if it contains all zeroes. Otherwise, it indicates the register that is to be used as an index register as follows:
001 Integer data register 1 010 Integer data register 2 011 Integer data register 3 100 Address register 0 101 Address register 1 110 Address register 2 111 Address register 3
Thus, some of the index registers are among the integer data registers, to allow index values to be the result of complex calculations. Since many programs will not require seven different base register values, however, some of the address registers may also be used as index registers; and, specifically, address register 0, which would otherwise not be useful, is allowed to serve as an index register.
The fourth line of the diagram shows a modified form of the register-to-register instruction in which only a source and destination register are specified. This frees up space to allow the instruction to include three bits which indicate, for the following block of 256 bits of instructions, a number of 32-bit instruction spaces at the beginning of that block to be skipped over.
This is useful in order to facilitate other instruction formats to be described in subsequent lines of the diagram, where additional information required for the instruction, either immediate values or supplemental data to create an instruction format larger than 32 bits, is taken from this area.
It is recommended that this two-address form should always be used in preference to a three-address form where the destination register and the operand register are the same. As this may mean that there is more than one instruction in a block with a skip field, all the skip fields should contain the same value. This permits simple implementations that set up the skip for the next block based on every skip field encountered, so that they will not give inconsistent results depending on the order in which instructions complete.
The P bit is used to indicate that the first 32 bits of the following 256-bit block of instructions will be used to provide predication information for the instructions in that block. In this case, the skip count must be 1 or greater, as those 32 bits are still included in the count of 32-bit instruction words to be skipped over.
In this case, the first 32 bits of the next block will have one of the forms shown in the following diagram:
In the first format, up to seven predicated instructions may be present in the block. If the C bit is zero, one of the first eight flag bits, numbered from 0 to 7, determines if the instruction is executed. If the C bit is one, the three remaining bits in the field indicate the state of the condition codes for which the instruction is executed. Values from 1 through 7 correspond to those in the conditional branch instructions; 000 indicates the instruction is executed if there is an overflow (as opposed to never being executed).
In the second format, which allows for up to four predicated instructions, if the C bit is zero, the following bit, if 1, indicates the instruction is to be executed when the flag bit selected by the last five bits of the field is cleared instead of set. If the C bit is one, the last five bits of the field correspond to condition codes as used in the conditional branch instructions.
In the third format, which allows for up to six predicated instructions, sixteen of the flag bits, and the first sixteen of the possible condition code tests are available.
When this method of indicating predication is used, if a branch takes place into a block set up in this way, the predication information will be ignored, as there will not be anything visible within the block itself to indicate that predication is taking place. This is the one deliberate exception to the general principle that by skipping over the first part of a block, and then always using pointers to use the material within that part of the block, branching into code, despite the lack of fixed overhead in the form of a block type indicator at the start of the block, will not cause problems.
It should be noted, however, that there is also another inherent unavoidable situation where branching to the wrong location is capable of causing problems.
It has been noted above that there may be more than one instruction with a three-bit skip field in a block, and, if so, all the skip fields should contain the same value. (The accompanying P bit should also be either set or cleared in the same fashion in all cases.) It should also be noted that if a block contains one or more instructions with a skip field, it is not necessarily the case that the last instruction in a block contains a skip field.
Thus, if a block contains one or more instructions with a skip field, and these skip fields contain a nonzero value, and a branch is made to that block after the last instruction with a skip field in that block, then the entire following block will be treated as composed of normal 32-bit instructions, with none skipped over.
The third line of the diagram shows an instruction format in which the source operand is an immediate value. But instead of being in a fixed location as part of the instruction, it is referenced by a pointer, like a memory operand instead of an immediate value, which is why this instruction format has been labelled "pseudo-immediate".
The pointer, however, points to one of the 32 bytes of the current 256 bit instruction block, so accessing the item it refers to, from what has already been fetched into the instruction buffer, should be at least as rapid as accessing an operand in a register, and thus in practice there should be no significant difference between operands of this form and conventional immediate operands.
Note that there is no need, as well, with this scheme, to place restrictions on what can be in a block that contains a branch target. Since the pseudo-immediates are pointed to, instead of being in locations deduced from what has happened in previous instructions in the block, as long as one only branches to actual code and not to constant values, things will simply work: the interpretation of instructions after the branch point will not depend on what portion of the block prior to the branch point consists of instructions, and what portion is skipped over to contain immediates or other data.
In the fifth and seventh lines, and in the nineteenth through twenty-fourth lines of the diagram are instruction formats that allow the instruction to be longer than 32 bits. Here, the pointer to additional material is four bits long, so that instructions may be lengthened in steps of 16 bits. How many of those are used depends on the particular instruction. Two different formats are provided, so that for a lengthened register-to-register instruction, two of the register operands can be in their usual position, and for a lengthened memory-reference instruction, the displacement field in the original 32-bit body of the instruction can be used.
Note that the 32-bit main body of the instruction is shown on the left of the diagram, and possible formats for the additional bits of the instruction, which actually will precede it in the location the pointer field indicates, are shown on the right of the diagram.
The sixth and seventh lines of the diagram show instruction formats similar to those in the fourth and fifth lines, except that they are modified so that the source operand is immediate. Thus, an immediate operand may be combined with the formats shown in those lines.
The eighth, ninth, and tenth lines of the diagram, along with the fourteenth, fifteenth, and sixteenth lines, show formats for an alternate set of instructions which use additional banks of registers that have 128 registers in them instead of 32 registers. Not all programs will have those registers available to them, because an enlarged register bank presents difficulties in being saved and restored during context switches, particularly including interrupts.
The purpose of the enlarged register bank is to allow a single program to use the processor at full speed without need for the technique of out-of-order execution. It is, however, envisaged as difficult to arrange programs to solve real-world problems so as to have lengthy segments which perform multiple independent calculations within the registers, as this would require.
Thus, in the eighth line, a register-to-register instruction format is shown with two seven-bit fields to specify the source and destination registers. In the ninth line, a format is shown with one seven-bit field for the destination register, with an immediate source operand. In the tenth line is a form of instruction that links the 128 registers of the enlarged register files with the 32 registers of the standard register files.
In the fourteenth line, an instruction format is shown that allows a destination, operand, and source register to all be specified. Only the destination register field is seven bits long. The operand and source register fields are each four bits long, and thus those registers must belong to the same group of sixteen registers as the destination register.
In the fifteenth line, a two-address register-to-register instruction format is shown that also uses this division of the register banks into eight groups of sixteen registers to shorten the source register field. This allows an indication of the number of instruction slots to be skipped in the next block to be placed within code that primarily uses the enlarged register files.
In the sixteenth line, an instruction format is shown that makes use of additional instruction bits to allow full three-address instructions which can use any of the 128 registers in the enlarged register files in each of the three roles.
The additional data which is used for immediates or supplementary bits in an instruction normally precedes the instruction within a 256-bit block. The value 1110 points to the first 16 bits of the last 32 bits in the block, which is also not avaiable. This value is used, as is shown in the twenty-fifth and twenty-sixth lines of the diagram, for an alternate means to indicate predication. As this method places the instructions to be predicated within the skipped part of the block, and accesses them by a pointer to them, a branch to a block in which this method of indicating predication is used won't cause predication to be ignored for the instructions that are to be predicated. Of course, that branch may be to a point after the instruction of this type, in which case those instructions are completely bypassed.
This method is used when correct operation in the event of a branch to a block where predication is used is essential. The other method described above, as it allows the entire first 32 bits of a block to describe the predication, has less overhead and is more flexible where branching into the block is not an issue, which is why it is provided as well.
If there are from one to three predicated instructions, six bits are available to describe the predication for each one. If the first bit, marked C, in each six-bit field is zero, the five remaining bits indicate one of thirty-two flag bits, and the corresponding instruction is executed if the flag bit is set. If the C bit is 1, the five remaining bits indicate the condition code value, in the same manner as in a conditional branch instruction, under which the instruction is executed.
If there are four to six predicated instructions, the three bits corresponding to the instruction, if all zero, indicate the instruction is executed unconditionally; if they have a value from 1 to 7, flag bits 1 to 7, from among the thirty-two flag bits, numbered 0 to 31, are then used to control whether the instruction is executed.
In either case, for any number of instructions to be predicated from 1 to 6, the predication fields are used from left to right, with the contents of unused fields on the right being ignored. (The 32-bit header indicated by the P bit in instruction formats above follows the opposite convention.)
Not shown in the diagram are instructions which begin with 1101 1111, 1111 1110, or 1111 1111; these combinations are reserved for additional instruction formats which are unrelated to those shown here, and for special-purpose instructions such as privileged operations including input-output instructions.
The opcodes of the memory-reference instructions are:
00000 LB Load Byte 00001 STB Store Byte 00010 ULB Unsigned Load Byte 00011 IB Insert Byte 00100 LH Load Halfword 00101 STH Store Halfword 00110 ULH Unsigned Load Halfword 00111 IH Insert Halfword 01000 L Load 01001 ST Store 01010 UL Unsigned Load 01011 I Insert 01100 LL Load Long 01101 STL Store Long 01110 JC Jump on Condition 01111 JS Jump to Subroutine 10000 LM Load Medium 10001 STM Store Medium 10010 LF Load Floating 10011 STF Store Floating 10100 LD Load Double 10101 STD Store Double 10110 LQ Load Quad 10111 STQ Store Quad
Byte, Halfword, Word, and Long are integer formats 8, 16, 32, and 64 bits in length respectively; Medium, Floating, Double, and Quad are floating-point formats 48, 32, 64, and 128 bits in length respectively. Due to their odd length, Medium-format floating-point numbers are considered to be aligned when they are aligned to a 16-bit boundary.
In the case of the Jump on Condition instruction, the destination register field is used as part of the opcode, to indicate the condition under which branching takes place. The various forms of the Jump on Condition instruction are:
01110 00000 NOP No-operation 01110 00001 JL Jump if low 01110 00010 JE Jump if equal 01110 00011 JLE Jump if low or equal 01110 00100 JH Jump if high 01110 00101 JNE Jump if not equal 01110 00110 JHE Jump if high or equal 01110 00111 J Jump 01110 01000 JV Jump if overflow
In the case of the Jump to Subroutine instruction, the destination register field indicates where the return address is to be placed: if it contains a value from 0 to 7, it is placed in the address register indicated by the number in that field; otherwise, it is placed in the integer data register indicated by the number in that field.
The basic floating-point formats supported by this architecture are patterned after those of IEEE 754, as used by many other computers, and are shown below:
In addition to the standard 32-bit and 64-bit types specified by IEEE 754, a similar type occupying 48 bits is defined. The size of the exponent field is chosen to be the minimum that allows numbers from 10^-99 to 10^99 to be represented, and with that exponent field, 11 digits of precision are provided. Thus, this format matches the precision provided by many pocket calculators, as well as used in mathematical tables and mechanical calculators; thus, historically, it appears to be a good fit to what many scientific problems require.
The extended-precision format of floating-point number is the one used in the registers for floating-point numbers of all precisions.
Thus, when a register-to-register operation involving a shorter precision is performed, some bits of the register are ignored when operands are taken, but all bits are filled to provide a valid extended-precision number with the proper value when results are returned.
This has the positive consequence that denormals do not require any additional overhead. It also means that single-precision, double-precision, and intermediate-precision numbers, in internal form, have a slightly greater numeric range than they do in external form.
This means that some computations may continue, and produce a correct result, which would otherwise fail if the numbers were kept in external form all the way through. However, this still tends to be viewed as a drawback, as it means that computations are less consistent in their results. Also, this means that instructions to store floating-point numbers other than extended precision floating point numbers may fail with an overflow or underflow error.
A later page will discuss the formats of Decimal Floating-Point numbers. One of those formats involves representing decimal digits using a modified form of Chen-Ho encoding. These numbers will also be converted to an internal form to speed computation; in the internal form, ten-bit fields representing three decimal digits will be converted to normal four-bit BCD digits. This will expand a 128-bit Decimal Floating-Point number to more than 128 bits. As a result, while there will still be 128 Decimal Floating-Point registers, a Decimal Floating-Point register will consist of both the corresponding 128-bit floating-point register and the corresponding 64-bit integer register.
The architecture provides vector instructions similar to those provided by many processors today. A vector as used by a short vector instruction is 256 bits long, and may contain four 64-bit or eight 32-bit floating point numbers, or four 64-bit, eight 32-bit, sixteen 16-bit, or thirty-two 8-bit integers.
There are sixteen short vector registers. Each short vector register occupies two consecutive 128-bit floating-point registers; thus, short vector register 0 is floating-point registers 0 and 1, short vector register 1 is floating-point registers 2 and 3, and so on.
The formats of the short vector instructions are shown in the eleventh, twelfth, and thirteenth lines of the diagram.
The component elements of a short vector are stored in the short vector registers in the same format as they are stored in memory. This is unlike the case with scalar calculations, where floating-point numbers are converted to an internal form to speed computation. Therefore, a status bit indicating that denormals are to be treated as zeroes is provided in the program status word, but it only affects short vector calculations.
The seventeenth and eighteenth lines of the diagram show the formats of the long vector instructions.
These use sets of sixty-four vector registers, each of which contains space for sixty-four elements in the vector. Thus, the floating-point long vector registers consist of sixty-four 128-bit elements each, and the integer long vector registers consist of sixty-four 64-bit elements each.
Numbers are converted to internal form for long vector instructions as for scalar instructions and unlike short vector instructions, and so the same considerations with respect to overflow and underflow apply.
Hybrid vector instructions use the floating-point long vector registers only, and also use the same instruction formats as long vector instructions.
These instructions, however, put data in those registers in external format, and pack the data within those registers; thus, a vector register would contain 128 rather than 64 floating-point numbers if they are in 64-bit double precision.
Stride is not supported with hybrid vector instructions.
The length field in the instruction indicates the number of 256-bit short vectors, rather than the number of individual elements, in the vector.
Like short vector instruction, these instructions are affected by the denormals are zero status bit.