[Next] [Up] [Previous] [Next Section] [Home] [Other]

The Concertina II Architecture

This is now my ninth attempt to propose a successor to my original Concertina architecture.

The instruction set is designed to be used within blocks that are 256 bits in length, according to a scheme which allows the instructions in a block to be decoded in parallel.

The instruction set is divided into two independent opcode spaces. One opcode space is used for 16 bit instructions, the 16-bit opcode space. Another opcode space is used for both 32 bit instructions, and instructions with a 32-bit main portion which make use of the short pointer mechanisms to be effectively longer, and this is the 32-bit opcode space. Where both 16-bit and 32-bit opcode space instructions are used, the distinction between them is indicated out-of-band, either in the header field in modes 2 and 3, or in instruction prefixes in mode 1.

The blocks are divided into 32-bit instruction slots, and a header within the block indicates which instruction slots contain a 32-bit instruction. These instructions may contain short pointers, which point within the block, to additional data allowing the instruction to be effectively longer than 32 bits.

While the instruction set is designed to be used within such blocks, in order to provide the maximum of flexibility in adapting it to different requirements, four modes are provided in which the instruction stream may be organized, and only two of those modes involve the 256-bit instruction blocks around which the instruction set is designed.

Because Mode 1, Mode 2, and Mode 3 are specificially suited to different categories of implementations of the architecture, it is not required that a conforming implementation support all of these modes. Instead, all that is required is that there is support for Mode 0, essential for some specialized applications, and one of the other three modes.

However, support for all four modes is strongly encouraged, particularly for any processor that might be used in a development system, so that it can be confirmed that code works properly when converted to other modes.

Supporting Mode 1+ instead of Mode 1, on the other hand, may, for some implementations make Mode 1 operate significantly more slowly than otherwise; support for Mode 1+ is only encouraged if it can be implemented with reasonable efficiency and without at all compromising the speed of programs in Mode 1 that do not use Mode 1+ features.

Thus, it is intended that the format of object code and executables for operating systems designed for that architecture will be such as to allow that code to be converted to any of the modes Mode 1, Mode 1+, Mode 2, and Mode 3, as these are the possible modes in which programs containing instructions drawn from the full instruction set of the architecture may be expressed. In this way, when a processor that either only supports one of these modes, or is most efficient at executing programs in one of these modes, is presented with a program, it may load it in as a program in the mode which will work, or work best, with it.


In brief, these modes, and the types of processor designs with which it is presumed that they will be most effective are:


Mode 0: RISC mode

In this mode, only those instructions that are exactly 32 bits long are supported. This portion of the instruction set has been ensured to be sufficient to write programs using nearly all of the features of a CPU.

This mode is intended to be supported by all implementations; its purpose is to avoid the need for high-speed interpretive programs making use of the technique of just-in-time (JIT) compilation to deal with the complexities of organizing instructions into blocks, as Mode 2 and Mode 3 require.


Mode 1: Prefix mode

Although the instruction set has been designed around using 256-bit blocks to contain instructions, in such a way as to permit instructions to be decoded in parallel as if they were all the same length despite instructions actually being of different lengths, this alternative format, involving applying 16-bit prefixes to some instructions, has been provided which avoids the need for the block structure while allowing the full instruction set to be used.

This mode is primarily suitable for minimal implementations of the architecture that decode and execute instructions one at a time, and one after the other. (Note that the early form of pipelining, where Fetch, Decode, and Execute are carried on simultaneously for different instructions, but only one instruction is being decoded at a given time, and only one instruction is executing at a given time, such as was done on the IBM 7094 II computer, would still qualify as this type of implementation for the purpose of deciding which instruction formats are workable.) However, a sufficiently sophisticated large-scale implementation of the architecture may not be presented with a significant problem in decoding the indications of the lengths of instructions without undue delay.


Mode 1+: Enhanced Prefix mode

This mode adds an additional form of instructions longer than 32-bits which avoids the need for adding a 16-bit prefix to them. This allows programs to be more compact. However, the initial 32 bits of an instruction need to be fully decoded in order to access the length information inside that part of the instruction.

This mode would also be suitable for minimal implementations of the architecture, but it appears, at least to me, that it would be a challenge for any large-scale implementation to handle it efficently.


Mode 2: Simple block mode

This mode organizes program code into 256-bit blocks with minimal overhead.

This makes use of the short-range pointers within instructions to allow the computer to quickly, and in parallel, fetch the additional parts of each instruction as instructions are decoded and executed in parallel.

No additional information is provided, however, to indicate whether instructions have dependencies or other characteristics that will interfere with executing them in parallel.

Therefore, this mode appears to be best suited to a superscalar processor which also has a full out-of-order (OoO) instruction execution capability.


Mode 3: VLIW mode

This mode also organizes program code into 256-bit blocks that facilitate decoding instructions in parallel. As this implies executing instructions in parallel, all block formats in this mode include information about dependencies between instructions, and also about resource conflicts between instructions.

This mode is best suited to a superscalar processor which attempts to provide high performance without full out-of-order instruction execution capability, through a very long instruction word (VLIW) design similar to that which is used by many digital signal processor (DSP) designs.


These modes, in detail, are as follows:

Mode 0: RISC Mode

A program to be run in Mode 0 may contain only those instructions which belong to the 32-bit opcode space and which are exactly 32 bits long, not using block-internal pointers either for immediate values or supplementary portions of the instruction itself.

Mode 0 provides provision for running unblocked code, where instructions stand on their own in a continuous stream, as in a conventional non-VLIW computer.

As noted, a subset of the instruction set is available for use in unblocked code. 16-bit instructions may not be used, since they're indicated in the block header. Instructions with pointers to additional portions of the instruction within a block may not be used, since there are no blocks.

Unblocked code may be useful, for example, in avoiding the need for excess complexity in interpretive programming systems making use of Just-in-Time compilation. In some implementations, unblocked code, since it lacks explicit indication of dependencies, may run much more slowly than code organized into blocks of 256 bits as described above. In larger, more elaborate implementations, that include full provision for out-of-order execution, the performance gap may be considerably smaller.

Mode 1: Prefix Mode

This mode allows for code that is not organized into 256-bit blocks, but which can still employ the full capabilities of the architecture. This is done by having variable-length instructions with the prefix property, as is common in many CISC systems.

Note that the way in which a prefix is used to indicate a string of 16 bit instructions means that it is not possible to branch to any of the 16-bit instructions within that string. Only the prefix of such a string may be a branch target, which may necessitate breaking such strings into pieces.

The prefixes used in this mode have the form:

Note that they use the portion of 32-bit opcode space that immediately precedes the cross-mode subroutine call instruction.

The first line in the illustration shows the prefix used to indicate a sequence of up to 255 instructions in 16-bit form.

The length field contains the number of instructions; a value of 0 is invalid.

The second line in the illustration shows the prefix used to indicate a general instruction from 32-bit opcode space which uses either or both immediate pointers or a supplementary pointer to include additional material in the instruction.

The lSupp field may contain zero if a supplemental portion for the instruction is not present, and it may contain a number from 1 to 4 (or up to 7 in the case of future extensions to the instruction set) indicating the length of the supplementary portion of the instruction in units of 16 bits.

The lIm1 field is used for the first immediate used by the instruction, whether it is an operand immediate or a source immediate; it is always the only one used if the instruction uses only one immediate operand. It indicates the length of the immediate as a power of two, from single-byte immediates to 256-bit immediates, with 256 bits being the length of a short vector and 128 bits being the length of the longest possible floating-point values.

The lIm2 field similarly indicates the lenfth of the second immediate operand used by the instruction, if any.

If, and only if, both lIm1 and lIm2 are 1, a single 16-bit portion within the instruction prefixed will contain both 8-bit immediates, with the first immediate in the high-order bits at the lower address, and the second immediate in the low-order bits at the higher address.

In any other case where an 8-bit immediate is present, it adds 16 bits to the length of the instruction, which will contain an unused 8-bit field which should be zero followed by the immediate.

The third line in the illustration shows a special prefix format for instructions which contain no supplementary field, and only one immediate which is 8 bits long. In this case, the 8-bit immediate value is contained within the 16-bit prefix itself.

The form of prefixed instructions is this:

First, the 16-bit prefix.

Second, the base 32-bit instruction. All pSupp and pImm fields in the instruction are to contain zeroes, but bits indicating that a pImm field is used are to be set normally.

Third, the supplementary portion of the instruction, if present.

Fourth, the first immediate value used by the instruction, if present.

Fifth, the second immediate value used by the instruction, if present.


Incidentally, note that a valid Mode 0 program is also a valid Mode 1 program. However, Mode 0 still has a reason to exist, because in Mode 0, a valid Mode 0 program can perhaps be executed more quickly, as the computer can assume it will not encounter any prefixes, and all instructions will be 32 bits long and on 32-bit boundaries.

Mode 1+: Extended Prefix Mode

In order to permit more compact code, optionally, when in Mode 1, systems may also support an additional form for instructions longer than 32 bits.

In this form, the instruction begins with the same 32 bits of the instruction that would appear in the instruction slot in sequence in a block mode, without any prefix.

However, if a supplementary portion of the instruction is present, the pSupp field will not be zero, and any pImm field corresponding to an immediate that is used will also not be zero.

Instead, the pSupp field will contain the length, in units of 16 bits, of the supplementary portion of the instruction, and the pImm fields will contain a byte pointer to the end of the immediate value to which they refer, relative to the first byte of the instruction having 0 as its position. Thus, the lowest value which may be contained in a pImm field in this instruction format is 4, for a one-byte immediate value immediately following an instruction without a supplementary portion.

The elements of the instruction, as in the case of a prefixed instruction, must be in the following order:

First, the base 32-bit instruction.

Second, any supplementary portion of the instruction, if present.

Third, any immediate used for an operand argument to an operate instruction, if present.

Fourth, any immediate used for a source argument to an instruction, if present.

In this form, the first 32 bits of the instruction contain all the information needed to determine the total length of the instruction, but decoding is considerably more complicated than when a 16-bit prefix is used.

Note that just as executing Mode 1+ code in Mode 1 is optional, in Mode 0, code using the features of Mode 1 or Mode 1+ may also optionally execute without giving an error, as a way of simplifying an implementation. Hence, a conformant implementation could lack any actual support for switching between modes at all, as long as the only mode it supported was either Mode 1 or Mode 1+, it allowed execution of mode switches between Mode 0 and Mode 1 to be executed, even though the mode switches are effectively ignored, and as long as an error is signalled if an attempt is made to switch to either Mode 2 or Mode 3.

Mode 2: Simple Block Mode

This mode permits the full use of the facility within instructions to contain short pointers within a 256-bit instruction block to allow long instructions, without including the other features of Mode 3 that require a longer block header.

This mode would be optimal for implementations where full OoO circuitry is present, and so the explicit indication of instruction dependencies provided for in the headers of Mode 3 blocks is not required.

The headers for a 256-bit block in this mode have these forms:

The first line shows the simplest form of header, with a length of only 16 bits.

The seven bits in the "short" field, if 0, indicate that the corresponding 32-bit instruction slot contains one 32-bit instruction, and if 1, indicate that the corresponding 32-bit instruction slot contains two 16-bit instructions.

The first instruction slot, as it contains a 16-bit header, must perforce contain a 16-bit instruction only in what remains, so there is no bit in the "short" field corresponding to that slot.

The eight bits in the "decode" field indicate which instruction slots contain instructions as opposed to containing additional data for instructions or being unused.

This field may contain one of these possible values:

11111111
11111110
11111100
11111000
11110000
11100000
11000000
10000000
01111111
01111110
01111100
01111000
01110000
01100000
01000000

The first bit may be zero to indicate that a 16-bit instruction in the second half of the first instruction slot is not used; for some implementations, it is expected that this may avoid the expenditure of some cycles that placing a 16-bit NOP in that location as the only means of indicating this would entail.

The second line shows a 32-bit header, which allows predication for blocks that do not contain any 16-bit instructions.

Here, the decode field may contain one of these possible values:

1111111
1111110
1111100
1111000
1110000
1100000
1000000

indicating that the block contains respectively from seven down to one instructions.

The P bit is a one if the instruction in its corresponding position is predicated; that is, if its execution is conditional.

The flag field indicates which of four predication flags controls whether or not the instruction is executed: the architecture provides for eight predication flags, all of which are available in Mode 3, but only the first four of which are usable in Mode 2.

The S bit controls the interpretation of the flag field. If it is a 0, then the instruction is only to be executed if the flag is set to 1; if it is a 1, then the instruction is only to be executed if the flag is reset to 0. For the 32-bit header, one S bit governs all predicated instructions; for the 64-bit header to be described next, each instruction has its own S bit.

The P and S bits are the Predication and Sense bits respectively.

The third line shows a 64-bit header, which allows predication for blocks which are free to contain 16-bit instructions.

Here, the decode field may contain one of these possible values:

0111111
0111110
0111100
0111000
0110000
0100000

indicating that from six down to one instruction slots in the block are in use, either for a single 32-bit instruction or two 16-bit instructions.

Here, each pair of two P-bit plus flag bits entries corresponds to a single 32-bit instruction slot, and only the first one is used where it contains a 32-bit instruction.

Note that this is the opposite of the way the U, D, and B bits are handled in the header mode shown in the fifth line of the diagram for Mode 3, where the first half relates to 32-bit instructions or the first 16-bit instruction in each slot, and the second half relates to the second 16-bit instruction in each slot.

Mode 3: Full VLIW Mode

This architecture is a VLIW architecture, where the instruction stream is composed of 256-bit instruction blocks, each of which may contain up to seven instructions which are 32 bits in length.

The intent of the design is to permit instructions to be decoded in parallel and, wherever possible, executed in parallel, to increase the speed of execution.

An instruction may include more than 32 bits of information. In that case, the instruction will contain one or more short pointers to the location of that information within the 256-bit instruction block. This allows, for example, instructions to use immediate values, which, unlike constants stored normally as data, do not require an additional fetch from memory, at an address not related to the successive addresses from which code is fetched.

This format is intended to allow parallel decoding of instructions, by having all instructions the same length, to also allow some instructions to be longer, thus offering the advantage of the "heads-and-tails" instruction format developed by Heidi Pan, but in addition, by using pointers to the additional data, to avoid imposing sequential processing on the decoding of longer instructions.

Not all instructions adhere to the RISC philosophy; the architecture is intended to be very general, and, thus, includes instructions which may need to be implemented by microcode as they are useful for some application areas.

The first 32 or 64 bits of a 256-bit instruction block have the following formats:

The seven bits labelled "decode" indicate which of the subsequent 32-bit instruction fields within the block contain instructions to be decoded and possibly executed. For the formats shown in the first two lines of the diagram, these bits may have one of the following seven possible values:

1111111
1111110
1111100
1111000
1110000
1100000
1000000

Thus, from one to seven instructions may be present within a block; a 1 bit corresponds to an instruction field that contains an instruction to be decoded.

Each 32-bit instruction has three bits associated with it that appear in the first 32 bits of the block.

The U bit, if 1, marks the instruction as one on the results of which a subsequent instruction depends.

The D bit, if 1, marks the instruction as one which is dependent on the results of a previous instruction.

The B bit, if 1, indicates that, even in the absence of a dependency, this instruction may not be executed in parallel with the immediately preceding instruction, but instead must begin on the next cycle.

These three bits are the Upon, Dependency, and Break bits.

The appropriate value for the B bit is model-dependent. This can be dealt with by having the object format indicate which type of machine for which the provided code is generated, and, where that differs from that of the machine to execute the code, having the loader determine from the machine code the correct values for the B bits.

In general, dependency is not model-dependent. However, the latency of instructions in cycles is model dependent, and this affects whether a dependency is relevant. Also, the B bit only indicates the need to wait one cycle before issuing an instruction, so it is useful for indicating conflicts only in fully-pipelined execution units.

Where an execution unit is not pipelined, as might be the case for a microprogrammed functional unit handling the string and packed decimal memory-to-memory instructions, the instructions would have to be marked as dependent, rather than using the B bit to indicate a resource conflict, to produce a sufficient delay.

The offset field indicates the value of the offset on entry into an instruction block.

To find the U bit that is set which corresponds to a D bit that is set, the offset is the number of set U bits in instructions previous to the instruction with the D bit set that are to be skipped before the U bit set in the instruction on which that instruction depends is found.

The offset is only meaningful when there has been at least one U bit encountered for which the corresponding D bit has not been encountered. Setting the offset to 7 in the case when it is not meaningful for this reason allows the offset to be incremented whenever any U bit is encountered, and decremented whenever any D bit is encountered; the first U bit will properly take the offset to zero.

The first bit of the block may be 0 or 1.

A block which does not contain predicated instructions, and does not contain branch instructions begins with a 0 bit.

A block which does not contain predicated instructions, but does contain one or more branch instructions begins with a 1 bit, with the format shown on the second line of the diagram.

For such a block, the block format is modified. The offset field is replaced by a branch field. This will contain a number from 1 to 7, indicating where the first branch instruction in the block is located. (If there are other branch instructions, there will be sufficient time to prepare for handling them after detecting them through normal instruction decoding.)

If there are six or fewer instructions in the block, the three bits that would have been the U, D, and B bits for the seventh instruction will contain the offset value. The offset value is optional, as it can be updated from what was present at the end of the preceding block, but it is helpful.


The third and fourth lines of the diagram shows the format of the first 64 bits of a block which contains predicated instructions.

These formats is indicated by having the first bit of the block contain 1, but with the field that would normally give the location of the first branch instruction containing all zeroes.

If the block does not contain any branch instructions, the first bit of the second 32 bits of the block is a 0, and the format of the first 64 bits of the block are shown in the third line. If it does contain one or more branch instructions, the first bit of the second 32 bits of the block is a 1, and the format of the first 64 bits of the block are shown in the fourth line.

Here, the decode field may contain one of these possible values:

0111111
0111110
0111100
0111000
0110000
0100000

indicating that the block contains respectively from six down to one instructions.

As the prefix portion of the block now consumes a second instruction slot, the three bits that would have contained the U, D, and B bits for that slot are now used for the offset.

In the case where there is no branch instruction in the block, corresponding to each of the six instruction slots which may potentially be used is an S bit, a P bit, and a three-bit flag field.

The P bit is a one if the instruction in its corresponding position is predicated; that is, if its execution is conditional.

The flag field indicates which of eight predication flags controls whether or not the instruction is executed.

The S bit controls the interpretation of the flag field. If it is a 0, then the instruction is only to be executed if the flag is set to 1; if it is a 1, then the instruction is only to be executed if the flag is reset to 0.

The P and S bits are the Predication and Sense bits respectively.

If a branch instruction is present in the block, as indicated by the first bit of the second 32 bits of the block being a 1, then there is only one S bit, and all predicated instructions in the block must have the same sense; if the S bit is 0, a predicated instruction must have its corresponding flag set (equal to 1) to execute; if the S bit is 1, a predicated instruction must have its corresponding flag reset (equal to 0) to execute.

As with the case where a branch is present, but predication is not, the branch field contains the number of the instruction slot containing the first branch instruction in the block, in this case from 2 to 7.


The fifth line of the diagram shows the header fomat for a block in which some of the 32-bit instruction slots contain a pair of 16-bit instructions instead of a single 32-bit instruction. Note that when 16-bit instructions are present in a block, predication is not available for that block.

The format of the decode field is the same as for the third and fourth lines of the diagram; its first bit must be zero, followed by from one to six ones, with the rest zero.

The sets of U, D, and B bits in the first word of the header refer either to an entire 32-bit instruction, or to the first 16-bit instruction of a pair; those in the second word of the header refer to the second 16-bit instruction of a pair.

The branch field may contain a zero to indicate the block contains no branches. If it contains a number from 2 to 7, this indicates which instruction slot contains the first branch in the block. If that instruction slot contains two 16-bit instructions, the H bit indicates which of those instructions is the first branch in the block: if 0, the first one, if 1, the second one.

Each of the six bits in the short field corresponds to one of the remaining six 32-bit instruction slots in the block, after the header, and the bit is a 1 for those slots which are to be decoded as instructions, but contain a pair of 16-bit instructions instead of a single 32-bit instruction.

Registers and Data Formats

The complement of registers included with this architecture is as follows:

There are 32 integer registers, each of which is 64 bits in length, numbered from 0 to 31.

Registers 1 through 7 may be used as index registers.

Registers 25 through 31 may be used as base registers, each of which points to an area of 65,536 bytes in length.

Register 16 may be used as a base register pointing to an area of 32,768 bytes in length.

Registers 19 through 23 may be used as base registers, each of which points to an area of 4,096 bytes in length.

The area of 4,096 bytes in length pointed to by register 18 will normally be used to contain up to 512 pointers, each 64 bits in length, to large arrays for use in Array Mode addressing.

Also, registers 8 through 15 may be used as base registers each pointing to an area 1,048,576 bytes in length for extended memory-reference instructions.

There are 32 floating-point registers, each of which is 128 bits in length, numbered from 0 to 31.

There are 32 type extension registers, each of which is 32 bits in length, and each of which is associated with a floating-point register.

Floating point numbers in IEEE 754 format have exponent fields of different length, depending on the size of the number. For faster computation, floating-point numbers are stored in floating-point registers in an internal form which corresponds to the format in which extended precision floating-point numbers are stored in memory: with a 15-bit exponent field, and without a hidden first bit in the significand.

As 128-bit extended floating-point numbers are already in this format in memory, all floating-point numbers will fit in a 128-bit register, although shorter floating-point numbers are expanded.

However, the 32 floating-point registers may also be used for Decimal Floating-Point (DFP) numbers. These numbers will also be expanded into an internal form for faster computation, but that internal form may take more than 128 bits. The type extension registers allow the floating-point registers to behave as registers which are 160 bits in length for programs which use such data types.

There are 16 short vector registers, each of which is 256 bits in length.

Each of these registers may contain:

As well, they may contain sixteen 16-bit short floating-point numbers in one of two formats.

These numbers all remain in these registers in the same format as that in which they appear in memory.

Also, there is a primary register group of eight long vector registers, and a scratchpad of sixty-four long vector registers, where each long vector register is composed of sixty-four floating-point registers, each 128 bits in length.


[Next] [Up] [Previous] [Next Section] [Home] [Other]