This is the current draft of an attempt on my part to propose a successor to my original Concertina architecture. Once again, it builds on previous attempts; a major goal is to keep overhead to a minimum, and ensure that program code is compact. In addition, the structure of the instruction set has been greatly simplified over that in previous iterations in one important respect.
The design attempts to combine many of the benefits of RISC, CISC, and VLIW architectures.
There are 32 integer general registers and 32 floating-point registers, and those instructions that perform arithmetic or logical operations include a bit for enabling changes to the condition codes as a result of those instructions. These are characteristics found in RISC architectures.
Having register banks of 32 registers allows different calculations to be intertwined in the code, and being able to control if instructions affect the condition codes allows more intervening instructions between an instruction that sets the condition codes and a branch instruction that makes use of those results. Both of these things allowed code to be designed to offer some of the same benefits as are obtained from out-of-order execution, without the hardware overhead. However, at the microprocessor clock rates in use today, these measures normally are not enough to be effective: however, if code written this way is combined with simultaneous multi-threading (SMT), then there is still the potential for competing with out-of-order execution.
Instructions are organized into 256-bit blocks which contain eight 32-bit instruction slots.
Instructions may cross block boundaries.
The instruction set is organized so that the computer is able to fetch a 256-bit block of instructions, and immediately begin decoding each 32-bit instruction slot independently of the others in the block. But special processing may instead be indicated by a block header within the block.
If the block begins with the bits
11011, then the first 32 bits of a block
are a block header, which will be in the form shown in the diagram
The header, after the five bits which indicate that it is a header, is divided into three sections. The first section is three bits long, the second is fifteen bits long, and the final section is nine bits long.
The three-bit section consists of a 3-bit decode
field. This field
7, indicating that all the remaining 32-bit instruction slots
in the block are to be decoded, or it may contain a lesser number, in which case its
difference from 7 indicates how many instruction slots at the end of the block are to
be ignored in decoding, so that they can be used to contain pseudo-immediate values.
One of the things that the fifteen-bit section may contain is a target
field, consisting of fourteen bits, each one corresponding to one of the remaining 16-bit
portions of the block. If a bit in this field is a
1, this indicates, where
this feature is enabled, that only those locations so indicated may be the target of a
Another thing it may contain is a sequence of thirteen bits, all of which are indicated by a B. These bits each correspond to one 16-bit area within the remainder of the block in order, not including the first one, as a break between blocks is implicit, and they mark the beginning of a group of instructions which may all be executed in parallel, thus, a B bit equal to 1 marks a break, across which the instructions cannot execute in parallel.
When a field consisting of B bits is present in the block header in the first 32-bit instruction slot of a block, it indicates that the instructions in that block may, normally, begin execution independently of the instructions in the block which precede them. The fourteen last bits of the first instruction slot, when equal to 1, indicate when this is not the case; each 1 bit shows the start of the first instruction of a group of one or more instructions, each of which may execute independently of the others in that group, but which must wait for the completion of the instructions which precede the group. Thus, the 1 bits split the instructions in the block into multiple groups of independently executing instructions, where the groups must still execute in sequence.
In order that groups of instructions indicated as executing in parallel will work properly on all implementations, certain restrictions are to be observed for the instructions within a single such group.
The most basic such restriction is that more than one instruction can only access a given register if all the accesses are read accesses only. One or more reads, and only one write is not permitted.
This is necessary because not all implementations will actually have a microarchitecture designed around the VLIW philosophy, and so instructions specified as executing in parallel may still execute serially.
Depending on the technologies used in producing implementations, more severe limitations may be required for conformant and portable code. One possibility is that even multiple read accesses to a single register are not allowed, because while register files need multiple ports, individual registers in them might run into fan-out considerations.
An even more extreme scenario is that the register files are divided into groups of registers, and two different simultaneous instructions can't access the same group of registers. 32-register files would be divided into groups of four registers, and 128-register files would be divided into groups of eight registers. To be clear, this would not prevent a single instruction from referencing multiple registers in the same group of registers; indeed, that is the normal and expected case, but an instruction would have to have entirely to itself all the groups of registers that it accesses in any way.
It may turn out that there is no reason for an implementation to require this restriction, and if experience shows that to be the case, it would be dropped as the criterion for portable code. However, while this restriction may seem severe, it does not interfere with what is the intended normal use of the ability to execute multiple instructions in parallel.
The fourteen-bit section may instead contain a predication clause, or one of the three possible short header clauses.
A predication clause begins with a bit marked S. If that bit is 0, the instruction
(or instructions) in the 32-bit instruction slot corresponding to a bit in the predicated
field that is a 1 will execute only if the flag bit indicated in the flag field
is set (that is, equal to
1); if the S bit is 1 instead, indicated instruction slots
will execute when the flag bit is cleared.
Finally, there is a 9-bit section, which may only contain one of the three possible short header clauses.
The first type of short header clause indicates instruction slots which are to be decoded as 32-bit instructions in an alternate format. This allows more general memory load and store instructions, and more general load and store multiple instructions, to be present with limited additional overhead.
A bit that is
1 indicates that the alternate form of decoding is to be
The second type of short header clause indicates instruction slots which can only contain a pair of 16-bit
instructions, never a 32-bit instruction, never a setup directive, never the start of a 48-bit
or 80-bit instruction. In these slots, the first bit of each 16-bit half is used to indicate,
if it is a
1, that the instruction is permitted to modify the condition codes.
The third type of short header clause is a modified form of the extended opcode specification; this can only modify 32-bit instructions, and provides only one additional opcode bit for each one, thus only doubling, rather than quadrupling, the possible opcode values for them.
A 32-bit instruction slot within a block will normally contain either a 32-bit
instruction which starts with
1, a pair of 16-bit instructions, both of which
0, or a 32-bit instruction which starts with
0 and the
second half of which starts with
1 to distinguish it from a pair of 16-bit
instructions; this latter form of 32-bit instructions is primarily used for the operate
instructions, but it is also used for an abbreviated form of a complete instruction set,
including memory-reference instructions, where three bits are used for a decode
field, to allow one to be specified with minimal overhead.
What if it is desired to specify that some of the instructions in a block are branch targets, and that some of them are to be executed in parallel, and to predicate some of them?
It is possible to begin a block with more than one 32-bit header.
The following rules apply:
Only the decode field in the first 32-bit header is valid.
Unused header clauses are to be those which contain alternate fields, which are to consist of all zeroes.
If a block begins with
1111, then the first 32 bits of a block are also a block
header, in the format shown below:
This type of block header indicates an alternate block format, significantly different from that of blocks without headers, or with the type of header described previously. This is the only type of block in which instructions longer than 32 bits may appear; 48-bit, 64-bit, and 80-bit instructions are available. (While mechanisms are possible to allow insttructions longer than 32 bits to appear in ordinary blocks without preventing the independent decoding of 32-bit instruction slots, the overhead of these methods would mean that the forms of longer instructions would have to be drastically different in the two cases, and thus two different sets of longer instructions have not been created.)
Instructions may cross block boundaries between blocks of this type, but not boundaries between
blocks of this type and ordinary blocks which do not begin with
Each of the two-bit fields in the type of block header shown above corresponds to a 16-bit instruction slot in the remainder of the block. The contents of those fields are interpreted as follows:
Note that this permits the free intermixing of instructions of all the available lengths.
If a block begins with
1110, then the first 32 bits of a block are also a block
header, in the format shown below:
This causes the block to consist of 36-bit instructions, allowing restrictions on the available addressing modes in the block to be removed.
Unlike the header for free-format blocks, this header may be combined with conventional block headers. As it changes the format of instructions, starting at its location, it must appear immediately after all the conventional block headers at the beginning of the block, since after its position they could no longer be recognized.
The complement of registers included with this architecture is as follows:
There are 32 integer registers, each of which is 64 bits in length, numbered from 0 to 31.
Registers 1 through 7 may be used as index registers.
Registers 25 through 31 may be used as base registers, each of which points to an area of 65,536 bytes in length.
Register 24 serves as a base register which points to an area 32,768 bytes in length.
Registers 9 through 15 may be used as base registers, each of which points to an area of 4,096 bytes in length.
At least part of the area of 4,096 bytes in length pointed to by register 8 will normally be used to contain up to 512 pointers, each 64 bits in length, for use in either Array Mode addressing or Address Table addressing.
Registers 17 through 23 may be used as base registers, each of which points to an area of 1,048,576 bytes in length. This addressing format is used for 48-bit extended memory-reference instructions.
There are 32 floating-point registers, each of which is 128 bits in length, numbered from 0 to 31.
Floating point numbers in IEEE 754 format have exponent fields of different length, depending on the size of the number. For faster computation, floating-point numbers are stored in floating-point registers in an internal form which corresponds to the format in which extended precision floating-point numbers are stored in memory: with a 15-bit exponent field, and without a hidden first bit in the significand.
As 128-bit extended floating-point numbers are already in this format in memory, all floating-point numbers will fit in a 128-bit register, although shorter floating-point numbers are expanded.
However, the 32 floating-point registers may also be used for Decimal Floating-Point (DFP) numbers. These numbers will also be expanded into an internal form for faster computation, but that internal form may take more than 128 bits.
This is dealt with as follows: Only 24 DFP numbers that are 128 bits in length may be stored in the 32 floating-point registers. When such a DFP number is stored in an even-numbered register, it is stored in that register, and the first 32 bits of the following register. When it is stored in a register the number of which is of the form 4n + 1 for integer n, the first 80 bits of the internal form of that number are stored in the last 80 bits of that register, and the remainder of the internal form of that number is stored in the last 80 bits of the second register after that register.
In this way, the same principle that storing double-length numbers in two adjacent registers is respected: numbers too long to be stored in a given register are stored in that register, and in another register of the same register file that is nearby. But the method is extended to allow more efficient use of the available space.
There are 16 short vector registers, each of which is 256 bits in length.
Each of these registers may contain:
As well, they may contain sixteen 16-bit short floating-point numbers in one of two formats.
These numbers all remain in these registers in the same format as that in which they appear in memory.
As for how data values are stored:
Signed integer values are stored in binary two's complement format.
Floating-point numbers are stored in IEEE 754 format.
The architecture is big-endian: the most significant bits of a value are stored in the byte at the lowest numbered address.