[Next] [Up] [Previous] [Next Section] [Home] [Other]

The Concertina II Architecture

This is now my ninth attempt to propose a successor to my original Concertina architecture.

The primary instruction set is designed to be used within blocks that are 256 bits in length, according to a scheme which allows the instructions in a block to be decoded in parallel. However, the architecture also provides modes of operation in which the instructions are not organized into blocks.

The instruction set is divided into two independent opcode spaces. One opcode space is used for 16 bit instructions, the 16-bit opcode space. Another opcode space is used for both 32 bit instructions, and instructions with a 32-bit main portion which make use of the short pointer mechanisms to be effectively longer, and this is the 32-bit opcode space. Where both 16-bit and 32-bit opcode space instructions are used, the distinction between them is indicated out-of-band, either in the header field in modes 12 through 15, or in instruction prefixes in modes 2 and 3.

The blocks are divided into 32-bit instruction slots, and a header within the block indicates which instruction slots contain a 32-bit instruction. These instructions may contain short pointers, which point within the block, to additional data allowing the instruction to be effectively longer than 32 bits.

While the instruction set is designed to be used within such blocks, in order to provide the maximum of flexibility in adapting it to different requirements, several modes are provided in which the instruction stream may be organized, of which four of those modes involve the 256-bit instruction blocks around which the instruction set is designed.


The reason why organizing instructions into blocks is a central feature of this design is because the method used in this design has some important advantages.

The blocks have a fixed length, and the header at the beginning of the block quickly indicates which of the 32-bit instruction slots in a block contains an instruction. So the instructions can all be decoded in parallel.

Some instructions are longer than 32 bits in length; these instructions contain short pointers which point somewhere within the 256-bit block to that information. Thus, decoding the longer instructions is still independent of decoding any other instruction in the block, and so it is not forced to be serial in this case either. The pointers do take up space in order to provide this advantage, however, and does mean that code in the available non-block modes is less compact than it could otherwise be.


Because the modes other than Mode 0 are specificially suited to different categories of implementations of the architecture, it is not required that an implementation support all of these modes.

A special-purpose implementation, designed for such things as embedded applications, may support as little as any one mode.

A general-purpose implementation, intended to be able to run any software offered for use on the architecture, must meet a more complicated set of requirements.

In addition to the primary instruction set, built around instructions in the 32-bit opcode space, there are two other instruction sets: the 16-bit instruction set, and the hybrid instruction set.

As well, some of the modes which make use of a given instruction set only support a subset of that instruction set.

A general-purpose implementation of the architecture is required to support:

However, support for as many of the available modes as is possible is strongly encouraged, even if some of them will not be supported particularly well, particularly for any processor that might be used in a development system, so that it can be confirmed that code works properly when converted to other modes.

Thus, it is intended that the format of object code and executables for operating systems designed for that architecture will be such as to allow that code to be converted to any of several modes in which programs containing instructions drawn from the instruction set in which that program is written may be expressed. In this way, when a processor that either only supports one of these modes, or is most efficient at executing programs in one of these modes, is presented with a program, it may load it in as a program in the mode which will work, or work best, with it.

Thus, a program may be written to run in Mode 5, which supports a subset of the hybrid instruction set, and so as well as the object format allowing it to be converted to other modes, it would also indicate which subset of the hybrid instruciton set is used, so that the recipient computer is aware that it can run it in any of modes 5, 6, 7, 8 or 9.

Thus, for example, a program containing instructions in the normal instruction set obviously cannot be converted to Mode 0, in which only pure 32-bit instructions are available.

Note that while this scheme allows software to be distributed in any one of the three instruction sets, and still be usable on all conforming general-purpose implementations, there are still some features more narrowly dependent on a smaller group of modes, such as the use of predicated instructions.

Thus, it is envisaged that in addition to general-format object modules, programs may be distributed with alternate versions of some subroutines that execute in only one particular mode for higher performance on machines on which that mode is implemented in an efficient manner.


In brief, these modes, and the types of processor designs with which it is presumed that they will be most effective are:


Mode 0: RISC mode

In this mode, only those instructions that are exactly 32 bits long are supported. This portion of the instruction set has been ensured to be sufficient to write programs using nearly all of the features of a CPU.

This mode is intended to be supported by all implementations; its purpose is to avoid the need for high-speed interpretive programs making use of the technique of just-in-time (JIT) compilation to deal with the complexities of organizing instructions into blocks, as Mode 6 and Mode 7 require.


Mode 2: Normal Prefix mode.

Although the instruction set has been designed around using 256-bit blocks to contain instructions, in such a way as to permit instructions to be decoded in parallel as if they were all the same length despite instructions actually being of different lengths, this alternative format, involving applying 16-bit prefixes to some instructions, has been provided which avoids the need for the block structure while allowing the full instruction set to be used.

This mode is primarily suitable for minimal implementations of the architecture that decode and execute instructions one at a time, and one after the other. (Note that the early form of pipelining, where Fetch, Decode, and Execute are carried on simultaneously for different instructions, but only one instruction is being decoded at a given time, and only one instruction is executing at a given time, such as was done on the IBM 7094 II computer, would still qualify as this type of implementation for the purpose of deciding which instruction formats are workable.) However, a sufficiently sophisticated large-scale implementation of the architecture may not be presented with a significant problem in decoding the indications of the lengths of instructions without undue delay.


Mode 3: Enhanced Prefix mode

This mode adds an additional form of instructions longer than 32-bits which avoids the need for adding a 16-bit prefix to them. This allows programs to be more compact. However, the initial 32 bits of an instruction need to be fully decoded in order to access the length information inside that part of the instruction.

This mode would also be suitable for minimal implementations of the architecture, but it appears, at least to me, that it would be a challenge for any large-scale implementation to handle it efficently.


Mode 4: 16-bit mode

This mode is closely related to Prefix Mode, but here 16-bit instructions are the default, and prefixes within the 16-bit opcode space are used to indicate instructions from within the 32-bit opcode space, as well as for other purposes.


Mode 5: Fast hybrid mode

In this mode, a modified instruction set is used. A limited number of 16-bit instructions, and a limited number of 32-bit instructions, are modified so that each of those two groups of instructions now takes up only half of the possible opcode space, and thus denser program code can be achieved, but with fewer limitations than imposed by 16-bit mode.


Mode 6: Hybrid CISC mode

This mode uses the same 16-bit and 32-bit instructions as Fast Hybrid Mode, but in addition supports most of the instructions in 32-bit opcode space that are longer than 32 bits by means of using the pSupp field to directly indicate the instruction length, allowing decoding to be relatively fast and efficient, while more closely approximating the feature set of a classical CISC architecture.


Mode 7: Extended hybrid mode

In addition to offering the features of Hybrid CISC Mode, in this mode the 32-bit opcode space prefixes may be used unchanged, and some of the 16-bit opcode space prefixes may be used by altering their first bit from 1 to 0.

Thus, the features of Normal Prefix Mode are added in this mode. Instructions longer than 32 bits in 32-bit opcode space, however, continue to be decoded as in Hybrid CISC Mode, and not as in Enhanced Prefix Mode.


Mode 8: Hybrid block mode

This mode organizes program code written in the Hybrid instruction set into blocks of 256 bits in length. It allows all instructions with pImm and pSupp fields, even the ones with no pSupp field for which pImm fields do not work in Mode 6, to work normally, with pointers in those fields, in the same manner as in Mode 12 and Mode 14 for the same instructions in the primary instruction set.


Mode 9: Hybrid VLIW mode

This mode also organizes program code written in the Hybrid instruction set into 256-bit blocks; here, the blocks also contain information about dependencies and resource conflicts to enable faster execution even in the absence of full out-of-order support.


Mode 10: Aligned hybrid block mode.

This mode is a block mode which uses the hybrid instruction set, but with the restriction that 16-bit instructions can only be used in pairs, so that 32-bit instructions are always aligned on 32-bit boundaries, in order to simplify decoding. It does not include provision for bits that indicate dependencies and resource conflicts, so it doesn't attempt to mitigate the effects of the absence of out-of-order execution support. It uses headers that are very similar to, and derived from, those for modes 12 and 13.


Mode 11: Aligned hybrid VLIW mode.

This mode is a block mode which uses the hybrid instruction set, but with the restriction that 16-bit instructions can only be used in pairs, so that 32-bit instructions are always aligned on 32-bit boundaries, in order to simplify decoding. This mode does provide for the U, D, and B bits as a substitute for elaborate out-of-order logic. It uses headers that are very similar to, and derived from, those for mode 9.


Mode 12: Normal simple block mode

This mode organizes program code into 256-bit blocks with minimal overhead.

This makes use of the short-range pointers within instructions to allow the computer to quickly, and in parallel, fetch the additional parts of each instruction as instructions are decoded and executed in parallel.

No additional information is provided, however, to indicate whether instructions have dependencies or other characteristics that will interfere with executing them in parallel.

Therefore, this mode appears to be best suited to a superscalar processor which also has a full out-of-order (OoO) instruction execution capability.


Mode 13: Basic simple block mode

Basic simple block mode is identical to Normal simple block mode, except that instructions with pSupp fields indicating a supplementary portion are not supported.

While pseudo-immediates are valuable in allowing programs to obtain constants from within the instruction stream without the overhead of additionally fetching from a distant location within data memory, the longer instructions made possible by providing instructions with a supplementary portion are often of limited usefulness.

Dropping both those instructions and the additional complexity this feature adds to instruction decoding may, for some types of implementation, enhance performance.


Mode 14: Normal VLIW mode

This mode also organizes program code into 256-bit blocks that facilitate decoding instructions in parallel. As this implies executing instructions in parallel, all block formats in this mode include information about dependencies between instructions, and also about resource conflicts between instructions.

This mode is best suited to a superscalar processor which attempts to provide high performance without full out-of-order instruction execution capability, through a very long instruction word (VLIW) design similar to that which is used by many digital signal processor (DSP) designs.


Mode 15: Basic VLIW mode

Basic VLIW mode is identical to Normal VLIW mode, except that support for instructions with pSupp fields is dropped, the rationale being the same as for Mode 5, Basic simple block mode.


These modes, in detail, are as follows:

Mode 0: RISC Mode

A program to be run in Mode 0 may contain only those instructions which belong to the 32-bit opcode space and which are exactly 32 bits long, not using block-internal pointers either for immediate values or supplementary portions of the instruction itself.

Mode 0 provides provision for running unblocked code, where instructions stand on their own in a continuous stream, as in a conventional non-VLIW computer.

As noted, a subset of the instruction set is available for use in unblocked code. 16-bit instructions may not be used, since they're indicated in the block header. Instructions with pointers to additional portions of the instruction within a block may not be used, since there are no blocks.

Unblocked code may be useful, for example, in avoiding the need for excess complexity in interpretive programming systems making use of Just-in-Time compilation. In some implementations, unblocked code, since it lacks explicit indication of dependencies, may run much more slowly than code organized into blocks of 256 bits as described above. In larger, more elaborate implementations, that include full provision for out-of-order execution, the performance gap may be considerably smaller.

Mode 2: Normal Prefix Mode

This mode allows for code that is not organized into 256-bit blocks, but which can still employ the full capabilities of the architecture. This is done by having variable-length instructions with the prefix property, as is common in many CISC systems.

Note that the way in which a prefix is used to indicate a string of 16 bit instructions means that it is not possible to branch to any of the 16-bit instructions within that string. Only the prefix of such a string may be a branch target, which may necessitate breaking such strings into pieces.

The prefixes used in this mode have the form:

Note that they use the portion of 32-bit opcode space that immediately follows the 32-bit cross-mode subroutine jump instruction, and that immediately precedes the three-address memory-to-memory instructions.


The first line in the diagram allows prefixes to also be used either as a way to access an instruction from a different Instruction Mode within Prefix Mode, or to change the instruction mode or the prefix mode. These prefixes are available in any mode in which instructions from 32-bit opcode space are available, not just in Prefix Mode.

16410x SM    Set Mode
16414x MP    Mode Prefix

The second line in the illustration shows a prefix which is also available in all modes in which instructions from 32-bit opcode space are available.

This prefix is intended to be placed at any point in code to which control may be transferred when control flow guidance is enabled. Its values are:

164130 164170  Permitted target for a jump or branch instruction
164131 164171  Permitted target for a return from subroutine
164132 164172  Permitted target for a subroutine call instruction
164133 164173  Permitted target for a non-advancing subroutine call instruction

A non-advancing subroutine call instruction is a subroutine call instruction which is not intended to call a subroutine, but only to perform a branch, and place the return address in a register for another purpose, such as initializing a base register at a subroutine entry point. It is required to make this distinction in order to have a shadow stack function properly, as in this architecture an actual subroutine call is used for that purpose, and not a non-branching register-to-register version of the subroutine call, such as the BALR instruction of the IBM System/360.

In order for the availability of this prefix in other modes, such as Mode 0, not to interfere with instruction decoding, it is followed by either 16 unused bits or an instruction from 16-bit opcode space, as indicated by the single opcode bit at bit 10. Thus, in the pairs of opcodes shown above, the first opcode is that of the prefix which precedes a 16-bit instruction, the second that of the prefix which precedes 16 unused bits.

Note that, even when Control Flow Guidance is enabled, a Jump to Subroutine Cross-Mode Unknown (JSRXMU) instruction will still be able to branch directly to a Short Subroutine Jump instruction, both as it needs to be able to do so in order to access its source mode (sM) field and as that, as a distinctive instruction required as a branch target, performs the needed functionality for control flow enforcement.


The third line in the illustration shows another prefix available in all modes where instructions from 32-bit opcode space are available; it allows a single instruction from 16-bit opcode space to be used by embedding it in the last 16-bits of a 32-bit instruction.


The fourth line in the illustration shows the prefix used to indicate a sequence of up to 255 instructions in 16-bit form.

The length field contains the number of instructions; a value of 0 indicates that all subsequent instructions will be in 16-bit form, unless something is encountered in the sequence of those instructions that causes a switch back to conventional instructions from the 32-bit opcode space.

In the case of a length field containing a number from 1 to 255, it is not permissible to branch from outside to any of the 16-bit instructions in the area so prefixed, because there would be no way to determine when to switch back from 16-bit instructions to 32-bit instructions. The block of 16-bit instructions, however, can contain 16-bit branch instructions that perform branches within the block, since the block position counter can be adjusted as part of performing such branches.


Also, such a sequence of 16-bit instructions is interruptible; a return from interrupt is allowed to branch to an instruction within the sequence from outside, since the count of 16-bit instructions remaining in the sequence will be among the status bits restored by the return from interrupt.


In the case of the 16-bit instructions which follow the same prefix, but with a length code of 0, this specific restriction does not apply. However, as there are no branch instructions which change the instruction mode, only a prefix within 16-bit opcode space, described in the page concerning 16-bit instructions, not here, changes the instruction mode (there are sixteen instruction modes available, but only within Mode 1 of the four instruction wrapping modes), any branch to a 16-bit instruction that is part of a sequence of 16-bit instructions being executed in instruction mode 1, where all instructions are assumed to be 16 bits long without any additional indication, must be made from either the same sequence, or another sequence, of 16-bit instructions being fetched and interpreted, as well as running, in this mode.

In both cases, for a length field from 1 to 255 and for a length field of 0, these retrictions on branching, of course, do not apply to a return from an interrupt, as the instructions used for returning from an interrupt, unlike normal branch instructions, also set additional status bits of the computer (or process) including the bits relevant to the correct interpretation of instructions, in addition to specifying a program counter value.


The fifth line in the illustration shows a special prefix format for instructions which contain only one immediate which is 8 bits long. In this case, the 8-bit immediate value is contained within the 16-bit prefix itself. A single bit is used to indicate if a supplementary field is present; if it is zero, one is not present, if it is one, there is a 16-bit supplementary field in the instruction; this prefix cannot be used for instructions with a longer supplementary field.


The sixth line in the illustration shows the prefix used to indicate a general instruction from 32-bit opcode space which uses either or both immediate pointers or a supplementary pointer to include additional material in the instruction.

The P bit, if set, indicates that the normal supplementary field of the instruction is extended by an additional 16 bits at the end which supply additional opcode bits; this allows the use, in prefix mode, of the same additional instructions which the prefix bits in header fields allow for the block modes.

The lSupp field may contain zero if a supplemental portion for the instruction is not present, and it may contain a number from 1 to 4 (or up to 7 in the case of future extensions to the instruction set) indicating the length of the supplementary portion of the instruction in units of 16 bits. This is the total length, and it therefore is increased by one when the P bit is set.

The lIm1 field is used for the first immediate used by the instruction, whether it is an operand immediate or a source immediate; it is always the only one used if the instruction uses only one immediate operand. It indicates the length of the immediate as a power of two, from single-byte immediates to 256-bit immediates, with 256 bits being the length of a short vector and 128 bits being the length of the longest possible floating-point values.

The lIm2 field similarly indicates the length of the second immediate operand used by the instruction, if any.

If, and only if, both lIm1 and lIm2 are 1, a single 16-bit portion within the instruction prefixed will contain both 8-bit immediates, with the first immediate in the high-order bits at the lower address, and the second immediate in the low-order bits at the higher address.

In any other case where an 8-bit immediate is present, it adds 16 bits to the length of the instruction, which will contain an unused 8-bit field which should be zero followed by the immediate.


The form of prefixed instructions is this:

First, the 16-bit prefix.

Second, the base 32-bit instruction. All pSupp and pImm fields in the instruction are to contain zeroes, but bits indicating that a pImm field is used are to be set normally.

Third, the supplementary portion of the instruction, if present.

Fourth, the first immediate value used by the instruction, if present.

Fifth, the second immediate value used by the instruction, if present.


Incidentally, note that a valid Mode 0 program is also a valid Normal Prefix Mode or Enhanced Prefix Mode program. However, Mode 0 still has a reason to exist, because in Mode 0, a valid Mode 0 program can perhaps be executed more quickly, as the computer can assume it will not encounter any prefixes, and all instructions will be 32 bits long and on 32-bit boundaries.

Mode 3: Extended Prefix Mode

In order to permit more compact code, the feature is added with this mode that an additional form for instructions longer than 32 bits is supported.

In this form, the instruction begins with the same 32 bits of the instruction that would appear in the instruction slot in sequence in a block mode, without any prefix.

However, if a supplementary portion of the instruction is present, the pSupp field will not be zero, and any pImm field corresponding to an immediate that is used will also not be zero.

Instead, the pSupp field will contain the length, in units of 16 bits, of the supplementary portion of the instruction, and the pImm fields will contain a byte pointer to the end of the immediate value to which they refer, relative to the first byte of the instruction having 0 as its position. Thus, the lowest value which may be contained in a pImm field in this instruction format is 4, for a one-byte immediate value immediately following an instruction without a supplementary portion.

The elements of the instruction, as in the case of a prefixed instruction, must be in the following order:

First, the base 32-bit instruction.

Second, any supplementary portion of the instruction, if present.

Third, any immediate used for an operand argument to an operate instruction, if present.

Fourth, any immediate used for a source argument to an instruction, if present.

In this form, the first 32 bits of the instruction contain all the information needed to determine the total length of the instruction, but decoding is considerably more complicated than when a 16-bit prefix is used.

It is optional for the computer to actually turn off the additional features of Mode 3 when within Mode 2. As well, it is also optional for Mode 0 to be fully enforced.

Therefore, it is possible for a conformant implementation of the hardware to "support" at least two modes without doing any actual mode switching; it could always operate as if it were in either Mode 2 or Mode 3, as long as it treated switches between supported modes as NOPs, and signalled an error for an attempt to switch to an unsupported mode.

Mode 4: 16-bit Mode

In this mode, instructions from the 16-bit opcode space are the default, and 16-bit prefixes are used to execute instructions from the 32-bit opcode space.

It is possible to enter this mode using a cross-mode jump instruction. It is also possible to enter this mode with a normal cross-mode subroutine call instruction, but not with the version of the cross-mode subroutine call instruction that does not require the mode of the called program to be specified.

Further details about this mode are available on the page describing the 16-bit instructions, as after they are described, the 16-bit headers which are used in this mode can be more easily described.

Mode 5: Fast Hybrid Mode, Mode 6: Hybrid CISC Mode, and Mode 7: Extended Hybrid Mode

As these modes use an instruction set derived from the normal instruction set in 32-bit opcode space and from the instruction set in 16-bit opcode space, to properly describe this mode, the instruction formats for that instruction set must be exhibited.

Therefore, a full description of the various forms of Hybrid Mode is deferred until the instruction formats of both the normal 32-bit instruction set and the 16-bit instruction set have been described.

Mode 8: Hybrid Block Mode

Hybrid Block Mode uses the merged 16-bit and 32-bit instruction set of Mode 5, Hybrid Mode, as the basis for a mode which organizes instructions into blocks.

The illustration below shows the format of the header for a 256-bit block, which may be 16 or 64 bits in length:

The Decode field contains a 1 bit corresponding to each 16-bit instruction slot which contains the first part of an instruction. In the first header format, it will always begin with 1, and in the second header format it will always begin with 0001.

While this simple format of the header field doesn't indicate if the last instruction in a block is 16 bits or 32 bits in length, that information will be derived from the first bit of the instruction instead.

In addition to the instruction formats shown in the illustration of instruction formats for Hybrid Mode, except for the instructions in 32-bit opcode space that were modified to a different form so that all the 32-bit instructions would fit in only half the opcode space, which remain in their modified form instead of reverting to their original form, all the instructions in 32-bit opcode space are included in Hybrid Block Mode.

In this mode, since it is a block mode, as might be expected, the pSupp field works normally as the pointer to a 16-bit slot in the block, and the pImm fields work normally as pointers to an 8-bit byte in the block.

The header format in the second line provides prefix bits; for any of these bits that are 1, if there is an instruction beginning in that 16=bit instruction slot, and it is a 32-bit instruction with a pSupp field, the supplementary bits are extended by 16 bits (at the end, not the beginning, despite the name of the prefix field), those 16 bits forming an additional (and more significant, and so in that respect the name is appropriate) part to the opcode bits of the instruction, thus allowing for extensions to the instruction set which will usually remain within the existing instruction formats.

Predication is optional; a P bit indicates if its corresponding 16-bit slot begins a predicated instruction, the flag identified by the following flag bit then determines if it is executed; if the corresponding S bit is zero, the normal case, where a true flag (set to 1) leads to execution; if the S bit is 1, the instruction executes if the flag is not set but is instead zero.

Some header formats (those in the second and third lines of the diagram) include a bit labelled X. This bit, if set, indicates that the 16-bit instruction set, including the 32-bit instructions created with a 16-bit opcode space prefix so as to add memory-reference instructions to that instruction set to complete it, is used for the instructions in that block instead of the hybrid instruction set. This does mean the original full 16-bit instruction set that fills opcode space, where 16-bit operate instructions may set the condition codes.

Mode 9: Hybrid VLIW Mode

In this mode, the format of the header of a block is as shown below:

Once again, a 1 bit in the decode field, which must begin with 1, corresponds to a 16-bit slot in which an instruction begins.

The triplets of Upon, Depend, and Break bits each correspond to one 16-bit slot in order, and the offset field indicates how many additional Depend bits set to 1 are between a Depend bit set to 1 and its corresponding Upon bit.

The fourth line shows the form of a block header which changes the size of the block to 512 bits. The second 256-bit half of the block can only contain pseudo-immediate values, and a bit set in the appendage field indicates that the pImm fields of the instruction beginning in the corresponding 16-bit instruction slot point instead to the second half of the block. If the Z bit is set, the block remains 256 bits in size, and appendage bits cause the immediates to be in (presumably) unused space in the next block.

If a 16-bit instruction begins in the last instruction slot in the first 256 bits of the block, the last UDB triplet is not available to be replaced with an offset value, so the reminder is omitted, and the offset is kept track of based on the placement of U and D bits continuing on from the previous block.

The fifth line leaves the size of the block at 256 bits, replacing the appendage bits with a prefix field. If an instruction from 32-bit opcode space that has a pSupp field starts in the corresponding 16-bit instruction slot, a 1 bit in that word indicates that the supplementary bits will have, appended to them at the end, despite the name of the field, 16 additional bits, serving as part of the opcode of the instruction to extend the instruction set.

The third and sixth lines in the diagram show how a minimal form of predication is supported in this mode. In the third line, there is no branch field to point to the first branch instruction in the block, so the block can contain no branch instructions.

If a P bit is present, the instruction is predicated based on flag 0, the only flag that can be used, as there is no space to indicate a flag.

In the format shown in the third line, there is at least an S bit, and if that bit is set, predicated instructions execute if the flag is 0 instead of if it is 1 as is the normal case.

Also note that the format in the fifth line does allow branch instructions in the block but only if no instruction begins in either of the last two instruction slots, so that the last two UDB triplets can be replaced with a branch field.

The header format in the second line allows individual blocks consisting of instructions from 32-bit opcode space instead of hybrid mode instructions to be included in the instruction stream. This allows predication with flags 0 through 7 as well as a more general form of the 32-bit memory-reference instruction, and so may be occasionally useful.

Some header formats (those in the first and fifth lines of the diagram) include a bit labelled X. This bit, if set, indicates that the 16-bit instruction set, including the 32-bit instructions created with a 16-bit opcode space prefix so as to add memory-reference instructions to that instruction set to complete it, is used for the instructions in that block instead of the hybrid instruction set. This does mean the original full 16-bit instruction set that fills opcode space, where 16-bit operate instructions may set the condition codes.

Mode 10: Aligned Hybrid Block Mode

This mode uses the combined instruction set with both 16-bit and 32-bit instructions sharing a common opcode space that is used in modes 5, 6, and 7, but with the restriction that 16-bit instructions must come in pairs, so that each 32-bit instruction slot aligned on a 32-bit boundary contains either a 32-bit instruction or two 16-bit instructions.

The header format for this mode is shown below:

It is almost identical to the format for Mode 12, Simple Block Mode, and Mode 13, Basic Simple Block Mode, except that:

In the first format, as in Mode 12 and Mode 13, if the first bit of the decode field is zero, then the 16 bits following the header, which can only contain a single 16-bit instruction (in this mode, the only case where one can be present tht is not part of a pair) are not decoded but instead skipped.

Mode 11: Aligned Hybrid VLIW Mode

This mode uses the combined instruction set with both 16-bit and 32-bit instructions sharing a common opcode space that is used in modes 5, 6, and 7, but with the restriction that 16-bit instructions must come in pairs, so that each 32-bit instruction slot aligned on a 32-bit boundary contains either a 32-bit instruction or two 16-bit instructions.

The header format for this mode is shown below:

Because each 32-bit instruction slot may contain two 16-bit instructions instead of a single 32-bit instruction, the header formats for this mode were derived from those of Mode 9 rather than those of Mode 14 and Mode 15. But while using the hybrid mode instructions requires a U, D, B bit triplet for each 16-bit field, requiring 16-bit instructions to come in pairs so that all 32-bit instructions are aligned on 32-bit boundaries means that the decode field only needs to contain one bit for every usable 32-bit instruction slot in the block. Therefore, there is now room to also include a six-bit prefix field in each header format as well, allowing the use of the extended instructions with sixteen additional opcode bits more often.

Mode 12: Simple Block Mode, and Mode 13: Basic Simple Block Mode

As noted above, Basic Simple Block Mode is a streamlined version of Simple Block Mode which drops support for instructions containing pSupp fields.

This mode permits the full use of the facility within instructions to contain short pointers within a 256-bit instruction block to allow long instructions, without including the other features of Mode 3 that require a longer block header.

This mode would be optimal for implementations where full OoO circuitry is present, and so the explicit indication of instruction dependencies provided for in the headers of Mode 3 blocks is not required.

The headers for a 256-bit block in this mode have these forms:

The first line shows the simplest form of header, with a length of only 16 bits.

The seven bits in the "short" field, if 0, indicate that the corresponding 32-bit instruction slot contains one 32-bit instruction, and if 1, indicate that the corresponding 32-bit instruction slot contains two 16-bit instructions.

The first instruction slot, as it contains a 16-bit header, must perforce contain a 16-bit instruction only in what remains, so there is no bit in the "short" field corresponding to that slot.

The eight bits in the "decode" field indicate which instruction slots contain instructions as opposed to containing additional data for instructions or being unused.

This field will normally one of these values:

11111111
11111110
11111100
11111000
11110000
11100000
11000000
10000000
01111111
01111110
01111100
01111000
01110000
01100000
01000000

The first bit may be zero to indicate that a 16-bit instruction in the second half of the first instruction slot is not used; for some implementations, it is expected that this may avoid the expenditure of some cycles that placing a 16-bit NOP in that location as the only means of indicating this would entail.

Note that this is an exception to the usual rule, as will be described below, that encountering a zero in the decode field in the normal sequence of instructions causes fall-through to the next block.


The second line shows a 32-bit header, which allows predication for blocks that do not contain any 16-bit instructions.

Here, the decode field will normally contain one of these values:

1111111
1111110
1111100
1111000
1110000
1100000
1000000

indicating that the block contains respectively from seven down to one instructions.

The P bit is a one if the instruction in its corresponding position is predicated; that is, if its execution is conditional.

The flag field indicates which of four predication flags controls whether or not the instruction is executed: the architecture provides for eight predication flags, all of which are available in Mode 3, but only the first four of which are usable in Mode 2.

The S bit controls the interpretation of the flag field. If it is a 0, then the instruction is only to be executed if the flag is set to 1; if it is a 1, then the instruction is only to be executed if the flag is reset to 0. For the 32-bit header, one S bit governs all predicated instructions; for the 64-bit header to be described next, each instruction has its own S bit.

The P and S bits are the Predication and Sense bits respectively.


The third line shows a 32-bit header for a block which is 512 bits in length instead of 256 bits.

However, the number of bits in the decode and short fields has not increased. Instructions can only be within the first 256-bit half of the longer block. This also applies to the supplementary portion of an instruction indicated by a pSupp field within the instruction.

What the second half of the block may contain is immediates, and the bits in the appendage field indicate if the immediates for the instruction in the corresponding 32-bit instruction slot are in the second half, if those bits are 1. If an instruction has two immediates, they must both be in the same half of the block.

Incidentally, it may be noted that although three-operand register-to-register instructions often allow both the source and operand registers to be replaced by immediates, this is done primarily to allow either one of them to be an immediate, as an operate instruction with only immediates as input is not particularly useful, as its result would be constant.

This block format makes it possible for short vector register-to-register instructions to replace register fields with pointers to 256-bit immediates.

Also note that it is not permitted to place an immediate value so that it straddles the boundary between the first 256 bits and the last 256 bits of a double-size block of this type.

If the bit marked Z is 1 instead of 0, however, instead of the block being 512 bits long, it will remain 256 bits long; in this case, the effect of a 1 bit in the appendage field will be to cause immediates to be located within the next block (presumably in unused space) rather than the current block. This option may sometimes be useful to allow for somewhat denser code.

Also present is a prefix field. The bits in this field, if 1, affect instructions beginning in corresponding instruction slots which have a nonzero pSupp field, and thus supplementary bits. If a bit is set, the supplementary bits are lengthened by 16 bits; these additional bits (at the end, not the beginning, of the supplementary bits, despite the name of the field) are additional opcode bits, which permit extending the instruction set.


The fourth line shows a 64-bit header, which allows predication for blocks which are free to contain 16-bit instructions.

Here, the decode field will normally contain one of these values:

0111111
0111110
0111100
0111000
0110000
0100000

indicating that from six down to one instruction slots in the block are in use, either for a single 32-bit instruction or two 16-bit instructions.

Here, each pair of two P-bit plus flag bits entries corresponds to a single 32-bit instruction slot, and only the first one is used where it contains a 32-bit instruction.

Note that this is the opposite of the way the U, D, and B bits are handled in the header mode shown in the fifth line of the diagram for Mode 3, where the first half relates to 32-bit instructions or the first 16-bit instruction in each slot, and the second half relates to the second 16-bit instruction in each slot.


The fifth and sixth lines show a 64-bit header, for blocks which may contain 16-bit instructions, and which also allows predication, but only 32-bit instructions may be predicated. All sixteen flag bits may be used to control instructions with this header, and there is an S bit for each 32-bit instruction as well. The format in the fifth line applies to a regular 256-bit block, and that in the sixth line, like that in the third line, is for a 512-bit block where the second 256 bits may only contain immediate values. Note that in either case this format of header has several unused bits, they should be zero. Also, the Z bit, once again, if set, changes the 512-bit block to a 256-bit block that can place some of its immediates in space shared by the next block.

In the fifth line, instead of appendage bits, prefix bits are present in the header, with the function described above.

Mode 14: VLIW Mode, and Mode 15: Basic VLIW Mode

As noted above, Basic VLIW Mode is a streamlined version of VLIW Mode which drops support for instructions containing pSupp fields.

This architecture is a VLIW architecture, where the instruction stream is composed of 256-bit instruction blocks, each of which may contain up to seven instructions which are 32 bits in length.

The intent of the design is to permit instructions to be decoded in parallel and, wherever possible, executed in parallel, to increase the speed of execution.

An instruction may include more than 32 bits of information. In that case, the instruction will contain one or more short pointers to the location of that information within the 256-bit instruction block. This allows, for example, instructions to use immediate values, which, unlike constants stored normally as data, do not require an additional fetch from memory, at an address not related to the successive addresses from which code is fetched.

This format is intended to allow parallel decoding of instructions, by having all instructions the same length, to also allow some instructions to be longer, thus offering the advantage of the "heads-and-tails" instruction format developed by Heidi Pan, but in addition, by using pointers to the additional data, to avoid imposing sequential processing on the decoding of longer instructions.

Not all instructions adhere to the RISC philosophy; the architecture is intended to be very general, and, thus, includes instructions which may need to be implemented by microcode as they are useful for some application areas.

The first 32 or 64 bits of a 256-bit instruction block have the following formats:

The seven bits labelled "decode" indicate which of the subsequent 32-bit instruction fields within the block contain instructions to be decoded and possibly executed. For the formats shown in the first two lines of the diagram, these bits will normally have one of the following seven values:

1111111
1111110
1111100
1111000
1110000
1100000
1000000

Thus, from one to seven instructions may be present within a block; a 1 bit corresponds to an instruction field that contains an instruction to be decoded.

Each 32-bit instruction has three bits associated with it that appear in the first 32 bits of the block.

The U bit, if 1, marks the instruction as one on the results of which a subsequent instruction depends.

The D bit, if 1, marks the instruction as one which is dependent on the results of a previous instruction.

The B bit, if 1, indicates that, even in the absence of a dependency, this instruction may not be executed in parallel with the immediately preceding instruction, but instead must begin on the next cycle.

These three bits are the Upon, Dependency, and Break bits.

The appropriate value for the B bit is model-dependent. This can be dealt with by having the object format indicate which type of machine for which the provided code is generated, and, where that differs from that of the machine to execute the code, having the loader determine from the machine code the correct values for the B bits.

In general, dependency is not model-dependent. However, the latency of instructions in cycles is model dependent, and this affects whether a dependency is relevant. Also, the B bit only indicates the need to wait one cycle before issuing an instruction, so it is useful for indicating conflicts only in fully-pipelined execution units.

Where an execution unit is not pipelined, as might be the case for a microprogrammed functional unit handling the string and packed decimal memory-to-memory instructions, the instructions would have to be marked as dependent, rather than using the B bit to indicate a resource conflict, to produce a sufficient delay.

The offset field indicates the value of the offset on entry into an instruction block.

To find the U bit that is set which corresponds to a D bit that is set, the offset is the number of set U bits in instructions previous to the instruction with the D bit set that are to be skipped before the U bit set in the instruction on which that instruction depends is found.

The offset is only meaningful when there has been at least one U bit encountered for which the corresponding D bit has not been encountered. Setting the offset to 7 in the case when it is not meaningful for this reason allows the offset to be incremented whenever any U bit is encountered, and decremented whenever any D bit is encountered; the first U bit will properly take the offset to zero.

The first bit of the block may be 0 or 1.

A block which does not contain predicated instructions, and does not contain branch instructions begins with a 0 bit.

A block which does not contain predicated instructions, but does contain one or more branch instructions begins with a 1 bit, with the format shown on the second line of the diagram.

For such a block, the block format is modified. The offset field is replaced by a branch field. This will contain a number from 1 to 7, indicating where the first branch instruction in the block is located. (If there are other branch instructions, there will be sufficient time to prepare for handling them after detecting them through normal instruction decoding.)

If there are six or fewer instructions in the block, the three bits that would have been the U, D, and B bits for the seventh instruction will contain the offset value. The offset value is optional, as it can be updated from what was present at the end of the preceding block, but it is helpful.


The third, fourth, fifth, seventh, and eighth lines of the diagram show formats for the first 64 bits of a block which contains predicated instructions.

A 64-bit header is indicated by the first bit of the header being 1, but the three bits in the first 32 bits of the header that would contain the location of the first branch instruction in the block are zero.

Here, the decode field will normally contain one of these values:

111111
111110
111100
111000
110000
100000

indicating that the block contains respectively from six down to one instructions.

As the prefix portion of the block now consumes a second instruction slot, the three bits that would have contained the U, D, and B bits for that slot are now used for the offset.

As shown in the third line of the diagram, when there is no branch instruction in the block, and the first bit of the second 32 bits of the header is 0, there is only one S bit, and so all sixteen flag bits are available for use in predication.


The fourth line of the diagram shows the case where there is no branch instruction in the block, the first two bits of the block are 10, and the first two bits of the second 32 bits of the header are 10, corresponding to each of the six instruction slots which may potentially be used is an S bit, a P bit, and a three-bit flag field.

The P bit is a one if the instruction in its corresponding position is predicated; that is, if its execution is conditional.

The flag field indicates which of eight predication flags controls whether or not the instruction is executed.

The S bit controls the interpretation of the flag field. If it is a 0, then the instruction is only to be executed if the flag is set to 1; if it is a 1, then the instruction is only to be executed if the flag is reset to 0.

The P and S bits are the Predication and Sense bits respectively.


The eighth line of the diagram shows a similar case. Here, only four, rather than eight, of the flag bits are available for use, as the field selecting a flag bit for use in predication for each instruction slot is now two bits long instead of three. This is because this header is for a 512-bit block, and so space is provided for an appendage field, as explained in the description below of the seventh line of the diagram. Also, there is a Z bit present, which, if 1, changes the block to a 256-bit block which simply places some of its immediate values in unused space in the next block.


In the fifth line of the diagram, we see a case where a branch instruction is present in the block, and predication is present, as indicated by the first two bits of the header being 11 and the first four bits of the second 32 bits of the block being 0000, then there is only one S bit, and all predicated instructions in the block must have the same sense; if the S bit is 0, a predicated instruction must have its corresponding flag set (equal to 1) to execute; if the S bit is 1, a predicated instruction must have its corresponding flag reset (equal to 0) to execute.

As with the case where a branch is present, but predication is not, the branch field contains the number of the instruction slot containing the first branch instruction in the block, in this case from 2 to 7.


The sixth line of the diagram shows the header fomat for a block in which some of the 32-bit instruction slots contain a pair of 16-bit instructions instead of a single 32-bit instruction. Note that when 16-bit instructions are present in a block, predication is not available for that block.

The format of the decode field is the same as for the third and fourth lines of the diagram; it must contain from one to six ones, with the rest zero.

The sets of U, D, and B bits in the first word of the header refer either to an entire 32-bit instruction, or to the first 16-bit instruction of a pair; those in the second word of the header refer to the second 16-bit instruction of a pair.

The branch field may contain a zero to indicate the block contains no branches. If it contains a number from 2 to 7, this indicates which instruction slot contains the first branch in the block. If that instruction slot contains two 16-bit instructions, the H bit indicates which of those instructions is the first branch in the block: if 0, the first one, if 1, the second one.

Each of the six bits in the short field corresponds to one of the remaining six 32-bit instruction slots in the block, after the header, and the bit is a 1 for those slots which are to be decoded as instructions, but contain a pair of 16-bit instructions instead of a single 32-bit instruction.


The seventh line of the diagram, like the ninth line of the diagram, shows the header format for a block that is 512 bits long instead of 256 bits long.

However, the number of decode bits has not been increased. In this type of block, both the primary 32-bit portion of an instruction, and the supplementary bits as indicated by the pSupp field within the instruction, must be in the first 256 bits of the block.

Only immediate values may be located in the second half of the block. The bits in the appendage field indicate, if 1, that the instruction in the corresponding 32-bit instruction slot has its immediate value, or all of its immediate values, in the second half of the block.

Also, in this header format, enough space was present to permit instructions to be predicated, but only with flag bits 0 to 3.

It is not permissible to locate an immediate value so that it crosses the boundary between the two 256-bit halves of a long block.

The Z bit in the header, if 1, cancels skipping over the next block in sequence, so instead of the block being a 512-bit block, it becomes a 256-bit block with some instructions that have placed their immediate values in (presumbably unused) space in the following block. This may be useful in making code somewhat more compact under some circumstances.

The eighth line of the diagram shows a header format offering predication with four flag bits which is similar in layout to the one on the seventh line. Instead of appendage bits allowing immediates to be located in the next block, however, this format contains prefix bits. If a bit is set, for an instruction with a pSupp field, the pSupp field is lengthened by 16 bits, the additional 16 bits being additional opcode bits, thus extending the instruction set. Those bits are placed at the end of the supplementary bits, not the beginning, despite the name of the field.

Instruction Sets and Modes

The basic instruction set of this architecture, with pSupp fields and pImm fields designed to contain pointers, and with instructions coming primarily from a 32-bit opcode space, but with additional instructions from a 16-bit opcode space, was designed around the needs of the block modes 12, 13, 14 and 15.

This same instruction set, with adaptations to work without a block structure, instead using prefixes to instructions as the "wrapper" which allows instructions to be longer than 32 bits, is used in Mode 2 and Mode 3.

Also, Mode 0 uses the subset of that instruction set consisting of only the pure 32-bit instructions.

As noted above, Mode 14 and 15 are suited to a medium-size implementation which attempts to use information in the instruction stream to take the place of out-of-order execution support, while Mode 12 and 13 would need out-of-order execution support to obtain similar efficiencies. Mode 2 and 3 are suited to smaller implementations that decode and execute instructions one at a time, although a sufficiently large implementation may be able to handle mode 2 well since it attempts to keep decoding the length of instructions reasonably fast.

Given, therefore, that different implementations of the architecture are likely to have a different favored mode, it has been suggested that a format for executables and object modules be used that allows suitable code to be converted into whichever one of these modes is best suited to the particular implementation of the architecture in use.

This, however, brings to light one potential limitation of the architecture. In a "big.LITTLE" design, where both small cores and large cores are used, with the intention of allowing power to be conserved by normally using the small cores, but switching to the large cores occasionally for computationally intensive work to give higher performance, it will be important to choose cores that have a mode which they handle efficiently in common, since while code can be converted to different modes from an object format, this conversion would not be possible, or at least efficient, if applied to code already loaded into memory.

In addition to this basic instruction set, there is also the subset of it consisting only of pure 32-bit instructions, which is used in Mode 0.

But there are also two other instruction sets, although they are derived from the primary instruction set and are very closely related to it.

There is the 16-bit instruction set, used in Mode 4, consisting of the instructions in 16-bit instruction space, plus some additional 32-bit instructions constructed with instruction prefixes to make the instruction set complete by allowing it to have memory-reference instructions.

And there is the hybrid instruction set, used in somewhat differing forms in modes 5, 6, 7, 8 and 10. Mode 5 uses the basic hybrid instruction set, produced by combining the 16-bit and 32-bit instructions to fit in a single opcode space. Modes 6 and 7 include more of the instructions from the 32-bit opcode space within the hybrid instruction set. Mode 7 provides access to the original instructions from 32-bit instruction space with prefixes. Modes 8 and 10, as they are block modes, include all the instructions from 32-bit opcode space in the hybrid instruction set except for the ones modified to restrict instructions from 32-bit opcode space to half of the hybrid opcode space, thus, they include all the instructions included in Mode 6 as well as additional ones.

This progressive subsetting makes the situation with regard to object code somewhat more complicated than with programs written in the original instruction set; a valid Mode 5 program can also be run in Mode 6 or 7, and can be wrapped differently to run in Mode 8 or 10, and a valid Mode 6 program can be run in Mode 7, and can be wrapped to run in Mode 8 or 10.

With both the hybrid instruction set, and the original instruction set, for it to be possible for a program to be wrapped to run in either a block mode or a non-block mode, the program must not contain explicit dependencies on how data is placed in memory. Thus, as noted elsewhere, one common technique, following a subroutine call by data, such as pointers to parameters or a parameter list, and fetching that data while returning correctly to executable code through incrementing the return pointer, is not applicable to the block modes - although in the section below we will see how a related technique can still be used in them.

Thus, while programs can, with certain limitations, be converted from running within one mode to another, that is only true for modes that share the same instruction set.

Orphan Code

On some computers, particularly those without general registers, or those where the registers are not wide enough to hold an address, a common programming technique is to place the parameter list for a subroutine call immediately after the subroutine call instruction. The subroutine can then use the return address to access the parameters, incrementing it after each one and after the end of list marker, so that it will end up pointing to the correct address for the jump back to the calling routine.

For obvious reasons, this programming technique is not applicable to programs in any of the block modes, 8, 10, 12, 13, 14, and 15. Programs normally consist of groups of contiguous instructions that are at regular intervals interrupted by the area at the end of each block used for the supplementary bits of instructions and for immediate values, and then the header of the next block.

Because each of the formats for a block header includes individual decode bits, one for each instruction slot in the block, however, a related technique is possible, even in code divided into blocks.

In a block with a seven-bit decode field, if that field contains, for example, 1100110, the result will be that the instructions in the first two available instruction slots will be decoded and executed, and after the instruction in the second instruction slot, execution will fall through to the next block.

The other two bits that are 1 indicate that the contents of two other instruction slots will be decoded, but this appears to be useless, as they won't be executed.

However, the fact that they are decoded means that they can be executed, specifically if either of them is the target of a branch.

And, so, it could happen that the second instruction in the block branches to the fifth available instruction slot, which would allow all four decoded instructions to be executed, with two instruction slots skipped over.

While this allows data to be placed following some instructions, it is still limited because of block boundaries. However, this is sufficient in one particular case: allowing the initial instructions of a subroutine to skip over an area containing the name of the subroutine as text. This technique is used in the calling conventions of some mainframe computers to permit identifying where a subroutine that failed with an error was called from.

So the rule for the decode field is: a 1 bit indicates an instruction slot the contents of which are decoded, and may be executed; if execution would reach an instruction slot that is not decoded, due to advance to the following instruction slot, fall-through is to the beginning of the next block instead, regardless of whether any later instruction slots may also be decoded. Of course, fall-through does not take place when a branch takes place instead.

Of course, in the case of the block modes with hybrid instructions, having zeroes and ones intermixed is in any case inevitable.

Registers and Data Formats

The complement of registers included with this architecture is as follows:

There are 32 integer registers, each of which is 64 bits in length, numbered from 0 to 31.

Registers 1 through 7 may be used as index registers.

Registers 25 through 31 may be used as base registers, each of which points to an area of 65,536 bytes in length.

Register 16 may be used as a base register pointing to an area of 32,768 bytes in length.

Registers 18 through 23 may be used as base registers, each of which points to an area of 4,096 bytes in length.

At least part of area of 4,096 bytes in length pointed to by register 18 will normally be used to contain up to 512 pointers, each 64 bits in length, for use in either Array Mode addressing or Address Table addressing.

Also, registers 8 through 15 may be used as base registers each pointing to an area 1,048,576 bytes in length for extended memory-reference instructions.

There are 32 floating-point registers, each of which is 128 bits in length, numbered from 0 to 31.

There are 32 type extension registers, each of which is 32 bits in length, and each of which is associated with a floating-point register.

Floating point numbers in IEEE 754 format have exponent fields of different length, depending on the size of the number. For faster computation, floating-point numbers are stored in floating-point registers in an internal form which corresponds to the format in which extended precision floating-point numbers are stored in memory: with a 15-bit exponent field, and without a hidden first bit in the significand.

As 128-bit extended floating-point numbers are already in this format in memory, all floating-point numbers will fit in a 128-bit register, although shorter floating-point numbers are expanded.

However, the 32 floating-point registers may also be used for Decimal Floating-Point (DFP) numbers. These numbers will also be expanded into an internal form for faster computation, but that internal form may take more than 128 bits. The type extension registers allow the floating-point registers to behave as registers which are 160 bits in length for programs which use such data types.

There are 16 short vector registers, each of which is 256 bits in length.

Each of these registers may contain:

As well, they may contain sixteen 16-bit short floating-point numbers in one of two formats.

These numbers all remain in these registers in the same format as that in which they appear in memory.

Also, there is a primary register group of eight long vector registers, and a scratchpad of sixty-four long vector registers, where each long vector register is composed of sixty-four floating-point registers, each 128 bits in length.

Control Flow Guidance

Recently, many classic computer architectures have included, as an augmentation to improve security against malicious software, mechanisms to prevent the alteration of memory containing executable code, and to prevent transfer of control to memory containing data that a user program can alter.

While this stops many forms of malicious software from working, it did not lead to those writing such software to give up. By such techniques as tampering with subroutine return addresses to execute carefully chosen portions of legitimate code, attackers retain the ability to achieve results.

Following techniques that are being used on existing architectures, I have added features to combat this type of attack to the Concertina II architecture.

Three techniques are employed:

The first technique modifies the subroutine call instruction so that it uses a random value, generated once when the computer is powered on, together with the return address to generate a hash of which a 12-bit digest is included as the most significant 12 bits of the return address.

This requires shrinking the virtual address space from 64 bits to 52 bits, so that the most significant 12 bits of all addresses can be ignored for locating data in memory.

The hash is checked whenever a subroutine return is performed, and if it is not correct, the process is halted with an error.

The second technique keeps track of subroutine calls and returns on a hidden stack operating in parallel with the normal subroutine call mechanism; if the return addresses obtained in both ways do not match, the process is halted with an error.

The third technique prevents branching by any means (excepting, of course, return from interrupt) to any but explicitly selected locations within executable code.

The first technique has the drawback that it reguires the virtual address space to be reduced in size, changing 64-bit addressing to 52-bit addressing.

The third technique has the drawback that it requires changes to existing code.

The third technique facilitates the other two techniques, as the branch targets also contain bits indicating information that jump instructions and jump to subroutine instructions do not indicate within themselves.

If code is well-behaved in the sense that the normal Jump to Subroutine instruction is always used to call a subroutine, thus resulting in a return address being pushed to the shadow stack, while the Short Subroutine Jump is always used only to place the current address in a register while also performing a jump, so it should not cause a push to the shadow stack, one of the reasons for requiring the third technique to be used if either of the other two is used is removed.

The return from a subroutine is an ordinary jump instruction. However, it won't be a jump instruction using program-counter-relative addressing. Usually, it will have a displacement of zero, since it is using the register containing the return address as a base register. And that means it won't be using a base register that was set up normally as a pointer to code.

The shadow stack, in addition to keeping track of the correct value of the return address from a subroutine, could keep track of the register in which that address was initially stored by the subroutine call. This would seem to allow one way to identify subroutine returns in well-behaved code. Would it run into trouble with nested subroutines, all using the same standard register for their return addresses? Not really, because usually that is dealt with by saving and restoring that register, not moving the return address to a different register.

The first technique doesn't offer space for keeping track of which base register is used, but applying validation only to branches with a zero displacement, or using some mechanism to determine when a subroutine is performing a jump outside its own range - with the idea that only subroutine calls and subroutine returns, not ordinary direct jumps, may be from one program to another - would allow the first technique to be employed independently of the third.


[Next] [Up] [Previous] [Next Section] [Home] [Other]