Instruction Issue

It is envisaged that the chip will include, in hardware, perhaps eight or sixteen instruction issuer units: the number would be related to how deep the pipeline is. However, even having two instruction issuer units for a deep pipeline would, when two instruction streams are executed concurrently, would effectively cut the length of the pipeline in half as far as the penalty for a collision or misprediction is concerned. As well, having more instruction issuer units than pipeline stages avoids the overhead of context switching if some instruction streams are briefly delayed, for example, because of having to access an operand from main memory instead of the cache (this becomes particularly important when one remembers that a context switch would normally require several accesses to main memory).

Each instruction issuer unit will contain its own Program Status Block register, its own program counter, its own aux program counter.

Note that while a separate Program Status Block and a separate set of program counters is required for every thread, there is only one pipeline; thus, while there may be multiple logical instruction issuers, it is entirely possible that there will actually be only one set of circuitry to decode instructions, which will merely alternate between using the status information for the different processes in progress. And this means that it would be entirely practical to have as many as 256 logical instruction issuers, by providing storage for a Program Status Block, a Program Counter, and an Aux Program Counter for each one.

The instruction issuers that are active at any one time function in a round-robin fashion.

If all the instruction issuers are active, then by the time a given instruction from one process is started, the preceding instruction from that process has left the pipeline in almost all cases, and so logical dependency pipeline delays are seldom or never encountered. But if only one instruction issuer is active, the pipeline can still be fed with new instructions at full speed by that instruction issuer.

The instruction issuers translate references to registers in instructions to references to actual locations in the register bank; thus, they decode register renaming for each process. They also decode the instructions, in whatever addressing mode and instruction format is currently in effect, to a long form that is independent of the bits in the Program Status Block which specify the instruction mode.

Note that while latency can be very important in the execution phase of instruction handling, since one instruction may depend on the results of a previous one, it is not important in instruction decoding, only throughput determines how long a program takes to execute. The one exception to that would be the case of conditional branching; while speculative execution would interfere with the attempt to maximize processor utilzation more simply by allowing concurrent execution of multiple threads, speculative decoding of instructions, up to, but not including, the placement of micro-ops in the queue, presents no serious problems.

The Register Map ID field in the Program Status Block is not intended to specify an instruction issuer; it specifies a register renaming map. But two processes running concurrently at this level cannot be using the same registers, because they would interfere with each other. Processes running concurrently at a higher level, where switching between one process to another is done infrequently, using interrupts and context switching, can of course share the same registers, because computers had interrupts before they had register renaming. Therefore, for an implementation with 256 logical instruction issuers, the register map ID could also identify the instruction issuer. Note that in any case, the register maps for active processes must all be completely disjoint.

However, since there would be substantially less than 256 stages in the pipeline, pipeline collisions would not be created if one process had its instructions issued multiple times within a round-robin of that length. But since we are dealing with the same process in that case, the same Program Status Block and program counters would be used each time that process was selected for issue.

Since instructions belonging to different processes are fetched from separate portions of memory, despite the fact that they are unlikely to have dependencies, it would appear that they would normally have to be issued on separate cycles. Two or more instructions belonging to the same process, if, in addition to not having dependencies, also do not use the same ALU, can be issued together, and thus, where parallelism is explicitly specified, it can be indicated not only that successive instructions are logically independent, and thus can be issued in successive cycles, but also that they use different computational resources in addition, and thus can be issued on the same cycle.

Because different processes avoid the problem of dependency, it is attractive to look for cases in which ALU conflicts also do not exist across processes, so that the full benefits of any superscalar capabilities an implementation may have. This does, however, entail considerable additional complexity; for example, a long instruction word in the pipeline, specifying simultaneous operations for different arithmetic units, may belong to more than one process. This implies that instructions in pipeline format would need to be segmented, with each segment labelled as to its source process, so that each segment would be handled properly in the event of an interrupt of one of the source processes.

Exploiting the superscalar capabilities of the architecture is particularly attractive in the case when no long vector computations are taking place, since it is envisaged that the arithmetic-logic units used for these computations can then be employed to provide 65-way superscalar operation.

But, as noted above, instructions belonging to different processes don't come from the same memory locations. Where would instructions from many different processes come from during a single cycle?

The answer to this is: look at the width of a cache line. A cache line may be filled by means of sixteen memory accesses along a 256 bit wide bus in a typical implementation. That contains 4,096 bits, or 64 64-bit double-precision floating-point numbers, or 256 16-bit halfwords.

Fetch one line of the cache containing instructions, and decode, in parallel, all the instructions it contains. Perhaps the first two instructions might have no logical dependencies and use different ALU resources; in that case, they would be placed in position to enter the execution pipeline immediately. The third instruction might still have no logical dependencies, but use an ALU resource used by one of the first two, so it would be placed to enter the execution pipeline on the next cycle. Then the fourth instruction might have a logical dependency, and so it would be placed to enter the execution pipeline several cycles later. And so on and so forth, for perhaps a hundred instructions.

Then, on the next cycle, fetch a cache line containing instructions for another process. And place those instructions in available slots in the queue for entering the pipeline in the same manner.

The diagram below may perhaps help to make clearer the sort of process envisaged:

On the top, cache lines containing multiple instructions are decoded; the resulting operations are then placed within internal instruction words of the very long instruction word (VLIW) type as appropriate given their logical dependencies and use of resources; these words, when fully assembled, go into the execution pipeline. This technique, known as a decoupled microarchitecture, is currently used on most major microprocessors, although it is here illustrated in association with a level of superscalar operation not yet practical for current die sizes.

This technique first entered the world of microprocessors when it was used by NexGen for their Nx586 microprocessor, the design of which eventually formed the basis for the AMD K6 microprocessor. At about the same time, the AMD K5 and the Pentium Pro microprocessors also used this technique.

Contemporary accounts of this noted the technique as a way for a microprocessor to have a simple internal design like that of a chip which provided reduced instruction-set computing (RISC), even though it implemented a traditional complex instruction-set computing (CISC) instruction set. However, a decoupled microarchitecture can involve more than simple on-the-fly translation of instructions to a conventional RISC format.

The decoupled microarchitecture technique is recognized as having originated in the work of Yale Patt and his students Hwu Wen-Mei, Stephen Melvin, and Michael Shebanow at Berkeley on HPS, the High Performance Substrate, and an experimental implementation of the instruction set of the VAX computer from DEC by generating HPS instructions dynamically.

Incidentally, one report on the Nx586, AMD K5, and Pentium Pro noted that the companies involved were all using the decoupled microarchitecture as a simpler alternative to out-of-order instruction execution; the HPS project already combined the use of both measures to better exploit available parallelism in the instruction stream.

The Control Data Cyberplus computer, also known as their Advanced Flexible Processor, had an instruction word which was about 100 bits long. (A preliminary manual, available on the web, gives an exact format of 99 bits in length, but other accounts give a length of just over 100 bits, so there may have been changes since that manual in the actual design sold.) In each cycle, the instruction specified the routing of outputs from a variety of functional units to the inputs of these same functional units. Two memory banks were functional units, so was a pipelined multiplier, a divide unit, an adder, and so on. Thus, each instruction could direct the computer to perform one of each type of operation of which the computer was capable.

This computer executed programs from a small memory containing 1,024 of these long instruction words.

Using the 1,024-word memory to contain the microprogram for a conventional computer would be inefficient for such an architecture, as a conventional instruction will perform a multiplication, or an XOR, or an addition, or a division, but not all of them at the same time. What this machine was used for initially, in actual practice, was high-speed image processing, involving short, carefully-coded routines that executed entirely in the unit's limited program memory.

A microprocessor having an underlying architecture similar to that of the Cyberplus would benefit by having a decoupled microarchitecture, since instructions from different instruction streams could be combined in a single long instruction word. This would, however, require multiple register files to be present as separate functional units, available for parallel access, or some similar measure. (In the high-performance implementation of the architecture described on the previous page, some multiport access to the register file is combined with independent L1 caches for the functional units; also, multi-function ALUs are envisaged, not fully split into multiple single-function units as in the Cyberplus.) Absent that, a machine of this nature could be utilized as one with a dataflow architecture, but not efficiently for executing programs written for a conventional architecture.

There are some important complexities that are glossed over in this diagram. The principle shown above will fare reasonably well for specifying when operands for an instruction to be fetched from cache or registers, and the operation commenced. A separate level of control would be required to control fetches from memory, since memory, being external to the chip, is fundamentally unpredictable.

But two complications occur when it comes to storing the results of calculations.

Operations entered into the floating-point multiplication pipeline, for example, on successive cycles may not complete on successive cycles; one might be a double-precision operation, and the next a single-precision operation. The floating-point ALU, however, may not be able to present more than one result for storage in a given cycle. This can be dealt with by accompanying the pipeline stages with buffers, so that a result which cannot be stored immediately advances in the buffers alongside the pipeline. To limit the maximum number of buffers to that of the pipeline stages, the oldest result must be given the highest priority for storage.

This will complicate the scheduling of logically-dependent operations, since the additional delay this might create needs to be taken into account. The information to do this is available to the instruction scheduler, but it adds considerably to the complexity of scheduling micro-operations.

The second complication is that the length of time an instruction takes to complete sometimes depends not on things implicit in the operation to be performed, such as the length of operands, and hence known in advance, but also on the value of operands. Thus, an arithmetic operation involving a NaN (Not-a-Number) value, or a multiplication by zero, can execute more rapidly. When a large number of concurrent processes are active, there is no significant loss of efficiency from scheduling all micro-operations based on worst-case execution times, but this approach lacks the flexibility to take full advantage of early results when only a single process is active, as will be a common situation. Given the intent to achieve full utilization of the processor through scheduling multiple processes in advance, this does mean that preference will be given to fast fixed-latency algorithms over fast variable-latency algorithms. It is also entirely possible that the algorithms with the shortest latency, but requiring the most hardware, will only be used in the main ALU, with the banks of multiple ALUs still being able to begin one operation per cycle, but taking more cycles to return a completed result, as latency, as opposed to throughput, is less of an issue with vector operations: as long as the vector is wider than the hardware used to process it, the frequency of going from one operation to the next one is reduced.

Yet another complication comes about when a divide instruction begun on one cycle, and a multiply instruction begun on a later cycle, both finish, and both require storing their results in the same bank of registers. If storage of results is controlled autonomously, this could require a buffer whose size would not be easy to limit. This seems to indicate that micro-ops for storing results need to be scheduled in advance along with the fetch and operate micro-ops. This gives an opportunity to deal with the single-process case; usually, a micro-op for storing a result would be placed at a time corresponding to its worst-case availablility, but in the single-process case, it could be placed at the best-case time, with the entire sequence of incoming VLIW commands being paused until the result is ready.

Even without the added sources of complexity cited above, this process might be complex enough to require a tiny computer to handle it, and, since that computer would have to carry out many steps in a single cycle of the chip as a whole, it might well be the one place on the chip where the gates would be ECL instead of CMOS. A tiny computer could, after all, require fewer gates than even one double-precision floating-point unit. (Also, incidentally, this design permits out-of-order execution.)

The need to have such a large number of instruction decoders operating in parallel, however, rather than just having one instruction decoder for the chip's full instruction set (and RISC-style instruction decoders which are replicated 64 times for cache-internal parallel computing) is perhaps the biggest obstacle to the realization of an architecture such as the one outlined here, since it is a strong argument against having a design for which instruction decoding is complicated.

Note that not all the circuitry for instruction decoding has to be duplicated for every 16 bits of the cache line; while 4,096 bits can contain up to 256 instructions that are 16 bits long, they can contain at most 128 instructions that are 32 bits long, and at most 85 instructions that are 48 bits long (although the carry-over of incomplete instructions between blocks means that one will need to be able to decode 86 such instructions in parallel to absolutely ensure no delays related to instruction decoding), and at most 64 instructions that are 64 bits long, and so on. Some additional flexibility in routing instructions to decoders will be required to allow this, of course.

Since a single cache fetch might turn up as many as 256 instructions, and the instruction words being constructed on the pipeline have access to sixty-five floating-point ALUs, sixty-five fixed-point ALUs, which total to only 130 items (in addition, a few more specialized items are available: one short vector ALU, a packed decimal ALU, a string operation unit, the circuitry for the extended operate instructions, for initiating an extended translate microprogram, and for initiating cache-internal parallel operation), couldn't the number of instruction decoders be reduced by using a narrower cache for instructions than for data? As noted in this section, for maximum performance, the arithmetic-logic units are likely to be designed so that they consist of one add/subtract unit, one multiply unit, one divide unit, and, in the case of the fixed-point ALU, one unit for logical operations, that can operate in parallel; thus, potentially as many as 455 operations could be initiated in the standard ALUs per cycle.

Ideally, instruction decoding should take less than a full pipeline cycle, since if there are fewer than 64 instructions decoded per instruction cache fetch, it would not normally be possible to fill even one of the two halves (floating-point or integer) of the VLIW which specifies one cycle's worth of operations within the pipeline.

Another consideration is that the model of superscalar operation presented here appears to imply the need for a 64 by 256 crossbar switch connecting the 64 vector ALUs to the scalar register banks of every active process. And even this would not be enough, if logically independent operations on values in different registers of the same register bank of the same process are to be carried out simultaneously.

There is a solution, which has two elements.

First, the number of logical instruction issuers can be limited from 256 to a more realistic 64 almost without penalty; in practice for most applications, even having eight processes running at once will be difficult to achieve most of the time. Of course, in a multi-user mainframe environment, any number of users might be present, but this is a design for a processor chip; if it is possible to provide each user with a keyboard and a screen, it is possible to provide each user with a processor. There are still cases, where a database on one set of hard disks is accessed through a server by multiple users, that many users sharing the same CPU is unavoidable, though, and so having sixty-four logical instruction issuers is not a total waste. This makes it possible to use the circuitry shown on this page as being intended to allow vectors to be quickly shifted and providing interprocessor communications, and which can be thought of as consisting of two layers of 8 by 8 crossbar switches, to route scalar register banks to arithmetic-logic units.

Second, even after a floating-point divide is broken down to numerous pipelined stages, the time required for each pipeline stage will be significantly greater than a single gate delay. If multiple fetches from the register banks can be performed, and multiple routings of data through the interconnect network can be made, in a single major cycle, parallel operations on data in the same register bank can be made in the next cycle by fetching the data serially, but in rapid succession, in the preceding cycle, and it becomes possible to have four layers of 8 by 8 crossbars, alternately in rows and columns, which allows the flexibility of a 64 by 64 crossbar to be more closely approximated. Having four minor cycles to a major cycle would be sufficient to achieve this.

Note also that when a cache-internal parallel computation is in progress, the pipelines of the vector ALUs would not be continuously filled by that. Hence, those ALUs could still be used, but occasionally a constructed VLIW would end up taking two major cycles before all the operations in it could be started. To help reduce this, it would be useful if the operations specified in a VLIW could be assigned on the fly to different ALUs; given the complexity of using less than a full crossbar circuit to route register data to the ALUs, there could be limitations and restrictions on this (i.e., reassignment only within the same group of eight ALUs) that should, however, have minimal impact on performance.

The foregoing discussion is oversimplified in two ways. The first is that it assumes all the operands of instructions are in registers.

What if they are in memory? If they are in cache, the problem is limited; it means that the VLIW entries must include cache addresses, not merely register addresses. But there is also the consideration that only one access to a cache line may be made per cycle. Thus, the purpose of an L1 cache now becomes clear; it allows a single fetch from the cache to lead to the data in the entire cache line being available to be fetched in a serial fashion. Thus, the L1 cache would contain numerous units, available for access by the ALUs as banks of registers of widths from 128 bits down to 8 bits, but which are loaded in their entirety from an L2 cache line. (An L1 cache for instructions could permit a cache line to be accessed serially, 16 bits at a time; since multiple instruction decoders operating in parallel are required in any event, this might simplify their design.)

When an operand is known to be in memory, and not in cache, the instruction decoder unit would have to be able to determine this, and place on the pipeline an operation to request the operand from memory, and then, later, an operation to perform the calculation. Thus, the specific cache line to be used would have to be reserved at the time of instruction decoding.

Since memory is external to the processor, and therefore the arrival time of an operand found in memory cannot be strictly determined in advance, the operation to perform the calculation would have to be positioned in the execution sequence after the required data was actually received successfully.

An indexed instruction, logically dependent on an instruction that has not yet been executed, and which may refer to memory within or outside of the cache, would have to be treated as a kind of conditional branch situation. Thus, such an instruction would result in an operation being placed in the pipeline to calculate the address, and then, once the address is calculated, an instruction to fetch the operand from memory would then be placed in the pipeline if required, and finally the operation to perform the calculation would be placed in the pipeline. Each of these steps, it may be noted, would involve a separate machine instruction in a classic load-store RISC architecture. This is true in the Power PC, for example, although only placing the sum of the value in an address register and the value of an index register requires a separate instruction; load and store instructions still include the specification of an address register whose contents are added to the contents of the address field of the instruction. It appears likely that implementations of that architecture are, therefore, designed on the assumption that address register contents are changed infrequently, so that information on what areas of memory are currently cached can be made rapidly accessible in a base-relative form.

It may also be noted that since memory and even L2 cache accesses may be considered to be operations separate from arithmetic operations in the machine's internal architecture, the internal architecture can be said to be of the Decoupled Fetch-Execute type. This is doubtless also a universal characteristic of today's microprocessors with decoupled architectures, but only a limited number of machines had this type of architecture visible in their standard instruction set; one example was the Culler-7 supercomputer.

And the fact that conditional branches have not been considered is the second oversimplification. This is another argument for an L1 cache for instruction decoding; the decoder then could, without needing to fetch the same line from the cache again, start placing instructions for at least the branch not taken case on the cache line once it was determined, by the execution unit, that this was the correct case.

Since the cycle at which the ambiguity is resolved would be known in advance, instructions for the two cases could be placed on the pre-execution portion of the pipeline speculatively, with the unused branch to be erased at the appropriate time. When the pipeline is not filled by multiple processes, or when one process has a sufficiently higher priority than others, speculative execution is another alternative; in that case, results would have to be placed in a different portion of the register bank.

Of course, this once again raises the question of how sophisticated a process can be that has to take place during each cycle of the processor.

Also note that, for example, every one of the sixty-five floating-point micro-ops in a VLIW would have to contain a field giving the floating-point format applicable to it, as indicated by eight bits of the Program Status Block, and the integer micro-ops would similarly have to indicate their format information as well. (This sort of thing, however, may be covered by an existing patent, U.S. patent 5,574,927; while the wide variety of data formats are part of this machine's native architecture, it is still ultimately present to be in aid of emulation, and, in addition, the machine's native architecture is emulated by a RISC core in the case of a decoupled microarchitecture.)