Putting it all Together, Again

I have been inspired to revisit the possibilities of designing, with conventional memory parts, a computer capable of handling alternate lengths of data as the result of learning that while most makes of 72-bit memory modules, which support ECC in conventional computer systems using 32-bit and 64-bit words, command premium prices, at least one company, Kingston, in its Value Line series of products, produces not only 72-bit RAM modules, but even registered 72-bit RAM modules, at the modest premium in price one might expect for modules which contain 12.5% more memory.

This means that the simple solution of using memory modules in sets of three instead of sets of two, to support a 48-bit word and a 12-bit basic unit, is entirely feasible even if one insists on having memory protected by an error-correcting code, instead of doing without as the typical personal computer does.

Initially, I considered using only a limited number of 72-bit parts, still using mostly 64-bit parts, to construct a memory array built around a 45-bit word. Four such words equal 180 bits, which can be divided into five parts of 36 bits each, or three parts of 60 bits each, allowing single-precision and double-precision numbers of the ideal lengths to be accessed without the need for any provision for handling unaligned data.

But this involves division by three and division by five during addressing operations. While three is 11, and three times five is fifteen, 1111, and so one could multiply by the appropriate number and quickly approximate division by 255 by multiplying by 257, as was done on the Burroughs Scientific Processor, that still adds some latency to memory operations.

More importantly, twice 180 bits, or 360 bits, is equal to 45 times 8 bits... so such an architecture could handle 8-bit characters, but it would still be at a serious disadvantage in communicating with the rest of the world, in which the 32-bit word is dominant.

Therefore, instead, I decided to go with the 48-bit word for multiplicative access to 36-bit and 60-bit data sizes, instead of the 45-bit word for aliquot access to 36-bit and 60-bit data sizes. (An aliquot part of a number is another number that divides it evenly without a remainder.)

Since 45 is just three less than 48, one could, as an alternate mode of memory access, still store 36-bit and 60-bit data in a fully-aligned fashion with an overhead of only one part in sixteen, or a wastage of 6.25%, where this would be advantageous.

Using the multiplicative principle, it is necessary to provide for efficient access to unaligned data. This means the data bus needs to be split into two pieces. Using sets of three memory modules, one would be one module wide, and the other two modules wide. This would limit the form of ECC used to a simple SEC-DED code. If, instead, one uses no less than two modules at once, each with eight ECC bit available, DEC-TED codes and chipkill codes, which provide higher reliability become available.

If one uses, instead, sets of six memory modules, then, with two address buses, one could divide the data bus into two equal pieces, each 192 bits wide, plus 24 ECC bits.

However, the reason for moving to 48 bits from 45, and thus needing to split the data bus and double its width, was to facilitate handling some additional lengths of data besides 36 bits and 60 bits. Including providing efficient operation with the dominant 8, 16, 32 and 64 bit paradigm.

As previously examined in these pages, if one wishes to store things two units wide in memory lines that are three units wide, one way to do that without dividing by three at all would be to store sixteen such things in eleven memory lines, thus using 32 units out of every 33.

To permit efficient handling of unaligned 16-bit, 32-bit, or 64-bit quantities within memory allocated to use with that size of memory, since this would mean ten memory lines of 384 bits that are fully used, and one memory line of which only the first 256 bits would be used, so that unaligned items that straddle the boundary between two blocks of 32 memory units of 128 bits can be handled without incurring a penalty, one address bus should control the first 128 bits of each memory line of 384 bits, and the other address bus should control the last 256 bits of each memory line of 384 bits. That way, the first and last 128 bits of the final memory line of a block used for 32 memory units of 128 bits would be controlled by different address buses, so the provision for unaligned access would extend to that part of such a block as well.

32 memory units of 128 bits are 512 bytes. As it happens, 2,048 bytes are 128 memory units of 128 bits, also one less than a multiple of three, in this case, 129, which is 43 times 3, instead of 33, which is 11 times 3.

In addition to being more efficient, through being more elaborate, some of the 128 extra bits, now only associated with each block of 2K bytes, could serve a purpose in assisting with the emulation of one famous architecture built around power-of-two multiples of the 8-bit byte, since a four-bit storage key value was associated with each 2K bytes of memory in the IBM System/360 computer.

Looking back at the 45-bit aliquot option as opposed to the 48-bit multiplicative option, as these use bits in the ratio of 15 to 16, I noted that 15 times 17 is 255, one less than 16 times 16. But there was no reasonable way to make use of that for greater storage efficiency without making the data bus too wide by an impractical amount.

However, this started me thinking about the Burroughs Scientific Processor once again. It was famous for using memory that was interleaved 17-fold so that data could be accessed efficiently at any non-unit stride that was relatively prime to seventeen.

If I used the aliquot memory option, but with rows of 51 memory modules - or even 33 memory modules, or 21 memory modules, as avoiding multiples of 11 or 7 would also be not an unreasonable thing to ask of programmers, I could offer the same efficiency as the BSP did. I would need a wider bus, as current memory modules are internally interleaved, and don't admit of being used on a bus with a higher speed than their own maximum frequency.

So one could have a memory system with rows of 21 memory modules, and seven address buses, one for every three memory modules. Since 16 memory modules are connected to one CPU in the SX-Alpha supercomputer from NEC, perhaps this would not be totally impractical.

So for fast matrix multiplication, data would be stored in aliquot fashion, but for other operations, the denser multiplicative method would be used.

Is there a power of two which would fit nicely into such a memory system for a mode in which data of the conventional sizes could also be handled? 21 times 3 is one less than a power of two, which is a small difference in the wrong direction; 21 times 49 is four more than a power of two, which would be acceptable.

Alternatively, instead of trying to connect 1,344 data lines to a single CPU, with all the awkwardness and expense that would entail, one could stick to just connecting 384 data lines to a single CPU, which is bad enough, and, when high-speed matrix operations are desired, simply put seventeen CPUs, each on their own board, in the same system with some sort of high-speed interconnect between the CPUs.

But either way, one obstacle is present: in the Burroughs Scientific Processor, the width of the data bus was that of the floating-point numbers on which the machine would operate. Here, what we are having seven or seventeen of are not individual floating-point numbers, but aliquot batches of three to five of them (or, in fact, in the seventeen-CPU case, six to ten of them).

Wouldn't that mean that accessing memory with seven or seventeen different address buses would still not succeed in ensuring, at least for all strides relatively prime to one problem number, that one could fetch from all seven or seventeen columns of memory before returning to one's starting point?

A little thought allowed me to come up with the first step in a soluton.

Taking the simple case of seven memory cells, each containing three 60-bit floating-point numbers:

If, instead of ordering them in groups of three like this:

 1  2  3      4  5  6      7  8  9     10 11 12     13 14 15     16 17 18     19 20 21

they were ordered like this:

 1  8 15      2  9 16      3 10 17      4 11 18      5 12 19      6 13 20      7 14 21

then the problem is solved. If I have a stride that is relatively prime to 7, say a stride of 100, then I am guaranteed that I will access seven memory cells in different columns before coming back around to the one at which I started.

And then it struck me that my solution to the problem was still significantly inadequate. I was definitely not using memory bandwidth with 100% efficiency, as I was accessing memory cells that contained three floating-point numbers, and there was only one of them that I wanted!

However, further thought led me to the realization that things were not as hopeless as they seemed, and I was indeed on the right track.

Why am I accessing memory with a stride of 100 anyways?

The usual answer is, because I have two 100 by 100 square matrices, on which I wish to perform matrix multiplication. When one multiplies matrix A by matrix B to produce matrix C, element i,j of matrix C is the sum, over the possible values of k (here, from 1 to 100) of all the products of the form A(i,k)*B(k,j).

Ignoring the fact that arrays in FORTRAN are stored in row-major order, as the transformations to handle that confusing case are trivial, this means that while I am fetching A(1,1), A(1,2), A(1,3)... twenty-one at a time, since they're nicely in order, and concurrently fetching B(1,1), B(2,1), B(3,1)... which are stored at the locations B(1), B(101), B(201)... I could actually also use the other data that I am also fetching which I don't need to calulate C(1,1); since I'm also fetching B(8), B(108), B(208)... and B(15), B(115), B(215)... at the same time, I can use them by also multiplying those with the values of A in order and then summing those series of products so as to calculate C(1,8) and C(1,15) while I'm busy calculationg C(1,1).

And then I go on to calculate C(1,2), C(1,9), and C(1,16) together. After calculating C(1,7), C(1,14) and C(1,21) together, then I skip ahead to C(1,22), C(1,29) and C(1,36), and so on and so forth.

If I were using 36-bit floats instead of 60-bit floats, packed five to a memory unit of 180 bits, I would just calculate five elements of C at a time instead of three.

So, by adapting my algorithm to the hardware, at least for matrix multiplication, this type of skewed storage of data would permit a BSP-style memory arrangement to provide 100% memory bandwidth for data of different widths.

Note, too, that nothing's stopping me from using all 192 bits instead of just 180 bits of a memory cell, so in addition to this working with 36 bit and 60 bit floats, it could also work with 48 bit and 64 bit floats.

Oops! (Accumulating multple totals)

It then dawned upon me that if I can accumulate three to five totals at the same time, in order to make effective use of the data I am constrained in any event to fetch from memory in addition to that which I originally sought... why not extend that principle, and accumulate sixteen totals at once?

In that case, I can fetch a whole memory line of sixteen floating-point numbers from the matrix that needs to be read with a stride, and use all sixteen numbers that I get. Thus, the case of a stride which is a multiple of the memory width now becomes optimal rather than pessimal. And there is no need for BSP-style gyrations, no need to make the memory line seventeen floating-point numbers wide.

If I divide the memory bus into four pieces, each four floats wide, and I have two 171 by 171 matrices to multiply, then I would have the pessimal case of having to start at any arbitrary point, and so I could only accumulate thirteen totals on one pass through the matrices.

This seems to be complicated, though. It isn't suited to Cray-style vector instructions in their original form. Does that mean they're obsolete, and the way things are done now, using COTS (cheap off-the-shelf) microprocessors for HPC (high performance computing) is the way to go?

The problem is, though, that a modern memory bus runs at about 1 GHz, or faster, while a modern microprocessor runs at 4 GHz, or slower. Being able to do three or four instructions between memory fetches is not going to get sixteen pairs of floating-point numbers multiplied.

Fortunately, though, it seems to me that if one simply added a four-bit field to a Cray-like vector instruction, to specify that up to sixteen instances of that instruction - displaced in memory forwards by the width of one floating-point number, and using the next vector register from the one specified - are actually described, one could, in the available cycles, start the processor doing the pipelined operations required.

And so now one would have a completely conventional-looking chip. Four address buses, each controlling, say, 128 data lines. That would be still more than used with ordinary microprocessors, but it would involve connecting half as many memory modules to the chip as was done in the NEC SX-Alpha, so it would be doable.

And nothing stops the chip from internally taking in 256 bits and just using 252 of them for seven 36-bit floating-point numbers, or just using 240 of them for five 48-bit floating-point numbers.

Uh-oh. (Real world memory.)

But then I realized that I was overlooking a very important point about how memory chips work in the real world.

Accesses to DRAM are, normally, painfully slow. Having DRAM with a cycle time that is four times slower than that of the processor is not enough to account for, and thus is not the primary cause of, the painful nature of such accesses.

There is also a latency issue.

Modern DRAM modules only provide data to the processor at the full speed of the bus during the access of up to eight consecutive lines of memory.

Thus, if it is desired to use the memory bus as efficiently as possible, DRAM modules need to be regarded as being wider than they seem.

In that case, the first thing I would change about the design noted above is to keep the four address buses, but now have each address bus only control one memory module. That would make the processor package more reasonable in size, making it look more like a COTS part than an HPC one.

Normally, the address buses would be ganged in pairs, and so one could still use 16 ECC bits to protect 128 data bits, thus allowing the use of advanced forms of ECC.

In the case where all four address buses are operating independently, in order to retain at least 75% efficient use of the bus width when working with inconvenient strides, since fetches of data would normally involve fetching eight successive 64-bit data words from each memory module, one could save the results of fetches on two consecutive cycles before doing ECC processing.

This switches one from accumulating from eight totals (in the case of 64-bit floats) to sixteen totals (in the case of 32-bit floats) on a pass through the matrices in matrix multiplication to accumulating thirty-two to sixty-four totals at once.

When doing a matrix multiplication, how many vector registers does one require?

Suppose I am accumulating sixteen totals at once. Then for the matrix with stride, I fill sixteen vector registers by reading in, on each memory access, one number that is placed in each one. After I'm done with that, I multiply each of those vector registers, on an element-by-element basis, with one vector register containing a row (or, in FORTRAN, a column) of the other matrix. And I then sum the elements of each vector result, and put them into sixteen running totals to go into the final matrix. Those don't need to go into vector registers, as they're generated slowly. So I can get by with, say, eighteen vector registers to accumulate sixteen totals.

This means that if I have only sixty-four vector registers, while I would not be able to properly multiply matrices of 32-bit floats, I would be able to go all the way down to matrices of 36-bit floats, so I would not be restricted to 64-bit floating point numbers. Sixty-four vector registers are quite a few, and there would now have to be a six-bit field instead of a four-bit field in vector memory reference instructions to indicate their parallel operation, but it should be - with difficulty - within the realm of possibility.

DDR3 kept the eight-way nature of DDR2, but, of course, a new generation of DRAM could internally organize data in larger batches. But it would only be necessary to fully keep up with that if the overhead of charging a new word line also doubled; otherwise, if the overhead were low, incurring it twice as often as necessary would not impair efficiency greatly.