jmd@granite.dec.com (John Danskin) (03/05/88)
Hi. I have some thoughts about vector architectures on mainframe machines that I thought I would expose to the friendly breezes of the network. In this article, I make some fairly terrifying assumptions about 'vector architectures by main- stream computer manufacturers'. Feel free to correct me if any are wrong. Recently there has been a trend for mainstream computer manufacturers to put vector units on their mainframes/workstations to try to edge into the scientific market now held by Cray/CDC etc. Often, the vector instruc- tion sets are more or less copied from the Cray instruction set, and will feature a small number (8/16) of relatively long (64 elements) vector registers. Some problems vectorize naturally into long vector opera- tions. For another, larger class of probloems, another approach is helpful: If your problem is parallel enough so that it could be pro- fitably attacked by an N processor MIMD unit, then, if you have enough mask registers, you can simulate the N MIMD units with one vector unit with registers of length N. Of course there is a performance hit due to masked out cycles on the vector unit, but for many applications where paral- lelism is significant, this is unimportant when held up against the cost advantages (better code density, one sequencer, deep pipelining) that a simd vector unit has over a comparable mimd unit. Of course, usually, the number of processing elements is usually much less than than the vector length. This stems both from the need for deep pipelines to achieve peak vector performance, and the need to conserve $$ when building these things. Now, the problem with this kind of programming, is that the memory traffic generated by the vector unit is proportional to the performance increase. So a vector unit with 16 64 element registers, operating as fast as 8 machines each with 16 registers, generates as much memory traffic as 8 machines with 16 registers. This makes normal caches pretty much use- less, and in fact, can't be done without going to highly interleaved memory ---> the Cray solution. Unfortunately for the aforementioned mainstream computer manufacturers, highly interleaved memory is expensive, very fast busses are expensive, and it costs money to design all of this stuff when it isn't really your design center. So, what can be done without building a Cray? Well, most of these vector units are 'low performance' units. I.E. They don't really need to be 64 elements long to get their performance. So, why not use more shorter vec- tors, say 64 vectors each with 16 elements? We know that machines with 64 registers generate far fewer memory refer- ences than machines with 16 registers, so for an inexpensive architecture (<$17M) it seems that shorter registers would be a win. An alternative design is to allow unaligned vector instruc- tions on a large homogenous register file. Then the programmer/compiler can decide how best to register usage and minimize memory references. This appears to be the tack adopted by Ardent. Can anybody discuss the tradeoffs I've mentiontioned in a more informed manner? I would really appreciate some deeper insights into these problems. John Danskin | decwrl!jmd DEC Workstation Systems Engineering | (415)853-6724 100 Hamilton Avenue | My comments are my own. Palo Alto, CA 94306 | I do not speak for DEC. -- John Danskin | decwrl!jmd DEC Workstation Systems Engineering | (415)853-6724 100 Hamilton Avenue | My comments are my own. Palo Alto, CA 94306 | I do not speak for DEC.