gl8f@astsun9.astro.Virginia.EDU (Greg Lindahl) (05/24/91)
After reading a tech report on i860 performance (available from uvacs.cs.virginia.edu: pub/techreports/ipc...), I was left with one question: The i860 has an 8k data cache, and both normal loads and pipelined loads that avoid the data cache. It seems to be very nasty to write a compiler that can do the right thing, with the particular memory system that Intel put in the iPSC i860 machine (page mode DRAM). So, what if the i860 instead had a small data cache, and pipelined loads into vector registers? It is relatively easy to design a compiler that can do a good job with vector registers. And you would be able to do pipelined loads of data to be re-used. And you wouldn't end up starved for registers. Did Intel simulate this, or did they guess? Did they guess wrong?
chased@rbbb.Eng.Sun.COM (David Chase) (05/25/91)
gl8f@astsun9.astro.Virginia.EDU (Greg Lindahl) writes: >After reading a tech report on i860 performance (available from >uvacs.cs.virginia.edu: pub/techreports/ipc...), I was left with one >question: ... >So, what if the i860 instead had a small data cache, and pipelined >loads into vector registers? It is relatively easy to design a >compiler that can do a good job with vector registers. And you would >be able to do pipelined loads of data to be re-used. And you wouldn't >end up starved for registers. Can you be more explicit about what you mean by vector registers? I was about to say that using vector registers was about the same as calling library routines, but then I remembered the matrix multiplication example -- calling an "inner product" subroutine does not get the same performance as simultaneously computing three inner products, which requires that you know something about the algorithm surrounding the inner product. One constraint on elementary row operations on the i860 (implemented in an elementary way -- i.e., without blocking the algorithm within which the EROs are used) is that the FP unit is capable of consuming/producing 64 bits of off-chip (single precision) operands per cycle. (Actually, the grouping rules preclude attaining this rate; best you can do in that style is about 80% of speed-of-light). I'm using the model: offchip[i..n] := offchip[i..n] + factor * onchip[i..n] (row major, i.e., C/Pascal, array layout.) In a "naive" implementation of Gaussian elimination (and QR, too, I believe) the intent is to eliminate the column "i" from a collection of offchip rows by repeatedly subtracting a multiple of the onchip row from them. Each cycle, the i860 can complete one "off := off + F * on" operation, which means that (on average) 32 bits must come on per cycle and 32 bits must be stored per cycle. This is the "speed of light" for data on the i860 -- unless you use a wider bus, or a faster clocking rate, or something, vector registers won't get the data on and off chip any faster. Note -- max pipelined load width is 64 bits, max cached load width is 128 bits. An N-unrolled loop to do the elementary row op has N pipelined loads and stores, plus N/4 cached loads, plus one branch instruction. In double op mode, you pair this with N mpy-and-adds and N/4 + 1 fnops. Thus, in N + N/4 + 1 cycles you perform N mpy-and-adds at best, attaining a speed of N/(N + N/4 + 1) times C -- for N == 16, this is .76C. Note, too, that something as simple as a quad-word (128 bit) pipelined load (i.e., double the width of the off-chip bus) solves the speed-of-light problem without adding vector registers or operations. This still leaves the compiler working hard, of course. David Chase Sun
gl8f@astsun.astro.Virginia.EDU (Greg Lindahl) (05/29/91)
In article <13990@exodus.Eng.Sun.COM> chased@rbbb.Eng.Sun.COM (David Chase) writes: >>So, what if the i860 instead had a small data cache, and pipelined >>loads into vector registers? It is relatively easy to design a >>compiler that can do a good job with vector registers. And you would >>be able to do pipelined loads of data to be re-used. And you wouldn't >>end up starved for registers. > >Can you be more explicit about what you mean by vector registers? I was thinking of something like Cray vector registers, or just explicitly-managed cache, with pipelined loads into cache instead of the current non-pipelined loads into cache. The the ALU would be cache-to-cache instead of register-to-register. >One constraint on elementary row operations on the i860 (implemented >in an elementary way -- i.e., without blocking the algorithm within >which the EROs are used) is that the FP unit is capable of >consuming/producing 64 bits of off-chip (single precision) operands >per cycle. This is what kills the iPSC i860 box -- my code is guaranteed to run at the rate of the memory system, with no useful caching effects. I read a lot of data. And the compilers can't even get up to memory speed, because they can't figure out very well how to optimize the loads and stores, because the problem is so complex and they run out of registers. >Note -- max pipelined load width is 64 bits, max cached load width is >128 bits. An N-unrolled loop to do the elementary row op has N >pipelined loads and stores, plus N/4 cached loads, plus one branch >instruction. In double op mode, you pair this with N mpy-and-adds and >N/4 + 1 fnops. Thus, in N + N/4 + 1 cycles you perform N mpy-and-adds >at best, attaining a speed of N/(N + N/4 + 1) times C -- for N == 16, >this is .76C. This analysis is a bit theoretical in my case, as the iPSC memory system provides (max pipelining) 2 cycle loads if you're on the same page, with a big penalty when you go to a different page. Even when you block loads until you run out of registers, you're paying a nasty penalty and complicating your compiler. For example: do i = 1, 128 z(i) = a * x(i) + y(i) enddo With vector registers you can do a simple pipelined load of all of x and y while calculating the results, and then write the results back. 3 page-mode-dram miss penalties total, and a solved compiler problem. With the current scheme you unroll by 4, load 4 x's, load 4 y's taking a page-miss penalty, while calculating the results, and then write the results back taking another page miss. That's 3 memory system page misses per cycle. (Note that I'm not talking about virtual memory page faults; for details on the memory system don't trust my memory, read the techreport.) When you go to more complex expressions I think vector registers win bigger and bigger, as long as you're looking at vector-type expressions. For scalar expressions, I think the two methods would provide about the same speed. So, to repeat: did Intel simulate this kind of alternative design, and would it provide better performance with a simpler compiler?
preston@ariel.rice.edu (Preston Briggs) (05/29/91)
gl8f@astsun.astro.Virginia.EDU (Greg Lindahl) writes: >This is what kills the iPSC i860 box -- my code is guaranteed to run >at the rate of the memory system, with no useful caching effects. This problem kills most everything except vector machines (ala Cray). That's why BLAS-2 and BLAS-3 routines were invented (i.e., to provide higher-level operations with enough data reuse that the memory system can keep up with the FP unit). Try saxpy's on an HP or IBM when then data is not in cache. They aren't going to get close to peak performance either. -------------------- >For example: > >do i = 1, 128 > z(i) = a * x(i) + y(i) >enddo >With the current scheme you unroll by 4, load 4 x's, load 4 y's taking >a page-miss penalty, while calculating the results, and then write the >results back taking another page miss. That's 3 memory system page >misses per cycle. ^^^^^ iteration Generally, you won't be able to use the load/store quad instructions because it's difficult to be assured of alignment. It's better to handle the above case by copying x into an aligned chunk of memory (called VR1, say) using pipelined loads (4 in a row) and store-quads. Then copy y into VR2. Finally, calculate the sum into VR3 (using quad-loads and stores). Then copy VR3 back to z using quad-loads and singleton stores. VR3 might be identical with either VR1 or VR2. Generally, the VRs will end up in cache. Naturally, we could use subroutines for the copying between memory and cache/VRs and for the vector add. ------------ Of course, Intel though of all this. They do it, using a traditional vectorizing front-end. The problem is that there's still insufficient memory bandwidth to support vector operations at 80 MFlops. That kind of performance requires hefty $$s. ------------ Note also that the "page misses" mentioned above and in Moyer's TR are on a particular board, build around 8 MBytes of dynamic RAM. Other boards have been built with static RAM ($$'s) that don't have the problem of crossing page boundaries. ------------- I think the difficulties with the i860 lie more in generating code for the exposed pipelines and handling their unhelpful multiply-add instruction. The bandwidth questions are going to afflict all the micro's for a while. Preston Briggs