lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (03/21/89)
In all this talk about the Intel 860, there's been very little mention of compilers. It seems clear that pipeline mode is beyond the compiler technology that Intel is bringing to bear. I expect them to ship a library of handcoded routines, which do dot-product and the like. At some point, the compilers will acquire directives ( "do a dot product" ) and/or pattern matching ( "gosh, that looks like code for a dot product" ). The compiler will then inline the library handcode, with appropriate parameterization. Part of the problem is the (poor) idea of having to catch the hot, smoking data as it flys out of the pipe. Self-draining pipelines would have been easier on both the compilers and the interrupt handling. The pipeline mode's startup and shutdown transients, which trash things, are also undesirable. The data-chaining features sound intractable. Perhaps another dual-instruction mode would have been better - one instruction to the FADD unit, the other to FMUL. The existing dual instruction mode is OK, because compilers can largely ignore it. A postprocessor (or an optimizing assembler) can do all the code scheduling. It can do a good job, too, if the compiler emits enough hints. Perhaps Intel should have dropped the floating point unit, and put in a second integer ALU instead. It would have goosed the Dhrystones, no? -- Don D.C.Lindsay Carnegie Mellon School of Computer Science --
mccalpin@loligo.uucp (John McCalpin) (03/21/89)
In article <4524@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes: >It seems clear that pipeline mode is beyond the compiler technology >that Intel is bringing to bear. I expect them to ship a library of >handcoded routines, which do dot-product and the like. At some point, >the compilers will acquire directives ( "do a dot product" ) and/or >pattern matching ( "gosh, that looks like code for a dot product" ). >The compiler will then inline the library handcode, with appropriate >parameterization. In other words, you expect them to repeat the last 12 years of supercomputer compiler/vectorizer development :-) .... Of course, source-to-source vectorizers exist already - I expect that the folks at Pacific Sierra or Kuck and Associates would be happy to sell their products to Intel (for appropriate remuneration). But vectorization is only part of the problem - see below.... >Part of the problem is the (poor) idea of having to catch the hot, >smoking data as it flys out of the pipe. Self-draining pipelines would >have been easier on both the compilers and the interrupt handling. The >pipeline mode's startup and shutdown transients, which trash things, >are also undesirable. The real problem with the i860 (and the other micro-based vector processors) is memory bandwidth. The Cray-1 is an excellent example of what happens when the pipe cannot be kept full. You need 2 input elements per clock and generate one output element, but the memory channel could only transfer one word per cycle. There are 512 64-bit vector registers on the Cray-1, which can be considered as a (very) small cache. Only very special codes have enough locality to get good performance on this machine. On the X/MP, the memory bandwidth was upped to 3 words/clock and all of the sudden, *lots* of code was running at a good fraction of full speed. CDC/ETA is now hitting the same trouble on the ETA-10, only the "cache" in this case is the 32 MB local cpu memory. This (SRAM) "cache" is connected to the 512+ MB (DRAM) "shared memory" by a 1 word/cycle interface, so code that overflows the cache has a noticeable performance hit. There are real live applications that suffer from this bottleneck. (The same comments apply to the Cray SSD). The (ridiculously small) data cache on the i860 makes absolutely no sense on a vector processor. Of course I don't expect an i860 system to have greater than 32 MB of cache (!), but the experience in the vector-computing world is that memory bandwidth performance limitations are always nearby, and are not always possible to beat with compiler technology alone. Just think about the cost of a memory subsytem that can feed a 100 MFLOPS vector unit (100 million*4bytes/word*(2 ins+1 out)=1.2 GB/second) out of main memory! >The data-chaining features sound intractable. Perhaps another >dual-instruction mode would have been better - one instruction to the >FADD unit, the other to FMUL. Why is chaining intractable? Cray and CDC have been doing it for years. >Perhaps Intel should have dropped the floating point unit, and put in a >second integer ALU instead. It would have goosed the Dhrystones, no? If the chip had been designed as a dhrystone engine, the caches would be large enough to be useful, too. >Don D.C.Lindsay Carnegie Mellon School of Computer Science ---------------------- John D. McCalpin ------------------------ Dept of Oceanography & Supercomputer Computations Research Institute mccalpin@masig1.ocean.fsu.edu mccalpin@nu.cs.fsu.edu --------------------------------------------------------------------
bcase@cup.portal.com (Brian bcase Case) (03/22/89)
>Part of the problem is the (poor) idea of having to catch the hot, >smoking data as it flys out of the pipe. Self-draining pipelines would >have been easier on both the compilers and the interrupt handling. The >pipeline mode's startup and shutdown transients, which trash things, >are also undesirable. Yes, I thought the idea of have to catch "the hot, smoking data" (I *love* that characterization!) was a dumb one too, until I realized (i.e. was told) why it is done that way: This way, the FP register file is much more useful. With "normal" interlocking, a full FP pipe would lock nearly *all* the FP registers. With the way the i860 does it, the registers that will be destinations can be used while the results that will eventually go there are being computed. This makes sense, assuming that one can have only 32 FP registers (16 dp registers). That might be where the mistake was made.... >The data-chaining features sound intractable. Perhaps another >dual-instruction mode would have been better - one instruction to the >FADD unit, the other to FMUL. Yep, it's a bitch. Again, the constraint was probably somewhat related to implementation technology. Again, I think they should have given up 4K bytes of the data cache in order to permit them to implement a cleaner *architecture* instead of sacrificing architecture to *implementation*. But then, what do I know. >Perhaps Intel should have dropped the floating point unit, and put in a >second integer ALU instead. It would have goosed the Dhrystones, no? I love it! *Now* you're talking! I mean, what's more important, FP or having the longest drystone histogram bar in your marketing "literature"! :-) :-) :-)