elind@ircam.ircam.fr (Eric Lindemann) (03/13/89)
I'm interested in memory access statistics (cache hit rate, external accesses per second, etc.) for the i860 running "typical" scalar code. This is with an eye to evaluating it's performance in a tightly coupled, shared memory, multiprocessor configuration. Does the high speed of the i860, combined with it's smallish caches, imply that a single processor will consume so much of the external bus bandwidth that trying to put four or even two i860s on a shared memory bus would lead to unacceptable performance degradation? This problem is fairly easy to analyze on paper for specific vector routines but requires more ellaborate simulation and profiling for the general computing case. I'm interested here mainly in performance issues. Let's leave the issue of cache coherency, and the lack of i860 hardware support for it, out of the discussion for the time being. Eric Lindemann IRCAM (Institute for Research and Coordination of Acoustics and Music)
brooks@vette.llnl.gov (Eugene Brooks) (03/14/89)
In article <494@ircam.UUCP> elind@ircam.ircam.fr (Eric Lindemann) writes: >Does the high speed of the i860, combined with it's smallish caches, imply >that a single processor will consume so much of the external bus bandwidth >that trying to put four or even two i860s on a shared memory bus would >lead to unacceptable performance degradation? Yes, you would get into real trouble real quick. You would want to have a bus system hooked to N rather large coherent caches which the N i860s were hooked to. For the current part which does not have a coherence mechanism for the on chip cache you would have to set the shared data as not cachable as far as the on chip cache is concerned. A good estimate of the external cache size desired is the total main memory divided by the number of processors you plan to configure. If the i860 had coherence support for the on chip cache, you would be able to implement the off chip cache with much slower memory than otherwize, leading to cheaper and larger off chip caches. Is the news software incompatible with your mailer too? brooks@maddog.llnl.gov, brooks@maddog.uucp, uunet!maddog.llnl.gov!brooks
jeff@Alliant.COM (Jeff Collins) (03/14/89)
In article <494@ircam.UUCP> elind@ircam.ircam.fr (Eric Lindemann) writes: >I'm interested here mainly in performance issues. Let's leave the issue >of cache coherency, and the lack of i860 hardware support for it, out >of the discussion for the time being. The performance issues associated with bus timing will be interesting to know, but the i860 performance in a multiprocessor will also be effected by the cache coherency schemes that are employed. If the part does not have external bus watchers (I haven't heard this stated yet, only implied), then it seems to be that the performance will be severely hampered. I may be missing something, but I have looked at a number of microprocessors with an aim to putting them in a multiprocessor. The conclusion that I, and my hardware friends, came to was that if a microprocessor has an internal cache and no external invalidate logic, then the only way to use the part in a symmetric multiprocessor is to disable the internal data cache. Internal I-caches have the same problems when you start to consider debuggers, but there are work arounds and performance isn't critical in these cases. What really confuses me is all of the activity aimed at putting these parts in a multiprocessor. I admit that the part is amazingly fast, but is it really an appropriate part for a multiprocessor - or even for a general purpose processor? (I had to make some controversial statement :^)
robertb@june.cs.washington.edu (Robert Bedichek) (03/17/89)
In article <3032@alliant.Alliant.COM> jeff@alliant.Alliant.COM (Jeff Collins) writes: > The performance issues associated with bus timing will be interesting >to know, but the i860 performance in a multiprocessor will also be effected by >the cache coherency schemes that are employed. If the part does not have >external bus watchers (I haven't heard this stated yet, only implied), then >it seems to be that the performance will be severely hampered. It depends on the multiprocessor's workload. If you build a Sequent-style machine and are doing, say parallel makes, often a useful thing to do, then there is little performance impact in maintaining cache/tlb coherency. The OS can be difficult to write, verify, and debug, of course, but this is the case with any general purpose multiprocessor OS. If the workload has a lot of shared data, you might be right. But you might also be surprised at the performance of a software-intensive solution on a machine like the MC88K, and perhaps the i860, where you can interrupt the CPU in just a few cycles. I've thought about this problem a little for 88K's, where one CPU can tell any other to do something specific, like flush a cache line or TLB. It takes a single store on the sending CPU, some interrupt control logic, and a hand-coded interrupt handler on the receiving CPU that can do the flush without saving more than one or two registers. The whole operation might take only 20 or 30 clock cycles. (On the 88K you don't have to resort to this because a CMMU can be flushed by its own CPU or by anything on the MBUS (memory bus), which all other CPU's have access to in most designs. Why I was thinking of this other scheme for the 88K is a longer story.) Btw, the 88K *does* have bus snooping, just as you would like and it might be faster *not* to use it. Bus snooping slows the system down because for every snoop, the CMMU must do a cache lookup. This will cause the CPU to stall sometimes when it goes to the CMMU (which is quite frequently). Build a dual-ported tag ram? Very expensive. There's not such thing as free cache coherency! > I may be missing something, but I have looked at a number of >microprocessors with an aim to putting them in a multiprocessor. The >conclusion that I, and my hardware friends, came to was that if a >microprocessor has an internal cache and no external invalidate logic, then >the only way to use the part in a symmetric multiprocessor is to disable the >internal data cache. Internal I-caches have the same problems when you start >to consider debuggers, but there are work arounds and performance isn't >critical in these cases. I think you are making assumptions about how quickly it could be done with a "RISC approach," see above. > What really confuses me is all of the activity aimed at putting these >parts in a multiprocessor. I admit that the part is amazingly fast, but is it >really an appropriate part for a multiprocessor - or even for a general >purpose processor? (I had to make some controversial statement :^) Oh it make a lot of sense! It's designed to be a graphics processor, where the problems often contain a high degree of parallelism that is relatively easily exploited. Many graphics processors are multiprocessors. Also, people like Sequent have shown that shared memory multiprocessors can work well in general computing environments. Why not use the fastest cheapest (overall) chip around to use as the element of a multiprocessor? (I'm not claiming that the i860 *is* these things, just that it is a possibility and therefore the discussion is reasonable.) Btw, on all this Dhrystone stuff on the i860 and on Intel in general: Listen to what they say about what speed part they will have when for what price, but don't even bother looking at their performance analysis of their CPU's. Read what MIPS has to say about their raw CPU performance, or measure it yourself, or flip a dime. Generally you have to guess/simulate/extrapolate performance for the system as a whole, so you have to do it yourself anyway. Rob Bedichek (robertb@cs.washington.edu) Disclaimer: I used to work for Intel, I think the MC 88K is great, the i860 might be really fine in many common applications, I've only read what MIPS has to say on the net.
brooks@vette.llnl.gov (Eugene Brooks) (03/17/89)
In article <7618@june.cs.washington.edu> robertb@uw-june.UUCP (Robert Bedichek) writes: >Build a dual-ported tag ram? Very expensive. >There's not such thing as free cache coherency! If you are serious about multiprocessor performance you want a dual ported tag ram. On a bus based system you crank up the number of processors until the bus can't take it anymore and therefore end up snooping the devil out of the cache. If you don't have dual ported tags there are no access cycles left for the processor. Any bus based shared memory multiprocessor worth its salt has dual ported tags. brooks@maddog.llnl.gov, brooks@maddog.uucp, .../uunet!maddog.llnl.gov!brooks
elind@ircam.ircam.fr (Eric Lindemann) (03/17/89)
In <21876@lll-winken.LLNL.GOV> brooks@vette.llnl.gov (Eugene Brooks) writes: >A good estimate of the external cache size desired is the total main memory >divided by the number of processors you plan to configure. So if I wanted to have 4 processors with 16 Mbyte of main memory this would imply a 4 Mbyte cache? That's a pretty big cache don't you think?
brooks@vette.llnl.gov (Eugene Brooks) (03/18/89)
In article <496@ircam.UUCP> elind@ircam.ircam.fr (Eric Lindemann) writes: >In <21876@lll-winken.LLNL.GOV> brooks@vette.llnl.gov (Eugene Brooks) writes: >>A good estimate of the external cache size desired is the total main memory >>divided by the number of processors you plan to configure. > >So if I wanted to have 4 processors with 16 Mbyte of main memory this would >imply a 4 Mbyte cache? That's a pretty big cache don't you think? I agree that it is pretty big as far as ones current notion of a cache is, but it is pretty much the right size if your goal for the cache is to hold portions of the problem data set which are "local" for extended periods of time then dynamically "shared" during communication between processors. The idea here is that the "second level cache" is a set of local memories for the shared data set to be distributed into. The purpose of the second level cache is to keep the relatively high miss rate on the "small but fast" first level cache from saturating the memory bus. The second level cache need not be implemented with fast static ram, it can be made of relatively slow, and therefore cheap, dynamic memory. Essentially the same memory technology as it used in the main memory can be used. A per-processor second level cache size of one or several megabytes is an entirely reasonable size, and I can show specific problems as examples for which it would be used effectively. brooks@maddog.llnl.gov, brooks@maddog.uucp, .../uunet!maddog.llnl.gov!brooks