[comp.arch] i860 Multiprocessing

elind@ircam.ircam.fr (Eric Lindemann) (03/13/89)

I'm interested in memory access statistics (cache hit rate, external accesses
per second, etc.) for the i860 running "typical" scalar code. This is with 
an eye to evaluating it's performance in a tightly coupled, shared memory, 
multiprocessor configuration. 

Does the high speed of the i860, combined with it's smallish caches, imply
that a single processor will consume so much of the external bus bandwidth 
that trying to put four or even two i860s on a shared memory bus would
lead to  unacceptable performance degradation?

This problem is fairly easy to analyze on paper for specific vector routines
but requires more ellaborate simulation and profiling for the general
computing case. 

I'm interested here mainly in performance issues. Let's leave the issue
of cache coherency, and the lack of i860 hardware support for it, out 
of the discussion for the time being.

Eric Lindemann
IRCAM (Institute for Research and Coordination of Acoustics and Music)

brooks@vette.llnl.gov (Eugene Brooks) (03/14/89)

In article <494@ircam.UUCP> elind@ircam.ircam.fr (Eric Lindemann) writes:
>Does the high speed of the i860, combined with it's smallish caches, imply
>that a single processor will consume so much of the external bus bandwidth 
>that trying to put four or even two i860s on a shared memory bus would
>lead to  unacceptable performance degradation?
Yes, you would get into real trouble real quick.  You would want to have
a bus system hooked to N rather large coherent caches which the N i860s were
hooked to.  For the current part which does not have a coherence mechanism
for the on chip cache you would have to set the shared data as not cachable
as far as the on chip cache is concerned.  A good estimate of the external
cache size desired is the total main memory divided by the number of processors
you plan to configure.  If the i860 had coherence support for the on chip
cache, you would be able to implement the off chip cache with much slower
memory than otherwize, leading to cheaper and larger off chip caches.


Is the news software incompatible with your mailer too?
brooks@maddog.llnl.gov, brooks@maddog.uucp, uunet!maddog.llnl.gov!brooks

jeff@Alliant.COM (Jeff Collins) (03/14/89)

In article <494@ircam.UUCP> elind@ircam.ircam.fr (Eric Lindemann) writes:
>I'm interested here mainly in performance issues. Let's leave the issue
>of cache coherency, and the lack of i860 hardware support for it, out 
>of the discussion for the time being.

	The performance issues associated with bus timing will be interesting
to know, but the i860 performance in a multiprocessor will also be effected by
the cache coherency schemes that are employed.  If the part does not have
external bus watchers (I haven't heard this stated yet, only implied), then
it seems to be that the performance will be severely hampered.

	I may be missing something, but I have looked at a number of
microprocessors with an aim to putting them in a multiprocessor.  The
conclusion that I, and my hardware friends, came to was that if a
microprocessor has an internal cache and no external invalidate logic, then
the only way to use the part in a symmetric multiprocessor is to disable the
internal data cache.  Internal I-caches have the same problems when you start
to consider debuggers, but there are work arounds and performance isn't
critical in these cases.

	What really confuses me is all of the activity aimed at putting these
parts in a multiprocessor.  I admit that the part is amazingly fast, but is it
really an appropriate part for a multiprocessor - or even for a general
purpose processor? (I had to make some controversial statement :^)

robertb@june.cs.washington.edu (Robert Bedichek) (03/17/89)

In article <3032@alliant.Alliant.COM> jeff@alliant.Alliant.COM (Jeff Collins) writes:
>	The performance issues associated with bus timing will be interesting
>to know, but the i860 performance in a multiprocessor will also be effected by
>the cache coherency schemes that are employed.  If the part does not have
>external bus watchers (I haven't heard this stated yet, only implied), then
>it seems to be that the performance will be severely hampered.

It depends on the multiprocessor's workload.  If you build a
Sequent-style machine and are doing, say parallel makes, often a useful
thing to do, then there is little performance impact in maintaining
cache/tlb coherency.  The OS can be difficult to write, verify, and
debug, of course, but this is the case with any general purpose
multiprocessor OS.

If the workload has a lot of shared data, you might be right.  But you
might also be surprised at the performance of a software-intensive
solution on a machine like the MC88K, and perhaps the i860, where you
can interrupt the CPU in just a few cycles.  I've thought about this
problem a little for 88K's, where one CPU can tell any other to do
something specific, like flush a cache line or TLB.  It takes a single
store on the sending CPU, some interrupt control logic, and a
hand-coded interrupt handler on the receiving CPU that can do the flush
without saving more than one or two registers.  The whole operation
might take only 20 or 30 clock cycles.  (On the 88K you don't have to
resort to this because a CMMU can be flushed by its own CPU or by
anything on the MBUS (memory bus), which all other CPU's have access to
in most designs.  Why I was thinking of this other scheme for the 88K
is a longer story.)

Btw, the 88K *does* have bus snooping, just as you would like and it
might be faster *not* to use it.  Bus snooping slows the system down
because for every snoop, the CMMU must do a cache lookup.  This will
cause the CPU to stall sometimes when it goes to the CMMU (which is
quite frequently).  Build a dual-ported tag ram?  Very expensive.
There's not such thing as free cache coherency!

>	I may be missing something, but I have looked at a number of
>microprocessors with an aim to putting them in a multiprocessor.  The
>conclusion that I, and my hardware friends, came to was that if a
>microprocessor has an internal cache and no external invalidate logic, then
>the only way to use the part in a symmetric multiprocessor is to disable the
>internal data cache.  Internal I-caches have the same problems when you start
>to consider debuggers, but there are work arounds and performance isn't
>critical in these cases.

I think you are making assumptions about how quickly it could be done
with a "RISC approach," see above.

>	What really confuses me is all of the activity aimed at putting these
>parts in a multiprocessor.  I admit that the part is amazingly fast, but is it
>really an appropriate part for a multiprocessor - or even for a general
>purpose processor? (I had to make some controversial statement :^)

Oh it make a lot of sense!  It's designed to be a graphics processor,
where the problems often contain a high degree of parallelism that is
relatively easily exploited.  Many graphics processors are
multiprocessors.

Also, people like Sequent have shown that shared memory multiprocessors
can work well in general computing environments.  Why not use the
fastest cheapest (overall) chip around to use as the element of a
multiprocessor?  (I'm not claiming that the i860 *is* these things,
just that it is a possibility and therefore the discussion is
reasonable.)

Btw, on all this Dhrystone stuff on the i860 and on Intel in general:
Listen to what they say about what speed part they will have when for
what price, but don't even bother looking at their performance analysis
of their CPU's.  Read what MIPS has to say about their raw CPU
performance, or measure it yourself, or flip a dime.  Generally you
have to guess/simulate/extrapolate performance for the system as a
whole, so you have to do it yourself anyway.

	Rob Bedichek  (robertb@cs.washington.edu)

Disclaimer: I used to work for Intel, I think the MC 88K is great, the
            i860 might be really fine in many common applications, I've
            only read what MIPS has to say on the net.

brooks@vette.llnl.gov (Eugene Brooks) (03/17/89)

In article <7618@june.cs.washington.edu> robertb@uw-june.UUCP (Robert Bedichek) writes:
>Build a dual-ported tag ram?  Very expensive.
>There's not such thing as free cache coherency!
If you are serious about multiprocessor performance you want a dual ported
tag ram.  On a bus based system you crank up the number of processors until
the bus can't take it anymore and therefore end up snooping the devil out of
the cache.  If you don't have dual ported tags there are no access cycles left
for the processor.  Any bus based shared memory multiprocessor worth its
salt has dual ported tags.


brooks@maddog.llnl.gov, brooks@maddog.uucp, .../uunet!maddog.llnl.gov!brooks

elind@ircam.ircam.fr (Eric Lindemann) (03/17/89)

In <21876@lll-winken.LLNL.GOV> brooks@vette.llnl.gov (Eugene Brooks) writes:
>A good estimate of the external cache size desired is the total main memory
>divided by the number of processors you plan to configure. 

So if I wanted to have 4 processors with 16 Mbyte of main memory this would
imply a 4 Mbyte cache? That's a pretty big  cache don't you think?

brooks@vette.llnl.gov (Eugene Brooks) (03/18/89)

In article <496@ircam.UUCP> elind@ircam.ircam.fr (Eric Lindemann) writes:
>In <21876@lll-winken.LLNL.GOV> brooks@vette.llnl.gov (Eugene Brooks) writes:
>>A good estimate of the external cache size desired is the total main memory
>>divided by the number of processors you plan to configure. 
>
>So if I wanted to have 4 processors with 16 Mbyte of main memory this would
>imply a 4 Mbyte cache? That's a pretty big  cache don't you think?
I agree that it is pretty big as far as ones current notion of a cache is,
but it is pretty much the right size if your goal for the cache is to
hold portions of the problem data set which are "local" for extended periods
of time then dynamically "shared" during communication between processors.
The idea here is that the "second level cache" is a set of local memories
for the shared data set to be distributed into.

The purpose of the second level cache is to keep the relatively high miss
rate on the "small but fast" first level cache from saturating the memory
bus.  The second level cache need not be implemented with fast static ram,
it can be made of relatively slow, and therefore cheap, dynamic memory.
Essentially the same memory technology as it used in the main memory can
be used.  A per-processor second level cache size of one or several megabytes
is an entirely reasonable size, and I can show specific problems as examples
for which it would be used effectively.

brooks@maddog.llnl.gov, brooks@maddog.uucp, .../uunet!maddog.llnl.gov!brooks