[comp.sys.m88k] Multiple-CPU 88000 systems, is the M-bus often the bottleneck?

robertb@june.cs.washington.edu (Robert Bedichek) (10/29/90)

I have a question for people familiar with bus traffic on the M-bus of
multiple CPU 88k systems.  I've done some analysis and I'd like someone
to tell me if I've missed something.  The result of my analysis is that
for some problems, the M-bus is the bottleneck and that 1-CPU systems
will sometimes be as fast as 2 or 4-CPU systems.

I am doing a study of an 88100/88200 based multiprocessor.  From what
I've heard, an aggressive memory design for these chips can do a cache
line fill in 8 cycles.  At 25 MHz, this yields a maximum cache-memory
throughput of 50 MB/sec.  Expressed in double-precision numbers that
can be loaded or stored: 6.25 M/sec.

Consider algorithms that munge large data sets and do relatively simple
operations on the data.  Examples: text search, matrix multiply, matrix
transpose.  I believe that algorithms to solve these problems, when
executed on 88k's, will be limited by the bandwidth of the M-bus.
(This is not a criticism of the 88k; something has to limit the
performance.)

Consider the O(n**3) algorithm for matrix multiply, we dot the rows of
A with columns of B to find elements of C.

   C = A x B

A row of C is formed from doting a row of A with all the columns of B.
Let's assume that a row of A can fit in cache, but that not all of the
columns of B can fit.  Let's also assume that the elements of these
matrices are plain double precision numbers and that the code is
hand-tweaked to overlap memory access and FP calculation.

Each calculation of an element of C then requires fetching a column of
B.  Since these values aren't in cache, they have to be fetched over
the M-bus, we have a maximum of 6.25 million fetches of elements of B
per second.  The FP unit is pipelined and can do 12 million FP
operations/sec.  It can keep up with the 6.25 M fetches per second.

This algorithm is trivial to run in parallel: give each processor a
chunk of rows of C to compute.  Motorola makes nice Hypermodules that
have 1, 2 or 4 88100 CPU's along with CMMUs in a small package.  But
they share the same M-bus (which makes sense), and in the scenario
above, adding CPUs would not speed it up the algorithm.  A system with
more caches per CPU and few CPUSs would usually run faster (on such an
algorithm).

Of course there are other problems where 4 CPUs sharing an M-bus give
an overall speed up.

My questions:

Do data-hungry/calculation-lean algorithms really saturate the M-bus
with just one CPU (or with just two)?

What kinds of problems benefit from 4-CPU 88100/88200 systems (over 2
or 1 CPU systems)?

What kind of M-bus utilizations have you seen on real 88k systems?

Notes:

1. If a row of A doesn't fit in cache, the dot products of segments
   of rows of A and columns of B can be computed and then summed.
   The segment size would be chosen so that the segments of rows
   of A would fit in cache.

2. I believe that other algorithms, such as matrix transpose and
   text search are more bus-hogging/calculation-lean.


If there is enough interest, I'll post a summary of email responses.

	Rob Bedichek      robertb@cs.washington.edu