robertb@june.cs.washington.edu (Robert Bedichek) (10/29/90)
I have a question for people familiar with bus traffic on the M-bus of multiple CPU 88k systems. I've done some analysis and I'd like someone to tell me if I've missed something. The result of my analysis is that for some problems, the M-bus is the bottleneck and that 1-CPU systems will sometimes be as fast as 2 or 4-CPU systems. I am doing a study of an 88100/88200 based multiprocessor. From what I've heard, an aggressive memory design for these chips can do a cache line fill in 8 cycles. At 25 MHz, this yields a maximum cache-memory throughput of 50 MB/sec. Expressed in double-precision numbers that can be loaded or stored: 6.25 M/sec. Consider algorithms that munge large data sets and do relatively simple operations on the data. Examples: text search, matrix multiply, matrix transpose. I believe that algorithms to solve these problems, when executed on 88k's, will be limited by the bandwidth of the M-bus. (This is not a criticism of the 88k; something has to limit the performance.) Consider the O(n**3) algorithm for matrix multiply, we dot the rows of A with columns of B to find elements of C. C = A x B A row of C is formed from doting a row of A with all the columns of B. Let's assume that a row of A can fit in cache, but that not all of the columns of B can fit. Let's also assume that the elements of these matrices are plain double precision numbers and that the code is hand-tweaked to overlap memory access and FP calculation. Each calculation of an element of C then requires fetching a column of B. Since these values aren't in cache, they have to be fetched over the M-bus, we have a maximum of 6.25 million fetches of elements of B per second. The FP unit is pipelined and can do 12 million FP operations/sec. It can keep up with the 6.25 M fetches per second. This algorithm is trivial to run in parallel: give each processor a chunk of rows of C to compute. Motorola makes nice Hypermodules that have 1, 2 or 4 88100 CPU's along with CMMUs in a small package. But they share the same M-bus (which makes sense), and in the scenario above, adding CPUs would not speed it up the algorithm. A system with more caches per CPU and few CPUSs would usually run faster (on such an algorithm). Of course there are other problems where 4 CPUs sharing an M-bus give an overall speed up. My questions: Do data-hungry/calculation-lean algorithms really saturate the M-bus with just one CPU (or with just two)? What kinds of problems benefit from 4-CPU 88100/88200 systems (over 2 or 1 CPU systems)? What kind of M-bus utilizations have you seen on real 88k systems? Notes: 1. If a row of A doesn't fit in cache, the dot products of segments of rows of A and columns of B can be computed and then summed. The segment size would be chosen so that the segments of rows of A would fit in cache. 2. I believe that other algorithms, such as matrix transpose and text search are more bus-hogging/calculation-lean. If there is enough interest, I'll post a summary of email responses. Rob Bedichek robertb@cs.washington.edu