[comp.arch] Killer Micros and the TC2000

slackey@bbn.com (Stan Lackey) (03/20/90)

In article <45408@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
>Speaking of architectural issues, how is the BBN TC 2000 working out?
>It should be a perfect example of Killer Micros in action.  But,
>I was rather surprised that the TC 2000 Butterfly switch is only 8 bits (!)
>wide and only supports a maximum memory bandwidth of 2.4 GBytes/sec
>for a 63 processor system.  A Cray Y-MP has about 40 GBytes/sec of total
>memory bandwidth, for reference.

The peak bandwidth of the 63-node TC2000 depends upon where you
measure it.  The memory has a 3-level hierarchy: 1) cache, 2)local
memory, and 3)global memory.  The Cray has no cache, but the 88000
chip set does; the appropriate place to measure would probably be at
the busses between the CPU chip and the cache chips.  Combined
instruction cache and data cache bussus are a peak of 160 MB/s, times
63 processors is 10 GB/s.  Local memory speed is in the neighborhood
of 25 MB/s, times 63 or 1.5 GB/s.  Global memory is 8 MB/s for an
aggregate of 500 MB/s.  Your mileage will be somewhere between 10 GB/s
and 500 MB/s, depending upon cache hit rate and the mixture of
accesses between local and global memory.

The 8-bit switch path clocks at 38 MHz, so the raw bandwidth of the
media is 38 MB/s.  Times 63 paths is peak media speed of 2.4 GB/s.

Not to mislead, the above describes more the performance model, with
the speed differential between local and global memory.  The
programming model is a single globally addressed memory space.
-Stan

slackey@bbn.com (Stan Lackey) (03/21/90)

In article <53795@bbn.COM> slackey@BBN.COM I responded to a posting
comparing TC2000 and Cray memory bandwidths:
>The peak bandwidth of the 63-node TC2000 depends upon where you
>measure it.  The memory has a 3-level hierarchy: 1) cache, 2)local
>memory, and 3)global memory.
I included a set of approximate peak bandwidths at the various levels,
commenting on what I felt was an apples-to-oranges comparison with the
Cray.  I erroneously left out the disclaimer: These are approximate
peak values given for comparison with other architectures only.
Although these values can be achieved under certain circumstances,
delivered averages will vary depending upon the application.
-Stan

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (03/21/90)

In article <53795@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>In article <45408@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
>>Speaking of architectural issues, how is the BBN TC 2000 working out?

>The peak bandwidth of the 63-node TC2000 depends upon where you
>measure it.

I agree.  Of course, Crays have no caches, but some Crays have local memory
and all Crays have vector registers and fairly numerous scalar registers.
You could call registers "programmable caches" to compare bandwidths :-)

My question was intentionally brief, but to be more specific: the architecture
obviously depends on the ability to parallelize in such a way that global
memory bandwidth is not the bottleneck.  How well is this working out?
etc. etc. etc.


  Hugh LaMaster, M/S 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)604-6117       

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (03/21/90)

In article <45490@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
>I agree.  Of course, Crays have no caches, but some Crays have local memory

  "  "     "    "                  ^ DATA caches ^ I should have said.
Pardon.

  Hugh LaMaster, M/S 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)604-6117       

crowl@cs.rochester.edu (Lawrence Crowl) (03/23/90)

In article <45490@ames.arc.nasa.gov>
lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
>My question was intentionally brief, but to be more specific: the [BBN TC
>2000 Multiprocessor] architecture obviously depends on the ability to
>parallelize in such a way that global memory bandwidth is not the bottleneck.
>How well is this working out?

My experience has been with the first Butterfly, based on the 68000.  On this
system, contention for the "inter-node" communication network was negligible.
You are far more likely to limit performance because of contention for a
specific memory module than the communication network.  I expect (but do not
know) that the same is true for the TC 2000.
-- 
  Lawrence Crowl		716-275-9499	University of Rochester
		      crowl@cs.rochester.edu	Computer Science Department
	  ...!{ames,rutgers}!rochester!crowl	Rochester, New York,  14627