[comp.parallel] Sparc base torus multiprocessor AP1000

jdb@arp.anu.edu.au (John Barlow) (05/10/91)
OK - I have currently 5 requests for more information on the new
Fujitsu AP1000.  I will copy verbatim from a handout I have
(composed by David Hawking, of the local Department of Computer
Science).  It starts a bit slow as it was written as a general
purpose information sheet (some sections deleted due to tedium ..)

Meet the CAP (otherwise known as AP1000)

The Fujitsu AP1000 ... is built up from a large number (up to 1,024) individual computers [called cells] ... Each cell in the AP1000 has 16 megabytes of RAM. [the cpu is a SPARC chip, but I believe it is not the
new chip with inbuilt MMU]

... The cells of the AP1000 communicate with each other by means
of three separate high-performance networks.

...


The Fujitsu AP1000 in Detail

The AP1000 is a single-user computer which is connected as an I/O
device (back end) to a Sun 4/390.  To perform a computation on the CAP,
a person must write at least two programs: a host program which runs on
the Sun front-end and downloads one or more task programs into the
cells.  At present, the only I/O possible from the cells is to and from
other cells and the host.  Consequently, the host program is
responsible for all external I/O.

***diagram***

          |      --------            
          |     |        |  host      -------- --------------
          |     |  sun   | interface | buffer |              |
ethernet  |-----|        |-----------|        |  Cell Array  |
          |     | 4/390  | 3 Mbytes  | 32 MB  |              |
          |     |        | a second   -------- --------------
          |      --------             
                                     \__________  ___________/
                                                \/
                                        Fujitsu  AP1000
***end diagram***

Communication Between Cells

There are three separate, high-performance communication networks in
the AP1000, called the B-net (broadcast), S-net (status and
synchronization) and T-net (torus network).  Each is supported by
special-purpose hardware, designed to minimize cell involvement in
message receiving and forwarding.  The existence of the three separate
networks improves performance by avoiding interference between the
three types of message traffic.

***diagram difficult to reproduce in ASCII***

imagine, at the bottom is a 2D grid (torus network, T-net).
above each node of this network is a processor cell.
above each cell is a broadcast network (B-net) joining several cells
to a node on a ring.
above the ring is the S-net, joining nodes on the ring to a common
point

***end diagram, hope you have a good imagination***

B-net
The B-net is a 32-bit wide, 50 Mbyte per second broadcast network which
permits the host or any cell to transmit data to all the other cells.
It is used to download programs to the cells.  It is implemented as a
top-level ring whose nodes are each at the top of a tree structure
containing up to thirty-two cells.  Despite this multi-level
implementation, the behaviour mimics that of a single shared bus.  All
cells have equal rights to send messages on the B-net.

S-net
The S-net allows cells or the host to test whether all cells are in a
particular state and permits an efficient implementation of barrier
synchronization, which is necessary in many parallel programming
applications.  For example, in a searching task, each cell may be given
the task of searching part of the data for items which match closely or
exactly the requirements of the search.  Because of differences in the
data that each cell has to process and because the cells work
asynchronously, the individual cells will finish their tasks at
different times.  The next step of the computation may depend upon the
overall search results.  Consequently, cells which finish early may
have to wait until all others have finished.  This is called barrier
synchronization.

The S-net has a tree topology in which each node consists of AND gates
and buffers connected by two sets of signal lines, one fast and the
other slow.  All cells put their signals to the S-net and receive the
overall result, which is the logical AND of all of the signals.


T-net
The T-net is the means of point-to-point communication between
individual cells.  Each cell has T-net connections to four immediately
adjacent neighbours and the connections are arranged to form a
two-dimensional torus, like a grid drawn on the surface of a donut.

The T-net connections are 16 bits wide and a peak transfer rate of 25
Mbyte/sec.  Each cell can simultaneously transmit a message and receive
one, giving an enormous total T-net bandwidth in a 1,024 cell machine
of 25 gigabytes/sec (peak).

Inside a Cell

***diagram***


      IU SPARC      FPU weitek   128Kbyte Cache
          |              |              |
           -------------------------------------------
                                                      |
                                                     MSC
                                                      |
 -------------------------------------------------------------- (LBUS)
       |                             |                  |
     DRAMC                          BIF                RTC
       |                            / \               / /\ \
       |                           |   |             | |  | |
     DRAM (16 MBytes)          S-net   B-net          T-net 

***end diagram***

Integer Unit and Floating Point Unit  (IU & FPU)
The IU is a 25 MHz SPARC chip originally designed by Sun Microsystems.
Chips of the same type are found in hundreds of thousands of
workstation, laptop and mainframe systems around the world.  The
instruction set of the cells is the same as that of the Sun 4/390 front
end. The FPU is made by Weitek and is the one normally used with this
particular SPARC chip.

Use of common, off-the-shelf components for the IU and FPU benefits
both designers and users of the AP1000 by reducing the hardware design
time and, more significantly, by giving instant access to a range of
existing software such as compilers and debuggers.  The combination
chosen achieves the healthy performance of x SPECmarks (about 16 MIPS)
and 1.6 double precision LINPACK MFLOPS.

Note that the cell does not have a Memory Management Unit (MMU).  A MMU
would be superfluous in the absence of local disk storage because there
is no sensible way of implementing virtual memory.  There is however a
Memory Protection Table (MPT) which prevents programs within a cell
from interfering with each others' data.

Memory and Cache (DRAM, DRAMC & CM)
The 16 megabytes of DRAM per cell is four-way interleaved to improve
average access time and is capable of correcting a single-bit error in
any byte.

The IU and FPU are connected to a simple direct-mapped cache
implemented in general-purpose static RAM.  The cache capacity is 128
Kbytes, organized in lines of 16 bytes each.  The cache controller is
part of the MSC and implements a copy-back algorithm.

Message Controller (MSC)
The MSC has direct access to memory and cache and enables a cell to
receive and send messages with a minimum of IU involvement.  The MSC
can send data to or from the T-net (via RTC) or the B-net (via BIF)
using one of several different methods.  It can write received data
into the appropriate ring buffer in memory or store it in an address
calculated from an index in the message header.

So-called stride DMA can be used for receiving or sending data which is
regularly dispersed in memory.  For example it is possible to ask it to
send as a message every 8th byte starting from a particular address in
memory up to another address.  The MSC can also be given a list of
addresses of data items irregularly located in memory and asked to send
all the data as a single message.

A function called line sending can transmit messages directly from
cache.  It may speed things up considerably because of the relatively
high probability that short data items are already stored in cache.

B-net interface (BIF)
The BIF consists of FIFO buffers for reading and writing and a
scatter/gather controller.  Its basic functions are sending and
receiving broadcasts.  It also supports the S-net functions.  In
combination with the DMA controller in the MSC, the cell can tune into
selected parts of a large block of data broadcast by the host (host
scatter, cell gather) or transmit parts of a block of data assembled in
a single operation by the host (host gather, cell scatter).  An example
of the use of scatter/gather is when the host transmits a large
two-dimensional matrix over the B-net and each cell receives the small
rectangular piece it has to process while ignoring the rest.

T-net interface (RTC)
Messages from cell A to cell B may be transferred in one step if the
two are adjacent but have to be routed via intermediate cells if they
are not.  In the routing scheme employed in the AP1000, a T-net message
acts like a worm passing along a temporary worm-hole between its source
and destination.  The head of a message may reach its destination with
its tail still at the source and its middle part spread across a number
of intermediate cells.

The RTC has two routing controllers, one for the x direction and the
other for the y.  It also has sufficient buffers to permit a maximum
network size of 32 x 32 cells.  The RTC can route messages at the rate
of 40 nanoseconds per byte.  In the absence of network contention, the
time to complete transmission of a message is 40 * (1 + distance +
messge size) nanoseconds.

When cells transmit messages frequently to irregular and distant
destinations, the picture is quite complex, with messages crossing and
sometimes blocking each other.  The designers of the AP1000 had to
avoid many pitfalls including deadlock, in which all "worms" are
blocked and inefficiency.  They added the concept of the structured
buffer pool to the wormhole routing scheme to avoid both of these
evils.

Programming the AP1000

Programs for the AP1000 are written in C or Fortran.  Communication
between cells and between host and cells is achieved by calls to
routines in  the host and cell libraries provided.  A cut-down version
of the Unix operating system called cell-os is downloaded into the
cells before cell program execution commences.

Programs are run under the control of caren, the Cellular Array Runtime
Environment, which provides facilities for monitoring the execution of
parallel programs and for symbolic debugging of the tasks running in
individual cells.  Output generated by the cells can be displayed on
the host.
-- 
jdb = John Barlow, Parallel Computing Research Facility,
Australian National University, I-Block, PO Box 4, Canberra, 2601, Australia.
email = jdb@arp.anu.edu.au
[International = +61 6, Australia = 06] [Phone = 2492930, Fax = 2490747]

-- 
=========================== MODERATOR ==============================
Steve Stevenson                            {steve,fpst}@hubcap.clemson.edu
Department of Computer Science,            comp.parallel
Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell