jdb@arp.anu.edu.au (John Barlow) (05/10/91)
OK - I have currently 5 requests for more information on the new Fujitsu AP1000. I will copy verbatim from a handout I have (composed by David Hawking, of the local Department of Computer Science). It starts a bit slow as it was written as a general purpose information sheet (some sections deleted due to tedium ..) Meet the CAP (otherwise known as AP1000) The Fujitsu AP1000 ... is built up from a large number (up to 1,024) individual computers [called cells] ... Each cell in the AP1000 has 16 megabytes of RAM. [the cpu is a SPARC chip, but I believe it is not the new chip with inbuilt MMU] ... The cells of the AP1000 communicate with each other by means of three separate high-performance networks. ... The Fujitsu AP1000 in Detail The AP1000 is a single-user computer which is connected as an I/O device (back end) to a Sun 4/390. To perform a computation on the CAP, a person must write at least two programs: a host program which runs on the Sun front-end and downloads one or more task programs into the cells. At present, the only I/O possible from the cells is to and from other cells and the host. Consequently, the host program is responsible for all external I/O. ***diagram*** | -------- | | | host -------- -------------- | | sun | interface | buffer | | ethernet |-----| |-----------| | Cell Array | | | 4/390 | 3 Mbytes | 32 MB | | | | | a second -------- -------------- | -------- \__________ ___________/ \/ Fujitsu AP1000 ***end diagram*** Communication Between Cells There are three separate, high-performance communication networks in the AP1000, called the B-net (broadcast), S-net (status and synchronization) and T-net (torus network). Each is supported by special-purpose hardware, designed to minimize cell involvement in message receiving and forwarding. The existence of the three separate networks improves performance by avoiding interference between the three types of message traffic. ***diagram difficult to reproduce in ASCII*** imagine, at the bottom is a 2D grid (torus network, T-net). above each node of this network is a processor cell. above each cell is a broadcast network (B-net) joining several cells to a node on a ring. above the ring is the S-net, joining nodes on the ring to a common point ***end diagram, hope you have a good imagination*** B-net The B-net is a 32-bit wide, 50 Mbyte per second broadcast network which permits the host or any cell to transmit data to all the other cells. It is used to download programs to the cells. It is implemented as a top-level ring whose nodes are each at the top of a tree structure containing up to thirty-two cells. Despite this multi-level implementation, the behaviour mimics that of a single shared bus. All cells have equal rights to send messages on the B-net. S-net The S-net allows cells or the host to test whether all cells are in a particular state and permits an efficient implementation of barrier synchronization, which is necessary in many parallel programming applications. For example, in a searching task, each cell may be given the task of searching part of the data for items which match closely or exactly the requirements of the search. Because of differences in the data that each cell has to process and because the cells work asynchronously, the individual cells will finish their tasks at different times. The next step of the computation may depend upon the overall search results. Consequently, cells which finish early may have to wait until all others have finished. This is called barrier synchronization. The S-net has a tree topology in which each node consists of AND gates and buffers connected by two sets of signal lines, one fast and the other slow. All cells put their signals to the S-net and receive the overall result, which is the logical AND of all of the signals. T-net The T-net is the means of point-to-point communication between individual cells. Each cell has T-net connections to four immediately adjacent neighbours and the connections are arranged to form a two-dimensional torus, like a grid drawn on the surface of a donut. The T-net connections are 16 bits wide and a peak transfer rate of 25 Mbyte/sec. Each cell can simultaneously transmit a message and receive one, giving an enormous total T-net bandwidth in a 1,024 cell machine of 25 gigabytes/sec (peak). Inside a Cell ***diagram*** IU SPARC FPU weitek 128Kbyte Cache | | | ------------------------------------------- | MSC | -------------------------------------------------------------- (LBUS) | | | DRAMC BIF RTC | / \ / /\ \ | | | | | | | DRAM (16 MBytes) S-net B-net T-net ***end diagram*** Integer Unit and Floating Point Unit (IU & FPU) The IU is a 25 MHz SPARC chip originally designed by Sun Microsystems. Chips of the same type are found in hundreds of thousands of workstation, laptop and mainframe systems around the world. The instruction set of the cells is the same as that of the Sun 4/390 front end. The FPU is made by Weitek and is the one normally used with this particular SPARC chip. Use of common, off-the-shelf components for the IU and FPU benefits both designers and users of the AP1000 by reducing the hardware design time and, more significantly, by giving instant access to a range of existing software such as compilers and debuggers. The combination chosen achieves the healthy performance of x SPECmarks (about 16 MIPS) and 1.6 double precision LINPACK MFLOPS. Note that the cell does not have a Memory Management Unit (MMU). A MMU would be superfluous in the absence of local disk storage because there is no sensible way of implementing virtual memory. There is however a Memory Protection Table (MPT) which prevents programs within a cell from interfering with each others' data. Memory and Cache (DRAM, DRAMC & CM) The 16 megabytes of DRAM per cell is four-way interleaved to improve average access time and is capable of correcting a single-bit error in any byte. The IU and FPU are connected to a simple direct-mapped cache implemented in general-purpose static RAM. The cache capacity is 128 Kbytes, organized in lines of 16 bytes each. The cache controller is part of the MSC and implements a copy-back algorithm. Message Controller (MSC) The MSC has direct access to memory and cache and enables a cell to receive and send messages with a minimum of IU involvement. The MSC can send data to or from the T-net (via RTC) or the B-net (via BIF) using one of several different methods. It can write received data into the appropriate ring buffer in memory or store it in an address calculated from an index in the message header. So-called stride DMA can be used for receiving or sending data which is regularly dispersed in memory. For example it is possible to ask it to send as a message every 8th byte starting from a particular address in memory up to another address. The MSC can also be given a list of addresses of data items irregularly located in memory and asked to send all the data as a single message. A function called line sending can transmit messages directly from cache. It may speed things up considerably because of the relatively high probability that short data items are already stored in cache. B-net interface (BIF) The BIF consists of FIFO buffers for reading and writing and a scatter/gather controller. Its basic functions are sending and receiving broadcasts. It also supports the S-net functions. In combination with the DMA controller in the MSC, the cell can tune into selected parts of a large block of data broadcast by the host (host scatter, cell gather) or transmit parts of a block of data assembled in a single operation by the host (host gather, cell scatter). An example of the use of scatter/gather is when the host transmits a large two-dimensional matrix over the B-net and each cell receives the small rectangular piece it has to process while ignoring the rest. T-net interface (RTC) Messages from cell A to cell B may be transferred in one step if the two are adjacent but have to be routed via intermediate cells if they are not. In the routing scheme employed in the AP1000, a T-net message acts like a worm passing along a temporary worm-hole between its source and destination. The head of a message may reach its destination with its tail still at the source and its middle part spread across a number of intermediate cells. The RTC has two routing controllers, one for the x direction and the other for the y. It also has sufficient buffers to permit a maximum network size of 32 x 32 cells. The RTC can route messages at the rate of 40 nanoseconds per byte. In the absence of network contention, the time to complete transmission of a message is 40 * (1 + distance + messge size) nanoseconds. When cells transmit messages frequently to irregular and distant destinations, the picture is quite complex, with messages crossing and sometimes blocking each other. The designers of the AP1000 had to avoid many pitfalls including deadlock, in which all "worms" are blocked and inefficiency. They added the concept of the structured buffer pool to the wormhole routing scheme to avoid both of these evils. Programming the AP1000 Programs for the AP1000 are written in C or Fortran. Communication between cells and between host and cells is achieved by calls to routines in the host and cell libraries provided. A cut-down version of the Unix operating system called cell-os is downloaded into the cells before cell program execution commences. Programs are run under the control of caren, the Cellular Array Runtime Environment, which provides facilities for monitoring the execution of parallel programs and for symbolic debugging of the tasks running in individual cells. Output generated by the cells can be displayed on the host. -- jdb = John Barlow, Parallel Computing Research Facility, Australian National University, I-Block, PO Box 4, Canberra, 2601, Australia. email = jdb@arp.anu.edu.au [International = +61 6, Australia = 06] [Phone = 2492930, Fax = 2490747] -- =========================== MODERATOR ============================== Steve Stevenson {steve,fpst}@hubcap.clemson.edu Department of Computer Science, comp.parallel Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell