jsutton@iWarp.intel.com (Jim Sutton) (06/01/91)
Over the past few months, there have been sporadic articles and requests for information concerning the iWarp project appearing in the comp.arch and comp.parallel newsgroups. It seems appropriate at this time to provide a brief description of what an iWarp is, for those of you who may be feeling left out. My credentials: I have been with iWarp since Q3 86. My primary function was component/systems architect, but I also had responsibility for the design of two of the functional units. This overview provides only a brief, global view of iWarp's architecture and capabilities, and does not attempt to cover details (our architectural spec used by the software development teams is over 450 pages!). Pointers to further information are provided at the end of this article for those who would like additional details. All performance numbers shown are for 20 Mhz systems. --------------------------------------- SYSTEM ARCHITECTURE iWarp is both a component and system design for scalable parallel computing systems, aimed at supporting both systolic (fine-grained) and message-passing (coarse-grained) applications requiring high-speed communications. Systolic applications involve few (typically <10) calculations per data element, so a *balance* between I/O and CPU performance is as critical as raw CPU MFLOPs. iWarp's primary design focus was on integrating high-performance I/O into the processor in a way that allows the computational units to have balanced access to both the I/O channels and memory. We made a conscious decision to avoid expensive, experimental (or simply difficult) CPU features found on current leading-edge processors, to allow us to focus time (and silicon) on these goals. An iWarp cell (i.e., node) is comprised of a single iWarp component and a bank of fast static RAM. iWarp cells are joined by directly connecting the iWarp components through the four bidirectional pathways. Typical config- urations are 1D (linear) and 2D (mesh) arrays, although other arrangements are possible. iWarp is the first commercial product to implement virtual routing channels in hardware, and extends this concept to *long-lived* virtual channels (which we call connections) that form a unidirectional path of reserved resources from the terminal source node to the terminal destination node. Once a connection has been established, one or more messages may be shipped along this path. Connections may also be *shared*; that is, one or more cells along the path may be senders, and one or more cells along the path may be receivers. Connections use a form of street-sign addressing to choose a routing path through the array; messages have names that identify the destination cell(s) or process(es) within the connection. iWarp supports 20 connections, with independent buffering, control and status resources for each. --------------------------------------- COMPONENT ARCHITECTURE The iWarp component consists of two essentially asynchronous agents: a Computation Agent and a Communication Agent. The Comp Agent contains integer core hardware, floating point hardware, memory interface and a 128 word Register File (RF). The Comm Agent handles all pathway traffic, and manipulation of the virtual channel buffers. Following is a brief summary of the iWarp component features. Unusual features are discussed in the next section. Communications - 4 bidirectional pathways. * Built as unidirectional bus pairs, each 8-bits (+control) wide. * 40 Mbytes/sec per bus (320 Mbytes/sec aggregate). - 4-entry Address Match CAM handles cell streetsigns and message names. - 20 channel buffers (called PCT records) * Allows 20 simultaneous connections through/to/from the cell. "Express" (thru-cell) traffic is not blocked by "inbound" or "outbound" traffic to/from the cell. * Express traffic handled automatically, without SW intervention and without stealing CPU cycles. * 8-word data queue provides smoothing. * Programmable stop conditions allow channel to be automatically converted from "express" (data flowing through cell) to "inbound" (data examined/consumed by cell), and vice-versa. - Independent round-robin scheduling for each outbound pathway. 128 word Register File (RF) - 118 general purpose registers + 10 special addresses. - Accessible as 8/16/32/64-bit fields on natural boundaries. - Heavily multi-ported: * Separate ports for integer core, FP adder, FP multiplier allow simultaneous access. Each port supports 2-reads/1-write per clock. * Special "back-door" ports tied directly to specific registers: LM ports and stream gates (see "Special Features"). Floating Point - Adder and Multiplier units: * Non-pipelined 2 clock SP, 4 clock DP basic operations. * Each 10 MFLOPs SP, 5 MFLOPs DP. - 32/64-bit IEEE P754. * Full trap support and fast mode, four rounding modes. - 3 operand instructions (2 src, 1 dest) - Result bypass between adder/multiplier. - Multiplier supports divide, remainder and square root instructions. - Integer-to-float, float-to-integer, pack and unpack operations. Integer Arithmetic - 8/16/32-bit data operations. - All 1-clock operations; 20 MIPs performance. - Arithmetic, shift/rotate, logical and bit operations. - 2 operand instructions (1 src, 1 src/dest) - Most ops allow 8-bit literal value as 2nd operand. - Result bypass to any other core (non-FP) instruction. Memory Operations - 8/16/32/64-bit load/store instructions. * Pre- and post-increment addressing. * Small-constant or variable address stride. * Big-endian/little-endian transformations for all data types. - Internal 256 word instruction cache, 4-way set associative. - Internal 2K word instruction ROM. Support Operations/Features - Stack: push/pop, call/return, allocate/release. - Branch and call targets: direct, register-indirect, memory-indirect. - System call to protected code through table lookup. - Control/status register load/store. - Register indirection (dereferences). - Four separate sets of flags: FP adder, FP mult, integer and "other". --------------------------------------- SPECIAL FEATURES Hardware Loop Support An ENTERLOOP instruction initializes dedicated hardware, including the loop-start address (the next instruction), an iteration count, and an optional conditional code (e.g., carry flag false). Most instructions have an ENDLOOP bit. If set, then the loop count is decremented, and the loop is repeated if the count is non-zero OR the specified condition (resulting from a preceding instruction) is false. This all occurs in parallel with normal instruction execution. Loops can be nested. The ENTERLOOP instruction saves the loop controls on the stack before changing. Exiting the loop retrieves the controls. A mode in the BRANCH instruction allows early exit from the loop. Stream Gates ANY instruction may read/write data directly from/to a channel buffer's data queue by reading/writing special addresses in the Register File. These special locations are called "stream gates", because they provide a gating function that allows a stream of data to pass, a word or double- word at a time, between the program and the array. The RF locations have no real storage themselves, but pass data to/from the channel buffers over a "backdoor" port (bus). There are two read gates (G0R,G1R) and two write gates (G0W,G1W). Each gate has a programmable binding, that specifies which channel buffer it is connected to. Once bound, any instruction may pass through the channel as a side effect of an ordinary read/write of an RF location. If data (or space) is not immediately available in the channel buffer's queue, the instruction "spins" before execution, until the data (space) is available. This allows direct program-to-program transfers without the costly overhead of checking queue status via instructions. Spin timeouts are available to catch deadlock or error conditions. LM Ports The other special addresses in the Register File are "LM Ports", and are "special" only in the Compute & Access instruction (see below). There are two read locations (LMR1,LMR2) and one write location (LMW). Each port has real storage associated with it (and is an ordinary register for non-C&A instructions), but also has a "backdoor" port to the external memory bus. The C&A instruction activates uses this back-door port to perform faster memory accesses (1-clock each). Compute & Access Instruction This is the workhorse instruction of the iWarp component. It is the sole *long* instruction (96-bits), and is horizontal in nature, specifying the parallel operation of nearly all functional units on the chip. The C&A instruction can initiate: 1 FP adder operation (2-clocks SP) 1 FP multiplier operation (2-clocks SP) 2 back-to-back memory ops (each 1-clock): A memory read into LMR1, and either a memory read into LMR2 or a memory write from LMW. Each memory op specifies an address register and an offset register/literal, allowing pre/post-incrementing addresses, with a constant/variable stride. Each memory access can be single- or double-word. end-of-loop test + repeat ANY of the source operands (FP add, FP mult, memory address calc) can specify a stream gate read (G0R,G1R). ANY of the destionation operands can specify a stream gate write (G0W,G1W). This produces a peak performance of: 20 MFLOPs + 20 MIPs (+ loop_test/branch) + 160 Mbytes/sec memory ops + 80 Mbytes/sec sends + 80 Mbytes/sec receives (+ 160 Mbytes/sec express (thru cell) traffic) Spools Message-passing communication involves transferring blocks of data from the memory of a sending cell to the memory of a receiving cell. iWarp provides 8 independent, programmable DMA interfaces between memory and the channel buffers, called "spools". Each spool is programmed with the channel number, two Register File locations (buffer address and buffer limit), stride, data-type (for endian-transformation) and direction (to/from memory). Once programmed and enabled, a spool operation will steal cycles from normal CPU operations *only* when data is available. The data transfer stops when the buffer limit is reached, or, in the case of spools-to-memory, when a "delimiter" (non-data word) reaches the top of the queue. When stopped, the spool records an event, which may invoke a service routine. A single spool can "max-out" its associated outbound or inbound pathway, at 40 Mbytes/sec. Multiple spools (corresponding to concurrent messages to/from a cell) are scheduled in a round-robin fashion, and will max-out at the memory bandwidth of 160 Mbytes/sec. A channel buffer can be attached to a spool at one end, and a stream gate at the other. This allows re-use of a code block in either systolic mode (streaming directly to/from pathway) and message-passing mode (streams to/from spools to/from memory). Events Because iWarp integrates a sophisticated communication manager with DMA transfers and normal CPU core and floating point activity, there are over 230 synchronous and asynchronous conditions ("events") that may require direct or indirect servicing. iWarp collects these events in a two-tier recording/reporting hierarchy. Low-level events are recorded in "group" event registers, and reported as single group-event to top-level EVENT register. There are individual reporting enables at both group and EVENT levels. Enabled events in the EVENT register cause automatic invocation of a service routine, vectored through a 64-entry service-routine table. Types of events include the following: - Connection arrival. - Message arrival with matched name. - Express connection flow stopped due to programmed condition. - Debug breakpoints. - Spool events. - Floating point extension and error traps. - Protection violations. - Timeouts. --------------------------------------- SYSTEM CONFIGURATIONS Quad Cell Board (QCB) - 4 processors, in a 2x2 array (max 80 MFLOPs/80 MIPs). - .5/2.0 Mbyte per processor on board. - daughter board expansion to: 1.5/4.5/6.0 Mbytes per processor. Card Cage Assembly (CCA) - Up to 16 QCBs (max 64 nodes, 1280 MFLOPs/1280 MIPs). - Clock board, fan and power supply. System Cabinet - Up to 4 CCAs (max 256 nodes, 5120 MFLOPs/5120 MIPs). Multi-Cabinet Systems - Connected with external cabling. - Up to 4 System Cabinets (max 1024 nodes, 20480 MFLOPS/20480 MIPs). SBA (Single Board Array) - QCB with Sun form factor - 4 processors, in a 2x2 array - .5/1.0/2.0/4.0 Mbyte per processor SIB (System Interface Board) - Single processor node - .5/1 Mbyte main memory, 64/256 Kbyte dual-port RAM - VMS interface to host. SBA System - Max supported by single Sun workstation: 1 SIB + up to 8 SBAs (2x16 array, 640 MFLOPs/640 MIPs) Currently shipping 10 Mhz systems, with 20 Mhz systems expected early 92. --------------------------------------- SOFTWARE The following are available now: Pathlib- Low level interface for systolic communication. RTS- Run Time System (basic kernel). C- Standard K&R, with global optimizations, assembler inlining and iWarp comm extensions. Apply- Image processing parallel program generator. The following are expected in Q4 91. RTS enhancements. C enhancements (machine dependent optimization, incl SW pipelining). Fortran 77 with VMS extensions, C/Fortran cross-language function inlining and iWarp comm extensions. Symbolic debugger (based on GNU). Pathlib, RTS, and C are bundled with systems. Fortran and Apply are extra cost. Symbolic debugger will be bundled with systems. --------------------------------------- FURTHER INFORMATION 5-day training classes cover system and communications architecture, use of the software tools, application development, and optimization tricks. Marketing Contact: Paul Wiley, Marketing Manager 5200 NE Elam Young Pkwy, CO4-02, Hillsboro, OR 97124-6497 (503) 629-6350 fax: (503) 629-6367 wiley@iwarp.intel.com There have been numerous iWarp-related papers in journals and conference proceedings over the last 2 years. A good starting list would include: iWarp: A 100-MOP LIW Microprocessor for Multicomputers C.Peterson, J.Sutton, P.Wiley, IEEE Micro, June 1991 iWarp: An Integrated Solution to High Speed Parallel Processing S.Borkar, et al., Proc. Supercomputing '88, IEEE CS Press, Nov 1988, pp 300-339 Supporting Systolic and Memory Communication in iWarp S.Borkar, et al., Proc. 17th Intl Symposium on Computer Architecture IEEE CS Press, May 1990, pp 70-81 Communication in iWarp Systems T.Gross, Proc. Supercomputing '89, IEEE CS Press, Nov 1989, pp 436-435 Apply: A Parallel Compiler for Image Processing Applications B.Baxter and B.Greer, Proc. 6th Distributed Memory Computing Conf, Apr 29-May 2 1991 <to be published, July?> ---------------------------------------------------------------------------- Jim Sutton, Sr Staff Engineer, intel/iWarp Program jsutton@iWarp.intel.com 5200 NE Elam Young Pky CO4-03, Hillsboro, OR 97124 (503)629-6345 -- ---------------------------------------------------------------------------- Jim Sutton, Sr Staff Engineer, intel/iWarp Program jsutton@iWarp.intel.com 5200 NE Elam Young Pky CO4-03, Hillsboro, OR 97124 (503)629-6345