jsutton@iWarp.intel.com (Jim Sutton) (06/01/91)
Over the past few months, there have been sporadic articles and requests
for information concerning the iWarp project appearing in the comp.arch
and comp.parallel newsgroups. It seems appropriate at this time to provide
a brief description of what an iWarp is, for those of you who may be feeling
left out.
My credentials: I have been with iWarp since Q3 86. My primary function
was component/systems architect, but I also had responsibility for the
design of two of the functional units.
This overview provides only a brief, global view of iWarp's architecture and
capabilities, and does not attempt to cover details (our architectural spec
used by the software development teams is over 450 pages!). Pointers to
further information are provided at the end of this article for those who
would like additional details.
All performance numbers shown are for 20 Mhz systems.
---------------------------------------
SYSTEM ARCHITECTURE
iWarp is both a component and system design for scalable parallel computing
systems, aimed at supporting both systolic (fine-grained) and message-passing
(coarse-grained) applications requiring high-speed communications. Systolic
applications involve few (typically <10) calculations per data element, so a
*balance* between I/O and CPU performance is as critical as raw CPU MFLOPs.
iWarp's primary design focus was on integrating high-performance I/O into
the processor in a way that allows the computational units to have balanced
access to both the I/O channels and memory. We made a conscious decision to
avoid expensive, experimental (or simply difficult) CPU features found on
current leading-edge processors, to allow us to focus time (and silicon) on
these goals.
An iWarp cell (i.e., node) is comprised of a single iWarp component and a
bank of fast static RAM. iWarp cells are joined by directly connecting the
iWarp components through the four bidirectional pathways. Typical config-
urations are 1D (linear) and 2D (mesh) arrays, although other arrangements
are possible.
iWarp is the first commercial product to implement virtual routing channels
in hardware, and extends this concept to *long-lived* virtual channels
(which we call connections) that form a unidirectional path of reserved
resources from the terminal source node to the terminal destination node.
Once a connection has been established, one or more messages may be shipped
along this path. Connections may also be *shared*; that is, one or more
cells along the path may be senders, and one or more cells along the path
may be receivers. Connections use a form of street-sign addressing to
choose a routing path through the array; messages have names that identify
the destination cell(s) or process(es) within the connection. iWarp supports
20 connections, with independent buffering, control and status resources for
each.
---------------------------------------
COMPONENT ARCHITECTURE
The iWarp component consists of two essentially asynchronous agents: a
Computation Agent and a Communication Agent. The Comp Agent contains
integer core hardware, floating point hardware, memory interface and a
128 word Register File (RF). The Comm Agent handles all pathway traffic, and
manipulation of the virtual channel buffers.
Following is a brief summary of the iWarp component features.
Unusual features are discussed in the next section.
Communications
- 4 bidirectional pathways.
* Built as unidirectional bus pairs, each 8-bits (+control) wide.
* 40 Mbytes/sec per bus (320 Mbytes/sec aggregate).
- 4-entry Address Match CAM handles cell streetsigns and message names.
- 20 channel buffers (called PCT records)
* Allows 20 simultaneous connections through/to/from the cell.
"Express" (thru-cell) traffic is not blocked by "inbound" or
"outbound" traffic to/from the cell.
* Express traffic handled automatically, without SW intervention and
without stealing CPU cycles.
* 8-word data queue provides smoothing.
* Programmable stop conditions allow channel to be automatically
converted from "express" (data flowing through cell) to "inbound"
(data examined/consumed by cell), and vice-versa.
- Independent round-robin scheduling for each outbound pathway.
128 word Register File (RF)
- 118 general purpose registers + 10 special addresses.
- Accessible as 8/16/32/64-bit fields on natural boundaries.
- Heavily multi-ported:
* Separate ports for integer core, FP adder, FP multiplier allow
simultaneous access. Each port supports 2-reads/1-write per clock.
* Special "back-door" ports tied directly to specific registers:
LM ports and stream gates (see "Special Features").
Floating Point
- Adder and Multiplier units:
* Non-pipelined 2 clock SP, 4 clock DP basic operations.
* Each 10 MFLOPs SP, 5 MFLOPs DP.
- 32/64-bit IEEE P754.
* Full trap support and fast mode, four rounding modes.
- 3 operand instructions (2 src, 1 dest)
- Result bypass between adder/multiplier.
- Multiplier supports divide, remainder and square root instructions.
- Integer-to-float, float-to-integer, pack and unpack operations.
Integer Arithmetic
- 8/16/32-bit data operations.
- All 1-clock operations; 20 MIPs performance.
- Arithmetic, shift/rotate, logical and bit operations.
- 2 operand instructions (1 src, 1 src/dest)
- Most ops allow 8-bit literal value as 2nd operand.
- Result bypass to any other core (non-FP) instruction.
Memory Operations
- 8/16/32/64-bit load/store instructions.
* Pre- and post-increment addressing.
* Small-constant or variable address stride.
* Big-endian/little-endian transformations for all data types.
- Internal 256 word instruction cache, 4-way set associative.
- Internal 2K word instruction ROM.
Support Operations/Features
- Stack: push/pop, call/return, allocate/release.
- Branch and call targets: direct, register-indirect, memory-indirect.
- System call to protected code through table lookup.
- Control/status register load/store.
- Register indirection (dereferences).
- Four separate sets of flags: FP adder, FP mult, integer and "other".
---------------------------------------
SPECIAL FEATURES
Hardware Loop Support
An ENTERLOOP instruction initializes dedicated hardware, including the
loop-start address (the next instruction), an iteration count, and an
optional conditional code (e.g., carry flag false). Most instructions
have an ENDLOOP bit. If set, then the loop count is decremented, and
the loop is repeated if the count is non-zero OR the specified condition
(resulting from a preceding instruction) is false. This all occurs in
parallel with normal instruction execution.
Loops can be nested. The ENTERLOOP instruction saves the loop controls
on the stack before changing. Exiting the loop retrieves the controls.
A mode in the BRANCH instruction allows early exit from the loop.
Stream Gates
ANY instruction may read/write data directly from/to a channel buffer's
data queue by reading/writing special addresses in the Register File.
These special locations are called "stream gates", because they provide
a gating function that allows a stream of data to pass, a word or double-
word at a time, between the program and the array. The RF locations have
no real storage themselves, but pass data to/from the channel buffers
over a "backdoor" port (bus).
There are two read gates (G0R,G1R) and two write gates (G0W,G1W). Each
gate has a programmable binding, that specifies which channel buffer it
is connected to. Once bound, any instruction may pass through the channel
as a side effect of an ordinary read/write of an RF location.
If data (or space) is not immediately available in the channel buffer's
queue, the instruction "spins" before execution, until the data (space)
is available. This allows direct program-to-program transfers without
the costly overhead of checking queue status via instructions. Spin
timeouts are available to catch deadlock or error conditions.
LM Ports
The other special addresses in the Register File are "LM Ports", and are
"special" only in the Compute & Access instruction (see below). There
are two read locations (LMR1,LMR2) and one write location (LMW). Each
port has real storage associated with it (and is an ordinary register
for non-C&A instructions), but also has a "backdoor" port to the external
memory bus. The C&A instruction activates uses this back-door port
to perform faster memory accesses (1-clock each).
Compute & Access Instruction
This is the workhorse instruction of the iWarp component. It is the sole
*long* instruction (96-bits), and is horizontal in nature, specifying the
parallel operation of nearly all functional units on the chip. The C&A
instruction can initiate:
1 FP adder operation (2-clocks SP)
1 FP multiplier operation (2-clocks SP)
2 back-to-back memory ops (each 1-clock):
A memory read into LMR1, and either a memory read into LMR2 or a
memory write from LMW. Each memory op specifies an address register
and an offset register/literal, allowing pre/post-incrementing
addresses, with a constant/variable stride. Each memory access
can be single- or double-word.
end-of-loop test + repeat
ANY of the source operands (FP add, FP mult, memory address calc) can
specify a stream gate read (G0R,G1R). ANY of the destionation operands
can specify a stream gate write (G0W,G1W).
This produces a peak performance of:
20 MFLOPs
+ 20 MIPs (+ loop_test/branch)
+ 160 Mbytes/sec memory ops
+ 80 Mbytes/sec sends
+ 80 Mbytes/sec receives
(+ 160 Mbytes/sec express (thru cell) traffic)
Spools
Message-passing communication involves transferring blocks of data from
the memory of a sending cell to the memory of a receiving cell. iWarp
provides 8 independent, programmable DMA interfaces between memory and
the channel buffers, called "spools". Each spool is programmed with
the channel number, two Register File locations (buffer address and buffer
limit), stride, data-type (for endian-transformation) and direction
(to/from memory). Once programmed and enabled, a spool operation will
steal cycles from normal CPU operations *only* when data is available.
The data transfer stops when the buffer limit is reached, or, in the
case of spools-to-memory, when a "delimiter" (non-data word) reaches
the top of the queue. When stopped, the spool records an event, which
may invoke a service routine.
A single spool can "max-out" its associated outbound or inbound pathway,
at 40 Mbytes/sec. Multiple spools (corresponding to concurrent messages
to/from a cell) are scheduled in a round-robin fashion, and will max-out
at the memory bandwidth of 160 Mbytes/sec.
A channel buffer can be attached to a spool at one end, and a stream
gate at the other. This allows re-use of a code block in either systolic
mode (streaming directly to/from pathway) and message-passing mode
(streams to/from spools to/from memory).
Events
Because iWarp integrates a sophisticated communication manager with DMA
transfers and normal CPU core and floating point activity, there are over
230 synchronous and asynchronous conditions ("events") that may require
direct or indirect servicing. iWarp collects these events in a two-tier
recording/reporting hierarchy.
Low-level events are recorded in "group" event registers, and reported as
single group-event to top-level EVENT register. There are individual
reporting enables at both group and EVENT levels. Enabled events in the
EVENT register cause automatic invocation of a service routine, vectored
through a 64-entry service-routine table.
Types of events include the following:
- Connection arrival.
- Message arrival with matched name.
- Express connection flow stopped due to programmed condition.
- Debug breakpoints.
- Spool events.
- Floating point extension and error traps.
- Protection violations.
- Timeouts.
---------------------------------------
SYSTEM CONFIGURATIONS
Quad Cell Board (QCB)
- 4 processors, in a 2x2 array (max 80 MFLOPs/80 MIPs).
- .5/2.0 Mbyte per processor on board.
- daughter board expansion to: 1.5/4.5/6.0 Mbytes per processor.
Card Cage Assembly (CCA)
- Up to 16 QCBs (max 64 nodes, 1280 MFLOPs/1280 MIPs).
- Clock board, fan and power supply.
System Cabinet
- Up to 4 CCAs (max 256 nodes, 5120 MFLOPs/5120 MIPs).
Multi-Cabinet Systems
- Connected with external cabling.
- Up to 4 System Cabinets (max 1024 nodes, 20480 MFLOPS/20480 MIPs).
SBA (Single Board Array)
- QCB with Sun form factor
- 4 processors, in a 2x2 array
- .5/1.0/2.0/4.0 Mbyte per processor
SIB (System Interface Board)
- Single processor node
- .5/1 Mbyte main memory, 64/256 Kbyte dual-port RAM
- VMS interface to host.
SBA System
- Max supported by single Sun workstation:
1 SIB + up to 8 SBAs (2x16 array, 640 MFLOPs/640 MIPs)
Currently shipping 10 Mhz systems, with 20 Mhz systems expected early 92.
---------------------------------------
SOFTWARE
The following are available now:
Pathlib- Low level interface for systolic communication.
RTS- Run Time System (basic kernel).
C- Standard K&R, with global optimizations, assembler inlining
and iWarp comm extensions.
Apply- Image processing parallel program generator.
The following are expected in Q4 91.
RTS enhancements.
C enhancements (machine dependent optimization, incl SW pipelining).
Fortran 77 with VMS extensions, C/Fortran cross-language function
inlining and iWarp comm extensions.
Symbolic debugger (based on GNU).
Pathlib, RTS, and C are bundled with systems. Fortran and Apply are
extra cost. Symbolic debugger will be bundled with systems.
---------------------------------------
FURTHER INFORMATION
5-day training classes cover system and communications architecture,
use of the software tools, application development, and optimization
tricks.
Marketing Contact:
Paul Wiley, Marketing Manager
5200 NE Elam Young Pkwy, CO4-02, Hillsboro, OR 97124-6497
(503) 629-6350 fax: (503) 629-6367 wiley@iwarp.intel.com
There have been numerous iWarp-related papers in journals and conference
proceedings over the last 2 years. A good starting list would include:
iWarp: A 100-MOP LIW Microprocessor for Multicomputers
C.Peterson, J.Sutton, P.Wiley, IEEE Micro, June 1991
iWarp: An Integrated Solution to High Speed Parallel Processing
S.Borkar, et al., Proc. Supercomputing '88, IEEE CS Press, Nov 1988,
pp 300-339
Supporting Systolic and Memory Communication in iWarp
S.Borkar, et al., Proc. 17th Intl Symposium on Computer Architecture
IEEE CS Press, May 1990, pp 70-81
Communication in iWarp Systems
T.Gross, Proc. Supercomputing '89, IEEE CS Press, Nov 1989, pp 436-435
Apply: A Parallel Compiler for Image Processing Applications
B.Baxter and B.Greer, Proc. 6th Distributed Memory Computing Conf,
Apr 29-May 2 1991 <to be published, July?>
----------------------------------------------------------------------------
Jim Sutton, Sr Staff Engineer, intel/iWarp Program jsutton@iWarp.intel.com
5200 NE Elam Young Pky CO4-03, Hillsboro, OR 97124 (503)629-6345
--
----------------------------------------------------------------------------
Jim Sutton, Sr Staff Engineer, intel/iWarp Program jsutton@iWarp.intel.com
5200 NE Elam Young Pky CO4-03, Hillsboro, OR 97124 (503)629-6345