[comp.parallel] iWarp Architecture Overview

jsutton@iWarp.intel.com (Jim Sutton) (06/01/91)

Over the past few months, there have been sporadic articles and requests
for information concerning the iWarp project appearing in the comp.arch
and comp.parallel newsgroups.  It seems appropriate at this time to provide
a brief description of what an iWarp is, for those of you who may be feeling
left out.

My credentials:  I have been with iWarp since Q3 86.  My primary function
was component/systems architect, but I also had responsibility for the
design of two of the functional units.

This overview provides only a brief, global view of iWarp's architecture and
capabilities, and does not attempt to cover details (our architectural spec
used by the software development teams is over 450 pages!).  Pointers to
further information are provided at the end of this article for those who
would like additional details.

All performance numbers shown are for 20 Mhz systems.

---------------------------------------
SYSTEM ARCHITECTURE

iWarp is both a component and system design for scalable parallel computing
systems, aimed at supporting both systolic (fine-grained) and message-passing
(coarse-grained) applications requiring high-speed communications.  Systolic
applications involve few (typically <10) calculations per data element, so a
*balance* between I/O and CPU performance is as critical as raw CPU MFLOPs.

iWarp's primary design focus was on integrating high-performance I/O into
the processor in a way that allows the computational units to have balanced
access to both the I/O channels and memory.  We made a conscious decision to
avoid expensive, experimental (or simply difficult) CPU features found on
current leading-edge processors, to allow us to focus time (and silicon) on
these goals.

An iWarp cell (i.e., node) is comprised of a single iWarp component and a
bank of fast static RAM.  iWarp cells are joined by directly connecting the
iWarp components through the four bidirectional pathways.  Typical config-
urations are 1D (linear) and 2D (mesh) arrays, although other arrangements
are possible.

iWarp is the first commercial product to implement virtual routing channels
in hardware, and extends this concept to *long-lived* virtual channels
(which we call connections) that form a unidirectional path of reserved
resources from the terminal source node to the terminal destination node.
Once a connection has been established, one or more messages may be shipped
along this path.  Connections may also be *shared*; that is, one or more
cells along the path may be senders, and one or more cells along the path
may be receivers.  Connections use a form of street-sign addressing to
choose a routing path through the array; messages have names that identify
the destination cell(s) or process(es) within the connection. iWarp supports
20 connections, with independent buffering, control and status resources for
each.

---------------------------------------
COMPONENT ARCHITECTURE

The iWarp component consists of two essentially asynchronous agents: a
Computation Agent and a Communication Agent.  The Comp Agent contains
integer core hardware, floating point hardware, memory interface and a
128 word Register File (RF).  The Comm Agent handles all pathway traffic, and
manipulation of the virtual channel buffers. 

Following is a brief summary of the iWarp component features.
Unusual features are discussed in the next section.

  Communications
  - 4 bidirectional pathways.
    * Built as unidirectional bus pairs, each 8-bits (+control) wide.
    * 40 Mbytes/sec per bus (320 Mbytes/sec aggregate).
  - 4-entry Address Match CAM handles cell streetsigns and message names.
  - 20 channel buffers (called PCT records)
    * Allows 20 simultaneous connections through/to/from the cell.
      "Express" (thru-cell) traffic is not blocked by "inbound" or
      "outbound" traffic to/from the cell.
    * Express traffic handled automatically, without SW intervention and
      without stealing CPU cycles.
    * 8-word data queue provides smoothing.
    * Programmable stop conditions allow channel to be automatically
      converted from "express" (data flowing through cell) to "inbound"
      (data examined/consumed by cell), and vice-versa.
  - Independent round-robin scheduling for each outbound pathway.

  128 word Register File (RF)
  - 118 general purpose registers + 10 special addresses.
  - Accessible as 8/16/32/64-bit fields on natural boundaries.
  - Heavily multi-ported:
    * Separate ports for integer core, FP adder, FP multiplier allow
      simultaneous access. Each port supports 2-reads/1-write per clock.
    * Special "back-door" ports tied directly to specific registers:
      LM ports and stream gates (see "Special Features").

  Floating Point
  - Adder and Multiplier units:
    * Non-pipelined 2 clock SP, 4 clock DP basic operations.
    * Each 10 MFLOPs SP, 5 MFLOPs DP.
  - 32/64-bit IEEE P754.
    * Full trap support and fast mode, four rounding modes.
  - 3 operand instructions (2 src, 1 dest)
  - Result bypass between adder/multiplier.
  - Multiplier supports divide, remainder and square root instructions.
  - Integer-to-float, float-to-integer, pack and unpack operations.

  Integer Arithmetic
  - 8/16/32-bit data operations.
  - All 1-clock operations; 20 MIPs performance.
  - Arithmetic, shift/rotate, logical and bit operations.
  - 2 operand instructions (1 src, 1 src/dest)
  - Most ops allow 8-bit literal value as 2nd operand.
  - Result bypass to any other core (non-FP) instruction.

  Memory Operations
  - 8/16/32/64-bit load/store instructions.
    * Pre- and post-increment addressing.
    * Small-constant or variable address stride.
    * Big-endian/little-endian transformations for all data types.
  - Internal 256 word instruction cache, 4-way set associative.
  - Internal 2K word instruction ROM.

  Support Operations/Features
  - Stack: push/pop, call/return, allocate/release.
  - Branch and call targets: direct, register-indirect, memory-indirect.
  - System call to protected code through table lookup.
  - Control/status register load/store.
  - Register indirection (dereferences).
  - Four separate sets of flags: FP adder, FP mult, integer and "other".

---------------------------------------
SPECIAL FEATURES

  Hardware Loop Support

     An ENTERLOOP instruction initializes dedicated hardware, including the
     loop-start address (the next instruction), an iteration count, and an
     optional conditional code (e.g., carry flag false).  Most instructions
     have an ENDLOOP bit.  If set, then the loop count is decremented, and
     the loop is repeated if the count is non-zero OR the specified condition
     (resulting from a preceding instruction) is false.  This all occurs in
     parallel with normal instruction execution.

     Loops can be nested.  The ENTERLOOP instruction saves the loop controls
     on the stack before changing.  Exiting the loop retrieves the controls.
     A mode in the BRANCH instruction allows early exit from the loop.

  Stream Gates

     ANY instruction may read/write data directly from/to a channel buffer's
     data queue by reading/writing special addresses in the Register File.
     These special locations are called "stream gates", because they provide
     a gating function that allows a stream of data to pass, a word or double-
     word at a time, between the program and the array. The RF locations have
     no real storage themselves, but pass data to/from the channel buffers
     over a "backdoor" port (bus).

     There are two read gates (G0R,G1R) and two write gates (G0W,G1W).  Each
     gate has a programmable binding, that specifies which channel buffer it
     is connected to. Once bound, any instruction may pass through the channel
     as a side effect of an ordinary read/write of an RF location.

     If data (or space) is not immediately available in the channel buffer's
     queue, the instruction "spins" before execution, until the data (space)
     is available.  This allows direct program-to-program transfers without
     the costly overhead of checking queue status via instructions.  Spin
     timeouts are available to catch deadlock or error conditions.

  LM Ports

     The other special addresses in the Register File are "LM Ports", and are
     "special" only in the Compute & Access instruction (see below).  There
     are two read locations (LMR1,LMR2) and one write location (LMW).  Each
     port has real storage associated with it (and is an ordinary register
     for non-C&A instructions), but also has a "backdoor" port to the external
     memory bus.  The C&A instruction activates uses this back-door port
     to perform faster memory accesses (1-clock each).

  Compute & Access Instruction

     This is the workhorse instruction of the iWarp component.  It is the sole
     *long* instruction (96-bits), and is horizontal in nature, specifying the
     parallel operation of nearly all functional units on the chip.  The C&A
     instruction can initiate:
        1 FP adder operation (2-clocks SP)
        1 FP multiplier operation (2-clocks SP)
        2 back-to-back memory ops (each 1-clock):
          A memory read into LMR1, and either a memory read into LMR2 or a
          memory write from LMW. Each memory op specifies an address register
          and an offset register/literal, allowing pre/post-incrementing
          addresses, with a constant/variable stride. Each memory access
          can be single- or double-word.
        end-of-loop test + repeat

     ANY of the source operands (FP add, FP mult, memory address calc) can
     specify a stream gate read (G0R,G1R).  ANY of the destionation operands
     can specify a stream gate write (G0W,G1W).

     This produces a peak performance of:
           20 MFLOPs
        +  20 MIPs (+ loop_test/branch)
        + 160 Mbytes/sec memory ops
        +  80 Mbytes/sec sends
        +  80 Mbytes/sec receives
       (+ 160 Mbytes/sec express (thru cell) traffic)

  Spools

     Message-passing communication involves transferring blocks of data from
     the memory of a sending cell to the memory of a receiving cell. iWarp
     provides 8 independent, programmable DMA interfaces between memory and
     the channel buffers, called "spools".  Each spool is programmed with
     the channel number, two Register File locations (buffer address and buffer
     limit), stride, data-type (for endian-transformation) and direction
     (to/from memory).  Once programmed and enabled, a spool operation will
     steal cycles from normal CPU operations *only* when data is available.

     The data transfer stops when the buffer limit is reached, or, in the
     case of spools-to-memory, when a "delimiter" (non-data word) reaches
     the top of the queue.  When stopped, the spool records an event, which
     may invoke a service routine.

     A single spool can "max-out" its associated outbound or inbound pathway,
     at 40 Mbytes/sec. Multiple spools (corresponding to concurrent messages
     to/from a cell) are scheduled in a round-robin fashion, and will max-out
     at the memory bandwidth of 160 Mbytes/sec.

     A channel buffer can be attached to a spool at one end, and a stream
     gate at the other.  This allows re-use of a code block in either systolic
     mode (streaming directly to/from pathway) and message-passing mode
     (streams to/from spools to/from memory).

  Events

     Because iWarp integrates a sophisticated communication manager with DMA
     transfers and normal CPU core and floating point activity, there are over
     230 synchronous and asynchronous conditions ("events") that may require
     direct or indirect servicing.  iWarp collects these events in a two-tier
     recording/reporting hierarchy.

     Low-level events are recorded in "group" event registers, and reported as
     single group-event to top-level EVENT register.  There are individual
     reporting enables at both group and EVENT levels.  Enabled events in the
     EVENT register cause automatic invocation of a service routine, vectored
     through a 64-entry service-routine table.

     Types of events include the following:
     - Connection arrival.
     - Message arrival with matched name.
     - Express connection flow stopped due to programmed condition.
     - Debug breakpoints.
     - Spool events.
     - Floating point extension and error traps.
     - Protection violations.
     - Timeouts.

---------------------------------------
SYSTEM CONFIGURATIONS

  Quad Cell Board (QCB)
  - 4 processors, in a 2x2 array (max 80 MFLOPs/80 MIPs).
  - .5/2.0 Mbyte per processor on board.
  - daughter board expansion to: 1.5/4.5/6.0 Mbytes per processor.
  Card Cage Assembly (CCA)
  - Up to 16 QCBs (max 64 nodes, 1280 MFLOPs/1280 MIPs).
  - Clock board, fan and power supply.
  System Cabinet
  - Up to 4 CCAs (max 256 nodes, 5120 MFLOPs/5120 MIPs).
  Multi-Cabinet Systems
  - Connected with external cabling.
  - Up to 4 System Cabinets (max 1024 nodes, 20480 MFLOPS/20480 MIPs).

  SBA (Single Board Array)
  - QCB with Sun form factor
  - 4 processors, in a 2x2 array
  - .5/1.0/2.0/4.0 Mbyte per processor
  SIB (System Interface Board)
  - Single processor node
  - .5/1 Mbyte main memory, 64/256 Kbyte dual-port RAM
  - VMS interface to host.
  SBA System
  - Max supported by single Sun workstation:
    1 SIB + up to 8 SBAs (2x16 array, 640 MFLOPs/640 MIPs)

  Currently shipping 10 Mhz systems, with 20 Mhz systems expected early 92.

---------------------------------------
SOFTWARE

  The following are available now:
     Pathlib- Low level interface for systolic communication.
     RTS-     Run Time System (basic kernel).
     C-       Standard K&R, with global optimizations, assembler inlining
              and iWarp comm extensions.
     Apply-   Image processing parallel program generator.

  The following are expected in Q4 91.
     RTS enhancements.
     C enhancements (machine dependent optimization, incl SW pipelining).
     Fortran 77 with VMS extensions, C/Fortran cross-language function
        inlining and iWarp comm extensions.
     Symbolic debugger (based on GNU).

  Pathlib, RTS, and C are bundled with systems.  Fortran and Apply are
  extra cost.  Symbolic debugger will be bundled with systems.

---------------------------------------
FURTHER INFORMATION

  5-day training classes cover system and communications architecture,
  use of the software tools, application development, and optimization
  tricks.

  Marketing Contact:
    Paul Wiley, Marketing Manager
    5200 NE Elam Young Pkwy, CO4-02, Hillsboro, OR  97124-6497
    (503) 629-6350    fax: (503) 629-6367       wiley@iwarp.intel.com

  There have been numerous iWarp-related papers in journals and conference
  proceedings over the last 2 years.  A good starting list would include:

    iWarp: A 100-MOP LIW Microprocessor for Multicomputers
    C.Peterson, J.Sutton, P.Wiley, IEEE Micro, June 1991

    iWarp: An Integrated Solution to High Speed Parallel Processing
    S.Borkar, et al., Proc. Supercomputing '88, IEEE CS Press, Nov 1988,
    pp 300-339

    Supporting Systolic and Memory Communication in iWarp
    S.Borkar, et al., Proc. 17th Intl Symposium on Computer Architecture
    IEEE CS Press, May 1990, pp 70-81

    Communication in iWarp Systems
    T.Gross, Proc. Supercomputing '89, IEEE CS Press, Nov 1989, pp 436-435

    Apply: A Parallel Compiler for Image Processing Applications
    B.Baxter and B.Greer, Proc. 6th Distributed Memory Computing Conf,
    Apr 29-May 2 1991 <to be published, July?>

 ----------------------------------------------------------------------------
 Jim Sutton, Sr Staff Engineer, intel/iWarp Program   jsutton@iWarp.intel.com
 5200 NE Elam Young Pky CO4-03, Hillsboro, OR 97124             (503)629-6345

--
 ----------------------------------------------------------------------------
 Jim Sutton, Sr Staff Engineer, intel/iWarp Program   jsutton@iWarp.intel.com
 5200 NE Elam Young Pky CO4-03, Hillsboro, OR 97124             (503)629-6345