[comp.arch] iWARP notes... it's pretty neat

lethin@ai.mit.edu (Richard A. Lethin) (05/25/91)

I had the opportunity to attend a one-week iWARP training course at
Intel's facility in Hillsboro, Oregon a few weeks ago.  They've put a
great deal of effort into course, so it was very informative. This was
the first week that they offered the course to prospective users, and
the iWARP system is still partly in development, so things weren't
completely smooth.

iWARP is very interesting, and worth people's while to examine.  While
the software is still a bit rough, it's progressing rapidly.  The
hardware works, is solid and definitely for-sale.

What follows is a sanitized version of my trip report.  Disclaimer:
it's subjective and probably contains errors.  I'm posting it to whet
your appetite for more iWARP information, and help the people who
sponsored my trip get their money's worth (perhaps encourage
researchers to build on their work instead of repeating it), and start
new comp.arch discussion threads.

Hardware
--------

iWARP is a single-chip microprocessor with on-chip floating point and
communication facilities.  The chip has no data cache or RAM, but does
have a small instruction cache. The silicon area is about (500 mil)^2,
and it was constructed from hand-placed and connected standard cells
(similar to the MDP, except that the MDP used some auto-place and
route for irregular cells).  The iWARP team consisted of about 60
people for 6 years, so the development cost was probably around $36M
(assuming $100k/person).  This was split between Intel and DARPA.

The clock speed is 20 MHz.  Chips are currently running a bit below
speed, but they're working to clean up the slow paths.

Intel doesn't sell iWARP chips, it sells iWARP systems.  CMU has one.
They've made at least one recent multimillion dollar sale of a large
system.

The regular iWARP card (Single-Board Array or SBA) includes four
processor chips; each node has 500 kbytes of ~50ns SRAM memory.  These
can plug into a custom backplane, or into a mothercard to adapt the
card to the VME bus.  The systems that we experimented with were 8
processor systems in a SPARC/VME box.  These systems also require a
System Interface Board (SIB).  The wiring of the inter-SBA connections
is very conservative - thick, long, shielded multiconductor cables.
Presumably, in the Intel-cabinet systems, the wiring is much more
efficient.

These boards are expensive: around $30,000 for the SBA and $15,000 for
the SIB.  I suppose that in terms of $/flops, one might consider them
a good deal, since the incremental cost for adding a few more flops is
another microprocessor and 500k of static RAM.

The instruction set is fairly conventional. Memory access is through
load and store operations, and simple integer operations bypass
results back and can execute in a single clock.  Floating point
operations are not pipelined.

In addition to the standard repertoire of "RISC" operations, the chip
also has pre- and post-increment on index registers for memory access,
nested looping instructions, push/pop/allocate/call, and special
communication control instructions.

The distinguishing feature of the iWARP instruction set is a VLIW-mode
96-bit long Compute & Access instruction (C&A).  An FP multiply, an FP
add, two memory operations, and a loop test can be issued and executed
parallel.  A team of compiler people is working to make their
single-chip compiler produce this instruction.  Currently, it does
not.  However, the assembly language inlining is particularly
well-implemented and should allow one to hand-code an inner loop
seamlessly, painlessly, and efficiently.  The FP units are not
pipelined, so C&A instructions may take a couple of clocks to
complete.  Since the communication agent access is a register access,
C&A could be computing with two input streams and sending results to
two output streams.

The register file is 128 registers deep.  Most of the registers are
general purpose, but a few are special-purpose.  For example: special
addressing registers used for athe C&A instruction and registers used
for streaming to/from the communication unit.

An alternate register space, "Control Space" of 1024 32-bit registers,
exists that allows the program to examine and modify control bits,
with special control-space-move instructions.  Some chip test points
can be examined in this space for diagnostic purposes.




Communication

A really interesting and innovative part of the processor design is
the communication architecture.  A lot of work and silicon area went
into it - it consumes 1/3 of the chip area in a clean, bit-sliced
design.  It's a set of very interesting low-level building blocks from
which one can construct a communication protocol.

Each processor has 4 "physical pathways" (4 incoming and 4 outgoing)
so it connects easily into a 2-D mesh.  However, aside from the
restriction that X-channels cannot connect to Y channels (they are a
half-clock out of phase) they could be connected in any topology.

Processors communicate with channels by streaming, ie, reading and
writing to register-file-mapped gates, or via spooling - setting up a
state machine to transfer directly between main memory and the
communication channel.

The channels are very high bandwidth.  They claim 40 Mbyte/sec (at 20
MHz) on each channel; there are 8 channels.  (We did some
benchmarking, normalizing for the slow clock, and even in the tightest
loop we could construct we could only get one processor to send to the
other at half of peak.  I speculated that there's some synchronization
overhead, or perhaps one needs to use spooling to hit the peak rate).

Options are provided for allocating portions of channels, setting up
connections through the array, interrupting on various message
conditions, breaking channels and inserting data, etc.  These features
can be accessed from C programs using a library of macros and function
calls named PATHLIB.  PATHLIB can use macros to generate efficient
assembly-language communication operations.

Routing is "signpost" routing - the processor initiating the
connection specifies the path to the receiver.  If resources (buffers)
along the path are available, the connection is established.
Otherwise, the message header blocks until resources become available.
Once a connection is established, special delimiter tokens can be
interspersed with the data to delimit separate messages or message
subfields.  The connection can be closed by sending a special "destroy
channel" token. The communication method that we saw most applications
use was static channels - they get established at the beginning of the
program, are used to send multiple messages, and are closed at the end
of the program.

Each processor has 20 buffers (this is the unit of replication in the
bit-sliced communication unit design) that can be allocated however
you wish - incoming, outgoing, etc.  Two are consumed by the run time
system. 

It's not obvious whether the communication architecture is "universal"
but it certainly is extensive.  It would be interesting to fool around
trying to construct a few different protocols and routing strategies.



Other Hardware Notes

How fast is it?  On some simple numerical integration problems that we
benchmarked, (that didn't use C&A) we would be generous in saying that
a single 20 MHz iWARP chip was 1/2 the speed of the SPARC host.  This
certainly isn't a complete characterization of the performance, but it
does serve as a ballpark figure. 

What's lacking architecturally?  Good facilities for doing memory
translation and protection: Implementing a shared address space would
be difficult on this architecture.



Software
--------

So, you've got this great parallel processor, how do you program it?
Intel supplies single-node C, an intermediate language "parallel C",
or "Apply", which is touted as a "signal processing language".  They
have a hacked-together-for-development run time system, with a much
spiffier run time system to be available soon.

C compiler

The single-node C-compiler is a regular-old C compiler.  It's solid,
with decent optimization.  As noted earlier, it doesn't produce the
VLIW-mode C&A instruction yet.  They've done a nice job with
assembly-language inlining.  Among other things, this allows
C-language macros to insert primitive communication operations
seamlessly.

Intermediate Parallel C

The Intermediate Parallel C is given with the disclaimer that it is
only an intermediate language for higher-level parallel languages.
But since the high level languages we saw didn't look particularly
useful (yet), this is the next-best thing.  For example, the class
exercises for writing a gaussian elimination program were hand-written
parallel C.

It's pretty shaky and has lots of bugs in the parser (silly bugs which
indicate that it hasn't been used a great deal yet, like the one that
causes it to mess up if you include more than one subroutine in the
same input file).  The program is translated into two output programs
(a master and a slave) in single-node C that make calls to PATHLIB for
communication.  One node in the array runs the master program, the
rest run identical copies of the slave program.  The master maintains
the synchronization of the slaves, sending messages to them
instructing them to proceed from one execution phase to the next.

Intermediate Parallel C is like regular C, with the addition of a
"parallel for" loop and some notion of local and distributed
variables.  However, the management of the distributed and local
variables is up to the user, using special copy operations that
distribute and copy variables between master and slaves.  These are
translated into calls to Pathlib for communication.

It's not clear whether you would want to really use this language for
performance-critical applications.  I suspect one would most-likely be
forced to resort to hand-rolling it in regular C, explicitly managing
the parallelism.

They warned that since this is an intermediate language and not a
supported product, the definition could change at any time.  So it's
not fair to criticize this too much, because it's not a true product.


Apply

Apply is a language for specifying image pixel transformations.  For
example, in the class, people wrote a simple edge detector in a few
lines of code.  You specify in Apply how each pixel is a function of
each of its neighbors, and the compiler emits a parallel C program to
to the operation.

It has a hybrid syntax that's not quite C and note quite ADA.  It's
still a bit fuzzy exactly how this gets woven into a complete
application, but presumably this will firm up with the new run-time
release (below).  Extensions to Apply apparently fix some of the
problems with the language and are forthcoming.

A library of pixel transformations, "Weblib" is available which
provides lots of common image transformations.



Run Time System

The first run time system that Intel shipped used far to much of the
local memory on each node.  So, they did some slash and trash, and got
it much smaller (about 50 kbyte/node).

Running programs (with the development runtime) is a glacial process.
The machine gets reset, tested, and rebooted with every run, so that
starting a program takes about 40 seconds.  This is an interim OS.

A much spiffier run time system is planned with each node running a
very lightweight run time system that provides a nearly complete
system-V UNIX interface (minus some unmanageable things like fork) to
the host's facilities (ie, file system, etc).  The goal is for this to
use 30 kbytes/node.  All of these calls pretty much just get forwarded to
the host for handling.  There's talk about supporting a memory mapped
high-performance disk interface on each node.


Observations
------------

If you can afford it, they have a working parallel-processor building
block chip and system.  One can probably expect the hardware to be
solid. The software is definitely "under construction, proceed at own
risk".  But they're working on it, and it's really just a matter of
time before they get some spiffy stuff cranked out.  People expressed
admiration for the rate at which CMU could crank out code.

There are some questions about the scalability of this system.
Per-node price is still very high ($9000, including RAM).  It's not
clear why the price is so high.

Intel now has a formidable, *experienced* parallel processor design
team.  

It's an interesting system, and the software questions are wide open.
How do you program such a machine?  What communication strategies are
practical?  What's it good for?  If we had one, we could certainly
fiddle with it to try to get something running on it.

It would be interesting to try constructing some simulations of other
architectures and protocols.  As mentioned earlier, the lack of
translation operations makes simulating a shared global address space
difficult.

Their Apply language, while being somewhat clumsy as a production
tool, is interesting because it raises issues of data placement and
communication optimization from high level operations, and proposes a
solution.  


Summary
-------

Again, the purpose of this memo has been to whet your appetite for
iWARP information.  I think it's a neat, solid, interesting system.


Richard Lethin
lethin@ai.mit.edu

uh311ae@sunmanager.lrz-muenchen.de (Henrik Klagges) (05/31/91)

Hello,

Thanks to Richard Lethin (lethin@ai.mit.edu) for his iWARP summary.
I would like to comment on some statements, because I disagree that
the iWARP is 'a neat, solid, interesting system'.

>Each processor has 4 "physical pathways" (4 incoming and 4 outgoing)
>so it connects easily into a 2-D mesh.  However, aside from the
>restriction that X-channels cannot connect to Y channels (they are a
>half-clock out of phase) they could be connected in any topology.

Basically, 4 bidirectionals isn't bad. I would prefer the 8 ones of the
new transputers, especially given the fact that they support a virtual
channel concept - i.e., giving you as many software channels as you want.
The XX, YY only restriction, however, is severe and sounds like an engi-
neering joke.

> They claim 40 Mbyte/sec (at 20MHz) on each channel; there are 8 channels. 
> (We did some benchmarking, normalizing for the slow clock, and even in 
> the tightest loop we could construct we could only get one processor to 
> send to the other at half of peak.

Thus proving that the claim must be wrong in any real-world system.

>The distinguishing feature of the iWARP instruction set is a VLIW-mode
>96-bit long Compute & Access instruction (C&A).  An FP multiply, an FP
>add, two memory operations, and a loop test can be issued and executed
>parallel.  A team of compiler people is working to make their
>single-chip compiler produce this instruction.  Currently, it does
>not.  However, the assembly language inlining is particularly
>well-implemented and should allow one to hand-code an inner loop
>seamlessly, painlessly, and efficiently. 

A 'single-chip compiler' which 'currently does not' for a selling pa-
rallel system ? 
This means that 'the distinguishing feature' essentially doesn't work,
except if you hand-code. This sounds like the story of optimizing com-
pilers and i860 performance. 

> These boards are expensive: around $30,000 for the SBA and $15,000 for
> the SIB. 
...
> a single 20 MHz iWARP chip was 1/2 the speed of the SPARC host
...
> There are some questions about the scalability of this system.
> Per-node price is still very high ($9000, including RAM).  It's not
> clear why the price is so high.

At $9K half-Sparc performance ? I'd rather buy a full Sparc (including
color monitor, 16Megs & HDD). For large parallelism, there is still a
connection machine (SIMD), a BBN Butterfly, a Meiko Computing Surface,
Paracom ... at less money. The fact that the iWARP has no Cache and no 
DRAM support (e.g. as transputers do) makes it very vulnerable to high 
speed SRAM prices - and very unlikely to zoom much higher than 20MHz 
in clock frequency. 

The iWARP was from the very beginning designed to be a building block
for 2D-mesh dataflow computers. Giving the right problems, dataflow 
can be very fast, given the wrong, it's useless. At half-Sparc speed 
the iWARP is slow even on this very specialized home turf, so I say,
forget it.

Cheers ! Rick@vee.lrz-muenchen.de

Henrik Klagges, U of Munich, Physics Dep.
#include "std_disclaimer.h"

ruehl@iis.ethz.ch (Roland Ruehl) (05/31/91)

In article <uh311ae.675674229@sunmanager> uh311ae@sunmanager.lrz-muenchen.de (Henrik Klagges) writes:

>				:
>Basically, 4 bidirectionals isn't bad. I would prefer the 8 ones of the
>new transputers, especially given the fact that they support a virtual
>channel concept - i.e., giving you as many software channels as you want.
>The XX, YY only restriction, however, is severe and sounds like an engi-
>neering joke.
>				:

iWARP supports 20 logical communication channels which
are implemented in hardware by multiplexing 4 bidirectional
hardware busses. With these logical channels it is possible
to emulate for instance a 2^6 Hypercube on an 8x8 iWARP system
without introducing software overhead.

>		:
>At $9K half-Sparc performance ? I'd rather buy a full Sparc (including
>color monitor, 16Megs & HDD). For large parallelism, there is still a
>connection machine (SIMD), a BBN Butterfly, a Meiko Computing Surface,
>		:

The CM is an SIMD machine and the BBN a shared memory parallel
processor with the associated drawbacks. A competetive
Computing Surface either uses the i860 with the compiler problems
you mentioned, or an H1 (=T9000) which has not been officially
released yet (how about an optimizing H1 C compiler ?)

>		:
>The iWARP was from the very beginning designed to be a building block
>for 2D-mesh dataflow computers.
>		:

A dataflow computer (see for instance "Monsoon: ..." by Papadopoulos
and Culler in ISCA 90) is designed to execute efficiently dataflow
graphs typically expressed in a functional language (for instance ID).
iWARP is programmed in C or W2 using low latency communication
primitives. Standart numerical applications (SOR, dense linear algebra,
signal processing, ...) can be parallelized efficiently provided enough
local memory.  Although the current C compiler release does not support
LIW optimization, iWARP has a good communication speed / local
computation performance ratio compared to other distributed memory
parallel processors (MIMD) commerically available at the moment.

---------

Roland Ruehl				uucp:  uunet!mcsun!ethz!ruehl
Tel: (01) 256 5146 (Switzerland)	eunet: ruehl@iis.ethz.ch
     +411 256 5146 (International)

Integrated Systems Laboratory
ETH-Zentrum
8092 Zurich

frazier@oahu.cs.ucla.edu (Greg Frazier) (06/01/91)

ruehl@iis.ethz.ch (Roland Ruehl) writes:

+In article <uh311ae.675674229@sunmanager> uh311ae@sunmanager.lrz-muenchen.de (Henrik Klagges) writes:

+>The iWARP was from the very beginning designed to be a building block
+>for 2D-mesh dataflow computers.

+A dataflow computer (see for instance "Monsoon: ..." by Papadopoulos
+and Culler in ISCA 90) is designed to execute efficiently dataflow
+graphs typically expressed in a functional language (for instance ID).
+iWARP is programmed in C or W2 using low latency communication
+primitives. 

I think what he meant to say was that the iWARP was designed as a
building block for systolic arrays.  Which it was/is.
-- 


Greg Frazier	frazier@CS.UCLA.EDU	!{ucbvax,rutgers}!ucla-cs!frazier

rfrench@neon.Stanford.EDU (Robert S. French) (06/01/91)

For people who want more information on the iWarp, here are some
recent papers that are relevant:

Shekhar Borkar, et al.  "Supporting Systolic and Memory Communication
in iWarp".  ISCA '90, pp 70-81.

Robert Cohn, et al.  "Architecture and Compiler Tradeoffs for a Long
Instruction Word Microprocessor".  ASPLOS '89, pp 2-14.

Ping-Sheng Tseng.  "Compiling Programs for a Linear Systolic Array".
PLDI '90, pp 311-321.  (Really talks about Warp, the the techniques
are probably applicable to iWarp)

If you take the last two papers together, you get a compiler that does
instruction scheduling, software pipelining, and automatic breakup of
tasks across a linear systolic array.  Now if only Intel could do
that...

BTW, I will be working with an iWarp system soon, and would enjoy
getting in contact with any of y'all who are currently using one.

			Rob