lethin@ai.mit.edu (Richard A. Lethin) (05/25/91)
I had the opportunity to attend a one-week iWARP training course at Intel's facility in Hillsboro, Oregon a few weeks ago. They've put a great deal of effort into course, so it was very informative. This was the first week that they offered the course to prospective users, and the iWARP system is still partly in development, so things weren't completely smooth. iWARP is very interesting, and worth people's while to examine. While the software is still a bit rough, it's progressing rapidly. The hardware works, is solid and definitely for-sale. What follows is a sanitized version of my trip report. Disclaimer: it's subjective and probably contains errors. I'm posting it to whet your appetite for more iWARP information, and help the people who sponsored my trip get their money's worth (perhaps encourage researchers to build on their work instead of repeating it), and start new comp.arch discussion threads. Hardware -------- iWARP is a single-chip microprocessor with on-chip floating point and communication facilities. The chip has no data cache or RAM, but does have a small instruction cache. The silicon area is about (500 mil)^2, and it was constructed from hand-placed and connected standard cells (similar to the MDP, except that the MDP used some auto-place and route for irregular cells). The iWARP team consisted of about 60 people for 6 years, so the development cost was probably around $36M (assuming $100k/person). This was split between Intel and DARPA. The clock speed is 20 MHz. Chips are currently running a bit below speed, but they're working to clean up the slow paths. Intel doesn't sell iWARP chips, it sells iWARP systems. CMU has one. They've made at least one recent multimillion dollar sale of a large system. The regular iWARP card (Single-Board Array or SBA) includes four processor chips; each node has 500 kbytes of ~50ns SRAM memory. These can plug into a custom backplane, or into a mothercard to adapt the card to the VME bus. The systems that we experimented with were 8 processor systems in a SPARC/VME box. These systems also require a System Interface Board (SIB). The wiring of the inter-SBA connections is very conservative - thick, long, shielded multiconductor cables. Presumably, in the Intel-cabinet systems, the wiring is much more efficient. These boards are expensive: around $30,000 for the SBA and $15,000 for the SIB. I suppose that in terms of $/flops, one might consider them a good deal, since the incremental cost for adding a few more flops is another microprocessor and 500k of static RAM. The instruction set is fairly conventional. Memory access is through load and store operations, and simple integer operations bypass results back and can execute in a single clock. Floating point operations are not pipelined. In addition to the standard repertoire of "RISC" operations, the chip also has pre- and post-increment on index registers for memory access, nested looping instructions, push/pop/allocate/call, and special communication control instructions. The distinguishing feature of the iWARP instruction set is a VLIW-mode 96-bit long Compute & Access instruction (C&A). An FP multiply, an FP add, two memory operations, and a loop test can be issued and executed parallel. A team of compiler people is working to make their single-chip compiler produce this instruction. Currently, it does not. However, the assembly language inlining is particularly well-implemented and should allow one to hand-code an inner loop seamlessly, painlessly, and efficiently. The FP units are not pipelined, so C&A instructions may take a couple of clocks to complete. Since the communication agent access is a register access, C&A could be computing with two input streams and sending results to two output streams. The register file is 128 registers deep. Most of the registers are general purpose, but a few are special-purpose. For example: special addressing registers used for athe C&A instruction and registers used for streaming to/from the communication unit. An alternate register space, "Control Space" of 1024 32-bit registers, exists that allows the program to examine and modify control bits, with special control-space-move instructions. Some chip test points can be examined in this space for diagnostic purposes. Communication A really interesting and innovative part of the processor design is the communication architecture. A lot of work and silicon area went into it - it consumes 1/3 of the chip area in a clean, bit-sliced design. It's a set of very interesting low-level building blocks from which one can construct a communication protocol. Each processor has 4 "physical pathways" (4 incoming and 4 outgoing) so it connects easily into a 2-D mesh. However, aside from the restriction that X-channels cannot connect to Y channels (they are a half-clock out of phase) they could be connected in any topology. Processors communicate with channels by streaming, ie, reading and writing to register-file-mapped gates, or via spooling - setting up a state machine to transfer directly between main memory and the communication channel. The channels are very high bandwidth. They claim 40 Mbyte/sec (at 20 MHz) on each channel; there are 8 channels. (We did some benchmarking, normalizing for the slow clock, and even in the tightest loop we could construct we could only get one processor to send to the other at half of peak. I speculated that there's some synchronization overhead, or perhaps one needs to use spooling to hit the peak rate). Options are provided for allocating portions of channels, setting up connections through the array, interrupting on various message conditions, breaking channels and inserting data, etc. These features can be accessed from C programs using a library of macros and function calls named PATHLIB. PATHLIB can use macros to generate efficient assembly-language communication operations. Routing is "signpost" routing - the processor initiating the connection specifies the path to the receiver. If resources (buffers) along the path are available, the connection is established. Otherwise, the message header blocks until resources become available. Once a connection is established, special delimiter tokens can be interspersed with the data to delimit separate messages or message subfields. The connection can be closed by sending a special "destroy channel" token. The communication method that we saw most applications use was static channels - they get established at the beginning of the program, are used to send multiple messages, and are closed at the end of the program. Each processor has 20 buffers (this is the unit of replication in the bit-sliced communication unit design) that can be allocated however you wish - incoming, outgoing, etc. Two are consumed by the run time system. It's not obvious whether the communication architecture is "universal" but it certainly is extensive. It would be interesting to fool around trying to construct a few different protocols and routing strategies. Other Hardware Notes How fast is it? On some simple numerical integration problems that we benchmarked, (that didn't use C&A) we would be generous in saying that a single 20 MHz iWARP chip was 1/2 the speed of the SPARC host. This certainly isn't a complete characterization of the performance, but it does serve as a ballpark figure. What's lacking architecturally? Good facilities for doing memory translation and protection: Implementing a shared address space would be difficult on this architecture. Software -------- So, you've got this great parallel processor, how do you program it? Intel supplies single-node C, an intermediate language "parallel C", or "Apply", which is touted as a "signal processing language". They have a hacked-together-for-development run time system, with a much spiffier run time system to be available soon. C compiler The single-node C-compiler is a regular-old C compiler. It's solid, with decent optimization. As noted earlier, it doesn't produce the VLIW-mode C&A instruction yet. They've done a nice job with assembly-language inlining. Among other things, this allows C-language macros to insert primitive communication operations seamlessly. Intermediate Parallel C The Intermediate Parallel C is given with the disclaimer that it is only an intermediate language for higher-level parallel languages. But since the high level languages we saw didn't look particularly useful (yet), this is the next-best thing. For example, the class exercises for writing a gaussian elimination program were hand-written parallel C. It's pretty shaky and has lots of bugs in the parser (silly bugs which indicate that it hasn't been used a great deal yet, like the one that causes it to mess up if you include more than one subroutine in the same input file). The program is translated into two output programs (a master and a slave) in single-node C that make calls to PATHLIB for communication. One node in the array runs the master program, the rest run identical copies of the slave program. The master maintains the synchronization of the slaves, sending messages to them instructing them to proceed from one execution phase to the next. Intermediate Parallel C is like regular C, with the addition of a "parallel for" loop and some notion of local and distributed variables. However, the management of the distributed and local variables is up to the user, using special copy operations that distribute and copy variables between master and slaves. These are translated into calls to Pathlib for communication. It's not clear whether you would want to really use this language for performance-critical applications. I suspect one would most-likely be forced to resort to hand-rolling it in regular C, explicitly managing the parallelism. They warned that since this is an intermediate language and not a supported product, the definition could change at any time. So it's not fair to criticize this too much, because it's not a true product. Apply Apply is a language for specifying image pixel transformations. For example, in the class, people wrote a simple edge detector in a few lines of code. You specify in Apply how each pixel is a function of each of its neighbors, and the compiler emits a parallel C program to to the operation. It has a hybrid syntax that's not quite C and note quite ADA. It's still a bit fuzzy exactly how this gets woven into a complete application, but presumably this will firm up with the new run-time release (below). Extensions to Apply apparently fix some of the problems with the language and are forthcoming. A library of pixel transformations, "Weblib" is available which provides lots of common image transformations. Run Time System The first run time system that Intel shipped used far to much of the local memory on each node. So, they did some slash and trash, and got it much smaller (about 50 kbyte/node). Running programs (with the development runtime) is a glacial process. The machine gets reset, tested, and rebooted with every run, so that starting a program takes about 40 seconds. This is an interim OS. A much spiffier run time system is planned with each node running a very lightweight run time system that provides a nearly complete system-V UNIX interface (minus some unmanageable things like fork) to the host's facilities (ie, file system, etc). The goal is for this to use 30 kbytes/node. All of these calls pretty much just get forwarded to the host for handling. There's talk about supporting a memory mapped high-performance disk interface on each node. Observations ------------ If you can afford it, they have a working parallel-processor building block chip and system. One can probably expect the hardware to be solid. The software is definitely "under construction, proceed at own risk". But they're working on it, and it's really just a matter of time before they get some spiffy stuff cranked out. People expressed admiration for the rate at which CMU could crank out code. There are some questions about the scalability of this system. Per-node price is still very high ($9000, including RAM). It's not clear why the price is so high. Intel now has a formidable, *experienced* parallel processor design team. It's an interesting system, and the software questions are wide open. How do you program such a machine? What communication strategies are practical? What's it good for? If we had one, we could certainly fiddle with it to try to get something running on it. It would be interesting to try constructing some simulations of other architectures and protocols. As mentioned earlier, the lack of translation operations makes simulating a shared global address space difficult. Their Apply language, while being somewhat clumsy as a production tool, is interesting because it raises issues of data placement and communication optimization from high level operations, and proposes a solution. Summary ------- Again, the purpose of this memo has been to whet your appetite for iWARP information. I think it's a neat, solid, interesting system. Richard Lethin lethin@ai.mit.edu
uh311ae@sunmanager.lrz-muenchen.de (Henrik Klagges) (05/31/91)
Hello, Thanks to Richard Lethin (lethin@ai.mit.edu) for his iWARP summary. I would like to comment on some statements, because I disagree that the iWARP is 'a neat, solid, interesting system'. >Each processor has 4 "physical pathways" (4 incoming and 4 outgoing) >so it connects easily into a 2-D mesh. However, aside from the >restriction that X-channels cannot connect to Y channels (they are a >half-clock out of phase) they could be connected in any topology. Basically, 4 bidirectionals isn't bad. I would prefer the 8 ones of the new transputers, especially given the fact that they support a virtual channel concept - i.e., giving you as many software channels as you want. The XX, YY only restriction, however, is severe and sounds like an engi- neering joke. > They claim 40 Mbyte/sec (at 20MHz) on each channel; there are 8 channels. > (We did some benchmarking, normalizing for the slow clock, and even in > the tightest loop we could construct we could only get one processor to > send to the other at half of peak. Thus proving that the claim must be wrong in any real-world system. >The distinguishing feature of the iWARP instruction set is a VLIW-mode >96-bit long Compute & Access instruction (C&A). An FP multiply, an FP >add, two memory operations, and a loop test can be issued and executed >parallel. A team of compiler people is working to make their >single-chip compiler produce this instruction. Currently, it does >not. However, the assembly language inlining is particularly >well-implemented and should allow one to hand-code an inner loop >seamlessly, painlessly, and efficiently. A 'single-chip compiler' which 'currently does not' for a selling pa- rallel system ? This means that 'the distinguishing feature' essentially doesn't work, except if you hand-code. This sounds like the story of optimizing com- pilers and i860 performance. > These boards are expensive: around $30,000 for the SBA and $15,000 for > the SIB. ... > a single 20 MHz iWARP chip was 1/2 the speed of the SPARC host ... > There are some questions about the scalability of this system. > Per-node price is still very high ($9000, including RAM). It's not > clear why the price is so high. At $9K half-Sparc performance ? I'd rather buy a full Sparc (including color monitor, 16Megs & HDD). For large parallelism, there is still a connection machine (SIMD), a BBN Butterfly, a Meiko Computing Surface, Paracom ... at less money. The fact that the iWARP has no Cache and no DRAM support (e.g. as transputers do) makes it very vulnerable to high speed SRAM prices - and very unlikely to zoom much higher than 20MHz in clock frequency. The iWARP was from the very beginning designed to be a building block for 2D-mesh dataflow computers. Giving the right problems, dataflow can be very fast, given the wrong, it's useless. At half-Sparc speed the iWARP is slow even on this very specialized home turf, so I say, forget it. Cheers ! Rick@vee.lrz-muenchen.de Henrik Klagges, U of Munich, Physics Dep. #include "std_disclaimer.h"
ruehl@iis.ethz.ch (Roland Ruehl) (05/31/91)
In article <uh311ae.675674229@sunmanager> uh311ae@sunmanager.lrz-muenchen.de (Henrik Klagges) writes: > : >Basically, 4 bidirectionals isn't bad. I would prefer the 8 ones of the >new transputers, especially given the fact that they support a virtual >channel concept - i.e., giving you as many software channels as you want. >The XX, YY only restriction, however, is severe and sounds like an engi- >neering joke. > : iWARP supports 20 logical communication channels which are implemented in hardware by multiplexing 4 bidirectional hardware busses. With these logical channels it is possible to emulate for instance a 2^6 Hypercube on an 8x8 iWARP system without introducing software overhead. > : >At $9K half-Sparc performance ? I'd rather buy a full Sparc (including >color monitor, 16Megs & HDD). For large parallelism, there is still a >connection machine (SIMD), a BBN Butterfly, a Meiko Computing Surface, > : The CM is an SIMD machine and the BBN a shared memory parallel processor with the associated drawbacks. A competetive Computing Surface either uses the i860 with the compiler problems you mentioned, or an H1 (=T9000) which has not been officially released yet (how about an optimizing H1 C compiler ?) > : >The iWARP was from the very beginning designed to be a building block >for 2D-mesh dataflow computers. > : A dataflow computer (see for instance "Monsoon: ..." by Papadopoulos and Culler in ISCA 90) is designed to execute efficiently dataflow graphs typically expressed in a functional language (for instance ID). iWARP is programmed in C or W2 using low latency communication primitives. Standart numerical applications (SOR, dense linear algebra, signal processing, ...) can be parallelized efficiently provided enough local memory. Although the current C compiler release does not support LIW optimization, iWARP has a good communication speed / local computation performance ratio compared to other distributed memory parallel processors (MIMD) commerically available at the moment. --------- Roland Ruehl uucp: uunet!mcsun!ethz!ruehl Tel: (01) 256 5146 (Switzerland) eunet: ruehl@iis.ethz.ch +411 256 5146 (International) Integrated Systems Laboratory ETH-Zentrum 8092 Zurich
frazier@oahu.cs.ucla.edu (Greg Frazier) (06/01/91)
ruehl@iis.ethz.ch (Roland Ruehl) writes: +In article <uh311ae.675674229@sunmanager> uh311ae@sunmanager.lrz-muenchen.de (Henrik Klagges) writes: +>The iWARP was from the very beginning designed to be a building block +>for 2D-mesh dataflow computers. +A dataflow computer (see for instance "Monsoon: ..." by Papadopoulos +and Culler in ISCA 90) is designed to execute efficiently dataflow +graphs typically expressed in a functional language (for instance ID). +iWARP is programmed in C or W2 using low latency communication +primitives. I think what he meant to say was that the iWARP was designed as a building block for systolic arrays. Which it was/is. -- Greg Frazier frazier@CS.UCLA.EDU !{ucbvax,rutgers}!ucla-cs!frazier
rfrench@neon.Stanford.EDU (Robert S. French) (06/01/91)
For people who want more information on the iWarp, here are some recent papers that are relevant: Shekhar Borkar, et al. "Supporting Systolic and Memory Communication in iWarp". ISCA '90, pp 70-81. Robert Cohn, et al. "Architecture and Compiler Tradeoffs for a Long Instruction Word Microprocessor". ASPLOS '89, pp 2-14. Ping-Sheng Tseng. "Compiling Programs for a Linear Systolic Array". PLDI '90, pp 311-321. (Really talks about Warp, the the techniques are probably applicable to iWarp) If you take the last two papers together, you get a compiler that does instruction scheduling, software pipelining, and automatic breakup of tasks across a linear systolic array. Now if only Intel could do that... BTW, I will be working with an iWarp system soon, and would enjoy getting in contact with any of y'all who are currently using one. Rob