jsutton@iWarp.intel.com (Jim Sutton) (06/04/91)
Over the past few months, there have been sporadic articles and requests for information concerning the iWarp project appearing in the comp.arch and comp.parallel newsgroups. It seems appropriate at this time to provide a brief description of what an iWarp is, for those of you who may be feeling left out. My credentials: I have been with iWarp since Q3 86. My primary function was component/systems architect, but I also had responsibility for the design of two of the functional units. This overview provides only a brief, global view of iWarp's architecture and capabilities, and does not attempt to cover details (our architectural spec used by the software development teams is over 450 pages!). Pointers to further information are provided at the end of this article for those who would like additional details. All performance numbers shown are for 20 Mhz systems. --------------------------------------- SYSTEM ARCHITECTURE iWarp is both a component and system design for scalable parallel computing systems, aimed at supporting both systolic (fine-grained) and message-passing (coarse-grained) applications requiring high-speed communications. Systolic applications involve few (typically <10) calculations per data element, so a *balance* between I/O and CPU performance is as critical as raw CPU MFLOPs. iWarp's primary design focus was on integrating high-performance I/O into the processor in a way that allows the computational units to have balanced access to both the I/O channels and memory. We made a conscious decision to avoid expensive, experimental (or simply difficult) CPU features found on current leading-edge processors, to allow us to focus time (and silicon) on these goals. An iWarp cell (i.e., node) is comprised of a single iWarp component and a bank of fast static RAM. iWarp cells are joined by directly connecting the iWarp components through the four bidirectional pathways. Typical config- urations are 1D (linear) and 2D (mesh) arrays, although other arrangements are possible. iWarp is the first commercial product to implement virtual routing channels in hardware, and extends this concept to *long-lived* virtual channels (which we call connections) that form a unidirectional path of reserved resources from the terminal source node to the terminal destination node. Once a connection has been established, one or more messages may be shipped along this path. Connections may also be *shared*; that is, one or more cells along the path may be senders, and one or more cells along the path may be receivers. Connections use a form of street-sign addressing to choose a routing path through the array; messages have names that identify the destination cell(s) or process(es) within the connection. iWarp supports 20 connections, with independent buffering, control and status resources for each. --------------------------------------- COMPONENT ARCHITECTURE The iWarp component consists of two essentially asynchronous agents: a Computation Agent and a Communication Agent. The Comp Agent contains integer core hardware, floating point hardware, memory interface and a 128 word Register File (RF). The Comm Agent handles all pathway traffic, and manipulation of the virtual channel buffers. Following is a brief summary of the iWarp component features. Unusual features are discussed in the next section. Communications - 4 bidirectional pathways. * Built as unidirectional bus pairs, each 8-bits (+control) wide. * 40 Mbytes/sec per bus (320 Mbytes/sec aggregate). - 4-entry Address Match CAM handles cell streetsigns and message names. - 20 channel buffers (called PCT records) * Allows 20 simultaneous connections through/to/from the cell. "Express" (thru-cell) traffic is not blocked by "inbound" or "outbound" traffic to/from the cell. * Express traffic handled automatically, without SW intervention and without stealing CPU cycles. * 8-word data queue provides smoothing. * Programmable stop conditions allow channel to be automatically converted from "express" (data flowing through cell) to "inbound" (data examined/consumed by cell), and vice-versa. - Independent round-robin scheduling for each outbound pathway. 128 word Register File (RF) - 118 general purpose registers + 10 special addresses. - Accessible as 8/16/32/64-bit fields on natural boundaries. - Heavily multi-ported: * Separate ports for integer core, FP adder, FP multiplier allow simultaneous access. Each port supports 2-reads/1-write per clock. * Special "back-door" ports tied directly to specific registers: LM ports and stream gates (see "Special Features"). Floating Point - Adder and Multiplier units: * Non-pipelined 2 clock SP, 4 clock DP basic operations. * Each 10 MFLOPs SP, 5 MFLOPs DP. - 32/64-bit IEEE P754. * Full trap support and fast mode, four rounding modes. - 3 operand instructions (2 src, 1 dest) - Result bypass between adder/multiplier. - Multiplier supports divide, remainder and square root instructions. - Integer-to-float, float-to-integer, pack and unpack operations. Integer Arithmetic - 8/16/32-bit data operations. - All 1-clock operations; 20 MIPs performance. - Arithmetic, shift/rotate, logical and bit operations. - 2 operand instructions (1 src, 1 src/dest) - Most ops allow 8-bit literal value as 2nd operand. - Result bypass to any other core (non-FP) instruction. Memory Operations - 8/16/32/64-bit load/store instructions. * Pre- and post-increment addressing. * Small-constant or variable address stride. * Big-endian/little-endian transformations for all data types. - Internal 256 word instruction cache, 4-way set associative. - Internal 2K word instruction ROM. Support Operations/Features - Stack: push/pop, call/return, allocate/release. - Branch and call targets: direct, register-indirect, memory-indirect. - System call to protected code through table lookup. - Control/status register load/store. - Register indirection (dereferences). - Four separate sets of flags: FP adder, FP mult, integer and "other". --------------------------------------- SPECIAL FEATURES Hardware Loop Support An ENTERLOOP instruction initializes dedicated hardware, including the loop-start address (the next instruction), an iteration count, and an optional conditional code (e.g., carry flag false). Most instructions have an ENDLOOP bit. If set, then the loop count is decremented, and the loop is repeated if the count is non-zero OR the specified condition (resulting from a preceding instruction) is false. This all occurs in parallel with normal instruction execution. Loops can be nested. The ENTERLOOP instruction saves the loop controls on the stack before changing. Exiting the loop retrieves the controls. A mode in the BRANCH instruction allows early exit from the loop. Stream Gates ANY instruction may read/write data directly from/to a channel buffer's data queue by reading/writing special addresses in the Register File. These special locations are called "stream gates", because they provide a gating function that allows a stream of data to pass, a word or double- word at a time, between the program and the array. The RF locations have no real storage themselves, but pass data to/from the channel buffers over a "backdoor" port (bus). There are two read gates (G0R,G1R) and two write gates (G0W,G1W). Each gate has a programmable binding, that specifies which channel buffer it is connected to. Once bound, any instruction may pass through the channel as a side effect of an ordinary read/write of an RF location. If data (or space) is not immediately available in the channel buffer's queue, the instruction "spins" before execution, until the data (space) is available. This allows direct program-to-program transfers without the costly overhead of checking queue status via instructions. Spin timeouts are available to catch deadlock or error conditions. LM Ports The other special addresses in the Register File are "LM Ports", and are "special" only in the Compute & Access instruction (see below). There are two read locations (LMR1,LMR2) and one write location (LMW). Each port has real storage associated with it (and is an ordinary register for non-C&A instructions), but also has a "backdoor" port to the external memory bus. The C&A instruction activates uses this back-door port to perform faster memory accesses (1-clock each). Compute & Access Instruction This is the workhorse instruction of the iWarp component. It is the sole *long* instruction (96-bits), and is horizontal in nature, specifying the parallel operation of nearly all functional units on the chip. The C&A instruction can initiate: 1 FP adder operation (2-clocks SP) 1 FP multiplier operation (2-clocks SP) 2 back-to-back memory ops (each 1-clock): A memory read into LMR1, and either a memory read into LMR2 or a memory write from LMW. Each memory op specifies an address register and an offset register/literal, allowing pre/post-incrementing addresses, with a constant/variable stride. Each memory access can be single- or double-word. end-of-loop test + repeat ANY of the source operands (FP add, FP mult, memory address calc) can specify a stream gate read (G0R,G1R). ANY of the destination operands can specify a stream gate write (G0W,G1W). This produces a peak performance of: 20 MFLOPs + 20 MIPs (+ loop_test/branch) + 160 Mbytes/sec memory ops + 80 Mbytes/sec sends + 80 Mbytes/sec receives (+ 160 Mbytes/sec express (thru cell) traffic) Spools Message-passing communication involves transferring blocks of data from the memory of a sending cell to the memory of a receiving cell. iWarp provides 8 independent, programmable DMA interfaces between memory and the channel buffers, called "spools". Each spool is programmed with the channel number, two Register File locations (buffer address and buffer limit), stride, data-type (for endian-transformation) and direction (to/from memory). Once programmed and enabled, a spool operation will steal cycles from normal CPU operations *only* when data is available. The data transfer stops when the buffer limit is reached, or, in the case of spools-to-memory, when a "delimiter" (non-data word) reaches the top of the queue. When stopped, the spool records an event, which may invoke a service routine. A single spool can "max-out" its associated outbound or inbound pathway, at 40 Mbytes/sec. Multiple spools (corresponding to concurrent messages to/from a cell) are scheduled in a round-robin fashion, and will max-out at the memory bandwidth of 160 Mbytes/sec. A channel buffer can be attached to a spool at one end, and a stream gate at the other. This allows re-use of a code block in either systolic mode (streaming directly to/from pathway) and message-passing mode (streams to/from spools to/from memory). Events Because iWarp integrates a sophisticated communication manager with DMA transfers and normal CPU core and floating point activity, there are over 230 synchronous and asynchronous conditions ("events") that may require direct or indirect servicing. iWarp collects these events in a two-tier recording/reporting hierarchy. Low-level events are recorded in "group" event registers, and reported as a single group-event to a top-level EVENT register. There are individual reporting enables at both group and EVENT levels. Enabled events in the EVENT register cause automatic invocation of a service routine, vectored through a 64-entry service-routine table. Types of events include the following: - Connection arrival. - Message arrival with matched name. - Express connection flow stopped due to programmed condition. - Debug breakpoints. - Spool events. - Floating point extension and error traps. - Protection violations. - Timeouts. --------------------------------------- SYSTEM CONFIGURATIONS Quad Cell Board (QCB) - 4 processors, in a 2x2 array (max 80 MFLOPs/80 MIPs). - .5/2.0 Mbyte per processor on board. - daughter board expansion to: 1.5/4.5/6.0 Mbytes per processor. Card Cage Assembly (CCA) - Up to 16 QCBs (max 64 nodes, 1280 MFLOPs/1280 MIPs). - Clock board, fan and power supply. System Cabinet - Up to 4 CCAs (max 256 nodes, 5120 MFLOPs/5120 MIPs). Multi-Cabinet Systems - Connected with external cabling. - Up to 4 System Cabinets (max 1024 nodes, 20480 MFLOPS/20480 MIPs). SBA (Single Board Array) - QCB with Sun form factor - 4 processors, in a 2x2 array - .5/1.0/2.0/4.0 Mbyte per processor SIB (System Interface Board) - Single processor node - .5/1 Mbyte main memory, 64/256 Kbyte dual-port RAM - VMS interface to host. SBA System - Max supported by single Sun workstation: 1 SIB + up to 8 SBAs (2x16 array, 640 MFLOPs/640 MIPs) Currently shipping 10 Mhz pre-production systems, which will be upgraded to 20 Mhz in early 92. --------------------------------------- SOFTWARE The following are available now: Pathlib- Low level interface for systolic communication. RTS- Run Time System (basic kernel). C- Standard K&R, with global optimizations, assembler inlining and iWarp comm extensions. Apply- Image processing parallel program generator. The following are expected in Q4 91. RTS enhancements. C enhancements (machine dependent optimization, incl SW pipelining). Fortran 77 with VMS extensions, C/Fortran cross-language function inlining and iWarp comm extensions. Symbolic debugger (based on GNU). Pathlib, RTS, and C are bundled with systems. Fortran and Apply are extra cost. Symbolic debugger will be bundled with systems. --------------------------------------- FURTHER INFORMATION 5-day training classes cover system and communications architecture, use of the software tools, application development, and optimization tricks. Marketing Contact: Paul Wiley, Marketing Manager 5200 NE Elam Young Pkwy, CO4-02, Hillsboro, OR 97124-6497 (503) 629-6350 fax: (503) 629-6367 wiley@iwarp.intel.com There have been numerous iWarp-related papers in journals and conference proceedings over the last 2 years. A good starting list would include: iWarp: A 100-MOP LIW Microprocessor for Multicomputers C.Peterson, J.Sutton, P.Wiley, IEEE Micro, June 1991 iWarp: An Integrated Solution to High Speed Parallel Processing S.Borkar, et al., Proc. Supercomputing '88, IEEE CS Press, Nov 1988, pp 300-339 Supporting Systolic and Memory Communication in iWarp S.Borkar, et al., Proc. 17th Intl Symposium on Computer Architecture IEEE CS Press, May 1990, pp 70-81 Communication in iWarp Systems T.Gross, Proc. Supercomputing '89, IEEE CS Press, Nov 1989, pp 436-435 Apply: A Parallel Compiler for Image Processing Applications B.Baxter and B.Greer, Proc. 6th Distributed Memory Computing Conf, Apr 29-May 2 1991 <to be published, July?> ---------------------------------------------------------------------------- Jim Sutton, Sr Staff Engineer, intel/iWarp Program jsutton@iwarp.intel.com 5200 NE Elam Young Pky CO4-03, Hillsboro, OR 97124 (503)629-6345 -- ---------------------------------------------------------------------------- Jim Sutton, Sr Staff Engineer, intel/iWarp Program jsutton@iWarp.intel.com 5200 NE Elam Young Pky CO4-03, Hillsboro, OR 97124 (503)629-6345
rfrench@neon.Stanford.EDU (Robert S. French) (06/04/91)
First of all, let me thank Jim for his (long) overview of the iWarp architecture. I'm sure it will help many people who don't know what in the heck we're talking about :-) There are some questions I have about the iWarp component, though, specifically about performance. The iWarp was designed as a high-powered systolic processor, and thus provides all sorts of neat communications capabilities. However, it also needs good integer and FP support in order to sustain processing rates. There are a number of oddities that I noticed in the iWarp specs: The FP adder takes 2 cycles (SP) or 4 cycles (DP) for all operations and isn't pipelined, which is pretty much OK considering the short cycle times. The FP multiplier takes the same for multiplication, but performance isn't nearly as impressive on operations such as division. For example, an SP division takes 15-16 clocks, and a DP division takes 31 clocks. If you'll forgive me for comparing apples and oranges, a MIPS R3010 can do the same in 12 and 19 cycles, respectively, and can maintain a higher clock rate. Likewise, a SP remainder takes "no more than 162 clocks", and a DP remainder takes "no more than 1,087 clocks", an incredibly long time, although I must admit I've never personally seen an application that uses FP remainder. In addition, considering that throughput is a major goal, it seems unfortunate that the FP multiplier isn't pipelined. The arithmetic unit does most operations in 1 cycle, except that it doesn't support integer multiply or divide. You have to use the FP multiplier for integer multiply (3 cycles), and there doesn't appear to be any way to do an integer divide at all (convert to FP, divide, convert back?). This has the added problem that you can't do an integer multiply (such as for a multi-dimensional array access) and an FP multiply or divide at the same time, which I think severely limits the applicability of the compute&access instruction. The iWarp has more support for byte and bit-level operations than any processor I've seen in a long time. For example, you can reference the individual bytes of a register as the source or destination for any arithmetic operation, and you can count bits, set/reset bits, find the first set bit, etc. These operations seem odd in a processor designed for high-powered floating point performance (this is, after all, why the C&A instruction can do one FPA and one FPM instruction and two memory ops). It seems to me that the effort and chip area devoted to these functions would have been better used building an integer multiplier, integer divider, and pipelining the FPM unit. Just some thoughts... Rob
jsutton@iWarp.intel.com (Jim Sutton) (06/05/91)
rfrench@neon.Stanford.EDU (Robert S. French) writes: > ... In addition, > considering that throughput is a major goal, it seems unfortunate that > the FP multiplier isn't pipelined. At the time the decision was made (mid '87?), pipelining the FP units presented some severe challenges: (1) Prior pipelined FP architectures (and ongoing work at that time) emphasized vector performance, but usually at the expense of scalar performance. We wanted high scalar performance as well. (2) Providing seamless send/receive constructs with full *invisible* synchronization was an imposing challenge even in scalar instructions. Meshing that into a pipelined FP architecture would have added massive complications. (3) The compiler development required to handle the integrated I/O would be enough of a challenge, without adding the complexity of pipelined manipulation. Note that in early '87 the entire iWarp FP design team was only 3 engineers! Given the knowledge and experience we have *today*, and given the proven send/receive interface mechanisms we have *today*, I would *now* be comfortable in specifying a pipelined FP. But that was not a viable choice at the time. > The arithmetic unit does most operations in 1 cycle, except that it > doesn't support integer multiply or divide. You have to use the FP > multiplier for integer multiply (3 cycles), and there doesn't appear > to be any way to do an integer divide at all (convert to FP, divide, > convert back?). This has the added problem that you can't do an > integer multiply (such as for a multi-dimensional array access) and an > FP multiply or divide at the same time, which I think severely limits > the applicability of the compute&access instruction. We found that virtually all of the multiplies required for multi-dimensional array accesses occur outside the innermost loop(s). As a consequence, for large data sets (which is iWarp's target), integer multiplies occur infrequently enough that the cost of adding a dedicated integer multiply unit could not be justified. Instead, we added a small amount of hardware to the FP multiplier to allow direct multiplication of integers. Integer divide is indeed implemented by converting to FP. Integer divide was found to occur so infrequently in our target applications that no special hardware cost could be justified. One point to keep in mind when examining the iWarp architecture is that all tradeoffs and optimizations center around the following target: * A tight loop (frequently a single C&A instruction) performing SP floating * point adds and multiplies, with 1-2 memory accesses, 1-2 sends and 1-2 * receives per iteration. > The iWarp has more support for byte and bit-level operations than any > processor I've seen in a long time. For example, ... > ... It seems to me that the effort and chip area > devoted to these functions would have been better used building an > integer multiplier, integer divider, and pipelining the FPM unit. The only bit-level instructions (other than ordinary logical operations) are bit-test/set/clear instructions. These were included to reduce the cycles required in manipulating the communication control and status registers. This helps (slightly) improve the software overhead associated with communications, at minimal silicon cost. The byte and half-word operations were provided to allow efficient support of C. Without these operations, we face two unpleasant alternatives: (1) Software must "promote" operands to 32-bit fields, perform the desired function, then "demote" the result. This adds substantial additional cycles, particularly if exact results are to be maintained. (2) Define the char/short/int data types as 32-bit fields. This consumes substantially more memory. In ordinary systems with large amounts of DRAM, this may not be an issue, but iWarp's design goals required a very-fast (and expensive) all-SRAM memory system, which means that efficient memory utilization is essential. ---------------------------------------------------------------------------- Jim Sutton, Sr Staff Engineer, intel/iWarp Program jsutton@iWarp.intel.com 5200 NE Elam Young Pky CO4-03, Hillsboro, OR 97124 (503)629-6345 -- ---------------------------------------------------------------------------- Jim Sutton, Sr Staff Engineer, intel/iWarp Program jsutton@iWarp.intel.com 5200 NE Elam Young Pky CO4-03, Hillsboro, OR 97124 (503)629-6345
mshute@cs.man.ac.uk (Malcolm Shute) (06/05/91)
Am I correct in assuming that this is Intel's answer to Inmos' Transputer? I would like both Intel and Inmos representatives take this bait... it has often been pointed out in this group that "Manufacture-X versus Manufacturer-Y" wars often through up quite a lot of useful information in all their smoke and fury, and I've been disappointed that no-one at Inmos has yet responded to this long, but interesting description of the iWARP. -- Malcolm SHUTE. (The AM Mollusc: v_@_ ) Disclaimer: all
carroll@ssc-vax (Jeff Carroll) (06/06/91)
In article <2622@m1.cs.man.ac.uk> mshute@cs.man.ac.uk (Malcolm Shute) writes: >Am I correct in assuming that this is Intel's answer to Inmos' Transputer? No, you're not. Professor Kung has been pushing the Warp project at CMU for a decade or so now, and Intel has been at work on iWarp for several years funded in part by the Department of Defense. Nobody I know who works for Intel has ever been of the opinion that Inmos has ever done anything that merited an answer. (I'm a transputer user too, and I think that there are some very nice things about the xputer architecture, but I also think that Inmos has done some things very wrong through the years.) Now, there are some interesting parallels between the iWarp architecture and the T9000 (nee H1) xputer, but there are also a couple of important differences. a) Intel has working silicon, now. I have seen it with my own eyes. I think nearly everyone will agree that Inmos is nowhere close to having a working (prototype, even) T9000. You can buy iWarp systems (in those wonderful gray cabinets) NOW. TODAY. b) Availability notwithstanding, my contacts at Intel have not convinced me that Galactic Intel is interested in marketing iWarp to the world at large. You may never see iWarp silicon available in quantity. >I would like both Intel and Inmos representatives take this bait... it has often been >pointed out in this group that "Manufacture-X versus Manufacturer-Y" wars >often through up quite a lot of useful information in all their smoke and fury, >and I've been disappointed that no-one at Inmos has yet responded to this >long, but interesting description of the iWARP. Perhaps the silence speaks for itself. I'm personally of the opinion that Inmos has already said far more than was really necessary about a chip that doesn't exist yet. I am fascinated about the geographically differing perceptions of the microprocessor market. I get the idea that the two most-talked-about micro architectures in the UK are the xputer and the ARM, both of which are practically unknown in the USA. Consequently I suppose that Inmos is seen in the UK as a major competitor of Intel. But then, they run funny network protocols over there, and drive on the wrong side of the road :^). -- Jeff Carroll carroll@ssc-vax.boeing.com "...and of their daughters it is written, 'Cursed be he who lies with any manner of animal.'" - Talmud
sysmgr@KING.ENG.UMD.EDU (Doug Mohney) (06/06/91)
In article <4077@ssc-bee.ssc-vax.UUCP>, carroll@ssc-vax (Jeff Carroll) writes: >b) Availability notwithstanding, my contacts at Intel have not > convinced me that Galactic Intel is interested in marketing iWarp > to the world at large. You may never see iWarp silicon available > in quantity. And what's the current demand for lots of iWarp machines? We're talking big bucks here for a system based upon Intel silicon (which some would argue is cursed from the moment it enters the factory grounds ;-) from a company who's primary business is microprocessors, not large systems. Large multiprocessor systems are made by other folks, like BBN and Thinking Machines. Signature envy: quality of some people to put 24+ lines in their .sigs -- > SYSMGR@CADLAB.ENG.UMD.EDU < --
dc@dcs.qmw.ac.uk (Daniel Cohen;E303) (06/07/91)
In <4077@ssc-bee.ssc-vax.UUCP> carroll@ssc-vax (Jeff Carroll) writes: >I am fascinated about the geographically differing perceptions of the >microprocessor market. I get the idea that the two most-talked-about >micro architectures in the UK are the xputer and the ARM, both of which >are practically unknown in the USA. Consequently I suppose that Inmos >is seen in the UK as a major competitor of Intel. >But then, they run funny network protocols over there, and drive on the >wrong side of the road :^). Well, I can't argue about the arseways ( tr. assways ) addressing and driving on the "wrong" side of the road, but your earlier point is nonsense! The transputer is talked about a lot because it's interesting, not because we think it's a serious threat to the 386! We don't see Inmos "as a major competitor of Intel". On the contrary, most British people are so used to the commercial failure of UK innovations that we're surprised Inmos has lasted this long :-) As for the ARM, I've no idea why you think it's talked about a lot here. All credit to Inmos for surviving in difficult circumstances; just maybe they will seriously threaten Inmos some day! Until then we're under no illusions. -- Daniel Cohen Department of Computer Science Email: dc@dcs.qmw.ac.uk Queen Mary and Westfield College Tel: +44 71 975 5249/4/5 Mile End Road, London E1 4NS, UK Fax: +44 81 980 6533 *** Glory, Glory, Hallelujah ***
davidb@inmos.co.uk (David Boreham) (06/07/91)
In article <2622@m1.cs.man.ac.uk> mshute@cs.man.ac.uk (Malcolm Shute) writes: >Am I correct in assuming that this is Intel's answer to Inmos' Transputer? Hey I've never been able to understand what's wrong with a 68K and four UARTs -:) Although comparisons are clearly going to be made, I don't believe that the iWARP and Transputer are trying to solve the same problems. Similarly for the TMS320C40 (perhaps more interesting than the iWARP, save the lack of routing). The more guys there are out there with silicon supporting message-based MIMD systems, the better as far as I'm concerned. The field is so devoid of useful solutions that there's plenty of scope for offering different features which fit different application areas. eg different communications speed/pins used, different CPU types (vector, scalar, FP, large, small...), different routing capabilities (good for <10 CPU, good for >1000 CPU...), different virtual channel capabilities (none, a few, many...). Oh, Jeff---check out which side of the road they drive on in Japan :^). David Boreham, INMOS Limited | mail(uk): davidb@inmos.co.uk or ukc!inmos!davidb Bristol, England | (us): uunet!inmos.com!davidb +44 454 616616 ex 547 | Internet: davidb@inmos.com
lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (06/09/91)
In article <16477@ganymede.inmos.co.uk> davidb@inmos.co.uk (David Boreham) writes: >Although comparisons are clearly going to be made, I don't believe >that the iWARP and Transputer are trying to solve the same >problems. >The more guys there are out there with silicon supporting >message-based MIMD systems, the better as far as I'm concerned. Agreed, but note that iWarp isn't message-based. It comes out of the "systolic" idea, originally popularized by (oddly enough) Kung. A torus of iWarp nodes *can* pass messages - and fairly well. However, that's not the main idea. The intention is that raw untagged *values* can be streamed around the interconnect. The node design caters to tight loops which read from registers that happen to be in-queues, and write to registers that happen to be out-queues. This programmable systolic behavior is sometimes incorrectly called dataflow. It works very well for some problem domains, notably image processing. The point to notice is that the streamed data may avoid burning memory cycles. This worked well for the Columbia IQCD, which was reporting a sustained 6 GFLOPS, more than a year ago. (Their problem domain is *really* limited: they just do quark physics.) -- Don D.C.Lindsay Carnegie Mellon Robotics Institute
glew@pdx007.intel.com (Andy Glew) (06/10/91)
I couldn't resist: >[Daniel Cohen:] >All credit to Inmos for surviving in difficult circumstances; just >maybe they will seriously threaten Inmos some day! Inmos threatening Inmos! Now isn't that a typical British computer story! (Look, I'm allowed to say this: (1) I'm a British national; (2) I'm a Canadian national as well, and I could almost as easily has said "Canadian computer story"; (3) It's funny. Well, maybe my post isn't funny, but the original typo is. Go on, laugh already.) -- Andy Glew, glew@ichips.intel.com Intel Corp., M/S JF1-19, 5200 NE Elam Young Parkway, Hillsboro, Oregon 97124-6497 This is a private posting; it does not indicate opinions or positions of Intel Corp.
dc@dcs.qmw.ac.uk (Daniel Cohen;E303) (06/11/91)
In <GLEW.91Jun9111517@pdx007.intel.com> glew@pdx007.intel.com (Andy Glew) writes: >>[Daniel Cohen:] >>All credit to Inmos for surviving in difficult circumstances; just >>maybe they will seriously threaten Inmos some day! >Inmos threatening Inmos! Now isn't that a typical British computer story! >(Look, I'm allowed to say this: (1) I'm a British national; (2) I'm a >Canadian national as well, and I could almost as easily has said >"Canadian computer story"; (3) It's funny. Well, maybe my post isn't >funny, but the original typo is. Go on, laugh already.) >-- >Andy Glew, glew@ichips.intel.com >Intel Corp., M/S JF1-19, 5200 NE Elam Young Parkway, >Hillsboro, Oregon 97124-6497 Yup, I'm laughing. Perhaps it was a Freudian slip :-) Inmos, Intel, what's a few billion between friends? PS. what's happened to iSC with the iPSC/860? Have they merged with the Touchstone project, or do they have independent plans? -- Daniel Cohen Department of Computer Science Email: dc@dcs.qmw.ac.uk Queen Mary and Westfield College Tel: +44 71 975 5249/4/5 Mile End Road, London E1 4NS, UK Fax: +44 81 980 6533 *** Glory, Glory, Hallelujah ***
steved@lion.inmos.co.uk (Stephen Doyle) (06/13/91)
glew@pdx007.intel.com (Andy Glew) writes: >>[Daniel Cohen:] >>All credit to Inmos for surviving in difficult circumstances; just >>maybe they will seriously threaten Inmos some day! >Inmos threatening Inmos! Now isn't that a typical British computer story! >(Look, I'm allowed to say this: (1) I'm a British national; (2) I'm a >Canadian national as well, and I could almost as easily has said >"Canadian computer story"; (3) It's funny. Well, maybe my post isn't >funny, but the original typo is. Go on, laugh already.) >-- >Andy Glew, glew@ichips.intel.com >Intel Corp., M/S JF1-19, 5200 NE Elam Young Parkway, >Hillsboro, Oregon 97124-6497 Quoting from the RISC Management Newsletter 1/1/91 RISC Processor Unit Shipments (000s) Processor 1989 1990 cum. transputer 190 240 540 SPARC 83 185 275 Rx000 35 90 130 Am29000 12 85 97 i960 8 65 73 i860 samples 65 65 ARM 40 55 145 M88100 10 50 60 CLIPPER 33 23 68 Totals 411 858 1453 So, INMOS are only surviving are we? Looks to me like we are holding the lead in RISC processor sales. Steve Doyle, | Tel +44 454 616616 INMOS Ltd, 1000 Aztec West | Fax +44 454 617910 Almondsbury | UK: steved@inmos.co.uk Bristol BS12 4SQ, UK | INET: steved@inmos.com
colin@array.UUCP (Colin Plumb) (06/14/91)
In article <16586@ganymede.inmos.co.uk> steved@inmos.co.uk (Stephen Doyle) writes: >So, INMOS are only surviving are we? Looks to me like we are holding the >lead in RISC processor sales. Well, except for the minor detail that the Transputer, focused as it is on minimising the semantic gap, is actually quite a CISC. You can draw parallels with the Intel 432, although the thrust of the development is quite different (CSP rather then capabilities). The point still stands, however, that the Transputer outsells the i860 and i960 combined. -- -Colin
carroll@ssc-vax (Jeff Carroll) (06/14/91)
In article <16586@ganymede.inmos.co.uk> steved@inmos.co.uk (Stephen Doyle) writes: >RISC Processor Unit Shipments (000s) > >Processor 1989 1990 cum. > >transputer 190 240 540 >SPARC 83 185 275 (the other guys' numbers elided - you get the idea) > >So, INMOS are only surviving are we? Looks to me like we are holding the >lead in RISC processor sales. Let's compare apples and apples here. How many of those xputers are T8s? It seems to me that comparing T2s and T4s with SPARCs and i960s and M88xxxs is engaging in creatively wishful thinking. If you want to call the xputer a RISC architecture, fine. But don't pretend that it's competing with the SPARC, the MIPS line, or the i960. The transputer is a niche product. The other semiconductor houses have (until now) been content to leave that niche to Inmos. With increased use of multiprocessor systems on the horizon, however, the other guys will be looking to increase their market share, and some of them have demonstrated better ability to rapidly bring silicon to market than Inmos. Now, I like xputers for what they are, but I'm glad I don't have to program my PC in Occam... -- Jeff Carroll carroll@ssc-vax.boeing.com "...and of their daughters it is written, 'Cursed be he who lies with any manner of animal.'" - Talmud
pcg@aber.ac.uk (Piercarlo Grandi) (06/15/91)
On 13 Jun 91 09:37:51 GMT, steved@lion.inmos.co.uk (Stephen Doyle) said: steved> Quoting from the RISC Management Newsletter 1/1/91 steved> RISC Processor Unit Shipments (000s) steved> Processor 1989 1990 cum. steved> transputer 190 240 540 steved> [ ... smaller figures for SPARC, Rx000, Am29K, ix60, ARM, M88K, steved> CLIPPER omitted ... ] Frankly, this newletter is crazy. The Transputer is not by any stretch of the imagination a RISC system (unless you define RISC == non CISC), and it is not, even more importantly, a general purpose CPU like the other systems listed. To put it into that league table is ridiculous. On the other hand the figures confirm that the ARM is deservedly one of the more popular RISC designs, and this is about as good for the prestige of the UK design companies as anything. The transputer *is* a success story, but not as a general purpose processor, just as an high end embedded systems engine. In that market it is not by any means the leader, but it is still significant. So maybe it would be more interesting to see the figures for sales of CPUs in the 32 bit embedded systems market, and to see it split between CISC sales (32k, 68k, x86) and non-CISC (which is not by any means the same as RISC) sales. -- Piercarlo Grandi | ARPA: pcg%uk.ac.aber@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@aber.ac.uk