[comp.arch] iWarp Architecture Overview

jsutton@iWarp.intel.com (Jim Sutton) (06/04/91)

Over the past few months, there have been sporadic articles and requests
for information concerning the iWarp project appearing in the comp.arch
and comp.parallel newsgroups.  It seems appropriate at this time to provide
a brief description of what an iWarp is, for those of you who may be feeling
left out.

My credentials:  I have been with iWarp since Q3 86.  My primary function
was component/systems architect, but I also had responsibility for the
design of two of the functional units.

This overview provides only a brief, global view of iWarp's architecture and
capabilities, and does not attempt to cover details (our architectural spec
used by the software development teams is over 450 pages!).  Pointers to
further information are provided at the end of this article for those who
would like additional details.

All performance numbers shown are for 20 Mhz systems.

---------------------------------------
SYSTEM ARCHITECTURE

iWarp is both a component and system design for scalable parallel computing
systems, aimed at supporting both systolic (fine-grained) and message-passing
(coarse-grained) applications requiring high-speed communications.  Systolic
applications involve few (typically <10) calculations per data element, so a
*balance* between I/O and CPU performance is as critical as raw CPU MFLOPs.

iWarp's primary design focus was on integrating high-performance I/O into
the processor in a way that allows the computational units to have balanced
access to both the I/O channels and memory.  We made a conscious decision to
avoid expensive, experimental (or simply difficult) CPU features found on
current leading-edge processors, to allow us to focus time (and silicon) on
these goals.

An iWarp cell (i.e., node) is comprised of a single iWarp component and a
bank of fast static RAM.  iWarp cells are joined by directly connecting the
iWarp components through the four bidirectional pathways.  Typical config-
urations are 1D (linear) and 2D (mesh) arrays, although other arrangements
are possible.

iWarp is the first commercial product to implement virtual routing channels
in hardware, and extends this concept to *long-lived* virtual channels
(which we call connections) that form a unidirectional path of reserved
resources from the terminal source node to the terminal destination node.
Once a connection has been established, one or more messages may be shipped
along this path.  Connections may also be *shared*; that is, one or more
cells along the path may be senders, and one or more cells along the path
may be receivers.  Connections use a form of street-sign addressing to
choose a routing path through the array; messages have names that identify
the destination cell(s) or process(es) within the connection. iWarp supports
20 connections, with independent buffering, control and status resources for
each.

---------------------------------------
COMPONENT ARCHITECTURE

The iWarp component consists of two essentially asynchronous agents: a
Computation Agent and a Communication Agent.  The Comp Agent contains
integer core hardware, floating point hardware, memory interface and a
128 word Register File (RF).  The Comm Agent handles all pathway traffic, and
manipulation of the virtual channel buffers. 

Following is a brief summary of the iWarp component features.
Unusual features are discussed in the next section.

  Communications
  - 4 bidirectional pathways.
    * Built as unidirectional bus pairs, each 8-bits (+control) wide.
    * 40 Mbytes/sec per bus (320 Mbytes/sec aggregate).
  - 4-entry Address Match CAM handles cell streetsigns and message names.
  - 20 channel buffers (called PCT records)
    * Allows 20 simultaneous connections through/to/from the cell.
      "Express" (thru-cell) traffic is not blocked by "inbound" or
      "outbound" traffic to/from the cell.
    * Express traffic handled automatically, without SW intervention and
      without stealing CPU cycles.
    * 8-word data queue provides smoothing.
    * Programmable stop conditions allow channel to be automatically
      converted from "express" (data flowing through cell) to "inbound"
      (data examined/consumed by cell), and vice-versa.
  - Independent round-robin scheduling for each outbound pathway.

  128 word Register File (RF)
  - 118 general purpose registers + 10 special addresses.
  - Accessible as 8/16/32/64-bit fields on natural boundaries.
  - Heavily multi-ported:
    * Separate ports for integer core, FP adder, FP multiplier allow
      simultaneous access. Each port supports 2-reads/1-write per clock.
    * Special "back-door" ports tied directly to specific registers:
      LM ports and stream gates (see "Special Features").

  Floating Point
  - Adder and Multiplier units:
    * Non-pipelined 2 clock SP, 4 clock DP basic operations.
    * Each 10 MFLOPs SP, 5 MFLOPs DP.
  - 32/64-bit IEEE P754.
    * Full trap support and fast mode, four rounding modes.
  - 3 operand instructions (2 src, 1 dest)
  - Result bypass between adder/multiplier.
  - Multiplier supports divide, remainder and square root instructions.
  - Integer-to-float, float-to-integer, pack and unpack operations.

  Integer Arithmetic
  - 8/16/32-bit data operations.
  - All 1-clock operations; 20 MIPs performance.
  - Arithmetic, shift/rotate, logical and bit operations.
  - 2 operand instructions (1 src, 1 src/dest)
  - Most ops allow 8-bit literal value as 2nd operand.
  - Result bypass to any other core (non-FP) instruction.

  Memory Operations
  - 8/16/32/64-bit load/store instructions.
    * Pre- and post-increment addressing.
    * Small-constant or variable address stride.
    * Big-endian/little-endian transformations for all data types.
  - Internal 256 word instruction cache, 4-way set associative.
  - Internal 2K word instruction ROM.

  Support Operations/Features
  - Stack: push/pop, call/return, allocate/release.
  - Branch and call targets: direct, register-indirect, memory-indirect.
  - System call to protected code through table lookup.
  - Control/status register load/store.
  - Register indirection (dereferences).
  - Four separate sets of flags: FP adder, FP mult, integer and "other".

---------------------------------------
SPECIAL FEATURES

  Hardware Loop Support

     An ENTERLOOP instruction initializes dedicated hardware, including the
     loop-start address (the next instruction), an iteration count, and an
     optional conditional code (e.g., carry flag false).  Most instructions
     have an ENDLOOP bit.  If set, then the loop count is decremented, and
     the loop is repeated if the count is non-zero OR the specified condition
     (resulting from a preceding instruction) is false.  This all occurs in
     parallel with normal instruction execution.

     Loops can be nested.  The ENTERLOOP instruction saves the loop controls
     on the stack before changing.  Exiting the loop retrieves the controls.
     A mode in the BRANCH instruction allows early exit from the loop.

  Stream Gates

     ANY instruction may read/write data directly from/to a channel buffer's
     data queue by reading/writing special addresses in the Register File.
     These special locations are called "stream gates", because they provide
     a gating function that allows a stream of data to pass, a word or double-
     word at a time, between the program and the array. The RF locations have
     no real storage themselves, but pass data to/from the channel buffers
     over a "backdoor" port (bus).

     There are two read gates (G0R,G1R) and two write gates (G0W,G1W).  Each
     gate has a programmable binding, that specifies which channel buffer it
     is connected to. Once bound, any instruction may pass through the channel
     as a side effect of an ordinary read/write of an RF location.

     If data (or space) is not immediately available in the channel buffer's
     queue, the instruction "spins" before execution, until the data (space)
     is available.  This allows direct program-to-program transfers without
     the costly overhead of checking queue status via instructions.  Spin
     timeouts are available to catch deadlock or error conditions.

  LM Ports

     The other special addresses in the Register File are "LM Ports", and are
     "special" only in the Compute & Access instruction (see below).  There
     are two read locations (LMR1,LMR2) and one write location (LMW).  Each
     port has real storage associated with it (and is an ordinary register
     for non-C&A instructions), but also has a "backdoor" port to the external
     memory bus.  The C&A instruction activates uses this back-door port
     to perform faster memory accesses (1-clock each).

  Compute & Access Instruction

     This is the workhorse instruction of the iWarp component.  It is the sole
     *long* instruction (96-bits), and is horizontal in nature, specifying the
     parallel operation of nearly all functional units on the chip.  The C&A
     instruction can initiate:
        1 FP adder operation (2-clocks SP)
        1 FP multiplier operation (2-clocks SP)
        2 back-to-back memory ops (each 1-clock):
          A memory read into LMR1, and either a memory read into LMR2 or a
          memory write from LMW. Each memory op specifies an address register
          and an offset register/literal, allowing pre/post-incrementing
          addresses, with a constant/variable stride. Each memory access
          can be single- or double-word.
        end-of-loop test + repeat

     ANY of the source operands (FP add, FP mult, memory address calc) can
     specify a stream gate read (G0R,G1R).  ANY of the destination operands
     can specify a stream gate write (G0W,G1W).

     This produces a peak performance of:
           20 MFLOPs
        +  20 MIPs (+ loop_test/branch)
        + 160 Mbytes/sec memory ops
        +  80 Mbytes/sec sends
        +  80 Mbytes/sec receives
       (+ 160 Mbytes/sec express (thru cell) traffic)

  Spools

     Message-passing communication involves transferring blocks of data from
     the memory of a sending cell to the memory of a receiving cell. iWarp
     provides 8 independent, programmable DMA interfaces between memory and
     the channel buffers, called "spools".  Each spool is programmed with
     the channel number, two Register File locations (buffer address and buffer
     limit), stride, data-type (for endian-transformation) and direction
     (to/from memory).  Once programmed and enabled, a spool operation will
     steal cycles from normal CPU operations *only* when data is available.

     The data transfer stops when the buffer limit is reached, or, in the
     case of spools-to-memory, when a "delimiter" (non-data word) reaches
     the top of the queue.  When stopped, the spool records an event, which
     may invoke a service routine.

     A single spool can "max-out" its associated outbound or inbound pathway,
     at 40 Mbytes/sec. Multiple spools (corresponding to concurrent messages
     to/from a cell) are scheduled in a round-robin fashion, and will max-out
     at the memory bandwidth of 160 Mbytes/sec.

     A channel buffer can be attached to a spool at one end, and a stream
     gate at the other.  This allows re-use of a code block in either systolic
     mode (streaming directly to/from pathway) and message-passing mode
     (streams to/from spools to/from memory).

  Events

     Because iWarp integrates a sophisticated communication manager with DMA
     transfers and normal CPU core and floating point activity, there are over
     230 synchronous and asynchronous conditions ("events") that may require
     direct or indirect servicing.  iWarp collects these events in a two-tier
     recording/reporting hierarchy.

     Low-level events are recorded in "group" event registers, and reported as
     a single group-event to a top-level EVENT register.  There are individual
     reporting enables at both group and EVENT levels.  Enabled events in the
     EVENT register cause automatic invocation of a service routine, vectored
     through a 64-entry service-routine table.

     Types of events include the following:
     - Connection arrival.
     - Message arrival with matched name.
     - Express connection flow stopped due to programmed condition.
     - Debug breakpoints.
     - Spool events.
     - Floating point extension and error traps.
     - Protection violations.
     - Timeouts.

---------------------------------------
SYSTEM CONFIGURATIONS

  Quad Cell Board (QCB)
  - 4 processors, in a 2x2 array (max 80 MFLOPs/80 MIPs).
  - .5/2.0 Mbyte per processor on board.
  - daughter board expansion to: 1.5/4.5/6.0 Mbytes per processor.
  Card Cage Assembly (CCA)
  - Up to 16 QCBs (max 64 nodes, 1280 MFLOPs/1280 MIPs).
  - Clock board, fan and power supply.
  System Cabinet
  - Up to 4 CCAs (max 256 nodes, 5120 MFLOPs/5120 MIPs).
  Multi-Cabinet Systems
  - Connected with external cabling.
  - Up to 4 System Cabinets (max 1024 nodes, 20480 MFLOPS/20480 MIPs).

  SBA (Single Board Array)
  - QCB with Sun form factor
  - 4 processors, in a 2x2 array
  - .5/1.0/2.0/4.0 Mbyte per processor
  SIB (System Interface Board)
  - Single processor node
  - .5/1 Mbyte main memory, 64/256 Kbyte dual-port RAM
  - VMS interface to host.
  SBA System
  - Max supported by single Sun workstation:
    1 SIB + up to 8 SBAs (2x16 array, 640 MFLOPs/640 MIPs)

  Currently shipping 10 Mhz pre-production systems, which will be upgraded
  to 20 Mhz in early 92.

---------------------------------------
SOFTWARE

  The following are available now:
     Pathlib- Low level interface for systolic communication.
     RTS-     Run Time System (basic kernel).
     C-       Standard K&R, with global optimizations, assembler inlining
              and iWarp comm extensions.
     Apply-   Image processing parallel program generator.

  The following are expected in Q4 91.
     RTS enhancements.
     C enhancements (machine dependent optimization, incl SW pipelining).
     Fortran 77 with VMS extensions, C/Fortran cross-language function
        inlining and iWarp comm extensions.
     Symbolic debugger (based on GNU).

  Pathlib, RTS, and C are bundled with systems.  Fortran and Apply are
  extra cost.  Symbolic debugger will be bundled with systems.

---------------------------------------
FURTHER INFORMATION

  5-day training classes cover system and communications architecture,
  use of the software tools, application development, and optimization
  tricks.

  Marketing Contact:
    Paul Wiley, Marketing Manager
    5200 NE Elam Young Pkwy, CO4-02, Hillsboro, OR  97124-6497
    (503) 629-6350    fax: (503) 629-6367       wiley@iwarp.intel.com

  There have been numerous iWarp-related papers in journals and conference
  proceedings over the last 2 years.  A good starting list would include:

    iWarp: A 100-MOP LIW Microprocessor for Multicomputers
    C.Peterson, J.Sutton, P.Wiley, IEEE Micro, June 1991

    iWarp: An Integrated Solution to High Speed Parallel Processing
    S.Borkar, et al., Proc. Supercomputing '88, IEEE CS Press, Nov 1988,
    pp 300-339

    Supporting Systolic and Memory Communication in iWarp
    S.Borkar, et al., Proc. 17th Intl Symposium on Computer Architecture
    IEEE CS Press, May 1990, pp 70-81

    Communication in iWarp Systems
    T.Gross, Proc. Supercomputing '89, IEEE CS Press, Nov 1989, pp 436-435

    Apply: A Parallel Compiler for Image Processing Applications
    B.Baxter and B.Greer, Proc. 6th Distributed Memory Computing Conf,
    Apr 29-May 2 1991 <to be published, July?>

 ----------------------------------------------------------------------------
 Jim Sutton, Sr Staff Engineer, intel/iWarp Program   jsutton@iwarp.intel.com
 5200 NE Elam Young Pky CO4-03, Hillsboro, OR 97124             (503)629-6345

--
 ----------------------------------------------------------------------------
 Jim Sutton, Sr Staff Engineer, intel/iWarp Program   jsutton@iWarp.intel.com
 5200 NE Elam Young Pky CO4-03, Hillsboro, OR 97124             (503)629-6345

rfrench@neon.Stanford.EDU (Robert S. French) (06/04/91)

First of all, let me thank Jim for his (long) overview of the iWarp
architecture.  I'm sure it will help many people who don't know what
in the heck we're talking about :-)

There are some questions I have about the iWarp component, though,
specifically about performance.  The iWarp was designed as a
high-powered systolic processor, and thus provides all sorts of neat
communications capabilities.  However, it also needs good integer and
FP support in order to sustain processing rates.  There are a number
of oddities that I noticed in the iWarp specs:

The FP adder takes 2 cycles (SP) or 4 cycles (DP) for all operations
and isn't pipelined, which is pretty much OK considering the short
cycle times.

The FP multiplier takes the same for multiplication, but performance
isn't nearly as impressive on operations such as division.  For
example, an SP division takes 15-16 clocks, and a DP division takes 31
clocks.  If you'll forgive me for comparing apples and oranges, a MIPS
R3010 can do the same in 12 and 19 cycles, respectively, and can
maintain a higher clock rate.  Likewise, a SP remainder takes "no more
than 162 clocks", and a DP remainder takes "no more than 1,087
clocks", an incredibly long time, although I must admit I've never
personally seen an application that uses FP remainder.  In addition,
considering that throughput is a major goal, it seems unfortunate that
the FP multiplier isn't pipelined.

The arithmetic unit does most operations in 1 cycle, except that it
doesn't support integer multiply or divide.  You have to use the FP
multiplier for integer multiply (3 cycles), and there doesn't appear
to be any way to do an integer divide at all (convert to FP, divide,
convert back?).  This has the added problem that you can't do an
integer multiply (such as for a multi-dimensional array access) and an
FP multiply or divide at the same time, which I think severely limits
the applicability of the compute&access instruction.

The iWarp has more support for byte and bit-level operations than any
processor I've seen in a long time.  For example, you can reference
the individual bytes of a register as the source or destination for
any arithmetic operation, and you can count bits, set/reset bits, find
the first set bit, etc.  These operations seem odd in a processor
designed for high-powered floating point performance (this is, after
all, why the C&A instruction can do one FPA and one FPM instruction
and two memory ops).  It seems to me that the effort and chip area
devoted to these functions would have been better used building an
integer multiplier, integer divider, and pipelining the FPM unit.

Just some thoughts...

			Rob

jsutton@iWarp.intel.com (Jim Sutton) (06/05/91)

rfrench@neon.Stanford.EDU (Robert S. French) writes:
>                                                    ...  In addition,
> considering that throughput is a major goal, it seems unfortunate that
> the FP multiplier isn't pipelined.

At the time the decision was made (mid '87?), pipelining the FP units
presented some severe challenges:
(1) Prior pipelined FP architectures (and ongoing work at that time)
    emphasized vector performance, but usually at the expense of scalar
    performance.  We wanted high scalar performance as well.
(2) Providing seamless send/receive constructs with full *invisible*
    synchronization was an imposing challenge even in scalar instructions.
    Meshing that into a pipelined FP architecture would have added
    massive complications.
(3) The compiler development required to handle the integrated I/O would
    be enough of a challenge, without adding the complexity of pipelined
    manipulation.
Note that in early '87 the entire iWarp FP design team was only 3 engineers!

Given the knowledge and experience we have *today*, and given the proven
send/receive interface mechanisms we have *today*, I would *now* be
comfortable in specifying a pipelined FP.  But that was not a viable choice
at the time.

> The arithmetic unit does most operations in 1 cycle, except that it
> doesn't support integer multiply or divide.  You have to use the FP
> multiplier for integer multiply (3 cycles), and there doesn't appear
> to be any way to do an integer divide at all (convert to FP, divide,
> convert back?).  This has the added problem that you can't do an
> integer multiply (such as for a multi-dimensional array access) and an
> FP multiply or divide at the same time, which I think severely limits
> the applicability of the compute&access instruction.

We found that virtually all of the multiplies required for multi-dimensional
array accesses occur outside the innermost loop(s).  As a consequence, for
large data sets (which is iWarp's target), integer multiplies occur
infrequently enough that the cost of adding a dedicated integer multiply
unit could not be justified.  Instead, we added a small amount of hardware
to the FP multiplier to allow direct multiplication of integers.

Integer divide is indeed implemented by converting to FP.  Integer divide
was found to occur so infrequently in our target applications that no special
hardware cost could be justified.

One point to keep in mind when examining the iWarp architecture is that
all tradeoffs and optimizations center around the following target:
* A tight loop (frequently a single C&A instruction) performing SP floating
* point adds and multiplies, with 1-2 memory accesses, 1-2 sends and 1-2
* receives per iteration.

> The iWarp has more support for byte and bit-level operations than any
> processor I've seen in a long time.  For example, ...
>                  ...  It seems to me that the effort and chip area
> devoted to these functions would have been better used building an
> integer multiplier, integer divider, and pipelining the FPM unit.

The only bit-level instructions (other than ordinary logical operations)
are bit-test/set/clear instructions. These were included to reduce the
cycles required in manipulating the communication control and status
registers.  This helps (slightly) improve the software overhead associated
with communications, at minimal silicon cost.

The byte and half-word operations were provided to allow efficient support
of C.  Without these operations, we face two unpleasant alternatives:
(1) Software must "promote" operands to 32-bit fields, perform the desired
    function, then "demote" the result.  This adds substantial additional
    cycles, particularly if exact results are to be maintained.
(2) Define the char/short/int data types as 32-bit fields.
    This consumes substantially more memory.  In ordinary systems with large
    amounts of DRAM, this may not be an issue, but iWarp's design goals
    required a very-fast (and expensive) all-SRAM memory system, which means
    that efficient memory utilization is essential.

 ----------------------------------------------------------------------------
 Jim Sutton, Sr Staff Engineer, intel/iWarp Program   jsutton@iWarp.intel.com
 5200 NE Elam Young Pky CO4-03, Hillsboro, OR 97124             (503)629-6345

--
 ----------------------------------------------------------------------------
 Jim Sutton, Sr Staff Engineer, intel/iWarp Program   jsutton@iWarp.intel.com
 5200 NE Elam Young Pky CO4-03, Hillsboro, OR 97124             (503)629-6345

mshute@cs.man.ac.uk (Malcolm Shute) (06/05/91)

Am I correct in assuming that this is Intel's answer to Inmos' Transputer?

I would like both Intel and Inmos representatives take this bait... it has often been
pointed out in this group that "Manufacture-X versus Manufacturer-Y" wars
often through up quite a lot of useful information in all their smoke and fury,
and I've been disappointed that no-one at Inmos has yet responded to this
long, but interesting description of the iWARP.
--

Malcolm SHUTE.         (The AM Mollusc:   v_@_ )        Disclaimer: all

carroll@ssc-vax (Jeff Carroll) (06/06/91)

In article <2622@m1.cs.man.ac.uk> mshute@cs.man.ac.uk (Malcolm Shute) writes:
>Am I correct in assuming that this is Intel's answer to Inmos' Transputer?

No, you're not. Professor Kung has been pushing the Warp project at CMU for
a decade or so now, and Intel has been at work on iWarp for several years
funded in part by the Department of Defense. Nobody I know who works for
Intel has ever been of the opinion that Inmos has ever done anything that
merited an answer. (I'm a transputer user too, and I think that there are
some very nice things about the xputer architecture, but I also think
that Inmos has done some things very wrong through the years.)

Now, there are some interesting parallels between the iWarp architecture and
the T9000 (nee H1) xputer, but there are also a couple of important
differences.

a)	Intel has working silicon, now. I have seen it with my own eyes.
	I think nearly everyone will agree that Inmos is nowhere close to
	having a working (prototype, even) T9000. You can buy iWarp
	systems (in those wonderful gray cabinets) NOW. TODAY.

b)	Availability notwithstanding, my contacts at Intel have not 
	convinced me that Galactic Intel is interested in marketing iWarp
	to the world at large. You may never see iWarp silicon available
	in quantity.

>I would like both Intel and Inmos representatives take this bait... it has often been
>pointed out in this group that "Manufacture-X versus Manufacturer-Y" wars
>often through up quite a lot of useful information in all their smoke and fury,
>and I've been disappointed that no-one at Inmos has yet responded to this
>long, but interesting description of the iWARP.

Perhaps the silence speaks for itself. I'm personally of the opinion that
Inmos has already said far more than was really necessary about a chip that
doesn't exist yet.

I am fascinated about the geographically differing perceptions of the
microprocessor market. I get the idea that the two most-talked-about
micro architectures in the UK are the xputer and the ARM, both of which
are practically unknown in the USA. Consequently I suppose that Inmos
is seen in the UK as a major competitor of Intel.

But then, they run funny network protocols over there, and drive on the
wrong side of the road :^).

-- 
Jeff Carroll		carroll@ssc-vax.boeing.com

"...and of their daughters it is written, 'Cursed be he who lies with 
any manner of animal.'" - Talmud

sysmgr@KING.ENG.UMD.EDU (Doug Mohney) (06/06/91)

In article <4077@ssc-bee.ssc-vax.UUCP>, carroll@ssc-vax (Jeff Carroll) writes:

>b)	Availability notwithstanding, my contacts at Intel have not 
>	convinced me that Galactic Intel is interested in marketing iWarp
>	to the world at large. You may never see iWarp silicon available
>	in quantity.

And what's the current demand for lots of iWarp machines? We're talking big
bucks here for a system based upon Intel silicon (which some would argue is
cursed from the moment it enters the factory grounds ;-) from a company who's
primary business is microprocessors, not large systems. Large multiprocessor
systems are made by other folks, like BBN and Thinking Machines. 

     Signature envy: quality of some people to put 24+ lines in their .sigs
  -- >                  SYSMGR@CADLAB.ENG.UMD.EDU                        < --

dc@dcs.qmw.ac.uk (Daniel Cohen;E303) (06/07/91)

In <4077@ssc-bee.ssc-vax.UUCP> carroll@ssc-vax (Jeff Carroll) writes:

>I am fascinated about the geographically differing perceptions of the
>microprocessor market. I get the idea that the two most-talked-about
>micro architectures in the UK are the xputer and the ARM, both of which
>are practically unknown in the USA. Consequently I suppose that Inmos
>is seen in the UK as a major competitor of Intel.

>But then, they run funny network protocols over there, and drive on the
>wrong side of the road :^).

Well, I can't argue about the arseways ( tr. assways ) addressing and
driving on the "wrong" side of the road, but your earlier point is
nonsense! The transputer is talked about a lot because it's interesting,
not because we think it's a serious threat to the 386! We don't see Inmos
"as a major competitor of Intel". On the contrary, most British people
are so used to the commercial failure of UK innovations that we're
surprised Inmos has lasted this long :-) As for the ARM, I've no idea
why you think it's talked about a lot here.

All credit to Inmos for surviving in difficult circumstances; just maybe
they will seriously threaten Inmos some day! Until then we're under no
illusions.

--
Daniel Cohen              Department of Computer Science 
Email: dc@dcs.qmw.ac.uk   Queen Mary and Westfield College
Tel: +44 71 975 5249/4/5  Mile End Road, London E1 4NS, UK
Fax: +44 81 980 6533      *** Glory, Glory, Hallelujah ***

davidb@inmos.co.uk (David Boreham) (06/07/91)

In article <2622@m1.cs.man.ac.uk> mshute@cs.man.ac.uk (Malcolm Shute) writes:
>Am I correct in assuming that this is Intel's answer to Inmos' Transputer?  

Hey I've never been able to understand what's wrong with 
a 68K and four UARTs -:)

Although comparisons are clearly going to be made, I don't believe
that the iWARP and Transputer are trying to solve the same
problems. Similarly for the TMS320C40 (perhaps more interesting
than the iWARP, save the lack of routing).                        

The more guys there are out there with silicon supporting 
message-based MIMD systems, the better as far as I'm concerned.
The field is so devoid of useful solutions that there's plenty
of scope for offering different features which fit different 
application areas. eg different communications speed/pins used,
different CPU types (vector, scalar, FP, large, small...),
different routing capabilities (good for <10 CPU, good for >1000 CPU...),
different virtual channel capabilities (none, a few, many...).


Oh, Jeff---check out which side of the road they drive on in Japan :^).


David Boreham, INMOS Limited | mail(uk): davidb@inmos.co.uk or ukc!inmos!davidb
Bristol,  England            |     (us): uunet!inmos.com!davidb
+44 454 616616 ex 547        | Internet: davidb@inmos.com

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (06/09/91)

In article <16477@ganymede.inmos.co.uk> 
	davidb@inmos.co.uk (David Boreham) writes:
>Although comparisons are clearly going to be made, I don't believe
>that the iWARP and Transputer are trying to solve the same
>problems.
>The more guys there are out there with silicon supporting 
>message-based MIMD systems, the better as far as I'm concerned.

Agreed, but note that iWarp isn't message-based.  It comes out of the
"systolic" idea, originally popularized by (oddly enough) Kung. A
torus of iWarp nodes *can* pass messages - and fairly well. However,
that's not the main idea. The intention is that raw untagged *values*
can be streamed around the interconnect.  The node design caters to
tight loops which read from registers that happen to be in-queues,
and write to registers that happen to be out-queues.

This programmable systolic behavior is sometimes incorrectly called
dataflow. It works very well for some problem domains, notably image
processing. The point to notice is that the streamed data may avoid
burning memory cycles. This worked well for the Columbia IQCD, which
was reporting a sustained 6 GFLOPS, more than a year ago. (Their
problem domain is *really* limited: they just do quark physics.)
-- 
Don		D.C.Lindsay 	Carnegie Mellon Robotics Institute

glew@pdx007.intel.com (Andy Glew) (06/10/91)

I couldn't resist:

>[Daniel Cohen:]
>All credit to Inmos for surviving in difficult circumstances; just
>maybe they will seriously threaten Inmos some day!

Inmos threatening Inmos!  Now isn't that a typical British computer story!




(Look, I'm allowed to say this: (1) I'm a British national; (2) I'm a
Canadian national as well, and I could almost as easily has said
"Canadian computer story"; (3) It's funny. Well, maybe my post isn't
funny, but the original typo is.  Go on, laugh already.)
--

Andy Glew, glew@ichips.intel.com
Intel Corp., M/S JF1-19, 5200 NE Elam Young Parkway, 
Hillsboro, Oregon 97124-6497

This is a private posting; it does not indicate opinions or positions
of Intel Corp.

dc@dcs.qmw.ac.uk (Daniel Cohen;E303) (06/11/91)

In <GLEW.91Jun9111517@pdx007.intel.com> glew@pdx007.intel.com (Andy Glew) writes:

>>[Daniel Cohen:]
>>All credit to Inmos for surviving in difficult circumstances; just
>>maybe they will seriously threaten Inmos some day!

>Inmos threatening Inmos!  Now isn't that a typical British computer story!

>(Look, I'm allowed to say this: (1) I'm a British national; (2) I'm a
>Canadian national as well, and I could almost as easily has said
>"Canadian computer story"; (3) It's funny. Well, maybe my post isn't
>funny, but the original typo is.  Go on, laugh already.)
>--
>Andy Glew, glew@ichips.intel.com
>Intel Corp., M/S JF1-19, 5200 NE Elam Young Parkway, 
>Hillsboro, Oregon 97124-6497

Yup, I'm laughing. Perhaps it was a Freudian slip :-)
Inmos, Intel, what's a few billion between friends?

PS. what's happened to iSC with the iPSC/860? Have they merged with the
Touchstone project, or do they have independent plans?

--
Daniel Cohen              Department of Computer Science 
Email: dc@dcs.qmw.ac.uk   Queen Mary and Westfield College
Tel: +44 71 975 5249/4/5  Mile End Road, London E1 4NS, UK
Fax: +44 81 980 6533      *** Glory, Glory, Hallelujah ***

steved@lion.inmos.co.uk (Stephen Doyle) (06/13/91)

glew@pdx007.intel.com (Andy Glew) writes:

>>[Daniel Cohen:]
>>All credit to Inmos for surviving in difficult circumstances; just
>>maybe they will seriously threaten Inmos some day!

>Inmos threatening Inmos!  Now isn't that a typical British computer story!

>(Look, I'm allowed to say this: (1) I'm a British national; (2) I'm a
>Canadian national as well, and I could almost as easily has said
>"Canadian computer story"; (3) It's funny. Well, maybe my post isn't
>funny, but the original typo is.  Go on, laugh already.)
>--
>Andy Glew, glew@ichips.intel.com
>Intel Corp., M/S JF1-19, 5200 NE Elam Young Parkway, 
>Hillsboro, Oregon 97124-6497

Quoting from the RISC Management Newsletter 1/1/91

RISC Processor Unit Shipments (000s)

Processor             1989     1990     cum.

transputer             190      240      540
SPARC                   83      185      275
Rx000                   35       90      130
Am29000                 12       85       97
i960                     8       65       73
i860               samples       65       65
ARM                     40       55      145
M88100                  10       50       60
CLIPPER                 33       23       68

Totals                 411      858     1453

So, INMOS are only surviving are we? Looks to me like we are holding the
lead in RISC processor sales.

Steve Doyle,                             | Tel +44 454 616616
INMOS Ltd, 1000 Aztec West               | Fax +44 454 617910
Almondsbury                              | UK: steved@inmos.co.uk
Bristol BS12 4SQ, UK                     | INET: steved@inmos.com

colin@array.UUCP (Colin Plumb) (06/14/91)

In article <16586@ganymede.inmos.co.uk> steved@inmos.co.uk (Stephen Doyle) writes:
>So, INMOS are only surviving are we? Looks to me like we are holding the
>lead in RISC processor sales.

Well, except for the minor detail that the Transputer, focused as it is
on minimising the semantic gap, is actually quite a CISC.  You can draw
parallels with the Intel 432, although the thrust of the development
is quite different (CSP rather then capabilities).

The point still stands, however, that the Transputer outsells the i860 and
i960 combined.
-- 
	-Colin

carroll@ssc-vax (Jeff Carroll) (06/14/91)

In article <16586@ganymede.inmos.co.uk> steved@inmos.co.uk (Stephen Doyle) writes:
>RISC Processor Unit Shipments (000s)
>
>Processor             1989     1990     cum.
>
>transputer             190      240      540
>SPARC                   83      185      275

(the other guys' numbers elided - you get the idea)

>
>So, INMOS are only surviving are we? Looks to me like we are holding the
>lead in RISC processor sales.

	Let's compare apples and apples here. How many of those xputers are
T8s?

	It seems to me that comparing T2s and T4s with SPARCs and i960s and
M88xxxs is engaging in creatively wishful thinking. If you want to call the
xputer a RISC architecture, fine. But don't pretend that it's competing with
the SPARC, the MIPS line, or the i960.

	The transputer is a niche product. The other semiconductor houses
have (until now) been content to leave that niche to Inmos. With
increased use of multiprocessor systems on the horizon, however, the
other guys will be looking to increase their market share, and some of
them have demonstrated better ability to rapidly bring silicon to market
than Inmos.

	Now, I like xputers for what they are, but I'm glad I don't have
to program my PC in Occam...


-- 
Jeff Carroll		carroll@ssc-vax.boeing.com

"...and of their daughters it is written, 'Cursed be he who lies with 
any manner of animal.'" - Talmud

pcg@aber.ac.uk (Piercarlo Grandi) (06/15/91)

On 13 Jun 91 09:37:51 GMT, steved@lion.inmos.co.uk (Stephen Doyle) said:

steved> Quoting from the RISC Management Newsletter 1/1/91

steved> RISC Processor Unit Shipments (000s)
steved> Processor             1989     1990     cum.
steved> transputer             190      240      540
steved> [ ... smaller figures for SPARC, Rx000, Am29K, ix60, ARM, M88K,
steved> CLIPPER omitted ... ]

Frankly, this newletter is crazy.

The Transputer is not by any stretch of the imagination a RISC system
(unless you define RISC == non CISC), and it is not, even more
importantly, a general purpose CPU like the other systems listed. To put
it into that league table is ridiculous.

On the other hand the figures confirm that the ARM is deservedly one of
the more popular RISC designs, and this is about as good for the
prestige of the UK design companies as anything.

The transputer *is* a success story, but not as a general purpose
processor, just as an high end embedded systems engine. In that market
it is not by any means the leader, but it is still significant.

So maybe it would be more interesting to see the figures for sales of
CPUs in the 32 bit embedded systems market, and to see it split between
CISC sales (32k, 68k, x86) and non-CISC (which is not by any means the
same as RISC) sales.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@aber.ac.uk