[comp.arch] Late, Lamented E&S-1 -- whats it look like?

baum@Apple.COM (Allen J. Baum) (11/21/89)

[]
Does anyone have details about the E&S-1? What makes it different/special?
--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

lamaster@athena.arc.nasa.gov (Hugh LaMaster) (11/21/89)

According to the preliminary product information, the key "interesting" 
system component was a crossbar switch which permitted 16 processors 
to access up to 256 MBytes of (interleaved) memory.  There was also a
second level of processors/memory possible.

I will bet that someone will put a similar crossbar together with an
R6000, 88k, or
SPARC system and make multitasking cheap and commericial.  At least the
E&S design had some bandwidth in it, unlike most of the bus-based systems
you see out there :-).



  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

wen-king@cit-vax.Caltech.Edu (King Su) (11/21/89)

In article <36652@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
>[]
<Does anyone have details about the E&S-1? What makes it different/special?
>--
<		  baum@apple.com		(408)974-3385
>{decwrl,hplabs}!amdahl!apple!baum

We have got one of their prototype system sitting upstairs.  I
understand it is on a short term loan, and I think you can look at it
if you care to drop by.  I can't describe how it works because I have
never seen it fully operational myself.  From the sleek look of its
exterior and interior, however, it is quite obvious that a great deal
of engineering work has been put into this machine.

Although the individual processors are 32-bit custom-built, single-chip
micros, the front-end is a generic SUN workstation.  The machine contains
two enormous boxes, one hides the cage for the SUN and a bank of disks.
The other houses the CPUs and memory.  The system runs Mach, and that
means you can rlogin to it and run shell on it like you do on a regular
workstation.  This is a system that is aimed to compete at the level of
Sequent and BBN instead of the level of iPSC.  The last time I checked,
the machine contains 16 cpus, although it has room for more.
-- 
/*------------------------------------------------------------------------*\
| Wen-King Su  wen-king@vlsi.caltech.edu  Caltech Corp of Cosmic Engineers |
\*------------------------------------------------------------------------*/

andy@svcs1.UUCP (Andy Piziali) (11/22/89)

In article <36652@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:

    Does anyone have details about the E&S-1?

The ES-1 is a shared memory MIMD supercomputer running the Mach operating
system.  The high-end configuration has 8 parallel processors where each
processor contains 16 computational units (CUs) and 256 MB.  The CUs are
connected to the memory system by a full 8x8 crossbar.  Processors are
connected to one another through a secondary crossbar which also serves as the
path to the I/O system.

Each CU has three independent pipelines: integer, floating point add, and
floating point multiply.  There are 32 integer registers and 64 floating point
registers.  The CU is heavily pipelined.

The memory system is 64-way interleaved and fully pipelined.

    What makes it different/special?

The ES-1 is different in that it delivers Cray class supercomputer performance
without resorting to vector facilities by providing a moderate number of CUs
for use by a like number of threads in a single parallel program.  The CUs may
also be time-shared in a multi-user environment like a classical mainframe.

These comments are my opinion, as a computer architect, and do not necessarily
represent those of my employer, Evans and Sutherland Computer Corporation.

baum@Apple.COM (Allen J. Baum) (11/23/89)

[]
>In article <324@svcs1.UUCP> andy@svcs1.UUCP (Andy Piziali) writes:
>The ES-1 is different in that it delivers Cray class supercomputer performance
>without resorting to vector facilities by providing a moderate number of CUs
>for use by a like number of threads in a single parallel program.  The CUs may
>also be time-shared in a multi-user environment like a classical mainframe.

This is no different than many other multiprocessor with lotsa processors.
Many of the issues cited were implementation (crossbars, highly pipelined,
separate int, fp add, fp mult) rather than architectural.
The problem with  multiproceesor is to get my dusty decks running on them.
And, if you have no architectural support for this, just software, then anyone
could use the same technique and slap together a bunch of micros to achieve
the same end. So, what are the architectural features that permitted this
system to be used effectively?

Note that I'm not saying that a system as you described is trivial to build.
High preformance crossbars can be incredible mess; in fact, extremely difficult
to build without sacrificing latency. What was the latency though the crossbar
(i.e. how many delay slots were there after a load?)

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

andy@svcs1.UUCP (Andy Piziali) (11/25/89)

In article <36725@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:

    The problem with multiprocessors is to get my dusty decks running on them.
    And, if you have no architectural support for this, just software, then
    anyone could use the same technique and slap together a bunch of micros to
    achieve the same end.  So, what are the architectural features that
    permitted this system to be used effectively?

The issue of running "dusty decks" on new computers has always been addressed
with software technology, compilers which translate old programs into the new
machine's instruction set and, in the case of parallel machines, map data
parallelism onto multiple processors.  In the case of the ES-1, the compiler
has an intimate knowledge of the CU pipeline for use in scheduling optimized
code and source language preprocessors are used to detect parallel operations
and create multiple threads within the original, single-threaded, dusty deck.

On top of the necessary compiler technology, there must then be architectural
support for coordinating the multiple threads of control created by the
compiler.  In the ES-1, there are three mechanisms for inter-thread synchroni-
zation: atomic memory accesses, signals, and interrupts.

The atomic memory accesses are your typical test-and-set operations: read and
set, read and reset, and reset first.

The signal mechanism is a means for threads to asynchronously communicate.  A
hardware control block is constructed by the thread (A) specifying what signals
the thread is expecting.  When another thread (B) sends thread A a signal, the
receipt of the signal is recorded in the control block and if the thread is not
currently active (running on a CU), a processor running a lower priority thread
is interrupted.

The CUs in an ES-1 may send interrupts to one another for use in asynchronous
event signalling.

    What was the latency through the crossbar (ie. how many delay slots were
    there after a load?)

I feel more comfortable answering how load latency is hidden in general in the
ES-1 than citing specific machine parameters.  The integer and floating point
registers are independently scoreboarded and are always non-blocking.  There is
no fixed number of delay slots after loads.  Instruction issue is not stalled
until a load destination register is specified as a subsequent source register.

baum@Apple.COM (Allen J. Baum) (11/28/89)

[]
>In article <329@svcs1.UUCP> andy@svcs1.UUCP (Andy Piziali) writes:
>In article <36725@apple.Apple.COM> Allen Baum asked:
> What are the architectural features that  permitted this system to be used 
> effectively?
>
>In the case of the ES-1, the compiler has an intimate knowledge of the CU
> pipeline

Sorry, but an optimizing compiler that knows about the pipeline is not an
architectural feature; it's a necessity (these days) for good performance.

>On top of the necessary compiler technology, there must then be architectural
>support for coordinating the multiple threads of control created by the
>compiler.  In the ES-1, there are three mechanisms for inter-thread synchroni-
>zation: atomic memory accesses, signals, and interrupts.
>

>The signal mechanism is a means for threads to asynchronously communicate.  A
>hardware control block is constructed by the thread(A) specifying what signals
>the thread is expecting.  When another thread (B) sends thread A a signal, the
>receipt of the signal is recorded in the control block & if the thread is not
>currently active(running on a CU), a processor running a lower priority thread
>is interrupted.
>
>The CUs in an ES-1 may send interrupts to one another for use in asynchronous
>event signalling.

OK, those are architectural features.

>>    What was the latency through the crossbar (ie. how many delay slots were
>>    there after a load?)

>I feel more comfortable answering how load latency is hidden in general in the
>ES-1 than citing specific machine parameters.  The integer and floating point
>registers are independently scoreboarded and are always non-blocking. There is
>no fixed number of delay slots after loads.  Instruction issue is not stalled
>until a load destination register is specified as a subsequent source register

Well, I understand that you may not be comfortable answering the question that
I asked, but I asked for a reason. Crossbars, as nice as the are, exact a
penalty in access latency. This penalty is sometimes great enough to cancel the
benefit of having the crossbar in the first place. If the penalty is great
enough, you may as well have a local-remote archtitecture. So.... what is the
penalty?

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum