[net.arch] ELXSI System 6400 .... Information needed

paul@cybavax.UUCP (Paul Middlehurst) (05/29/86)

Anyone out there had experience with the ELXSI System 6400?

	. How fast is it really?
	. What of the SysV/4.2 ports?
	. What support is available?
	. ... anything else you think I should know?

I'd really appreciate any of the above information (and more) and give my
thanks/appreciation in advance.


From the terminal of:	{U.K.}!ukc!reading!cybavax!paul

A.K.A.	Paul Middlehurst
	Dept. Computer Science
	University College Swansea
	Singleton Park
	Swansea SA2 8PP
	United Kingdom

rfc@calmasd.CALMA.UUCP (Robert Clayton) (06/16/86)

In article <203@cybavax.UUCP>, paul@cybavax.UUCP (Paul Middlehurst) writes:
> Anyone out there had experience with the ELXSI System 6400?
> 
> 	. How fast is it really?

Initially, they claimed 4X a VAX 780 for FORTRAN.  They improved their
compiler and claimed 6X.  They expanded their cache and claimed 7X.  In
a 10 processor test at Sandia Labs they got 10.1X the power of a single
processor.  Potential 80 MIP machine with a full complement of 12
processors.  64 bit processors.  6 processors fit in a cabinet the size
of a 780.  12 processors require two such cabinets with a necessarily
short cable joining the bus.  25 nsec bus, 50 nsec processors.

> 	. What of the SysV/4.2 ports?

Their Unix rides on top of their Message Based OS kernel.  Its System V.
I don't know of any 4.2 work, but by now, it's possible.

> 	. What support is available?

> 	. ... anything else you think I should know?

Their Gigabus is attractive with 200-300 MB/Sec capacity, and they
indicate a willingness to support special devices on it such as 100
MFLOP array processors.  64 MB/Sec I/O capacity and more if you need
it.

Price/MIP is comparable to VAX 8600.  Entry level price about 15%
above the 8600.

Gene Amdahl's Trilogy Corporation bought out Elxsi about a year ago.
Amdahl is now the head of the company.

The only machine I can think of that compares with the ELXSI in terms
of capacity is the IBM 3090 for 2-3X the price and Crays.  Their
market appears to be the "almost-super" computer market.

Bob Clayton
Calma, San Diego
(619) 587-3147

jel@portal.UUcp (John Little) (06/18/86)

In article <1946@calmasd.CALMA.UUCP>, rfc@calmasd.CALMA.UUCP (Robert Clayton) writes:
> a 10 processor test at Sandia Labs they got 10.1X the power of a single
> processor.  

This is an interesting trick. Does anyone have a clue about how they
got a greater than linear speedup?  Was this a cpu benchmark or did
it include i/o?  Can I program my single processor to emulate a
multiprocessor configuration and get increased performance :-) ?

John Little
{sun,atari}!portal!jel

rfc@calmasd.CALMA.UUCP (Robert Clayton) (06/20/86)

In article <120@portal.UUcp>, jel@portal.UUcp (John Little) writes:
> In article <1946@calmasd.CALMA.UUCP>, rfc@calmasd.CALMA.UUCP (Robert Clayton) writes:
> > a 10 processor test at Sandia Labs they got 10.1X the power of a single
> > processor.  
> 
> This is an interesting trick. Does anyone have a clue about how they
> got a greater than linear speedup?  Was this a cpu benchmark or did
> it include i/o?

I'm told since I wrote this that the problem involved much context
switching and that this overhead (measured on a per processor basis)
was reduced when the problem was spread out over several processors.

rfc@calmasd.CALMA.UUCP (Robert Clayton)

mat@amdahl.UUCP (06/21/86)

In article <120@portal.UUcp>, jel@portal.UUcp (John Little) writes:
> In article <1946@calmasd.CALMA.UUCP>, rfc@calmasd.CALMA.UUCP (Robert Clayton) writes:
> > a 10 processor test at Sandia Labs they got 10.1X the power of a single
> > processor.  
> 
> This is an interesting trick. Does anyone have a clue about how they
> got a greater than linear speedup?  Was this a cpu benchmark or did
> it include i/o?  

It was a fixed workload benchmark, running a variety of jobs as I recall.
It included I/O, but didn't measure anything but relative CPU throughput
capacity.

The greater than linear speedup occurs as a result of improved locality
of reference, reduced process switching, and better cache performance.
The ELXSI machine uses a message based architecture, and has cached
process context (registers, etc.) for 16 processes per processor. It
is very cheap to switch to a process that has a process slot, and very
expensive to switch to one that doesn't (process 0, the scheduler, must
be woken up to purge one process from its slot and set up the slot for 
the new one before the new one can run. A microcode dispatcher handles
the dispatching of "hot" processes that have a slot. Anyway, the existence
of more process slots reduces the number of very costly swaps, and, as a
byproduct, reduces cache miss rate, etc. Net result is that these savings
more than offset any interprocessor interference losses. Since there is no
memory sharing, this interference is small.

It should be pointed out that the message based architecture induces a very
high process switch rate, which makes these effects quite different than
would be observed in more traditional systems.

In a sense, the superlinear speedup is observed because of reducing overheads
which make the uniprocessor system run "slower than it should."

-- 
Mike Taylor                        ...!{ihnp4,hplabs,amd,sun}!amdahl!mat

[ This may not reflect my opinion, let alone anyone else's.  ]

cmt@myrias.UUCP (Chris Thomson) (06/22/86)

> > a 10 processor test at Sandia Labs they got 10.1X the power of a single
> > processor.  
> 
> This is an interesting trick. Does anyone have a clue about how they
> got a greater than linear speedup?

This was a cache effect.  The program being run was almost perfectly
parallelizable (how's that for a word?), with almost no synchronization
overhead (a linear algebra problem, I think).  Thus its total data motion
on 10 processors was very nearly the same as on 1 processor, but there was
10 times as much cache available, hence the 10.1 times speedup.  The result
is hardly a general one, but does speak well of the machine's overall
ability to multiprocess.
-- 
Chris Thomson, Myrias Research Corporation	   ihnp4!alberta!myrias!cmt
200 10328 81 Ave, Edmonton Alberta, Canada	   403 432 1616

josh@polaris.UUCP (Josh Knight) (06/22/86)

In article <120@portal.UUcp> jel@portal.UUcp (John Little) writes:
>In article <1946@calmasd.CALMA.UUCP>, rfc@calmasd.CALMA.UUCP (Robert Clayton) writes:
>> a 10 processor test at Sandia Labs they got 10.1X the power of a single
>> processor.  
>
>This is an interesting trick. Does anyone have a clue about how they
>got a greater than linear speedup?  Was this a cpu benchmark or did
>it include i/o?  Can I program my single processor to emulate a
>multiprocessor configuration and get increased performance :-) ?
>

We certainly don't have any of these things here, but I should think
that more processors might mean more memory.  On a time sharing workload,
more memory could mean better performance, enough to hide whatever extra
(if any) software cost was involved.  If you have 10 people editing
and 10 CPU's you may do many fewer context switches with concomittant
reduction software costs not to mention (perhaps) fewer cache misses.
There are lots of ways it COULD happen; however, like John, I'm a Little
(sorry John) skeptical.

Of course I don't speak for IBM, only me.
-- 

	Josh Knight, IBM T.J. Watson Research
 josh@ibm.com, josh@yktvmh.bitnet,  ...!philabs!polaris!josh

rb@cci632.UUCP (Rex Ballard) (06/26/86)

In article <120@portal.UUcp> jel@portal.UUcp (John Little) writes:
>In article <1946@calmasd.CALMA.UUCP>, rfc@calmasd.CALMA.UUCP (Robert Clayton) writes:
>> a 10 processor test at Sandia Labs they got 10.1X the power of a single
>> processor.  
>
>This is an interesting trick. Does anyone have a clue about how they
>got a greater than linear speedup?

Sure, I've seen it several times in several different situations.
The secret is to not count anything other than "CPU" instruction speed.
In reality, there are probably DMA, MMU and related controllers that are
not included in the MIPS figures.  Caching, asynchronous processing, and
CPU time normally spent doing other things can also be contributing factors.

One really old trick is to use the MMU to do "string moves", this is
especially useful for "pipes" or their equivalents, where you know that
the original is no longer needed.

>Was this a cpu benchmark or did it include i/o? 

Any multi-processor benchmark requires at least some I/O even if it is
just inter-process "pipes".  If the single CPU timings were based on
drystones, but the multi was prolog LIPs or some similar arrangement,
the CPU ratings may have actually been too low.

Even if the exact same algorythm was used (DMA controllers,...), the
bus contention of DMA to/from the same processor vs two different
processors would still lead to a small (1%) increase for two.

From my own experience, I'm suprised they only got 1% on 10 processors,
it should have been .3%/processor.

Sequent, CCI, and several others have often found performance increases
on certain applications (esp. the ones they were designed for).

>Can I program my single processor to emulate a
>multiprocessor configuration and get increased performance :-) ?

In a way, yes!  By using an ACRTC rather than a "Bit mapped" graphics
display, an X.25 serial link instead of an RS-232 link ('rupts every
block instead of every character), and about 20 other "tricks", you
could actually get 200 times the performance of an equivalent "CPU
only system".  It wouldn't show up in the Drystones or Whetstones,
but it would be noticable to the user.

A number of 68020 and 68010 boxes have "Comm boards" that contain
additional processors, including 68008s, 80186s and others, along
with DMA, local memory (for buffering), and individual lines.  These
are usually not taken into consideration when Drystones are compared.
A 5/30 is nominally rated at 2 MIPS, but there are a minimum of 2
additional 1 mips processors "hidden" in the controller boards.

A Sun workstation isn't blindingly fast in Drystones, but for graphics,
it would beat a VAX 8600 (if the Vax ran bit-mapped).

A Cray X-MP will beat a 6/32 in number crunching any day, but a 6/32
does data bases and file servers extremely well.  It's simply a matter
of planning your system archetecture for the type of work you intend
to do.

Just to be fair, what benchmarks did they use?