paul@cybavax.UUCP (Paul Middlehurst) (05/29/86)
Anyone out there had experience with the ELXSI System 6400? . How fast is it really? . What of the SysV/4.2 ports? . What support is available? . ... anything else you think I should know? I'd really appreciate any of the above information (and more) and give my thanks/appreciation in advance. From the terminal of: {U.K.}!ukc!reading!cybavax!paul A.K.A. Paul Middlehurst Dept. Computer Science University College Swansea Singleton Park Swansea SA2 8PP United Kingdom
rfc@calmasd.CALMA.UUCP (Robert Clayton) (06/16/86)
In article <203@cybavax.UUCP>, paul@cybavax.UUCP (Paul Middlehurst) writes: > Anyone out there had experience with the ELXSI System 6400? > > . How fast is it really? Initially, they claimed 4X a VAX 780 for FORTRAN. They improved their compiler and claimed 6X. They expanded their cache and claimed 7X. In a 10 processor test at Sandia Labs they got 10.1X the power of a single processor. Potential 80 MIP machine with a full complement of 12 processors. 64 bit processors. 6 processors fit in a cabinet the size of a 780. 12 processors require two such cabinets with a necessarily short cable joining the bus. 25 nsec bus, 50 nsec processors. > . What of the SysV/4.2 ports? Their Unix rides on top of their Message Based OS kernel. Its System V. I don't know of any 4.2 work, but by now, it's possible. > . What support is available? > . ... anything else you think I should know? Their Gigabus is attractive with 200-300 MB/Sec capacity, and they indicate a willingness to support special devices on it such as 100 MFLOP array processors. 64 MB/Sec I/O capacity and more if you need it. Price/MIP is comparable to VAX 8600. Entry level price about 15% above the 8600. Gene Amdahl's Trilogy Corporation bought out Elxsi about a year ago. Amdahl is now the head of the company. The only machine I can think of that compares with the ELXSI in terms of capacity is the IBM 3090 for 2-3X the price and Crays. Their market appears to be the "almost-super" computer market. Bob Clayton Calma, San Diego (619) 587-3147
jel@portal.UUcp (John Little) (06/18/86)
In article <1946@calmasd.CALMA.UUCP>, rfc@calmasd.CALMA.UUCP (Robert Clayton) writes: > a 10 processor test at Sandia Labs they got 10.1X the power of a single > processor. This is an interesting trick. Does anyone have a clue about how they got a greater than linear speedup? Was this a cpu benchmark or did it include i/o? Can I program my single processor to emulate a multiprocessor configuration and get increased performance :-) ? John Little {sun,atari}!portal!jel
rfc@calmasd.CALMA.UUCP (Robert Clayton) (06/20/86)
In article <120@portal.UUcp>, jel@portal.UUcp (John Little) writes: > In article <1946@calmasd.CALMA.UUCP>, rfc@calmasd.CALMA.UUCP (Robert Clayton) writes: > > a 10 processor test at Sandia Labs they got 10.1X the power of a single > > processor. > > This is an interesting trick. Does anyone have a clue about how they > got a greater than linear speedup? Was this a cpu benchmark or did > it include i/o? I'm told since I wrote this that the problem involved much context switching and that this overhead (measured on a per processor basis) was reduced when the problem was spread out over several processors. rfc@calmasd.CALMA.UUCP (Robert Clayton)
mat@amdahl.UUCP (06/21/86)
In article <120@portal.UUcp>, jel@portal.UUcp (John Little) writes: > In article <1946@calmasd.CALMA.UUCP>, rfc@calmasd.CALMA.UUCP (Robert Clayton) writes: > > a 10 processor test at Sandia Labs they got 10.1X the power of a single > > processor. > > This is an interesting trick. Does anyone have a clue about how they > got a greater than linear speedup? Was this a cpu benchmark or did > it include i/o? It was a fixed workload benchmark, running a variety of jobs as I recall. It included I/O, but didn't measure anything but relative CPU throughput capacity. The greater than linear speedup occurs as a result of improved locality of reference, reduced process switching, and better cache performance. The ELXSI machine uses a message based architecture, and has cached process context (registers, etc.) for 16 processes per processor. It is very cheap to switch to a process that has a process slot, and very expensive to switch to one that doesn't (process 0, the scheduler, must be woken up to purge one process from its slot and set up the slot for the new one before the new one can run. A microcode dispatcher handles the dispatching of "hot" processes that have a slot. Anyway, the existence of more process slots reduces the number of very costly swaps, and, as a byproduct, reduces cache miss rate, etc. Net result is that these savings more than offset any interprocessor interference losses. Since there is no memory sharing, this interference is small. It should be pointed out that the message based architecture induces a very high process switch rate, which makes these effects quite different than would be observed in more traditional systems. In a sense, the superlinear speedup is observed because of reducing overheads which make the uniprocessor system run "slower than it should." -- Mike Taylor ...!{ihnp4,hplabs,amd,sun}!amdahl!mat [ This may not reflect my opinion, let alone anyone else's. ]
cmt@myrias.UUCP (Chris Thomson) (06/22/86)
> > a 10 processor test at Sandia Labs they got 10.1X the power of a single > > processor. > > This is an interesting trick. Does anyone have a clue about how they > got a greater than linear speedup? This was a cache effect. The program being run was almost perfectly parallelizable (how's that for a word?), with almost no synchronization overhead (a linear algebra problem, I think). Thus its total data motion on 10 processors was very nearly the same as on 1 processor, but there was 10 times as much cache available, hence the 10.1 times speedup. The result is hardly a general one, but does speak well of the machine's overall ability to multiprocess. -- Chris Thomson, Myrias Research Corporation ihnp4!alberta!myrias!cmt 200 10328 81 Ave, Edmonton Alberta, Canada 403 432 1616
josh@polaris.UUCP (Josh Knight) (06/22/86)
In article <120@portal.UUcp> jel@portal.UUcp (John Little) writes: >In article <1946@calmasd.CALMA.UUCP>, rfc@calmasd.CALMA.UUCP (Robert Clayton) writes: >> a 10 processor test at Sandia Labs they got 10.1X the power of a single >> processor. > >This is an interesting trick. Does anyone have a clue about how they >got a greater than linear speedup? Was this a cpu benchmark or did >it include i/o? Can I program my single processor to emulate a >multiprocessor configuration and get increased performance :-) ? > We certainly don't have any of these things here, but I should think that more processors might mean more memory. On a time sharing workload, more memory could mean better performance, enough to hide whatever extra (if any) software cost was involved. If you have 10 people editing and 10 CPU's you may do many fewer context switches with concomittant reduction software costs not to mention (perhaps) fewer cache misses. There are lots of ways it COULD happen; however, like John, I'm a Little (sorry John) skeptical. Of course I don't speak for IBM, only me. -- Josh Knight, IBM T.J. Watson Research josh@ibm.com, josh@yktvmh.bitnet, ...!philabs!polaris!josh
rb@cci632.UUCP (Rex Ballard) (06/26/86)
In article <120@portal.UUcp> jel@portal.UUcp (John Little) writes: >In article <1946@calmasd.CALMA.UUCP>, rfc@calmasd.CALMA.UUCP (Robert Clayton) writes: >> a 10 processor test at Sandia Labs they got 10.1X the power of a single >> processor. > >This is an interesting trick. Does anyone have a clue about how they >got a greater than linear speedup? Sure, I've seen it several times in several different situations. The secret is to not count anything other than "CPU" instruction speed. In reality, there are probably DMA, MMU and related controllers that are not included in the MIPS figures. Caching, asynchronous processing, and CPU time normally spent doing other things can also be contributing factors. One really old trick is to use the MMU to do "string moves", this is especially useful for "pipes" or their equivalents, where you know that the original is no longer needed. >Was this a cpu benchmark or did it include i/o? Any multi-processor benchmark requires at least some I/O even if it is just inter-process "pipes". If the single CPU timings were based on drystones, but the multi was prolog LIPs or some similar arrangement, the CPU ratings may have actually been too low. Even if the exact same algorythm was used (DMA controllers,...), the bus contention of DMA to/from the same processor vs two different processors would still lead to a small (1%) increase for two. From my own experience, I'm suprised they only got 1% on 10 processors, it should have been .3%/processor. Sequent, CCI, and several others have often found performance increases on certain applications (esp. the ones they were designed for). >Can I program my single processor to emulate a >multiprocessor configuration and get increased performance :-) ? In a way, yes! By using an ACRTC rather than a "Bit mapped" graphics display, an X.25 serial link instead of an RS-232 link ('rupts every block instead of every character), and about 20 other "tricks", you could actually get 200 times the performance of an equivalent "CPU only system". It wouldn't show up in the Drystones or Whetstones, but it would be noticable to the user. A number of 68020 and 68010 boxes have "Comm boards" that contain additional processors, including 68008s, 80186s and others, along with DMA, local memory (for buffering), and individual lines. These are usually not taken into consideration when Drystones are compared. A 5/30 is nominally rated at 2 MIPS, but there are a minimum of 2 additional 1 mips processors "hidden" in the controller boards. A Sun workstation isn't blindingly fast in Drystones, but for graphics, it would beat a VAX 8600 (if the Vax ran bit-mapped). A Cray X-MP will beat a 6/32 in number crunching any day, but a 6/32 does data bases and file servers extremely well. It's simply a matter of planning your system archetecture for the type of work you intend to do. Just to be fair, what benchmarks did they use?