[comp.hypercube] Hypercube Survey response

lls@mimsy.UUCP (Lauren L. Smith) (07/08/87)
This is the response that I received to my request on information
on the hypercube machines available commercially.
------------------------------------------------------------------
>From: wunder@hpcea.HP.COM (Walter Underwood)
Date: 20 Jun 87 00:38:22 GMT
Organization: HP Corporate Engineering - Palo Alto, CA

When looking at hypercube-type systems, don't forget Mieko (or Meiko?)
in Bristol, England.  Theirs is based on the Transputer, and has been
shipping for a little while.

---------------------------------------------------------------------
Date: Wed, 17 Jun 87 15:36:29 PDT
From: seismo!gatech!tektronix!ogcvax!pase (Douglas M. Pase)
Organization: Oregon Graduate Center, Beaverton, OR

Of the four you mentioned, here are my experiences:

1)  Intel's iPSC

   I am most familiar with this system.  I have written several programs of
various sizes for this machine, and am currently working on a moderately large
language implementation for it right now.

   It uses the Intel 80286/287 processor pair and connects 16/32/64/128
machines together using ethernet chips.  Each node has 512K bytes of memory,
but that can be expanded to 4 1/2 M bytes (I think that's right) by removing
alternate processor boards (cutting the number of nodes per chassis in half).
Each processor does about 30K flops.

   The message size is limited to 16K bytes or less.  Transfer rate varies
from about .0025 sec for an empty message to about .035 sec for a 16K byte
message, both traveling 5 hops.  About .0001 sec and .030 sec for the same
messages going only one hop.

   Software development is encumbered somewhat by the different memory models
the compiler(s) and architecture must support (small, medium, and several
"large" memory models), but for me that has been more of an annoyance than
a restriction.  My only real problem has been figuring out what set of flags
to use when I compile the programs.  They're all documented; I've just been
lazy about looking them up.

   Their communication utilities are reasonable, but the nomenclature they use
seems a bit strange and misleading -- a channel is not really a communication
channel as much as it is a "FILE" descriptor.  A process ID is not really an ID
"process") as much as it is an identifier of the channel descriptor.  This has
taken time to get used to, but once I was used to it, no problem.

   Because they use what they call a type, a process ID and a node to segregate
ich recipient the message is intended, and the process ID can specify either
always specifies the hardware node ID, which is both expected and reasonable

   In summary, I like the machine, but could suggest some important
improvements in both the hardware and software.  I've also heard some comments
that it's a hard machine to use, too slow, etc.  To those I respond: 
distributed memory multiprocessor model is, by nature, the most difficult model
to use.  I see nothing in the Intel design which makes it more difficult than
any other machine, and I see some "conveniences" which do simplify my tasks.
Some good hardware speedups are in the works, too.  It's a good platform for
distributed processing research.


2)  NCUBE

   I don't know much about NCUBE, but that it uses 680x0 technology, which I
personally prefer over the 80x86 line.  I have also heard it only has 128K bytes
of memory per node, which would be inadequate for my purposes.  I'm having a
hard time getting by with 512K.  I've heard favorable rumors about it's price
tag, but that's it - just rumors.


3)  FPS T-Series

   The T-Series is a hefty machine family.  The price tag is steep even at 2
or 4 processors, but you get some good array processor technology along with
it.  Unless your inclination was towards weather or hypersonic aerodynamic
simulation, I personally would stay away from this beast.  It is sort of like
having a micro-cray at each node.  I have heard stories (I worked at FPS, and
as such was entertained by some of the "war stories", most of which appeared
in local news papers) about hardware and software "anomalies" which could make
the original design quite painful to use.  But in all fairness, the new
management at FPS is going to great lengths to correct the problems.  I also
personally know most of the software team assigned to the T-Series, and I have
a great deal of confidence in them.

----------------------------------------------------------------------
Date: Thu, 18 Jun 87 13:00:07 CDT
From: grunwald@m.cs.uiuc.edu (Dirk Grunwald)

Hullo,
From: fosterm@ogc.edu
Subject: iPSC performance measurements
Status: RO


The following report contains performance measurements we have made on
the Intel hypercube (iPSC) for the 2.1 Release and the Beta 3.0 Release.
Ditroff source for the report and sources for the test programs are 
available on request.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=










       Comparative Performance of Two Intel iPSC Node
                     Operating Systems


                       Mark Foster

              Computer Science and Engineering
                   Oregon Graduate Center
                    Beaverton, OR 97006

                 Revision 1.2 of 86/11/07


          A program has been constructed to to  measure
     bandwidth  and  latency of node-to-node communica-
     tions in the Intel iPSC.  The purpose of the meas-
     urements  were  to  establish a comparison between
     the  iPSC  Release  2.1  Node   Operating   System
     ("k286") and the iPSC Beta 3.0 Node Operating Sys-
     tem ("nx").  This report compares the  performance
     of  the  two  systems  and  shows the improvements
     realized in the Beta 3.0 release.



1.  Test Environment.

The tests were run on a D-5  (32-node)  system.  The  entire
suite  of  tests  were  run  under both the Release 2.1 node
operating system and the Beta 3.0 node operating system.

2.  Test Algorithm.

The test  involves  measurements  of  communication  between
exactly  two  nodes.   One node is designated as a base node
and another as a relay node.  The base node sends a  messag
of  a particular size to the relay node; the relay node col-
lects the entire message then sends  it  back  to  the  base
node.   The  base  node uses its system clock to measure the
elapsed time between sending of the message and  receipt  of
the  returned  message.   This  send-relay-receive  loop  is
repeated 2500 times to create an composite  sum  of  elapsed
time.

Initial synchronization  of  the  base  and  relay  node  is
ensured  to  prevent timing measurement of any period during
which the relay is not yet ready to begin the test.

3.  Results.

Three main trials were run for each type  of  communication:



                      November 1986





                          - 2 -


i)  adjacent  neighbor,  (ii)  one-hop  neighbor, and (iii)
two-hop neighbor.  For each  trial,  the  message  size  was
varied from 0 bytes to 8K bytes.

The bandwidth statistics were calculated with the formulas:

    k286bpmsec = total_bytes / k286msec
    nxbpmsec = total_bytes / nxmsec


where k286msec and nxmsec are the average elapsed times  for
a given test, in milliseconds, and where

    total_bytes     = total_messages * aggregate_length

    total_messages  = 2500 * 2

                    This value reflects the number  messages
                    passed between two nodes.

    aggregate_length= (test parameter: User Message Size)  +
                    overhead

                    In calculation  of  the  statistics,  an
                    additional  20  bytes  per 1K packet was
                    added to account  for  per-packet  over-
                    head.



Nodes   User Message   k286msec   nxmsec       k286            nx      speedup
       Size (bytes)                       (bytes/msec)   (bytes/msec)    ratio

0   1          0         20309     11044        4.92           9.05       1.84
0   1         10         21611      7888        6.94          19.02       2.74
0   1        500         25414     11634      102.31         223.48       2.18
0   1       1024         29842     16035      174.92         325.54       1.86
0   1       4096         83822     64265      249.10         324.90       1.30
0   1       8192        158768    128011      263.03         326.22       1.24
0   3          0         29896      9577        3.34          10.44       3.12
0   3         10         31757     10335        4.72          14.51       3.07
0   3        500         38430     17211       67.66         151.07       2.23
0   3       1024         45016     23993      115.96         217.56       1.88
0   3       4096        265514     82553       78.64         252.93       3.22
0   3       8192        418403    161532       99.81         258.52       2.59
0   7          0         40341     12607        2.48           7.93       3.20
0   7         10         42222     13450        3.55          11.15       3.14
0   7        500         51816     22809       50.18         113.99       2.27
0   7       1024         61278     32757       85.19         159.36       1.87
0   7       4095        367382     96894       56.82         215.44       3.79
0   7       8192        612427    185931       68.19         224.60       3.29






                      November 1986





                          - 3 -


3.1.  Anomalies.

Two anomalies were noted in this examination.  Of particular
note  are  the  bandwidth  values  for n-hop (n > 0) message
sizes greater than 1024 using the k286 kernel: the data rate
actually  decreases  for  message  sizes of 4K and 8K.  This
problem appears to have been corrected in the Beta 3.0  ker-
nel.    Another,   perhaps  less  significant,  anomaly  was
detected when sending  zero-size  messages  to  an  adjacent
node.   We  found that the time taken to transmit empty mes-
sages typically increased by 40 percent over  one-byte  mes-
sages  in the Beta 3.0 kernel.  This problem only occurs for
communication between adjacent nodes, and only for  messages
of length zero (timing for message lengths greater than zero
are consistent with the characteristic performance curve).

3.2.  Summary.

We found that the message-passing performance  of  the  Beta
3.0  system  was  improved  by a maximum of almost 3.8 times
over the  2.1  system.   On  the  average,  the  performance
increased 2.5 times.

Our maximum measured communication bandwidth of the Beta 3.0
system, then, is slightly more than 1/3 Megabyte per second.
The effective minimum node-to-node latency is  approximately
1.5   milliseconds.   Our  measurements  were  found  to  be
equivalent to measurements made by Intel Scientific  Comput-
ers  (iSC)  for  the Beta 3.0 release.  iSC reports that the
3.0 production version, to be released in November, has  1/2
Megabyte  per  second  bandwidth and .893 milliseconds null-
message latency.




                      November 1986