lls@mimsy.UUCP (Lauren L. Smith) (07/08/87)
This is the response that I received to my request on information
on the hypercube machines available commercially.
------------------------------------------------------------------
>From: wunder@hpcea.HP.COM (Walter Underwood)
Date: 20 Jun 87 00:38:22 GMT
Organization: HP Corporate Engineering - Palo Alto, CA
When looking at hypercube-type systems, don't forget Mieko (or Meiko?)
in Bristol, England. Theirs is based on the Transputer, and has been
shipping for a little while.
---------------------------------------------------------------------
Date: Wed, 17 Jun 87 15:36:29 PDT
From: seismo!gatech!tektronix!ogcvax!pase (Douglas M. Pase)
Organization: Oregon Graduate Center, Beaverton, OR
Of the four you mentioned, here are my experiences:
1) Intel's iPSC
I am most familiar with this system. I have written several programs of
various sizes for this machine, and am currently working on a moderately large
language implementation for it right now.
It uses the Intel 80286/287 processor pair and connects 16/32/64/128
machines together using ethernet chips. Each node has 512K bytes of memory,
but that can be expanded to 4 1/2 M bytes (I think that's right) by removing
alternate processor boards (cutting the number of nodes per chassis in half).
Each processor does about 30K flops.
The message size is limited to 16K bytes or less. Transfer rate varies
from about .0025 sec for an empty message to about .035 sec for a 16K byte
message, both traveling 5 hops. About .0001 sec and .030 sec for the same
messages going only one hop.
Software development is encumbered somewhat by the different memory models
the compiler(s) and architecture must support (small, medium, and several
"large" memory models), but for me that has been more of an annoyance than
a restriction. My only real problem has been figuring out what set of flags
to use when I compile the programs. They're all documented; I've just been
lazy about looking them up.
Their communication utilities are reasonable, but the nomenclature they use
seems a bit strange and misleading -- a channel is not really a communication
channel as much as it is a "FILE" descriptor. A process ID is not really an ID
"process") as much as it is an identifier of the channel descriptor. This has
taken time to get used to, but once I was used to it, no problem.
Because they use what they call a type, a process ID and a node to segregate
ich recipient the message is intended, and the process ID can specify either
always specifies the hardware node ID, which is both expected and reasonable
In summary, I like the machine, but could suggest some important
improvements in both the hardware and software. I've also heard some comments
that it's a hard machine to use, too slow, etc. To those I respond:
distributed memory multiprocessor model is, by nature, the most difficult model
to use. I see nothing in the Intel design which makes it more difficult than
any other machine, and I see some "conveniences" which do simplify my tasks.
Some good hardware speedups are in the works, too. It's a good platform for
distributed processing research.
2) NCUBE
I don't know much about NCUBE, but that it uses 680x0 technology, which I
personally prefer over the 80x86 line. I have also heard it only has 128K bytes
of memory per node, which would be inadequate for my purposes. I'm having a
hard time getting by with 512K. I've heard favorable rumors about it's price
tag, but that's it - just rumors.
3) FPS T-Series
The T-Series is a hefty machine family. The price tag is steep even at 2
or 4 processors, but you get some good array processor technology along with
it. Unless your inclination was towards weather or hypersonic aerodynamic
simulation, I personally would stay away from this beast. It is sort of like
having a micro-cray at each node. I have heard stories (I worked at FPS, and
as such was entertained by some of the "war stories", most of which appeared
in local news papers) about hardware and software "anomalies" which could make
the original design quite painful to use. But in all fairness, the new
management at FPS is going to great lengths to correct the problems. I also
personally know most of the software team assigned to the T-Series, and I have
a great deal of confidence in them.
----------------------------------------------------------------------
Date: Thu, 18 Jun 87 13:00:07 CDT
From: grunwald@m.cs.uiuc.edu (Dirk Grunwald)
Hullo,
From: fosterm@ogc.edu
Subject: iPSC performance measurements
Status: RO
The following report contains performance measurements we have made on
the Intel hypercube (iPSC) for the 2.1 Release and the Beta 3.0 Release.
Ditroff source for the report and sources for the test programs are
available on request.
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Comparative Performance of Two Intel iPSC Node
Operating Systems
Mark Foster
Computer Science and Engineering
Oregon Graduate Center
Beaverton, OR 97006
Revision 1.2 of 86/11/07
A program has been constructed to to measure
bandwidth and latency of node-to-node communica-
tions in the Intel iPSC. The purpose of the meas-
urements were to establish a comparison between
the iPSC Release 2.1 Node Operating System
("k286") and the iPSC Beta 3.0 Node Operating Sys-
tem ("nx"). This report compares the performance
of the two systems and shows the improvements
realized in the Beta 3.0 release.
1. Test Environment.
The tests were run on a D-5 (32-node) system. The entire
suite of tests were run under both the Release 2.1 node
operating system and the Beta 3.0 node operating system.
2. Test Algorithm.
The test involves measurements of communication between
exactly two nodes. One node is designated as a base node
and another as a relay node. The base node sends a messag
of a particular size to the relay node; the relay node col-
lects the entire message then sends it back to the base
node. The base node uses its system clock to measure the
elapsed time between sending of the message and receipt of
the returned message. This send-relay-receive loop is
repeated 2500 times to create an composite sum of elapsed
time.
Initial synchronization of the base and relay node is
ensured to prevent timing measurement of any period during
which the relay is not yet ready to begin the test.
3. Results.
Three main trials were run for each type of communication:
November 1986
- 2 -
i) adjacent neighbor, (ii) one-hop neighbor, and (iii)
two-hop neighbor. For each trial, the message size was
varied from 0 bytes to 8K bytes.
The bandwidth statistics were calculated with the formulas:
k286bpmsec = total_bytes / k286msec
nxbpmsec = total_bytes / nxmsec
where k286msec and nxmsec are the average elapsed times for
a given test, in milliseconds, and where
total_bytes = total_messages * aggregate_length
total_messages = 2500 * 2
This value reflects the number messages
passed between two nodes.
aggregate_length= (test parameter: User Message Size) +
overhead
In calculation of the statistics, an
additional 20 bytes per 1K packet was
added to account for per-packet over-
head.
Nodes User Message k286msec nxmsec k286 nx speedup
Size (bytes) (bytes/msec) (bytes/msec) ratio
0 1 0 20309 11044 4.92 9.05 1.84
0 1 10 21611 7888 6.94 19.02 2.74
0 1 500 25414 11634 102.31 223.48 2.18
0 1 1024 29842 16035 174.92 325.54 1.86
0 1 4096 83822 64265 249.10 324.90 1.30
0 1 8192 158768 128011 263.03 326.22 1.24
0 3 0 29896 9577 3.34 10.44 3.12
0 3 10 31757 10335 4.72 14.51 3.07
0 3 500 38430 17211 67.66 151.07 2.23
0 3 1024 45016 23993 115.96 217.56 1.88
0 3 4096 265514 82553 78.64 252.93 3.22
0 3 8192 418403 161532 99.81 258.52 2.59
0 7 0 40341 12607 2.48 7.93 3.20
0 7 10 42222 13450 3.55 11.15 3.14
0 7 500 51816 22809 50.18 113.99 2.27
0 7 1024 61278 32757 85.19 159.36 1.87
0 7 4095 367382 96894 56.82 215.44 3.79
0 7 8192 612427 185931 68.19 224.60 3.29
November 1986
- 3 -
3.1. Anomalies.
Two anomalies were noted in this examination. Of particular
note are the bandwidth values for n-hop (n > 0) message
sizes greater than 1024 using the k286 kernel: the data rate
actually decreases for message sizes of 4K and 8K. This
problem appears to have been corrected in the Beta 3.0 ker-
nel. Another, perhaps less significant, anomaly was
detected when sending zero-size messages to an adjacent
node. We found that the time taken to transmit empty mes-
sages typically increased by 40 percent over one-byte mes-
sages in the Beta 3.0 kernel. This problem only occurs for
communication between adjacent nodes, and only for messages
of length zero (timing for message lengths greater than zero
are consistent with the characteristic performance curve).
3.2. Summary.
We found that the message-passing performance of the Beta
3.0 system was improved by a maximum of almost 3.8 times
over the 2.1 system. On the average, the performance
increased 2.5 times.
Our maximum measured communication bandwidth of the Beta 3.0
system, then, is slightly more than 1/3 Megabyte per second.
The effective minimum node-to-node latency is approximately
1.5 milliseconds. Our measurements were found to be
equivalent to measurements made by Intel Scientific Comput-
ers (iSC) for the Beta 3.0 release. iSC reports that the
3.0 production version, to be released in November, has 1/2
Megabyte per second bandwidth and .893 milliseconds null-
message latency.
November 1986