[comp.parallel] Variable Speed iPSC/860

tom@icase.edu (Tom Crockett) (07/18/90)

[ I got this through the iPSC group mailing out of Cornell. Since it deals
  with the subject matter of this group, I decided to post it. Looks like
  the old non-determinism problem to me. Steve]



I've recently noticed some funny behavior from an application on our
iPSC/860 system which is heavily communication bound.  It seems to run
at one of four different speeds (maybe more, I've only seen four).  I
stumbled on this because I was comparing some performance measurements
between our system (bluecrab.icase.edu, 32 nodes) and the one at NAS
(lagrange.nas.nasa.gov, 128 nodes).  There was good agreement at 1-16
nodes, but when I got to 32 nodes, lagrange gave me results that were
noticeably slower, by nearly a factor of two for some of my data sets. 
There were no other users logged in and no other cubes active when I
made the runs, so interference from CFS traffic and the SRM seemed
unlikely.  In addition, my program is very careful not to include any
I/O operations to either the CFS or the SRM in the timing measurements. 
All timing measurements were made with dclock(), and the interval being
timed is bracketed with gsync() calls to be sure that the processors all
start and finish together.

In an effort to pinpoint the problem, I re-ran the program on lagrange
with four different 32-node cubes, rooted at processors 0, 32, 64, and
96.  I also modified the input data to repeat the same computation 10
times, in order to check for repeatability.  To my surprise, the average
of 10 repetitions for each of the 4 runs was the same speed as my
original run on bluecrab, and the repeatability between repetitions was
very good, with variations of at most 10%, and typically much less.  I
also re-ran the program locally to check for repeatability, and was even
more surprised to find that it suddenly was running at half speed on
bluecrab.

I have since made several more runs on bluecrab, and identified at least
four different speed ranges.  Within a single run, each of the 10
repetitions shows good repeatability, but between runs, the performance
can vary by more than a factor of 2, but always seeming to land on one
of several distinct speeds.  It appears that it's more likely to obtain
full speed immediately following a rebootcube, but I don't have a large
enough sample size to say with any certainty that rebooting makes a
difference.

I've checked with a few other users locally, and some of them report
seeing wide variations in performance from run-to-run as well.  I've
even had a couple of reports of significant performance variations
running on a single node (no communication).  I would tend to discount
these, except that I've seen this once myself on lagrange.  Has anybody
else seen this phenomenon?  Can anyone suggest an explanation?  Shahid
Bokhari has been doing extensive experiments here with message passing
performance, and has demonstrated that link contention can cause wide
swings in communication time (easily a factor of two or more), but this
seems to be too simple an explanation for the application I describe
here, because (1) the program generates 5-10 thousand large messages
(~1600 bytes) for each repetition, and you'd think that contention
anomalies would average out over such a large sample size, and (2) the
repeatability within a single job is so good.  I should also add that
the algorithm is very asynchronous, so messages will be on the fly in
the system pretty much all of the time.

My gut feeling is there's something going on in the bowels of NX (system
messages, maybe?), or else the DCMs get in some weird state, but this is
purely conjecture.

Tom Crockett

ICASE
Institute for Computer Applications in Science and Engineering

M.S. 132C				e-mail:  tom@icase.edu
NASA Langley Research Center		phone:  (804) 864-2182
Hampton,  VA  23665-5225