tom@icase.edu (Tom Crockett) (07/18/90)
[ I got this through the iPSC group mailing out of Cornell. Since it deals with the subject matter of this group, I decided to post it. Looks like the old non-determinism problem to me. Steve] I've recently noticed some funny behavior from an application on our iPSC/860 system which is heavily communication bound. It seems to run at one of four different speeds (maybe more, I've only seen four). I stumbled on this because I was comparing some performance measurements between our system (bluecrab.icase.edu, 32 nodes) and the one at NAS (lagrange.nas.nasa.gov, 128 nodes). There was good agreement at 1-16 nodes, but when I got to 32 nodes, lagrange gave me results that were noticeably slower, by nearly a factor of two for some of my data sets. There were no other users logged in and no other cubes active when I made the runs, so interference from CFS traffic and the SRM seemed unlikely. In addition, my program is very careful not to include any I/O operations to either the CFS or the SRM in the timing measurements. All timing measurements were made with dclock(), and the interval being timed is bracketed with gsync() calls to be sure that the processors all start and finish together. In an effort to pinpoint the problem, I re-ran the program on lagrange with four different 32-node cubes, rooted at processors 0, 32, 64, and 96. I also modified the input data to repeat the same computation 10 times, in order to check for repeatability. To my surprise, the average of 10 repetitions for each of the 4 runs was the same speed as my original run on bluecrab, and the repeatability between repetitions was very good, with variations of at most 10%, and typically much less. I also re-ran the program locally to check for repeatability, and was even more surprised to find that it suddenly was running at half speed on bluecrab. I have since made several more runs on bluecrab, and identified at least four different speed ranges. Within a single run, each of the 10 repetitions shows good repeatability, but between runs, the performance can vary by more than a factor of 2, but always seeming to land on one of several distinct speeds. It appears that it's more likely to obtain full speed immediately following a rebootcube, but I don't have a large enough sample size to say with any certainty that rebooting makes a difference. I've checked with a few other users locally, and some of them report seeing wide variations in performance from run-to-run as well. I've even had a couple of reports of significant performance variations running on a single node (no communication). I would tend to discount these, except that I've seen this once myself on lagrange. Has anybody else seen this phenomenon? Can anyone suggest an explanation? Shahid Bokhari has been doing extensive experiments here with message passing performance, and has demonstrated that link contention can cause wide swings in communication time (easily a factor of two or more), but this seems to be too simple an explanation for the application I describe here, because (1) the program generates 5-10 thousand large messages (~1600 bytes) for each repetition, and you'd think that contention anomalies would average out over such a large sample size, and (2) the repeatability within a single job is so good. I should also add that the algorithm is very asynchronous, so messages will be on the fly in the system pretty much all of the time. My gut feeling is there's something going on in the bowels of NX (system messages, maybe?), or else the DCMs get in some weird state, but this is purely conjecture. Tom Crockett ICASE Institute for Computer Applications in Science and Engineering M.S. 132C e-mail: tom@icase.edu NASA Langley Research Center phone: (804) 864-2182 Hampton, VA 23665-5225