[comp.hypercube] How to time on the NCUBE?

lls@hubcap.UUCP (11/19/87)

All of you NCUBE users out there - This one is for you!

I am finally getting some programs to run on our NCUBE (in C).  I want
to do some timing - and have some questions:

1) How does one time on the host?  I am currently using the time() function,
but that only gives me accuracy to the nearest second - I need to get to
clock ticks or at least to msecs.  My programs are only take 1 or 2 seconds,
and I cannot tell the difference between using 1 or 8 processors when
measuring overall time....

2) I am using the ntime() function on each node.  This at least gives me
clock ticks and lets me measure the amount of time spent on each node -
However, I have noticed that my times linearly increase from node 0 to
node 7 (I have an 8 node system) - Why would this be?  All nodes have
the same amount of work (totally balanced problem).  Is it because of
the communication - node 0 gets its data first and processes it before
the others get their data - hence the linearity?

Any suggestions on how to measure the performance on the NCUBE would
be appreciated.  I don't want to add in so many timing monitors, that
it affects the performance.

Thanks,    

- Lauren Smith

pase@ogcvax.UUCP (Douglas M. Pase) (11/24/87)

In article <hubcap.693> lls@mimsy.UUCP (Lauren L. Smith) writes:

>1) How does one time on the host? [...]
>
>2) I am using the ntime() function on each node.  This at least gives me
>clock ticks and lets me measure the amount of time spent on each node -
[...]

	(We do not have an NCUBE, but these are general techniques.)

Since you do have a clock on each node which is sufficiently accurate, one
thing you can do is synchronize the beginning and ending of the computation
on a root node and do the timing there.  Synchronization for the start
consists of a collect followed by a broadcast.  Collect is the equivalent
of each node telling the root that it is ready to compute.  Broadcast is
the equivalent of a starter's pistol, or a begin signal.  A collect is
the last step prior to stopping the clock, and signals the root each node
has finished.  Both the broadcast and the collect should communicate only
through node-to-node links over a minimum-height spanning tree.  Any other
approach could cause race conditions, depending on the exact nature of the
routine.

The cost of the synchronization is twice the time it takes to send a small
message across the diameter of the cube.  This should be small compared to
the computation or your timings won't mean much anyway.

I'm trying to strike a balance between being too terse and too obvious so
I'll stop here.  If you want a more detailed explanation I'd be happy to
give it, along with code for broadcast and collect, plus an example or two.
If enough people are interested I'll post it.
--
Doug Pase   --   ...ucbvax!tektronix!ogcvax!pase  or  pase@cse.ogc.edu.csnet