[comp.simulation] SIMULATION DIGEST V8 N8

simulation@uflorida.cis.ufl.edu (Moderator: Paul Fishwick) (04/28/89)

Volume: 8, Issue: 8, Fri Apr 28 10:43:38 EDT 1989

+----------------+
| TODAY'S TOPICS |
+----------------+

(1) Compile Time Load Balancing
(2) Special Section: Distributed Simulation Issues

* Moderator: Paul Fishwick, Univ. of Florida
* Send topical mail to: simulation@uflorida.cis.ufl.edu
* Archives available via FTP to bikini.cis.ufl.edu, login as
  'anonymous', use your last name as the password, change
  directory to pub/simdigest.
* Simulation Tools available by doing above and changing the
  directory to pub/simdigest/tools.



-----------------------------------------------------------------------------

To: hp4nl!comp-simulation@uunet.UU.NET
Path: nikhefh!stil
From: stil@nikhefh.hep.nl (Gertjan Stil)
Newsgroups: comp.parallel,comp.simulation
Subject: Wanted References
Keywords: compile time load-balancing
Date: 28 Apr 89 11:59:53 GMT
Organization: Nikhef-H, Amsterdam (the Netherlands).


I am searching for articles which handle about compile time load-balancing.
We are designing an object oriented language, with a static number of objects,
i.e. the number of objects is known at compile time. The problem is, how to
distribute these objects on a grid of processors, with an optimum in parallel
execution, but the communication between objects have to been considered also.
It's an option to place heavy communicating objects on the same node (processor)
in order to reduce network delay, etc.

I'm already searching a month or so, for articles about this subject, but sofar
nothing found. So EVERY reference wich touches or deals with this subject is 
welcome to me. PLEASE, PLEASE e-mail me with your contributions, and I will
post a compilation of the references in the news groups eventualy.


				Thanks in advance,

					Gertjan.



------------------------------


[[ The following articles were taken from comp.parallel - they are of
   special interest to the simulation community -PAF ]]


>From uflorida!gatech!hubcap!matloff%mole.Berkeley.EDU Thu Apr 27 09:50:11 EDT 1989
Article 619 of comp.parallel:
Path: uflorida!gatech!hubcap!matloff%mole.Berkeley.EDU
>From: matloff%mole.Berkeley.EDU@ucbvax.Berkeley.EDU (Norman Matloff)
Newsgroups: comp.parallel
Subject: Re: Distributed simulation
Date: 26 Apr 89 20:31:51 GMT
Sender: fpst@hubcap.clemson.edu
Approved: parallel@hubcap.clemson.edu

In article <5117@hubcap.clemson.edu> usha%FSU.BITNET@CUNYVM.CUNY.EDU writes:

>The following is a list of references on distributed simulation.  But
>this has not been updated for the past two years.

I'm glad you brought up the subject, because I would like to raise a
question.

Here is the setting.  It's discrete-event simulation, which many or most 
of your references are concerned with, on multiple-processor systems
(tightly or loosely coupled).  Let P be the number of processors and T
be the amount of run time needed to get the desired statistical accuracy.

The usual approach in these papers is to have different processors handle
different parts of this program.  E.g. one processor might manage the
event list, another might do the statistics collection, etc.  Or, in
process-oriented simulators like Simula or csim (or, for that matter,
Modula II applied to simulation), each processor might handle a different
process.

It has always seemed to me that these methods are inherently wasteful.
Each processor is going to have a substantial amount of idle time.

Moreover, it seems to me that the best method is also the simplest  --
just have all P processors run the uniprocessor version of the program,
but for T/P amount of run time, instead of T.  [Of course, each processor 
needs to use a different random number seed.]

This method gives 100% utilization for all processors, and except for
start-up transients, should give the same statistical accuracy as the
other methods, in fact BETTER accuracy, since in the other methods
time T is actually less than time T, due to idle periods.

Is anyone aware of any thorough work on this question, i.e. the question
of whether the complex methods are better or worse than the simple one?
I think there was a brief paper by Heidelberger on the effect of 
transients in this context a few years ago in one of the Winter 
Simulation Conferences, but it didn't go very far.

Some of the complex methods give a really elegant theory (e.g. see the
references to Chandy), but it still seems to me that the simple method
will give better performance.

   Norm





------------------------------

>From uflorida!gatech!hubcap!wen-king Fri Apr 28 10:31:30 EDT 1989
Article 620 of comp.parallel:
Path: uflorida!gatech!hubcap!wen-king
>From: wen-king@csvax.caltech.edu (Wen-King Su)
Newsgroups: comp.parallel
Subject: Re: Distributed simulation (Discrete Event)
Date: 27 Apr 89 20:03:37 GMT
Sender: fpst@hubcap.clemson.edu
Approved: parallel@hubcap.clemson.edu

In article <5276@hubcap.clemson.edu> matloff%mole.Berkeley.EDU@ucbvax.Berkeley.EDU (Norman Matloff) writes:
>
<It has always seemed to me that [distributed] methods are inherently wasteful.
>Each processor is going to have a substantial amount of idle time.
<
>Moreover, it seems to me that the best method is also the simplest  --
<just have all P processors run the uniprocessor version of the program...
>
<This method gives 100% utilization for all processors...

Since you are not worried about the elapse time of the simulation runs,
which may is itself the most important consideration in some simulations,
I will assume that throughput is the single most important consideration
here.  Even with this assumption, however, it still depends on how you
compute the cost of hardware.

If the cost is proportional to the number of processors, running one
sequential simulation on each processor clearly is optimal.  Since 100%
of the processors are in use at all time, as the number of processors
doubles, the throughput doubles as well.  Nothing can do better than that.

However, if the cost is proportional to the amount of hardware involved,
distributed simulation can be more efficient.  For the purpose of this
discussion, let us first assume that:

	1) Hardware used == the amount of memory involved.
	2) Simulation requires the same amount of memory
	   no matter how many processors are used.

Now, if we further assume that the simulation shows no speedup, it will
still be as efficient as the sequential simulation.  However, if we assume
that the speedup is linear, every time we double the processors and half
the memory per processor, we also half the elapse time.  We doubled the
throughput using essentially the same amount of hardware.  Although the
truth is somewhere in between, if the distributed simulation shows any
speedup at all, it will be more efficient than the sequential simulation.  

In reality, neither 1) nor 2) are true, but the truth lies not far away
and the distributed simulation has to show some degree of speedup just
to break-even.  But if it does better than break-even, it will be more
efficient than the sequential simulation.

One may argue that in real life, the processor size can't change as we
use more and more processors.  In real life, however, neither can a
simulation that tightly fits a 100 processor multicomputer be run on
just one processor of the multicomputer.

/*------------------------------------------------------------------------*\
| Wen-King Su  wen-king@vlsi.caltech.edu  Caltech Corp of Cosmic Engineers |
\*------------------------------------------------------------------------*/



------------------------------

Date: Fri, 28 Apr 89 10:34:20 -0400
From: Paul Fishwick <fishwick@fish.cis.ufl.edu>
To: simulation@ufl.edu

>From uflorida!gatech!hubcap!matloff Fri Apr 28 10:32:58 EDT 1989
Article 624 of comp.parallel:
Path: uflorida!gatech!hubcap!matloff
>From: matloff@iris.ucdavis.edu (Norm Matloff)
Newsgroups: comp.parallel
Subject: Re: Distributed simulation (Discrete Event)
Date: 28 Apr 89 12:11:34 GMT
Sender: fpst@hubcap.clemson.edu
Approved: parallel@hubcap.clemson.edu

In article <5311@hubcap.clemson.edu> wen-king@csvax.caltech.edu (Wen-King Su) writes:
>In article <5276@hubcap.clemson.edu> matloff%mole.Berkeley.EDU@ucbvax.Berkeley.EDU (Norman Matloff) writes:
>>
><It has always seemed to me that [distributed] methods are inherently wasteful.
>>Each processor is going to have a substantial amount of idle time.
><
>>Moreover, it seems to me that the best method is also the simplest  --
><just have all P processors run the uniprocessor version of the program...

><This method gives 100% utilization for all processors...

>Since you are not worried about the elapse time of the simulation runs,
>which may is itself the most important consideration in some simulations,

I'm surprised to see you say this.  Am I missing something, maybe a
subtle distinction between elapsed time and some other measure of time?

I *am* worried about the time.  In fact, I even specified a time T
for a uniprocessor version, which it is hoped to reduce to T/P in a
P-processor system.

What has to be held constant in any comparison is accuracy.  I defined
the accuracy level to be whatever comes from running the simulation on
a uniprocessor for time T.  The problem with both kinds of distributed
simulations  --  the task-partitioned and replicated types  --  is that
in the former, there is less than 100% utilization, while in the latter,
there is initialization bias.  So in both cases, the desired accuracy 
needs more than T/P time, i.e. there is less than a factor-P speedup.
The only question is which is better.

>distributed simulation can be more efficient.  For the purpose of this
>discussion, let us first assume that:

>	1) Hardware used == the amount of memory involved.
>	2) Simulation requires the same amount of memory
>	   no matter how many processors are used.

As far as I know, memory is NOT a problem.  The major data structure is 
the event queue, and in typical applications this is not very long at all, 
maybe a dozen items.

  Norm


------------------------------

Date: Fri, 28 Apr 89 10:34:27 -0400
From: Paul Fishwick <fishwick@fish.cis.ufl.edu>
To: simulation@ufl.edu

>From uflorida!gatech!hubcap!matloff Fri Apr 28 10:33:16 EDT 1989
Article 625 of comp.parallel:
Path: uflorida!gatech!hubcap!matloff
>From: matloff@iris.ucdavis.edu (Norm Matloff)
Newsgroups: comp.parallel
Subject: Re: Distributed simulation
Date: 28 Apr 89 12:11:47 GMT
Sender: fpst@hubcap.clemson.edu
Approved: parallel@hubcap.clemson.edu

In article <5312@hubcap.clemson.edu> wagner@june.cs.washington.edu (Bullwinkle J. Moose) writes:

>While it's true that running parallel independent replications has a
>lot of appeal, there are some mitigating considerations.  The first and
>most obvious of these is that beyond a certain point, the marginal
>effect on confidence interval width of increasing the number of
>replications becomes insignificant.

This can not be literally true. The accuracy will be roughly inversely 
proportional to the square root of the number of replications, no matter
how many replications there are.  If you quadruple the number of
replications, you should always get double the accuracy (except for
init. bias).

Perhaps you meant that after a certain point, any further decrease in
CI width are of marginal value to the USER?

>Second, because of the presence of an initialization bias in any
>stochastic simulation, it is important to run each replication long
>enough to ensure that the confidence intervals that one obtains for
>some estimator actually contain the true value of what is being
>estimated.

Yes, as I recall, that was the theme of Heidelberger's paper.  But
it seems to me that it's not so simple.

For example, in typical applications, a number of runs will be made
of the same simulation program, with different parameters, e.g.
different arrival rates.  After doing the first run, the user has
a pretty good idea of the steady-state queue lengths, and can
initialize the other runs with nonempty queues of those approximate
sizes.  This should reduce the initialization bias quite substantially.

Or, better yet, in systems which have regeneration points and don't
have huge (in the notation from my posting yesterday, >> T/P)
regeneration cycle times, one can simply start each replicate at
the regeneration point  --  thus no initialization bias at all.

Of course, you do run into the problem you mentioned, i.e. the
problem of "waiting for the slowest one," which gave you your
P/log P figure.  However, analogous kinds of problem should plague
the task-partitioning approach too, e.g. as P increases, the
problem of idle time should get worse and worse.

The task-partitioning approach also has a scaling problem.  For
a fixed application, say a 10-server queue, there will be a very
small limit on the usable value of P.  After a certain point,
you just can't find anything for the processors to do.  On the
other hand, the replicates approach has this too, i.e. for a
fixed problem, a large value of P exacerbates either the
initialization bias problem or the "wait for the slowest one
problem."

So I don't think that either approach is a clear win.  But the
replicates approach certainly is much simpler.

>Finally, a pragmatic consideration against running multiple independent
>replications in parallel is that some simulations have enormous memory
>requirements,

In most simulations of queueing systems in practice, memory should not
be a problem.  The length of the event list should be on the order of 
magnitude of the number of servers, which is generally not large.

   Norm





------------------------------

Date: Fri, 28 Apr 89 10:34:16 -0400
From: Paul Fishwick <fishwick@fish.cis.ufl.edu>
To: simulation@ufl.edu

>From uflorida!gatech!hubcap!wagner Fri Apr 28 10:32:24 EDT 1989
Article 621 of comp.parallel:
Path: uflorida!gatech!hubcap!wagner
>From: wagner@june.cs.washington.edu (Bullwinkle J. Moose)
Newsgroups: comp.parallel
Subject: Re: Distributed simulation
Summary: Statistical considerations for parallel simulation
Date: 27 Apr 89 20:03:51 GMT
Sender: fpst@hubcap.clemson.edu
Approved: parallel@hubcap.clemson.edu

In article <5276@hubcap.clemson.edu>, matloff%mole.Berkeley.EDU@ucbvax.Berkeley.EDU (Norman Matloff) writes:
> In article <5117@hubcap.clemson.edu> usha%FSU.BITNET@CUNYVM.CUNY.EDU writes:
> 
> Here is the setting.  It's discrete-event simulation, which many or most 
> of your references are concerned with, on multiple-processor systems
> (tightly or loosely coupled).  Let P be the number of processors and T
> be the amount of run time needed to get the desired statistical accuracy.
> ...
> Moreover, it seems to me that the best method is also the simplest  --
> just have all P processors run the uniprocessor version of the program,
> but for T/P amount of run time, instead of T.  [Of course, each processor 
> needs to use a different random number seed.]
> 
> Is anyone aware of any thorough work on this question, i.e. the question
> of whether the complex methods are better or worse than the simple one?
> I think there was a brief paper by Heidelberger on the effect of 
> transients in this context a few years ago in one of the Winter 
> Simulation Conferences, but it didn't go very far.

Well, funny you should ask, because I have just finished writing up the
chapter of my Ph.D. dissertation dealing with this question.

In fact, Heidelberger is the only one I know of who has done any analysis
of this problem.  The complete references are:

    @InProceedings{ Heidelberger86,
	     Author= "P. Heidelberger",
	     Title= "{Statistical Analysis of Parallel Simulations}",
	     Booktitle= "Proc. 1986 Winter Simulation Conference",
	     Year = 1986,
	     Pages= "290-295",
	     Organization= SCS,
	     Abstract= "Comparison of independent replications with a single
			long parallel simulation."
	    }

    @TechReport{ Heidelberger87,
	     Author= "P. Heidelberger",
	     Title= "{Discrete Event Simulations and Parallel Processing: Statistical Properties}",
	     Institution= "IBM Thomas J. Watson Research Center",
	     Year = 1987,
	     Number= "RC 12733 (\#57302)",
	     Address= "Yorktown Heights, NY",
	     Month= May,
	     Abstract= "Proves that certain statistical estimators 
			obtained by running separate replications
			in parallel are biased, and that some even converge
			to the wrong value!"
	    }

Heidelberger86 examined the tradeoff between intra-replication
parallelism M and inter-replication parallelism K when the total number
of processors is some constant (hence P = MK).  In essence, the results
of Heidelberger86 are that intra-replication parallelism should be
favored over inter-replication parallelism under the following
circumstances:  for systems with a strong initial transient;  for short
runs;  or if a large number of processors are available.  While I have
very little to add to this in an analytical sense, I've spent quite
some time evaluating the practical impact of his results.

While it's true that running parallel independent replications has a
lot of appeal, there are some mitigating considerations.  The first and
most obvious of these is that beyond a certain point, the marginal
effect on confidence interval width of increasing the number of
replications becomes insignificant.  Thus, given enough processors, it
still makes sense to parallelize each replication.

Second, because of the presence of an initialization bias in any
stochastic simulation, it is important to run each replication long
enough to ensure that the confidence intervals that one obtains for
some estimator actually contain the true value of what is being
estimated.  (I.e., an extremely large number of very short replications
will probably result in very small confidence intervals that do not
contain the true value!) What this means in practical terms is that if
elapsed wall clock time is the overriding consideration, you may not be
able to achieve accurate results unless you parallelize each replication.

Third, Heidelberger87 pointed out that as the number of replications
run in parallel increases, statistical considerations require stopping
rules that lead to poor processor utilization near the end of the
simulation run.  Grossly oversimplifying, it turns out that if the
completion time of individual replications is a random variable, then
in order to obtain unbiased estimators of model outputs it is necessary
to wait for all replications to finish.  (The randomness of completion
times may be a significant factor when the simulation is estimating
transient behavior of a model, or when the regenerative method for
obtaining confidence intervals is being used within each replication.)
In fact, under certain assumptions it turns out that as the number of
processors P increases, the speedup obtained tends toward P/logP rather
than P.  (This is similar to a result by Kruskal and Weiss regarding
the completion time of a group of self-scheduling processors taking
tasks off of a central task queue.)  Therefore, your assertion that

> This method gives 100% utilization for all processors, ...

is statistically untrue.

Lastly, I have taken the analysis of Heidelberger86 and tweaked it just
a bit for purposes of illustration.  Rather than fixing a running time
and comparing the mean square error (mse)
of the alternatives obtained by assigning various values to M and K, I
have fixed the mse and time to completion of the sequential case (M=1,
K=P) and calculated what speedup the parallel simulation would need to
achieve (as a function of M) in order to better the accuracy of the
purely sequential method.  I call this the break-even speedup.  It can
be interpreted in one of two ways:  if the parallel simulation is
capable of exceeding the break-even speedup, then either (a) the
parallel strategy can produce the same level of statistical confidence
as the purely sequential strategy in less elapsed time, or (b) the
parallel strategy can produce a higher level of statistical confidence
than the purely sequential strategy in the same amount of elapsed
time.  The advantage of this analysis is that it does not require any
assumptions about the speedup behavior of the parallel simulation.

The general behavior of the break-even speedup is that it is almost
linear in M, but the slope of the line decreases rapidly as the total
number of processors P increases.  For really large P (e.g., 1024) the
break-even speedup is truly paltry even for large M, so that any
reasonable implementation of a parallel simulation ought to be able to
produce a win.  (Note:  I'm not suggesting that it's realistic to
expect a parallel simulator to show speedups on 1024 processors (in
fact, that wouldn't give you anything on which to base confidence
intervals); rather, given 1024 processors, one ought to run a M=32,
K=32 parallel simulation, or, if your simulation starts to wheeze and
die at 32 processors, maybe only M=16, K=64.)

Finally, a pragmatic consideration against running multiple independent
replications in parallel is that some simulations have enormous memory
requirements, thus, it actually may not be possible to run the
replications simultaneously.  In fact, Jefferson et al. at JPL
are running into situations in which they can't even calculate speedup
because they can't run 1-processor versions of the problems they are
simulating (they are running on a distributed memory architecture).

		Dave Wagner
		University of Washington Comp Sci Department
		wagner@cs.washington.edu
		{cornell,harvard,ssc-vax,tektronix}!uw-beaver!wagner

P.S.  My complete dissertation will be available from the University of
Washington sometime this summer.  The title is "Conservative Parallel
Discrete-Event Simulation: Principles and Practice".





------------------------------



+--------------------------+
| END OF SIMULATION DIGEST |
+--------------------------+