simulation@uflorida.cis.ufl.edu (Moderator: Paul Fishwick) (04/28/89)
Volume: 8, Issue: 8, Fri Apr 28 10:43:38 EDT 1989 +----------------+ | TODAY'S TOPICS | +----------------+ (1) Compile Time Load Balancing (2) Special Section: Distributed Simulation Issues * Moderator: Paul Fishwick, Univ. of Florida * Send topical mail to: simulation@uflorida.cis.ufl.edu * Archives available via FTP to bikini.cis.ufl.edu, login as 'anonymous', use your last name as the password, change directory to pub/simdigest. * Simulation Tools available by doing above and changing the directory to pub/simdigest/tools. ----------------------------------------------------------------------------- To: hp4nl!comp-simulation@uunet.UU.NET Path: nikhefh!stil From: stil@nikhefh.hep.nl (Gertjan Stil) Newsgroups: comp.parallel,comp.simulation Subject: Wanted References Keywords: compile time load-balancing Date: 28 Apr 89 11:59:53 GMT Organization: Nikhef-H, Amsterdam (the Netherlands). I am searching for articles which handle about compile time load-balancing. We are designing an object oriented language, with a static number of objects, i.e. the number of objects is known at compile time. The problem is, how to distribute these objects on a grid of processors, with an optimum in parallel execution, but the communication between objects have to been considered also. It's an option to place heavy communicating objects on the same node (processor) in order to reduce network delay, etc. I'm already searching a month or so, for articles about this subject, but sofar nothing found. So EVERY reference wich touches or deals with this subject is welcome to me. PLEASE, PLEASE e-mail me with your contributions, and I will post a compilation of the references in the news groups eventualy. Thanks in advance, Gertjan. ------------------------------ [[ The following articles were taken from comp.parallel - they are of special interest to the simulation community -PAF ]] >From uflorida!gatech!hubcap!matloff%mole.Berkeley.EDU Thu Apr 27 09:50:11 EDT 1989 Article 619 of comp.parallel: Path: uflorida!gatech!hubcap!matloff%mole.Berkeley.EDU >From: matloff%mole.Berkeley.EDU@ucbvax.Berkeley.EDU (Norman Matloff) Newsgroups: comp.parallel Subject: Re: Distributed simulation Date: 26 Apr 89 20:31:51 GMT Sender: fpst@hubcap.clemson.edu Approved: parallel@hubcap.clemson.edu In article <5117@hubcap.clemson.edu> usha%FSU.BITNET@CUNYVM.CUNY.EDU writes: >The following is a list of references on distributed simulation. But >this has not been updated for the past two years. I'm glad you brought up the subject, because I would like to raise a question. Here is the setting. It's discrete-event simulation, which many or most of your references are concerned with, on multiple-processor systems (tightly or loosely coupled). Let P be the number of processors and T be the amount of run time needed to get the desired statistical accuracy. The usual approach in these papers is to have different processors handle different parts of this program. E.g. one processor might manage the event list, another might do the statistics collection, etc. Or, in process-oriented simulators like Simula or csim (or, for that matter, Modula II applied to simulation), each processor might handle a different process. It has always seemed to me that these methods are inherently wasteful. Each processor is going to have a substantial amount of idle time. Moreover, it seems to me that the best method is also the simplest -- just have all P processors run the uniprocessor version of the program, but for T/P amount of run time, instead of T. [Of course, each processor needs to use a different random number seed.] This method gives 100% utilization for all processors, and except for start-up transients, should give the same statistical accuracy as the other methods, in fact BETTER accuracy, since in the other methods time T is actually less than time T, due to idle periods. Is anyone aware of any thorough work on this question, i.e. the question of whether the complex methods are better or worse than the simple one? I think there was a brief paper by Heidelberger on the effect of transients in this context a few years ago in one of the Winter Simulation Conferences, but it didn't go very far. Some of the complex methods give a really elegant theory (e.g. see the references to Chandy), but it still seems to me that the simple method will give better performance. Norm ------------------------------ >From uflorida!gatech!hubcap!wen-king Fri Apr 28 10:31:30 EDT 1989 Article 620 of comp.parallel: Path: uflorida!gatech!hubcap!wen-king >From: wen-king@csvax.caltech.edu (Wen-King Su) Newsgroups: comp.parallel Subject: Re: Distributed simulation (Discrete Event) Date: 27 Apr 89 20:03:37 GMT Sender: fpst@hubcap.clemson.edu Approved: parallel@hubcap.clemson.edu In article <5276@hubcap.clemson.edu> matloff%mole.Berkeley.EDU@ucbvax.Berkeley.EDU (Norman Matloff) writes: > <It has always seemed to me that [distributed] methods are inherently wasteful. >Each processor is going to have a substantial amount of idle time. < >Moreover, it seems to me that the best method is also the simplest -- <just have all P processors run the uniprocessor version of the program... > <This method gives 100% utilization for all processors... Since you are not worried about the elapse time of the simulation runs, which may is itself the most important consideration in some simulations, I will assume that throughput is the single most important consideration here. Even with this assumption, however, it still depends on how you compute the cost of hardware. If the cost is proportional to the number of processors, running one sequential simulation on each processor clearly is optimal. Since 100% of the processors are in use at all time, as the number of processors doubles, the throughput doubles as well. Nothing can do better than that. However, if the cost is proportional to the amount of hardware involved, distributed simulation can be more efficient. For the purpose of this discussion, let us first assume that: 1) Hardware used == the amount of memory involved. 2) Simulation requires the same amount of memory no matter how many processors are used. Now, if we further assume that the simulation shows no speedup, it will still be as efficient as the sequential simulation. However, if we assume that the speedup is linear, every time we double the processors and half the memory per processor, we also half the elapse time. We doubled the throughput using essentially the same amount of hardware. Although the truth is somewhere in between, if the distributed simulation shows any speedup at all, it will be more efficient than the sequential simulation. In reality, neither 1) nor 2) are true, but the truth lies not far away and the distributed simulation has to show some degree of speedup just to break-even. But if it does better than break-even, it will be more efficient than the sequential simulation. One may argue that in real life, the processor size can't change as we use more and more processors. In real life, however, neither can a simulation that tightly fits a 100 processor multicomputer be run on just one processor of the multicomputer. /*------------------------------------------------------------------------*\ | Wen-King Su wen-king@vlsi.caltech.edu Caltech Corp of Cosmic Engineers | \*------------------------------------------------------------------------*/ ------------------------------ Date: Fri, 28 Apr 89 10:34:20 -0400 From: Paul Fishwick <fishwick@fish.cis.ufl.edu> To: simulation@ufl.edu >From uflorida!gatech!hubcap!matloff Fri Apr 28 10:32:58 EDT 1989 Article 624 of comp.parallel: Path: uflorida!gatech!hubcap!matloff >From: matloff@iris.ucdavis.edu (Norm Matloff) Newsgroups: comp.parallel Subject: Re: Distributed simulation (Discrete Event) Date: 28 Apr 89 12:11:34 GMT Sender: fpst@hubcap.clemson.edu Approved: parallel@hubcap.clemson.edu In article <5311@hubcap.clemson.edu> wen-king@csvax.caltech.edu (Wen-King Su) writes: >In article <5276@hubcap.clemson.edu> matloff%mole.Berkeley.EDU@ucbvax.Berkeley.EDU (Norman Matloff) writes: >> ><It has always seemed to me that [distributed] methods are inherently wasteful. >>Each processor is going to have a substantial amount of idle time. >< >>Moreover, it seems to me that the best method is also the simplest -- ><just have all P processors run the uniprocessor version of the program... ><This method gives 100% utilization for all processors... >Since you are not worried about the elapse time of the simulation runs, >which may is itself the most important consideration in some simulations, I'm surprised to see you say this. Am I missing something, maybe a subtle distinction between elapsed time and some other measure of time? I *am* worried about the time. In fact, I even specified a time T for a uniprocessor version, which it is hoped to reduce to T/P in a P-processor system. What has to be held constant in any comparison is accuracy. I defined the accuracy level to be whatever comes from running the simulation on a uniprocessor for time T. The problem with both kinds of distributed simulations -- the task-partitioned and replicated types -- is that in the former, there is less than 100% utilization, while in the latter, there is initialization bias. So in both cases, the desired accuracy needs more than T/P time, i.e. there is less than a factor-P speedup. The only question is which is better. >distributed simulation can be more efficient. For the purpose of this >discussion, let us first assume that: > 1) Hardware used == the amount of memory involved. > 2) Simulation requires the same amount of memory > no matter how many processors are used. As far as I know, memory is NOT a problem. The major data structure is the event queue, and in typical applications this is not very long at all, maybe a dozen items. Norm ------------------------------ Date: Fri, 28 Apr 89 10:34:27 -0400 From: Paul Fishwick <fishwick@fish.cis.ufl.edu> To: simulation@ufl.edu >From uflorida!gatech!hubcap!matloff Fri Apr 28 10:33:16 EDT 1989 Article 625 of comp.parallel: Path: uflorida!gatech!hubcap!matloff >From: matloff@iris.ucdavis.edu (Norm Matloff) Newsgroups: comp.parallel Subject: Re: Distributed simulation Date: 28 Apr 89 12:11:47 GMT Sender: fpst@hubcap.clemson.edu Approved: parallel@hubcap.clemson.edu In article <5312@hubcap.clemson.edu> wagner@june.cs.washington.edu (Bullwinkle J. Moose) writes: >While it's true that running parallel independent replications has a >lot of appeal, there are some mitigating considerations. The first and >most obvious of these is that beyond a certain point, the marginal >effect on confidence interval width of increasing the number of >replications becomes insignificant. This can not be literally true. The accuracy will be roughly inversely proportional to the square root of the number of replications, no matter how many replications there are. If you quadruple the number of replications, you should always get double the accuracy (except for init. bias). Perhaps you meant that after a certain point, any further decrease in CI width are of marginal value to the USER? >Second, because of the presence of an initialization bias in any >stochastic simulation, it is important to run each replication long >enough to ensure that the confidence intervals that one obtains for >some estimator actually contain the true value of what is being >estimated. Yes, as I recall, that was the theme of Heidelberger's paper. But it seems to me that it's not so simple. For example, in typical applications, a number of runs will be made of the same simulation program, with different parameters, e.g. different arrival rates. After doing the first run, the user has a pretty good idea of the steady-state queue lengths, and can initialize the other runs with nonempty queues of those approximate sizes. This should reduce the initialization bias quite substantially. Or, better yet, in systems which have regeneration points and don't have huge (in the notation from my posting yesterday, >> T/P) regeneration cycle times, one can simply start each replicate at the regeneration point -- thus no initialization bias at all. Of course, you do run into the problem you mentioned, i.e. the problem of "waiting for the slowest one," which gave you your P/log P figure. However, analogous kinds of problem should plague the task-partitioning approach too, e.g. as P increases, the problem of idle time should get worse and worse. The task-partitioning approach also has a scaling problem. For a fixed application, say a 10-server queue, there will be a very small limit on the usable value of P. After a certain point, you just can't find anything for the processors to do. On the other hand, the replicates approach has this too, i.e. for a fixed problem, a large value of P exacerbates either the initialization bias problem or the "wait for the slowest one problem." So I don't think that either approach is a clear win. But the replicates approach certainly is much simpler. >Finally, a pragmatic consideration against running multiple independent >replications in parallel is that some simulations have enormous memory >requirements, In most simulations of queueing systems in practice, memory should not be a problem. The length of the event list should be on the order of magnitude of the number of servers, which is generally not large. Norm ------------------------------ Date: Fri, 28 Apr 89 10:34:16 -0400 From: Paul Fishwick <fishwick@fish.cis.ufl.edu> To: simulation@ufl.edu >From uflorida!gatech!hubcap!wagner Fri Apr 28 10:32:24 EDT 1989 Article 621 of comp.parallel: Path: uflorida!gatech!hubcap!wagner >From: wagner@june.cs.washington.edu (Bullwinkle J. Moose) Newsgroups: comp.parallel Subject: Re: Distributed simulation Summary: Statistical considerations for parallel simulation Date: 27 Apr 89 20:03:51 GMT Sender: fpst@hubcap.clemson.edu Approved: parallel@hubcap.clemson.edu In article <5276@hubcap.clemson.edu>, matloff%mole.Berkeley.EDU@ucbvax.Berkeley.EDU (Norman Matloff) writes: > In article <5117@hubcap.clemson.edu> usha%FSU.BITNET@CUNYVM.CUNY.EDU writes: > > Here is the setting. It's discrete-event simulation, which many or most > of your references are concerned with, on multiple-processor systems > (tightly or loosely coupled). Let P be the number of processors and T > be the amount of run time needed to get the desired statistical accuracy. > ... > Moreover, it seems to me that the best method is also the simplest -- > just have all P processors run the uniprocessor version of the program, > but for T/P amount of run time, instead of T. [Of course, each processor > needs to use a different random number seed.] > > Is anyone aware of any thorough work on this question, i.e. the question > of whether the complex methods are better or worse than the simple one? > I think there was a brief paper by Heidelberger on the effect of > transients in this context a few years ago in one of the Winter > Simulation Conferences, but it didn't go very far. Well, funny you should ask, because I have just finished writing up the chapter of my Ph.D. dissertation dealing with this question. In fact, Heidelberger is the only one I know of who has done any analysis of this problem. The complete references are: @InProceedings{ Heidelberger86, Author= "P. Heidelberger", Title= "{Statistical Analysis of Parallel Simulations}", Booktitle= "Proc. 1986 Winter Simulation Conference", Year = 1986, Pages= "290-295", Organization= SCS, Abstract= "Comparison of independent replications with a single long parallel simulation." } @TechReport{ Heidelberger87, Author= "P. Heidelberger", Title= "{Discrete Event Simulations and Parallel Processing: Statistical Properties}", Institution= "IBM Thomas J. Watson Research Center", Year = 1987, Number= "RC 12733 (\#57302)", Address= "Yorktown Heights, NY", Month= May, Abstract= "Proves that certain statistical estimators obtained by running separate replications in parallel are biased, and that some even converge to the wrong value!" } Heidelberger86 examined the tradeoff between intra-replication parallelism M and inter-replication parallelism K when the total number of processors is some constant (hence P = MK). In essence, the results of Heidelberger86 are that intra-replication parallelism should be favored over inter-replication parallelism under the following circumstances: for systems with a strong initial transient; for short runs; or if a large number of processors are available. While I have very little to add to this in an analytical sense, I've spent quite some time evaluating the practical impact of his results. While it's true that running parallel independent replications has a lot of appeal, there are some mitigating considerations. The first and most obvious of these is that beyond a certain point, the marginal effect on confidence interval width of increasing the number of replications becomes insignificant. Thus, given enough processors, it still makes sense to parallelize each replication. Second, because of the presence of an initialization bias in any stochastic simulation, it is important to run each replication long enough to ensure that the confidence intervals that one obtains for some estimator actually contain the true value of what is being estimated. (I.e., an extremely large number of very short replications will probably result in very small confidence intervals that do not contain the true value!) What this means in practical terms is that if elapsed wall clock time is the overriding consideration, you may not be able to achieve accurate results unless you parallelize each replication. Third, Heidelberger87 pointed out that as the number of replications run in parallel increases, statistical considerations require stopping rules that lead to poor processor utilization near the end of the simulation run. Grossly oversimplifying, it turns out that if the completion time of individual replications is a random variable, then in order to obtain unbiased estimators of model outputs it is necessary to wait for all replications to finish. (The randomness of completion times may be a significant factor when the simulation is estimating transient behavior of a model, or when the regenerative method for obtaining confidence intervals is being used within each replication.) In fact, under certain assumptions it turns out that as the number of processors P increases, the speedup obtained tends toward P/logP rather than P. (This is similar to a result by Kruskal and Weiss regarding the completion time of a group of self-scheduling processors taking tasks off of a central task queue.) Therefore, your assertion that > This method gives 100% utilization for all processors, ... is statistically untrue. Lastly, I have taken the analysis of Heidelberger86 and tweaked it just a bit for purposes of illustration. Rather than fixing a running time and comparing the mean square error (mse) of the alternatives obtained by assigning various values to M and K, I have fixed the mse and time to completion of the sequential case (M=1, K=P) and calculated what speedup the parallel simulation would need to achieve (as a function of M) in order to better the accuracy of the purely sequential method. I call this the break-even speedup. It can be interpreted in one of two ways: if the parallel simulation is capable of exceeding the break-even speedup, then either (a) the parallel strategy can produce the same level of statistical confidence as the purely sequential strategy in less elapsed time, or (b) the parallel strategy can produce a higher level of statistical confidence than the purely sequential strategy in the same amount of elapsed time. The advantage of this analysis is that it does not require any assumptions about the speedup behavior of the parallel simulation. The general behavior of the break-even speedup is that it is almost linear in M, but the slope of the line decreases rapidly as the total number of processors P increases. For really large P (e.g., 1024) the break-even speedup is truly paltry even for large M, so that any reasonable implementation of a parallel simulation ought to be able to produce a win. (Note: I'm not suggesting that it's realistic to expect a parallel simulator to show speedups on 1024 processors (in fact, that wouldn't give you anything on which to base confidence intervals); rather, given 1024 processors, one ought to run a M=32, K=32 parallel simulation, or, if your simulation starts to wheeze and die at 32 processors, maybe only M=16, K=64.) Finally, a pragmatic consideration against running multiple independent replications in parallel is that some simulations have enormous memory requirements, thus, it actually may not be possible to run the replications simultaneously. In fact, Jefferson et al. at JPL are running into situations in which they can't even calculate speedup because they can't run 1-processor versions of the problems they are simulating (they are running on a distributed memory architecture). Dave Wagner University of Washington Comp Sci Department wagner@cs.washington.edu {cornell,harvard,ssc-vax,tektronix}!uw-beaver!wagner P.S. My complete dissertation will be available from the University of Washington sometime this summer. The title is "Conservative Parallel Discrete-Event Simulation: Principles and Practice". ------------------------------ +--------------------------+ | END OF SIMULATION DIGEST | +--------------------------+