rjc@cs.ucla.edu (Robert Collins) (03/16/90)
In article <8374@hubcap.clemson.edu> bcsaic!carroll@beaver.cs.washington.edu (Jeff Carroll) writes: > > Since my friendly neighborhood "experienced CM hacker" is not >currently available for consultation, I'm asking the net at large for a >clarification of something that puzzles me. (I'm not a CM hacker, but we >did take a close look at a CM when we went computer shopping some time >ago.) > I've read a couple of statements now to the effect that >efficiency improves as virtual processor ratio increases. This seems >counterintuitive, as I would expect each virtual processor to add >(superlinearly) to the overhead of "processor management". > Would someone please explain? If your front end (scalar) processing time exceeds the parallel processing time during a given sequence, then going to a high VP ratio (more work for the CM for a given amount of front end work), you will get a better than linear speedup. Unless you have a very fast front end, you can get a situation like this at a VP ratio of 1. I have long code sequences that will keep our CM only .75 busy even when driven by a 16 MIP Sun4/330 (and only cranking out Paris calls--no scalar computation), although typically this is not a big problem with a fast front end. Things were really ugly when we had a VAX 8350 front end (somewhere around 1 MIP for a CM process). In general, you want the bottleneck of the computation to be the CM. Get the fastest front end you can. Now, assuming that you have the CM running full out, you can get better than linear speedup by increasing the VP ratio for some operations. In particular, floating point (with the floating point accelerators). The more operations you push through the pipeline, the less overhead you have per op for filling and emptying the pipeline. Other operations scale worse than linear (for example, routing at certain VP ratios). As far as I know, there is no real "processor management" overhead that increases with VP ratio. I think you just loop that many more times in the sequencer (or wherever the VP stuff is handled). Hope this helps clear up the confusion, rob ------------------------------------------------------------------------------- rjc@cs.ucla.edu C++/Paris on the CM2: The only way to fly. -------------------------------------------------------------------------------
barmar@Think.COM (Barry Margolin) (03/16/90)
In article <8374@hubcap.clemson.edu> bcsaic!carroll@beaver.cs.washington.edu (Jeff Carroll) writes: > I've read a couple of statements now to the effect that >efficiency improves as virtual processor ratio increases. This seems >counterintuitive, as I would expect each virtual processor to add >(superlinearly) to the overhead of "processor management". > Would someone please explain? The efficiency gain is because there is some per-instruction overhead that is amortized across the virtual processors. The comparison must be made between ways of implementing a program that operates on more data elements than there are PEs. If the size of your application is larger than the number of physical processors then there are two ways you can choose to implement it: 1) iterate in the application code, or 2) use virtual processors (I'm ignoring CMIS for this discussion). In the former case each iteration involves executing front-end code, pushing PARIS instructions down the FIFO, and decoding the PARIS into CM microcode. With VP's, however, these steps are done once; the iteration occurs in the CM microcontroller after the instruction has been decoded. At high VP ratios this overhead amortization can really add up -- Barry Margolin, Thinking Machines Corp. barmar@think.com {uunet,harvard}!think!barmar
rrt@romeo.cs.duke.edu (Russell R. Tuck) (03/16/90)
In article <8374@hubcap.clemson.edu> bcsaic!carroll@beaver.cs.washington.edu (Jeff Carroll) writes: > I've read a couple of statements now to the effect that >efficiency improves as virtual processor ratio increases. This seems >counterintuitive, as I would expect each virtual processor to add >(superlinearly) to the overhead of "processor management". There are two unavoidable costs to virtual processors that I know of. (1) Physical PE memory is shared by virtual processors, so memory available per virtual processor is inversely proportional to VPR. (2) Each operation has to be done VPR times, so it takes VPR times as long. There are also two kinds of processing overhead that I know of. (a) Any PE addresses have to be adjusted for each repetition of the operation. This is just adding a constant, I think. (b) Every operation, regardless of VPR, has to be sent from host to sequencer. Note that all the overhead is linear (1,2,a) or constant (b) in VPR. Here's the explanation for improved efficiency at high VPR as I understand it. Host to sequencer overhead dominates execution time for operations with VPR=1. The sequencer handles (2) and (a), so (b) is invariant (or nearly so) with respect to VPR. (1) affects whether programs run out of PE memory, but not the speed of those that don't. So larger VPRs get to amortize (b) over more work, and so make better use of the available PE cycles. PS- I over-generalized when I said recently that MIMD folks blithely asssume constant-time parallel memory access. Some do, but many others don't, as a friend working on the BBN Butterfly here at Duke reminded me. Sorry. Russ Tuck rrt@cs.duke.edu Computer Science Department ...!{decvax|mcnc}!duke!rrt Duke University Durham, NC 27706, USA (919) 660-6527
art@cs.bu.edu (Al Thompson) (03/16/90)
In article <8374@hubcap.clemson.edu> bcsaic!carroll@beaver.cs.washington.edu (Jeff Carroll) writes: | | Since my friendly neighborhood "experienced CM hacker" is not |currently available for consultation, I'm asking the net at large for a |clarification of something that puzzles me. (I'm not a CM hacker, but we |did take a close look at a CM when we went computer shopping some time |ago.) | I've read a couple of statements now to the effect that |efficiency improves as virtual processor ratio increases. This seems |counterintuitive, as I would expect each virtual processor to add |(superlinearly) to the overhead of "processor management". | Would someone please explain? That is in fact a correct statement. What happens is that the vp's are really abstractions. Each physical pe has its memory segmented in such a way that the the vp uses different address bases. Thus each physical pe behaves as if it were doing the work of several pe's. The gain in efficiency comes when you realize that the instructions need only be broadcast once and decoded once for the physical processors. That means that when you are using vp's you only get charged once for this overhead. >From that comes the statement that vp's are "more efficient" than physical processors. So, the more vp's you have the better the overall gain.
billo@nova.npac.syr.edu (Bill O) (03/17/90)
In article <8374@hubcap.clemson.edu> bcsaic!carroll@beaver.cs.washington.edu (Jeff Carroll) writes: > > I've read a couple of statements now to the effect that >efficiency improves as virtual processor ratio increases. This seems >counterintuitive, as I would expect each virtual processor to add >(superlinearly) to the overhead of "processor management". Other respondents have mentioned that higher v/p ratios usually lead to higher CM utilization figures, and reduced cm instruction overhead. It should also be mentioned that for many cm instructions, especially floating point, higher v/p ratios mean more opportunities to keep cm internal pipelines full. Fuller pipelines mean more computation per second. -- Bill O'Farrell, Northeast Parallel Architectures Center at Syracuse University (billo@nova.npac.syr.edu)
slim@cis.ohio-state.edu (Scott Whitman) (03/19/90)
In regards to virtual processor comments, right now I am working on theorizing the nearly optimal ratio for a particular application of virtual processors to real processors. In this case, I am working on a BBN Butterfly and the virtual processors are really just tasks. In any case, I am trying to come up with a theory for determining apriori what a nearly optimal ratio would be based on the size of the input data and the number of processors one is working on (i.e., the best ratio may change as depending one how many processors are assigned to the problem). In general, I believe that an increase in this ratio is worthwhile in order to reduce the end-effects problem. However, a certain point of no return is hit when the overhead of starting a certain number of tasks exceeds the gains one receives in reducing the end effects. At this point, we have hit the nearly optimal ratio (actually, just prior to this). Anyway, we are trying to determine a good heuristic to use to obtain this ratio. Right now I am more or less basing it on the input data size, and the overhead to start a new task, and additionally, the fact that the dataset somewhat increases in size due to duplication as the ratio increases. The problem is that I don't believe this model is very accurate. In fact, for two inputs of the same size, it is not necessarily true that the time will be nearly the same for each. It is roughly true, though, that a larger input size results in a longer overall time and so our model bases its heuristic on this fact. Obviously the machine parameters also make a great deal of difference, but for the moment we are ignoring these. What I'd like to know is if anyone has any comments regarding this strategy and also if there are any papers on this subject matter. Advthanksance, -Scott Whitman -- Scott Whitman, Department of Computer and Information Science slim@cis.ohio-state.edu or ...!{pyramid,killer}!osu-cis!slim The Ohio State University; 2036 Neil Ave. Columbus OH USA 43210-1277