ejnorman@dogie.macc.wisc.edu (Eric Norman) (02/15/88)
Assume that we have some data somewhere, a program somewhere, and that we want the program to chew on the data. What principles are there that we can use to decide whether to move the program to the data or to move the data to the program? Note: this is not asking whether we should move programs to data or vicey-versey; it is asking for principles that we can use to make those decisions. Eric Norman <ejnorman@unix2.macc.wisc.edu>
fouts@orville.nas.nasa.gov (Marty Fouts) (02/19/88)
In article <4614@sdcsvax.UCSD.EDU> ejnorman@dogie.macc.wisc.edu (Eric Norman) writes: >Assume that we have some data somewhere, a program somewhere, and that we >want the program to chew on the data. What principles are there that we >can use to decide whether to move the program to the data or to move the >data to the program? > This one is easy. Move which ever one costs the least to move (;-) Now all you have to do is calculate cost :-(. If you are using identical processing elements and you assume an infinite supply of memory, the cost is trivially calculated as the size of the object to be moved, so you copy the smaller of the two items and use it at the other site. Of course, reality soon intrudes in the form of limited memory, disparate processing elements, and the need for multiple access to the same data. If you have a *lot* of memory, you can create the program everywhere you might need it. By creating a copy for each of the different kinds of processing elements, you can hide that problem. The program doesn't have to be created in cpu memory, but can be cached on disk and referenced when needed. This adds yet another cost complexity: when is it cheaper to move the data object someplace where the program is resident then it is to bring the program in from disk. It also introduces the problem of which elements a program can run on. When to move the program is also complicated by load balancing. Microeconomics might indicate moving the program, but the destination may be overloaded, so the overall cost might be lower to move the data object. Then there is the data item. Hidden in the cost for moving it is the way in which it will be used. If a data item is only used as an input to a computation, than it can probably be safely copied to the destination with the original left intact, but if it is being used as both input and output, there is a data consistency problem, since some later computation will have its choice of items. The simplest way around this is to move writable data rather than copy it, so that the original reference goes away. This becomes very difficult when the data contains pointers to other data or is heavily referenced. There is also a programming paradigm issue. Some distributed systems implement the assumption of data as objects. In the full blown approach, some program is always kept with the data and this program is the only way to access and modify elements of the data. Monitors have the same structure. This can influence the choice. Future behavior is also important. If the program is going to create a lot of output data, where that data is going to go becomes important. This is a recursive predictive problem, since that data is going to become input to further computation, etc. So, (as a starting point to further discussion) I would propose the following principles: View programs as multiple input / multiple output filters. View data as persistent objects. The interesting property for a given choice of which to move are: (not in order of importance) 1) Number of currently active references by other programs 2) Number of currently active references to/from other data 3) Size For a given program, a datum can be input, output or both. To determine which should move, calculate the set of all current locations for any of the input or output data, and the current location of the program. From this set, compute: 1) The set of all places where the program can run. (If the processors are not identical, there may be some reason to restrict the places where the program can execute.) This set is not empty, because the initial location of the program is in it. If it has only one member, than the problem is trival. Move the data. Otherwise go to step 2. (Note in step 2 and 3, the set of locations where the program can run is restricted to the set calculated in 1.) 2) If the set of sources and destinations has only two locations, calculate the relative cost of moving from each location to the other, and then make the cheapest move. Calculating the cost is the fun part of the algorithm. Otherwise go to step 3. 3) In the extreme, this step is very intense, because it means finding an optimum by examining the power set of the initial set of locations and making the series of moves which are the cheapest. My engineer's mind says restrict the problem set so that multiple source/destination locations are prohibited, but I bet someone will come up with an example to justify multiple sources and destinations. Obvious restrictions are to require the output data to all reside at the same location and to force the program to be runable at all locations. Neither of these are satisfactory in the environment I'm working in, because we do distributed computations between machines precisely because they are different. The cost of moving data must be calculated in terms of current references, relationship to other data, and need for consistency of multiple copies, but I don't have a handle on what the function should look like. Replicating the program is cheaper, since you can enforce the rule that programs shouldn't be self modifying, so there is little need for copy consistency. Marty