[comp.os.research] Problem #1

ejnorman@dogie.macc.wisc.edu (Eric Norman) (02/15/88)

Assume that we have some data somewhere, a program somewhere, and that we
want the program to chew on the data.  What principles are there that we
can use to decide whether to move the program to the data or to move the
data to the program?

Note: this is not asking whether we should move programs to data or
vicey-versey; it is asking for principles that we can use to make those
decisions.

Eric Norman <ejnorman@unix2.macc.wisc.edu>

fouts@orville.nas.nasa.gov (Marty Fouts) (02/19/88)

In article <4614@sdcsvax.UCSD.EDU> ejnorman@dogie.macc.wisc.edu (Eric Norman) writes:
>Assume that we have some data somewhere, a program somewhere, and that we
>want the program to chew on the data.  What principles are there that we
>can use to decide whether to move the program to the data or to move the
>data to the program?
>

This one is easy.  Move which ever one costs the least to move (;-)
Now all you have to do is calculate cost :-(.

If you are using identical processing elements and you assume an
infinite supply of memory, the cost is trivially calculated as the
size of the object to be moved, so you copy the smaller of the two
items and use it at the other site.  Of course, reality soon intrudes
in the form of limited memory, disparate processing elements, and the
need for multiple access to the same data.

If you have a *lot* of memory, you can create the program everywhere
you might need it.  By creating a copy for each of the different kinds
of processing elements, you can hide that problem.  The program
doesn't have to be created in cpu memory, but can be cached on disk
and referenced when needed.  This adds yet another cost complexity:
when is it cheaper to move the data object someplace where the program
is resident then it is to bring the program in from disk.  It also
introduces the problem of which elements a program can run on.

When to move the program is also complicated by load balancing.
Microeconomics might indicate moving the program, but the destination
may be overloaded, so the overall cost might be lower to move the data
object.

Then there is the data item.  Hidden in the cost for moving it is the
way in which it will be used.  If a data item is only used as an input
to a computation, than it can probably be safely copied to the
destination with the original left intact, but if it is being used as
both input and output, there is a data consistency problem, since some
later computation will have its choice of items.  The simplest way
around this is to move writable data rather than copy it, so that the
original reference goes away.  This becomes very difficult when the
data contains pointers to other data or is heavily referenced.

There is also a programming paradigm issue.  Some distributed systems
implement the assumption of data as objects.  In the full blown
approach, some program is always kept with the data and this program
is the only way to access and modify elements of the data.  Monitors
have the same structure.  This can influence the choice.

Future behavior is also important.  If the program is going to create
a lot of output data, where that data is going to go becomes
important.  This is a recursive predictive problem, since that data is
going to become input to further computation, etc.

So, (as a starting point to further discussion)  I would propose the
following principles:

View programs as multiple input / multiple output filters.

View data as persistent objects.  The interesting property for a given
choice of which to move are: (not in order of importance)
  1) Number of currently active references by other programs
  2) Number of currently active references to/from other data
  3) Size

For a given program, a datum can be input, output or both.  To
determine which should move, calculate the set of all current
locations for any of the input or output data, and the current
location of the program.  From this set, compute:

1) The set of all places where the program can run.  (If the
   processors are not identical, there may be some reason to restrict
   the places where the program can execute.)  This set is not empty,
   because the initial location of the program is in it.  If it has
   only one member, than the problem is trival.  Move the data.
   Otherwise go to step 2.

(Note in step 2 and 3, the set of locations where the program can run
 is restricted to the set calculated in 1.)

2) If the set of sources and destinations has only two locations,
   calculate the relative cost of moving from each location to the
   other, and then make the cheapest move.  Calculating the cost is
   the fun part of the algorithm.  Otherwise go to step 3.

3) In the extreme, this step is very intense, because it means finding
   an optimum by examining the power set of the initial set of
   locations and making the series of moves which are the cheapest.
   My engineer's mind says restrict the problem set so that multiple
   source/destination locations are prohibited, but I bet someone
   will come up with an example to justify multiple sources and
   destinations.

Obvious restrictions are to require the output data to all reside at
the same location and to force the program to be runable at all
locations.  Neither of these are satisfactory in the environment I'm
working in, because we do distributed computations between machines
precisely because they are different.

The cost of moving data must be calculated in terms of current
references, relationship to other data, and need for consistency of
multiple copies, but I don't have a handle on what the function should
look like.  Replicating the program is cheaper, since you can enforce
the rule that programs shouldn't be self modifying, so there is little
need for copy consistency.

Marty