[comp.hypercube] Reply to Ed Segall's article replying to mine

cik@l.cc.purdue.edu (Herman Rubin) (01/06/88)

I have posted the following article to comp.arch.  Since the article to
which I refer appeared in comp.hypercube, which requested clarification of
certain points, I request that this article, which attempts that 
clarification, appear in comp.hypercube.




I would like to clarify my claim that acceptance-rejection methods for
generating random variables may be more suited for some vector machines than
for parallel machines.  On a vector machine, after checking that the input
random vectors U and T (usually U will be uniform, preferably not in
floating point, and T will be exponential) are sufficiently long, the
general idea of the procedure is

	X = F(U)
	D = G(U)
	T = H(T,D)
	
	If T[i] satisfies an appropriate condition, accept X[i] and keep
T[i]; otherwise replace X[i] with something else (or possibly discard the
value of X[i]) and discard the value of T[i].  There are more sophisticated
possibilities, but this gives the general idea.

Since we are dealing with random input, it is impossible to predict the length
of the resulting vectors.  Also this algorithm is vectorized; with a sequential
processor it would be better to use a T value until a rejection occurs.  There
is a considerable difference between vector processors.  On the CYBER 205, the
operations described use only simple machine instructions.  On the CRAY 1,
things are quite difficult to vectorize, and since the calculation of F and G
may involve table lookup, scatter-gather procedures must be availabe in many
cases.  On the CRAYs which have scatter-gather, if the procedure is highly
efficient (small rejection probability) things are not too bad.  Rejection 
probabilities may be small (<.01) or somewhat larger, say between .1 and .4.

On a parallel processor, one can either make sure that each of the separate
computing units produces exactly the same number of items or merge the various
output vectors.  We also have to put T back, eliding the discarded elements.
Also, one or more of the units may find itself in a position in which it needs
to call a refill subroutine.  This may lead to unsuspected failures of
reproducibility; this is not _necessarily_ bad.  It can also cause the various
processors to take greatly different times, which is likely to be bad.

There is a _possible_ way out if some good vector instructions are added to
parallel processors.  What is needed are the vector test, merge, and compress
instructions such as are found on the CYBER 205.  (An article in which I raised
the possibility of this was previously posted.)  If these could be added to
parallel processors, possibly as an independent computing unit, provided that
the slightly less efficient vector version is used, even SIMD machines 
might be able to do a reasonable job in these situations. 

	Another situation in which similar problems might arise is that of
evaluating a function where different algorithms are required in different
parts of the domain.  Breakup, evaluation of each set by its appropriate
routine, and merge is certainly the best way _if it can be done_.

Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet