[comp.parallel] permutation summary

walker@cs.uiuc.edu (William Walker) (12/13/90)
Last month I asked for information about generating permutations in
parallel. Below is a copy of some responses I have received so far.
I haven't edited them, so there is some repetition.
I now have many good ideas for what to do. I would like to thank
all the respondents for their timely and excellent advice!

thanks,
Bill Walker     walker@m.cs.uiuc.edu       (217) 344-1312
 Department of Computer Science, University of Illinois
1304 West Springfield Avenue, Urbana, Illinois 61801-2987

-------------------------------------------------------------------------------



***** From bright@server.cs.jhu.edu Mon Nov 26 09:51:37 1990

Assuming that you are dealing with a hypercube, something 
like the following might work.

Let data[i] and random[i] be stored in processor i for all i.

1-  data[i] = i 
2-  random[i] = random number between 1 and N
3-  for i = 0 to log(N)-1 do
3a-   Let X and Y be two processors such that X and Y differ
  only in the i'th bit.  Assume X > Y.
3b-   if the i'th bit of random[Y] = 1, then swap data[X] and data[Y]
4-  random[i] = random number between 1 and N
5-  for i = log(N)-1 downto 0 do
5a-   Let X and Y be two processors such that X and Y differ
  only in the i'th bit.  Assume X > Y.
5b-   if the i'th bit of random[Y] = 1, then swap data[X] and data[Y]

I'm hoping that this procedure would have the affect of performing
a random routing of the elements in data using the network
of Abraham Waksman (A Permutation Network, Journal of the ACM, Vol 15,
January 1968).

I would imagine that this would take the same time as performing
two bitonic merges.

Note that this is completely off the top of my head so maybe it
is completely wrong.

Also I'm a little confused by what you mean by saying that we can
assume a general router but no shared memory.  What is the difference?

***** From abrodnik@watdragon.waterloo.edu Mon Nov 26 10:07:19 1990


I'd appreciate to receive the summary. On the other hand I tried something
similar, got a O(n log(n)) serial algorithm, but I don't know how to
parallelize it completley (part of it). There was an article in last FOCS by
Fich, Munro, and Poblete about permuting as well. But it does not seem to be 
parallelized simply. I am sure that permuting can be parallelized (otherwise 
AKS sorting algorithm wouldn't work in O(log(N)) time), but perhaps it has 
such a huge constant as AKS.

Regards

Andrej
 

***** From hammonds@riacs.edu Mon Nov 26 10:47:48 1990


On the Connection Machine can do the following.

Let each processors (or virtual processor) choose
a random number (real or integer).  Then, sort
the values.  In *lisp this operation is called
rank!!.  This gives you a "send address" and
permutes the values in parallel.  What machine
are you working on and what language?  Perhaps
I can be more specific...

   Steve Hammond

    hammonds@riacs.edu

***** From bromley@Think.COM Mon Nov 26 11:16:01 1990


A brief sketch follows.  For convenience assume N is a power of 2.  It
requires a source of pseudo random bits.  

Each processor starts out with its own index.  

loop for i from 0 upto log2(N)

  Each processor generates a random bit and exchanges that bit and its
  current value with processor self XOR 2^i.  
  If the bit from the processor with higher address is a 1 then each
  processor takes its partners value otherwise it keeps its own value.

***** From aka@cs.brown.edu Mon Nov 26 12:02:40 1990


One of the ways I use is the following (suited for hypercubes but can 
be simulated by a general router):

  Let processor i have the value i.
  Loop over the dimension j = 1 through logN
    For each pair of nodes whose addresses differ only in the $j$th
    bit, 
       Swap their values with probability .5
  endloop

It is a simple logN parallel step loop. Processor i is mapped to a 
processor that differs with probability .5 in each dimension. Of
course the permutations of different processors are not independent
in the strong sense, but is a good heuristic.

Good luck.

ajit agrawal

***** From jacquemin-michel@CS.YALE.EDU Mon Nov 26 14:52:44 1990

Here is a communication-intensive (but simple) solutio, that I used on
the Connection Machine:
   In each processor i, generate locally a random number r(i) (the range
   should be big compared to N).  Globally sort these number: the permutation
   to use is the "rank" of r(i) in the sorted sequence.

I don't claim this is perfect, but I imagine it should be satisfying,
as long as the range of the initial random numbers you generate is
large enough compared to N (to avoid biasing the distribution too
much, when two processors generate the same initial random number).

Michel Jacquemin


***** From bromley@Think.COM Mon Nov 26 16:58:02 1990

#include <cm/paris.h>

void
make_random_permutation (field)
     CM_field_id_t field;
{
  CM_vp_set_id_t v = CM_field_vp_set (field);
  CM_geometry_id_t g = CM_vp_set_geometry (v);
  int send_address_length = CM_geometry_send_address_length (g);
  int bits = 1 + send_address_length;
  CM_field_id_t f = CM_allocate_stack_field (bits);
  CM_field_id_t rand_bit = CM_add_offset_to_field_id (f, bits-1);
  CM_field_id_t partner_f = CM_allocate_stack_field (bits);
  CM_field_id_t partner_rand_bit = CM_add_offset_to_field_id (partner_f, bits-1);
  CM_field_id_t self = CM_allocate_stack_field (send_address_length);
  CM_field_id_t partner = CM_allocate_stack_field (send_address_length);
  CM_field_id_t do_exchange = CM_allocate_stack_field (1);
  
  int i;

  CM_set_context();

  CM_my_send_address_1L (f);
  CM_my_send_address_1L (self);

  for (i = 0; i < send_address_length; i++)
    {
      CM_u_random_1L (rand_bit, 1, 2);

      CM_logxor_constant_3_1L (partner, self, 1 << i, send_address_length);

      CM_send_1L (partner_f, partner, f, bits, CM_no_field);

      CM_logand_context (CM_add_offset_to_field_id (self, i));
      CM_u_move_1L (do_exchange, rand_bit, 1);
      CM_invert_context();
      CM_u_move_1L (do_exchange, partner_rand_bit, 1);
      CM_set_context();

      CM_logand_context (do_exchange);
      CM_u_move_1L (f, partner_f, send_address_length);
      CM_set_context();
    }

  CM_u_move_always_1L (field, f, send_address_length);
  CM_deallocate_stack_through (f);
}

  

***** From gkj@doc.imperial.ac.uk Tue Nov 27 03:37:32 1990

There is a paper by S. Tomboulian, "Indirect Addressing and Load Balancing
for Faster Solutions to Mandelbrot Set on Simd Architectures."  Application
note AN101, MasPar Computer Corporation, 749 North Mary Avenue, Sunnyvale,
CA 94086, June, 1989.  Although the paper isn't specifically about this
problem, at the end of it she describes how different ways of allocating
pixels to processing elements (random permutation being tried among others)
works out in efficiency terms.  I think the main conclusion was that a
psuedo-random scheme involving a series of shifts across the PE plane was
best in terms of time required to generate the "permutation" and resulting
"randomness".  Cheers!

Guido...


- Never eat yellow snow + Translated humor isn't + Lazyness is for weaklings -

        |||||||              Guido K. Jouret
        | ` ' |              gkj@doc.ic.ac.uk
       (| o o |)	     Functional Programming Section
        |  ^  |		     Imperial College of Science and Technology
	\ "~" / 	     London SW7 2BZ
	 \___/		     U.K.


***** From mrg@Princeton.EDU Tue Nov 27 10:06:19 1990

Bill,
Here's a simple algorithm.

    1.  Each processor creates a record with two fields:
	    key:  A random integer between 1 and 1*N (for M >> N).
	    src:  The location of the processor.
    2.  The SIMD sorts these records based on the "key" field.
        The probability of two records having the same key is N(N-1)/(2*M);
	so for M sufficiently large, I'll assume no collisions occur (more
	on this later).  Sorting can be done in O((log N)^2) time (e.g.
	Batcher's bitonic sort).  At the end of the sorting step,
	processor k has the record with the k'th smallest key.  The
	"src" field of this record is the integer (in 1..N) assigned
	to this processor.

So what about conflicts in the sorting step.  There are several obvious
choices:
    1.  Try again.  This happens with low enough probability that the
	expected run time of the algorithm remains good.
    2.  Break ties arbitrarily.  This results in a slightly non-uniform
	choice of permutations, but for M sufficiently large, no one will
	ever notice.
    3.  Break ties with your centralized algorithm.  Again, since ties
	are rare, and large-way ties even rarer, this is practical.

I'd be interested in a summary of approaches suggested.  You can send me
e-mail, or post to the net.
--Mark

***** From leo@wk48.nas.nasa.gov Tue Nov 27 12:37:54 1990

This may be faster, you'll have to try it out I imagine.  Begin
by assigning each node its self-address as the element of the permutation.
The idea now is to randomize this permutation, which now reads 1,2,...,N.
To do this, let each node generate a random number in the range [1,N], we'll
use this to tag the permutation value.  So now we have a structure that
looks something like this:

Node   Permutation  Tag
----   -----------  ---
1	1	    7
2	2	    3
3	3	    4
4	4	    3
5	5	    1
6	6	    5
7	7	    1
8	8	    6


Now sort the sequence such that the tags are in non-descending order, and
move the permutation value to the sorted position.  So our example above becomes


Node   Permutation  Tag
----   -----------  ---
1	5	    1
2	7	    1
3	2	    3
4	4	    3
5	3	    4
6	6	    5
7	8	    6
8	1	    7

Voila!  You have your random permutation.  For large enough N this may be
faster than doing it on the front end.  Integer sorting can be easily 
implemented with complexity O(N log N) in parallel, (sequentially, you can
do a bucket sort in time O(N), but in parallel this is somewhat harder to
achieve).  

If you do try this out, I'd be interested in learning the results (i.e. is
it any faster?)  I'm curious how large is N, and what machine are you working
on?  Good luck

Leo Dagum
(415) 604-4779

***** From kornkven@uxh.cso.uiuc.edu Tue Nov 27 14:50:22 1990

Hi Bill,

In CM Fortran on the Connection Machine, there is a function that
ranks the elements of a CM array.  So it seems that a possible
approach (on this machine anyway) would be to generate an array
of N random numbers, then to rank those random numbers, the ranks
being your permutation of 1..N.

Ed Kornkven
kornkven@cs.uiuc.edu


***** From leban@par3.cs.umass.edu Tue Nov 27 15:24:36 1990

Here's my solution to the parallel permutation problem.  I'm curious to know
if there's anything better.  I'd be interested in seeing a summary of
responses.

        --- Bruce
        Leban@cs.umass.edu  @amherst.mass.usa.earth
-----------------------------------------------------------------------------
The simplest sequential algorithm is
    for i = 1 to n
        swap a[i] with a[random]
My parallel algorithm is a variation of this.
-----------------------------------------------------------------------------
    /* i, j, k, a, b, c, swapped and ID are local to the PEs */
    /* this randomly permutes the values in a */
    a = ID;
    swapped = FALSE;
    while (!global-and(swapped))
        {
        j = -1;
        if (random(TRUE))              /* potential sender */
            {
            if (!swapped)
                {
                k = random(NPROCS);
                if (k != ID)
                    put ID in i at k with COLLISIONS=c;
                    /* c is the count of messages received */
                else if (c > 0 && random(1))    /* note 1 */
                    swapped = TRUE;
                }
            }
        else if (c == 1)                /* receiver (one incoming put) */
            {
            put ID in j at i with NOCOLLISIONS;
            put a in b at i with NOCOLLISIONS;
            }
        if (j >= 0)                     /* confirmed sender */
            {
            put a in a at j with NOCOLLISIONS;
            a = b;
            swapped = TRUE;     /* note 2 */
            }
        }

At each stage in the algorithm, each processor is either a sender or a
receiver.  The senders each select a random processor to swap with.  If two
select the same receiver, neither one swaps.  If a sender selects another
sender, it's ignored.  Once a swap is completed, the sender becomes a
permanent receiver. 

On the first step, of the non-colliding pairs, it's expected that half of
them are sender-receiver pairs.  The odds of having a large number of the
puts collide is very small (it's analogous to hashing k items into 2k
buckets).  Thus almost half of the potential senders perform a swap on the
first step.  Thus it's expected to take O(log n) steps to terminate.
-----
Note 1: since each sender selects a receiver with probability 1/2, we need to
make sure that if it selects itself, it has the same probability of treating
itself as a receiver.  Otherwise, there will be a bias towards self swaps.
We also don't do the self-swap if there's an incoming message.  This avoids a
similar bias caused by no collisions on self-swaps.

Note 2: the variable swapped is set in only one processor of the two.
Otherwise it would be producing a random set of 2-cycles, rather than a
random permutation.


***** From Marco.Zagha@ENQUIRER.SCANDAL.CS.CMU.EDU Tue Nov 27 22:59:47 1990

A simple approach is to generate random numbers in parallel and rank
them.  Rank is a sort procedure that returns n to the nth highest key.
You can build a rank out of a sort by appending the starting index of
each key to the low order bits of the key.  Something like rank is
often provided as a library routine, e.g. on the Connection Machine and
on the Cray.

Be careful though if you want a truly random permutation.  The
problem with the random+rank algorithm is that if you start
out with any duplicate keys, the result is not completely random.
(Not to mention the quality of the parallel random number generation).
The chance of getting duplicate keys is higher than you might
think -- for example, for large permutations, 32-bit keys might
not be enough.  I seem to recall seeing formulas for this in
statistics books along with problems like "How many people
do you need until you expect that more than one will have the
same birthday".

Another approach would be to look up random mating algorithms
and do a swap based approach.  I'm sure that theoreticians have
beaten this problem to death.

Please post your summary on comp.parallel.

== Marco Zagha
Internet: marcoz@cs.cmu.edu            Uucp:   ...!seismo!cs.cmu.edu!marcoz  
Bitnet:   marcoz%cs.cmu.edu@cmuccvma   CSnet:  marcoz%cs.cmu.edu@relay.cs.net

***** From ney@camelback.Princeton.EDU Wed Nov 28 09:19:38 1990


I'm interested in problems like this, but my background is weak.  Can
you suggest some references summarizing known standard algorithms for
parallel architectures?  (I have a march 1988 tech report #UCB/CSD
88/408 "A Survey of Parallel Algorithms for Shared Memory Machines" by
Karp and Ramachandran.)  In exchange I'll give you my (most likely
naive) ideas for your problem.

First, if there is a quick way to assign an arbitrary permutation (for
instance, if the sites are numbered and know their numbers, a
reasonable assumption since I would guess the sites must be named)
then one should be able to quickly shuffle this permutation by
performing random swaps in parallel.  For instance we can do

define shuffle(low, high)
  if (low >= high)
    return
  else
    mid = (low + high)/2
    in parallel
      shuffle(low, mid), shuffle(mid, high)
    in parallel for each i=0...mid-low-1
      if (random() mod 2 == 1)
	swapNumbersOf(low + i, mid + i)

Then shuffle(1,N) (where N is a power of two) randomly permutes the 
numbers of sites 1..N in log(n) rounds, with each round requiring 
2N messages with each site sending and receiving one message.

Alternatively, if there is no preexisting permutation, start by having
each site have an random real number in the interval [0,1), then use a
fast parallel sorting algorithm to compute the rank of the number at
each site.  Of course one can't generate random reals, instead
generate the bits of the reals lazily, that is start out knowing 0
bits of each number, and whenever two numbers are compared, if the
known bits of the numbers are not sufficient to decide their order,
randomly extend each one until their order is determined.  Instead one
could start by generating numbers in the range (1..N^3) say, in which
case the probability that any two are equal vanishes.  This also has
the advantage of allowing a radix sort.

I don't know how fast parallel sorting algorithms really work, though,
and it seems like there should be an easier way.

  Neal Young	(ney@cs.princeton.edu)


***** From prins@cs.unc.edu Wed Nov 28 15:40:57 1990


I had to do this recently on a MasPar MP-1.  I just followed Knuth's
algorithm for generating a random permutation of 0..N-1: start with P[i] = i
(for i in 0..N-1) and independent random values R[0]..R[N-1] where 0 <= R[i]
< N.  Now swap each P[i] with P[R[i]], in some serializable order.  To do
this quickly place P[i] and R[i] on processor i, and let the router
determine the serialization (the details of this depend heavily on the
construction of the router, of course, and will be different on a CM for
example).  The MPL code below generates a random permutation of 0..nproc-1
on an MP-1 in around 2.5ms.  ..jfp

/* generate a random permutation using router */
   {
      register plural short T,P,R;
      register plural unsigned char C;

      R = (p_random() >> 2) & (nproc-1);
      P = iproc;
      C = 1;
      while (C) {
        if (connected(R)) {
          T = router[R].P;
          router[R].P = P;
          P = T;
          C = 0;
        }
      }
    }

--\--  Jan Prins  (prins@cs.unc.edu)       
  /    Computer Science Dept.                
--\--  UNC Chapel Hill  

***** From thy@adesign.uucp Thu Nov 29 04:34:37 1990


I am not sure this can help you, but i can give it as is anyway.

#!/bin/sh
# This is a shell archive (produced by shar 3.49)
# To extract the files from this archive, save it to a file, remove
# everything above the "!/bin/sh" line above, and type "sh file_name".
#
# made 11/29/1990 09:59 UTC by thy@dsgn42
# Source directory /user1/thy/spl/tmp
#
# existing files will NOT be overwritten unless -c is specified
#
# This shar contains:
# length  mode       name
# ------ ---------- ------------------------------------------
#   4667 -rw-rw-r-- 1
#   2506 -rw-r--r-- par.h
#    952 -rw-rw-r-- ref.txt
#    833 -rw-rw-r-- rnd.c
#
# ============= 1 ==============
if test -f '1' -a X"$1" != X"-c"; then
	echo 'x - skipping 1 (File already exists)'
else
echo 'x - extracting 1 (Text)'
sed 's/^X//' << 'SHAR_EOF' > '1' &&
#!/bin/sh
# This is a shell archive (produced by shar 3.49)
# To extract the files from this archive, save it to a file, remove
# everything above the "!/bin/sh" line above, and type "sh file_name".
#
# made 11/29/1990 09:54 UTC by thy@dsgn42
# Source directory /user1/thy/spl/tmp
#
# existing files will NOT be overwritten unless -c is specified
#
# This shar contains:
# length  mode       name
# ------ ---------- ------------------------------------------
#   2506 -rw-r--r-- par.h
#    833 -rw-rw-r-- rnd.c
#
# ============= par.h ==============
if test -f 'par.h' -a X"$1" != X"-c"; then
X	echo 'x - skipping par.h (File already exists)'
else
echo 'x - extracting par.h (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'par.h' &&
#include <stdio.h>
#include <malloc.h>
XX
#define EVER 1
#define TIME 2
#define ATOM 16
XX
#define MAX_VAL 99
#define MIN_VAL -1
#define DEF_VAL 0
XX
Atom *_univ;
int global, _val;
int n_atom, _atom, _time, _wrap_flag;
int THIS, _trace;
XX
#define FOR_EVER for(;;) {
#define END_FOR_EVER }
#define REPEAT(x) for (THIS = 0; THIS < x; ++THIS) {
#define END_REPEAT }
#define LOOP REPEAT(n_atom)
#define END_LOOP }
XX
#define here 0
#define left(x) ((x) - 1)
#define right(x) ((x) + 1)
XX
#define NULL_LEFT (*(_univ + _time * (n_atom + 2)) = null_atom)
#define NULL_RIGHT (*(_univ + n_atom + _time * (n_atom + 2) + 1) = null_atom)
#define WRAP_ON (_wrap_flag = 1)
#define WRAP_OFF (_wrap_flag = 0)
#define WRAP_TWIST (_wrap_flag = !_wrap_flag)
#define PAR_INIT(x) (n_atom = (x), _univ = (Atom *)malloc(sizeof(Atom) * (n_atom + 2) * TIME), WRAP_ON)
#define _NEXT_TIME ((_time + 1) % TIME)
#define _ALL_ALL_ATOM for (_atom = -1; _atom <= n_atom; ++_atom)
#define _ALL_ATOM for (_atom = !_wrap_flag ? -1 : 0; _atom < (!_wrap_flag ? n_atom + 1 : n_atom); ++_atom)
#define _ATOM(x) (_wrap_flag ? _WRAP_ATOM(x) : _NO_WRAP_ATOM(x))
#define _NO_WRAP_ATOM(x) (_atom + (x) < -1 ? -1 : (_atom + (x) > n_atom + 1 ? n_atom + 1 : _atom + (x)))
#define _WRAP_ATOM(x) ((_atom + (x) + n_atom) % n_atom)
#define _NOW_ATOM(x) (_univ + 1 + _time * (n_atom + 2) + _ATOM(x))
#define _NEXT_ATOM(x) (_univ + 1 + _NEXT_TIME * (n_atom + 2) + _ATOM(x))
XX
#define _SYNC _ALL_ALL_ATOM { *_NEXT_ATOM(here) = *_NOW_ATOM(here); }
#define DO _SYNC _ALL_ATOM {
#define DONE } _time = _NEXT_TIME;
XX
#define SELECT DO if (0) {;
#define ALT(x) } else if (x) {
#define END_SELECT } DONE
XX
#define IF(x) DO if(x) {
#define END_IF } DONE
XX
#define TRACE(x, y) _trace = _wrap_flag; WRAP_OFF; DO (void)fprintf(stderr, x, get(y, here)); DONE _wrap_flag = _trace; (void)fprintf(stderr, "\n");
XX
#define get_left(x) ((_univ + _time * (n_atom + 2))->x)
#define set_left(x, y) ((_univ + _time * (n_atom + 2))->x = (y))
#define get_right(x) ((_univ + n_atom + _time * (n_atom + 2) + 1)->x)
#define set_right(x, y) ((_univ + n_atom + _time * (n_atom + 2) + 1)->x = (y))
XX
#define get(x, y) (_NOW_ATOM(y)->x)
#define set(x, y, z) (_NEXT_ATOM(y)->x = (z))
XX
#define GLOBAL(x) global = 0; _ALL_ATOM { global |= get(x, here); } global = !global;
XX
#define INPUT(x) _val = get_val(stdin); set_left(x, _val);
#define OUTPUT(x) (void)printf("%2d ", get_right(x));
XX
#define SHIFTR(x) DO set(x, here, get(x, left(here))); DONE
#define SHIFTL(x) DO set(x, here, get(x, right(here))); DONE
SHAR_EOF
chmod 0644 par.h ||
echo 'restore of par.h failed'
Wc_c="`wc -c < 'par.h'`"
test 2506 -eq "$Wc_c" ||
X	echo 'par.h: original size 2506, current size' "$Wc_c"
fi
# ============= rnd.c ==============
if test -f 'rnd.c' -a X"$1" != X"-c"; then
X	echo 'x - skipping rnd.c (File already exists)'
else
echo 'x - extracting rnd.c (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'rnd.c' &&
typedef struct { int dir, val; } Atom;
XX
Atom null_atom = { 0, 0 };
XX
#include "par.h"
XX
#define LEFT 0
#define RIGHT 1
XX
main(ac, av)
int ac;
char **av;
{
XX	extern void srand();
XX
XX	PAR_INIT(ac == 2 ? atoi(av[1]) : ATOM);
XX
XX	srand((unsigned)getpid());
XX
XX	/* input */
XX	WRAP_OFF;
XX	LOOP
XX		set_left(val, THIS);
XX		TRACE("%02d ", val);
XX		DO set(val, here, get(val, left(here))); DONE
XX	END_LOOP
XX
XX	/* mix */
XX	WRAP_ON;
XX	LOOP
XX		DO set(dir, here, rand() % 2); DONE
XX		SELECT
XX			ALT (get(dir, here) == LEFT && get(dir, left(here)) == RIGHT)
XX				set(val, here, get(val, left(here)));
XX			ALT (get(dir, here) == RIGHT && get(dir, right(here)) == LEFT)
XX				set(val, here, get(val, right(here)));
XX		END_SELECT
XX		TRACE("%02d ", val);
XX	END_LOOP
XX
XX	/* output */
XX	WRAP_OFF;
XX	LOOP
XX		DO set(val, here, get(val, left(here))); DONE
XX		TRACE("%02d ", val);
XX	END_LOOP
}
XX
SHAR_EOF
chmod 0664 rnd.c ||
echo 'restore of rnd.c failed'
Wc_c="`wc -c < 'rnd.c'`"
test 833 -eq "$Wc_c" ||
X	echo 'rnd.c: original size 833, current size' "$Wc_c"
fi
Xexit 0
SHAR_EOF
chmod 0664 1 ||
echo 'restore of 1 failed'
Wc_c="`wc -c < '1'`"
test 4667 -eq "$Wc_c" ||
	echo '1: original size 4667, current size' "$Wc_c"
fi
# ============= par.h ==============
if test -f 'par.h' -a X"$1" != X"-c"; then
	echo 'x - skipping par.h (File already exists)'
else
echo 'x - extracting par.h (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'par.h' &&
#include <stdio.h>
#include <malloc.h>
X
#define EVER 1
#define TIME 2
#define ATOM 16
X
#define MAX_VAL 99
#define MIN_VAL -1
#define DEF_VAL 0
X
Atom *_univ;
int global, _val;
int n_atom, _atom, _time, _wrap_flag;
int THIS, _trace;
X
#define FOR_EVER for(;;) {
#define END_FOR_EVER }
#define REPEAT(x) for (THIS = 0; THIS < x; ++THIS) {
#define END_REPEAT }
#define LOOP REPEAT(n_atom)
#define END_LOOP }
X
#define here 0
#define left(x) ((x) - 1)
#define right(x) ((x) + 1)
X
#define NULL_LEFT (*(_univ + _time * (n_atom + 2)) = null_atom)
#define NULL_RIGHT (*(_univ + n_atom + _time * (n_atom + 2) + 1) = null_atom)
#define WRAP_ON (_wrap_flag = 1)
#define WRAP_OFF (_wrap_flag = 0)
#define WRAP_TWIST (_wrap_flag = !_wrap_flag)
#define PAR_INIT(x) (n_atom = (x), _univ = (Atom *)malloc(sizeof(Atom) * (n_atom + 2) * TIME), WRAP_ON)
#define _NEXT_TIME ((_time + 1) % TIME)
#define _ALL_ALL_ATOM for (_atom = -1; _atom <= n_atom; ++_atom)
#define _ALL_ATOM for (_atom = !_wrap_flag ? -1 : 0; _atom < (!_wrap_flag ? n_atom + 1 : n_atom); ++_atom)
#define _ATOM(x) (_wrap_flag ? _WRAP_ATOM(x) : _NO_WRAP_ATOM(x))
#define _NO_WRAP_ATOM(x) (_atom + (x) < -1 ? -1 : (_atom + (x) > n_atom + 1 ? n_atom + 1 : _atom + (x)))
#define _WRAP_ATOM(x) ((_atom + (x) + n_atom) % n_atom)
#define _NOW_ATOM(x) (_univ + 1 + _time * (n_atom + 2) + _ATOM(x))
#define _NEXT_ATOM(x) (_univ + 1 + _NEXT_TIME * (n_atom + 2) + _ATOM(x))
X
#define _SYNC _ALL_ALL_ATOM { *_NEXT_ATOM(here) = *_NOW_ATOM(here); }
#define DO _SYNC _ALL_ATOM {
#define DONE } _time = _NEXT_TIME;
X
#define SELECT DO if (0) {;
#define ALT(x) } else if (x) {
#define END_SELECT } DONE
X
#define IF(x) DO if(x) {
#define END_IF } DONE
X
#define TRACE(x, y) _trace = _wrap_flag; WRAP_OFF; DO (void)fprintf(stderr, x, get(y, here)); DONE _wrap_flag = _trace; (void)fprintf(stderr, "\n");
X
#define get_left(x) ((_univ + _time * (n_atom + 2))->x)
#define set_left(x, y) ((_univ + _time * (n_atom + 2))->x = (y))
#define get_right(x) ((_univ + n_atom + _time * (n_atom + 2) + 1)->x)
#define set_right(x, y) ((_univ + n_atom + _time * (n_atom + 2) + 1)->x = (y))
X
#define get(x, y) (_NOW_ATOM(y)->x)
#define set(x, y, z) (_NEXT_ATOM(y)->x = (z))
X
#define GLOBAL(x) global = 0; _ALL_ATOM { global |= get(x, here); } global = !global;
X
#define INPUT(x) _val = get_val(stdin); set_left(x, _val);
#define OUTPUT(x) (void)printf("%2d ", get_right(x));
X
#define SHIFTR(x) DO set(x, here, get(x, left(here))); DONE
#define SHIFTL(x) DO set(x, here, get(x, right(here))); DONE
SHAR_EOF
chmod 0644 par.h ||
echo 'restore of par.h failed'
Wc_c="`wc -c < 'par.h'`"
test 2506 -eq "$Wc_c" ||
	echo 'par.h: original size 2506, current size' "$Wc_c"
fi
# ============= ref.txt ==============
if test -f 'ref.txt' -a X"$1" != X"-c"; then
	echo 'x - skipping ref.txt (File already exists)'
else
echo 'x - extracting ref.txt (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'ref.txt' &&

***** From: mips!mips.com!mark@decwrl.dec.com (Mark G. Johnson)

There's an article in a recent journal that might be of interest:
X
X   "Cellular Automata-Based Pseudorandom Number Generators for
X    Built-In Self-Test", IEEE Transactions on Computer-Aided Design,
X    Vol. 8 No. 8, August 1989, pp. 842-859.
X
The technique produces a parallel *word* of random bits once per clock
tick, unlike the Linear-Feedback Shift Register which produces a serial
bitstream, one bit per clock tick.  A major drawback of the LFSR is the
rather high auto- and cross-correlation in the stream of pseudorandoms;
the Cellular Automaton approach seems to improve this dramatically.
X
Knuth's tests for randomness are applied and the results are favorable.
-- 
X -- Mark Johnson	
X 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
X	...!decwrl!mips!mark	(408) 991-0208
X
SHAR_EOF
chmod 0664 ref.txt ||
echo 'restore of ref.txt failed'
Wc_c="`wc -c < 'ref.txt'`"
test 952 -eq "$Wc_c" ||
	echo 'ref.txt: original size 952, current size' "$Wc_c"
fi
# ============= rnd.c ==============
if test -f 'rnd.c' -a X"$1" != X"-c"; then
	echo 'x - skipping rnd.c (File already exists)'
else
echo 'x - extracting rnd.c (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'rnd.c' &&
typedef struct { int dir, val; } Atom;
X
Atom null_atom = { 0, 0 };
X
#include "par.h"
X
#define LEFT 0
#define RIGHT 1
X
main(ac, av)
int ac;
char **av;
{
X	extern void srand();
X
X	PAR_INIT(ac == 2 ? atoi(av[1]) : ATOM);
X
X	srand((unsigned)getpid());
X
X	/* input */
X	WRAP_OFF;
X	LOOP
X		set_left(val, THIS);
X		TRACE("%02d ", val);
X		DO set(val, here, get(val, left(here))); DONE
X	END_LOOP
X
X	/* mix */
X	WRAP_ON;
X	LOOP
X		DO set(dir, here, rand() % 2); DONE
X		SELECT
X			ALT (get(dir, here) == LEFT && get(dir, left(here)) == RIGHT)
X				set(val, here, get(val, left(here)));
X			ALT (get(dir, here) == RIGHT && get(dir, right(here)) == LEFT)
X				set(val, here, get(val, right(here)));
X		END_SELECT
X		TRACE("%02d ", val);
X	END_LOOP
X
X	/* output */
X	WRAP_OFF;
X	LOOP
X		DO set(val, here, get(val, left(here))); DONE
X		TRACE("%02d ", val);
X	END_LOOP
}
X
SHAR_EOF
chmod 0664 rnd.c ||
echo 'restore of rnd.c failed'
Wc_c="`wc -c < 'rnd.c'`"
test 833 -eq "$Wc_c" ||
	echo 'rnd.c: original size 833, current size' "$Wc_c"
fi
exit 0

***** From alanr@chopin.media.mit.edu Thu Nov 29 17:31:29 1990

Assuming a power of 2 number of processors.

One way to do it is by doing a bunch of pairwise data swaps.
Initially each processor contains one of the elements to be permuted.
In each step, consider pairs of processors which are 2^step apart.
Have the one of the pair generate a random bit. If the bit is 1, then
swap the data in the two processors via a send. After Log2(n) steps
you will have somewhat randomly shuffled.

I say somewhat because I realized after writing this that

a) Not all permutations are generated by this method. The permutation
1324567..... for instance 

b) It isn't obvious to me what the distribution of permutations
generated by this method.  For instance if you think about where a
single element can go, you realize that the larger the number of bits
which differ between its original and destination address, the less
likely it will land up there. (since each send corresponds to changing
one bit in the original address of the object)

Both these problems might be alleviated if you repeated the process a
number of times. First, you can reach more states. For the second
point, as you increase the number of steps, the number of address bits
which can change becomes higher and so less unlikely that an will be
some bit difference away from it's original position.

Anyways, hope this helps. If I think of more I'll send mail.  By the
way, how do you generate the permutation on the serial machine?

  -alan




***** From alk@Think.COM Thu Nov 29 20:16:03 1990


My solutions are developed with CM-2 in mind.  Although the large N
send considerations mentioned below are valid, they should not be of
practical importance for reasonable N.

---------------
second attempt:
  
generate random number with range >> N.  sort self-addresses with this
random key.  Probability of collisions in random number approaches 0 as 
range/N approaches inifinity.  If you want a really paranoid algorithm,
add here a step to sort subsequences with identical neighboring keys by
a new random number of large range.  Very few iterations required to reduce
longest such subsequence to length 1. (Depends on range/N, but in
practical terms, I can't imagine needing more than 2 with N < 2^20, range
0 to 2^32.)

With fixed iteration count, this is of the same order as a sort.
Disregarding the fact that on a CM-2, for large enough N, the send
time is linear in N, that is.  With a variable iteration count,
my mind boggles at the statistics.  I'm not _that_ interested.

--------------
first attempt:

0) set MIN to 0, MAX to N, set pvar D to FALSE, pvar E to FALSE
1) generate self-addresses to pvar B
2) generate random values in {MIN...MAX} to pvar A
3) send-with-max to sites in pvar C which have a FALSE value in E,
   using processor addresses in A, the values those sites of B with a FALSE
   value in D, marking all sites which successfully sent with a TRUE value
   in D and all sites which recieved values with a TRUE value in E
4) set MIN to global minimum of B in all sites with E == FALSE
5) set MAX to global maximum of B likewise
4) narrow persistent context to processors with FALSE value in D
5) if any processors remain in context, goto 2

note on step 3:
  this is accomplished by getting to T1 from E[A], 
  setting T1 <- T1 | D, temporarily narrowing context to those sites
  in context with T1 == FALSE, using a notifying send with maximum
  which sets E <- TRUE in recieving sites, then with context temporarily 
  set to sites with D == FALSE, sending to D[A] from E.

I probably introduced an error at some point, but the idea should be clear:
send to random (unused) locations the self-address of all processors which
have not seen their self-addresses used.  Were the range of target sites
not narrowed to MIN...MAX it would become harder with each iteration to
hit a site which had not yet recieved a value, but since it narrowed, at each
iteration more and more sites are trying to send values to fewer and fewer
candidate sites, accelerating convergence.

I'd have to think to calculate average case O(N), sorry, but it is
certainly sublinear, unless I'm sorely mistaken (and again disregarding
communication time), and hence preferrable to serial calculation for large N.

--alk


***** From solworth@uicbert.eecs.uic.edu Fri Nov 30 13:24:35 1990

	Why not just generate a pair <random-number, processor number>,
sort by random number, and extract processor number.

Jon

***** From spencert@cs.rpi.edu Fri Nov 30 13:42:29 1990

Bill,
	How about the following :
Each processor generates a random number in the range 1..n^3 (or bigger).

Now sort (random #, proc #) pairs by random #, and compute
	(random #, proc #, rank).

Finally sort (proc #, rank) pairs by proc#.  The rank is each processor's
value of the permutation.  

Note that you need a range >> n^2 to avoid equal random #'s

I hope that this helps.

				-Tom Spencer