walker@cs.uiuc.edu (William Walker) (12/13/90)
Last month I asked for information about generating permutations in parallel. Below is a copy of some responses I have received so far. I haven't edited them, so there is some repetition. I now have many good ideas for what to do. I would like to thank all the respondents for their timely and excellent advice! thanks, Bill Walker walker@m.cs.uiuc.edu (217) 344-1312 Department of Computer Science, University of Illinois 1304 West Springfield Avenue, Urbana, Illinois 61801-2987 ------------------------------------------------------------------------------- ***** From bright@server.cs.jhu.edu Mon Nov 26 09:51:37 1990 Assuming that you are dealing with a hypercube, something like the following might work. Let data[i] and random[i] be stored in processor i for all i. 1- data[i] = i 2- random[i] = random number between 1 and N 3- for i = 0 to log(N)-1 do 3a- Let X and Y be two processors such that X and Y differ only in the i'th bit. Assume X > Y. 3b- if the i'th bit of random[Y] = 1, then swap data[X] and data[Y] 4- random[i] = random number between 1 and N 5- for i = log(N)-1 downto 0 do 5a- Let X and Y be two processors such that X and Y differ only in the i'th bit. Assume X > Y. 5b- if the i'th bit of random[Y] = 1, then swap data[X] and data[Y] I'm hoping that this procedure would have the affect of performing a random routing of the elements in data using the network of Abraham Waksman (A Permutation Network, Journal of the ACM, Vol 15, January 1968). I would imagine that this would take the same time as performing two bitonic merges. Note that this is completely off the top of my head so maybe it is completely wrong. Also I'm a little confused by what you mean by saying that we can assume a general router but no shared memory. What is the difference? ***** From abrodnik@watdragon.waterloo.edu Mon Nov 26 10:07:19 1990 I'd appreciate to receive the summary. On the other hand I tried something similar, got a O(n log(n)) serial algorithm, but I don't know how to parallelize it completley (part of it). There was an article in last FOCS by Fich, Munro, and Poblete about permuting as well. But it does not seem to be parallelized simply. I am sure that permuting can be parallelized (otherwise AKS sorting algorithm wouldn't work in O(log(N)) time), but perhaps it has such a huge constant as AKS. Regards Andrej ***** From hammonds@riacs.edu Mon Nov 26 10:47:48 1990 On the Connection Machine can do the following. Let each processors (or virtual processor) choose a random number (real or integer). Then, sort the values. In *lisp this operation is called rank!!. This gives you a "send address" and permutes the values in parallel. What machine are you working on and what language? Perhaps I can be more specific... Steve Hammond hammonds@riacs.edu ***** From bromley@Think.COM Mon Nov 26 11:16:01 1990 A brief sketch follows. For convenience assume N is a power of 2. It requires a source of pseudo random bits. Each processor starts out with its own index. loop for i from 0 upto log2(N) Each processor generates a random bit and exchanges that bit and its current value with processor self XOR 2^i. If the bit from the processor with higher address is a 1 then each processor takes its partners value otherwise it keeps its own value. ***** From aka@cs.brown.edu Mon Nov 26 12:02:40 1990 One of the ways I use is the following (suited for hypercubes but can be simulated by a general router): Let processor i have the value i. Loop over the dimension j = 1 through logN For each pair of nodes whose addresses differ only in the $j$th bit, Swap their values with probability .5 endloop It is a simple logN parallel step loop. Processor i is mapped to a processor that differs with probability .5 in each dimension. Of course the permutations of different processors are not independent in the strong sense, but is a good heuristic. Good luck. ajit agrawal ***** From jacquemin-michel@CS.YALE.EDU Mon Nov 26 14:52:44 1990 Here is a communication-intensive (but simple) solutio, that I used on the Connection Machine: In each processor i, generate locally a random number r(i) (the range should be big compared to N). Globally sort these number: the permutation to use is the "rank" of r(i) in the sorted sequence. I don't claim this is perfect, but I imagine it should be satisfying, as long as the range of the initial random numbers you generate is large enough compared to N (to avoid biasing the distribution too much, when two processors generate the same initial random number). Michel Jacquemin ***** From bromley@Think.COM Mon Nov 26 16:58:02 1990 #include <cm/paris.h> void make_random_permutation (field) CM_field_id_t field; { CM_vp_set_id_t v = CM_field_vp_set (field); CM_geometry_id_t g = CM_vp_set_geometry (v); int send_address_length = CM_geometry_send_address_length (g); int bits = 1 + send_address_length; CM_field_id_t f = CM_allocate_stack_field (bits); CM_field_id_t rand_bit = CM_add_offset_to_field_id (f, bits-1); CM_field_id_t partner_f = CM_allocate_stack_field (bits); CM_field_id_t partner_rand_bit = CM_add_offset_to_field_id (partner_f, bits-1); CM_field_id_t self = CM_allocate_stack_field (send_address_length); CM_field_id_t partner = CM_allocate_stack_field (send_address_length); CM_field_id_t do_exchange = CM_allocate_stack_field (1); int i; CM_set_context(); CM_my_send_address_1L (f); CM_my_send_address_1L (self); for (i = 0; i < send_address_length; i++) { CM_u_random_1L (rand_bit, 1, 2); CM_logxor_constant_3_1L (partner, self, 1 << i, send_address_length); CM_send_1L (partner_f, partner, f, bits, CM_no_field); CM_logand_context (CM_add_offset_to_field_id (self, i)); CM_u_move_1L (do_exchange, rand_bit, 1); CM_invert_context(); CM_u_move_1L (do_exchange, partner_rand_bit, 1); CM_set_context(); CM_logand_context (do_exchange); CM_u_move_1L (f, partner_f, send_address_length); CM_set_context(); } CM_u_move_always_1L (field, f, send_address_length); CM_deallocate_stack_through (f); } ***** From gkj@doc.imperial.ac.uk Tue Nov 27 03:37:32 1990 There is a paper by S. Tomboulian, "Indirect Addressing and Load Balancing for Faster Solutions to Mandelbrot Set on Simd Architectures." Application note AN101, MasPar Computer Corporation, 749 North Mary Avenue, Sunnyvale, CA 94086, June, 1989. Although the paper isn't specifically about this problem, at the end of it she describes how different ways of allocating pixels to processing elements (random permutation being tried among others) works out in efficiency terms. I think the main conclusion was that a psuedo-random scheme involving a series of shifts across the PE plane was best in terms of time required to generate the "permutation" and resulting "randomness". Cheers! Guido... - Never eat yellow snow + Translated humor isn't + Lazyness is for weaklings - ||||||| Guido K. Jouret | ` ' | gkj@doc.ic.ac.uk (| o o |) Functional Programming Section | ^ | Imperial College of Science and Technology \ "~" / London SW7 2BZ \___/ U.K. ***** From mrg@Princeton.EDU Tue Nov 27 10:06:19 1990 Bill, Here's a simple algorithm. 1. Each processor creates a record with two fields: key: A random integer between 1 and 1*N (for M >> N). src: The location of the processor. 2. The SIMD sorts these records based on the "key" field. The probability of two records having the same key is N(N-1)/(2*M); so for M sufficiently large, I'll assume no collisions occur (more on this later). Sorting can be done in O((log N)^2) time (e.g. Batcher's bitonic sort). At the end of the sorting step, processor k has the record with the k'th smallest key. The "src" field of this record is the integer (in 1..N) assigned to this processor. So what about conflicts in the sorting step. There are several obvious choices: 1. Try again. This happens with low enough probability that the expected run time of the algorithm remains good. 2. Break ties arbitrarily. This results in a slightly non-uniform choice of permutations, but for M sufficiently large, no one will ever notice. 3. Break ties with your centralized algorithm. Again, since ties are rare, and large-way ties even rarer, this is practical. I'd be interested in a summary of approaches suggested. You can send me e-mail, or post to the net. --Mark ***** From leo@wk48.nas.nasa.gov Tue Nov 27 12:37:54 1990 This may be faster, you'll have to try it out I imagine. Begin by assigning each node its self-address as the element of the permutation. The idea now is to randomize this permutation, which now reads 1,2,...,N. To do this, let each node generate a random number in the range [1,N], we'll use this to tag the permutation value. So now we have a structure that looks something like this: Node Permutation Tag ---- ----------- --- 1 1 7 2 2 3 3 3 4 4 4 3 5 5 1 6 6 5 7 7 1 8 8 6 Now sort the sequence such that the tags are in non-descending order, and move the permutation value to the sorted position. So our example above becomes Node Permutation Tag ---- ----------- --- 1 5 1 2 7 1 3 2 3 4 4 3 5 3 4 6 6 5 7 8 6 8 1 7 Voila! You have your random permutation. For large enough N this may be faster than doing it on the front end. Integer sorting can be easily implemented with complexity O(N log N) in parallel, (sequentially, you can do a bucket sort in time O(N), but in parallel this is somewhat harder to achieve). If you do try this out, I'd be interested in learning the results (i.e. is it any faster?) I'm curious how large is N, and what machine are you working on? Good luck Leo Dagum (415) 604-4779 ***** From kornkven@uxh.cso.uiuc.edu Tue Nov 27 14:50:22 1990 Hi Bill, In CM Fortran on the Connection Machine, there is a function that ranks the elements of a CM array. So it seems that a possible approach (on this machine anyway) would be to generate an array of N random numbers, then to rank those random numbers, the ranks being your permutation of 1..N. Ed Kornkven kornkven@cs.uiuc.edu ***** From leban@par3.cs.umass.edu Tue Nov 27 15:24:36 1990 Here's my solution to the parallel permutation problem. I'm curious to know if there's anything better. I'd be interested in seeing a summary of responses. --- Bruce Leban@cs.umass.edu @amherst.mass.usa.earth ----------------------------------------------------------------------------- The simplest sequential algorithm is for i = 1 to n swap a[i] with a[random] My parallel algorithm is a variation of this. ----------------------------------------------------------------------------- /* i, j, k, a, b, c, swapped and ID are local to the PEs */ /* this randomly permutes the values in a */ a = ID; swapped = FALSE; while (!global-and(swapped)) { j = -1; if (random(TRUE)) /* potential sender */ { if (!swapped) { k = random(NPROCS); if (k != ID) put ID in i at k with COLLISIONS=c; /* c is the count of messages received */ else if (c > 0 && random(1)) /* note 1 */ swapped = TRUE; } } else if (c == 1) /* receiver (one incoming put) */ { put ID in j at i with NOCOLLISIONS; put a in b at i with NOCOLLISIONS; } if (j >= 0) /* confirmed sender */ { put a in a at j with NOCOLLISIONS; a = b; swapped = TRUE; /* note 2 */ } } At each stage in the algorithm, each processor is either a sender or a receiver. The senders each select a random processor to swap with. If two select the same receiver, neither one swaps. If a sender selects another sender, it's ignored. Once a swap is completed, the sender becomes a permanent receiver. On the first step, of the non-colliding pairs, it's expected that half of them are sender-receiver pairs. The odds of having a large number of the puts collide is very small (it's analogous to hashing k items into 2k buckets). Thus almost half of the potential senders perform a swap on the first step. Thus it's expected to take O(log n) steps to terminate. ----- Note 1: since each sender selects a receiver with probability 1/2, we need to make sure that if it selects itself, it has the same probability of treating itself as a receiver. Otherwise, there will be a bias towards self swaps. We also don't do the self-swap if there's an incoming message. This avoids a similar bias caused by no collisions on self-swaps. Note 2: the variable swapped is set in only one processor of the two. Otherwise it would be producing a random set of 2-cycles, rather than a random permutation. ***** From Marco.Zagha@ENQUIRER.SCANDAL.CS.CMU.EDU Tue Nov 27 22:59:47 1990 A simple approach is to generate random numbers in parallel and rank them. Rank is a sort procedure that returns n to the nth highest key. You can build a rank out of a sort by appending the starting index of each key to the low order bits of the key. Something like rank is often provided as a library routine, e.g. on the Connection Machine and on the Cray. Be careful though if you want a truly random permutation. The problem with the random+rank algorithm is that if you start out with any duplicate keys, the result is not completely random. (Not to mention the quality of the parallel random number generation). The chance of getting duplicate keys is higher than you might think -- for example, for large permutations, 32-bit keys might not be enough. I seem to recall seeing formulas for this in statistics books along with problems like "How many people do you need until you expect that more than one will have the same birthday". Another approach would be to look up random mating algorithms and do a swap based approach. I'm sure that theoreticians have beaten this problem to death. Please post your summary on comp.parallel. == Marco Zagha Internet: marcoz@cs.cmu.edu Uucp: ...!seismo!cs.cmu.edu!marcoz Bitnet: marcoz%cs.cmu.edu@cmuccvma CSnet: marcoz%cs.cmu.edu@relay.cs.net ***** From ney@camelback.Princeton.EDU Wed Nov 28 09:19:38 1990 I'm interested in problems like this, but my background is weak. Can you suggest some references summarizing known standard algorithms for parallel architectures? (I have a march 1988 tech report #UCB/CSD 88/408 "A Survey of Parallel Algorithms for Shared Memory Machines" by Karp and Ramachandran.) In exchange I'll give you my (most likely naive) ideas for your problem. First, if there is a quick way to assign an arbitrary permutation (for instance, if the sites are numbered and know their numbers, a reasonable assumption since I would guess the sites must be named) then one should be able to quickly shuffle this permutation by performing random swaps in parallel. For instance we can do define shuffle(low, high) if (low >= high) return else mid = (low + high)/2 in parallel shuffle(low, mid), shuffle(mid, high) in parallel for each i=0...mid-low-1 if (random() mod 2 == 1) swapNumbersOf(low + i, mid + i) Then shuffle(1,N) (where N is a power of two) randomly permutes the numbers of sites 1..N in log(n) rounds, with each round requiring 2N messages with each site sending and receiving one message. Alternatively, if there is no preexisting permutation, start by having each site have an random real number in the interval [0,1), then use a fast parallel sorting algorithm to compute the rank of the number at each site. Of course one can't generate random reals, instead generate the bits of the reals lazily, that is start out knowing 0 bits of each number, and whenever two numbers are compared, if the known bits of the numbers are not sufficient to decide their order, randomly extend each one until their order is determined. Instead one could start by generating numbers in the range (1..N^3) say, in which case the probability that any two are equal vanishes. This also has the advantage of allowing a radix sort. I don't know how fast parallel sorting algorithms really work, though, and it seems like there should be an easier way. Neal Young (ney@cs.princeton.edu) ***** From prins@cs.unc.edu Wed Nov 28 15:40:57 1990 I had to do this recently on a MasPar MP-1. I just followed Knuth's algorithm for generating a random permutation of 0..N-1: start with P[i] = i (for i in 0..N-1) and independent random values R[0]..R[N-1] where 0 <= R[i] < N. Now swap each P[i] with P[R[i]], in some serializable order. To do this quickly place P[i] and R[i] on processor i, and let the router determine the serialization (the details of this depend heavily on the construction of the router, of course, and will be different on a CM for example). The MPL code below generates a random permutation of 0..nproc-1 on an MP-1 in around 2.5ms. ..jfp /* generate a random permutation using router */ { register plural short T,P,R; register plural unsigned char C; R = (p_random() >> 2) & (nproc-1); P = iproc; C = 1; while (C) { if (connected(R)) { T = router[R].P; router[R].P = P; P = T; C = 0; } } } --\-- Jan Prins (prins@cs.unc.edu) / Computer Science Dept. --\-- UNC Chapel Hill ***** From thy@adesign.uucp Thu Nov 29 04:34:37 1990 I am not sure this can help you, but i can give it as is anyway. #!/bin/sh # This is a shell archive (produced by shar 3.49) # To extract the files from this archive, save it to a file, remove # everything above the "!/bin/sh" line above, and type "sh file_name". # # made 11/29/1990 09:59 UTC by thy@dsgn42 # Source directory /user1/thy/spl/tmp # # existing files will NOT be overwritten unless -c is specified # # This shar contains: # length mode name # ------ ---------- ------------------------------------------ # 4667 -rw-rw-r-- 1 # 2506 -rw-r--r-- par.h # 952 -rw-rw-r-- ref.txt # 833 -rw-rw-r-- rnd.c # # ============= 1 ============== if test -f '1' -a X"$1" != X"-c"; then echo 'x - skipping 1 (File already exists)' else echo 'x - extracting 1 (Text)' sed 's/^X//' << 'SHAR_EOF' > '1' && #!/bin/sh # This is a shell archive (produced by shar 3.49) # To extract the files from this archive, save it to a file, remove # everything above the "!/bin/sh" line above, and type "sh file_name". # # made 11/29/1990 09:54 UTC by thy@dsgn42 # Source directory /user1/thy/spl/tmp # # existing files will NOT be overwritten unless -c is specified # # This shar contains: # length mode name # ------ ---------- ------------------------------------------ # 2506 -rw-r--r-- par.h # 833 -rw-rw-r-- rnd.c # # ============= par.h ============== if test -f 'par.h' -a X"$1" != X"-c"; then X echo 'x - skipping par.h (File already exists)' else echo 'x - extracting par.h (Text)' sed 's/^X//' << 'SHAR_EOF' > 'par.h' && #include <stdio.h> #include <malloc.h> XX #define EVER 1 #define TIME 2 #define ATOM 16 XX #define MAX_VAL 99 #define MIN_VAL -1 #define DEF_VAL 0 XX Atom *_univ; int global, _val; int n_atom, _atom, _time, _wrap_flag; int THIS, _trace; XX #define FOR_EVER for(;;) { #define END_FOR_EVER } #define REPEAT(x) for (THIS = 0; THIS < x; ++THIS) { #define END_REPEAT } #define LOOP REPEAT(n_atom) #define END_LOOP } XX #define here 0 #define left(x) ((x) - 1) #define right(x) ((x) + 1) XX #define NULL_LEFT (*(_univ + _time * (n_atom + 2)) = null_atom) #define NULL_RIGHT (*(_univ + n_atom + _time * (n_atom + 2) + 1) = null_atom) #define WRAP_ON (_wrap_flag = 1) #define WRAP_OFF (_wrap_flag = 0) #define WRAP_TWIST (_wrap_flag = !_wrap_flag) #define PAR_INIT(x) (n_atom = (x), _univ = (Atom *)malloc(sizeof(Atom) * (n_atom + 2) * TIME), WRAP_ON) #define _NEXT_TIME ((_time + 1) % TIME) #define _ALL_ALL_ATOM for (_atom = -1; _atom <= n_atom; ++_atom) #define _ALL_ATOM for (_atom = !_wrap_flag ? -1 : 0; _atom < (!_wrap_flag ? n_atom + 1 : n_atom); ++_atom) #define _ATOM(x) (_wrap_flag ? _WRAP_ATOM(x) : _NO_WRAP_ATOM(x)) #define _NO_WRAP_ATOM(x) (_atom + (x) < -1 ? -1 : (_atom + (x) > n_atom + 1 ? n_atom + 1 : _atom + (x))) #define _WRAP_ATOM(x) ((_atom + (x) + n_atom) % n_atom) #define _NOW_ATOM(x) (_univ + 1 + _time * (n_atom + 2) + _ATOM(x)) #define _NEXT_ATOM(x) (_univ + 1 + _NEXT_TIME * (n_atom + 2) + _ATOM(x)) XX #define _SYNC _ALL_ALL_ATOM { *_NEXT_ATOM(here) = *_NOW_ATOM(here); } #define DO _SYNC _ALL_ATOM { #define DONE } _time = _NEXT_TIME; XX #define SELECT DO if (0) {; #define ALT(x) } else if (x) { #define END_SELECT } DONE XX #define IF(x) DO if(x) { #define END_IF } DONE XX #define TRACE(x, y) _trace = _wrap_flag; WRAP_OFF; DO (void)fprintf(stderr, x, get(y, here)); DONE _wrap_flag = _trace; (void)fprintf(stderr, "\n"); XX #define get_left(x) ((_univ + _time * (n_atom + 2))->x) #define set_left(x, y) ((_univ + _time * (n_atom + 2))->x = (y)) #define get_right(x) ((_univ + n_atom + _time * (n_atom + 2) + 1)->x) #define set_right(x, y) ((_univ + n_atom + _time * (n_atom + 2) + 1)->x = (y)) XX #define get(x, y) (_NOW_ATOM(y)->x) #define set(x, y, z) (_NEXT_ATOM(y)->x = (z)) XX #define GLOBAL(x) global = 0; _ALL_ATOM { global |= get(x, here); } global = !global; XX #define INPUT(x) _val = get_val(stdin); set_left(x, _val); #define OUTPUT(x) (void)printf("%2d ", get_right(x)); XX #define SHIFTR(x) DO set(x, here, get(x, left(here))); DONE #define SHIFTL(x) DO set(x, here, get(x, right(here))); DONE SHAR_EOF chmod 0644 par.h || echo 'restore of par.h failed' Wc_c="`wc -c < 'par.h'`" test 2506 -eq "$Wc_c" || X echo 'par.h: original size 2506, current size' "$Wc_c" fi # ============= rnd.c ============== if test -f 'rnd.c' -a X"$1" != X"-c"; then X echo 'x - skipping rnd.c (File already exists)' else echo 'x - extracting rnd.c (Text)' sed 's/^X//' << 'SHAR_EOF' > 'rnd.c' && typedef struct { int dir, val; } Atom; XX Atom null_atom = { 0, 0 }; XX #include "par.h" XX #define LEFT 0 #define RIGHT 1 XX main(ac, av) int ac; char **av; { XX extern void srand(); XX XX PAR_INIT(ac == 2 ? atoi(av[1]) : ATOM); XX XX srand((unsigned)getpid()); XX XX /* input */ XX WRAP_OFF; XX LOOP XX set_left(val, THIS); XX TRACE("%02d ", val); XX DO set(val, here, get(val, left(here))); DONE XX END_LOOP XX XX /* mix */ XX WRAP_ON; XX LOOP XX DO set(dir, here, rand() % 2); DONE XX SELECT XX ALT (get(dir, here) == LEFT && get(dir, left(here)) == RIGHT) XX set(val, here, get(val, left(here))); XX ALT (get(dir, here) == RIGHT && get(dir, right(here)) == LEFT) XX set(val, here, get(val, right(here))); XX END_SELECT XX TRACE("%02d ", val); XX END_LOOP XX XX /* output */ XX WRAP_OFF; XX LOOP XX DO set(val, here, get(val, left(here))); DONE XX TRACE("%02d ", val); XX END_LOOP } XX SHAR_EOF chmod 0664 rnd.c || echo 'restore of rnd.c failed' Wc_c="`wc -c < 'rnd.c'`" test 833 -eq "$Wc_c" || X echo 'rnd.c: original size 833, current size' "$Wc_c" fi Xexit 0 SHAR_EOF chmod 0664 1 || echo 'restore of 1 failed' Wc_c="`wc -c < '1'`" test 4667 -eq "$Wc_c" || echo '1: original size 4667, current size' "$Wc_c" fi # ============= par.h ============== if test -f 'par.h' -a X"$1" != X"-c"; then echo 'x - skipping par.h (File already exists)' else echo 'x - extracting par.h (Text)' sed 's/^X//' << 'SHAR_EOF' > 'par.h' && #include <stdio.h> #include <malloc.h> X #define EVER 1 #define TIME 2 #define ATOM 16 X #define MAX_VAL 99 #define MIN_VAL -1 #define DEF_VAL 0 X Atom *_univ; int global, _val; int n_atom, _atom, _time, _wrap_flag; int THIS, _trace; X #define FOR_EVER for(;;) { #define END_FOR_EVER } #define REPEAT(x) for (THIS = 0; THIS < x; ++THIS) { #define END_REPEAT } #define LOOP REPEAT(n_atom) #define END_LOOP } X #define here 0 #define left(x) ((x) - 1) #define right(x) ((x) + 1) X #define NULL_LEFT (*(_univ + _time * (n_atom + 2)) = null_atom) #define NULL_RIGHT (*(_univ + n_atom + _time * (n_atom + 2) + 1) = null_atom) #define WRAP_ON (_wrap_flag = 1) #define WRAP_OFF (_wrap_flag = 0) #define WRAP_TWIST (_wrap_flag = !_wrap_flag) #define PAR_INIT(x) (n_atom = (x), _univ = (Atom *)malloc(sizeof(Atom) * (n_atom + 2) * TIME), WRAP_ON) #define _NEXT_TIME ((_time + 1) % TIME) #define _ALL_ALL_ATOM for (_atom = -1; _atom <= n_atom; ++_atom) #define _ALL_ATOM for (_atom = !_wrap_flag ? -1 : 0; _atom < (!_wrap_flag ? n_atom + 1 : n_atom); ++_atom) #define _ATOM(x) (_wrap_flag ? _WRAP_ATOM(x) : _NO_WRAP_ATOM(x)) #define _NO_WRAP_ATOM(x) (_atom + (x) < -1 ? -1 : (_atom + (x) > n_atom + 1 ? n_atom + 1 : _atom + (x))) #define _WRAP_ATOM(x) ((_atom + (x) + n_atom) % n_atom) #define _NOW_ATOM(x) (_univ + 1 + _time * (n_atom + 2) + _ATOM(x)) #define _NEXT_ATOM(x) (_univ + 1 + _NEXT_TIME * (n_atom + 2) + _ATOM(x)) X #define _SYNC _ALL_ALL_ATOM { *_NEXT_ATOM(here) = *_NOW_ATOM(here); } #define DO _SYNC _ALL_ATOM { #define DONE } _time = _NEXT_TIME; X #define SELECT DO if (0) {; #define ALT(x) } else if (x) { #define END_SELECT } DONE X #define IF(x) DO if(x) { #define END_IF } DONE X #define TRACE(x, y) _trace = _wrap_flag; WRAP_OFF; DO (void)fprintf(stderr, x, get(y, here)); DONE _wrap_flag = _trace; (void)fprintf(stderr, "\n"); X #define get_left(x) ((_univ + _time * (n_atom + 2))->x) #define set_left(x, y) ((_univ + _time * (n_atom + 2))->x = (y)) #define get_right(x) ((_univ + n_atom + _time * (n_atom + 2) + 1)->x) #define set_right(x, y) ((_univ + n_atom + _time * (n_atom + 2) + 1)->x = (y)) X #define get(x, y) (_NOW_ATOM(y)->x) #define set(x, y, z) (_NEXT_ATOM(y)->x = (z)) X #define GLOBAL(x) global = 0; _ALL_ATOM { global |= get(x, here); } global = !global; X #define INPUT(x) _val = get_val(stdin); set_left(x, _val); #define OUTPUT(x) (void)printf("%2d ", get_right(x)); X #define SHIFTR(x) DO set(x, here, get(x, left(here))); DONE #define SHIFTL(x) DO set(x, here, get(x, right(here))); DONE SHAR_EOF chmod 0644 par.h || echo 'restore of par.h failed' Wc_c="`wc -c < 'par.h'`" test 2506 -eq "$Wc_c" || echo 'par.h: original size 2506, current size' "$Wc_c" fi # ============= ref.txt ============== if test -f 'ref.txt' -a X"$1" != X"-c"; then echo 'x - skipping ref.txt (File already exists)' else echo 'x - extracting ref.txt (Text)' sed 's/^X//' << 'SHAR_EOF' > 'ref.txt' && ***** From: mips!mips.com!mark@decwrl.dec.com (Mark G. Johnson) There's an article in a recent journal that might be of interest: X X "Cellular Automata-Based Pseudorandom Number Generators for X Built-In Self-Test", IEEE Transactions on Computer-Aided Design, X Vol. 8 No. 8, August 1989, pp. 842-859. X The technique produces a parallel *word* of random bits once per clock tick, unlike the Linear-Feedback Shift Register which produces a serial bitstream, one bit per clock tick. A major drawback of the LFSR is the rather high auto- and cross-correlation in the stream of pseudorandoms; the Cellular Automaton approach seems to improve this dramatically. X Knuth's tests for randomness are applied and the results are favorable. -- X -- Mark Johnson X MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086 X ...!decwrl!mips!mark (408) 991-0208 X SHAR_EOF chmod 0664 ref.txt || echo 'restore of ref.txt failed' Wc_c="`wc -c < 'ref.txt'`" test 952 -eq "$Wc_c" || echo 'ref.txt: original size 952, current size' "$Wc_c" fi # ============= rnd.c ============== if test -f 'rnd.c' -a X"$1" != X"-c"; then echo 'x - skipping rnd.c (File already exists)' else echo 'x - extracting rnd.c (Text)' sed 's/^X//' << 'SHAR_EOF' > 'rnd.c' && typedef struct { int dir, val; } Atom; X Atom null_atom = { 0, 0 }; X #include "par.h" X #define LEFT 0 #define RIGHT 1 X main(ac, av) int ac; char **av; { X extern void srand(); X X PAR_INIT(ac == 2 ? atoi(av[1]) : ATOM); X X srand((unsigned)getpid()); X X /* input */ X WRAP_OFF; X LOOP X set_left(val, THIS); X TRACE("%02d ", val); X DO set(val, here, get(val, left(here))); DONE X END_LOOP X X /* mix */ X WRAP_ON; X LOOP X DO set(dir, here, rand() % 2); DONE X SELECT X ALT (get(dir, here) == LEFT && get(dir, left(here)) == RIGHT) X set(val, here, get(val, left(here))); X ALT (get(dir, here) == RIGHT && get(dir, right(here)) == LEFT) X set(val, here, get(val, right(here))); X END_SELECT X TRACE("%02d ", val); X END_LOOP X X /* output */ X WRAP_OFF; X LOOP X DO set(val, here, get(val, left(here))); DONE X TRACE("%02d ", val); X END_LOOP } X SHAR_EOF chmod 0664 rnd.c || echo 'restore of rnd.c failed' Wc_c="`wc -c < 'rnd.c'`" test 833 -eq "$Wc_c" || echo 'rnd.c: original size 833, current size' "$Wc_c" fi exit 0 ***** From alanr@chopin.media.mit.edu Thu Nov 29 17:31:29 1990 Assuming a power of 2 number of processors. One way to do it is by doing a bunch of pairwise data swaps. Initially each processor contains one of the elements to be permuted. In each step, consider pairs of processors which are 2^step apart. Have the one of the pair generate a random bit. If the bit is 1, then swap the data in the two processors via a send. After Log2(n) steps you will have somewhat randomly shuffled. I say somewhat because I realized after writing this that a) Not all permutations are generated by this method. The permutation 1324567..... for instance b) It isn't obvious to me what the distribution of permutations generated by this method. For instance if you think about where a single element can go, you realize that the larger the number of bits which differ between its original and destination address, the less likely it will land up there. (since each send corresponds to changing one bit in the original address of the object) Both these problems might be alleviated if you repeated the process a number of times. First, you can reach more states. For the second point, as you increase the number of steps, the number of address bits which can change becomes higher and so less unlikely that an will be some bit difference away from it's original position. Anyways, hope this helps. If I think of more I'll send mail. By the way, how do you generate the permutation on the serial machine? -alan ***** From alk@Think.COM Thu Nov 29 20:16:03 1990 My solutions are developed with CM-2 in mind. Although the large N send considerations mentioned below are valid, they should not be of practical importance for reasonable N. --------------- second attempt: generate random number with range >> N. sort self-addresses with this random key. Probability of collisions in random number approaches 0 as range/N approaches inifinity. If you want a really paranoid algorithm, add here a step to sort subsequences with identical neighboring keys by a new random number of large range. Very few iterations required to reduce longest such subsequence to length 1. (Depends on range/N, but in practical terms, I can't imagine needing more than 2 with N < 2^20, range 0 to 2^32.) With fixed iteration count, this is of the same order as a sort. Disregarding the fact that on a CM-2, for large enough N, the send time is linear in N, that is. With a variable iteration count, my mind boggles at the statistics. I'm not _that_ interested. -------------- first attempt: 0) set MIN to 0, MAX to N, set pvar D to FALSE, pvar E to FALSE 1) generate self-addresses to pvar B 2) generate random values in {MIN...MAX} to pvar A 3) send-with-max to sites in pvar C which have a FALSE value in E, using processor addresses in A, the values those sites of B with a FALSE value in D, marking all sites which successfully sent with a TRUE value in D and all sites which recieved values with a TRUE value in E 4) set MIN to global minimum of B in all sites with E == FALSE 5) set MAX to global maximum of B likewise 4) narrow persistent context to processors with FALSE value in D 5) if any processors remain in context, goto 2 note on step 3: this is accomplished by getting to T1 from E[A], setting T1 <- T1 | D, temporarily narrowing context to those sites in context with T1 == FALSE, using a notifying send with maximum which sets E <- TRUE in recieving sites, then with context temporarily set to sites with D == FALSE, sending to D[A] from E. I probably introduced an error at some point, but the idea should be clear: send to random (unused) locations the self-address of all processors which have not seen their self-addresses used. Were the range of target sites not narrowed to MIN...MAX it would become harder with each iteration to hit a site which had not yet recieved a value, but since it narrowed, at each iteration more and more sites are trying to send values to fewer and fewer candidate sites, accelerating convergence. I'd have to think to calculate average case O(N), sorry, but it is certainly sublinear, unless I'm sorely mistaken (and again disregarding communication time), and hence preferrable to serial calculation for large N. --alk ***** From solworth@uicbert.eecs.uic.edu Fri Nov 30 13:24:35 1990 Why not just generate a pair <random-number, processor number>, sort by random number, and extract processor number. Jon ***** From spencert@cs.rpi.edu Fri Nov 30 13:42:29 1990 Bill, How about the following : Each processor generates a random number in the range 1..n^3 (or bigger). Now sort (random #, proc #) pairs by random #, and compute (random #, proc #, rank). Finally sort (proc #, rank) pairs by proc#. The rank is each processor's value of the permutation. Note that you need a range >> n^2 to avoid equal random #'s I hope that this helps. -Tom Spencer