Andrew.Chin@prg.oxford.ac.uk (Andrew Chin) (04/16/91)
I am working on realistic PRAM models. The PRAM is usually defined without specifying what basic instruction set is used. Two criteria for deciding whether an instruction is "fair" to use are linear (sequential) bit complexity and membership in AC0. (These criteria exclude multiplication, for example.) Even if an instruction is "fair," however, it is best to justify supporting it in hardware or low-level code. It will help me greatly if I can have a basic instruction which takes n n-bit words and produces n n-bit words such that the first word consists of the highest-order bits, the second word consists of the second-highest-order bits, etc. The effect is that of transposing an n-by-n 0-1-valued matrix. Clearly this can be done with linear bit complexity and in AC0. What I want to know is: (1) Does this instruction have a name? (2) Are there references to it in the literature? (3) Is it practical? (Both research results and hunches are invited.) ______________________________________________________________________ Andrew Chin chin@prg.oxford.ac.uk Oxford University Computing Laboratory 8-11 Keble Road +44 865 513661 Oxford OX1 3QD (0865) 513661 England at Texas A&M 1991-93 ______________________________________________________________________ -- =========================== MODERATOR ============================== Steve Stevenson {steve,fpst}@hubcap.clemson.edu Department of Computer Science, comp.parallel Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell
rbe@yrloc.ipsa.reuter.COM (Robert Bernecky) (04/17/91)
In article <1991Apr16.164534.1530@hubcap.clemson.edu> Andrew Chin <Andrew.Chin@prg.oxford.ac.uk> writes: > >multiplication, for example.) Even if an instruction is "fair," Pardon my ignorance, but what is ACO? >It will help me greatly if I can have a basic instruction which takes >n n-bit words and produces n n-bit words such that the first word >consists of the highest-order bits, the second word consists of the >second-highest-order bits, etc. The effect is that of transposing >an n-by-n 0-1-valued matrix. > >(1) Does this instruction have a name? >(2) Are there references to it in the literature? >(3) Is it practical? (Both research results and hunches are invited.) (1) Yes, it does have a name. If I were being pedantic, I'd stop there. In APL and J, we call it "transpose". (2) Yes, see any of many books on APL, or see Hui, Iverson, et al , "APL\?", in ACM SIGAPL Quote Quad, VOl 20, no 4, July 1990 - APL90 COnference proceedings for an early paper on J. (3) Is it practical? You betcha! Suppose you only have a reduction primitive for summing rows of an array, and want to sum the columns. Merely do "sum reduce tranpose x". Very handy. Why, I'll bet it would even have applications with non-Boolean arguments 8^}. With tongue removed from cheek: Boolean operations are an area which have made APL knock the socks off other languages in terms of performance. Having architectural support for transpose and other operations (indexing in and out of bit arrays with vector arguments comes to mind - contiguous row/column stuff -- would help) would help even more. J and older dialects of APL do not have the Flatlander prejudices that seem rife in other languages: Transpose (denoted |:) of an array x of shape (denoted $), results in an array in which the ordering of axes is reversed. That is, $x 3 4 5 (3 planes, 4 rows, 5 columns) $|:x (shape of the transpose of x) 5 4 3 If you want to transpose each plane, use "rank" (denoted ") to apply it to the rank-2 objects: $|:"2 x 3 5 4 There's a thread on this going on in comp.lang.apl. J, by the way, is a new dialect of APL, in which we have abandoned the APL character set, and adopted ASCII, and also rationalized 25 years of design, in which there some inevitable warts. So, yes, Boolean transpose is important. But, be sure that you allow for non-square arrays! I often transpose funny-shaped Boolean arrays. Robert Bernecky rbe@yrloc.ipsa.reuter.com bernecky@itrchq.itrc.on.ca Snake Island Research Inc (416) 368-6944 FAX: (416) 360-4694 18 Fifth Street, Ward's Island Toronto, Ontario M5J 2B9 Canada -- =========================== MODERATOR ============================== Steve Stevenson {steve,fpst}@hubcap.clemson.edu Department of Computer Science, comp.parallel Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell
Andrew.Chin@prg.oxford.ac.uk (Andrew Chin) (04/22/91)
Here is a summary of the many helpful replies I received to my query about the PRAM/RAM operation > ... which takes n n-bit words and produces n n-bit words such > that the first word consists of the highest-order bits, the > second word consists of the second-highest-order bits, etc. > The effect is that of transposing an n-by-n 0-1-valued matrix. (1) Does this instruction have a name? (2) Are there references to it in the literature? (3) Is it practical? (Both research results and hunches are invited.) Many thanks to everyone who took time to consider the questions, and especially those who sent me these very good answers. My responses to their responses are prefixed by "<". Further responses are invited and I will continue to summarize any information received that seems to be of general interest. -------------------HARDWARE--------------------- pjn@edu.UMD.cs (P. J. Narayanan): The primitive you described is often called "corner turning". It is an interesting operation indeed. The Connection Machine has a chip -- called SPRINT chip -- that interfaces the bit serial processor memories with the floating point processor (they only have one floating point unit per 32 bit serial processors) that implements the above primitive (and a lot more, I think). Each of the 32 PEs write 32 bits from their memory onto the SPRINT chip in 32 cycles. The FPU reads them as 32 bit words, corner turned, and processes them. Years back, I had a simple hardware design involving bipolar ROMs and fast shift circuitry that gave corner turning capability to memory. Thus an 8-byte block [i.e., and 8x8 block of bits] (could be extended to any size, with corresponding increase in the extra hardware) could be read either the way -- as bytes of rows or columns. It is definitely practical to design a corner turning hardware (as demonstrated by the SPRINT chip); I believe it is practical to give this capability to a DRAM memory with "reasonable" extra hardware. ---------------------------------------- Mark Greenstreet <mrg@edu.Princeton>: > (3) Is it practical? Yes (if you're a theoretician). Take a register file of n, nbit registers. Fold it along the main diagonal. Now, all cells that need to communicate for a bit transpose operation are neighbors. Foldiing an array preserves the array of the array (ignoring small constant factors). So it looks good. Maybe (if you're an engineer). To justify hardware implementation of the operation, you have to show that its cost (the small constant factor mentioned above, plus the architectural constraint relating the number of registers to the word size, etc.) is justified by the benefit of having the instruction. Does it speed up enough programs sufficiently to justify the longer cycle time due to longer wires in the register file due to the folding? ----------------------------------------- Pete Hugo (hugo@sunapee.dartmouth.edu) also noted that the Connection Machine supports corner turning, and adds: So, I guess this hardware does a transpose of a 32 by n matrix of bits, rather than and n by n matrix. The problem I see with transposing an N by N matrix is that it would require a lot of global memory access. The problem I have always had with PRAM is this idea that you can build a machine with infinite bandwidth to a shared memory. It just isn't going to happen. I think to make PRAM realistic, we have to add some metric of 'locality' to the model. < Comment: The Block PRAM of Aggarwal, Chandra and Snir is a good < model for introducing these sorts of locality issues. I will < have a thesis on this which will be available in late summer. < Send requests to <library@prg.oxford.ac.uk> after Sept. 1. ----------------------------------------------- Bojana Obrenic was lead co-author on a paper about corner turning with Arnold Rosenberg < rsnbrg@edu.umass.cs.elys >, who reported this and kindly asked Bojana to send me a copy, and writes: The operation is very important for certain real applications. ----------------COMPLEXITY THEORY--------------- <srt@edu.duke.cs> Steve Tate: > The PRAM is usually defined without specifying what basic > instruction set is used. Well, this is not really true. There are quite a few papers dealing with various instruction sets for the PRAM. The standard PRAM typically has only addition, subtraction, division by two (shift right), test and branch, and indirect addressing. The MPRAM model allows multiplication also, and this has also been studied. There are some recent papers by Trahan, Ramachandran, and (Someone else) that address this for both deterministic and nondeterministic PRAMS. Also there is a classic paper by Hartmanis (I think) on MPRAMs (maybe this was only on MRAMs). < Comment: The references sound helpful. I should have said: < Despite the existence of specific studies on the power of < instruction sets, the PRAM is usually defined without specifying < what basic instruction set is used. ------------------------------------------------ Ken Regan <regan%edu.buffalo.cs@edu.buffalo.cs.castor>: The NLIN not-equal-to DLIN proof of Pippenger, Paul, Szemeredi and Trotter depends heavily on the 1-dimensional structure of the tapes. In particular, if you have three blocks B1,B2,B3 on the tape, and you move from B1 to B3, then you can't refer to what you stored in B1 without going through B2 again. Figuratively, the time spent going through B2 is "soft time", and the NLIN /= DLIN proof sort-of says that nondeterminism can squeeze out soft time that determinism cannot. (But only improving by a wimpy log*n factor.) In the plane, however, there are more ways to locate memory blocks "contiguously", and these frustrate the graph theoretic techniques (so far) which PPST used to prove their result. So maybe these results only say that our notion of running time is wrong, too "one-dimensional." But, I answer (and hope), maybe this kind of one-dimensionality only reflects the fact that our formal theory of languages is string-based to begin with--we don't regard "2-D strings" as fundamental units. Matrix-transpose is an important topic in TM theory as well. The best I can do is refer you to several articles by Ming Li and Paul Vitanyi on Kolmogorov complexity, where they talk about the K-c significance of matrix transpose, and lower bounds for TM's. Their best article is in a volume entitled "Complexity Theory Retrospective", Alan Selman, editor, Springer Verlag, 1990. Li and Vitanyi are writing a book, and I believe OUP is the publisher (?). < Ken also mentioned related work by Jie Wang (Wilkes University) < and himself (different topics). ---------------------------------------------- Robert Bernecky< rbe@yrloc.ipsa.reuter.COM > (see also PROGRAMMING below): Pardon my ignorance, but what is ACO? < AC^0 is a Boolean function complexity measure related to the class NC. < A function is in NC if it can be computed by Boolean circuits with < polynomial size and polylogarithmic depth. Alternatively, a function is < in NC if it can be computed by a PRAM with polynomially many processors < in polylogarithmic time. < A function is in AC^0 if it can be computed by Boolean circuits < whose gates can be unary negations, or ANDs or ORs of arbitrary arity, < with polynomial size and constant depth. < It follows from a result of Furst, Saxe and Sipser (1981) that < multiplication is not in AC^0. The use of AC^0 as a criterion for < PRAM instructions was introduced by Stockmeyer and Vishkin (1984). -----------------PROGRAMMING------------------- (Jeremy Gibbons) Jeremy.Gibbons@prg.oxford.ac.uk In Squigol (or Orwell) it's "transpose." tran = (zipwith (++))/ . [.]** ** / = reduce (or fold) [.] a = [a] zipwith f ([],[]) = [] zipwith f (a:x,b:y) = (f a b):(zipwith f (x,y)) ----------------------------------------- rbe@yrloc.ipsa.reuter.COM (Robert Bernecky): In APL and J, we call it "transpose". --------------------MISC. ----------------- < Eric Ristad <ristad@edu.Princeton> and Alan Jeffrey < <jeffrey@se.chalmers.cs> also responded. ----------------------------------------- A parting shot from Frank Silbermann <fs@edu.tulane.cs.rex>: > I am working on realistic PRAM models. I had always thought "realistic PRAM model" was an oxymoron. ----------------------------------------- -- =========================== MODERATOR ============================== Steve Stevenson {steve,fpst}@hubcap.clemson.edu Department of Computer Science, comp.parallel Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell