[comp.parallel] Name this primitive

Andrew.Chin@prg.oxford.ac.uk (Andrew Chin) (04/16/91)

I am working on realistic PRAM models.  The PRAM is usually defined
without specifying what basic instruction set is used.  Two criteria for
deciding whether an instruction is "fair" to use are linear (sequential)
bit complexity and membership in AC0.  (These criteria exclude
multiplication, for example.)  Even if an instruction is "fair,"
however, it is best to justify supporting it in hardware or low-level
code.

It will help me greatly if I can have a basic instruction which takes
n n-bit words and produces n n-bit words such that the first word 
consists of the highest-order bits, the second word consists of the
second-highest-order bits, etc.  The effect is that of transposing
an n-by-n 0-1-valued matrix.

Clearly this can be done with linear bit complexity and in AC0.
What I want to know is:

(1)  Does this instruction have a name?
(2)  Are there references to it in the literature?
(3)  Is it practical?  (Both research results and hunches are invited.)

______________________________________________________________________
Andrew Chin                                chin@prg.oxford.ac.uk
Oxford University Computing Laboratory
8-11 Keble Road                            +44 865 513661
Oxford  OX1 3QD                            (0865) 513661
England                                    at Texas A&M 1991-93
______________________________________________________________________

-- 
=========================== MODERATOR ==============================
Steve Stevenson                            {steve,fpst}@hubcap.clemson.edu
Department of Computer Science,            comp.parallel
Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell

rbe@yrloc.ipsa.reuter.COM (Robert Bernecky) (04/17/91)

In article <1991Apr16.164534.1530@hubcap.clemson.edu> Andrew Chin <Andrew.Chin@prg.oxford.ac.uk> writes:
>
>multiplication, for example.)  Even if an instruction is "fair,"

Pardon my ignorance, but what is ACO?

>It will help me greatly if I can have a basic instruction which takes
>n n-bit words and produces n n-bit words such that the first word 
>consists of the highest-order bits, the second word consists of the
>second-highest-order bits, etc.  The effect is that of transposing
>an n-by-n 0-1-valued matrix.
>
>(1)  Does this instruction have a name?
>(2)  Are there references to it in the literature?
>(3)  Is it practical?  (Both research results and hunches are invited.)

(1) Yes, it does have a name. If I were being pedantic, I'd stop there.
    In APL and J, we call it "transpose".

(2) Yes, see any of many books on APL, or see Hui, Iverson, et al ,
    "APL\?", in ACM SIGAPL Quote Quad, VOl 20, no 4, July 1990 -
     APL90 COnference proceedings for an early paper on J.

(3) Is it practical? You betcha! Suppose you only have a reduction
    primitive for summing rows of an array, and want to sum the columns.
    Merely do "sum reduce tranpose x". Very handy. Why, I'll bet
    it would even have applications with non-Boolean arguments 8^}.

With tongue removed from cheek: Boolean operations are an area which
have made APL knock the socks off other languages in terms of performance.
Having architectural support for transpose and other operations (indexing 
in and out of bit arrays with vector arguments comes to mind - contiguous
row/column stuff -- would help) would help even more. 

J and older dialects of APL do not have the Flatlander prejudices that
seem rife in other languages: Transpose (denoted |:) of an array x
of shape (denoted $), results in an array in which the ordering of
axes is reversed. That is, 
     $x 
3 4 5       (3 planes, 4 rows, 5 columns)
     $|:x   (shape of the transpose of x)
5 4 3

If you want to transpose each plane, use "rank" (denoted ") to apply it
to the rank-2 objects:
     $|:"2 x
3 5 4

There's a thread on this going on in comp.lang.apl. J, by the way, is
a new dialect of APL, in which we have abandoned the APL character set,
and adopted ASCII, and also rationalized 25 years of design, in which 
there some inevitable warts. 

So, yes, Boolean transpose is important. But, be sure that you allow
for non-square arrays! I often transpose funny-shaped Boolean arrays.



Robert Bernecky      rbe@yrloc.ipsa.reuter.com  bernecky@itrchq.itrc.on.ca 
Snake Island Research Inc  (416) 368-6944   FAX: (416) 360-4694 
18 Fifth Street, Ward's Island
Toronto, Ontario M5J 2B9 
Canada

-- 
=========================== MODERATOR ==============================
Steve Stevenson                            {steve,fpst}@hubcap.clemson.edu
Department of Computer Science,            comp.parallel
Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell

Andrew.Chin@prg.oxford.ac.uk (Andrew Chin) (04/22/91)

Here is a summary of the many helpful replies I received to my
query about the PRAM/RAM operation

> ... which takes n n-bit words and produces n n-bit words such
> that the first word consists of the highest-order bits, the
> second word consists of the second-highest-order bits, etc. 
> The effect is that of transposing an n-by-n 0-1-valued matrix.

(1)  Does this instruction have a name?
(2)  Are there references to it in the literature?
(3)  Is it practical?  (Both research results and hunches are invited.)

Many thanks to everyone who took time to consider the questions,
and especially those who sent me these very good answers.
My responses to their responses are prefixed by "<".

Further responses are invited and I will continue to summarize
any information received that seems to be of general interest.

-------------------HARDWARE---------------------

pjn@edu.UMD.cs (P. J. Narayanan):

The primitive you described is often called "corner turning". It is
an interesting operation indeed.

The Connection Machine has a chip -- called SPRINT chip -- that
interfaces the bit serial processor memories with the floating
point processor (they only have one floating point unit per 32
bit serial processors) that implements the above primitive (and
a lot more, I think). Each of the 32 PEs write 32 bits from their
memory onto the SPRINT chip in 32 cycles. The FPU reads them as
32 bit words, corner turned, and processes them.

Years back, I had a simple hardware design involving bipolar ROMs and
fast shift circuitry that gave corner turning capability to memory.
Thus an 8-byte block [i.e., and 8x8 block of bits] (could be extended
to any size, with corresponding increase in the extra hardware) could
be read either the way -- as bytes of rows or columns. It is definitely
practical to design a corner turning hardware (as demonstrated by the
SPRINT chip); I believe it is practical to give this capability to
a DRAM memory with "reasonable" extra hardware.
----------------------------------------
Mark Greenstreet <mrg@edu.Princeton>:

> (3) Is it practical?

Yes (if you're a theoretician).  Take a register file of n, nbit registers.
Fold it along the main diagonal.  Now, all cells that need to communicate
for a bit transpose operation are neighbors.  Foldiing an array preserves
the array of the array (ignoring small constant factors).  So it looks good.

Maybe (if you're an engineer).  To justify hardware implementation of the
operation, you have to show that its cost (the small constant factor
mentioned above, plus the architectural constraint relating the number of
registers to the word size, etc.) is justified by the benefit of having
the instruction.  Does it speed up enough programs sufficiently to justify
the longer cycle time due to longer wires in the register file due to the
folding?
-----------------------------------------
Pete Hugo (hugo@sunapee.dartmouth.edu) also noted that the Connection
Machine supports corner turning, and adds:

So, I guess this hardware does a transpose of a 32 by n matrix of
bits, rather than and n by n matrix.  The problem I see with
transposing an N by N matrix is that it would require a lot of global
memory access.  The problem I have always had with PRAM is this idea
that you can build a machine with infinite bandwidth to a shared
memory.  It just isn't going to happen.  I think to make PRAM
realistic, we have to add some metric of 'locality' to the model.

< Comment:  The Block PRAM of Aggarwal, Chandra and Snir is a good
< model for introducing these sorts of locality issues.  I will
< have a thesis on this which will be available in late summer.
< Send requests to <library@prg.oxford.ac.uk> after Sept. 1.
-----------------------------------------------
Bojana Obrenic was lead co-author on a paper about corner turning
with Arnold Rosenberg < rsnbrg@edu.umass.cs.elys >, who reported
this and kindly asked Bojana to send me a copy, and writes:

The operation is very important for certain real applications.

----------------COMPLEXITY THEORY---------------

 <srt@edu.duke.cs> Steve Tate:

> The PRAM is usually defined without specifying what basic 
> instruction set is used.

Well, this is not really true.  There are quite a few papers dealing with
various instruction sets for the PRAM.  The standard PRAM typically
has only addition, subtraction, division by two (shift right), test
and branch, and indirect addressing.  The MPRAM model allows multiplication
also, and this has also been studied.  There are some recent papers
by Trahan, Ramachandran, and (Someone else) that address this for
both deterministic and nondeterministic PRAMS.  Also there is a classic
paper by Hartmanis (I think) on MPRAMs (maybe this was only on MRAMs).

< Comment:  The references sound helpful.  I should have said:
< Despite the existence of specific studies on the power of
< instruction sets, the PRAM is usually defined without specifying
< what basic instruction set is used.
------------------------------------------------
Ken Regan <regan%edu.buffalo.cs@edu.buffalo.cs.castor>:

The NLIN not-equal-to DLIN proof of Pippenger, Paul, Szemeredi and
Trotter depends heavily on the 1-dimensional structure of the tapes.
In particular, if you have three blocks B1,B2,B3 on the tape, and you
move from B1 to B3, then you can't refer to what you stored in B1
without going through B2 again.  Figuratively, the time spent going
through B2 is "soft time", and the NLIN /= DLIN proof sort-of says
that nondeterminism can squeeze out soft time that determinism cannot.
(But only improving by a wimpy log*n factor.)

In the plane, however, there are more ways to locate memory blocks
"contiguously", and these frustrate the graph theoretic techniques (so far)
which PPST used to prove their result.

So maybe these results only say that our notion of running time is wrong,
too "one-dimensional."  But, I answer (and hope), maybe this
kind of one-dimensionality only reflects the fact that our formal theory
of languages is string-based to begin with--we don't regard "2-D strings"
as fundamental units.

Matrix-transpose is an important topic in TM theory as well.  The best I
can do is refer you to several articles by Ming Li and Paul Vitanyi on
Kolmogorov complexity, where they talk about the K-c significance of
matrix transpose, and lower bounds for TM's.  Their best article is in a
volume entitled "Complexity Theory Retrospective", Alan Selman, editor,
Springer Verlag, 1990.  Li and Vitanyi are writing a book, and I believe
OUP is the publisher (?).

< Ken also mentioned related work by Jie Wang (Wilkes University)
< and himself (different topics).
----------------------------------------------
Robert Bernecky< rbe@yrloc.ipsa.reuter.COM > (see also PROGRAMMING below):

Pardon my ignorance, but what is ACO?

< AC^0 is a Boolean function complexity measure related to the class NC.
< A function is in NC if it can be computed by Boolean circuits with
< polynomial size and polylogarithmic depth.  Alternatively, a function is
< in NC if it can be computed by a PRAM with polynomially many processors
< in polylogarithmic time.

< A function is in AC^0 if it can be computed by Boolean circuits
< whose gates can be unary negations, or ANDs or ORs of arbitrary arity,
< with polynomial size and constant depth.

< It follows from a result of Furst, Saxe and Sipser (1981) that
< multiplication is not in AC^0.  The use of AC^0 as a criterion for
< PRAM instructions was introduced by Stockmeyer and Vishkin (1984).

-----------------PROGRAMMING-------------------

(Jeremy Gibbons) Jeremy.Gibbons@prg.oxford.ac.uk

In Squigol (or Orwell) it's "transpose."

  tran = (zipwith (++))/ . [.]**

** / = reduce (or fold)
   [.] a = [a]
   zipwith f ([],[]) = []
   zipwith f (a:x,b:y) = (f a b):(zipwith f (x,y))
-----------------------------------------
rbe@yrloc.ipsa.reuter.COM (Robert Bernecky):

In APL and J, we call it "transpose".

--------------------MISC. -----------------
< Eric Ristad <ristad@edu.Princeton> and Alan Jeffrey 
< <jeffrey@se.chalmers.cs> also responded.
-----------------------------------------
A parting shot from Frank Silbermann <fs@edu.tulane.cs.rex>:

>	I am working on realistic PRAM models.

I had always thought "realistic PRAM model" was an oxymoron.
-----------------------------------------

-- 
=========================== MODERATOR ==============================
Steve Stevenson                            {steve,fpst}@hubcap.clemson.edu
Department of Computer Science,            comp.parallel
Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell