[comp.arch] RISC is a nasty no-no!

terry@wsccs.UUCP (terry) (02/23/88)

	We have been porting software at the company I work for for a number
of years... consider: our primary product runs on over 130 unix machines,
all of the 286/386 UNIX cloning OS's (UNIX/Xenix/Venix/etc), Berk, SysV,
SysIII, VMS, CTOS, BTOS, MS-DOS, CP/M, MP/M, Turbodos, Whatever Apple calls
their normal MAC OS, and some nasty things you never heard of.  We also run
on one purportedly RISC system, the RT (IBM).

	Further, it is communications software which EMULATES 9 terminals
EXACTLY (no, it is NOT the simplistic 'cu', and NO, we do not do 132 columns
on a VT50) as possible given the limitations of the hardware.  This means it
has to talk to some pretty bizzarre serial/ethernet devices.  It does this
correctly.  At 9600+ baud.

	[Lest this be consired an ad, let me point out that 1) I have not
	 mentioned the product name and 2) the reader is supposed to glean
	 an idea of portablity.  It is the same code on all machines except
	 CP/M and MP/M, and they don't count]

	We have been able to port to all machines we attempted, except one
(no, not VMS... although that's a horse of a different wheelbase).  That one
was SUN's new (not so new any more) RISC machine, errorneously labeled via
an increment of a prior product, thus fooling the user into believing his
code will run.

	During our many-houred visit in this strange dimension (a usual port
takes no longer than 2 hours usually... the Harris HCX9 took 15 minutes,
including writing 5 tapes), we attempted to port 5 products, not one of
which runs on less than 60 machines.  The only one that didn't grunk (read
Segmentation fault, core dumped) was the one I have described above in great
gory detail, and that took tweaking.  We currently do not distribute this
due to an insane fear of technical support calls, although, again, it does
run.

THE REASON:  Type-casting.  You can't.  Small programs seem to, but it doesn't
work.  Bytes tend to be word aligned.  Other messy stuff.  It was not a
pretty sight (site?).  I am sure there are other problems, but geez, this is
demonstrably portable code.

	I am all for RISC machines when reasonably implimented.  My idea of
RISC is an instruction set that is sufficently small to allow the manufacturer
to call it RISC and not get sued, but sufficiently varied to allow me to go
off and have the assembler impliment enough macro's that my compiler thinks
it's running on a 680x0.  ----Oh rats!  If I can't have that, at LEAST my
portable C compiler should be.  Sun must have some good people to be able
to have ported a semblance of UNIX to this thing.

	I shudder when I hear people say "Won't it be neat when we can buy a
RISC workstation based on the Sun chip!".  EEAaauuUGGhah!


| Terry Lambert           UUCP: ...!decvax!utah-cs!century!terry              |
| @ Century Software       or : ...utah-cs!uplherc!sp7040!obie!wsccs!terry    |
| SLC, Utah                                                                   |
|                   These opinions are not my companies, but if you find them |
|                   useful, send a $20.00 donation to Brisbane Australia...   |
| 'There are monkey boys in the facility.  Do not be alarmed; you are secure' |

steve@nuchat.UUCP (Steve Nuchia) (02/28/88)

From article <179@wsccs.UUCP>, by terry@wsccs.UUCP (terry):

[ lots of self-congratulation about how portable his code is, followed
  by complaints that it isn't portable to the SPARC ]

> THE REASON:  Type-casting.  You can't.  Small programs seem to, but it doesn't
> work.  Bytes tend to be word aligned.  Other messy stuff.  It was not a
> pretty sight (site?).  I am sure there are other problems, but geez, this is
> demonstrably portable code.
  ^^^^^^^^^^^^^^^^^^^^^^^^^^

FLAME ON!    ( I love this! )

WRONG.  It is demonstrably NON-portable code - it failed to port
to a working compiler on a reasonable machine.  If the bloody
unix kernel runs (and it does) your silly application should, too.

FLAME OFF.

How about some code fragments and a discussion of how they
meet the standards (de facto or otherwise) and how the SPARC
compiler fails to properly implement them.

Get a clue - portable doesn't mean "runs on X processors", it
means "conforms to standards".
-- 
Steve Nuchia	    | [...] but the machine would probably be allowed no mercy.
uunet!nuchat!steve  | In other words then, if a machine is expected to be
(713) 334 6720	    | infallible, it cannot be intelligent.  - Alan Turing, 1947

cruff@scdpyr.UUCP (Craig Ruff) (02/28/88)

In article <696@nuchat.UUCP> steve@nuchat.UUCP (Steve Nuchia) writes:
>From article <179@wsccs.UUCP>, by terry@wsccs.UUCP (terry):
>[ lots of self-congratulation about how portable his code is, followed
>  by complaints that it isn't portable to the SPARC ]
>
>> THE REASON:  Type-casting.  You can't.
>
>FLAME ON!    ( I love this! )
>
>WRONG.  It is demonstrably NON-portable code - it failed to port
>to a working compiler on a reasonable machine.  If the bloody
>unix kernel runs (and it does) your silly application should, too.
>
>FLAME OFF.

I've just ported several thousand lines of code to the Sun-4.  The only
problem I had was in the use of the ndbm library.  I was doing a
structure copy with the dptr result from dbm_fetch, which turned out to
be non-word aligned.  Of course the documentation does not state that
the dptr will be word aligned, in fact, it doesn't state anything at
all about this.  Anyway, type casts do exist in this code, though in 
small numbers.  Since I knew this code would eventually reside on the Sun-4,
I was careful to write it with portability in mind.  I also ported many
more thousand lines of both C (written with Unix portability in mind) and
Fortran (gak! written with many-vendor machine portability in mind).  Not
one single problem related to data alignment or casting.

In all, I'd say that the Sun-4 does not present any unsolvable problems
in porting code.  If your code is really portable that is. :-)
-- 
Craig Ruff      NCAR                         INTERNET: cruff@scdpyr.UCAR.EDU
(303) 497-1211  P.O. Box 3000                   CSNET: cruff@ncar.CSNET
		Boulder, CO  80307               UUCP: cruff@scdpyr.UUCP

elg@killer.UUCP (Eric Green) (02/28/88)

in article <179@wsccs.UUCP>, terry@wsccs.UUCP (terry) says:
[describes a probram which runs on systems with a minimum of porting, e.g. 2
hours for a Harris minicomputer: then details hours trying to port to a Sun 4,
and failing: ]

> THE REASON:  Type-casting.  You can't.  Small programs seem to, but it doesn't
> work.  Bytes tend to be word aligned.  Other messy stuff.  It was not a
> pretty sight (site?).  I am sure there are other problems, but geez, this is
> demonstrably portable code.
> 
> 	I am all for RISC machines when reasonably implimented.  My idea of
> RISC is an instruction set that is sufficently small to allow the manufacturer
> to call it RISC and not get sued, but sufficiently varied to allow me to go
> off and have the assembler impliment enough macro's that my compiler thinks
> it's running on a 680x0.  ----Oh rats!  If I can't have that, at LEAST my
> portable C compiler should be.  Sun must have some good people to be able
> to have ported a semblance of UNIX to this thing.

There's a couple of possible problems that may be bugging you:

1) Sun's Unix isn't. That is, it's a huge superset of Unix, with features of
both BSD4.x and Sys V all mashed together.

2) Sun's "C" compiler apparently isn't very good at handling a lot of 
things, from your description. Or maybe there's a flag you didn't set, or
something similiar. 

I've used a Pyramid 90x and an IBM RT. I've read papers on the AMD29000 and
the MIPSco chip. I see no inherent reason for portable programs to not run on
any of them, except possibly the 29000 (which lacks byte addressing in its
native rendition).

--
Eric Lee Green  elg@usl.CSNET     Snail Mail P.O. Box 92191      
{cbosgd,ihnp4}!killer!elg         Lafayette, LA 70509            

Come on, girl, I ain't looking for no fight/You ain't no beauty, but hey
you're alright/And I just need someone to love/tonight

daveb@geac.UUCP (David Collier-Brown) (02/29/88)

In article <696@nuchat.UUCP> steve@nuchat.UUCP (Steve Nuchia) writes:
>WRONG.  It is demonstrably NON-portable code - it failed to port
>to a working compiler on a reasonable machine.  If the bloody
>unix kernel runs (and it does) your silly application should, too.

  Agreed. However....

>Get a clue - portable doesn't mean "runs on X processors", it
>means "conforms to standards".

  The purpose of the standard is to allow portability, by making
machines similar enough that code `ports' (an oversimplification, of
course).  Therefore a claim that XXX runs on N Unix boxes of at
least 2 different OS families is a stronger claim for portability
than meeting a standard that describes (and i'm thinking of the SVID
here) only one family.

  Much more info required, Terry. (terry@wsccs.UUCP)

--dave
-- 
 David Collier-Brown.                 {mnetor yunexus utgpu}!geac!daveb
 Geac Computers International Inc.,   |  Computer Science loses its
 350 Steelcase Road,Markham, Ontario, |  memory (if not its mind) 
 CANADA, L3R 1B3 (416) 475-0525 x3279 |  every 6 months.

bs@linus.UUCP (Robert D. Silverman) (02/29/88)

In article <284@scdpyr.UUCP: cruff@scdpyr.UUCP (Craig Ruff) writes:
:In article <696@nuchat.UUCP> steve@nuchat.UUCP (Steve Nuchia) writes:
:>From article <179@wsccs.UUCP>, by terry@wsccs.UUCP (terry):
:>[ lots of self-congratulation about how portable his code is, followed
:>  by complaints that it isn't portable to the SPARC ]
:>
:>> THE REASON:  Type-casting.  You can't.
:>
:>FLAME ON!    ( I love this! )
:>
:>WRONG.  It is demonstrably NON-portable code - it failed to port
:>to a working compiler on a reasonable machine.  If the bloody
:>unix kernel runs (and it does) your silly application should, too.
 
There's something about RISC architectures in general that I find 
confusing. Since they (read SPARC or equivalent) have no integer multiply
instructions, any code which has a fair number of these is going to
be slow. This would include any program which had access to 2-D arrays
since one must do multiplications (unless the array sizes are a convenient
power of 2) to get the array indices right. Any code that accesses a[i][j]
should run like a pig on such machines. I've seen some benchmarks that
suggest SUN-4's are in fact slower than SUN-3's on programs that do a 
large amount of integer multiplies/divides. What good is a computer that
can't multiply?

Anyone remember the CADET? It was, I believe, the IBM 1630 and stood for
'Can't Add, Doesn't Even Try'. It did its addition by table lookup.

For people who want to use computers to COMPUTE, it appears that SPARC
is a step in the wrong direction.

Bob Silverman

dfk@duke.cs.duke.edu (David Kotz) (03/01/88)

> There's something about RISC architectures in general that I find 
> confusing. Since they (read SPARC or equivalent) have no integer multiply
> instructions, any code which has a fair number of these is going to
> be slow. This would include any program which had access to 2-D arrays
> since one must do multiplications (unless the array sizes are a convenient
> power of 2) to get the array indices right. ...
> Bob Silverman

Multiplication is not necessary to access 2-D arrays if the array is
set up like most arrays in C, where each row is a typical vector and
the 2-D array is just a vector of pointers to each row vector. Then
double-indirection is necessary, rather than multiplication. I won't
say that's any better, but you don't *need* multiplication. (It might
not be so bad once the first two pointers are cached, and a good
programmer puts them in registers anyway). 

David Kotz
-- 
ARPA:	dfk@cs.duke.edu
CSNET:	dfk@duke        
UUCP:	{ihnp4!decvax}!duke!dfk

tim@amdcad.AMD.COM (Tim Olson) (03/01/88)

In article <3530@killer.UUCP> elg@killer.UUCP (Eric Green) writes:
| I've used a Pyramid 90x and an IBM RT. I've read papers on the AMD29000 and
| the MIPSco chip. I see no inherent reason for portable programs to not run on
| any of them, except possibly the 29000 (which lacks byte addressing in its
| native rendition).

Correction: the Am29000 has byte and halfword (16-bit) extract and insert
instructions to allow you to interface easily with word-only
addressable memories.

It also has explicit byte and halfword loads and stores (encoded in the
control field) to interface with memory that performs those accesses
directly. 

	-- Tim Olson
	Advanced Micro Devices
	(tim@amdcad.amd.com)

tim@amdcad.AMD.COM (Tim Olson) (03/01/88)

In article <25699@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes:
| There's something about RISC architectures in general that I find 
| confusing. Since they (read SPARC or equivalent) have no integer multiply
| instructions, any code which has a fair number of these is going to
| be slow.

What makes you think that a single integer multiply instruction would be
faster?  On CISC machines, these are normally expanded to a series of
multiply-step iterations (which most RISC machines have as an
instruction).  Only if you have architectural support for fast
multiplication (i.e. a large multiplier array) is a single multiply
instruction beneficial.

| This would include any program which had access to 2-D arrays
| since one must do multiplications (unless the array sizes are a convenient
| power of 2) to get the array indices right. Any code that accesses a[i][j]
| should run like a pig on such machines.

A multiply by a constant (as in the case of a 2-dimensional array
access in C) is almost always performed faster with an inline series of
shifts and adds than with a multiply.

	-- Tim Olson
	Advanced Micro Devices
	(tim@amdcad.amd.com)

bs@linus.UUCP (Robert D. Silverman) (03/01/88)

In article <11199@duke.cs.duke.edu: dfk@duke.cs.duke.edu (David Kotz) writes:
:> There's something about RISC architectures in general that I find 
:> confusing. Since they (read SPARC or equivalent) have no integer multiply
:> instructions, any code which has a fair number of these is going to
:> be slow. This would include any program which had access to 2-D arrays
:> since one must do multiplications (unless the array sizes are a convenient
:> power of 2) to get the array indices right. ...
:> Bob Silverman
:
:Multiplication is not necessary to access 2-D arrays if the array is
:set up like most arrays in C, where each row is a typical vector and
:the 2-D array is just a vector of pointers to each row vector. Then

etc.


This isn't intended as a flame but your response is too 'language dependent'.
Not all code is written in a language that has pointers. There are many
numerical packages out there, written in FORTRAN (yech). It not only
doesn't have pointers, but it also stores things in column-major rather than
row-major order.

Please. We are discussing hardware, not software. Let's not introduce
language dependencies. Your solution also fails to address those numerical
problems that have multiplies in them.

Bob

edw@IUS1.CS.CMU.EDU (Eddie Wyatt) (03/01/88)

> 
> Multiplication is not necessary to access 2-D arrays if the array is
> set up like most arrays in C, where each row is a typical vector and
> the 2-D array is just a vector of pointers to each row vector.

EXCUSE ME, but when I declare an array to be an n by m matrix in
C (float foo[n][m]) I get a contiguous block of memory.  The
representation IS NOT row-vector or column-vector.  So when I access
index number i,j someone has to perform the calculation 
base + sizeof(type)*(i*m+j).



-- 

Eddie Wyatt 				e-mail: edw@ius1.cs.cmu.edu

bcase@Apple.COM (Brian Case) (03/01/88)

In article <3530@killer.UUCP> elg@killer.UUCP (Eric Green) writes:
>I've used a Pyramid 90x and an IBM RT. I've read papers on the AMD29000 and
>the MIPSco chip. I see no inherent reason for portable programs to not run on
>any of them, except possibly the 29000 (which lacks byte addressing in its
>native rendition).

Wait a minute, I beg to differ with "lacks byte addressing in its native
rendition."  There are byte and halfword (16-bit) insert and extract
operations.  These give you everything you need (yes, yes, I know with
argueable efficiency).  Now, if you meant arbitrary byte *alignment,*
then yeah, but the 29K isn't unique there.  Note that the Pyramid machines
are little endian (aren't they?).  This helps non-portable code become
portable.

hansen@mips.COM (Craig Hansen) (03/01/88)

In article <25699@linus.UUCP>, bs@linus.UUCP (Robert D. Silverman) writes:
> In article <284@scdpyr.UUCP: cruff@scdpyr.UUCP (Craig Ruff) writes:
> :In article <696@nuchat.UUCP> steve@nuchat.UUCP (Steve Nuchia) writes:
> :>From article <179@wsccs.UUCP>, by terry@wsccs.UUCP (terry):
> :>[ lots of self-congratulation about how portable his code is, followed
> :>  by complaints that it isn't portable to the SPARC ]
> :>
> :>> THE REASON:  Type-casting.  You can't.
> :>
> :>FLAME ON!    ( I love this! )
> :>
> :>WRONG.  It is demonstrably NON-portable code - it failed to port
> :>to a working compiler on a reasonable machine.  If the bloody
> :>unix kernel runs (and it does) your silly application should, too.

This of course pre-supposes that the SPARC architecture yields
reasonable machines. While Sun claims that the Sun4 is source-code
compatible with the Sun3, what that really means is that if it ports
to the Sun4, it was portable, and if it doesn't port, it wasn't
portable. It's ridiculous to claim the Sun4 machine is source-code
compatible if _all_ software written for the Sun3 doesn't port,
as "portable" code written for the Sun3 would port to many machines
besides the Sun4 anyway.

In fact, the SPARC architecture has a real problem with source-code
compatibility with the Sun3 machines - the alignment rules are
different between the 68020 and SPARC, and code that depends on
misaligned data is hard to port to SPARC. The MIPS architecture and
compiler system is in a better position to port such code because
efficient instructions are available to handle unaligned (32-bit)
words, and our compiler system can be set to use them in such code. We
also provide utilities that can help to pinpoint where in the program
these problems occur, and can also fix up references to such unaligned
pointers within an exception handler, as an aid to porting the code
quickly and then going back to tune the code later.

> There's something about RISC architectures in general that I find 
> confusing. Since they (read SPARC or equivalent) have no integer multiply
> instructions, any code which has a fair number of these is going to
> be slow. This would include any program which had access to 2-D arrays
> since one must do multiplications (unless the array sizes are a convenient
> power of 2) to get the array indices right. Any code that accesses a[i][j]
> should run like a pig on such machines. I've seen some benchmarks that
> suggest SUN-4's are in fact slower than SUN-3's on programs that do a 
> large amount of integer multiplies/divides. What good is a computer that
> can't multiply?

All RISC architectures are NOT created equal, particulaly with respect
to integer multiply instructions. The MIPS R-Series processors have
explicit signed and unsigned integer multiply and divide instructions,
that are executed in special-purpose hardware. A 32-bit multiply takes
12 cycles, with up to 10 instructions that can be executed in parallel
with the multiply. We considered that to be a much superior solution
than multiply-step, which would have been slower and harder to
implement. (Multiply-step has too many operands and too many results.)

In many cases, 2-D arrays have sizes that are known at compile-time,
and so become multiplications by constants.  Multiplication by
constants can be handled efficiently by most RISC machines, but are
generally a little faster in cycle count on the HP "Precision"
architecture (I still think of it as Spectrum, but then I'm getting
old and set in my ways....), which has single-cycle shift-and-add
operations that are good for doing multiplies by constants that aren't
powers of 2.

The MIPS compiler picks either an explicit multiply operation or
software shift-and-add sequences, depending on the value and form
(variable vs constant) of the operands. The end result is that
multiplies are most often faster than the 12 cycle worst-case figure.

-- 
Craig Hansen
Manager, Architecture Development
MIPS Computer Systems, Inc.
...{ames,decwrl,prls}!mips!hansen or hansen@mips.com   408-991-0234

hansen@mips.COM (Craig Hansen) (03/01/88)

In article <7507@apple.Apple.Com>, bcase@Apple.COM (Brian Case) writes:
> In article <3530@killer.UUCP> elg@killer.UUCP (Eric Green) writes:
> >I've used a Pyramid 90x and an IBM RT. I've read papers on the AMD29000 and
> >the MIPSco chip. I see no inherent reason for portable programs to not run on
> >any of them, except possibly the 29000 (which lacks byte addressing in its
> >native rendition).
> 
> Wait a minute, I beg to differ with "lacks byte addressing in its native
> rendition."  There are byte and halfword (16-bit) insert and extract
> operations.  These give you everything you need (yes, yes, I know with
> argueable efficiency).  Now, if you meant arbitrary byte *alignment,*
> then yeah, but the 29K isn't unique there.  Note that the Pyramid machines
> are little endian (aren't they?).  This helps non-portable code become
> portable.

Tim says that you can use either the insert and extract operations,
or, in a machine that has external hardware to provide byte addressing
in the memory system itself, you can use direct load and store
byte/halfword operations.  It would seem that if you wished, you could
provide unaligned load and store halfword/word operations with a great
deal of additional external hardware, except that there's no way to
handle items that cross page boundaries. [Not an issue if the MMU isn't
in use, of course.]

In any case, I have two questions:

1) In AMD's performance models, which memory model is used?
	For full clock rate systems, I don't see how you could
	resonably build the direct partial-word addressed machine.
	[The MIPS processors perform the required shifting/extracting
	at full clock rate on the processor chip, and so directly
	handles partial-word operands without additional hardware.]

2) Which memory model does the compiler generate?
	What effect does this lack of architectural specificity
	have on software compatibility? I presume that I can't
	do anything to make code that uses direct partial-word
	addressing work on a machine that has only full-word addressing.

-- 
Craig Hansen
Manager, Architecture Development
MIPS Computer Systems, Inc.
...{ames,decwrl,prls}!mips!hansen or hansen@mips.com   408-991-0234

jbs@eddie.MIT.EDU (Jeff Siegal) (03/01/88)

In article <25723@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes:
>In article <11199@duke.cs.duke.edu: dfk@duke.cs.duke.edu (David Kotz) writes:
>:Multiplication is not necessary to access 2-D arrays if the array is
>:set up like most arrays in C, where each row is a typical vector and
>:the 2-D array is just a vector of pointers to each row vector. Then
>
>This isn't intended as a flame but your response is too 'language dependent'.
>Not all code is written in a language that has pointers. 

There is nothing language dependant about row vector representation of
2-d arrays.  It is an implementation tool (i.e. the compiler writer,
library writer, or software archetecture designer deside such a
thing).  Once implemented in the system innards, the programmer need
not be concerned with it (i.e. you still reference a[x][y] or a[x,y]
or whatever).  Of course, C gives you the ability to represent your
"arrays" this way even when the built-in array representation doesn't.

Even without doing any pointer dereferencing, it is easy to avoid
doing any multiplication by having the compiler set up a table giving
the address offset of each row of the array.  So a[x][y] is found at
address:

a + offset[x] + y

>Please. We are discussing hardware, not software. Let's not introduce
>language dependencies. Your solution also fails to address those numerical
>problems that have multiplies in them.

True, there is no answer that is going to address this problems short
of specialized hardware.  I think it's pretty obvious that for
applications which primarily multiply, you want fast hardware support
for multiplication.  Machines without such support are compromising
multiply performance for better performance at (hopefully) more common
operations like register loads and stores, comparisons, adds, etc. 

Jeff Siegal

hankd@pur-ee.UUCP (Hank Dietz) (03/01/88)

In article <11199@duke.cs.duke.edu>, dfk@duke.cs.duke.edu (David Kotz) writes:
> > confusing. Since they (read SPARC or equivalent) have no integer multiply
> > instructions, any code which has a fair number of these is going to
> > be slow. This would include any program which had access to 2-D arrays
[stuff]
> Multiplication is not necessary to access 2-D arrays if the array is
> set up like most arrays in C, where each row is a typical vector and
> the 2-D array is just a vector of pointers to each row vector. Then
> double-indirection is necessary, rather than multiplication. I won't

Indirection isn't pretty in a lot of ways (because it is random
addressing unless you're very clever), but that's not the point.
The multiply for indexing is, first of all, often converted into
bumping pointers if you have a good optimizing compiler and,
secondly, unless you have a VERY FAST multiply, multiplying by
VIRTUALLY ANY compile-time CONSTANT (e.g., sizeof an element)
is faster using shift and add/subtract instructions.  For example, a
multiply by 7 (not a nice power of 2) is really shift to multiply
by 8 and then subtract the original number.  This sort of thing is
widely known by compiler writers -- references are available --
and the big disadvantage to it is that it may require the use of
more registers for temporaries.

jesup@pawl19.pawl.rpi.edu (Randell E. Jesup) (03/01/88)

In article <25699@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes:
>There's something about RISC architectures in general that I find 
>confusing. Since they (read SPARC or equivalent) have no integer multiply
>instructions, any code which has a fair number of these is going to
>be slow.

	Ever noticed how long a multiply takes on a 68000?  And that's
only a 16x16=32 multiply!  The fact that it takes multiple instructions
to do a multiply doesn't mean it isn't a FAST multiply.  Almost all (or
maybe all) RISC machines have some multiply support in hardware to make
multiplies take a reasonably small amount of time.  The only way to do one
fast (on any chip) is to throw a giant array-multiplier at it, which takes
up an ungodly amount of chip area.  The result is RISC chips have things
like MSTEP instructions, that do 1 or 2 bits of a multiply, so they can
do a multiply in 16 or 32 cycles (approx).  The 68000 requires up to
70+ cycles to do a 16x16 multiply.

>For people who want to use computers to COMPUTE, it appears that SPARC
>is a step in the wrong direction.

	SPARC may not be perfect, and a maxed out 68030 (or maybe even 020)
might be able to beat it on a benchmark or two, but in general it seems
like a win for most applications.

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

(-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

tim@amdcad.AMD.COM (Tim Olson) (03/01/88)

In article <1721@mips.mips.COM> hansen@mips.COM (Craig Hansen) writes:
| 1) In AMD's performance models, which memory model is used?
| 	For full clock rate systems, I don't see how you could
| 	resonably build the direct partial-word addressed machine.
| 	[The MIPS processors perform the required shifting/extracting
| 	at full clock rate on the processor chip, and so directly
| 	handles partial-word operands without additional hardware.]

We use the "full-word load/store with extract/insert instructions"
model.

| 2) Which memory model does the compiler generate?

All of our compilers default to the above-mentioned model.  Pragmas (in
the dpANS-tracking (can't really say ANSI, yet, can we ;-) compiler
allow code for the second model to be generated.

| 	What effect does this lack of architectural specificity
| 	have on software compatibility? I presume that I can't
| 	do anything to make code that uses direct partial-word
| 	addressing work on a machine that has only full-word addressing.

We have stated many times that one of the main design philosophies
behind the 29000 was to provide usable mechanisms, not dictate policies.
If you wish to run at the current 25MHz and add some external logic to
directly perform byte and half-word accesses, fine.  We provide those
hooks for you.  However, we felt that it was much more important to
speed up *all* memory accesses (by providing direct load-forwarding)
rather than slow all of them down through shift-mux, sign-or-zero extend
logic which most of the time isn't required.

To answer your question directly -- if you want to be portable, you use
the full-word, extract/insert model, since it works on all systems.

	-- Tim Olson
	Advanced Micro Devices
	(tim@amdcad.amd.com)

tpmsph@ecsvax.UUCP (Thomas P. Morris) (03/01/88)

  In response to  a posting about RISC having "bad" performance due to
array indexing which might require heavy integer multiplication, David
Kotz points out:
> Multiplication is not necessary to access 2-D arrays if the array is
> set up like most arrays in C, where each row is a typical vector and
> the 2-D array is just a vector of pointers to each row vector. Then
> double-indirection is necessary, rather than multiplication. I won't
> say that's any better, but you don't *need* multiplication. (It might
> not be so bad once the first two pointers are cached, and a good
> programmer puts them in registers anyway). 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Two points: (1) a good programmer shouldn't have to put those pointers
into registers. That's what good compilers are for! ;-)  (2) If your
array elements are being accessed sequentially, a good _optimizing_
compiler ought to be changing those array-index computations with
addititively computed offsets from a base. Strength reduction, code
hoisting, and elimination or reduction of invariant computations can
do wonders for code!

Of course, a good programmer _would_ indicate to his/her "C" compiler that 
the indices or the pointers ought to go in registers, too!


-----------------------------------------------------------------------------
Tom Morris                                 BITNET: TOM@UNCSPHVX
UNC School of Public Health                UUCP  : ...!mcnc!ecsvax!tpmsph

        "It's a joke, boy! I say, it's a joke! Don't ya get it?"
                                                -Foghorn Leghorn
-----------------------------------------------------------------------------

firth@sei.cmu.edu (Robert Firth) (03/01/88)

In article <998@PT.CS.CMU.EDU> edw@IUS1.CS.CMU.EDU (Eddie Wyatt) writes:

>EXCUSE ME, but when I declare an array to be an n by m matrix in
>C (float foo[n][m]) I get a contiguous block of memory.  The
>representation IS NOT row-vector or column-vector.  So when I access
>index number i,j someone has to perform the calculation 
>base + sizeof(type)*(i*m+j).

However, going by Knuth's statistics on array usage, more than 95%
of those multiplications occur in loops where the index is the
induction variable.  The optimisation called "strength reduction"
can then remove them.  If your target machine has a slow multiply,
such an optimisation is probably necessary.  How to do it is
explained in detail in many books, including eg Wulf's 'The Design
of an Optimising Compiler'.  The first Fortran compiler included
this optimisation, so it's been around for quite a while.

firth@sei.cmu.edu (Robert Firth) (03/01/88)

In article <8332@eddie.MIT.EDU> jbs@eddie.MIT.EDU (Jeff Siegal) writes:

>There is nothing language dependant about row vector representation of
>2-d arrays.  It is an implementation tool (i.e. the compiler writer,
>library writer, or software archetecture designer deside such a
>thing).  Once implemented in the system innards, the programmer need
>not be concerned with it...

Jeff, i find this an interesting assertion, and one I'd dearly like
to believe.

However, my reading of ANSI X3.9-1978, especially Section 5, on arrays,
leads me to conclude that array representation in column-major form,
and array access by chain multiplication-&-addition of subscripts, is
the only feasible implementation choice.

If you have a design that does everything right, and allows access by
chained indirection-&-offsetting, then I'd be most interested in seeing it.

scottg@hpiacla.HP.COM (Scott Gulland) (03/02/88)

> / hpiacla:comp.arch / terry@wsccs.UUCP (terry) /  8:08 pm  Feb 22, 1988 /
> 
> THE REASON:  Type-casting.  You can't.  Small programs seem to, but it doesn't
> work.  Bytes tend to be word aligned.  Other messy stuff.  . . .
> 
>                         . . .   , but sufficiently varied to allow me to go
> off and have the assembler impliment enough macro's that my compiler thinks
> it's running on a 680x0.  ----Oh rats!  If I can't have that, at LEAST my
> portable C compiler should be.  Sun must have some good people to be able
> to have ported a semblance of UNIX to this thing.

I think that what you have run into is porting from a machine with very
loose data alignment requirements to one with very strict data alignment
requirements, probably in combination with a marginal compiler.  The 680x0 
and 286/386 class of machines have very loose data alignment requirements. 
Integers, floats, pointers can be placed on any byte address boundry and 
still work.  Risc machines however, may require integers, floats, pointers,
etc to be placed on a 4-byte address, doubles on a 8-byte address, etc.
If you then take a short and type cast it to a float or int, there is a
50% chance that the data does not sit on the correct byte address boundary,
and so your program aborts.  This can also happen if the C compiler does not
insure correct data alignment within structures and unions.

**************************************************************************
* Scott Gulland	            | {ucbvax,hplabs}!hpda!hpiacla!scottg [UUCP] *
* Indus. Appl. Center (IAC) | scottg@hpiacla                      [SMTP] *
* Hewlett-Packard Co.       | (408) 746-5498                      [AT&T] *
* 1266 Kifer Road           | 1-746-5498                     [HP-TELNET] *
* Sunnyvale, CA  94086  USA | "What If..."                [HP-TELEPATHY] *
**************************************************************************

baum@apple.UUCP (Allen J. Baum) (03/02/88)

--------
[]
RE: needing multiplies in array accesses
Accessing arbitrary array elements is non extremely common. Most of the
array references are inside loops, and use the loop induction variables as
part of the subscript expression. A good compiler will strength reduce these
multiply operations and turn them into add/subtract immediates. Even if this
is not possible, the multiply operation is generally reg*imm, which reduces
to a shift in the common cases (power of two array sizes are pretty common!),
or a SMALL series of shifts and adds.

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

csg@pyramid.pyramid.com (Carl S. Gutekunst) (03/02/88)

In article <7507@apple.Apple.Com> bcase@apple.UUCP (Brian Case) writes:
>Now, if you meant arbitrary byte *alignment,* then yeah, but the 29K isn't
>unique there. Note that the Pyramid machines are little endian (aren't they?).

No, the Pyramid CPU is big-endian, like the 68020. Alignment is restricted to
natural boundaries, except that doubles need only be 32-bit aligned. Binary
operations are always performed on 32-bit or 64-bit objects, but the load and
store instructions can reference 8-bit, 16-bit, or 32-bit words; loads can be
with or without sign-extension. An unaligned version of the Pyramid was done
as a special for HHB Systems, many eons ago; it needed an extra clock tick per
memory reference to rotate the word into place. Some odd timings happened too
it part of the load was in cache, and part caused a cache miss.

I find all the flamage about aligned operations in the SPARC amusing; Pyramid
went through the same flamage some years ago. Having grown up on machines that
forced alignment, I thought the objections were pretty silly then. Still think
they're silly now.

<csg>

bcase@Apple.COM (Brian Case) (03/02/88)

In article <7649@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
>The multiply for indexing is, first of all, often converted into
>bumping pointers if you have a good optimizing compiler and,
>secondly, unless you have a VERY FAST multiply, multiplying by
>VIRTUALLY ANY compile-time CONSTANT (e.g., sizeof an element)
>is faster using shift and add/subtract instructions.  For example, a
>multiply by 7 (not a nice power of 2) is really shift to multiply
>by 8 and then subtract the original number.  This sort of thing is
>widely known by compiler writers -- references are available --
>and the big disadvantage to it is that it may require the use of
>more registers for temporaries.

Yes, you might need more registers.  Plus, now that you have computed
the multiply by 8, it is possible to reuse this computation later if
it is needed.  This is one of the easiest ways to see that lots of
registers, three-address operations, and a smart compiler are good.
I've said it before and I'll say it again:  RISC is both "hardware with
a very short cycle time" and a philosophy:  "reuse rather than recompute."
Note that a RISC machine, a memory reference can be a significant
computation.

rbbb@acornrc.UUCP (David Chase) (03/02/88)

It seems that people out there have invented a new meaning for RISC.
You can read about the old meaning in (for example) "Reduced Instruction
Set Computers", an IEEE Tutorial by William Stallings.

Comments like 

> Please. We are discussing hardware, not software.

and 

>         I am all for RISC machines when reasonably implimented.  My idea of
> RISC is an instruction set that is sufficently small to allow the
> manufacturer
> to call it RISC and not get sued, but sufficiently varied to allow me to go
> off and have the assembler impliment enough macro's that my compiler thinks
> it's running on a 680x0.  ----Oh rats!  If I can't have that, at LEAST my
> portable C compiler should be.

demonstrate how misunderstood RISC is.  To quote from the above reference,

  A vital element in achieving high performance on a RISC system is an 
  optimizing compiler.

When you have a high-quality register allocator, instruction scheduler, loop
invariant code mover, strength reducer, and constant propagater 
implemented in assembler macros, let me know (I'm morbidly curious).  It may
be the case that these optimizations aren't common in C compilers, but so it
goes.  You are certainly free to learn about them and implement them in a
compiler and sell it to the world.

Understand also that (as has been noted by other posters) complaining about
the slowness of a rare operation is silly; RISC designers HAVE done 
instruction mix profiles, and try to make their compiler+machine combinations
fast for those mixes.  Of course, with a stupid compiler there will be a
lot more multiply operations.  That's why you can't exclude software from
this discussion.

On an unrelated note, the Acorn RISC Machine appears to do its shift-and-OP
instructions without overhead provided that the shift is constant (that is,
not obtained from a register).  A second pair of odd features (that I'm
sure will be the subject of furious discussion for the next week or so) is
(1) a SET-CC bit--if 0, then the condition code is left unmodified and (2)
a condition flag for every instruction (not just branches).  (The reference
for this is the A.R.M CPU Software Manual, Version 1.00.  Since it is over
two years old things have undoubtedly changed slightly.)  When you combine
all these you get very compact loops for (general) multiplies, byte-swapping,
and division (division is not as compact as the other two).

David Chase
Olivetti Research Center, Menlo Park

rgh@mtund.ATT.COM (Ronald Hiller) (03/02/88)

In article <25699@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes:
>There's something about RISC architectures in general that I find 
>confusing. Since they (read SPARC or equivalent) have no integer multiply
>instructions, any code which has a fair number of these is going to
>be slow.
more complaints about lack of multiply instruction and effect on accessing
2-D arrays.

It turns out that shift/add (or shift/add/subtract) sequences are really
quite fast when multiplication by a constant is required.  As an example, 
on an 8086 (I know...I shouldn't compare it with SPARC!!), you can nearly
always do a multiply by a constant faster using the shift add sequence then
using the multiply instruction.

Ron

dik@cwi.nl (Dik T. Winter) (03/02/88)

In article <7482@apple.UUCP> baum@apple.UUCP (Allen Baum) writes:
 > Accessing arbitrary array elements is non extremely common. Most of the
 > array references are inside loops, and use the loop induction variables as
 > part of the subscript expression. A good compiler will strength reduce these
 > multiply operations and turn them into add/subtract immediates. Even if this
 > is not possible, the multiply operation is generally reg*imm, which reduces
 > to a shift in the common cases (power of two array sizes are pretty common!),
 > or a SMALL series of shifts and adds.
 > 
I agree with most (CDC Cyber does not have a true integer multiply either).
I disagree with the last, when you start programming on supers you will soon
learn that power of two array sizes are the worse choice you can make.
Even if nominally your array size ought to be a power of two, make it one
more in the declaration to prevent degradation of perfomance.
-- 
dik t. winter, cwi, amsterdam, nederland
INTERNET   : dik@cwi.nl
BITNET/EARN: dik@mcvax

mcdonald@uxe.cso.uiuc.edu (03/02/88)

>Anyone remember the CADET? It was, I believe, the IBM 1630 and stood for
>'Can't Add, Doesn't Even Try'. It did its addition by table lookup.

IBM 1620

It also used table lookup for multiplies.

The table was at a fixed location in memory (100, I seem to 
remember) and had to be read in at boot time be the card reader.

This machine used 2N404 GERMANIUM!!! transistors.

It normally ran at 125kHZ, but it had a special circuit so that when the
(hardware) instruction to write to its typewriter occured the clock
slowed down to 10 Hz - driven by a cam on the motor shaft!!!!!!!!!!!

It was fully, absolutely, DECIMAL.

It was variable "word" length, up to 20000 decimal digits!!!!!!!!!

Doug McDonald

alan@pdn.UUCP (Alan Lovejoy) (03/03/88)

In article <25699@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes:
>There's something about RISC architectures in general that I find 
>confusing. Since they (read SPARC or equivalent) have no integer multiply
>instructions, any code which has a fair number of these is going to
>be slow. This would include any program which had access to 2-D arrays
>since one must do multiplications (unless the array sizes are a convenient
>power of 2) to get the array indices right. Any code that accesses a[i][j]
>should run like a pig on such machines. I've seen some benchmarks that
>suggest SUN-4's are in fact slower than SUN-3's on programs that do a 
>large amount of integer multiplies/divides. What good is a computer that
>can't multiply?

Given

  VAR

    a: ARRAY RowIndex OF ARRAY ColumnIndex OF Element;

If the compiler's code generator (or the assembly hack) stores the
ADDRESS of each row at a[row], so that a[row][column] is actually
a[row]^[column] (but the rows are stored on the stack, not the heap),
then a[row][column] not only does not require multiplication, but
will be FASTER than multiplication.                                 

Of course, a[MIN(RowIndex)..MAX(RowIndex)] would need to be initialized
(possibly using multiplication) once each time the block in which "a" is 
declared is entered.         

This scheme takes more memory.  Good, fast, cheap:  choose any two.

On the other hand, fine-tuning a computer architecture so that it gets
the best possible performance executing "average" code is guaranteed to
result in poor performance in some cases when executing decidedly
unaverage code.  The question is, to what degree should an architecture
cater to pathological cases at the expense of normal ones (or vice
versa)?  The answer, of course, depends on who will use the architecture
to do what.  And people won't agree on the answer, which is why there 
are so many different computer architectures in the world.

Which brings me to my main point:  don't judge RISC on the
characteristics of SPARC.  There are other ways to skin a cat,
and other cats to skin.  Some RISCs do have hardware multiply.

--alan@pdn

barmar@think.COM (Barry Margolin) (03/03/88)

In article <7514@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
>In article <7482@apple.UUCP> baum@apple.UUCP (Allen Baum) writes:
> > (power of two array sizes are pretty common!),

For scientific programming (the major application of supercomputers),
is this really true?  What scientific applications naturally map onto
power-of-two arrays?  I suspect that making arrays a power of two in
size is a habit mostly of systems programmers, who know to make things
fit neatly into memory pages in order to reduce paging.

>I agree with most (CDC Cyber does not have a true integer multiply either).
>when you start programming on supers you will soon
>learn that power of two array sizes are the worse choice you can make.

Could you explain why this is so?  Maybe there are architectures where
it doesn't make a difference, but I can't imagine why power of two
would be WORSE than other sizes.

Barry Margolin
Thinking Machines Corp.

barmar@think.com
uunet!think!barmar

daveb@geac.UUCP (David Collier-Brown) (03/03/88)

In article <4400@aw.sei.cmu.edu> firth@bd.sei.cmu.edu.UUCP (Robert Firth) writes:
[discussion about array representation, and the hardware support
appropriate to multiplicative addressing]

>However, my reading of ANSI X3.9-1978, especially Section 5, on arrays,
>leads me to conclude that array representation in column-major form,
>and array access by chain multiplication-&-addition of subscripts, is
>the only feasible implementation choice.

  My reading of the ANSI proposal leads me to the same conclusion.
This is both good (it eliminates an ambiguity) and bad (it removes
the freedom of an compiler-writer to do what is best applicable to
her architecture).
  Could someone who is more of a language-lawyer comment on this
interpretation?  Was it in fact the intention of the committee?

--dave (redirected to comp.lang.c) c-b
-- 
 David Collier-Brown.                 {mnetor yunexus utgpu}!geac!daveb
 Geac Computers International Inc.,   |  Computer Science loses its
 350 Steelcase Road,Markham, Ontario, |  memory (if not its mind) 
 CANADA, L3R 1B3 (416) 475-0525 x3279 |  every 6 months.

bs@linus.UUCP (Robert D. Silverman) (03/03/88)

In article <17415@think.UUCP: barmar@fafnir.think.com.UUCP (Barry Margolin) writes:
:In article <7514@boring.cwi.nl: dik@cwi.nl (Dik T. Winter) writes:
::In article <7482@apple.UUCP> baum@apple.UUCP (Allen Baum) writes:
:: > (power of two array sizes are pretty common!),
:
:For scientific programming (the major application of supercomputers),
:is this really true?  What scientific applications naturally map onto
:power-of-two arrays?  I suspect that making arrays a power of two in
:size is a habit mostly of systems programmers, who know to make things
:fit neatly into memory pages in order to reduce paging.
:
::I agree with most (CDC Cyber does not have a true integer multiply either).
::when you start programming on supers you will soon
::learn that power of two array sizes are the worse choice you can make.
:
:Could you explain why this is so?  Maybe there are architectures where
:it doesn't make a difference, but I can't imagine why power of two
:would be WORSE than other sizes.
:
:Barry Margolin
:Thinking Machines Corp.
:
:barmar@think.com
:uunet!think!barmar
 
Somewhat simplified explanation:

This is relatively easy to explain. When doing vectorization on arrays
whose length is a power of two, memory interleaving conflicts occur.
You find that instead of grabbing different elements of an array from
different banks you try to grab multiple elements from the SAME banks.
This defeats the purpose of interleaving and slows down vectorization.

Bob Silverman

lisper@yale.UUCP (Bjorn Lisper) (03/04/88)

In article <17415@think.UUCP> barmar@fafnir.think.com.UUCP (Barry Margolin) writes:
>In article <7514@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
>>In article <7482@apple.UUCP> baum@apple.UUCP (Allen Baum) writes:
>> > (power of two array sizes are pretty common!),
>
>For scientific programming (the major application of supercomputers),
>is this really true?  What scientific applications naturally map onto
>power-of-two arrays?  I suspect that making arrays a power of two in
>size is a habit mostly of systems programmers, who know to make things
>fit neatly into memory pages in order to reduce paging.

FFT.

Bjorn Lisper

fouts@orville.nas.nasa.gov (Marty Fouts) (03/04/88)

In article <17415@think.UUCP> barmar@fafnir.think.com.UUCP (Barry Margolin) writes:
>>In article <7482@apple.UUCP> baum@apple.UUCP (Allen Baum) writes:
>> > (power of two array sizes are pretty common!),
>
>For scientific programming (the major application of supercomputers),
>is this really true?  What scientific applications naturally map onto
>power-of-two arrays?  I suspect that making arrays a power of two in
>size is a habit mostly of systems programmers, who know to make things
>fit neatly into memory pages in order to reduce paging.
>

Much scientific programming takes the form of solving partial
differential equations using finite difference approximations of the
initial equation.  By selecting the differencing delta, the programmer
controls the problem size, so there is really a range of sizes which
work.   The lower limit is the minimum necessary to achieve some kind
of numerical stability in the result.  The upper limit is the amount
of computer time available/desirable for solving the problem.

On many vector machines, the use of multiple memory banks to increase
apparent memory bandwidth dictates that arrays not be powers of two in
all dimensions, in order to avoid bank conflict when marching through
the array.  Some computations are done with arrays in which bounds are
arbitrarily set to relatively prime values.

oconnor@sungoddess.steinmetz (Dennis M. O'Connor) (03/05/88)

An article by daveb@geac.UUCP (David Collier-Brown) says:
] In article <4400@aw.sei.cmu.edu> firth@bd.sei.cmu.edu.UUCP (Robert Firth) writes:
] [discussion about array representation, and the hardware support
] appropriate to multiplicative addressing]
] 
] >However, my reading of ANSI X3.9-1978, especially Section 5, on arrays,
] >leads me to conclude that array representation in column-major form,
] >and array access by chain multiplication-&-addition of subscripts, is
] >the only feasible implementation choice.
] 
]   My reading of the ANSI proposal leads me to the same conclusion.
] This is both good (it eliminates an ambiguity) and bad (it removes
] the freedom of an compiler-writer to do what is best applicable to
] her architecture).
] -- 
]  David Collier-Brown.                 {mnetor yunexus utgpu}!geac!daveb

Why is it we have to reinvent the wheel so often ? N-dimensional
arrays, whether column or row major order, can always be accessed
without run-time multiplies. And can still be in a continuous block of
memory. The method for doing this is at least 14 years old : I heard
about it in 1974.

An example will illustrate. BTW, this is NOT the most optimized
algorithm, but they are simple permutations of this.

Given an array that is N1 by N2 by N3 et cetera. First, allocate
your block of memory for the N1*N2*N3*etc elements of the array. Then,
for each dimension EXCEPT the minor one (in this example, Nn, but it
could be N1 or any Nx ) allocate space for a vector of words with a number
of elements equal to the cardinality of that dimension. Then simply
fill in these vectors with (range 1..Cardinality(Nx)) * (whatever
multiplier is used for this dimension in the multiplication-&-addition
mwthod ). Note that this is either done at compile time, or ONCE when
the array is allocated, and uses very little multiplication.

Now, when you need to access ARRAY[a,b,c,...,n] you get the address
by using ARRAY_BASE + N1VEC[a] + N2VEC[b] + N3VEC[c] ... + n.
No multiplies.

This, I understand, is the ORIGINAL meaning of the term "VECTORIZED
ARRAY". Here's a concrete example :

  type FAST_ARRAY is array[1..6,1..4,1..7] of WHATEVER ;

produces :

 type MEMBLOCK is array [ 6*4*7 ] of WHATEVER ;
 N1VEC : constant array [1..6] of INTEGER := ( 0, 28, 56, 84, 112, 140 ) ;
 N2VEC : constant array [1..4] of INTEGER := ( 0, 7, 14, 21 ) ;

 PRAGMA INLINE( GET_FAST_ARRAY, SET_FAST_ARRAY ) ;

 function GET_FAST_ARRAY( M : MEMBLOCK; A,B,C : INTEGER ) return WHATEVER is
 begin
    return M( N1VEC( A ) + N2VEC( B ) + C ) ;
 end GET_FAST_ARRAY ;

 procedure SET_FAST_ARRAY( M : in out MEMBLOCK; a,b,c : INTEGER;
			   INPUT : WHATEVER ) is
 begin
   M( N1VEC( A ) + N2VEC( B ) + C ) := INPUT ;
 end SET_FAST_ARRAY ;

This just an algorithmic description, I haven't compiled it.
But you get the idea. And this will work for column-major,
row-major, 3rd-dimension-major, whatever.

It's essentially just having a look-up table for frequently used multiplies.
--
    Dennis O'Connor			      oconnor%sungod@steinmetz.UUCP
		   ARPA: OCONNORDM@ge-crd.arpa
   (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

jbuck@epimass.EPI.COM (Joe Buck) (03/05/88)

In article <998@PT.CS.CMU.EDU> edw@IUS1.CS.CMU.EDU (Eddie Wyatt) writes:
>> Multiplication is not necessary to access 2-D arrays if the array is
>> set up like most arrays in C, where each row is a typical vector and
>> the 2-D array is just a vector of pointers to each row vector.
>EXCUSE ME, but when I declare an array to be an n by m matrix in
>C (float foo[n][m]) I get a contiguous block of memory.  The
>representation IS NOT row-vector or column-vector.  So when I access
>index number i,j someone has to perform the calculation 
>base + sizeof(type)*(i*m+j).

No, the compiler doesn't have to do things that way.  It can create
an extra row of pointers to the beginning of each row:  p_foo[n],
where p_foo[i] contains &foo[i][0].  Then foo[i][j] is at *p_foo[i] +
j.  No multiply required, but there might be a shift or two to
accomodate two-byte or four-byte array elements.  The old Fortran
compiler for PDP-11's running RT-11 had an option to do this:  you're
trading off speed for space.  They called it "array vectoring".  The
array itself is still allocated the way you think it is, but there
are extra pointers.

-- 
- Joe Buck  {uunet,ucbvax,sun,<smart-site>}!epimass.epi.com!jbuck
	    Old Internet mailers: jbuck%epimass.epi.com@uunet.uu.net

edw@IUS1.CS.CMU.EDU (Eddie Wyatt) (03/06/88)

   First comment about induction on loop variables.  Even if 95% of
array usage fits into this category, not all those applications
may use it (or at least not without some serious dataflow analysis).
As soon as a procedure or function call appears within the body
of a loop, one can nolonger assume nice behaviour of induction
variables or base address.  I'm refering to the problems of aliasing.

> 
> No, the compiler doesn't have to do things that way.  It can create
> an extra row of pointers to the beginning of each row:  p_foo[n],
> where p_foo[i] contains &foo[i][0].  Then foo[i][j] is at *p_foo[i] +
> j.  No multiply required, but there might be a shift or two to
> accomodate two-byte or four-byte array elements.  The old Fortran
> compiler for PDP-11's running RT-11 had an option to do this:  you're
> trading off speed for space.  They called it "array vectoring".  The
> array itself is still allocated the way you think it is, but there
> are extra pointers.

   Interesting thought there - replace the representation.  It may 
have been easy to do for fortran, but it seems to me that C may
have some features that wouldn't allow it (or at least make it
a pain in the ass).  One is the interchangablity of pointer and
array.  Currently this is permissible (and easliy done) because
the reps are the same - pointers to the data blocks.  Therefore
a array may be passed to a function that expects a pointer (and still
have reasonable things happen).  If an optional rep is used,
the problem of parameter passing becomes determining which rep is
intended to be used.  Since function discriptions currently
are not required to be within scope of the call, a correct choice
can not be made.


-- 

Eddie Wyatt 				e-mail: edw@ius1.cs.cmu.edu

crick@bnr-rsc.UUCP (Bill Crick) (03/12/88)

The comment aboutgood programmers putting the array pointers in 
registers is probably moot. Don't most modern(?) compilers ignore
the register directive and do their own allocation?

bpendlet@esunix.UUCP (Bob Pendleton) (03/17/88)

From article <9788@steinmetz.steinmetz.UUCP>, by oconnor@sungoddess.steinmetz (Dennis M. O'Connor):
> An article by daveb@geac.UUCP (David Collier-Brown) says:

> Why is it we have to reinvent the wheel so often ? N-dimensional
> arrays, whether column or row major order, can always be accessed
> without run-time multiplies. And can still be in a continuous block of
> memory. The method for doing this is at least 14 years old : I heard
> about it in 1974.

If memory serves, the ancient FORTRAN-5 compiler on the Univac 1108
did arrays this way. The old beast had hardware support for subscripting
arrays using index vectors.

This means that the idea goes back into the late '50s at least.

-- 
Bob Pendleton @ Evans & Sutherland
UUCP Address:  {decvax,ucbvax,ihnp4,allegra}!decwrl!esunix!bpendlet
Alternate:     {ihnp4,seismo}!utah-cs!utah-gr!uplherc!esunix!bpendlet
        I am solely responsible for what I say.

chasm@killer.UUCP (Charles Marslett) (03/17/88)

In article <641@bnr-rsc.UUCP>, crick@bnr-rsc.UUCP (Bill Crick) writes:
> The comment aboutgood programmers putting the array pointers in 
> registers is probably moot. Don't most modern(?) compilers ignore
> the register directive and do their own allocation?

If my experience is any indication, register usage in modern micro C compilers
is drifting toward the the golden ideal of 1970-era Fortran compilers (IBM's
Fortran H for example) -- Microsoft C 5.1 seems to completly ignore register
declarations except for generating some warnings if they are used incorrectly
and earlier beta copies of 5.1 seemed to have real problems with having more
than 4 or 5 register declarations (couldn't juggle its own register allocation
and the programer's too!).  At least one of the 68000 compilers I used a
couple of years ago also had the characteristic that if it needed a register
(for its optimization) it would take it from your set of register variables
and might (sometimes) let you use the register elsewhere in the code.  This
is a much harder task for the 68000 since there are more candidates for
register variables and optimization sequences.

I guess my point is, if the demand for good code generation from compilers
wins out over fast debugging cycles for compiler company resources, we
may reach the stage where the author of the program need not worry about
the architecture of the underlying machine when writing code because the
compiler writer is optimizing generic-CPU code well enough that he can
ignore the application writer's optimizations.  I hope we get there before
everyone gets sidetracked to more "interesting" or "marketable" features
in compilers -- whatever they might be in 1992.

                                         ========================
Charles Marslett                         |Said the iconoclast:  |
chasm@killer.UUCP                        | This is better?      |
                                          ========================

fouts@orville.nas.nasa.gov (Marty Fouts) (03/18/88)

In article <3723@killer.UUCP> chasm@killer.UUCP (Charles Marslett) writes:
>
>I guess my point is, if the demand for good code generation from compilers
>wins out over fast debugging cycles for compiler company resources, we
>may reach the stage where the author of the program need not worry about
>the architecture of the underlying machine when writing code because the
>compiler writer is optimizing generic-CPU code well enough that he can
>ignore the application writer's optimizations.  I hope we get there before
>everyone gets sidetracked to more "interesting" or "marketable" features
>in compilers -- whatever they might be in 1992.
>

I used to hope this to, until I started writing code which runs on a
wide variety of architectures.  I suspect that the best the compiler
will be able to do (which is actually quite good) is fairly local
optimization in the sense of changes which implement an equivalent
algorithm to that expressed in the original program, rather than
solving an equivalent problem, but perhaps with a different algorithm.

For example, there are two basic algorithms for finding all of the
primes less than some integer.  Once can apply the sieve algorithm
explicitly by marking all of the multiples of each prime as it becomes
known, or one can divide each candidate integer by all primes less
than the square root of the integer, looking for an exact divisor.  On
most of the machines I've coded these algorithms on, the explicit
sieve is a major win because counting is much faster than division.
However, on the Connection Machine, the division algorithm is faster.

Although I expect various compilers to optimize the loop invariant
constructs out of my sieve, I would be surprised if any of them
converted it into a division finder, even on the machines where that
was the better approach.

I don't think we'll reach the point where authors never need to worry,
but I do think we'll reach the point where once the machine is chosen,
the author can depend on the compiler to get the best performance out
of a particular choice of algorithm.  It then becomes the author's
problem to figure out which algorithm has the best chance of winning
on the class of machines the program is intended for.