[comp.sys.m88k] Information wanted on m88000 Risc workstations

amull@Morgan.COM (Andrew P. Mullhaupt) (01/05/90)

Please send me any information, experience, sources, tips
you may have regarding the m88000/i80386 combination systems
such as the Opus Personal Mainframe 8120, Everex Step 8825,
or other similar systems. 

Thanks in Advance
Andrew Mullhaupt
Morgan Stanley & Co., Inc.
1251 Ave. Americas
New York, NY 10020

(USA) (212)-703-6948

rfg@ics.uci.edu (Ron Guilmette) (01/07/90)

In article <641@s5.Morgan.COM> amull@Morgan.COM (Andrew P. Mullhaupt) writes:
>Please send me any information, experience, sources, tips
>you may have regarding the m88000/i80386 combination systems
>such as the Opus Personal Mainframe 8120, Everex Step 8825,
>or other similar systems. 

Coincidently, there is a write up about such systems in the January 1990
issue of "MIPS" magazine (soon to be "Personal Workstation" magazine?).

I haven't read it in detail yet, but there is also a separate article
on page 56 ("Great Performers") where some benchmarks of different
types of current hardware offerings are given, along with price/
performance evaluations.

The bottom line?  Three of the five "Best Performers" in the category
called "UNIX Workstations" are based on the 88000 (including the top
two slots).  In the "Best Price/Performance" list, the top 4 entries
are all based on the 88000.

Two items worthy of note from the "Best Price/Performance" list:

	The least expensive item on the list is the Data General
	AViiON workstation (even less expensive than the 386 add-ins).

	The DG AViiON has far and away the best single-precision
	Whetstone performance, and it has much better double-precision
	performance than any of the 386 add-ins.  This fact could be
	critical if you plan on doing any graphics or other numerically
	intensive computation.

Anybody who is now considering buying a "hot-box" would be well advised
to have a look at this article before making a final choice.

// rfg

amull@Morgan.COM (Andrew P. Mullhaupt) (01/07/90)

In article <25A64468.11498@paris.ics.uci.edu>, rfg@ics.uci.edu (Ron Guilmette) writes:
> Coincidently, there is a write up about such systems in the January 1990
> issue of "MIPS" magazine (soon to be "Personal Workstation" magazine?).

No co-incidence. I got interested in these boxes entirely due to that
article, (and the Byte magazine 'first look' at the portable Opus.)

Now the performance of these toys is really wild. The only real
difficulties I have are:

1. We have extensive need for Berkeley extensions in our software.
We also use Sun's memory mapped files a whole lot. The System V
alternative (shared memory) is OK, but we're pretty leery of any
System V that isn't practically Release 4. Can I get close enough to
Sun OS with an Aviion (Everex 8825, Opus 8120, etc.) If I can
I may very well get one.

2. That ratio of Megaflops to MIPS sucks. Let me rephrase this. Given
that the 88000 is the only RISC chip with onboard floating support,
you've got to wonder why since it ends up being (relatively) so
slow. Can you get an FPA for it? On the systems with the combined
88000/80386 CPUs can you hang a quick Cyrix of the 80386, or a Weitek
3167? or can you put a 4167 on the 88000? Does Motorola have some
kind of remedy for those of us who like the looks of those soon to
be announced 486/860 systems which will scream for floating point?

Let me make one sobering point here for those who yet fail to
apprehend the need for double precision arithmetic outside of pure
engineering and scientific work: It turns out that you cannot 
make the obvious split adjustments to stock prices in a portfolio
if your position in a stock can be over a hundred thousand dollars
without getting some unfortunate round off effects which could, if
you didn't catch them first, lead you to violate exchange rules or
misfile taxes. The cost of such a mistake (which would be measured
in your job) is large compared to whatever benefit that factor of
two turns out to be in machines. 

> The bottom line?  Three of the five "Best Performers" in the category
> called "UNIX Workstations" are based on the 88000 (including the top
> two slots).  In the "Best Price/Performance" list, the top 4 entries
> are all based on the 88000.
> 
> Two items worthy of note from the "Best Price/Performance" list:
> 
> 	The least expensive item on the list is the Data General
> 	AViiON workstation (even less expensive than the 386 add-ins).
> 
> 	The DG AViiON has far and away the best single-precision
> 	Whetstone performance, and it has much better double-precision
> 	performance than any of the 386 add-ins.  This fact could be
> 	critical if you plan on doing any graphics or other numerically
> 	intensive computation.
> 
> Anybody who is now considering buying a "hot-box" would be well advised
> to have a look at this article before making a final choice.
> 
> // rfg

Yeah, well that DecStation 3100 kind of stomps these 88000 boxes for
double precision. And the application benchmarks in that issue show
just how nasty the threat is from the 486 (e.g. the Cheetah Gold is
in the same class as these other machines, and Weitek IS working on
a floating point coprocessor for the 486. Also the Cheetah costs
about 10,000 for the tested configuration.) It's not really clear
how the price performance benchmark is arrived at, and the Dhrystone
just doesn't represent what I need a box for. Right now I'm of a
mind to get the 88000 if I can get good UNIX and some kind of 
floating point help. Otherwise, it's back to square one. Oh well.

Please keep this subject alive - I think the 88000 is finally
emerging beyond its established user base - and I think discussion
could only help its chances.

Later,
Andrew Mullhaupt

wood@dg-rtp.dg.com (Tom Wood) (01/09/90)

In article <648@s5.Morgan.COM> amull@Morgan.COM (Andrew P. Mullhaupt) writes:

>2. That ratio of Megaflops to MIPS sucks. Let me rephrase this. Given
>that the 88000 is the only RISC chip with onboard floating support,
>you've got to wonder why since it ends up being (relatively) so
>slow. Can you get an FPA for it? On the systems with the combined
>88000/80386 CPUs can you hang a quick Cyrix of the 80386, or a Weitek
>3167? or can you put a 4167 on the 88000? Does Motorola have some
>kind of remedy for those of us who like the looks of those soon to
>be announced 486/860 systems which will scream for floating point?

and later:

>Yeah, well that DecStation 3100 kind of stomps these 88000 boxes for
>double precision. And the application benchmarks in that issue show
>just how nasty the threat is from the 486 (e.g. the Cheetah Gold is
>in the same class as these other machines, and Weitek IS working on
>a floating point coprocessor for the 486. Also the Cheetah costs
>about 10,000 for the tested configuration.) It's not really clear
>how the price performance benchmark is arrived at, and the Dhrystone
>just doesn't represent what I need a box for. Right now I'm of a
>mind to get the 88000 if I can get good UNIX and some kind of 
>floating point help. Otherwise, it's back to square one. Oh well.

I'd like to entertain a discussion on the FP performance of the 88k.
I have yet to see a compiler that takes advantage of the pipeline
on this machine to any extent.  Theoretically, you can have 5 FP adds
and 6 FP multiplies going on at once (if I understand correctly, the total
here is not 11, but 9: at most 5 FP adds or at most 6 FP multiplies and
no more than 9 total).  So how would you feel if someone were able to
boost Mflops by a factor of say 3 (or better) by improving the compiler 
technology?

Here's a sample of what I'm talking about.  These are computed values
for the Matrix multiply inner loop:

	DO 10 J = 1,N
    10	    A(I,J) = A(I,J) + B(I,K)*C(K,J)

Code Generation Technique      Cycles/iteration      Mflops

    Naive code                      19                 2.10
    Naive code, 2 unrolls          35/2		       2.28
    Sophisticated, 4 unrolls       28/4		       5.71
    Sophisticated, 8 unrolls       48/8 	       6.67

Well, how 'bout it!?
---
			Tom Wood	(919) 248-6067
			Data General, Research Triangle Park, NC
			{the known world}!rti!xyzzy!wood

manson@sphere.eng.ohio-state.edu (Robert Manson) (01/09/90)

In article <1879@xyzzy.UUCP> wood@gen-rtx.dg.com (Tom Wood) writes:
>
>I'd like to entertain a discussion on the FP performance of the 88k.
>I have yet to see a compiler that takes advantage of the pipeline
>on this machine to any extent.
[...]
>So how would you feel if someone were able to
>boost Mflops by a factor of say 3 (or better) by improving the compiler 
>technology?

Seems to me that everybody'd be better off with a smarter assembler.
After all, the changes that we're talking about should be possible
by assembly-code analysis, although I doubt that the level of
optimization achieved would be as good as a smart compiler. I'm
currently working on such an assembler, but I would hope that
such a beastie would be available commercially. The advantage of doing
it in the assembler is that then every compiler gets a performance
boost, and it also benefits any crazed humans that still like/need to
program in assembly.

>	DO 10 J = 1,N
>    10	    A(I,J) = A(I,J) + B(I,K)*C(K,J)
>

Depending on the loop construction, I could see this happening in such
a smart assembler, although it would be easier in the compiler.

I would have to agree that lack of a good optimizing compiler for
the 88k is a major lack-the big gain in FP code on the 88k is the
parallelization that can occur.

>			Tom Wood	(919) 248-6067
>			Data General, Research Triangle Park, NC
>			{the known world}!rti!xyzzy!wood

						Bob
manson@cis.ohio-state.edu

alan@oz.nm.paradyne.com (Alan Lovejoy) (01/09/90)

In article <648@s5.Morgan.COM< amull@Morgan.COM (Andrew P. Mullhaupt) writes:
<Now the performance of these toys is really wild. The only real
<difficulties I have are:

<1. We have extensive need for Berkeley extensions in our software.
<We also use Sun's memory mapped files a whole lot. The System V
<alternative (shared memory) is OK, but we're pretty leery of any
<System V that isn't practically Release 4. Can I get close enough to
<Sun OS with an Aviion (Everex 8825, Opus 8120, etc.) If I can
<I may very well get one.

Motorola has been making a rather big noise for the past year and a half
about the "fact" that SVR4 with BCS would be available "first" for the 88k.
I have no idea whether they have or will deliver on this claim.  I suggest
you ask Moto/DG/Textronix/Opus/Everex/AT&T/Unix International.

<2. That ratio of Megaflops to MIPS sucks. Let me rephrase this. Given
<that the 88000 is the only RISC chip with onboard floating support,
<you've got to wonder why since it ends up being (relatively) so
<slow. Can you get an FPA for it? On the systems with the combined
<88000/80386 CPUs can you hang a quick Cyrix of the 80386, or a Weitek
<3167? or can you put a 4167 on the 88000? Does Motorola have some
<kind of remedy for those of us who like the looks of those soon to
<be announced 486/860 systems which will scream for floating point?

Double precision FP is slow primarily because the 88k does not have 64-bit data
paths internally.  That is the price Moto paid for putting the FPU on the same 
chip as  the IPU.  The benefits they get are: 1) no need to shuffle data between
integer and fp registers; 2) standardized fp instruction set; 3) assurance for 
SW developers that all 88k systems will have HW FP, and 4) you can buy 1 88100 
and 2 88200's at 16MHz for $499 (in lots of 1000, of course); try matching that
price/performance ratio with ANY other CPU.  Also, they started out several 
years behind MIPS with the Rx000 and 9 months behind SPARC.  They are now in 
production with 33MHz CMOS parts; MIPS and the SPARC gang are not.

Moto is obviously aiming the 88k at the mass market as a direct replacement
of the 68k.  MIPS is aiming at the very high end (for example, with the
R6000). The next generation of the 88k will be aimed at the high end, while
the current generation will be priced to capture the low and medium market
segments.  There is nothing in the 88k architecture to prevent Motorola from 
using 64 (or even 128) bit data paths and superscalar pipelining in the
next generation 88k.  Should happen within the next year and a half, probably
sooner rather than later (the current generation is almost two years old now,
after all.)  I don't think that the competition will be able to match Moto's 
prices on the current generation.  But who knows?

<Yeah, well that DecStation 3100 kind of stomps these 88000 boxes for
<double precision. And the application benchmarks in that issue show
<just how nasty the threat is from the 486 (e.g. the Cheetah Gold is
<in the same class as these other machines, and Weitek IS working on
<a floating point coprocessor for the 486. Also the Cheetah costs
<about 10,000 for the tested configuration.) It's not really clear
<how the price performance benchmark is arrived at, and the Dhrystone
<just doesn't represent what I need a box for. Right now I'm of a
<mind to get the 88000 if I can get good UNIX and some kind of 
<floating point help. Otherwise, it's back to square one. Oh well.

Buy a machine and you're buying into an architecture for quite some time,
as many purchasers of IBM/MS-DOS systems have found out.  Of course UNIX
helps in this regard, but only so much.  These machines are all within 
a factor of two in performance, WHICH COULD BE DUE TO SOFTWARE FACTORS
SUCH AS CODE GENERATORS.  You should consider all factors, not just
performance differences no greater than 2x.  How fast can each architecture's
performance be increased?  Which one has the best staying power in the
market?  Which one has achieved (or will achieve) the strongest market
position and standardization?

No matter which vendor you buy an 88k box from, you'll get the same
COMPATIBLE FPU.  And it won't cost extra.


____"Congress shall have the power to prohibit speech offensive to Congress"____
Alan Lovejoy; alan@pdn; 813-530-2211; AT&T Paradyne: 8550 Ulmerton, Largo, FL.
Disclaimer: I do not speak for AT&T Paradyne.  They do not speak for me. 
Mottos:  << Many are cold, but few are frozen. >>     << Frigido, ergo sum. >>

soper@maxzilla.encore.com (Pete Soper) (01/09/90)

From article <75406@tut.cis.ohio-state.edu>, by manson@sphere.eng.ohio-state.edu (Robert Manson):

> such a beastie would be available commercially. The advantage of doing
> it in the assembler is that then every compiler gets a performance
> boost, and it also benefits any crazed humans that still like/need to
> program in assembly.

  You mean every compiler that generates assembler output. Many do not do
this by default or even at all. 

> 
> I would have to agree that lack of a good optimizing compiler for
> the 88k is a major lack-the big gain in FP code on the 88k is the
> parallelization that can occur.

  Both GNU C and Green Hills C/C++/F77/Pascal are optimizing compilers that
have 88k code generators available. Surely both have to do instruction 
scheduling of some sort to suport the 88k. Perhaps this area needs more 
work? Is the 860 so much faster because of raw performance or does it 
have the same pipeline issues and a compiler that more effectively supports 
them?
  Sort of on this subject, is GNU C the only C compiler shipped with the
DG box, or is it an alternative to Green Hills? Assuming GNU C is "it",
does it play well with Green Hills Fortran, which I'm assuming is still
the official Fortran product? Has DG extended gdb to cover both languages
or is another debugger used with their Fortran product? 
----------------------------------------------------------------------
Pete Soper                                             +1 919 481 3730
internet: soper@encore.com     uucp: {bu-cs,decvax,gould}!encore!soper 
Encore Computer Corp, 901 Kildaire Farm Rd, bldg D, Cary, NC 27511 USA

tom@ssd.csd.harris.com (Tom Horsley) (01/09/90)

>I'd like to entertain a discussion on the FP performance of the 88k.
>I have yet to see a compiler that takes advantage of the pipeline
>on this machine to any extent.  Theoretically, you can have 5 FP adds
>and 6 FP multiplies going on at once (if I understand correctly, the total
>here is not 11, but 9: at most 5 FP adds or at most 6 FP multiplies and
>no more than 9 total).  So how would you feel if someone were able to
>boost Mflops by a factor of say 3 (or better) by improving the compiler 
>technology?

This may be true for single precision, but it is hard to see how you can get
the pipe full for double precision. Any instruction with a double precision
source operand requires two (count'em 2) cycles before the 88k will even
bother looking at the next instruction. Then for double precision float
instructions there are two cycles required in the first FP1 pipe stage
(although the one of these FP1 cycles can overlap with the last of the two
decode cycles, so perhaps this is not so bad).

>Code Generation Technique      Cycles/iteration      Mflops
>
>    Naive code                      19                 2.10
>    Naive code, 2 unrolls          35/2		2.28
>    Sophisticated, 4 unrolls       28/4		5.71
>    Sophisticated, 8 unrolls       48/8 	        6.67
>
>Well, how 'bout it!?

In your example, even if everything is pipelined, the minimum number of
instructions that seem to be required just to do the computation is:

instruction   number   cycles

       addu        2        2   loop overhead
        bb1        1        1
        cmp        1        1

   fadd.ddd        8       16   loop body
   fmul.ddd        8       16
       ld.d       16       16
       st.d        8       16
-----------------------------
                           68

As near as I can tell 68 is not equal to 48. Do you have actual assembler
code that does this inner loop in 48 cycles? Could you post it?

As near as I can tell, this example does not work out as well as the
original poster implied.  Couple this with the real world fact (known even
by Cray users with heavy duty vectorizing compilers) that an awful lot of
real world algorithms have dependencies on previous results. No matter how
good your compiler is, it cannot pipeline these algorithms, because the next
thing depends on the last thing.  (Obviously it is worth the trouble to
pipeline when you can, I am just saying it is not always possible).

Another note said something about doing these sorts of optimizations at the
assembly level. This is also likely to turn out to be very hard.  The code
generated by the compiler is very likely to have the st.d instruction right
after the fadd.ddd instruction and right before the next set of ld.d
instructions. Unless the assembler is equipped to do enough symbolic
execution to prove that there is no aliasing it is going to have to leave
the st.d in front of the next set of ld.d instructions. This effectively
serializes the code since the thing being stored is the result of the fadd,
and there are very few things that can be reordered to fill pipeline slots.

For highest performance in all cases, give me the float unit with the
highest raw speed, pipelining only works if my algorithm is suitable, raw
speed always works.

Note: If the sample code had a divide instruction in it, it would be orders
of magnitude worse. Divides are *really* awful (they can't even be
pipelined).

Note Note: I am not fundamentally against the 88k. In fact, I like it. I
just wish the double precision performance were better. The main reason to
buy an 88k box over and above a MIPS or a 486 hot box is the existence of
the BCS standard. DEC has effectively shot MIPS in the foot by deciding to
run their boxes with the bytes backward. This makes it nearly impossible to
imagine a useful BCS ever happening across the full line of MIPS based
boxes.
--
=====================================================================
domain: tahorsley@ssd.csd.harris.com  USMail: Tom Horsley
  uucp: ...!novavax!hcx1!tahorsley            511 Kingbird Circle
      or  ...!uunet!hcx1!tahorsley            Delray Beach, FL  33444
======================== Aging: Just say no! ========================

amull@Morgan.COM (Andrew P. Mullhaupt) (01/10/90)

In article <1879@xyzzy.UUCP>, wood@dg-rtp.dg.com (Tom Wood) writes:
> In article <648@s5.Morgan.COM> amull@Morgan.COM (Andrew P. Mullhaupt) writes:
> 
> >2. That ratio of Megaflops to MIPS sucks. Let me rephrase this. Given
> >that the 88000 is the only RISC chip with onboard floating support,
> >you've got to wonder why since it ends up being (relatively) so
> >slow. 
> 
> and later:
> 
> >		...Right now I'm of a
> >mind to get the 88000 if I can get good UNIX and some kind of 
> >floating point help. Otherwise, it's back to square one. Oh well.
> 
> I'd like to entertain a discussion on the FP performance of the 88k.
> I have yet to see a compiler that takes advantage of the pipeline
> on this machine to any extent.  Theoretically, you can have 5 FP adds
> and 6 FP multiplies going on at once (if I understand correctly, the total
> here is not 11, but 9: at most 5 FP adds or at most 6 FP multiplies and
> no more than 9 total).  So how would you feel if someone were able to
> boost Mflops by a factor of say 3 (or better) by improving the compiler 
> technology?
> 
> Here's a sample of what I'm talking about.  These are computed values
> for the Matrix multiply inner loop:
> 
> 	DO 10 J = 1,N
>     10	    A(I,J) = A(I,J) + B(I,K)*C(K,J)
> 
> Code Generation Technique      Cycles/iteration      Mflops
> 
>     Naive code                      19                 2.10
>     Naive code, 2 unrolls          35/2		       2.28
>     Sophisticated, 4 unrolls       28/4		       5.71
>     Sophisticated, 8 unrolls       48/8 	       6.67
> 
> Well, how 'bout it!?

A man after my own heart! I just finished bitching and moaning at
the local C experts because the Sun 4 cc compiler produces the
most stupid code I've (or after they saw it, they've) ever seen for
the loop unrolling you've described. You actually give up a factor
of three for no known reason! On the same hardware, gcc will take
advantage of unrolled loops (e.g. Duff's device) to full effect.
Too bad that there are situations which go the other way 'round. 

You will find another case for local optimization where RISC is
often vulnerable is the question of inlining memcpy (strncopy, 
etc.). You want to 'unroll' this guy into int or even double
transfers, but you've got to walk on eggs for alignment to support
the full semantics. The 386/486 boxes are pretty good at this, and
the SCO UNIX compiler (cc) for the 386 inlines a handful of standard
library functions, and then generates some pretty smart assembler code.
(It is necessary to point out that the behavior can be switched in 
and out by command line argument and preprocessor pragma - so if you
depend on your own memcpy, etc., then you won't get hurt by an
overzealous optimizer...)

Now consider this code running on the 486. It's well known that the
486 can run all the 386 code (well if you've got a non-broken step
6 486 at least), but it is also almost as well known that the code
sequences which are optimal for the 386 and 486 are sometimes different.
There is even the question of code generation for the Cyrix replacement
for the 80387 chip. It runs all the 80387 code unmodified, but there are
ways to get the Cyrix to go another factor of two faster by generating
different code. There are compilers, and libraries to take advantage of
these situations, but I know of none for the 88000. 

On the other hand, I have heard that the 88000 is going to someday have
a wider data path to it's floating point pipelines. Sounds like a good
idea to me.

So have you got a compiler which generates optimal code to get the
other factor of two or three out of my code? Remember - I've already
unrolled my loops, aligned my structures, and taken advantage of the
FORTRAN calling sequence. Just like the Linpack benchmarks.

Later,
Andrew Mullhaupt

rfg@ics.uci.edu (Ron Guilmette) (01/10/90)

In article <648@s5.Morgan.COM> amull@Morgan.COM (Andrew P. Mullhaupt) writes:
>
>1. We have extensive need for Berkeley extensions in our software.
>We also use Sun's memory mapped files a whole lot. The System V
>alternative (shared memory) is OK, but we're pretty leery of any
>System V that isn't practically Release 4. Can I get close enough to
>Sun OS with an Aviion (Everex 8825, Opus 8120, etc.) If I can
>I may very well get one.

DG/UX on the AViiOn has lots of popular BSD extensions like long filenames,
symbolic links, memory mapped files, and probably many others I don't know
about.

>Yeah, well that DecStation 3100 kind of stomps these 88000 boxes for
>double precision. And the application benchmarks in that issue show
>just how nasty the threat is from the 486 (e.g. the Cheetah Gold is

I don't know where you are getting your numbers.  The 3100 didn't even
make either of the "Best Performance" or "Best Price/Performance" lists
in that article, so the numbers for the 3100 were not even shown.

What was shown however were the single and double precision Whetstone
numbers for MIPS's own MIPS-based R2030 system (which I would think
should be quite similar to the DEC product in terms of performance).
These independently published numbers clearly show that the AViiON
beats the hell out of MIPS-based systems on single precision Whetstones
and looses by only about 10% on double precision.  I would hardly call
that 10% "stomping".  You probably would never even notice the difference
in practice.

Also, please correct me if I'm wrong, but doesn't the 3100 cost about
twice as much?

Finally, note that the application benchmark numbers shown in that article
were possibly somewhat misleading because they were probably done with
DG/UX 4.10 which came with a horrible implementation of malloc() in libc.a.
Most good sized C applications rely heavily on a good fast malloc() and can
suffer dramatically if they are linked with a malloc which has poor
performance.

The malloc implementation has been totally replaced in DG/UX 4.20.  It's
light-years better now.

// rfg

rfg@ics.uci.edu (Ron Guilmette) (01/10/90)

In article <10825@encore.Encore.COM> soper@maxzilla.encore.com (Pete Soper) writes:
>From article <75406@tut.cis.ohio-state.edu>, by manson@sphere.eng.ohio-state.edu (Robert Manson):
>> 
>> I would have to agree that lack of a good optimizing compiler for
>> the 88k is a major lack-the big gain in FP code on the 88k is the
>> parallelization that can occur.
>
>  Both GNU C and Green Hills C/C++/F77/Pascal are optimizing compilers that
>have 88k code generators available. Surely both have to do instruction 
>scheduling of some sort to suport the 88k.

Yes.  You could call it "instruction scheduling" I suppose.  A better term
might be "naive instruction scheduling".  Attempts to do "sophisticated"
instruction scheduling for these sorts of machines are still mostly
research projects (with the notable exception of the MultiFlow systems).

>Perhaps this area needs more work?

Gee! No kidding?

>Is the 860 so much faster because of raw performance or does it 
>have the same pipeline issues and a compiler that more effectively supports 
>them?

It has many of the same pipelining opportunities and pitfalls.  As far as
I know it does not have good "sophisticated" compilers yet.

It is not even clear to me that the performance (scaled to clock
frequency) is that much better that the 88k.  I have yet to see any
performance numbers for the i860 except those published by Intel.
Does any other source have published numbers?

>  Sort of on this subject, is GNU C the only C compiler shipped with the
>DG box, or is it an alternative to Green Hills? Assuming GNU C is "it",
>does it play well with Green Hills Fortran, which I'm assuming is still
>the official Fortran product? Has DG extended gdb to cover both languages
>or is another debugger used with their Fortran product? 

Why would you think that GDB would have to be extended?  A breakpoint on
a line is a breakpoint on a line, no?  A "list" command lists some source
lines, yes?  What the difference if it's FORTRAN or C?

// rfg

mash@mips.COM (John Mashey) (01/10/90)

I don't usually comment in this newsgroup, but there was enough
(mis)information in the following that I had to comment:

In article <6915@pdn.paradyne.com> alan@oz.paradyne.com (Alan Lovejoy) writes:
>Double precision FP is slow primarily because the 88k does not have 64-bit data
>paths internally.  That is the price Moto paid for putting the FPU on the same 
>chip as  the IPU.  The benefits they get are: 1) no need to shuffle data between
>integer and fp registers; 2) standardized fp instruction set; 3) assurance for 
>SW developers that all 88k systems will have HW FP, and 4) you can buy 1 88100 
>and 2 88200's at 16MHz for $499 (in lots of 1000, of course); try matching that
>price/performance ratio with ANY other CPU.  Also, they started out several 
>years behind MIPS with the Rx000 and 9 months behind SPARC.  They are now in 
>production with 33MHz CMOS parts; MIPS and the SPARC gang are not.
They paid a price to put it on the same chip, and it's a legitimate
choice, however:
	2) MIPS and SPARC certainly have standardized instruction sets.
	3) MIPS (at least) supports complete IEEE emulation in the UNIX
	kernel, so one does not need extra flavors of binaries.
	4) Try matching that price/performance:
	You can put together the core of of
	a system of similar performance, including CPU, FPU, MMU ,128K caches,
	glue, for about $400, or less [i've heard of one, which was in large
	quantities, as low as $250, although that might have been a little
	slower].  Things like the IDT 3001 reduce the cost even more,
	and having less cache helps too.

	a) For some kinds of algorithms, you'd really like more FP regs,
	which MIPS, SPARC, HP PA have.
	b) As it stands, with the natural 32-bit organization of the
	register file, you either:
		1) Need more cycles to read/write operands [what 88K did]
		OR
		2) Need more read and write ports, especially to accomadate
		mulitple-cycle operations whose results come back later.
	one of the reasons MOST people have separate FP and integer registers
	is to:
		1) organize FP as 64-bit.
		2) Have more read and write ports to accommodate
		heavily-overlapped FP operations.  Ports cost, sooner or later.
	
Sun & Solbourne are shipping (SPARC) systems at 33Mhz; Stardent shipping (MIPS)
at 32Mhz (to be fair, all just recently).  I have no idea how many there
are of these things, as I haven't tried to buy them lately.  If there are
lots of 33Mhz 88Ks actually shipping out there in systems, we haven't seen
them, although they certainly exist, and have been benchmarked.
Needless to say, always measure performance on real programs, not
clock rate: if only clock rate counted, 50Mhz 68030s would be ahead of
all other chips mentioned (and they're not).
>
>Moto is obviously aiming the 88k at the mass market as a direct replacement
>of the 68k.  MIPS is aiming at the very high end (for example, with the
I'm not privy to Moto's plans and aims, but this is a clear statement of
MIPS' direction...which is 100% wrong, and why I posted this.
Many MIPS chips go into embedded control, laser printers, telephone switches,
avionics, autos, etc) for example, and if we can't get to
the lowest part of that, we're sure interested in the high-
performance part, as well as workstations, and small servers.
We do high-end (R6000) besides, but why does that make anybody think we're
disinterested in the low-end? MIPS partners who do a lot of embedded say
they fight all of the time with Intel 960s, sometimes with 29Ks, and seldom with
the 88K.... (now, that's anecdote, and hard to check, but...)
>R6000). The next generation of the 88k will be aimed at the high end, while
>the current generation will be priced to capture the low and medium market
>segments.  There is nothing in the 88k architecture to prevent Motorola from 
Well, they'll have to get the prices down to beat what you can do with
standard SRAMs and on-chip cache control. and it's going to be real tough for
the part of the embedded market that doesn't care about FP, because
people can sell equivalent-performance chipsets for about half the price.
>using 64 (or even 128) bit data paths and superscalar pipelining in the
>next generation 88k.  Should happen within the next year and a half, probably
>sooner rather than later (the current generation is almost two years old now,
>after all.)  I don't think that the competition will be able to match Moto's 
Where does 2 years come from? Do you count announcements? It is
well-known that until about 6-8 months ago, nobody could even ship production
systems, due to the crippling FP bugs that only got fixed then.
>prices on the current generation.  But who knows?
The combined register set, as described above, will not help superscalaring
much....  "But who knows?": lots of people.

Note that there's an awful lot of misinformation and speculation floating
around here, presented authoritatively.

Of use, when they appear, will be the next set of SPEC numbers,
which help give a more realistic assessment of performance than
the (increasingly unreliable/gimmickable) *stones.

Anyway, the 88K is a credible and respectable chipset, but claims
like "it costs $X and nothing else is close" (without giving any data
from anything else), and claims of "wonderful things will happen soon,
and no one else could match them" are marketing claims, not technical ones.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mash@mips.COM (John Mashey) (01/10/90)

In article <25AAE835.16940@paris.ics.uci.edu> rfg@ics.uci.edu (Ron Guilmette) writes:

>>Yeah, well that DecStation 3100 kind of stomps these 88000 boxes for
>>double precision. And the application benchmarks in that issue show
>>just how nasty the threat is from the 486 (e.g. the Cheetah Gold is
>
>I don't know where you are getting your numbers.  The 3100 didn't even
>make either of the "Best Performance" or "Best Price/Performance" lists
>in that article, so the numbers for the 3100 were not even shown.
But they have been shown, as in page 36 of the January 1990 issue, and
earlier ones.  I suggest that of the various FP benchmarks, the most
representative is the Livermore Loops, where the DS3100 yielded 1.928
MFLOPS (DP) versus 1.48 MFLOPS for the Step 8825 (25Mhz).
>
>What was shown however were the single and double precision Whetstone
>numbers for MIPS's own MIPS-based R2030 system (which I would think
>should be quite similar to the DEC product in terms of performance).
>These independently published numbers clearly show that the AViiON
>beats the hell out of MIPS-based systems on single precision Whetstones
>and looses by only about 10% on double precision.  I would hardly call
>that 10% "stomping".  You probably would never even notice the difference
>in practice.
Q: Do Whetstones correlate with real performance on real floating-point
programs?
A: Not very well.
>
>Also, please correct me if I'm wrong, but doesn't the 3100 cost about
>twice as much?
No; the Everex described in the article cost $21,995, and the Opus
$15,075.  I don't have the numbers handy for the DS3100, but it's in
the same ballpark; of course the Everex/Opus have a 386 & MSDOS as well,
but also don't have as big a screen, I think, so there's generally
some apples/oranges comparisons in both directions, depending on what
you want.
>
>Finally, note that the application benchmark numbers shown in that article
>were possibly somewhat misleading because they were probably done with
>DG/UX 4.10 which came with a horrible implementation of malloc() in libc.a.
>Most good sized C applications rely heavily on a good fast malloc() and can
>suffer dramatically if they are linked with a malloc which has poor
>performance.
>
>The malloc implementation has been totally replaced in DG/UX 4.20.  It's
>light-years better now.
Malloc only appears explicitly once in the whole set of benchmarks, as part of
initialization....Also, understand that these benchmarks are mixtures of
small synthetics that try to model different environments: they are NOT
applications.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

andrew@frip.WV.TEK.COM (Andrew Klossner) (01/10/90)

[]

	"Why would you think that GDB would have to be extended?  A
	breakpoint on a line is a breakpoint on a line, no?  A "list"
	command lists some source lines, yes?  What the difference if
	it's FORTRAN or C?"

We've had to do considerable work extending gdb to play with Green
Hills Fortran.  One big difference is that Fortran arrays are stored in
column-major order.  Another is that Fortran parameters are passed by
reference.  Still another is that Fortran passes string parameters as
(address, length).  To support Fortran application programmers, it's
inadequate to let GDB display this information as though the C language
model were in effect.

  -=- Andrew Klossner   (uunet!tektronix!frip.WV.TEK!andrew)    [UUCP]
                        (andrew%frip.wv.tek.com@relay.cs.net)   [ARPA]

guy@auspex.auspex.com (Guy Harris) (01/11/90)

>	3) MIPS (at least) supports complete IEEE emulation in the UNIX
>	kernel, so one does not need extra flavors of binaries.

As does SunOS for SPARC machines.

marvin@oakhill.UUCP (Marvin Denman) (01/11/90)

In article <TOM.90Jan9101628@hcx2.ssd.csd.harris.com> tom@ssd.csd.harris.com (Tom Horsley) writes:
>>  ...
>>  Discussion by Tom Wood of Data General about the possibility of boosting
>>  Mflops by a factor of 3 or better with improved compiler technology.
>>  ...
>This may be true for single precision, but it is hard to see how you can get
>the pipe full for double precision. Any instruction with a double precision
>source operand requires two (count'em 2) cycles before the 88k will even
>bother looking at the next instruction. Then for double precision float
>instructions there are two cycles required in the first FP1 pipe stage
>(although the one of these FP1 cycles can overlap with the last of the two
>decode cycles, so perhaps this is not so bad).

Two cycles to issue a double precision operation is an artifact of the 
88100 implementation.  The penalty is only two cycles for initiating
and terminating instructions though.  The pipes generally compress out
bubbles, so any stalls at the end of the pipe are usually hidden unless
the pipe is full for some reason.

>
>>Code Generation Technique      Cycles/iteration      Mflops
>>    Naive code                      19                 2.10
>>    Naive code, 2 unrolls          35/2		             2.28
>>    Sophisticated, 4 unrolls       28/4		             5.71
>>    Sophisticated, 8 unrolls       48/8 	             6.67

>In your example, even if everything is pipelined, the minimum number of
>instructions that seem to be required just to do the computation is:

>instruction   number   cycles
>       addu        2        2   loop overhead
>        bb1        1        1
>        cmp        1        1
>   fadd.ddd        8       16   loop body
>   fmul.ddd        8       16
>       ld.d       16       16
>       st.d        8       16
>-----------------------------
>                           68

>As near as I can tell, this example does not work out as well as the
>original poster implied.  Couple this with the real world fact (known even
>by Cray users with heavy duty vectorizing compilers) that an awful lot of
>real world algorithms have dependencies on previous results. No matter how
>good your compiler is, it cannot pipeline these algorithms, because the next
>thing depends on the last thing.  (Obviously it is worth the trouble to
>pipeline when you can, I am just saying it is not always possible).

The example in question was obviously for single precision.  The 68 cycle
appears to be approximately correct for best case double precision assuming 
the loop in double precision can be unrolled 8 times before running out of 
registers.  One clock could probably be saved in this case by optimizing the
loop to use bcnd instead of the compare and branch sequence.

I haven't coded this loop, but I have unrolled similar loops such as Linpack.  
Comparing 68 cycles to 19 cycles is not an apples to apples comparison.
The naive code would also be slowed somewhat by using double precision.  
As a first guess I would say that the ratio of unrolled code to naive code will 
still be close to 3.  Compilers have much room for improvement particularly
in floating point numerical code.  The current compilers do very little
scheduling and no unrolling of loops that I am aware of.  Just scheduling
operations with latencies greater than 1 will improve performance significantly.
Unrolling loops will make a large difference in this type of code.

Data dependencies between iterations of a loop are a very significant problem
with unrolling loops.  Hopefully the compiler will recognize the nondependencies
well enough to unroll most loops that can be unrolled.  I agree that on some
loops there are dependencies that hinder unrolling.  If these can be identified
though the compiler may even be able to remove redundant loads.  There is so
much room for improvement that I find it difficult to be pessimistic about 
the amount of improvement that is possible.

>For highest performance in all cases, give me the float unit with the
>highest raw speed, pipelining only works if my algorithm is suitable, raw
>speed always works.

I disagree.  I think that unless the latency is very short (2 or maybe 3 cycles)
that pipelining will pay off on a normal application mix.  The longer the
latency, the more likely it is that you will want to unroll or reschedule code.
It will be interesting to see if MIPS goes to pipelining floating point 
instructions in future parts.
-- 

Marvin Denman
Motorola 88000 Design
cs.utexas.edu!oakhill!marvin

amull@Morgan.COM (Andrew P. Mullhaupt) (01/11/90)

In article <25AAE835.16940@paris.ics.uci.edu>, rfg@ics.uci.edu (Ron Guilmette) writes:
> In article <648@s5.Morgan.COM> amull@Morgan.COM (Andrew P. Mullhaupt) writes:
> >
> >Yeah, well that DecStation 3100 kind of stomps these 88000 boxes for
> >double precision. And the application benchmarks in that issue show
> >just how nasty the threat is from the 486 (e.g. the Cheetah Gold is
> 
> I don't know where you are getting your numbers.  The 3100 didn't even
> make either of the "Best Performance" or "Best Price/Performance" lists
> in that article, so the numbers for the 3100 were not even shown.
> 
> What was shown however were the single and double precision Whetstone
> numbers for MIPS's own MIPS-based R2030 system (which I would think
> should be quite similar to the DEC product in terms of performance).
> These independently published numbers clearly show that the AViiON
> beats the hell out of MIPS-based systems on single precision Whetstones
> and looses by only about 10% on double precision.  I would hardly call
> that 10% "stomping".  You probably would never even notice the difference
> in practice.
> 

Right you are, and then wrong again. I was looking at the comparison
between the DecStation 3100 and the Opus 8120 and Everex 8825 on page
36, where the  somewhat more powerful Aviion is not present. I see a
consistent 30% victory in the Double precision Linpack and Livermore,
which is what I care about. It would appear that the Aviion is less
vulnerable than the Opus and Everex systems. On the other hand you're
wrong if you think I'd miss that 30% in practice. I have runs planned
for machines which take 50 hours of 3090 CPU and I have to find out
if they can go on workstations. Given that you lose a lot of the
vector and large scale cache advantage, plus the large scale use of hand
coded assembly (ESSL) on the 3090, I'm looking at thousands of hours of
CPU on a workstation class machine. Nearly all of it is double precision
floating point. Even if it's as little as 10%, I think a hundred hours
of CPU is noticable.
> Finally, note that the application benchmark numbers shown in that article
> were possibly somewhat misleading because they were probably done with
> DG/UX 4.10 which came with a horrible implementation of malloc() in libc.a.
> Most good sized C applications rely heavily on a good fast malloc() and can
> suffer dramatically if they are linked with a malloc which has poor
> performance.
> 
> The malloc implementation has been totally replaced in DG/UX 4.20.  It's
> light-years better now.
Well, I wasn't paying much attention to them, but I think the way
they come up with the overall ranking is weird - because the Aviion
beats the 'Best Performers' in all categories but two: Financial
and Dhrystone 2. I think they're putting too much weight on the 
Dhrystone, and this lands the Aviion in 4th place after the MIPS
RS2030. (Go Figure).

Also: you may have the impression that I am primarily concerned with
FORTRAN applications. This is sort of correct, in that you generally
use Linpack and Eispack instead of recoding them - even if your 
application is in C. Here, you end up with algorithms which are
seldom as fast as N log N; often you have N^3 time and N^2 space.
It would have to be the world's dumbest malloc before I would notice
it in scientific computing. Indeed, one of the standard 'jokes' is
to use an N^2 sort routine to order the eigenvalues/singular values
of a matrix, (which has just cost you a fairly large constant times
N^3), the idea being that you'll never be able to do an N big enough
to wear out the N^2 sort. (Being a real puritan, I use shellsort....)

Later,
Andrew Mullhaupt

meissner@osf.org (Michael Meissner) (01/12/90)

In article <10825@encore.Encore.COM> soper@maxzilla.encore.com (Pete Soper) writes:

|   Both GNU C and Green Hills C/C++/F77/Pascal are optimizing compilers that
| have 88k code generators available. Surely both have to do instruction 
| scheduling of some sort to suport the 88k. Perhaps this area needs more 
| work? Is the 860 so much faster because of raw performance or does it 
| have the same pipeline issues and a compiler that more effectively supports 
| them?

First of all, to support the 88k, you don't have to do any instruction
scheduling, since the hardware will stall if the data is not
available.  Obviously, there is an advantage to doing scheduling (the
numbers I saw were in the 5-10% range).  I don't know about extremely
recent Greenhills releases, but Greenhills did some limited amount of
instruction scheduling, and GNU C did not (unless you count filling
the delay slots of branchs/calls as instruction scheduling).
Instruction scheduling seems to help floating point the most on the
88k (my gut level reaction is that the compiler does not often times
have anything else useful to do to cover the two stalls needed for
loads).  This tended to show up in integer/system benchmarks, GNU and
Greenhills were neck and neck, wheras Greenhills had an advantage in
floating point.

Bias alert:  I spent a year working on the GNU C compiler for the 88k,
so I'm not a disinterested observer.

|   Sort of on this subject, is GNU C the only C compiler shipped with the
| DG box, or is it an alternative to Green Hills? Assuming GNU C is "it",
| does it play well with Green Hills Fortran, which I'm assuming is still
| the official Fortran product? Has DG extended gdb to cover both languages
| or is another debugger used with their Fortran product? 

The only compiler that is included with the DG/UX 88k operating system
is GNU C.  You can purchase Greenhills C, Fortran, and Pascal if you
desire.  I believe that Absoft also sells a fortran compiler for
DG/UX.  All languages on the 88k are expected to meet the 88open
Object Compatibility Standard (OCS) with regard to calling sequence,
so that you can mix and match (though there is one minor detail that
both GNU C and Greenhills fail in the same way).  DG does supply it's
own debugger to cover all of the languages (mxdb), but you could get
by by using gdb (I think there are problems with multidimensioned
arrays, and of course describe uses C syntax).  Part of the problem,
is that COFF is so C specific, but COFF is required by the standards.

I believe that the 88k OCS standard makes it impossible for C to call
a Fortran function that returns double complex, since C is required to
treat it as a function which returns a struct -- which goes in memory,
wheras Fortran returns the value in registers.  I don't have a copy of
the standard anymore, so I can't verify this.
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA

Catproof is an oxymoron, Childproof is nearly so

amull@Morgan.COM (Andrew P. Mullhaupt) (01/12/90)

In article <2811@yogi.oakhill.UUCP>, marvin@oakhill.UUCP (Marvin Denman) writes:
> In article <TOM.90Jan9101628@hcx2.ssd.csd.harris.com> tom@ssd.csd.harris.com (Tom Horsley) writes:
|||  ...
|||  Discussion by Tom Wood of Data General about the possibility of boosting
|||  Mflops by a factor of 3 or better with improved compiler technology.
|||  ...
||This may be true for single precision, but it is hard to see how you can get
||the pipe full for double precision. Any instruction with a double precision
||source operand requires two (count'em 2) cycles before the 88k will even
||bother looking at the next instruction. Then for double precision float
||instructions there are two cycles required in the first FP1 pipe stage
||(although the one of these FP1 cycles can overlap with the last of the two
||decode cycles, so perhaps this is not so bad).
| 
| 
| The example in question was obviously for single precision.  The 68 cycle
| appears to be approximately correct for best case double precision assuming 
| the loop in double precision can be unrolled 8 times before running out of 
| registers.  One clock could probably be saved in this case by optimizing the
| loop to use bcnd instead of the compare and branch sequence.
| 
| I haven't coded this loop, but I have unrolled similar loops such as Linpack.  
| Comparing 68 cycles to 19 cycles is not an apples to apples comparison.
| The naive code would also be slowed somewhat by using double precision.  
| As a first guess I would say that the ratio of unrolled code to naive code will 
| still be close to 3.  Compilers have much room for improvement particularly
| in floating point numerical code.  The current compilers do very little
| scheduling and no unrolling of loops that I am aware of.  Just scheduling
| operations with latencies greater than 1 will improve performance significantly.
| Unrolling loops will make a large difference in this type of code.

God help us I hope not. Unless we're reading off different pages, the
BLAS (Basic Linear Algebra Subroutines) have loops which are unrolled
in many places just for this reason. If the compiler insists on rolling
them back up - that's its affair. Now you can't win by unrolling every
loop, because some loops are big enough that unrolling them pops you
out of cache, etc., so don't expect unrolling loops to be the winner
every time. You can sometimes get nailed by inlining stuff for the same
reason. Now what to unroll is a harder question than it used to be 
because you've so many different sizes of cache and stuff across the
different machines, but looking at the old CDC 6600 architecture
and it's multiply scheduled pipelines will likely show that scientific
computing has been around this block once before. The tricks are often
worthwhile, and I would expect every self-respecting compiler to be
aware of the available weapons. 
| 
| Data dependencies between iterations of a loop are a very significant problem
| with unrolling loops.  Hopefully the compiler will recognize the nondependencies
| well enough to unroll most loops that can be unrolled.  I agree that on some
| loops there are dependencies that hinder unrolling.  If these can be identified
| though the compiler may even be able to remove redundant loads.  There is so
| much room for improvement that I find it difficult to be pessimistic about 
| the amount of improvement that is possible.
| 
||For highest performance in all cases, give me the float unit with the
||highest raw speed, pipelining only works if my algorithm is suitable, raw
||speed always works.
| 
| I disagree.  I think that unless the latency is very short (2 or maybe 3 cycles)
| that pipelining will pay off on a normal application mix.  The longer the
| latency, the more likely it is that you will want to unroll or reschedule code.
| It will be interesting to see if MIPS goes to pipelining floating point 
| instructions in future parts.

It seems to me that 68 versus 48 clocks is about a 40 % penalty for
double precision. That's too much for my taste - (I can tolerate about
a 20% differential). If you think about exercising your bus, etc.,
double precision probably gets higher efficiency but hurts your cache
hit ratio as compared to single precision. 

It is quite likely that there are two user communities here - the
single precision fans and the double precision fans. We will most
likely end up preferring different machines. I come from the double
precision school of thought, and almost ignore single precision
benchmarks. I would expect the other camp does the reverse. It should
be pretty obvious that unrolling, (a control overhead reduction
technique) will be more efficacious when the amount of real work
done on each pass of the loop is smaller. All other things being
equal, we should expect single precision code to benefit more by the
application of unrolling. (It really helps integer code no end). 

But back to my original confusion - am I the only one with a BLAS
which unrolls its own loops? Given that I'm talking about double
precision arithmetic, should I really expect the compiler to find 
yet another factor of two? I'll believe it when I see it.

Later,
Andrew Mullhaupt

tom@ssd.csd.harris.com (Tom Horsley) (01/12/90)

In article <2811@yogi.oakhill.UUCP> marvin@oakhill.UUCP (Marvin Denman) writes:

   >The example in question was obviously for single precision.

The original article specifically stated that the example was double precision,
that is why I wondered where the numbers came from.

   >            One clock could probably be saved in this case by optimizing the
   >loop to use bcnd instead of the compare and branch sequence.

Maybe, but I got this code by assuming that I could do induction variable
elimination and test replacement. In order to use bcnd, I need to count
down to zero, which probably means adding in an extra subu, thus eating
the cycle I just saved. Perhaps a sufficiently clever compiler could get
around this. In any event neither 67 nor 68 is close to 48.

   >Data dependencies between iterations of a loop are a very significant
   >problem with unrolling loops.  Hopefully the compiler will recognize the
   >nondependencies well enough to unroll most loops that can be unrolled.
   >I agree that on some loops there are dependencies that hinder unrolling.
   >If these can be identified though the compiler may even be able to
   >remove redundant loads.  There is so much room for improvement that I
   >find it difficult to be pessimistic about the amount of improvement that
   >is possible.

There is no question that compilers can generate better code than they do
now. We are currently at the stage of doing a detailed examination of the
code quality of our own 88k compilers here at Harris Computer Systems, and
we are often horrified by some of the truly rotten code we produce. We ARE
fixing these problems. (And occasionally we are uplifted by the terrific
code we produce).

However, there is a real problem with loop unrolling that depends on language
semantics. In FORTRAN compilers it may well be possible to profitably unroll
many loops, due to some of the aliasing restrictions that the FORTRAN standard
imposes on arguments. In the long term in Ada, it is also possible because
Ada requires a global program database which could someday be used to do the
sorts of interprocedural analysis required to determine that no aliasing
occurs. But on U**x systems, most code is written in C, increasingly even
numerical code is written in C. But C pointers can point pretty much anywhere.
Compilers generally have to make worst case assumptions. This means that
in any loop like the one in the original example where there is a load
through a pointer on the right of the statement and a store through a pointer
on the left, the compiler will be forced to assume that the store must
take place before the next loop iteration does a load. Even if you unroll
the loop, this data dependence will still be in place.

Unfortunately, the only way you can get the example loop fully pipelined is
to do several multiplies and adds before actually storing the result.  In
this case, if the algorithm were coded in C, you could take almost no
advantage of pipelining, the only thing unrolling would get you is a slight
improvement in the loop overhead, incrementing and testing the induction
variable.

   >I disagree.  I think that unless the latency is very short (2 or maybe 3
   >cycles) that pipelining will pay off on a normal application mix.

   >Marvin Denman
   >Motorola 88000 Design
   >cs.utexas.edu!oakhill!marvin

Of course you disagree, you work for Motorola :-)

Actually I didn't mean to imply that I thought pipelining was a bad idea, I
am all in favor of it, because when you can take advantage of it it does a
super job. I just wish that it didn't take so many clocks to get through the
pipe, because when it does not work out so well you just have to eat the
cycles and like it. In those cases I would prefer to eat as few cycles as
possible. To paraphrase your comment about MIPS, it will be interesting to
see if Motorola goes to fewer clocks for float instructions in the next
generation chips.

I still maintain that a large amount of real code (not artificial
benchmarks) contains data dependencies that force serial computation. I
would like this code to run fast as well.
--
=====================================================================
domain: tahorsley@ssd.csd.harris.com  USMail: Tom Horsley
  uucp: ...!novavax!hcx1!tahorsley            511 Kingbird Circle
      or  ...!uunet!hcx1!tahorsley            Delray Beach, FL  33444
======================== Aging: Just say no! ========================

slackey@bbn.com (Stan Lackey) (01/13/90)

In article <671@s5.Morgan.COM> amull@Morgan.COM (Andrew P. Mullhaupt) writes:

>But back to my original confusion - am I the only one with a BLAS
>which unrolls its own loops? Given that I'm talking about double
>precision arithmetic, should I really expect the compiler to find 
>yet another factor of two? I'll believe it when I see it.

In older machines (those without any scalar pipeline) the only
advantage of unrolling loops was to reduce loop overhead.  Now, with
scalar pipelines, a good instruction scheduler can likewise take
advantage of unrolling; that is, in a loop like a(i) = b(i)*s which is
unrolled say 4 times, b(i:i+3) can be fetched at the start of the loop
and put into 4 registers by 4 sequential loads (I assume using
displacement addressing).  Then the four muls can be started
sequentially, followed by 4 stores.  The time to do 4 loop interations
in this case should be only slightly more than the time to do one
(with all cache hits).

Note that with more in the loop (like maybe two fetched vectors instead of
one) and maybe an add, you use up all the registers real fast, especially
with double precision.

-Stan

earl@wright.mips.com (Earl Killian) (01/13/90)

In article <2811@yogi.oakhill.UUCP>, marvin@oakhill (Marvin Denman) writes:
>I disagree.  I think that unless the latency is very short (2 or
>maybe 3 cycles) that pipelining will pay off on a normal application
>mix.  The longer the latency, the more likely it is that you will
>want to unroll or reschedule code.  It will be interesting to see if
>MIPS goes to pipelining floating point instructions in future parts.

Pipelining makes less than 1% difference on the non-vector
applications that I've looked at.  Even on vector applications it is
unimportant if your latencies are short enough.  2 or 3 cycles adds
are doable, for example.  Consider the application being discussed,
matrix multiply, which is highly vectorizable.  If the original poster
is correct in that the 88100, with its pipelined floating-point units,
tops out in 6.7 mflop/s in single precision matrix multiplies, it
really proves this point.  The MIPS R3000, with non-pipelined
floating-point units, can do matrix multiplies at
			   25MHz	   33MHz
	single		11.8 mflop/s	15.7 mflop/s
	double		 7.8 mflop/s	10.4 mflop/s
This an example of why MIPS perfers low-latency to pipelined fp.
--
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086
-- 
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086

rfg@ics.uci.edu (Ron Guilmette) (01/14/90)

In article <TOM.90Jan12072511@hcx2.ssd.csd.harris.com> tom@ssd.csd.harris.com (Tom Horsley) writes:
>
>However, there is a real problem with loop unrolling that depends on language
>semantics. In FORTRAN compilers it may well be possible to profitably unroll
>many loops, due to some of the aliasing restrictions that the FORTRAN standard
>imposes on arguments. In the long term in Ada, it is also possible because
>Ada requires a global program database which could someday be used to do the
>sorts of interprocedural analysis required to determine that no aliasing
>occurs. But on U**x systems, most code is written in C, increasingly even
>numerical code is written in C. But C pointers can point pretty much anywhere.
>Compilers generally have to make worst case assumptions...

NOT true!

I have been arguing over this very issue with a professor here recently.
He is particularly interested in issues of instruction scheduling for
VLIW machines and I have been telling him (repeatedly) that you will
never achieve enough parallelism to make it worth your while on machines
like that (or even on the lowly 88k) if you are compiling from C source
code and if you do not do some heafty (but plausible) alias analysis
based on as much information as can be gleaned from the source code.

For instance, although pointers can in in theory point almost anywhere,
there are in fact many cases where it is obvious that the set of things
that could in fact be pointed to is some identifiable subset of the
entire address space.  For example:

	{
	  char array[100];
	  char *end = &array[99];
	  char *p = array;

	  while (p <= end)
	    *p = ' ';
	}

Guess where p always points to!  Now guess where end always points to.
You can work your way up to significantly more complex examples from here.

Some particularly good work was done on alias analysis for C at Hewlett
Packard (for the PA) and was written up a "Retargetable High Level Alias
Analysis" in the 1986 ACM POPL Proceedings.  Even though that's the best
paper I have seen on the subject yet, I think that they may have missed
a few possible additional tricks which might allow them to infer additional
limitations on the set of things that a pointer can point to, but it is hard
to tell.  They did do a pretty detailed analysis, but I guess that my own
ego makes me want to believe that (if given enough time) I could do better.

With a really robust alias analysis mechanism, you start to get into some
interesting questions regarding storing aliasing information for "library"
modules as well as for the code your are currently compiling.  How much do you
store?  How do you represent it?  If anybody has ideas about such things,
I (for one) am all ears.

// rfg

tom@ssd.csd.harris.com (Tom Horsley) (01/15/90)

In article <25AFDC1A.11327@paris.ics.uci.edu> rfg@ics.uci.edu (Ron Guilmette) writes:
>entire address space.  For example:

>        {
>          char array[100];
>          char *end = &array[99];
>          char *p = array;
>
>          while (p <= end)
>            *p = ' ';
>        }
>
>Guess where p always points to!  Now guess where end always points to.
>You can work your way up to significantly more complex examples from here.

This is absolutely true, and symbolic execution is a well know (if high
overhead) way of squeezing information like this out of the source code.
(We use a form of it in our Ada compilers to eliminate constraint checks we
can prove are not required).  But I can make up examples too:

    void
    matmul(a,b,c,n,m)
       double * a;
       double * b;
       double * c;
       int      n,m;
    {
       ...
    }

Guess where a, b, and c, always point to! Unless I have a global program
database and I can compile the matmul routine knowing everything about
every point it is called, the only guess I can make is "somewhere within
the address space of the machine".

The ultimate fanatic compiler might consider generating two routine bodies,
one with aliasing between arguments and one without, then throw in some
runtime checks for array overlap, and just jump to the "best" body. But be
careful, you are liable to introduce more overhead with the runtime checks
than you save by picking the correct body (particularly if the arguments
always overlap). Deciding which optimizations are profitable is the hardest
problem in engineering a good compiler.

Both of these are legitimate example programs, in one case a smart compiler
could do a good job, in the other case it is out of luck. I am interested in
getting good performance out of *all* compiled code, not just examples which
happen to work well for the machine. I don't want to argue about which
example is "more realistic", the fact is, they are both "realistic". I don't
want anyone to claim that I said you can't generate good code for the 88k,
sometimes you can generate fantastically good code for it, but sometimes you
can't.
--
=====================================================================
domain: tahorsley@ssd.csd.harris.com  USMail: Tom Horsley
  uucp: ...!novavax!hcx1!tahorsley            511 Kingbird Circle
      or  ...!uunet!hcx1!tahorsley            Delray Beach, FL  33444
======================== Aging: Just say no! ========================

amull@Morgan.COM (Andrew P. Mullhaupt) (01/16/90)

In article <50855@bbn.COM>, slackey@bbn.com (Stan Lackey) writes:
> In article <671@s5.Morgan.COM> amull@Morgan.COM (Andrew P. Mullhaupt) writes:
> 
> >But back to my original confusion - am I the only one with a BLAS
> >which unrolls its own loops? Given that I'm talking about double
> >precision arithmetic, should I really expect the compiler to find 
> >yet another factor of two? I'll believe it when I see it.
> 
> In older machines (those without any scalar pipeline) the only
> advantage of unrolling loops was to reduce loop overhead.  Now, with
> scalar pipelines, a good instruction scheduler can likewise take
> advantage of unrolling; that is, in a loop like a(i) = b(i)*s which is
> unrolled say 4 times, b(i:i+3) can be fetched at the start of the loop
> and put into 4 registers by 4 sequential loads (I assume using
> displacement addressing).  Then the four muls can be started
> sequentially, followed by 4 stores.  The time to do 4 loop interations
> in this case should be only slightly more than the time to do one
> (with all cache hits).
> 
> Note that with more in the loop (like maybe two fetched vectors instead of
> one) and maybe an add, you use up all the registers real fast, especially
> with double precision.
> 
As a matter of fact, the first time I had to worry about unrolling loops was
on a CDC 6600 (it was delivered in 1963 - and was the third one built). Not
that I was programming it then - but that's how old the machine was. Now this
'box' had (if memory serves) eight arithmetic pipelines which could all be
simultaneously running: It was something along the lines of integer and floating
add, subtract, multiply and divide, (but it wasn't exactly that - the exact
specification for the Cyber (a descendant) machine can be found in Michael
Metcalf's interesting book _FORTRAN Optimization_.) I'm not sure how old loop
unrolling is, but the FORTRAN compiler for the CDC 6600 had it by the time I
got around to that machine. In fact this is one of the machines where hand-coded
assembler was as likely to slow down code as the FORTRAN compiler's code because
the compiler took care to schedule the pipes. It could even move code across
loops or function calls in order to schedule better. Now this is at least a
15 year old compiler and a 27 year old machine. I don't think the role of 
loop unrolling is really in a new and different light - and I'm somewhat
disappointed in at least on of the compilers I've run across for RISC.

For example: the Sun 4 compiler willfully punishes you if you unroll your loops
in the source. It doesn't unroll them for you either. The gcc-1.35 compiler
for the same machine quite understands what you want and you get as much as a
factor of 10 speedup. This proves that the hardware is not the problem. 

Can anyone who has an m88k and a C compiler check out what happens if you unroll
loops at the source level and post a short summary? 


Later,
Andrew Mullhaupt

khb@chiba.kbierman@sun.com (Keith Bierman - SPD Advanced Languages) (01/17/90)

In article <681@terminus.Morgan.COM> amull@Morgan.COM (Andrew P. Mullhaupt) writes:

>For example: the Sun 4 compiler willfully punishes you if you unroll
>your  loops in the source. It doesn't unroll them for you either.

Not true. The compiler DOES unrolling. f77 users have been able to see
this for a year or two. Since C has been bundled with the OS to have
any chance of getting unrolling from C would have required you to
install the f77 components into your C compiler (or use some of the
less well known options). The next C compiler will be available soon,
and will not require an OS upgrade.

--
Keith H. Bierman    |*My thoughts are my own. !! kbierman@sun.com
It's Not My Fault   |	MTS --Only my work belongs to Sun* 
I Voted for Bill &  | Advanced Languages/Floating Point Group            
Opus                | "When the going gets Weird .. the Weird turn PRO"

"There is NO defense against the attack of the KILLER MICROS!"
			Eugene Brooks

marvin@oakhill.UUCP (Marvin Denman) (01/17/90)

In article <34446@mips.mips.COM> , earl@wright.mips.com (Earl Killian) writes:

>Consider the application being discussed,
>matrix multiply, which is highly vectorizable.  If the original poster
>is correct in that the 88100, with its pipelined floating-point units,
>tops out in 6.7 mflop/s in single precision matrix multiplies, it
>really proves this point.  The MIPS R3000, with non-pipelined
>floating-point units, can do matrix multiplies at
>      			   25MHz	   33MHz
>	single		11.8 mflop/s	15.7 mflop/s
>	double		 7.8 mflop/s	10.4 mflop/s
>This an example of why MIPS perfers low-latency to pipelined fp.
>--
>UUCP: {ames,decwrl,prls,pyramid}!mips!earl
>USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086

It should be noted that the 88k numbers you repeated are apparently at 20Mhz
and for the specific code fragment posted:
   DO 10 J = I,N
10 A(I,J) = A(I,J) + B(I,K) * C(K,J)

The numbers you posted for the R3000 are PROBABLY for a slightly different
code fragment:   ( I am more conversant in C so I will translate)

  for (i=0 ; i<MAXI ; i++)
    for (j=0 ; j<MAXJ ; j++)
      for (k=0, a[i][j]=0.0 ; k<MAXK ; k++)
        a[i][j] = a[i][j] + (b[i][k] * c[k][j]);

Is that true?

The inner loop written in this style can accumulate a[i][j] into a register and
remove the stores from the inner loop.  (Note that the assumption that the arrays
do not overlap is necessary)

When I recoded this inner loop for the 88100 unrolling the loop 8 times, I got 
10.8 Mflops at 25 Mhz and 14.4 Mflops at 33 Mhz for single precision.  This loop 
had only 1 cycle of stalling out of 37 cycles so the floating point latencies had 
a neglible effect.  How much was the inner loop unrolled for the R3000?  By my 
rough calculation I would suspect it would have to be 16 or so to get the 
numbers quoted.  This is probably a legitimate difference, but I would be 
interested to know if the extra unrolling is the cause of this difference.

Marvin Denman
Motorola 88000 Design
cs.utexas.edu!oakhill!marvin

-- 

Marvin Denman
Motorola 88000 Design
cs.utexas.edu!oakhill!marvin

earl@wright.mips.com (Earl Killian) (01/18/90)

In article <2831@yogi.oakhill.UUCP>, marvin@oakhill (Marvin Denman) writes:
>It should be noted that the 88k numbers you repeated are apparently
>at 20Mhz

I see.  I included both 25 and 33MHz numbers because I wasn't sure
what clock to compare to.  I didn't think of 20.

>and for the specific code fragment posted:
>   DO 10 J = I,N
>10 A(I,J) = A(I,J) + B(I,K) * C(K,J)
>
>The numbers you posted for the R3000 are PROBABLY for a slightly different
>code fragment:   ( I am more conversant in C so I will translate)
>  for (i=0 ; i<MAXI ; i++)
>    for (j=0 ; j<MAXJ ; j++)
>      for (k=0, a[i][j]=0.0 ; k<MAXK ; k++)
>        a[i][j] = a[i][j] + (b[i][k] * c[k][j]);
>Is that true?

Yes, the matrix multiply library routine quoted uses an algorithm
close to the above (the appropriate algorithm for matrix multiply does
depend on the machine).  One difference from the above is that it
appears you're assuming the array bounds are known at compile-time,
which is not true for the library subroutine I used (the stride is a
parameter).  This makes the address arithmetic more expensive (it adds
a whole instruction per flop).  The second is that unrolling was done
for the middle-loop, not the inner loop.

>When I recoded this inner loop for the 88100 unrolling the loop 8
>times, I got 10.8 Mflops at 25 Mhz and 14.4 Mflops at 33 Mhz for
>single precision.

What about double? ;-)

>The inner loop written in this style can accumulate a[i][j] into a
>register and remove the stores from the inner loop.
>...
>This loop had only 1 cycle of stalling out of 37 cycles so the
>floating point latencies had a neglible effect.

But accumulates into a[i][j] are dependent, and I thought the fp add
was 5 cycles, so 8 dependent fp adds should take a minimum of 40
cycles, true?  Did you convert to multiple parallel accumulation
registers to get around the fp latency?

>How much was the inner loop unrolled for the R3000?

The middle loop was unrolled 8 times.

Anyway, the point of my response to the original
	It will be interesting to see if MIPS goes to pipelining
	floating point instructions in future parts.
is that we're not going to add pipelining at the expense of latency,
because low-latency lets you do two things well (scalar and vector),
whereas pipelining lets you only do vector well.  I was surprised that
a high-latency highly-pipelined machine like the 88100 actually
appeared to slower on a vector problem than the R3000, and you
correctly pointed out was only because the originally posted code was
somewhat sub-optimal for the 88100.  On a vector problem, both
machines should be instruction-issue limited.  The latency or
pipelining required to run at peak rate is a function of the
instruction to flop ratio.  We try to keep our latencies below that
ratio, whereas the 88100 keeps its pipelining below that rate.
-- 
-- 
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (01/19/90)

In article <34780@mips.mips.COM> earl@wright.mips.com (Earl Killian) writes:
>In article <2831@yogi.oakhill.UUCP>, marvin@oakhill (Marvin Denman) writes:
:
>What about double? ;-)
Good question.  I would like to see both single and double numbers, when
these comparisons come up.
>Anyway, the point of my response to the original
>	It will be interesting to see if MIPS goes to pipelining
>	floating point instructions in future parts.
>is that we're not going to add pipelining at the expense of latency,

I wonder what f.p. op counts are common these days.  For example, has anyone
done an analysis of the SPEC benchmarks to see what the (dynamic) instruction 
counts on various RISC machines are 
(like the ones we used to see for the IBM 360 :-)
and, also, what the sequences look like for things like the matrix-vector
version of Linpack (i.e. the most floating point intensive vectoriazable codes).

Conjecture:

I would guess that you will see some potential benefit to pipelining add, but
not multiply or divide (as long as you don't have vector instructions). 

Suggestion:

Consider pipelining add by duplication of the add unit.  I think this approach
has benefits:  

You can use the same already optimized adder design, and, 
most real-estate-saving pipelining methods add latency.  

I think MIPSCo is correct to reduce latency first, but
I think there will be usable speedup if add is pipelined, according to
instruction analyses that I have seen previously for other architectures. 

The Motorola and MIPSCo products each represent reasonable design choices.
It is interesting to have real competition in a product area which was
lunatic fringe three or four years ago.

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117