amull@Morgan.COM (Andrew P. Mullhaupt) (01/05/90)
Please send me any information, experience, sources, tips you may have regarding the m88000/i80386 combination systems such as the Opus Personal Mainframe 8120, Everex Step 8825, or other similar systems. Thanks in Advance Andrew Mullhaupt Morgan Stanley & Co., Inc. 1251 Ave. Americas New York, NY 10020 (USA) (212)-703-6948
rfg@ics.uci.edu (Ron Guilmette) (01/07/90)
In article <641@s5.Morgan.COM> amull@Morgan.COM (Andrew P. Mullhaupt) writes: >Please send me any information, experience, sources, tips >you may have regarding the m88000/i80386 combination systems >such as the Opus Personal Mainframe 8120, Everex Step 8825, >or other similar systems. Coincidently, there is a write up about such systems in the January 1990 issue of "MIPS" magazine (soon to be "Personal Workstation" magazine?). I haven't read it in detail yet, but there is also a separate article on page 56 ("Great Performers") where some benchmarks of different types of current hardware offerings are given, along with price/ performance evaluations. The bottom line? Three of the five "Best Performers" in the category called "UNIX Workstations" are based on the 88000 (including the top two slots). In the "Best Price/Performance" list, the top 4 entries are all based on the 88000. Two items worthy of note from the "Best Price/Performance" list: The least expensive item on the list is the Data General AViiON workstation (even less expensive than the 386 add-ins). The DG AViiON has far and away the best single-precision Whetstone performance, and it has much better double-precision performance than any of the 386 add-ins. This fact could be critical if you plan on doing any graphics or other numerically intensive computation. Anybody who is now considering buying a "hot-box" would be well advised to have a look at this article before making a final choice. // rfg
amull@Morgan.COM (Andrew P. Mullhaupt) (01/07/90)
In article <25A64468.11498@paris.ics.uci.edu>, rfg@ics.uci.edu (Ron Guilmette) writes: > Coincidently, there is a write up about such systems in the January 1990 > issue of "MIPS" magazine (soon to be "Personal Workstation" magazine?). No co-incidence. I got interested in these boxes entirely due to that article, (and the Byte magazine 'first look' at the portable Opus.) Now the performance of these toys is really wild. The only real difficulties I have are: 1. We have extensive need for Berkeley extensions in our software. We also use Sun's memory mapped files a whole lot. The System V alternative (shared memory) is OK, but we're pretty leery of any System V that isn't practically Release 4. Can I get close enough to Sun OS with an Aviion (Everex 8825, Opus 8120, etc.) If I can I may very well get one. 2. That ratio of Megaflops to MIPS sucks. Let me rephrase this. Given that the 88000 is the only RISC chip with onboard floating support, you've got to wonder why since it ends up being (relatively) so slow. Can you get an FPA for it? On the systems with the combined 88000/80386 CPUs can you hang a quick Cyrix of the 80386, or a Weitek 3167? or can you put a 4167 on the 88000? Does Motorola have some kind of remedy for those of us who like the looks of those soon to be announced 486/860 systems which will scream for floating point? Let me make one sobering point here for those who yet fail to apprehend the need for double precision arithmetic outside of pure engineering and scientific work: It turns out that you cannot make the obvious split adjustments to stock prices in a portfolio if your position in a stock can be over a hundred thousand dollars without getting some unfortunate round off effects which could, if you didn't catch them first, lead you to violate exchange rules or misfile taxes. The cost of such a mistake (which would be measured in your job) is large compared to whatever benefit that factor of two turns out to be in machines. > The bottom line? Three of the five "Best Performers" in the category > called "UNIX Workstations" are based on the 88000 (including the top > two slots). In the "Best Price/Performance" list, the top 4 entries > are all based on the 88000. > > Two items worthy of note from the "Best Price/Performance" list: > > The least expensive item on the list is the Data General > AViiON workstation (even less expensive than the 386 add-ins). > > The DG AViiON has far and away the best single-precision > Whetstone performance, and it has much better double-precision > performance than any of the 386 add-ins. This fact could be > critical if you plan on doing any graphics or other numerically > intensive computation. > > Anybody who is now considering buying a "hot-box" would be well advised > to have a look at this article before making a final choice. > > // rfg Yeah, well that DecStation 3100 kind of stomps these 88000 boxes for double precision. And the application benchmarks in that issue show just how nasty the threat is from the 486 (e.g. the Cheetah Gold is in the same class as these other machines, and Weitek IS working on a floating point coprocessor for the 486. Also the Cheetah costs about 10,000 for the tested configuration.) It's not really clear how the price performance benchmark is arrived at, and the Dhrystone just doesn't represent what I need a box for. Right now I'm of a mind to get the 88000 if I can get good UNIX and some kind of floating point help. Otherwise, it's back to square one. Oh well. Please keep this subject alive - I think the 88000 is finally emerging beyond its established user base - and I think discussion could only help its chances. Later, Andrew Mullhaupt
wood@dg-rtp.dg.com (Tom Wood) (01/09/90)
In article <648@s5.Morgan.COM> amull@Morgan.COM (Andrew P. Mullhaupt) writes: >2. That ratio of Megaflops to MIPS sucks. Let me rephrase this. Given >that the 88000 is the only RISC chip with onboard floating support, >you've got to wonder why since it ends up being (relatively) so >slow. Can you get an FPA for it? On the systems with the combined >88000/80386 CPUs can you hang a quick Cyrix of the 80386, or a Weitek >3167? or can you put a 4167 on the 88000? Does Motorola have some >kind of remedy for those of us who like the looks of those soon to >be announced 486/860 systems which will scream for floating point? and later: >Yeah, well that DecStation 3100 kind of stomps these 88000 boxes for >double precision. And the application benchmarks in that issue show >just how nasty the threat is from the 486 (e.g. the Cheetah Gold is >in the same class as these other machines, and Weitek IS working on >a floating point coprocessor for the 486. Also the Cheetah costs >about 10,000 for the tested configuration.) It's not really clear >how the price performance benchmark is arrived at, and the Dhrystone >just doesn't represent what I need a box for. Right now I'm of a >mind to get the 88000 if I can get good UNIX and some kind of >floating point help. Otherwise, it's back to square one. Oh well. I'd like to entertain a discussion on the FP performance of the 88k. I have yet to see a compiler that takes advantage of the pipeline on this machine to any extent. Theoretically, you can have 5 FP adds and 6 FP multiplies going on at once (if I understand correctly, the total here is not 11, but 9: at most 5 FP adds or at most 6 FP multiplies and no more than 9 total). So how would you feel if someone were able to boost Mflops by a factor of say 3 (or better) by improving the compiler technology? Here's a sample of what I'm talking about. These are computed values for the Matrix multiply inner loop: DO 10 J = 1,N 10 A(I,J) = A(I,J) + B(I,K)*C(K,J) Code Generation Technique Cycles/iteration Mflops Naive code 19 2.10 Naive code, 2 unrolls 35/2 2.28 Sophisticated, 4 unrolls 28/4 5.71 Sophisticated, 8 unrolls 48/8 6.67 Well, how 'bout it!? --- Tom Wood (919) 248-6067 Data General, Research Triangle Park, NC {the known world}!rti!xyzzy!wood
manson@sphere.eng.ohio-state.edu (Robert Manson) (01/09/90)
In article <1879@xyzzy.UUCP> wood@gen-rtx.dg.com (Tom Wood) writes: > >I'd like to entertain a discussion on the FP performance of the 88k. >I have yet to see a compiler that takes advantage of the pipeline >on this machine to any extent. [...] >So how would you feel if someone were able to >boost Mflops by a factor of say 3 (or better) by improving the compiler >technology? Seems to me that everybody'd be better off with a smarter assembler. After all, the changes that we're talking about should be possible by assembly-code analysis, although I doubt that the level of optimization achieved would be as good as a smart compiler. I'm currently working on such an assembler, but I would hope that such a beastie would be available commercially. The advantage of doing it in the assembler is that then every compiler gets a performance boost, and it also benefits any crazed humans that still like/need to program in assembly. > DO 10 J = 1,N > 10 A(I,J) = A(I,J) + B(I,K)*C(K,J) > Depending on the loop construction, I could see this happening in such a smart assembler, although it would be easier in the compiler. I would have to agree that lack of a good optimizing compiler for the 88k is a major lack-the big gain in FP code on the 88k is the parallelization that can occur. > Tom Wood (919) 248-6067 > Data General, Research Triangle Park, NC > {the known world}!rti!xyzzy!wood Bob manson@cis.ohio-state.edu
alan@oz.nm.paradyne.com (Alan Lovejoy) (01/09/90)
In article <648@s5.Morgan.COM< amull@Morgan.COM (Andrew P. Mullhaupt) writes:
<Now the performance of these toys is really wild. The only real
<difficulties I have are:
<1. We have extensive need for Berkeley extensions in our software.
<We also use Sun's memory mapped files a whole lot. The System V
<alternative (shared memory) is OK, but we're pretty leery of any
<System V that isn't practically Release 4. Can I get close enough to
<Sun OS with an Aviion (Everex 8825, Opus 8120, etc.) If I can
<I may very well get one.
Motorola has been making a rather big noise for the past year and a half
about the "fact" that SVR4 with BCS would be available "first" for the 88k.
I have no idea whether they have or will deliver on this claim. I suggest
you ask Moto/DG/Textronix/Opus/Everex/AT&T/Unix International.
<2. That ratio of Megaflops to MIPS sucks. Let me rephrase this. Given
<that the 88000 is the only RISC chip with onboard floating support,
<you've got to wonder why since it ends up being (relatively) so
<slow. Can you get an FPA for it? On the systems with the combined
<88000/80386 CPUs can you hang a quick Cyrix of the 80386, or a Weitek
<3167? or can you put a 4167 on the 88000? Does Motorola have some
<kind of remedy for those of us who like the looks of those soon to
<be announced 486/860 systems which will scream for floating point?
Double precision FP is slow primarily because the 88k does not have 64-bit data
paths internally. That is the price Moto paid for putting the FPU on the same
chip as the IPU. The benefits they get are: 1) no need to shuffle data between
integer and fp registers; 2) standardized fp instruction set; 3) assurance for
SW developers that all 88k systems will have HW FP, and 4) you can buy 1 88100
and 2 88200's at 16MHz for $499 (in lots of 1000, of course); try matching that
price/performance ratio with ANY other CPU. Also, they started out several
years behind MIPS with the Rx000 and 9 months behind SPARC. They are now in
production with 33MHz CMOS parts; MIPS and the SPARC gang are not.
Moto is obviously aiming the 88k at the mass market as a direct replacement
of the 68k. MIPS is aiming at the very high end (for example, with the
R6000). The next generation of the 88k will be aimed at the high end, while
the current generation will be priced to capture the low and medium market
segments. There is nothing in the 88k architecture to prevent Motorola from
using 64 (or even 128) bit data paths and superscalar pipelining in the
next generation 88k. Should happen within the next year and a half, probably
sooner rather than later (the current generation is almost two years old now,
after all.) I don't think that the competition will be able to match Moto's
prices on the current generation. But who knows?
<Yeah, well that DecStation 3100 kind of stomps these 88000 boxes for
<double precision. And the application benchmarks in that issue show
<just how nasty the threat is from the 486 (e.g. the Cheetah Gold is
<in the same class as these other machines, and Weitek IS working on
<a floating point coprocessor for the 486. Also the Cheetah costs
<about 10,000 for the tested configuration.) It's not really clear
<how the price performance benchmark is arrived at, and the Dhrystone
<just doesn't represent what I need a box for. Right now I'm of a
<mind to get the 88000 if I can get good UNIX and some kind of
<floating point help. Otherwise, it's back to square one. Oh well.
Buy a machine and you're buying into an architecture for quite some time,
as many purchasers of IBM/MS-DOS systems have found out. Of course UNIX
helps in this regard, but only so much. These machines are all within
a factor of two in performance, WHICH COULD BE DUE TO SOFTWARE FACTORS
SUCH AS CODE GENERATORS. You should consider all factors, not just
performance differences no greater than 2x. How fast can each architecture's
performance be increased? Which one has the best staying power in the
market? Which one has achieved (or will achieve) the strongest market
position and standardization?
No matter which vendor you buy an 88k box from, you'll get the same
COMPATIBLE FPU. And it won't cost extra.
____"Congress shall have the power to prohibit speech offensive to Congress"____
Alan Lovejoy; alan@pdn; 813-530-2211; AT&T Paradyne: 8550 Ulmerton, Largo, FL.
Disclaimer: I do not speak for AT&T Paradyne. They do not speak for me.
Mottos: << Many are cold, but few are frozen. >> << Frigido, ergo sum. >>
soper@maxzilla.encore.com (Pete Soper) (01/09/90)
From article <75406@tut.cis.ohio-state.edu>, by manson@sphere.eng.ohio-state.edu (Robert Manson): > such a beastie would be available commercially. The advantage of doing > it in the assembler is that then every compiler gets a performance > boost, and it also benefits any crazed humans that still like/need to > program in assembly. You mean every compiler that generates assembler output. Many do not do this by default or even at all. > > I would have to agree that lack of a good optimizing compiler for > the 88k is a major lack-the big gain in FP code on the 88k is the > parallelization that can occur. Both GNU C and Green Hills C/C++/F77/Pascal are optimizing compilers that have 88k code generators available. Surely both have to do instruction scheduling of some sort to suport the 88k. Perhaps this area needs more work? Is the 860 so much faster because of raw performance or does it have the same pipeline issues and a compiler that more effectively supports them? Sort of on this subject, is GNU C the only C compiler shipped with the DG box, or is it an alternative to Green Hills? Assuming GNU C is "it", does it play well with Green Hills Fortran, which I'm assuming is still the official Fortran product? Has DG extended gdb to cover both languages or is another debugger used with their Fortran product? ---------------------------------------------------------------------- Pete Soper +1 919 481 3730 internet: soper@encore.com uucp: {bu-cs,decvax,gould}!encore!soper Encore Computer Corp, 901 Kildaire Farm Rd, bldg D, Cary, NC 27511 USA
tom@ssd.csd.harris.com (Tom Horsley) (01/09/90)
>I'd like to entertain a discussion on the FP performance of the 88k. >I have yet to see a compiler that takes advantage of the pipeline >on this machine to any extent. Theoretically, you can have 5 FP adds >and 6 FP multiplies going on at once (if I understand correctly, the total >here is not 11, but 9: at most 5 FP adds or at most 6 FP multiplies and >no more than 9 total). So how would you feel if someone were able to >boost Mflops by a factor of say 3 (or better) by improving the compiler >technology? This may be true for single precision, but it is hard to see how you can get the pipe full for double precision. Any instruction with a double precision source operand requires two (count'em 2) cycles before the 88k will even bother looking at the next instruction. Then for double precision float instructions there are two cycles required in the first FP1 pipe stage (although the one of these FP1 cycles can overlap with the last of the two decode cycles, so perhaps this is not so bad). >Code Generation Technique Cycles/iteration Mflops > > Naive code 19 2.10 > Naive code, 2 unrolls 35/2 2.28 > Sophisticated, 4 unrolls 28/4 5.71 > Sophisticated, 8 unrolls 48/8 6.67 > >Well, how 'bout it!? In your example, even if everything is pipelined, the minimum number of instructions that seem to be required just to do the computation is: instruction number cycles addu 2 2 loop overhead bb1 1 1 cmp 1 1 fadd.ddd 8 16 loop body fmul.ddd 8 16 ld.d 16 16 st.d 8 16 ----------------------------- 68 As near as I can tell 68 is not equal to 48. Do you have actual assembler code that does this inner loop in 48 cycles? Could you post it? As near as I can tell, this example does not work out as well as the original poster implied. Couple this with the real world fact (known even by Cray users with heavy duty vectorizing compilers) that an awful lot of real world algorithms have dependencies on previous results. No matter how good your compiler is, it cannot pipeline these algorithms, because the next thing depends on the last thing. (Obviously it is worth the trouble to pipeline when you can, I am just saying it is not always possible). Another note said something about doing these sorts of optimizations at the assembly level. This is also likely to turn out to be very hard. The code generated by the compiler is very likely to have the st.d instruction right after the fadd.ddd instruction and right before the next set of ld.d instructions. Unless the assembler is equipped to do enough symbolic execution to prove that there is no aliasing it is going to have to leave the st.d in front of the next set of ld.d instructions. This effectively serializes the code since the thing being stored is the result of the fadd, and there are very few things that can be reordered to fill pipeline slots. For highest performance in all cases, give me the float unit with the highest raw speed, pipelining only works if my algorithm is suitable, raw speed always works. Note: If the sample code had a divide instruction in it, it would be orders of magnitude worse. Divides are *really* awful (they can't even be pipelined). Note Note: I am not fundamentally against the 88k. In fact, I like it. I just wish the double precision performance were better. The main reason to buy an 88k box over and above a MIPS or a 486 hot box is the existence of the BCS standard. DEC has effectively shot MIPS in the foot by deciding to run their boxes with the bytes backward. This makes it nearly impossible to imagine a useful BCS ever happening across the full line of MIPS based boxes. -- ===================================================================== domain: tahorsley@ssd.csd.harris.com USMail: Tom Horsley uucp: ...!novavax!hcx1!tahorsley 511 Kingbird Circle or ...!uunet!hcx1!tahorsley Delray Beach, FL 33444 ======================== Aging: Just say no! ========================
amull@Morgan.COM (Andrew P. Mullhaupt) (01/10/90)
In article <1879@xyzzy.UUCP>, wood@dg-rtp.dg.com (Tom Wood) writes: > In article <648@s5.Morgan.COM> amull@Morgan.COM (Andrew P. Mullhaupt) writes: > > >2. That ratio of Megaflops to MIPS sucks. Let me rephrase this. Given > >that the 88000 is the only RISC chip with onboard floating support, > >you've got to wonder why since it ends up being (relatively) so > >slow. > > and later: > > > ...Right now I'm of a > >mind to get the 88000 if I can get good UNIX and some kind of > >floating point help. Otherwise, it's back to square one. Oh well. > > I'd like to entertain a discussion on the FP performance of the 88k. > I have yet to see a compiler that takes advantage of the pipeline > on this machine to any extent. Theoretically, you can have 5 FP adds > and 6 FP multiplies going on at once (if I understand correctly, the total > here is not 11, but 9: at most 5 FP adds or at most 6 FP multiplies and > no more than 9 total). So how would you feel if someone were able to > boost Mflops by a factor of say 3 (or better) by improving the compiler > technology? > > Here's a sample of what I'm talking about. These are computed values > for the Matrix multiply inner loop: > > DO 10 J = 1,N > 10 A(I,J) = A(I,J) + B(I,K)*C(K,J) > > Code Generation Technique Cycles/iteration Mflops > > Naive code 19 2.10 > Naive code, 2 unrolls 35/2 2.28 > Sophisticated, 4 unrolls 28/4 5.71 > Sophisticated, 8 unrolls 48/8 6.67 > > Well, how 'bout it!? A man after my own heart! I just finished bitching and moaning at the local C experts because the Sun 4 cc compiler produces the most stupid code I've (or after they saw it, they've) ever seen for the loop unrolling you've described. You actually give up a factor of three for no known reason! On the same hardware, gcc will take advantage of unrolled loops (e.g. Duff's device) to full effect. Too bad that there are situations which go the other way 'round. You will find another case for local optimization where RISC is often vulnerable is the question of inlining memcpy (strncopy, etc.). You want to 'unroll' this guy into int or even double transfers, but you've got to walk on eggs for alignment to support the full semantics. The 386/486 boxes are pretty good at this, and the SCO UNIX compiler (cc) for the 386 inlines a handful of standard library functions, and then generates some pretty smart assembler code. (It is necessary to point out that the behavior can be switched in and out by command line argument and preprocessor pragma - so if you depend on your own memcpy, etc., then you won't get hurt by an overzealous optimizer...) Now consider this code running on the 486. It's well known that the 486 can run all the 386 code (well if you've got a non-broken step 6 486 at least), but it is also almost as well known that the code sequences which are optimal for the 386 and 486 are sometimes different. There is even the question of code generation for the Cyrix replacement for the 80387 chip. It runs all the 80387 code unmodified, but there are ways to get the Cyrix to go another factor of two faster by generating different code. There are compilers, and libraries to take advantage of these situations, but I know of none for the 88000. On the other hand, I have heard that the 88000 is going to someday have a wider data path to it's floating point pipelines. Sounds like a good idea to me. So have you got a compiler which generates optimal code to get the other factor of two or three out of my code? Remember - I've already unrolled my loops, aligned my structures, and taken advantage of the FORTRAN calling sequence. Just like the Linpack benchmarks. Later, Andrew Mullhaupt
rfg@ics.uci.edu (Ron Guilmette) (01/10/90)
In article <648@s5.Morgan.COM> amull@Morgan.COM (Andrew P. Mullhaupt) writes: > >1. We have extensive need for Berkeley extensions in our software. >We also use Sun's memory mapped files a whole lot. The System V >alternative (shared memory) is OK, but we're pretty leery of any >System V that isn't practically Release 4. Can I get close enough to >Sun OS with an Aviion (Everex 8825, Opus 8120, etc.) If I can >I may very well get one. DG/UX on the AViiOn has lots of popular BSD extensions like long filenames, symbolic links, memory mapped files, and probably many others I don't know about. >Yeah, well that DecStation 3100 kind of stomps these 88000 boxes for >double precision. And the application benchmarks in that issue show >just how nasty the threat is from the 486 (e.g. the Cheetah Gold is I don't know where you are getting your numbers. The 3100 didn't even make either of the "Best Performance" or "Best Price/Performance" lists in that article, so the numbers for the 3100 were not even shown. What was shown however were the single and double precision Whetstone numbers for MIPS's own MIPS-based R2030 system (which I would think should be quite similar to the DEC product in terms of performance). These independently published numbers clearly show that the AViiON beats the hell out of MIPS-based systems on single precision Whetstones and looses by only about 10% on double precision. I would hardly call that 10% "stomping". You probably would never even notice the difference in practice. Also, please correct me if I'm wrong, but doesn't the 3100 cost about twice as much? Finally, note that the application benchmark numbers shown in that article were possibly somewhat misleading because they were probably done with DG/UX 4.10 which came with a horrible implementation of malloc() in libc.a. Most good sized C applications rely heavily on a good fast malloc() and can suffer dramatically if they are linked with a malloc which has poor performance. The malloc implementation has been totally replaced in DG/UX 4.20. It's light-years better now. // rfg
rfg@ics.uci.edu (Ron Guilmette) (01/10/90)
In article <10825@encore.Encore.COM> soper@maxzilla.encore.com (Pete Soper) writes: >From article <75406@tut.cis.ohio-state.edu>, by manson@sphere.eng.ohio-state.edu (Robert Manson): >> >> I would have to agree that lack of a good optimizing compiler for >> the 88k is a major lack-the big gain in FP code on the 88k is the >> parallelization that can occur. > > Both GNU C and Green Hills C/C++/F77/Pascal are optimizing compilers that >have 88k code generators available. Surely both have to do instruction >scheduling of some sort to suport the 88k. Yes. You could call it "instruction scheduling" I suppose. A better term might be "naive instruction scheduling". Attempts to do "sophisticated" instruction scheduling for these sorts of machines are still mostly research projects (with the notable exception of the MultiFlow systems). >Perhaps this area needs more work? Gee! No kidding? >Is the 860 so much faster because of raw performance or does it >have the same pipeline issues and a compiler that more effectively supports >them? It has many of the same pipelining opportunities and pitfalls. As far as I know it does not have good "sophisticated" compilers yet. It is not even clear to me that the performance (scaled to clock frequency) is that much better that the 88k. I have yet to see any performance numbers for the i860 except those published by Intel. Does any other source have published numbers? > Sort of on this subject, is GNU C the only C compiler shipped with the >DG box, or is it an alternative to Green Hills? Assuming GNU C is "it", >does it play well with Green Hills Fortran, which I'm assuming is still >the official Fortran product? Has DG extended gdb to cover both languages >or is another debugger used with their Fortran product? Why would you think that GDB would have to be extended? A breakpoint on a line is a breakpoint on a line, no? A "list" command lists some source lines, yes? What the difference if it's FORTRAN or C? // rfg
mash@mips.COM (John Mashey) (01/10/90)
I don't usually comment in this newsgroup, but there was enough (mis)information in the following that I had to comment: In article <6915@pdn.paradyne.com> alan@oz.paradyne.com (Alan Lovejoy) writes: >Double precision FP is slow primarily because the 88k does not have 64-bit data >paths internally. That is the price Moto paid for putting the FPU on the same >chip as the IPU. The benefits they get are: 1) no need to shuffle data between >integer and fp registers; 2) standardized fp instruction set; 3) assurance for >SW developers that all 88k systems will have HW FP, and 4) you can buy 1 88100 >and 2 88200's at 16MHz for $499 (in lots of 1000, of course); try matching that >price/performance ratio with ANY other CPU. Also, they started out several >years behind MIPS with the Rx000 and 9 months behind SPARC. They are now in >production with 33MHz CMOS parts; MIPS and the SPARC gang are not. They paid a price to put it on the same chip, and it's a legitimate choice, however: 2) MIPS and SPARC certainly have standardized instruction sets. 3) MIPS (at least) supports complete IEEE emulation in the UNIX kernel, so one does not need extra flavors of binaries. 4) Try matching that price/performance: You can put together the core of of a system of similar performance, including CPU, FPU, MMU ,128K caches, glue, for about $400, or less [i've heard of one, which was in large quantities, as low as $250, although that might have been a little slower]. Things like the IDT 3001 reduce the cost even more, and having less cache helps too. a) For some kinds of algorithms, you'd really like more FP regs, which MIPS, SPARC, HP PA have. b) As it stands, with the natural 32-bit organization of the register file, you either: 1) Need more cycles to read/write operands [what 88K did] OR 2) Need more read and write ports, especially to accomadate mulitple-cycle operations whose results come back later. one of the reasons MOST people have separate FP and integer registers is to: 1) organize FP as 64-bit. 2) Have more read and write ports to accommodate heavily-overlapped FP operations. Ports cost, sooner or later. Sun & Solbourne are shipping (SPARC) systems at 33Mhz; Stardent shipping (MIPS) at 32Mhz (to be fair, all just recently). I have no idea how many there are of these things, as I haven't tried to buy them lately. If there are lots of 33Mhz 88Ks actually shipping out there in systems, we haven't seen them, although they certainly exist, and have been benchmarked. Needless to say, always measure performance on real programs, not clock rate: if only clock rate counted, 50Mhz 68030s would be ahead of all other chips mentioned (and they're not). > >Moto is obviously aiming the 88k at the mass market as a direct replacement >of the 68k. MIPS is aiming at the very high end (for example, with the I'm not privy to Moto's plans and aims, but this is a clear statement of MIPS' direction...which is 100% wrong, and why I posted this. Many MIPS chips go into embedded control, laser printers, telephone switches, avionics, autos, etc) for example, and if we can't get to the lowest part of that, we're sure interested in the high- performance part, as well as workstations, and small servers. We do high-end (R6000) besides, but why does that make anybody think we're disinterested in the low-end? MIPS partners who do a lot of embedded say they fight all of the time with Intel 960s, sometimes with 29Ks, and seldom with the 88K.... (now, that's anecdote, and hard to check, but...) >R6000). The next generation of the 88k will be aimed at the high end, while >the current generation will be priced to capture the low and medium market >segments. There is nothing in the 88k architecture to prevent Motorola from Well, they'll have to get the prices down to beat what you can do with standard SRAMs and on-chip cache control. and it's going to be real tough for the part of the embedded market that doesn't care about FP, because people can sell equivalent-performance chipsets for about half the price. >using 64 (or even 128) bit data paths and superscalar pipelining in the >next generation 88k. Should happen within the next year and a half, probably >sooner rather than later (the current generation is almost two years old now, >after all.) I don't think that the competition will be able to match Moto's Where does 2 years come from? Do you count announcements? It is well-known that until about 6-8 months ago, nobody could even ship production systems, due to the crippling FP bugs that only got fixed then. >prices on the current generation. But who knows? The combined register set, as described above, will not help superscalaring much.... "But who knows?": lots of people. Note that there's an awful lot of misinformation and speculation floating around here, presented authoritatively. Of use, when they appear, will be the next set of SPEC numbers, which help give a more realistic assessment of performance than the (increasingly unreliable/gimmickable) *stones. Anyway, the 88K is a credible and respectable chipset, but claims like "it costs $X and nothing else is close" (without giving any data from anything else), and claims of "wonderful things will happen soon, and no one else could match them" are marketing claims, not technical ones. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
mash@mips.COM (John Mashey) (01/10/90)
In article <25AAE835.16940@paris.ics.uci.edu> rfg@ics.uci.edu (Ron Guilmette) writes: >>Yeah, well that DecStation 3100 kind of stomps these 88000 boxes for >>double precision. And the application benchmarks in that issue show >>just how nasty the threat is from the 486 (e.g. the Cheetah Gold is > >I don't know where you are getting your numbers. The 3100 didn't even >make either of the "Best Performance" or "Best Price/Performance" lists >in that article, so the numbers for the 3100 were not even shown. But they have been shown, as in page 36 of the January 1990 issue, and earlier ones. I suggest that of the various FP benchmarks, the most representative is the Livermore Loops, where the DS3100 yielded 1.928 MFLOPS (DP) versus 1.48 MFLOPS for the Step 8825 (25Mhz). > >What was shown however were the single and double precision Whetstone >numbers for MIPS's own MIPS-based R2030 system (which I would think >should be quite similar to the DEC product in terms of performance). >These independently published numbers clearly show that the AViiON >beats the hell out of MIPS-based systems on single precision Whetstones >and looses by only about 10% on double precision. I would hardly call >that 10% "stomping". You probably would never even notice the difference >in practice. Q: Do Whetstones correlate with real performance on real floating-point programs? A: Not very well. > >Also, please correct me if I'm wrong, but doesn't the 3100 cost about >twice as much? No; the Everex described in the article cost $21,995, and the Opus $15,075. I don't have the numbers handy for the DS3100, but it's in the same ballpark; of course the Everex/Opus have a 386 & MSDOS as well, but also don't have as big a screen, I think, so there's generally some apples/oranges comparisons in both directions, depending on what you want. > >Finally, note that the application benchmark numbers shown in that article >were possibly somewhat misleading because they were probably done with >DG/UX 4.10 which came with a horrible implementation of malloc() in libc.a. >Most good sized C applications rely heavily on a good fast malloc() and can >suffer dramatically if they are linked with a malloc which has poor >performance. > >The malloc implementation has been totally replaced in DG/UX 4.20. It's >light-years better now. Malloc only appears explicitly once in the whole set of benchmarks, as part of initialization....Also, understand that these benchmarks are mixtures of small synthetics that try to model different environments: they are NOT applications. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
andrew@frip.WV.TEK.COM (Andrew Klossner) (01/10/90)
[] "Why would you think that GDB would have to be extended? A breakpoint on a line is a breakpoint on a line, no? A "list" command lists some source lines, yes? What the difference if it's FORTRAN or C?" We've had to do considerable work extending gdb to play with Green Hills Fortran. One big difference is that Fortran arrays are stored in column-major order. Another is that Fortran parameters are passed by reference. Still another is that Fortran passes string parameters as (address, length). To support Fortran application programmers, it's inadequate to let GDB display this information as though the C language model were in effect. -=- Andrew Klossner (uunet!tektronix!frip.WV.TEK!andrew) [UUCP] (andrew%frip.wv.tek.com@relay.cs.net) [ARPA]
guy@auspex.auspex.com (Guy Harris) (01/11/90)
> 3) MIPS (at least) supports complete IEEE emulation in the UNIX > kernel, so one does not need extra flavors of binaries. As does SunOS for SPARC machines.
marvin@oakhill.UUCP (Marvin Denman) (01/11/90)
In article <TOM.90Jan9101628@hcx2.ssd.csd.harris.com> tom@ssd.csd.harris.com (Tom Horsley) writes: >> ... >> Discussion by Tom Wood of Data General about the possibility of boosting >> Mflops by a factor of 3 or better with improved compiler technology. >> ... >This may be true for single precision, but it is hard to see how you can get >the pipe full for double precision. Any instruction with a double precision >source operand requires two (count'em 2) cycles before the 88k will even >bother looking at the next instruction. Then for double precision float >instructions there are two cycles required in the first FP1 pipe stage >(although the one of these FP1 cycles can overlap with the last of the two >decode cycles, so perhaps this is not so bad). Two cycles to issue a double precision operation is an artifact of the 88100 implementation. The penalty is only two cycles for initiating and terminating instructions though. The pipes generally compress out bubbles, so any stalls at the end of the pipe are usually hidden unless the pipe is full for some reason. > >>Code Generation Technique Cycles/iteration Mflops >> Naive code 19 2.10 >> Naive code, 2 unrolls 35/2 2.28 >> Sophisticated, 4 unrolls 28/4 5.71 >> Sophisticated, 8 unrolls 48/8 6.67 >In your example, even if everything is pipelined, the minimum number of >instructions that seem to be required just to do the computation is: >instruction number cycles > addu 2 2 loop overhead > bb1 1 1 > cmp 1 1 > fadd.ddd 8 16 loop body > fmul.ddd 8 16 > ld.d 16 16 > st.d 8 16 >----------------------------- > 68 >As near as I can tell, this example does not work out as well as the >original poster implied. Couple this with the real world fact (known even >by Cray users with heavy duty vectorizing compilers) that an awful lot of >real world algorithms have dependencies on previous results. No matter how >good your compiler is, it cannot pipeline these algorithms, because the next >thing depends on the last thing. (Obviously it is worth the trouble to >pipeline when you can, I am just saying it is not always possible). The example in question was obviously for single precision. The 68 cycle appears to be approximately correct for best case double precision assuming the loop in double precision can be unrolled 8 times before running out of registers. One clock could probably be saved in this case by optimizing the loop to use bcnd instead of the compare and branch sequence. I haven't coded this loop, but I have unrolled similar loops such as Linpack. Comparing 68 cycles to 19 cycles is not an apples to apples comparison. The naive code would also be slowed somewhat by using double precision. As a first guess I would say that the ratio of unrolled code to naive code will still be close to 3. Compilers have much room for improvement particularly in floating point numerical code. The current compilers do very little scheduling and no unrolling of loops that I am aware of. Just scheduling operations with latencies greater than 1 will improve performance significantly. Unrolling loops will make a large difference in this type of code. Data dependencies between iterations of a loop are a very significant problem with unrolling loops. Hopefully the compiler will recognize the nondependencies well enough to unroll most loops that can be unrolled. I agree that on some loops there are dependencies that hinder unrolling. If these can be identified though the compiler may even be able to remove redundant loads. There is so much room for improvement that I find it difficult to be pessimistic about the amount of improvement that is possible. >For highest performance in all cases, give me the float unit with the >highest raw speed, pipelining only works if my algorithm is suitable, raw >speed always works. I disagree. I think that unless the latency is very short (2 or maybe 3 cycles) that pipelining will pay off on a normal application mix. The longer the latency, the more likely it is that you will want to unroll or reschedule code. It will be interesting to see if MIPS goes to pipelining floating point instructions in future parts. -- Marvin Denman Motorola 88000 Design cs.utexas.edu!oakhill!marvin
amull@Morgan.COM (Andrew P. Mullhaupt) (01/11/90)
In article <25AAE835.16940@paris.ics.uci.edu>, rfg@ics.uci.edu (Ron Guilmette) writes: > In article <648@s5.Morgan.COM> amull@Morgan.COM (Andrew P. Mullhaupt) writes: > > > >Yeah, well that DecStation 3100 kind of stomps these 88000 boxes for > >double precision. And the application benchmarks in that issue show > >just how nasty the threat is from the 486 (e.g. the Cheetah Gold is > > I don't know where you are getting your numbers. The 3100 didn't even > make either of the "Best Performance" or "Best Price/Performance" lists > in that article, so the numbers for the 3100 were not even shown. > > What was shown however were the single and double precision Whetstone > numbers for MIPS's own MIPS-based R2030 system (which I would think > should be quite similar to the DEC product in terms of performance). > These independently published numbers clearly show that the AViiON > beats the hell out of MIPS-based systems on single precision Whetstones > and looses by only about 10% on double precision. I would hardly call > that 10% "stomping". You probably would never even notice the difference > in practice. > Right you are, and then wrong again. I was looking at the comparison between the DecStation 3100 and the Opus 8120 and Everex 8825 on page 36, where the somewhat more powerful Aviion is not present. I see a consistent 30% victory in the Double precision Linpack and Livermore, which is what I care about. It would appear that the Aviion is less vulnerable than the Opus and Everex systems. On the other hand you're wrong if you think I'd miss that 30% in practice. I have runs planned for machines which take 50 hours of 3090 CPU and I have to find out if they can go on workstations. Given that you lose a lot of the vector and large scale cache advantage, plus the large scale use of hand coded assembly (ESSL) on the 3090, I'm looking at thousands of hours of CPU on a workstation class machine. Nearly all of it is double precision floating point. Even if it's as little as 10%, I think a hundred hours of CPU is noticable. > Finally, note that the application benchmark numbers shown in that article > were possibly somewhat misleading because they were probably done with > DG/UX 4.10 which came with a horrible implementation of malloc() in libc.a. > Most good sized C applications rely heavily on a good fast malloc() and can > suffer dramatically if they are linked with a malloc which has poor > performance. > > The malloc implementation has been totally replaced in DG/UX 4.20. It's > light-years better now. Well, I wasn't paying much attention to them, but I think the way they come up with the overall ranking is weird - because the Aviion beats the 'Best Performers' in all categories but two: Financial and Dhrystone 2. I think they're putting too much weight on the Dhrystone, and this lands the Aviion in 4th place after the MIPS RS2030. (Go Figure). Also: you may have the impression that I am primarily concerned with FORTRAN applications. This is sort of correct, in that you generally use Linpack and Eispack instead of recoding them - even if your application is in C. Here, you end up with algorithms which are seldom as fast as N log N; often you have N^3 time and N^2 space. It would have to be the world's dumbest malloc before I would notice it in scientific computing. Indeed, one of the standard 'jokes' is to use an N^2 sort routine to order the eigenvalues/singular values of a matrix, (which has just cost you a fairly large constant times N^3), the idea being that you'll never be able to do an N big enough to wear out the N^2 sort. (Being a real puritan, I use shellsort....) Later, Andrew Mullhaupt
meissner@osf.org (Michael Meissner) (01/12/90)
In article <10825@encore.Encore.COM> soper@maxzilla.encore.com (Pete Soper) writes: | Both GNU C and Green Hills C/C++/F77/Pascal are optimizing compilers that | have 88k code generators available. Surely both have to do instruction | scheduling of some sort to suport the 88k. Perhaps this area needs more | work? Is the 860 so much faster because of raw performance or does it | have the same pipeline issues and a compiler that more effectively supports | them? First of all, to support the 88k, you don't have to do any instruction scheduling, since the hardware will stall if the data is not available. Obviously, there is an advantage to doing scheduling (the numbers I saw were in the 5-10% range). I don't know about extremely recent Greenhills releases, but Greenhills did some limited amount of instruction scheduling, and GNU C did not (unless you count filling the delay slots of branchs/calls as instruction scheduling). Instruction scheduling seems to help floating point the most on the 88k (my gut level reaction is that the compiler does not often times have anything else useful to do to cover the two stalls needed for loads). This tended to show up in integer/system benchmarks, GNU and Greenhills were neck and neck, wheras Greenhills had an advantage in floating point. Bias alert: I spent a year working on the GNU C compiler for the 88k, so I'm not a disinterested observer. | Sort of on this subject, is GNU C the only C compiler shipped with the | DG box, or is it an alternative to Green Hills? Assuming GNU C is "it", | does it play well with Green Hills Fortran, which I'm assuming is still | the official Fortran product? Has DG extended gdb to cover both languages | or is another debugger used with their Fortran product? The only compiler that is included with the DG/UX 88k operating system is GNU C. You can purchase Greenhills C, Fortran, and Pascal if you desire. I believe that Absoft also sells a fortran compiler for DG/UX. All languages on the 88k are expected to meet the 88open Object Compatibility Standard (OCS) with regard to calling sequence, so that you can mix and match (though there is one minor detail that both GNU C and Greenhills fail in the same way). DG does supply it's own debugger to cover all of the languages (mxdb), but you could get by by using gdb (I think there are problems with multidimensioned arrays, and of course describe uses C syntax). Part of the problem, is that COFF is so C specific, but COFF is required by the standards. I believe that the 88k OCS standard makes it impossible for C to call a Fortran function that returns double complex, since C is required to treat it as a function which returns a struct -- which goes in memory, wheras Fortran returns the value in registers. I don't have a copy of the standard anymore, so I can't verify this. -- Michael Meissner email: meissner@osf.org phone: 617-621-8861 Open Software Foundation, 11 Cambridge Center, Cambridge, MA Catproof is an oxymoron, Childproof is nearly so
amull@Morgan.COM (Andrew P. Mullhaupt) (01/12/90)
In article <2811@yogi.oakhill.UUCP>, marvin@oakhill.UUCP (Marvin Denman) writes: > In article <TOM.90Jan9101628@hcx2.ssd.csd.harris.com> tom@ssd.csd.harris.com (Tom Horsley) writes: ||| ... ||| Discussion by Tom Wood of Data General about the possibility of boosting ||| Mflops by a factor of 3 or better with improved compiler technology. ||| ... ||This may be true for single precision, but it is hard to see how you can get ||the pipe full for double precision. Any instruction with a double precision ||source operand requires two (count'em 2) cycles before the 88k will even ||bother looking at the next instruction. Then for double precision float ||instructions there are two cycles required in the first FP1 pipe stage ||(although the one of these FP1 cycles can overlap with the last of the two ||decode cycles, so perhaps this is not so bad). | | | The example in question was obviously for single precision. The 68 cycle | appears to be approximately correct for best case double precision assuming | the loop in double precision can be unrolled 8 times before running out of | registers. One clock could probably be saved in this case by optimizing the | loop to use bcnd instead of the compare and branch sequence. | | I haven't coded this loop, but I have unrolled similar loops such as Linpack. | Comparing 68 cycles to 19 cycles is not an apples to apples comparison. | The naive code would also be slowed somewhat by using double precision. | As a first guess I would say that the ratio of unrolled code to naive code will | still be close to 3. Compilers have much room for improvement particularly | in floating point numerical code. The current compilers do very little | scheduling and no unrolling of loops that I am aware of. Just scheduling | operations with latencies greater than 1 will improve performance significantly. | Unrolling loops will make a large difference in this type of code. God help us I hope not. Unless we're reading off different pages, the BLAS (Basic Linear Algebra Subroutines) have loops which are unrolled in many places just for this reason. If the compiler insists on rolling them back up - that's its affair. Now you can't win by unrolling every loop, because some loops are big enough that unrolling them pops you out of cache, etc., so don't expect unrolling loops to be the winner every time. You can sometimes get nailed by inlining stuff for the same reason. Now what to unroll is a harder question than it used to be because you've so many different sizes of cache and stuff across the different machines, but looking at the old CDC 6600 architecture and it's multiply scheduled pipelines will likely show that scientific computing has been around this block once before. The tricks are often worthwhile, and I would expect every self-respecting compiler to be aware of the available weapons. | | Data dependencies between iterations of a loop are a very significant problem | with unrolling loops. Hopefully the compiler will recognize the nondependencies | well enough to unroll most loops that can be unrolled. I agree that on some | loops there are dependencies that hinder unrolling. If these can be identified | though the compiler may even be able to remove redundant loads. There is so | much room for improvement that I find it difficult to be pessimistic about | the amount of improvement that is possible. | ||For highest performance in all cases, give me the float unit with the ||highest raw speed, pipelining only works if my algorithm is suitable, raw ||speed always works. | | I disagree. I think that unless the latency is very short (2 or maybe 3 cycles) | that pipelining will pay off on a normal application mix. The longer the | latency, the more likely it is that you will want to unroll or reschedule code. | It will be interesting to see if MIPS goes to pipelining floating point | instructions in future parts. It seems to me that 68 versus 48 clocks is about a 40 % penalty for double precision. That's too much for my taste - (I can tolerate about a 20% differential). If you think about exercising your bus, etc., double precision probably gets higher efficiency but hurts your cache hit ratio as compared to single precision. It is quite likely that there are two user communities here - the single precision fans and the double precision fans. We will most likely end up preferring different machines. I come from the double precision school of thought, and almost ignore single precision benchmarks. I would expect the other camp does the reverse. It should be pretty obvious that unrolling, (a control overhead reduction technique) will be more efficacious when the amount of real work done on each pass of the loop is smaller. All other things being equal, we should expect single precision code to benefit more by the application of unrolling. (It really helps integer code no end). But back to my original confusion - am I the only one with a BLAS which unrolls its own loops? Given that I'm talking about double precision arithmetic, should I really expect the compiler to find yet another factor of two? I'll believe it when I see it. Later, Andrew Mullhaupt
tom@ssd.csd.harris.com (Tom Horsley) (01/12/90)
In article <2811@yogi.oakhill.UUCP> marvin@oakhill.UUCP (Marvin Denman) writes: >The example in question was obviously for single precision. The original article specifically stated that the example was double precision, that is why I wondered where the numbers came from. > One clock could probably be saved in this case by optimizing the >loop to use bcnd instead of the compare and branch sequence. Maybe, but I got this code by assuming that I could do induction variable elimination and test replacement. In order to use bcnd, I need to count down to zero, which probably means adding in an extra subu, thus eating the cycle I just saved. Perhaps a sufficiently clever compiler could get around this. In any event neither 67 nor 68 is close to 48. >Data dependencies between iterations of a loop are a very significant >problem with unrolling loops. Hopefully the compiler will recognize the >nondependencies well enough to unroll most loops that can be unrolled. >I agree that on some loops there are dependencies that hinder unrolling. >If these can be identified though the compiler may even be able to >remove redundant loads. There is so much room for improvement that I >find it difficult to be pessimistic about the amount of improvement that >is possible. There is no question that compilers can generate better code than they do now. We are currently at the stage of doing a detailed examination of the code quality of our own 88k compilers here at Harris Computer Systems, and we are often horrified by some of the truly rotten code we produce. We ARE fixing these problems. (And occasionally we are uplifted by the terrific code we produce). However, there is a real problem with loop unrolling that depends on language semantics. In FORTRAN compilers it may well be possible to profitably unroll many loops, due to some of the aliasing restrictions that the FORTRAN standard imposes on arguments. In the long term in Ada, it is also possible because Ada requires a global program database which could someday be used to do the sorts of interprocedural analysis required to determine that no aliasing occurs. But on U**x systems, most code is written in C, increasingly even numerical code is written in C. But C pointers can point pretty much anywhere. Compilers generally have to make worst case assumptions. This means that in any loop like the one in the original example where there is a load through a pointer on the right of the statement and a store through a pointer on the left, the compiler will be forced to assume that the store must take place before the next loop iteration does a load. Even if you unroll the loop, this data dependence will still be in place. Unfortunately, the only way you can get the example loop fully pipelined is to do several multiplies and adds before actually storing the result. In this case, if the algorithm were coded in C, you could take almost no advantage of pipelining, the only thing unrolling would get you is a slight improvement in the loop overhead, incrementing and testing the induction variable. >I disagree. I think that unless the latency is very short (2 or maybe 3 >cycles) that pipelining will pay off on a normal application mix. >Marvin Denman >Motorola 88000 Design >cs.utexas.edu!oakhill!marvin Of course you disagree, you work for Motorola :-) Actually I didn't mean to imply that I thought pipelining was a bad idea, I am all in favor of it, because when you can take advantage of it it does a super job. I just wish that it didn't take so many clocks to get through the pipe, because when it does not work out so well you just have to eat the cycles and like it. In those cases I would prefer to eat as few cycles as possible. To paraphrase your comment about MIPS, it will be interesting to see if Motorola goes to fewer clocks for float instructions in the next generation chips. I still maintain that a large amount of real code (not artificial benchmarks) contains data dependencies that force serial computation. I would like this code to run fast as well. -- ===================================================================== domain: tahorsley@ssd.csd.harris.com USMail: Tom Horsley uucp: ...!novavax!hcx1!tahorsley 511 Kingbird Circle or ...!uunet!hcx1!tahorsley Delray Beach, FL 33444 ======================== Aging: Just say no! ========================
slackey@bbn.com (Stan Lackey) (01/13/90)
In article <671@s5.Morgan.COM> amull@Morgan.COM (Andrew P. Mullhaupt) writes: >But back to my original confusion - am I the only one with a BLAS >which unrolls its own loops? Given that I'm talking about double >precision arithmetic, should I really expect the compiler to find >yet another factor of two? I'll believe it when I see it. In older machines (those without any scalar pipeline) the only advantage of unrolling loops was to reduce loop overhead. Now, with scalar pipelines, a good instruction scheduler can likewise take advantage of unrolling; that is, in a loop like a(i) = b(i)*s which is unrolled say 4 times, b(i:i+3) can be fetched at the start of the loop and put into 4 registers by 4 sequential loads (I assume using displacement addressing). Then the four muls can be started sequentially, followed by 4 stores. The time to do 4 loop interations in this case should be only slightly more than the time to do one (with all cache hits). Note that with more in the loop (like maybe two fetched vectors instead of one) and maybe an add, you use up all the registers real fast, especially with double precision. -Stan
earl@wright.mips.com (Earl Killian) (01/13/90)
In article <2811@yogi.oakhill.UUCP>, marvin@oakhill (Marvin Denman) writes: >I disagree. I think that unless the latency is very short (2 or >maybe 3 cycles) that pipelining will pay off on a normal application >mix. The longer the latency, the more likely it is that you will >want to unroll or reschedule code. It will be interesting to see if >MIPS goes to pipelining floating point instructions in future parts. Pipelining makes less than 1% difference on the non-vector applications that I've looked at. Even on vector applications it is unimportant if your latencies are short enough. 2 or 3 cycles adds are doable, for example. Consider the application being discussed, matrix multiply, which is highly vectorizable. If the original poster is correct in that the 88100, with its pipelined floating-point units, tops out in 6.7 mflop/s in single precision matrix multiplies, it really proves this point. The MIPS R3000, with non-pipelined floating-point units, can do matrix multiplies at 25MHz 33MHz single 11.8 mflop/s 15.7 mflop/s double 7.8 mflop/s 10.4 mflop/s This an example of why MIPS perfers low-latency to pipelined fp. -- UUCP: {ames,decwrl,prls,pyramid}!mips!earl USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086 -- UUCP: {ames,decwrl,prls,pyramid}!mips!earl USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086
rfg@ics.uci.edu (Ron Guilmette) (01/14/90)
In article <TOM.90Jan12072511@hcx2.ssd.csd.harris.com> tom@ssd.csd.harris.com (Tom Horsley) writes: > >However, there is a real problem with loop unrolling that depends on language >semantics. In FORTRAN compilers it may well be possible to profitably unroll >many loops, due to some of the aliasing restrictions that the FORTRAN standard >imposes on arguments. In the long term in Ada, it is also possible because >Ada requires a global program database which could someday be used to do the >sorts of interprocedural analysis required to determine that no aliasing >occurs. But on U**x systems, most code is written in C, increasingly even >numerical code is written in C. But C pointers can point pretty much anywhere. >Compilers generally have to make worst case assumptions... NOT true! I have been arguing over this very issue with a professor here recently. He is particularly interested in issues of instruction scheduling for VLIW machines and I have been telling him (repeatedly) that you will never achieve enough parallelism to make it worth your while on machines like that (or even on the lowly 88k) if you are compiling from C source code and if you do not do some heafty (but plausible) alias analysis based on as much information as can be gleaned from the source code. For instance, although pointers can in in theory point almost anywhere, there are in fact many cases where it is obvious that the set of things that could in fact be pointed to is some identifiable subset of the entire address space. For example: { char array[100]; char *end = &array[99]; char *p = array; while (p <= end) *p = ' '; } Guess where p always points to! Now guess where end always points to. You can work your way up to significantly more complex examples from here. Some particularly good work was done on alias analysis for C at Hewlett Packard (for the PA) and was written up a "Retargetable High Level Alias Analysis" in the 1986 ACM POPL Proceedings. Even though that's the best paper I have seen on the subject yet, I think that they may have missed a few possible additional tricks which might allow them to infer additional limitations on the set of things that a pointer can point to, but it is hard to tell. They did do a pretty detailed analysis, but I guess that my own ego makes me want to believe that (if given enough time) I could do better. With a really robust alias analysis mechanism, you start to get into some interesting questions regarding storing aliasing information for "library" modules as well as for the code your are currently compiling. How much do you store? How do you represent it? If anybody has ideas about such things, I (for one) am all ears. // rfg
tom@ssd.csd.harris.com (Tom Horsley) (01/15/90)
In article <25AFDC1A.11327@paris.ics.uci.edu> rfg@ics.uci.edu (Ron Guilmette) writes: >entire address space. For example: > { > char array[100]; > char *end = &array[99]; > char *p = array; > > while (p <= end) > *p = ' '; > } > >Guess where p always points to! Now guess where end always points to. >You can work your way up to significantly more complex examples from here. This is absolutely true, and symbolic execution is a well know (if high overhead) way of squeezing information like this out of the source code. (We use a form of it in our Ada compilers to eliminate constraint checks we can prove are not required). But I can make up examples too: void matmul(a,b,c,n,m) double * a; double * b; double * c; int n,m; { ... } Guess where a, b, and c, always point to! Unless I have a global program database and I can compile the matmul routine knowing everything about every point it is called, the only guess I can make is "somewhere within the address space of the machine". The ultimate fanatic compiler might consider generating two routine bodies, one with aliasing between arguments and one without, then throw in some runtime checks for array overlap, and just jump to the "best" body. But be careful, you are liable to introduce more overhead with the runtime checks than you save by picking the correct body (particularly if the arguments always overlap). Deciding which optimizations are profitable is the hardest problem in engineering a good compiler. Both of these are legitimate example programs, in one case a smart compiler could do a good job, in the other case it is out of luck. I am interested in getting good performance out of *all* compiled code, not just examples which happen to work well for the machine. I don't want to argue about which example is "more realistic", the fact is, they are both "realistic". I don't want anyone to claim that I said you can't generate good code for the 88k, sometimes you can generate fantastically good code for it, but sometimes you can't. -- ===================================================================== domain: tahorsley@ssd.csd.harris.com USMail: Tom Horsley uucp: ...!novavax!hcx1!tahorsley 511 Kingbird Circle or ...!uunet!hcx1!tahorsley Delray Beach, FL 33444 ======================== Aging: Just say no! ========================
amull@Morgan.COM (Andrew P. Mullhaupt) (01/16/90)
In article <50855@bbn.COM>, slackey@bbn.com (Stan Lackey) writes: > In article <671@s5.Morgan.COM> amull@Morgan.COM (Andrew P. Mullhaupt) writes: > > >But back to my original confusion - am I the only one with a BLAS > >which unrolls its own loops? Given that I'm talking about double > >precision arithmetic, should I really expect the compiler to find > >yet another factor of two? I'll believe it when I see it. > > In older machines (those without any scalar pipeline) the only > advantage of unrolling loops was to reduce loop overhead. Now, with > scalar pipelines, a good instruction scheduler can likewise take > advantage of unrolling; that is, in a loop like a(i) = b(i)*s which is > unrolled say 4 times, b(i:i+3) can be fetched at the start of the loop > and put into 4 registers by 4 sequential loads (I assume using > displacement addressing). Then the four muls can be started > sequentially, followed by 4 stores. The time to do 4 loop interations > in this case should be only slightly more than the time to do one > (with all cache hits). > > Note that with more in the loop (like maybe two fetched vectors instead of > one) and maybe an add, you use up all the registers real fast, especially > with double precision. > As a matter of fact, the first time I had to worry about unrolling loops was on a CDC 6600 (it was delivered in 1963 - and was the third one built). Not that I was programming it then - but that's how old the machine was. Now this 'box' had (if memory serves) eight arithmetic pipelines which could all be simultaneously running: It was something along the lines of integer and floating add, subtract, multiply and divide, (but it wasn't exactly that - the exact specification for the Cyber (a descendant) machine can be found in Michael Metcalf's interesting book _FORTRAN Optimization_.) I'm not sure how old loop unrolling is, but the FORTRAN compiler for the CDC 6600 had it by the time I got around to that machine. In fact this is one of the machines where hand-coded assembler was as likely to slow down code as the FORTRAN compiler's code because the compiler took care to schedule the pipes. It could even move code across loops or function calls in order to schedule better. Now this is at least a 15 year old compiler and a 27 year old machine. I don't think the role of loop unrolling is really in a new and different light - and I'm somewhat disappointed in at least on of the compilers I've run across for RISC. For example: the Sun 4 compiler willfully punishes you if you unroll your loops in the source. It doesn't unroll them for you either. The gcc-1.35 compiler for the same machine quite understands what you want and you get as much as a factor of 10 speedup. This proves that the hardware is not the problem. Can anyone who has an m88k and a C compiler check out what happens if you unroll loops at the source level and post a short summary? Later, Andrew Mullhaupt
khb@chiba.kbierman@sun.com (Keith Bierman - SPD Advanced Languages) (01/17/90)
In article <681@terminus.Morgan.COM> amull@Morgan.COM (Andrew P. Mullhaupt) writes: >For example: the Sun 4 compiler willfully punishes you if you unroll >your loops in the source. It doesn't unroll them for you either. Not true. The compiler DOES unrolling. f77 users have been able to see this for a year or two. Since C has been bundled with the OS to have any chance of getting unrolling from C would have required you to install the f77 components into your C compiler (or use some of the less well known options). The next C compiler will be available soon, and will not require an OS upgrade. -- Keith H. Bierman |*My thoughts are my own. !! kbierman@sun.com It's Not My Fault | MTS --Only my work belongs to Sun* I Voted for Bill & | Advanced Languages/Floating Point Group Opus | "When the going gets Weird .. the Weird turn PRO" "There is NO defense against the attack of the KILLER MICROS!" Eugene Brooks
marvin@oakhill.UUCP (Marvin Denman) (01/17/90)
In article <34446@mips.mips.COM> , earl@wright.mips.com (Earl Killian) writes: >Consider the application being discussed, >matrix multiply, which is highly vectorizable. If the original poster >is correct in that the 88100, with its pipelined floating-point units, >tops out in 6.7 mflop/s in single precision matrix multiplies, it >really proves this point. The MIPS R3000, with non-pipelined >floating-point units, can do matrix multiplies at > 25MHz 33MHz > single 11.8 mflop/s 15.7 mflop/s > double 7.8 mflop/s 10.4 mflop/s >This an example of why MIPS perfers low-latency to pipelined fp. >-- >UUCP: {ames,decwrl,prls,pyramid}!mips!earl >USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086 It should be noted that the 88k numbers you repeated are apparently at 20Mhz and for the specific code fragment posted: DO 10 J = I,N 10 A(I,J) = A(I,J) + B(I,K) * C(K,J) The numbers you posted for the R3000 are PROBABLY for a slightly different code fragment: ( I am more conversant in C so I will translate) for (i=0 ; i<MAXI ; i++) for (j=0 ; j<MAXJ ; j++) for (k=0, a[i][j]=0.0 ; k<MAXK ; k++) a[i][j] = a[i][j] + (b[i][k] * c[k][j]); Is that true? The inner loop written in this style can accumulate a[i][j] into a register and remove the stores from the inner loop. (Note that the assumption that the arrays do not overlap is necessary) When I recoded this inner loop for the 88100 unrolling the loop 8 times, I got 10.8 Mflops at 25 Mhz and 14.4 Mflops at 33 Mhz for single precision. This loop had only 1 cycle of stalling out of 37 cycles so the floating point latencies had a neglible effect. How much was the inner loop unrolled for the R3000? By my rough calculation I would suspect it would have to be 16 or so to get the numbers quoted. This is probably a legitimate difference, but I would be interested to know if the extra unrolling is the cause of this difference. Marvin Denman Motorola 88000 Design cs.utexas.edu!oakhill!marvin -- Marvin Denman Motorola 88000 Design cs.utexas.edu!oakhill!marvin
earl@wright.mips.com (Earl Killian) (01/18/90)
In article <2831@yogi.oakhill.UUCP>, marvin@oakhill (Marvin Denman) writes: >It should be noted that the 88k numbers you repeated are apparently >at 20Mhz I see. I included both 25 and 33MHz numbers because I wasn't sure what clock to compare to. I didn't think of 20. >and for the specific code fragment posted: > DO 10 J = I,N >10 A(I,J) = A(I,J) + B(I,K) * C(K,J) > >The numbers you posted for the R3000 are PROBABLY for a slightly different >code fragment: ( I am more conversant in C so I will translate) > for (i=0 ; i<MAXI ; i++) > for (j=0 ; j<MAXJ ; j++) > for (k=0, a[i][j]=0.0 ; k<MAXK ; k++) > a[i][j] = a[i][j] + (b[i][k] * c[k][j]); >Is that true? Yes, the matrix multiply library routine quoted uses an algorithm close to the above (the appropriate algorithm for matrix multiply does depend on the machine). One difference from the above is that it appears you're assuming the array bounds are known at compile-time, which is not true for the library subroutine I used (the stride is a parameter). This makes the address arithmetic more expensive (it adds a whole instruction per flop). The second is that unrolling was done for the middle-loop, not the inner loop. >When I recoded this inner loop for the 88100 unrolling the loop 8 >times, I got 10.8 Mflops at 25 Mhz and 14.4 Mflops at 33 Mhz for >single precision. What about double? ;-) >The inner loop written in this style can accumulate a[i][j] into a >register and remove the stores from the inner loop. >... >This loop had only 1 cycle of stalling out of 37 cycles so the >floating point latencies had a neglible effect. But accumulates into a[i][j] are dependent, and I thought the fp add was 5 cycles, so 8 dependent fp adds should take a minimum of 40 cycles, true? Did you convert to multiple parallel accumulation registers to get around the fp latency? >How much was the inner loop unrolled for the R3000? The middle loop was unrolled 8 times. Anyway, the point of my response to the original It will be interesting to see if MIPS goes to pipelining floating point instructions in future parts. is that we're not going to add pipelining at the expense of latency, because low-latency lets you do two things well (scalar and vector), whereas pipelining lets you only do vector well. I was surprised that a high-latency highly-pipelined machine like the 88100 actually appeared to slower on a vector problem than the R3000, and you correctly pointed out was only because the originally posted code was somewhat sub-optimal for the 88100. On a vector problem, both machines should be instruction-issue limited. The latency or pipelining required to run at peak rate is a function of the instruction to flop ratio. We try to keep our latencies below that ratio, whereas the 88100 keeps its pipelining below that rate. -- -- UUCP: {ames,decwrl,prls,pyramid}!mips!earl USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (01/19/90)
In article <34780@mips.mips.COM> earl@wright.mips.com (Earl Killian) writes: >In article <2831@yogi.oakhill.UUCP>, marvin@oakhill (Marvin Denman) writes: : >What about double? ;-) Good question. I would like to see both single and double numbers, when these comparisons come up. >Anyway, the point of my response to the original > It will be interesting to see if MIPS goes to pipelining > floating point instructions in future parts. >is that we're not going to add pipelining at the expense of latency, I wonder what f.p. op counts are common these days. For example, has anyone done an analysis of the SPEC benchmarks to see what the (dynamic) instruction counts on various RISC machines are (like the ones we used to see for the IBM 360 :-) and, also, what the sequences look like for things like the matrix-vector version of Linpack (i.e. the most floating point intensive vectoriazable codes). Conjecture: I would guess that you will see some potential benefit to pipelining add, but not multiply or divide (as long as you don't have vector instructions). Suggestion: Consider pipelining add by duplication of the add unit. I think this approach has benefits: You can use the same already optimized adder design, and, most real-estate-saving pipelining methods add latency. I think MIPSCo is correct to reduce latency first, but I think there will be usable speedup if add is pipelined, according to instruction analyses that I have seen previously for other architectures. The Motorola and MIPSCo products each represent reasonable design choices. It is interesting to have real competition in a product area which was lunatic fringe three or four years ago. Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117