rcd@ico.ISC.COM (Dick Dunn) (07/08/88)
[Better start off by emphasizing that these are my own opinions, not ISC's.] "Neal Nelson and Associates" had a booth at the USENIX vendor's exhibition whose sole purpose seemed to be RISC-bashing. Although they purportedly have developed a set of tests, which they call their "Business Benchmark (R)" for helping people make realistic comparisons of machines under realistic loads, in fact even basic descriptions of the tests cast serious doubt on their real usefulness. Some of the tests seem overly simplistic; others contain obvious biases toward or against certain types of hardware and/or I/O system software. A couple of months ago, several trade rags ran articles reporting how Neal Nelson & Associates (NNA) had shown that CISCs would beat RISCs. It's hard to tell just who botched what part of it--whether NNA did a bad job of reporting it or the trade journals reported what they wanted to hear--but the articles were abominable. At USENIX, the NNA booth had great piles of reprints of several of these RISC-bashing trade-rag articles--and NO OTHER reports of substantive conclusions. (Their other literature listed system configurations they had tested and tried to explain the tests.) I saw no attempt to present a balanced picture, nor to compare a representative set of comparable machines. For example, the EE Times RISC-bash article (which I happen to have at hand) shows a Sun 3/260 beating the pants off a Model 25 RT PC and slightly winning over a MIPS M-500 on the test EE Times chose to show in detail. (The Sun 3 was the only CISC design represented; there were other RISCs. I'm singling out MIPS and the RT as examples.) EE Times chose to report in detail on only one test out of 18 in the benchmark! The results as reported show some obvious problems: - Why were they using the fastest Sun 3 (25 MHz) for comparison against the slowest MIPS machine? - If a 25 MHz CISC is only 12% faster than an 8 MHz RISC, how does that make the CISC faster? - Although the older RT lost badly, it's not a full RISC design... and one ought perhaps to take into account that the Sun system costs more than 3 times as much. - Where are the other CISCs? It seems very much like only the best CISC they could find was used as the basis for comparison--and certainly the best RISCs were NOT used. At the point that EE Times published their report, it would have been easy (for NNA in particular) to say that it was simply a case of EE Times doing some very selective reporting. But at USENIX, it was clear that NNA was quite proud to show off the EE Times article...and it seems clear that NNA has an ax to grind wrt RISCs even though it's not clear why. They're making some pretty strong statements: "I'm beginning to believe RISC doesn't belong everywhere, or possibly even anywhere..." "...we still haven't seen any areas of study that say RISC has been implemented and shown a marked improvement..." ...but they're not backing them up. Both of these are attributed to Neal Nelson. Incidentally, the second statement may well be true--but it's sur- prising that someone working on benchmarks could ignore a decade or so of work! (It's interesting that at ASPLOS last fall, the two camps seemed to be holding viewpoints of "RISC has won" vs "the battle isn't over"--quite a different story from what Nelson tells.) Most folks familiar with the RISC-CISC debate have their biases, and in a moment of passion they may get a little carried away in supporting their viewpoint. But what I see in this Neal Nelson stuff goes beyond a momentary outburst--it's an unsubstantiated, unprofessional reaction to the substantial real work (and very real, very fast machines) that have come out of RISC development. I think it added some heat and absolutely no light...we need thought more than we need more emotion. -- Dick Dunn UUCP: {ncar,nbires}!ico!rcd (303)449-2870 ...Are you making this up as you go along?
pardo@june.cs.washington.edu (David Keppel) (07/08/88)
[ RISC bashing at Usenix ] Anybody out there at NNA care to respond? Anybody from EE times? ( We've seen this topic quite a bit recently. Please use e-mail. ) ;-D on ( RISCy Business ) Pardo
dwc@homxc.UUCP (Malaclypse the Elder) (07/08/88)
i can imagine that as a vendor of benchmarks, it is in neal nelson's interest to fuel the debate of risc vs. cisc with results that are contrary to popular belief. they will probably sell more benchmarks to people who want to see if their machine also has the same problem. danny chen homxc!dwc
davidsen@steinmetz.ge.com (William E. Davidsen Jr) (07/11/88)
In article <6888@ico.ISC.COM> rcd@ico.ISC.COM (Dick Dunn) writes: | | "Neal Nelson and Associates" had a booth at the USENIX vendor's exhibition | whose sole purpose seemed to be RISC-bashing. Although they purportedly | have developed a set of tests, which they call their "Business Benchmark | (R)" for helping people make realistic comparisons of machines under | realistic loads, in fact even basic descriptions of the tests cast serious | doubt on their real usefulness. Some of the tests seem overly simplistic; | others contain obvious biases toward or against certain types of hardware | and/or I/O system software. Neal Nelson presents the results of his benchmarks and the description of each test. You are free to interpret them any way you want. That's not a flame, just a reminder that if you choose to interpret his results as RISC bashing, then you seem to have decided that RISC is better, and that a benchmark which doesn't show that is either biased or useless. We have looked at the NN benchmarks for a number of machines (I obviously can't say which ones), and my personal reaction is that they are reasonable and valid for business applications. If your application is something else, why not get a benchmark suite which tests that, rather than blasting NN? I don't consider ANY benchmark to be the whole story on a machine, even my own. -- bill davidsen (wedu@ge-crd.arpa) {uunet | philabs | seismo}!steinmetz!crdos1!davidsen "Stupidity, like virtue, is its own reward" -me
rcd@ico.ISC.COM (Dick Dunn) (07/12/88)
In response to my griping about Neal Nelson at USENIX, davidsen@ steinmetz.ge.com (William E. Davidsen Jr) writes: > Neal Nelson presents the results of his benchmarks and the description > of each test. You are free to interpret them any way you want. That's > not a flame, just a reminder that if you choose to interpret his results > as RISC bashing, then you seem to have decided that RISC is better, and > that a benchmark which doesn't show that is either biased or useless. Good grief! I did NOT interpret NN's results as RISC bashing...I interpreted the *presentation* of the results as RISC bashing. The presentation at USENIX was flamboyantly anti-RISC--meaning that there are statements about RISCs by NN which are vehemently anti-RISC and not backed up by fact. Yes, I am free to interpret the results any way I want--but NN wants me to interpret them in a particular way, as strongly anti-RISC. I don't see that they support that viewpoint at all, which is what I'm com- plaining about. There *might* be some useful results and good work behind it all, but I sure-as-hell can't find it even after trying to peel back the layers of hype and bad journalism, so I tend to doubt that there's much there. No, I haven't decided that RISC is better. The biases in the benchmarks (more to the point, in the reporting thereof) are evident regardless of your own biases, if you just look at what's been said. I pointed out some of the more obvious problems...so if you think I'm off base, why don't you take on the *substance* of what I said? (I.e., if you're trying to say *I'm* biased, show us how. For example, do you dispute that they compared the fastest Sun 3 against the slowest MIPS box?) Most of my computing is done on CISCs, and they serve well. But I had a chance recently to run a couple of problems on a low-end RISC; they ran so fast that I put in some debugging code to be sure something hadn't gotten short-circuited! (It hadn't.) I'm not taking up the sword to defend RISC, but I know the RISC guys aren't smoking rope--they're for real. > We have looked at the NN benchmarks for a number of machines (I > obviously can't say which ones), and my personal reaction is that they > are reasonable and valid for business applications... OK, so which benchmarks are the good ones? Note that the one that EE Times gave such prominent coverage was one of the simplest--a loop with just 4 calculations (+-*/) on 16-bit integers, running 1 to 15 copies at a time. That has about 0.5 * dsq to do with any real business program. And, as I said in the original article, I could have attributed it to EE Times' sloppiness (the rest of the article was an expository/stylistic/technical mess) but for the fact that NN was showing it off. NN has 17 other benchmarks, and they could have put together a complete presentation of the benchmarks on comparable machines. They didn't. Why not? > ...why not get a benchmark suite which tests [what concerns you], > rather than blasting NN? Done! I have tests of my own which I run when *I* want to get an idea of how fast a processor is. The reason I'm blasting NN is that I see them misleading people--and using a lot of PR to mislead a lot of people. It's that aspect that bothers me--not that it's RISCs per se that they're bashing, but that they're bashing, instead of testing and reporting carefully. -- Dick Dunn UUCP: {ncar,nbires}!ico!rcd (303)449-2870 ...Are you making this up as you go along?
walter@garth.UUCP (Walter Bays) (07/13/88)
Several articles recently have commented that Neal Nelson (benchmark service) is challenging the widely held view that RISC is faster than CISC, saying that CISC is faster than RISC. NN's most famous comparison shows that a Sun 3 is faster than a Sun 4. The other side can cite benchmarks showing that a Sun 4 is much faster than a Sun 3. An article in the July issue of UNIX Review sheds some light on the issue. It's by David Wilson of Workstation Laboratories (another benchmark service). The article shows "... a class of tasks for which the Sun 4/260 is two or three times faster than the Sun 3/260, a class for which performance is about the same, and a class where the 4/260's performance seems slightly lower..." Wilson discusses the reasons for these results. Any computer designer selects architectural features based on their expected utility given the class of workloads to be run. RISC arose from the observation that many of the features of conventional computers did not help (or hurt) performance _in_ _the_ _typical_ _case_. Various CISC's have always included several RISC-like features their designers found helpful. And most RISC's include a few CISC-like features their designers found helpful. For performance comparisons, there is no subtitute for a benchmark that represents your application(s). -- ------------------------------------------------------------------------------ My opinions are my own. Objects in mirror are closer than they appear. E-Mail route: ...!pyramid!garth!walter (415) 852-2384 USPS: Intergraph APD, 2400 Geng Road, Palo Alto, California 94303 ------------------------------------------------------------------------------
davidsen@steinmetz.ge.com (William E. Davidsen Jr) (07/13/88)
In article <6965@ico.ISC.COM> rcd@ico.ISC.COM (Dick Dunn) writes: | > We have looked at the NN benchmarks for a number of machines (I | > obviously can't say which ones), and my personal reaction is that they | > are reasonable and valid for business applications... | | OK, so which benchmarks are the good ones? Note that the one that EE Times | gave such prominent coverage was one of the simplest--a loop with just 4 | calculations (+-*/) on 16-bit integers, running 1 to 15 copies at a time. The decision is yours... NN gives the result of the test and what it measures. I don't disagree that considering (any) one benchmark as an indicator is probably a waste, but with a selection of results you can compare two (or more) machines in those areas which apply to your situation. I have a UNIX benchmark suite which I have run on a number of machines for my personal edification. It measures some raw performance numbers such as the speed of arithmetic for all data types, trancendental functions, test and branch for int and float, disk access and transfer times for large and small files, speed of bit fiddling such as Grey to binary, etc. Then I measure speed of compile, performance under multitasking load, speed of pipes and system calls, and a few other things. The *one* thing I measure which consistently represents the overall performance of the machine is the real time to run the entire benchmark suite. -- bill davidsen (wedu@ge-crd.arpa) {uunet | philabs | seismo}!steinmetz!crdos1!davidsen "Stupidity, like virtue, is its own reward" -me
landru@stan.UUCP (Mike Rosenlof) (07/13/88)
In article <936@garth.UUCP> walter@garth.UUCP (Walter Bays) writes: > >An article in the July issue of UNIX Review sheds some light on the issue. >It's by David Wilson of Workstation Laboratories (another benchmark service). >The article shows "... a class of tasks for which the Sun 4/260 is two or >three times faster than the Sun 3/260, a class for which performance is about >the same, and a class where the 4/260's performance seems slightly lower..." >Wilson discusses the reasons for these results. > When I first brought up X on our color sun 4/260, recently converted from a sun 3/260, I was amazed that the X server performance for simple things like scrolling and moving windows around was no better. This was just how it looked, I didn't get out my stopwatch. So I did a little comparrison with bit blt ( bit block transfer, scrolling, moving a window, ... ) timing. The loop which does most of the work for a bit blt looks like this for the common copy case: register long count; register long *src, *dst; while( --count ) { *dst++ = *src++; } the sun 68k compiler after optimizing, produces this code: LY00001: movl a5@+,a4@+ LY00000: subql #1,d7 jne LY00001 according to the 68020 users manual, this loop takes 10 clocks in the best case and 15 clocks in its cache case. With a 40 nsec clock, this is 400 and 600 nsec per loop. the sun SPARC compiler after optimizing, produces: LY2: ! [internal] ld [%o3],%o0 dec %o5 tst %o5 st %o0,[%o4] inc 4,%o3 bne LY2 inc 4,%o4 which takes 9 clocks, and with a 60 nsec clock, this is 540 nsec. Since the 68K loop is so tight, I suspect we're seeing the best case 68K timing with SPARC doing the set up work faster to make up some of the difference. Which processor is going to get faster clocks sooner? Or will newer versions reduce the clock count? I've heard of a 33 Mhz 68020 being available now or soon, other SPARC implementations are also said to be in the works. My point is that with a reduced instruction set, you're very likely to find some applications that are slowed down by this reduction. In this case, I find that the sun 4/260 makes a very nice compile or compute server, but it's not a very impressive X server. -- Mike Rosenlof SAE (303)447-2861 2190 Miller Drive stan!landru@boulder.edu Longmont Colorado landru@stan.uucp 80501 ...hao!boulder!stan!landru
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (07/14/88)
>[ RISC bashing at Usenix ]
Before you shed too many tears about poor RISC people being bashed,
just remember all the hype with which "RISC" was marketed to the
computing world 5 (?) years ago. And remember, back then, that
supposedly things like arithmetic speed and memory bandwidth didn't
matter, according to the RISC camp. Now that we have new processors
like the MIPS R3000/R3010 and the Motorola 88K series, to mention
a couple of recent designs, things have come a long way.
Rather than say that RISC or CISC won, I think a more fair way to sum
up what really happened is to say that performance won. Conventional
wisdom 15 years ago (at, say, IBM and CDC) was that there would always
be a very limited demand for high performance systems. What happened
was quite different. People found maximum performance useful in
systems of all price categories. This was not anticipated by a lot
of people. But whether that performance is provided by a RISC or
CISC system does not matter to the end user.
That aside, it warms my heart to see MIPS (R3000/R3010) and Motorola
(88K series) battling it out over performance on benchmarks like
Linpack and the Livermore Loops. Just two years ago, these companies
were leaving floating point/memory intensive job performance to the
mainframe bunch. The latest MIPS performance brief shows, among other
things, the MIPS M/2000 system weighing in at 3.6 64 bit Fortran MFLOPS-
faster than the CDC 7600. Now that is progress.
[By the way, my hat is off to MIPS for their latest performance brief
(3.4 - dated June 1988). Good job. I wish every company would
provide a report like this.]
--
Hugh LaMaster, m/s 233-9, UUCP ames!lamaster
NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov
Moffett Field, CA 94035
Phone: (415)694-6117
pope@vatican.Sun.COM (John Pope) (07/14/88)
In article <202@baka.stan.UUCP> stan!landru@boulder.edu writes: > >When I first brought up X on our color sun 4/260, recently converted from >a sun 3/260, I was amazed that the X server performance for simple things >like scrolling and moving windows around was no better. > [...] >The loop which does most of the work for a bit blt looks like this for the >common copy case: > >register long count; >register long *src, *dst; > > while( --count ) > { > *dst++ = *src++; > } > *** Warning! Brain damaged software alert! *** This should be re-coded to use the bcopy() library routine, which does a 32 bit copy instead of a byte at a time. You should see a *noticable* improvement. Moral: use your libraries, that's what they're there for. >My point is that with a reduced instruction set, you're very likely to >find some applications that are slowed down by this reduction. In this >case, I find that the sun 4/260 makes a very nice compile or compute >server, but it's not a very impressive X server. Please be careful about making conclusions like this regarding a particular machine or architecture. Performance is a combination of a lot of factors, most of them not as clear-cut as this case is... >Mike Rosenlof SAE (303)447-2861 >2190 Miller Drive stan!landru@boulder.edu >Longmont Colorado landru@stan.uucp >80501 ...hao!boulder!stan!landru John Pope Sun Microsystems, Inc. pope@sun.COM John Pope Sun Microsystems, Inc. pope@sun.COM
pope@vatican.Sun.COM (John Pope) (07/14/88)
In article <59798@sun.uucp> pope@sun.UUCP (John Pope) writes: > >*** Warning! Brain damaged software alert! *** Sorry, my brain damage. After double-checking, I saw you were copying longs instead of chars. While using the libraries is almost always right, you probably wouldn't see much difference here. Oh well... John Pope Sun Microsystems, Inc. pope@sun.COM
landru@stan.UUCP (Mike Rosenlof) (07/14/88)
In article <59798@sun.uucp> pope@sun.UUCP (John Pope) writes: >>register long count; >>register long *src, *dst; ^^^^ >> while( --count ) >> { >> *dst++ = *src++; >> } >*** Warning! Brain damaged software alert! *** >This should be re-coded to use the bcopy() library routine, which >does a 32 bit copy instead of a byte at a time. You should see a >*noticable* improvement. Moral: use your libraries, that's what they're >there for. Last time I looked, 'long' on the sun C compilers was 32 bits, and this example still holds. If the library function is optimized C or hand coded assembler, the machine code is going to come up nearly identical to my examples. (assembler for 68020 and SPARC not quoted here) >>My point is that with a reduced instruction set, you're very likely to >>find some applications that are slowed down by this reduction. In this >>case, I find that the sun 4/260 makes a very nice compile or compute >>server, but it's not a very impressive X server. > >Please be careful about making conclusions like this regarding a >particular machine or architecture. Performance is a combination of a >lot of factors, most of them not as clear-cut as this case is... I think I was clear that this is a very specific example, and that in other areas its performance is "very nice". I don't know of many machines that spend large portions of CPU cycles just doing bit blt (or bcopy). In this isolated case, SPARC ( or at least this implementation by Fujitsu ) is not impressive, especially when comparing costs to a 68020 system. This is one of the difficulties of very simple graphics like this, lots of data has to be moved around. One more compute intensive functions like shaded surfaces, I'm sure the sun/4 would be a tremendous improvement. -- Mike Rosenlof SAE (303)447-2861 2190 Miller Drive stan!landru@boulder.edu Longmont Colorado landru@stan.uucp 80501 ...hao!boulder!stan!landru
chris@mimsy.UUCP (Chris Torek) (07/15/88)
>In article <202@baka.stan.UUCP> stan!landru@boulder.edu writes: >>... for the common copy case: >>register long *src, *dst; In article <59798@sun.uucp> pope@vatican.Sun.COM (John Pope) writes: >This should be re-coded to use the bcopy() library routine, which >does a 32 bit copy instead of a byte at a time. Reread the original. It *does* do a 32 bit copy. Still, one should use bcopy/memcopy/memmove/whatever-we-call-it-this-week. I suspect it can be optimised a bit more (copy 64 bytes per trip around the main loop, e.g.). -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
radford@calgary.UUCP (Radford Neal) (07/15/88)
In article <202@baka.stan.UUCP>, landru@stan.UUCP (Mike Rosenlof) writes: > When I first brought up X on our color sun 4/260, recently converted from > a sun 3/260, I was amazed that the X server performance for simple things > like scrolling and moving windows around was no better... > The loop which does most of the work for a bit blt looks like this for the > common copy case: > > register long count; > register long *src, *dst; > > while( --count ) > { > *dst++ = *src++; > } > > [ goes on to examine the code generated for 68020 and SPARC ] Your problem is that the above C code is grossly non-optimal. Assuming that "count" is typically fairly large, the optimal C code is the following: bcopy ((char*)src, (char*)dst, count*sizeof(long)); If for some bizzare reason your C comiler doesn't come with a "bcopy" routine, I suggest something along the following lines: while (count>8) { *dst++ = *src++; *dst++ = *src++; *dst++ = *src++; *dst++ = *src++; *dst++ = *src++; *dst++ = *src++; *dst++ = *src++; *dst++ = *src++; count -= 8; } while (count>0) { *dst+ = *src++; count -= 1; } There are, of course, many variations, and it's hard to tell which will be best on any particular processor, which is why "bcopy" was invented. Radford Neal
alverson@decwrl.dec.com (Robert Alverson) (07/15/88)
In article <204@baka.stan.UUCP> stan!landru@boulder.edu writes: >In article <59798@sun.uucp> pope@sun.UUCP (John Pope) writes: >>>register long count; >>>register long *src, *dst; > ^^^^ >>> while( --count ) >>> { >>> *dst++ = *src++; >>> } >>*** Warning! Brain damaged software alert! *** >>This should be re-coded to use the bcopy() library routine, which >>does a 32 bit copy instead of a byte at a time. You should see a >>*noticable* improvement. Moral: use your libraries, that's what they're >>there for. Despite the incorrectness of Pope's reasoning, I tend to agree that you should use a library routine to perform such a low-level function as copying memory. In particular, a library routine might unroll the loop many times, so that the cost per word approaches that of a single load+store pair. This would make the cost per byte nearly 5 cycles on Sparc (I think), bringing it to 300ns (?). This is still rather high, it seems like a RISC ought to do a load+store in 2 or 3 cycles (scheduled!). Similarly, on a VAX, the library routine might just happen to correspond directly to a VAX instruction, so that the loop could be executed in microcode. In any case, copying memory seems like such a fundamentally useful operation that you can expect the library code to be at least as good as what you can get out of the compiler. Bob
pope@vatican (John Pope) (07/15/88)
In article <204@baka.stan.UUCP>, landru@stan (Mike Rosenlof) writes: >Last time I looked, 'long' on the sun C compilers was 32 bits, and this >example still holds. If the library function is optimized C or hand >coded assembler, the machine code is going to come up nearly identical >to my examples. (assembler for 68020 and SPARC not quoted here) I again apologize for my case of caffeine induced type-ahead. I'd seen a couple of cases just lately where the char copy loop had been written in this way (not even using register variables, yet) and went overboard. My point was not to defend our machine (really), but to say that rewriting of standard functions can often lead to performance loss regardless of machine or architecture. >Mike Rosenlof SAE (303)447-2861 >2190 Miller Drive stan!landru@boulder.edu >Longmont Colorado landru@stan.uucp >80501 ...hao!boulder!stan!landru -- -- John Pope Sun Microsystems, Inc. pope@sun.COM
roy@phri.UUCP (Roy Smith) (07/15/88)
alverson@decwrl.UUCP (Robert Alverson) writes: > I tend to agree that you should use a library routine to perform such a > low-level function as copying memory. [...] copying memory seems like > such a fundamentally useful operation that you can expect the library > code to be at least as good as what you can get out of the compiler. On the other hand, library routines can't make assumptions about their arguments. Bcopy(3) can handle copies of arbitrary length with arbitrary alignment, and thus must perform assorted checks to see if it has to copy the first few and/or last bytes "by hand". For short blocks, the overhead of these extra checks might be important enough that coding your own block-copy code might be a big win if you know you're only going to be copying blocks with "nice" alignments and lengths. On the other hand, every line of code you write yourself is just another bug waiting to happen (not to mention a waste of programing time). If you call a library routine you can be reasonably sure it works right. For example, of all the times I've been convinced that malloc(3) and/or free(3) was screwing up, never once have I not been able to (eventually) trace the fault to my own code. Moral: use library routines, but before you release the code do some serious profiling on it. As long as you don't spend much time in library code, don't worry about possible inefficiencies therein. If you do find you're calling a routine with some special-case arguments and where the generality of the library routine is slowing you down, then go ahead and write your own replacement, but make sure it works as well as the original! -- Roy Smith, System Administrator Public Health Research Institute {allegra,philabs,cmcl2,rutgers}!phri!roy -or- phri!roy@uunet.uu.net "The connector is the network"
grunwald@m.cs.uiuc.edu (07/15/88)
Actually, I would be agast if bcopy() didn't use an unrolled version of the loop; by jumping into a table of MOV instructions, you can eliminate the decrement and jump instructions, giving a much higher data movement for less instructions. But this has been hashed out before.
radford@calgary.UUCP (Radford Neal) (07/16/88)
In article <204@baka.stan.UUCP>, landru@stan.UUCP (Mike Rosenlof) writes: > ...In this isolated case, SPARC ( or at least this implementation > by Fujitsu ) is not impressive, especially when comparing costs to a > 68020 system. This is one of the difficulties of very simple graphics > like this, lots of data has to be moved around... It's possible that we're all missing the point here. Could it be that the time for simple graphics on both the 68020 system and the SPARC system was dominated by the access time of the frame buffer? That no processor of any speed could have speeded the graphics up? Was the frame buffer the same in the two cases? Radford Neal
henry@utzoo.uucp (Henry Spencer) (07/19/88)
In article <1746@vaxb.calgary.UUCP> radford@calgary.UUCP (Radford Neal) writes: >Your problem is that the above C code is grossly non-optimal. Assuming >that "count" is typically fairly large, the optimal C code is the >following: > > bcopy ((char*)src, (char*)dst, count*sizeof(long)); No. The optimal C code uses memcpy, not bcopy. The difference is not just six of one and half a dozen of the other: memcpy asserts that its operands do not overlap, and smart compilers can often generate better code with this knowledge. Also, memcpy is ANSI C and bcopy is just a Berkeleyism :-), so compilers are more likely to pay special attention (e.g. inlining) to memcpy. -- Anyone who buys Wisconsin cheese is| Henry Spencer at U of Toronto Zoology a traitor to mankind. --Pournelle |uunet!mnetor!utzoo! henry @zoo.toronto.edu
anc@camcon.uucp (Adrian Cockcroft) (07/20/88)
In article <202@baka.stan.UUCP>, landru@stan.UUCP (Mike Rosenlof) writes: > > When I first brought up X on our color sun 4/260.... > ... I was amazed that the X server performance for simple things > like scrolling and moving windows around was no better. This was just how ... > The loop which does most of the work for a bit blt looks like this for the > common copy case: > > register long count; > register long *src, *dst; > > while( --count ) > { > *dst++ = *src++; > } ...... > according to the 68020 users manual, this loop takes 10 clocks in the > best case and 15 clocks in its cache case. With a 40 nsec clock, this > is 400 and 600 nsec per loop. Are you using the DBRA instruction for this? Has anyone ever seen a compiler generate a DBRA? Maybe SUNs bcopy library routine is in assembler and uses it. > the sun SPARC compiler after optimizing, produces: .... > which takes 9 clocks, and with a 60 nsec clock, this is 540 nsec. > > My point is that with a reduced instruction set, you're very likely to > find some applications that are slowed down by this reduction. In this > case, I find that the sun 4/260 makes a very nice compile or compute > server, but it's not a very impressive X server. The Inmos Transputer has a RISC core with microcode added to speed up things that compilers can use and to put operating system primitives in microcode. One of its useful extras is a block move instruction that moves words as fast as memory bandwidth will allow. ldl src ;load local onto register stack ldl dst ldc count ;load constant move ;blast those RAM chips The move will take 100 ns per word for on-chip src and dst or 300ns per word for off-chip src and dst. The compiler I have (Pentasoft C) can be told to watch out for strcpy(s,"string constant") where it knows the length of src and also uses move for bcopy and structure assignment. A 'wcopy' routine or macro would be needed to get the above code: #define wcopy(src,dst,count) __ABCregs(count,dst,src);asm(" move") would do the trick with Pentasoft C. For bitblt the T800 also has a 2 dimensional block move instruction. Inmos's attitude is that the RISC core made enough space on the chip for RAM and interprocessor links but as the chip shrinks they are adding more microcode space and taking common code sequences into microcode for better performance on certain applications. If this is the hardest work for an X server then Transputers should be pretty good. X is currently being ported to the Transputer by a team at the University of Kent. The Atari Abaq (T800 based) will have X as standard but it probably uses its superfast blitter chip rather than the T800. -- | Adrian Cockcroft anc@camcon.uucp ..!uunet!mcvax!ukc!camcon!anc -[T]- Cambridge Consultants Ltd, Science Park, Cambridge CB4 4DW, | England, UK (0223) 358855 (You are in a maze of twisty little C004's, all alike...)
ok@quintus.uucp (Richard A. O'Keefe) (07/22/88)
In article <1681@gofast.camcon.uucp> anc@camcon.uucp (Adrian Cockcroft) writes: >> The loop which does most of the work for a bit blt looks like this for the >> common copy case: >> register long count; >> register long *src, *dst; >> while( --count ) >> { >> *dst++ = *src++; >> } >Are you using the DBRA instruction for this? Has anyone ever seen a compiler >generate a DBRA? Maybe SUNs bcopy library routine is in assembler and uses it. The SunOS 3.2 C compiler is perfectly happy to generate a DBRA; you just have to write some suprising C to do it. void move(dst, src, len) register long *dst, *src; register short len; { do *dst++ = *src++; while (--len != -1); } compiled with -O yielded movl a6@(8),a5 movl a6@(12),a4 movw a6@(18),d7 L16: movl a4@+,a5@+ dbra d7,L16 Remember that dbra operates on *16-bit* registers (boo hiss). A reasonably good strcpy for 68010s can be done thus: void strmov(dst, src) register char *dst, *src; { register short len = -2; while ((*dst++ = *src++) && --len != -1); } which turns into movl a6@(8),a5 movl a6@(12),a4 moveq #-2,d7 L14: movb a4@+,a5@+ dbeq d7,L14 Frankly, I'd rather write clearer C and put up with not getting DBcc. (Oops, just forfeited my junior hacker's badge...) That, of course, is the snag with CISCy instructions: if they're not *exactly* what you want, you might as well not have them.
ralphw@ius3.ius.cs.cmu.edu (Ralph Hyre) (07/29/88)
In article <12485@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes: >>In article <202@baka.stan.UUCP> stan!landru@boulder.edu writes: <<<... for the common copy case: <<<[some code] <In article <59798@sun.uucp> pope@vatican.Sun.COM (John Pope) writes: >>This should be re-coded to use the bcopy() library routine <Reread the original. It *does* do a 32 bit copy. ... <I suspect it can be optimised a bit more (copy 64 bytes per trip around <the main loop, e.g.). On a Sun, at least, don't you have the additional options of 1) playing MMU games (for 'virtual', read-only copies) 2) use the RasterOP stuff for the real thing. I thought this was why the Sun kernel's bcopy was so much faster then the C library one. -- - Ralph W. Hyre, Jr. Internet: ralphw@ius2.cs.cmu.edu Phone:(412)268-{2847,3275} CMU-{BUGS,DARK} Amateur Packet Radio: N3FGW@W2XO, or c/o W3VC, CMU Radio Club, Pittsburgh, PA