[comp.arch] RISC bashing at USENIX

rcd@ico.ISC.COM (Dick Dunn) (07/08/88)

[Better start off by emphasizing that these are my own opinions, not ISC's.]

"Neal Nelson and Associates" had a booth at the USENIX vendor's exhibition
whose sole purpose seemed to be RISC-bashing.  Although they purportedly
have developed a set of tests, which they call their "Business Benchmark
(R)" for helping people make realistic comparisons of machines under
realistic loads, in fact even basic descriptions of the tests cast serious
doubt on their real usefulness.  Some of the tests seem overly simplistic;
others contain obvious biases toward or against certain types of hardware
and/or I/O system software.

A couple of months ago, several trade rags ran articles reporting how Neal
Nelson & Associates (NNA) had shown that CISCs would beat RISCs.  It's
hard to tell just who botched what part of it--whether NNA did a bad job of
reporting it or the trade journals reported what they wanted to hear--but
the articles were abominable.  At USENIX, the NNA booth had great piles of
reprints of several of these RISC-bashing trade-rag articles--and NO OTHER
reports of substantive conclusions.  (Their other literature listed system
configurations they had tested and tried to explain the tests.)  I saw no
attempt to present a balanced picture, nor to compare a representative set
of comparable machines.

For example, the EE Times RISC-bash article (which I happen to have at
hand) shows a Sun 3/260 beating the pants off a Model 25 RT PC and
slightly winning over a MIPS M-500 on the test EE Times chose to show
in detail.  (The Sun 3 was the only CISC design represented; there were
other RISCs.  I'm singling out MIPS and the RT as examples.)  EE Times
chose to report in detail on only one test out of 18 in the benchmark!
The results as reported show some obvious problems:
	- Why were they using the fastest Sun 3 (25 MHz) for comparison
	  against the slowest MIPS machine?
	- If a 25 MHz CISC is only 12% faster than an 8 MHz RISC, how does
	  that make the CISC faster?
	- Although the older RT lost badly, it's not a full RISC design...
	  and one ought perhaps to take into account that the Sun system
	  costs more than 3 times as much.
	- Where are the other CISCs?  It seems very much like only the best
	  CISC they could find was used as the basis for comparison--and
	  certainly the best RISCs were NOT used.
At the point that EE Times published their report, it would have been easy
(for NNA in particular) to say that it was simply a case of EE Times doing
some very selective reporting.  But at USENIX, it was clear that NNA was
quite proud to show off the EE Times article...and it seems clear that NNA
has an ax to grind wrt RISCs even though it's not clear why.  They're
making some pretty strong statements:
	"I'm beginning to believe RISC doesn't belong everywhere, or
	possibly even anywhere..."
	"...we still haven't seen any areas of study that say RISC has been
	implemented and shown a marked improvement..."
...but they're not backing them up.  Both of these are attributed to Neal
Nelson.  Incidentally, the second statement may well be true--but it's sur-
prising that someone working on benchmarks could ignore a decade or so of
work!

(It's interesting that at ASPLOS last fall, the two camps seemed to be
holding viewpoints of "RISC has won" vs "the battle isn't over"--quite a
different story from what Nelson tells.)

Most folks familiar with the RISC-CISC debate have their biases, and in a
moment of passion they may get a little carried away in supporting their
viewpoint.  But what I see in this Neal Nelson stuff goes beyond a
momentary outburst--it's an unsubstantiated, unprofessional reaction to
the substantial real work (and very real, very fast machines) that have
come out of RISC development.  I think it added some heat and absolutely no
light...we need thought more than we need more emotion.
-- 
Dick Dunn      UUCP: {ncar,nbires}!ico!rcd           (303)449-2870
   ...Are you making this up as you go along?

pardo@june.cs.washington.edu (David Keppel) (07/08/88)

[ RISC bashing at Usenix ]

Anybody out there at NNA care to respond?  Anybody from EE times?

( We've seen this topic quite a bit recently.  Please use e-mail. )

	;-D on  ( RISCy Business )  Pardo

dwc@homxc.UUCP (Malaclypse the Elder) (07/08/88)

i can imagine that as a vendor of benchmarks,
it is in neal nelson's interest to fuel the
debate of risc vs. cisc with results that are
contrary to popular belief.  they will probably
sell more benchmarks to people who want to see
if their machine also has the same problem.

danny chen
homxc!dwc

davidsen@steinmetz.ge.com (William E. Davidsen Jr) (07/11/88)

In article <6888@ico.ISC.COM> rcd@ico.ISC.COM (Dick Dunn) writes:
| 
| "Neal Nelson and Associates" had a booth at the USENIX vendor's exhibition
| whose sole purpose seemed to be RISC-bashing.  Although they purportedly
| have developed a set of tests, which they call their "Business Benchmark
| (R)" for helping people make realistic comparisons of machines under
| realistic loads, in fact even basic descriptions of the tests cast serious
| doubt on their real usefulness.  Some of the tests seem overly simplistic;
| others contain obvious biases toward or against certain types of hardware
| and/or I/O system software.

  Neal Nelson presents the results of his benchmarks and the description
of each test. You are free to interpret them any way you want. That's
not a flame, just a reminder that if you choose to interpret his results
as RISC bashing, then you seem to have decided that RISC is better, and
that a benchmark which doesn't show that is either biased or useless.

  We have looked at the NN benchmarks for a number of machines (I
obviously can't say which ones), and my personal reaction is that they
are reasonable and valid for business applications. If your application
is something else, why not get a benchmark suite which tests that,
rather than blasting NN?

  I don't consider ANY benchmark to be the whole story on a machine,
even my own.
-- 
	bill davidsen		(wedu@ge-crd.arpa)
  {uunet | philabs | seismo}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

rcd@ico.ISC.COM (Dick Dunn) (07/12/88)

In response to my griping about Neal Nelson at USENIX, davidsen@
steinmetz.ge.com (William E. Davidsen Jr) writes:
>   Neal Nelson presents the results of his benchmarks and the description
> of each test. You are free to interpret them any way you want. That's
> not a flame, just a reminder that if you choose to interpret his results
> as RISC bashing, then you seem to have decided that RISC is better, and
> that a benchmark which doesn't show that is either biased or useless.

Good grief!  I did NOT interpret NN's results as RISC bashing...I
interpreted the *presentation* of the results as RISC bashing.  The
presentation at USENIX was flamboyantly anti-RISC--meaning that there are
statements about RISCs by NN which are vehemently anti-RISC and not backed
up by fact.  Yes, I am free to interpret the results any way I want--but NN
wants me to interpret them in a particular way, as strongly anti-RISC.  I
don't see that they support that viewpoint at all, which is what I'm com-
plaining about.

There *might* be some useful results and good work behind it all, but I
sure-as-hell can't find it even after trying to peel back the layers of
hype and bad journalism, so I tend to doubt that there's much there.

No, I haven't decided that RISC is better.  The biases in the benchmarks
(more to the point, in the reporting thereof) are evident regardless of
your own biases, if you just look at what's been said.  I pointed out some
of the more obvious problems...so if you think I'm off base, why don't you
take on the *substance* of what I said?  (I.e., if you're trying to say *I'm*
biased, show us how.  For example, do you dispute that they compared the
fastest Sun 3 against the slowest MIPS box?)

Most of my computing is done on CISCs, and they serve well.  But I had a
chance recently to run a couple of problems on a low-end RISC; they ran so
fast that I put in some debugging code to be sure something hadn't gotten
short-circuited!  (It hadn't.)  I'm not taking up the sword to defend RISC,
but I know the RISC guys aren't smoking rope--they're for real.

>   We have looked at the NN benchmarks for a number of machines (I
> obviously can't say which ones), and my personal reaction is that they
> are reasonable and valid for business applications...

OK, so which benchmarks are the good ones?  Note that the one that EE Times
gave such prominent coverage was one of the simplest--a loop with just 4
calculations (+-*/) on 16-bit integers, running 1 to 15 copies at a time.
That has about 0.5 * dsq to do with any real business program.  And, as I
said in the original article, I could have attributed it to EE Times'
sloppiness (the rest of the article was an expository/stylistic/technical
mess) but for the fact that NN was showing it off.  NN has 17 other
benchmarks, and they could have put together a complete presentation of the
benchmarks on comparable machines.  They didn't.  Why not?

> ...why not get a benchmark suite which tests [what concerns you],
> rather than blasting NN?

Done!  I have tests of my own which I run when *I* want to get an idea of
how fast a processor is.  The reason I'm blasting NN is that I see them
misleading people--and using a lot of PR to mislead a lot of people.  It's
that aspect that bothers me--not that it's RISCs per se that they're
bashing, but that they're bashing, instead of testing and reporting
carefully.
-- 
Dick Dunn      UUCP: {ncar,nbires}!ico!rcd           (303)449-2870
   ...Are you making this up as you go along?

walter@garth.UUCP (Walter Bays) (07/13/88)

Several articles recently have commented that Neal Nelson (benchmark
service) is challenging the widely held view that RISC is faster than
CISC, saying that CISC is faster than RISC.  NN's most famous
comparison shows that a Sun 3 is faster than a Sun 4.  The other side
can cite benchmarks showing that a Sun 4 is much faster than a Sun 3.

An article in the July issue of UNIX Review sheds some light on the issue.
It's by David Wilson of Workstation Laboratories (another benchmark service).
The article shows "... a class of tasks for which the Sun 4/260 is two or
three times faster than the Sun 3/260, a class for which performance is about
the same, and a class where the 4/260's performance seems slightly lower..."
Wilson discusses the reasons for these results.

Any computer designer selects architectural features based on their
expected utility given the class of workloads to be run.  RISC arose
from the observation that many of the features of conventional
computers did not help (or hurt) performance _in_ _the_ _typical_
_case_.  Various CISC's have always included several RISC-like features
their designers found helpful.  And most RISC's include a few CISC-like
features their designers found helpful.  For performance comparisons,
there is no subtitute for a benchmark that represents your application(s).
-- 
------------------------------------------------------------------------------
My opinions are my own.  Objects in mirror are closer than they appear.
E-Mail route: ...!pyramid!garth!walter		(415) 852-2384
USPS: Intergraph APD, 2400 Geng Road, Palo Alto, California 94303
------------------------------------------------------------------------------

davidsen@steinmetz.ge.com (William E. Davidsen Jr) (07/13/88)

In article <6965@ico.ISC.COM> rcd@ico.ISC.COM (Dick Dunn) writes:

| >   We have looked at the NN benchmarks for a number of machines (I
| > obviously can't say which ones), and my personal reaction is that they
| > are reasonable and valid for business applications...
| 
| OK, so which benchmarks are the good ones?  Note that the one that EE Times
| gave such prominent coverage was one of the simplest--a loop with just 4
| calculations (+-*/) on 16-bit integers, running 1 to 15 copies at a time.

  The decision is yours... NN gives the result of the test and what it
measures. I don't disagree that considering (any) one benchmark as an
indicator is probably a waste, but with a selection of results you can
compare two (or more) machines in those areas which apply to your
situation.

  I have a UNIX benchmark suite which I have run on a number of machines
for my personal edification. It measures some raw performance numbers
such as the speed of arithmetic for all data types, trancendental
functions, test and branch for int and float, disk access and transfer
times for large and small files, speed of bit fiddling such as Grey to
binary, etc. Then I measure speed of compile, performance under
multitasking load, speed of pipes and system calls, and a few other
things. The *one* thing I measure which consistently represents the
overall performance of the machine is the real time to run the entire
benchmark suite.
-- 
	bill davidsen		(wedu@ge-crd.arpa)
  {uunet | philabs | seismo}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

landru@stan.UUCP (Mike Rosenlof) (07/13/88)

In article <936@garth.UUCP> walter@garth.UUCP (Walter Bays) writes:
>
>An article in the July issue of UNIX Review sheds some light on the issue.
>It's by David Wilson of Workstation Laboratories (another benchmark service).
>The article shows "... a class of tasks for which the Sun 4/260 is two or
>three times faster than the Sun 3/260, a class for which performance is about
>the same, and a class where the 4/260's performance seems slightly lower..."
>Wilson discusses the reasons for these results.
>

When I first brought up X on our color sun 4/260, recently converted from
a sun 3/260, I was amazed that the X server performance for simple things
like scrolling and moving windows around was no better.  This was just how
it looked, I didn't get out my stopwatch.  So I did a little comparrison
with bit blt ( bit block transfer, scrolling, moving a window, ... ) timing.

The loop which does most of the work for a bit blt looks like this for the
common copy case:

register long count;
register long *src, *dst;

   while( --count )
   {
      *dst++ = *src++;
   }


the sun 68k compiler after optimizing, produces this code:

LY00001:
    movl    a5@+,a4@+
LY00000:
    subql   #1,d7
    jne LY00001

according to the 68020 users manual, this loop takes 10 clocks in the
best case and 15 clocks in its cache case.  With a 40 nsec clock, this
is 400 and 600 nsec per loop.


the sun SPARC compiler after optimizing, produces:

LY2:                    ! [internal]
    ld  [%o3],%o0
    dec %o5
    tst %o5
    st  %o0,[%o4]
    inc 4,%o3
    bne LY2
    inc 4,%o4

which takes 9 clocks, and with a 60 nsec clock, this is 540 nsec.

Since the 68K loop is so tight, I suspect we're seeing the best case 68K
timing with SPARC doing the set up work faster to make up some of the
difference.  

Which processor is going to get faster clocks sooner?  Or will newer
versions reduce the clock count?  I've heard of a 33 Mhz 68020 being
available now or soon, other SPARC implementations are also said to be
in the works.


My point is that with a reduced instruction set, you're very likely to 
find some applications that are slowed down by this reduction.  In this
case, I find that the sun 4/260 makes a very nice compile or compute
server, but it's not a very impressive X server.

-- 
Mike Rosenlof		SAE			(303)447-2861
2190 Miller Drive			stan!landru@boulder.edu
Longmont Colorado			landru@stan.uucp
80501					...hao!boulder!stan!landru

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (07/14/88)

>[ RISC bashing at Usenix ]


Before you shed too many tears about poor RISC people being bashed,
just remember all the hype with which "RISC" was marketed to the
computing world 5 (?) years ago.  And remember, back then, that
supposedly things like arithmetic speed and memory bandwidth didn't
matter, according to the RISC camp.  Now that we have new processors
like the MIPS R3000/R3010 and the Motorola 88K series, to mention
a couple of recent designs, things have come a long way.

Rather than say that RISC or CISC won, I think a more fair way to sum
up what really happened is to say that performance won.  Conventional
wisdom 15 years ago (at, say, IBM and CDC) was that there would always
be a very limited demand for high performance systems.  What happened
was quite different.  People found maximum performance useful in
systems of all price categories.  This was not anticipated by a lot
of people.  But whether that performance is provided by a RISC or
CISC system does not matter to the end user.

That aside, it warms my heart to see MIPS (R3000/R3010) and Motorola
(88K series) battling it out over performance on benchmarks like
Linpack and the Livermore Loops.  Just two years ago, these companies
were leaving floating point/memory intensive job performance to the
mainframe bunch.  The latest MIPS performance brief shows, among other
things, the MIPS M/2000 system weighing in at 3.6 64 bit Fortran MFLOPS-
faster than the CDC 7600.  Now that is progress.

[By the way, my hat is off to MIPS for their latest performance brief
(3.4 - dated June 1988).  Good job.  I wish every company would
provide a report like this.]

-- 
  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

pope@vatican.Sun.COM (John Pope) (07/14/88)

In article <202@baka.stan.UUCP> stan!landru@boulder.edu writes:
>
>When I first brought up X on our color sun 4/260, recently converted from
>a sun 3/260, I was amazed that the X server performance for simple things
>like scrolling and moving windows around was no better.
> [...]
>The loop which does most of the work for a bit blt looks like this for the
>common copy case:
>
>register long count;
>register long *src, *dst;
>
>   while( --count )
>   {
>      *dst++ = *src++;
>   }
>

*** Warning! Brain damaged software alert! ***
This should be re-coded to use the bcopy() library routine, which
does a 32 bit copy instead of a byte at a time. You should see a
*noticable* improvement. Moral: use your libraries, that's what they're 
there for.

>My point is that with a reduced instruction set, you're very likely to
>find some applications that are slowed down by this reduction.  In this
>case, I find that the sun 4/260 makes a very nice compile or compute
>server, but it's not a very impressive X server.

Please be careful about making conclusions like this regarding a
particular machine or architecture. Performance is a combination of a
lot of factors, most of them not as clear-cut as this case is...

>Mike Rosenlof		SAE			(303)447-2861
>2190 Miller Drive			stan!landru@boulder.edu
>Longmont Colorado			landru@stan.uucp
>80501					...hao!boulder!stan!landru

John Pope
	Sun Microsystems, Inc. 
		pope@sun.COM

John Pope
	Sun Microsystems, Inc. 
		pope@sun.COM

pope@vatican.Sun.COM (John Pope) (07/14/88)

In article <59798@sun.uucp> pope@sun.UUCP (John Pope) writes:
>
>*** Warning! Brain damaged software alert! ***

Sorry, my brain damage. After double-checking, I saw you were copying 
longs instead of chars. While using the libraries is almost always right,
you probably wouldn't see much difference here. Oh well...

John Pope
	Sun Microsystems, Inc. 
		pope@sun.COM

landru@stan.UUCP (Mike Rosenlof) (07/14/88)

In article <59798@sun.uucp> pope@sun.UUCP (John Pope) writes:
>>register long count;
>>register long *src, *dst;
           ^^^^
>>   while( --count )
>>   {
>>      *dst++ = *src++;
>>   }
>*** Warning! Brain damaged software alert! ***
>This should be re-coded to use the bcopy() library routine, which
>does a 32 bit copy instead of a byte at a time. You should see a
>*noticable* improvement. Moral: use your libraries, that's what they're 
>there for.

Last time I looked, 'long' on the sun C compilers was 32 bits, and this
example still holds.  If the library function is optimized C or hand
coded assembler, the machine code is going to come up nearly identical
to my examples. (assembler for 68020 and SPARC not quoted here)

>>My point is that with a reduced instruction set, you're very likely to
>>find some applications that are slowed down by this reduction.  In this
>>case, I find that the sun 4/260 makes a very nice compile or compute
>>server, but it's not a very impressive X server.
>
>Please be careful about making conclusions like this regarding a
>particular machine or architecture. Performance is a combination of a
>lot of factors, most of them not as clear-cut as this case is...


I think I was clear that this is a very specific example, and that in
other areas its performance is "very nice".  I don't know of many
machines that spend large portions of CPU cycles just doing bit blt
(or bcopy).  In this isolated case, SPARC ( or at least this implementation
by Fujitsu ) is not impressive, especially when comparing costs to a
68020 system.  This is one of the difficulties of very simple graphics
like this, lots of data has to be moved around.  One more compute
intensive functions like shaded surfaces, I'm sure the sun/4 would be
a tremendous improvement.


-- 
Mike Rosenlof		SAE			(303)447-2861
2190 Miller Drive			stan!landru@boulder.edu
Longmont Colorado			landru@stan.uucp
80501					...hao!boulder!stan!landru

chris@mimsy.UUCP (Chris Torek) (07/15/88)

>In article <202@baka.stan.UUCP> stan!landru@boulder.edu writes:
>>... for the common copy case:
>>register long *src, *dst;

In article <59798@sun.uucp> pope@vatican.Sun.COM (John Pope) writes:
>This should be re-coded to use the bcopy() library routine, which
>does a 32 bit copy instead of a byte at a time.

Reread the original.  It *does* do a 32 bit copy.

Still, one should use bcopy/memcopy/memmove/whatever-we-call-it-this-week.
I suspect it can be optimised a bit more (copy 64 bytes per trip around
the main loop, e.g.).
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

radford@calgary.UUCP (Radford Neal) (07/15/88)

In article <202@baka.stan.UUCP>, landru@stan.UUCP (Mike Rosenlof) writes:

> When I first brought up X on our color sun 4/260, recently converted from
> a sun 3/260, I was amazed that the X server performance for simple things
> like scrolling and moving windows around was no better...

> The loop which does most of the work for a bit blt looks like this for the
> common copy case:
> 
> register long count;
> register long *src, *dst;
> 
>    while( --count )
>    {
>       *dst++ = *src++;
>    }
> 
> [ goes on to examine the code generated for 68020 and SPARC ]


Your problem is that the above C code is grossly non-optimal. Assuming
that "count" is typically fairly large, the optimal C code is the
following:

     bcopy ((char*)src, (char*)dst, count*sizeof(long));

If for some bizzare reason your C comiler doesn't come with a "bcopy"
routine, I suggest something along the following lines:

     while (count>8)
     {
        *dst++ = *src++; *dst++ = *src++; *dst++ = *src++; *dst++ = *src++;
        *dst++ = *src++; *dst++ = *src++; *dst++ = *src++; *dst++ = *src++;
        count -= 8;
     }
     while (count>0)
     { 
       *dst+ = *src++;
       count -= 1;
     }

There are, of course, many variations, and it's hard to tell which will
be best on any particular processor, which is why "bcopy" was invented.

   Radford Neal

alverson@decwrl.dec.com (Robert Alverson) (07/15/88)

In article <204@baka.stan.UUCP> stan!landru@boulder.edu writes:
>In article <59798@sun.uucp> pope@sun.UUCP (John Pope) writes:
>>>register long count;
>>>register long *src, *dst;
>           ^^^^
>>>   while( --count )
>>>   {
>>>      *dst++ = *src++;
>>>   }
>>*** Warning! Brain damaged software alert! ***
>>This should be re-coded to use the bcopy() library routine, which
>>does a 32 bit copy instead of a byte at a time. You should see a
>>*noticable* improvement. Moral: use your libraries, that's what they're 
>>there for.

Despite the incorrectness of Pope's reasoning, I tend to agree that
you should use a library routine to perform such a low-level function
as copying memory.  In particular, a library routine might unroll
the loop many times, so that the cost per word approaches that of a
single load+store pair.  This would make the cost per byte nearly 5
cycles on Sparc (I think), bringing it to 300ns (?).  This is still
rather high, it seems like a RISC ought to do a load+store in 2 or
3 cycles (scheduled!).

Similarly, on a VAX, the library routine might just happen to correspond
directly to a VAX instruction, so that the loop could be executed in
microcode.  In any case, copying memory seems like such a fundamentally
useful operation that you can expect the library code to be at least
as good as what you can get out of the compiler.

Bob

pope@vatican (John Pope) (07/15/88)

In article <204@baka.stan.UUCP>, landru@stan (Mike Rosenlof) writes:

>Last time I looked, 'long' on the sun C compilers was 32 bits, and this
>example still holds. If the library function is optimized C or hand
>coded assembler, the machine code is going to come up nearly identical
>to my examples. (assembler for 68020 and SPARC not quoted here)

I again apologize for my case of caffeine induced type-ahead. I'd seen
a couple of cases just lately where the char copy loop had been written
in this way (not even using register variables, yet) and went overboard.
My point was not to defend our machine (really), but to say that rewriting
of standard functions can often lead to performance loss regardless of
machine or architecture.

>Mike Rosenlof		SAE			(303)447-2861
>2190 Miller Drive			stan!landru@boulder.edu
>Longmont Colorado			landru@stan.uucp
>80501					...hao!boulder!stan!landru
-- 
-- 
John Pope
	Sun Microsystems, Inc. 
		pope@sun.COM

roy@phri.UUCP (Roy Smith) (07/15/88)

alverson@decwrl.UUCP (Robert Alverson) writes:
> I tend to agree that you should use a library routine to perform such a
> low-level function as copying memory. [...] copying memory seems like
> such a fundamentally useful operation that you can expect the library
> code to be at least as good as what you can get out of the compiler.

	On the other hand, library routines can't make assumptions about
their arguments.  Bcopy(3) can handle copies of arbitrary length with
arbitrary alignment, and thus must perform assorted checks to see if it has
to copy the first few and/or last bytes "by hand".  For short blocks, the
overhead of these extra checks might be important enough that coding your own
block-copy code might be a big win if you know you're only going to be
copying blocks with "nice" alignments and lengths.

	On the other hand, every line of code you write yourself is just
another bug waiting to happen (not to mention a waste of programing time).
If you call a library routine you can be reasonably sure it works right.  For
example, of all the times I've been convinced that malloc(3) and/or free(3)
was screwing up, never once have I not been able to (eventually) trace the
fault to my own code.

	Moral: use library routines, but before you release the code do some
serious profiling on it.  As long as you don't spend much time in library
code, don't worry about possible inefficiencies therein.  If you do find
you're calling a routine with some special-case arguments and where the
generality of the library routine is slowing you down, then go ahead and
write your own replacement, but make sure it works as well as the original!
-- 
Roy Smith, System Administrator
Public Health Research Institute
{allegra,philabs,cmcl2,rutgers}!phri!roy -or- phri!roy@uunet.uu.net
"The connector is the network"

grunwald@m.cs.uiuc.edu (07/15/88)

Actually, I would be agast if bcopy() didn't use an unrolled version of
the loop; by jumping into a table of MOV instructions, you can eliminate the
decrement and jump instructions, giving a much higher data movement for
less instructions.

But this has been hashed out before.

radford@calgary.UUCP (Radford Neal) (07/16/88)

In article <204@baka.stan.UUCP>, landru@stan.UUCP (Mike Rosenlof) writes:

> ...In this isolated case, SPARC ( or at least this implementation
> by Fujitsu ) is not impressive, especially when comparing costs to a
> 68020 system.  This is one of the difficulties of very simple graphics
> like this, lots of data has to be moved around...

It's possible that we're all missing the point here. Could it be that
the time for simple graphics on both the 68020 system and the SPARC system
was dominated by the access time of the frame buffer? That no processor
of any speed could have speeded the graphics up?

Was the frame buffer the same in the two cases?

   Radford Neal

henry@utzoo.uucp (Henry Spencer) (07/19/88)

In article <1746@vaxb.calgary.UUCP> radford@calgary.UUCP (Radford Neal) writes:
>Your problem is that the above C code is grossly non-optimal. Assuming
>that "count" is typically fairly large, the optimal C code is the
>following:
>
>     bcopy ((char*)src, (char*)dst, count*sizeof(long));

No.  The optimal C code uses memcpy, not bcopy.  The difference is not
just six of one and half a dozen of the other:  memcpy asserts that its
operands do not overlap, and smart compilers can often generate better
code with this knowledge.  Also, memcpy is ANSI C and bcopy is just a
Berkeleyism :-), so compilers are more likely to pay special attention
(e.g. inlining) to memcpy.
-- 
Anyone who buys Wisconsin cheese is|  Henry Spencer at U of Toronto Zoology
a traitor to mankind.  --Pournelle |uunet!mnetor!utzoo! henry @zoo.toronto.edu

anc@camcon.uucp (Adrian Cockcroft) (07/20/88)

In article <202@baka.stan.UUCP>, landru@stan.UUCP (Mike Rosenlof) writes:
> 
> When I first brought up X on our color sun 4/260....
> ... I was amazed that the X server performance for simple things
> like scrolling and moving windows around was no better.  This was just how
...
> The loop which does most of the work for a bit blt looks like this for the
> common copy case:
> 
> register long count;
> register long *src, *dst;
> 
>    while( --count )
>    {
>       *dst++ = *src++;
>    }
...... 
> according to the 68020 users manual, this loop takes 10 clocks in the
> best case and 15 clocks in its cache case.  With a 40 nsec clock, this
> is 400 and 600 nsec per loop.
Are you using the DBRA instruction for this? Has anyone ever seen a compiler
generate a DBRA? Maybe SUNs bcopy library routine is in assembler and uses it.

> the sun SPARC compiler after optimizing, produces:
....
> which takes 9 clocks, and with a 60 nsec clock, this is 540 nsec.
> 
> My point is that with a reduced instruction set, you're very likely to 
> find some applications that are slowed down by this reduction.  In this
> case, I find that the sun 4/260 makes a very nice compile or compute
> server, but it's not a very impressive X server.

The Inmos Transputer has a RISC core with microcode added to speed up things
that compilers can use and to put operating system primitives in microcode.
One of its useful extras is a block move instruction that moves words as fast
as memory bandwidth will allow.

        ldl src        ;load local onto register stack
        ldl dst
        ldc count      ;load constant
        move           ;blast those RAM chips

The move will take 100 ns per word for on-chip src and dst or 300ns per word
for off-chip src and dst. The compiler I have (Pentasoft C) can be told to
watch out for strcpy(s,"string constant") where it knows the length of src
and also uses move for bcopy and structure assignment. A 'wcopy' routine
or macro would be needed to get the above code:

#define wcopy(src,dst,count) __ABCregs(count,dst,src);asm(" move")
would do the trick with Pentasoft C.

For bitblt the T800 also has a 2 dimensional block move instruction.  Inmos's
attitude is that the RISC core made enough space on the chip for RAM and
interprocessor links but as the chip shrinks they are adding more microcode
space and taking common code sequences into microcode for better performance on
certain applications.

If this is the hardest work for an X server then Transputers should be pretty
good.  X is currently being ported to the Transputer by a team at the
University of Kent.  The Atari Abaq (T800 based) will have X as standard but it
probably uses its superfast blitter chip rather than the T800.

-- 
  |   Adrian Cockcroft anc@camcon.uucp  ..!uunet!mcvax!ukc!camcon!anc
-[T]- Cambridge Consultants Ltd, Science Park, Cambridge CB4 4DW,
  |   England, UK                                        (0223) 358855
      (You are in a maze of twisty little C004's, all alike...)

ok@quintus.uucp (Richard A. O'Keefe) (07/22/88)

In article <1681@gofast.camcon.uucp> anc@camcon.uucp (Adrian Cockcroft) writes:
>> The loop which does most of the work for a bit blt looks like this for the
>> common copy case:
>> register long count;
>> register long *src, *dst;
>>    while( --count )
>>    {
>>       *dst++ = *src++;
>>    }
>Are you using the DBRA instruction for this? Has anyone ever seen a compiler
>generate a DBRA? Maybe SUNs bcopy library routine is in assembler and uses it.

The SunOS 3.2 C compiler is perfectly happy to generate a DBRA; you just
have to write some suprising C to do it.

void move(dst, src, len)
    register long *dst, *src;
    register short len;
    {
	do *dst++ = *src++;
	while (--len != -1);
    }

compiled with -O yielded

	movl	a6@(8),a5
	movl	a6@(12),a4
	movw	a6@(18),d7
L16:
	movl	a4@+,a5@+
	dbra	d7,L16

Remember that dbra operates on *16-bit* registers (boo hiss).
A reasonably good strcpy for 68010s can be done thus:

void strmov(dst, src)
    register char *dst, *src;
    {
	register short len = -2;
	while ((*dst++ = *src++) && --len != -1);
    }

which turns into 

	movl	a6@(8),a5
	movl	a6@(12),a4
	moveq	#-2,d7
L14:
	movb	a4@+,a5@+
	dbeq	d7,L14

Frankly, I'd rather write clearer C and put up with not getting DBcc.
(Oops, just forfeited my junior hacker's badge...)

That, of course, is the snag with CISCy instructions: if they're not
*exactly* what you want, you might as well not have them.

ralphw@ius3.ius.cs.cmu.edu (Ralph Hyre) (07/29/88)

In article <12485@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>>In article <202@baka.stan.UUCP> stan!landru@boulder.edu writes:
<<<... for the common copy case:
<<<[some code]
<In article <59798@sun.uucp> pope@vatican.Sun.COM (John Pope) writes:
>>This should be re-coded to use the bcopy() library routine
<Reread the original.  It *does* do a 32 bit copy.
...
<I suspect it can be optimised a bit more (copy 64 bytes per trip around
<the main loop, e.g.).

On a Sun, at least, don't you have the additional options of
1) playing MMU games (for 'virtual', read-only copies)
2) use the RasterOP stuff for the real thing.
I thought this was why the Sun kernel's bcopy was so much faster then the
C library one.
-- 
					- Ralph W. Hyre, Jr.

Internet: ralphw@ius2.cs.cmu.edu    Phone:(412)268-{2847,3275} CMU-{BUGS,DARK}
Amateur Packet Radio: N3FGW@W2XO, or c/o W3VC, CMU Radio Club, Pittsburgh, PA