[comp.lang.c] Tuning your libraries for your machine

gnu@hoptoad.UUCP (04/05/87)

In article <1537@husc6.UUCP>, reiter@endor.harvard.edu (Ehud Reiter) writes:
> 2) Simple routines like strcpy should be adjusted to perform well on a
> particular architecture (if the microVAX doesn't have a hardware locc
> instruction, then is it too much to ask that the run-time library supplied
> for the microVAX be changed not to use locc, at least in small and frequently
> used routines like strcpy?)

It only becomes reasonable to tailor a system for a particular piece of
hardware when there are only a small number of variants that run that
architecture.  In other words, this might have been fine when there was
the 780 and the 750 (nobody counted the 730 or MV-1 anyway) but once
you have a bunch of models, you just have to make the code
straightforward and don't do anything that *really* breaks on some
machine.  I presume in the Vax case this means mostly avoiding the
unimplemented instructions.

I worked on an APL system for the IBM 360/370 and just finding out the
timings for the 15 or 20 models that could run the code was too much work,
let alone figuring out which combination would be best until IBM's next
release.  (No flames on 15..20, this was in 1973!)

(Of course, the same applies to an "architecture" like C/Unix -- write
code that's straightforward and doesn't do anything that really breaks
anywhere.  Super optimizing your C source is kinda hard these days --
are you *sure* it's better to code it this way on the Cray?  IBM?  DG?
DEC?  8080?)

It's true that a tailored shared library could give some benefit, but
the general problem extends to what code to generate inline, not just
in library routines.

> 3) Simple routines like strcpy should be recoded in assembler, at least to
> the degree of having their procedure prologues simplified, and so that they
> use registers which don't have to be restored.
> 4) In-line expansion of common (and simple) library routines should be
> considered.

These should both be done automatically by a good compiler.  Compilers
that put in large procedure prolog/epilogs and don't simplify them when
possible have no excuses.  Those that won't use the scratch registers
for variables when possible have excuses but newer compilers are beating
them -- excuses don't benchmark very well.
-- 
Copyright 1987 John Gilmore; you can redistribute only if your recipients can.
(This is an effort to bend Stargate to work with Usenet, not against it.)
{sun,ptsfa,lll-crg,ihnp4,ucbvax}!hoptoad!gnu	       gnu@ingres.berkeley.edu

nelson@ohlone.UUCP (04/06/87)

In article <1959@hoptoad.uucp>, gnu@hoptoad.uucp (John Gilmore) writes:
> It only becomes reasonable to tailor a system for a particular piece of
> hardware when there are only a small number of variants that run that
> architecture.
> [...]
> (Of course, the same applies to an "architecture" like C/Unix -- write
> code that's straightforward and doesn't do anything that really breaks
> anywhere.  Super optimizing your C source is kinda hard these days --
> are you *sure* it's better to code it this way on the Cray?  IBM?  DG?
> DEC?  8080?)

Right on!  Things like a[i++] = *++p; are totally lost on a Cray.
Unfortunately, the current Cray C compiler is none too great ... we're
working on it!
-----------------------
Bron Nelson     {ihnp4, lll-lcc}!ohlone!nelson
Not the opinions of Cray Research

jda@mas1.UUCP (04/06/87)

In article <1959@hoptoad.uucp>, gnu@hoptoad.uucp (John Gilmore) writes:
> In article <1537@husc6.UUCP>, reiter@endor.harvard.edu (Ehud Reiter) writes:
> > 2) Simple routines like strcpy should be adjusted to perform well on a
> > particular architecture....
> 
> It only becomes reasonable to tailor a system for a particular piece of
> hardware when there are only a small number of variants that run that
> architecture....
> I worked on an APL system for the IBM 360/370 and just finding out the
> timings for the 15 or 20 models that could run the code was too much work,
> let alone figuring out which combination would be best until IBM's next
> release.  (No flames on 15..20, this was in 1973!)

A simple but common logic flaw in my opinion.  Granted that it can require
up to 15 or 20 times the effort to support 15 or 20 models, but the issue is
whether any such model is worth added support.  I can understand a statement
like "I'm not going to optimize for the Lemon III Model B since Lemon Computer
Corporation hasn't even sold one yet."  But John Gilmore seems to be saying:
"IBM was selling thousands of machines a month so the only sensible thing
was to move my product to a company whose market was so small they wouldn't
confuse me with multiple models."

Apologies to John -- the problem is likely to be with budgeting misconceptions
rather than the technical staff.

> 
> It's true that a tailored shared library could give some benefit, but
> the general problem extends to what code to generate inline, not just
> in library routines.

The user doesn't demand a general solution.  He just doesn't like his
application running 20 times slower than necessary.  The plain fact is that
major savings can result from optimizing a few routines (strcpy, ldiv being
good examples).
 
> > 3) Simple routines like strcpy should be recoded in assembler, at least to
> > the degree of having their procedure prologues simplified, and so that they
> > use registers which don't have to be restored.
> 
> These should both be done automatically by a good compiler....

"should be" but not necessarily "is".  There are some *really pathetic*
compilers out there.  From a recent poison remedy pamphlet:
	Induce vomiting.  If necessary show the subject the output
	   of Whitesmiths 68k C compiler.

James D. Allen -- opinions not necessarily necessary.

guy@gorodish.UUCP (04/07/87)

>A simple but common logic flaw in my opinion.  Granted that it can require
>up to 15 or 20 times the effort to support 15 or 20 models, but the issue is
>whether any such model is worth added support.

Do you know of any major vendor who *does* provide that kind of
support?  Does DEC provide different versions of the VMS libraries
for different VAX models?  Does IBM provide different versions of the
MVS libraries for different 370 models?

>But John Gilmore seems to be saying: "IBM was selling thousands of machines
>a month so the only sensible thing was to move my product to a company whose
>market was so small they wouldn't confuse me with multiple models."

John is obviously NOT saying anything even remotely resembling that.
Show me where he said *anything* about *moving* his product from an
IBM machine.  He's merely saying that it wasn't worth the effort
producing N versions of the APL system for N different IBM 370
models.

>The user doesn't demand a general solution.  He just doesn't like his
>application running 20 times slower than necessary.  The plain fact is that
>major savings can result from optimizing a few routines (strcpy, ldiv being
>good examples).

OK, show me a plain fact that indicates that, for anybody supplying
large volumes of some software package, you can get a 20X speed up by
tweaking the code for different models of a line of machines (not
"tweaking the code for machines with or without a given hardware
option", but "tweaking the code for different models of a line of
machines" - e.g., a VAX-11/780, VAX 8600, and VAX 8200, or a 370/168,
3033, 4381, and 3090).

>> These should both be done automatically by a good compiler....
>
>"should be" but not necessarily "is".  There are some *really pathetic*
>compilers out there.

OK, are the compilers in question "pathetic" or "good"?  They can't
be both....