[comp.arch] random notes from a John Mashey talk

henry@utzoo.uucp (Henry Spencer) (11/19/88)

John Mashey gave a talk here the other day.  Herewith some random notes
on it, of possible interest to others.  This is not an attempt to record
everything (or even everything significant) that he said.  All statements
are my recollection of what he said, i.e. if something's wrong, it might
be because it's his opinion and he's wrong, or because I didn't remember
it correctly.  "I don't make the news, I just try to report it."

"RISC:  there's less to be sorry about."

Reducing instruction set is a means to an end, not an end in itself.  Their
policy is that a new feature must either be structurally necessary (e.g.
Return From Exception) or must demonstrably improve performance 1% on a
significant number of real applications.  "Intuition is not allowed."
"Never argue about something you can simulate."  At midnight their entire
computer room suddenly becomes furiously active as the nightly batch of
simulations fires up.

Much criticism of tiny benchmarks.  Real programs are much better, but
even there one has to try to run a reasonable variety.  There is an effort
underway, jointly among several companies, to try to put together a set
of big, "real", freely-redistributable programs for useful benchmarking.

"If a vendor can't give you some reasonable backup for the viewgraphs,
don't believe the viewgraphs."

They try to collect "embarrassing" benchmarks, programs that do something
unusual, on the grounds that including these helps to keep the performance
numbers honest.

Statistics over a small number of programs can be very misleading.  Some
of Berkeley's commitment to register windows may have been accidental:
two of their major initial benchmarks, the C compiler and nroff, quite
by accident are unusually heavy register-save-and-restorers.  People
have noticed, with some surprise, that nroff does very few byte accesses
despite being a text formatter... but TeX is different.  Most programs
use char and int but make very little use of short... but some CAD stuff
is different, and so is TCP/IP.  Most programs are maybe 20% loads and
10% stores, but some are very different, e.g. diff does lots of loads
and few stores.  Compress is a good way to embarrass a machine that
relies very heavily on caching.

They are not competing in the low-end imbedded-controller market.  "It's
hard for us to build anything less than 6 MIPS."

The 88000's floating-point hardware looks good in single precision but is
*spectacularly* slower in double precision.  (These numbers are very new,
the floating point basically didn't work until very recently.)  Basically,
Motorola sacrificed double precision to save silicon.  Okay for things
like graphics, not so okay for serious floating-point applications.

Making an IEEE double-precision floating-point add run in two cycles is
not easy.  The tricky part is exceptions, especially since the MIPSco cpu
has precise exceptions.  It turns out that a quick table lookup based on
some of the high bits of the exponents of the two operands can usually
assure you that no exception is possible.  If the table lookup says "maybe"
rather than "no", the FPU just stalls until it can be sure.  This almost
never happens in reality unless an exception is indeed imminent.

Multiply and divide are asynchronous, with the cpu stalling if the result
is required before it's available.  In the current implementation this
usually does happen, the delay is too long to fill completely very often.
One side effect is that the divider doesn't check for a zero divisor, since
an explicit check for that can be the first thing in the delay slots!

General experience is that filling one load or branch delay slot is easy
but filling two or more is much harder.

Roughly 50% of all function calls (dynamically, not statically) are calls
to leaf functions.  These need not save the return address out of the
register it gets parked in, and need not have a stack frame (all the
variables can just go in registers, barring funny situations), which makes
call and return of such functions one cycle each.  MIPSco passes the first
four parameters in registers, and has 13 registers which the caller must
save if he wants them, so local storage is usually all-register.

With 13 scratch registers, register save and restore is pretty uncommon,
averaging less than 2 registers per call.  SPARC, with its register
windows, actually does more saving and restoring to/from memory, because
its big fixed-size windows hurt it.

I asked him about languages with highly dynamic call patterns, like C++
and Smalltalk.  He said that for those languages, it would indeed be
mildly helpful to have a few register windows.

For a commercial machine, the ability to do unaligned accesses can be
significant.  (MIPSco does it via special instructions; their compilers
can be told to use those for all dubious accesses.)  Ugly things like
Cobol and PL/I tend to want them, and there exist major applications
in C and Fortran which cannot be fixed easily.  Vendors are much happier
about getting something running with a compiler option and then tuning
the result than about having to get alignment right from the start.

It is hard to win the performance race with pure clock speed, because
you can't ship a part if you can't test it, and the test-equipment folks
are not going to build a new tester just for you.  So clock rates tend
to improve simultaneously across the industry.

He put up a slide listing lines of code in the various parts of their
various compilers.  Out of six compilers and assorted optimizers etc.,
half the total lines were for Ada.

Their compilers use the "signed" (overflow causes exception) instructions
for non-pointer arithmetic in C.  This constrains optimization in ANSI C,
but they already have to deal with the issue for Ada, so this isn't a big
software problem.

MIPSco is mostly content to watch the AT&Sun/OSF/etc scuffle from the
sidelines.  "When elephants dance, the mice stay off the floor."

CPUs are almost free; what matters in cost is the support hardware.  Their
new big machine puts most of its money into things like fast I/O.  "Heavy
iron, with a CPU pasted on."

For most Unix utilities they don't bother with the heavier levels of
optimization, e.g. interprocedural.  However:  "We do have customers
who'll commit vile acts to get another 5-10%."

Optimizing the Unix kernel is a "thrilling adventure".  The new "volatile"
keyword in ANSI C helps, but figuring out where to use it is not trivial,
and implementing it in an existing optimizing compiler is lots of fun.

"Never debug a new CPU and a new silicon process simultaneously."  They
(or rather, their silicon-building partners) use essentially static-RAM
processes, and let the RAM market do the leading-edge process debugging.

The hard part of designing a CPU is not the instruction set, but the
exception handling.
-- 
Sendmail is a bug,             |     Henry Spencer at U of Toronto Zoology
not a feature.                 | uunet!attcan!utzoo!henry henry@zoo.toronto.edu