[comp.sys.nsc.32k] "Unoptimizing" Dhrystone

davet@oakhill.UUCP (04/15/87)

In article <16151@amdcad.AMD.COM> tim@amdcad.UUCP (Tim Olson) writes:

>In article <219@homxb.UUCP> gemini@homxb.UUCP (Rick Richardson) writes:
>>                               ...  Meanwhile, any advice on modifying
>>the Dhrystone for version 1.2 such that a global optimizer won't be
>>able to remove anything will be appreciated.
>
>One easy fix to Dhrystone is to package it as two (or more) separate c
>files which must be compiled separately, then linked together.  This
>will prevent global optimizers from looking at the entire program at one
>whack .....

Note that there is at least one company, Philon, Inc. which claims to have
an optimization pass which runs on the final linked object code which would
defeat the above tactics.  I myself may be working on such an animal in a few
months.

However, I still think it's vital to get the above changes put into
a new version of Dhrystone because current results are becoming less
meaningfull.

My info on Philon comes from Uniforum 1986 Conference Proceedings Pages 125
to 138.  I have no other knowledge of that company, it's products or the
efficacy of their compiler system.

 -- David Trissel  Motorola Semiconductor Inc., Austin, Texas
	       {siesmo,ihnp4}!ut-sally!im4u!oakhill!davet

sjc@mips.UUCP (04/16/87)

In article <864@oakhill.UUCP>, davet@oakhill.UUCP (Dave Trissel) writes:
> In article <16151@amdcad.AMD.COM> tim@amdcad.UUCP (Tim Olson) writes:
> >In article <219@homxb.UUCP> gemini@homxb.UUCP (Rick Richardson) writes:
> >>                               ...  Meanwhile, any advice on modifying
> >>the Dhrystone for version 1.2 such that a global optimizer won't be
> >>able to remove anything will be appreciated.
> >
> >One easy fix to Dhrystone is to package it as two (or more) separate c
> >files which must be compiled separately, then linked together...
> 
> Note that there is at least one company, Philon, Inc. which claims to have
> an optimization pass which runs on the final linked object code which would
> defeat the above tactics.

In any benchmark, if you compute some function (e.g. a hash) of all of
the variables which are set but not otherwise used, and then print the
result, you not only prevent optimizers from eliminating the code which
sets those variables, but you also provide a check on whether the
emitted program is correct. It's human nature to concentrate on how
fast a benchmark runs and ignore whether it runs correctly, but the
latter is really a prerequisite, particularly when you're working with
a compiler or optimizer which is in a state of flux.

My plea to benchmark designers:  measure whatever time you want, but
afterward compare the results against known correct constant values,
and print a go/no-go message telling whether the results were correct
(or, for floating point, whether they were within an acceptable range).

Dhrystone has additional vulnerable spots: for example, the empty loops
used for subtracting out the loop-overhead. But the loops don't have to
be empty; you could add some nontrivial computation (whose result is
later printed, compared against a constant, or used to compute a hash)
to both the empty loops and the real ones, so that the subtraction
removes both the effect of the looping and the effect of the computation.

-- 
...decwrl!mips!sjc						Steve Correll

earl@mips.UUCP (Earl Killian) (04/17/87)

>In article <219@homxb.UUCP> gemini@homxb.UUCP (Rick Richardson) writes:
>Meanwhile, any advice on modifying the Dhrystone for version 1.2 such
>that a global optimizer won't be able to remove anything will be
>appreciated.

I'd like to second Steve Correll's comments and go a bit farther.

I hate to praise whetstone, but if you're looking for ways to improve
dhrystone, then you ought to start by looking at whetstone.  The only
thing dhrystone appears to have borrowed from whetstone so far is the
name, and this is unfortunate.  whetstone (1) does output so that the
program isn't simply thrown away; (2) does input so that the entire
computation cannot be done at compile time; (3) prints results so that
the correctness of the run can be verified; (4) uses a technique for
subtracting out loop overhead that does not involve timing an empty
loop that is likely to be deleted by a compiler; (5) makes every
computation's value used later so compilers don't just delete the
computation (often in conjunction with #3 above); (6) makes all
computations in loops depend on variables that change on every loop
iteration, thus preventing hoisting.  The only optimization that I
know of that begins to trash whetstone is inlining subroutines (you
can also apply angle addition formulas to save a call or two to
SIN/COS if you really want to tailor your compiler to the benchmark).

Dhrystone does none of these things.  It represents a step backward in
the art of benchmarking.  If you want to improve dhrystone, bring it
up to the standards of the twenty year old whetstone.

I started off saying I hated to praise whetstone.  The reason is that
I don't think it is particularly representative.  Neither is
dhrystone.  Both claim to be synthetic benchmarks created from looking
at statistics of source language constructs.  This sounds reasonable
on the surface, but for some reason this approach seems to generate
very different low-level statistics than real programs.

For example, dhrystone averages 15 instructions per procedure call on
a VAX.  The call-intensive programs I've studied average 50
instructions per call (30 is the lowest real program number I saw).
For example, I know of a pascal compiler that produces 8200 dhrystones
on an VAX/8600 when you use CALLS for the procedure calls and 14000
when you use a faster procedure call protocol.  This is 70% more
dhrystones!  And yet that compiler itself only speeds up by 20% when
you use the better procedure call protocol.

Another example: this forum has already noted the importance of strcpy
to the C version of dhrystone, and questioned that it is quite as
important in real programs.

Another example: whetstone is half a test of you floating point math
library and not the compiler or hardware.  Some real floating point
programs (e.g. spice) do spend a lot of time in SQRT, LOG, EXP, etc.
but not as much as whetstone.  And other real programs spend almost no
time there.

On the other hand, perhaps it is feature that these benchmarks do go
make people produce good libraries.  :-)

Another problem with these benchmarks is that they're small.  They fit
trivially into caches, for example.  For real programs, caches serve
not to make your program fit, but to make address the impedance
mismatch between main memory and the cpu.  You still need memory
bandwidth proportional to your processing power (although you wouldn't
know it from these small benchmarks) but raw memory bandwidth isn't
well matched to the appetite of a processor.  Thus caches.

Writing a big benchmark is not easy, of course, nor is porting it.
On the other hand, ask yourself why we're trying to build faster and
faster computers.  The reason is so that we can run bigger and bigger
programs.  Small benchmarks are becoming more and more misleading.
And one benchmark won't characterize all your applications either.
At MIPS, we use the following programs for benchmarking and
architectural study:
	ccom		-- C front-end
	uopt		-- global optimizer
	as1		-- assembler second pass
	nroff		-- text formatter
	TeX		-- another formatter with very different statistics
	compress	-- data compression
	espresso	-- PLA reduction
	hspice/spice2g6/spice3a7	-- circuit simulation
	timber wolf	-- routing
	doduc		-- monte carlo nuclear reactor simulation
	linpack		-- matrix reduction
(There are more that we want to study, but don't have yet.)  All of
these have very different statistics.  None fit well in 8Kb caches.
Some fit in 64Kb caches; others still do not.  Some use floating point
heavily; some don't.  Some want large cache blocks; others want small.
Some need very large virtual to physical translation caches; others
need only a few entries.  Linpack can use vector ops; the others
probably cannot.  Etc. etc.

reiter@endor.harvard.edu (Ehud Reiter) (04/21/87)

In article <312@gumby.UUCP> earl@mips.UUCP (Earl Killian) writes:
>At MIPS, we use the following programs for benchmarking and
>architectural study:
>	<list of 10 or so programs>
>All of these have very different statistics.

On the subject of benchmarks, I've noticed that many people who do use
multiple programs for benchmarking tend to average the results together.
Even MIPS, whose Performance Brief is the best of its kind that I've seen,
does this to some degree (e.g. their "UNIX Total", in comparing an M/500
to a VAX-11/780, averages together relative performance figures ranging
from 4.7 to 6.7 into an average figure of 5.6).  Since averaging can be
a statistically dubious process (especially if the programs being averaged
are not weighted by some kind of "freqency of use" parameter), perhaps it
would be better to report the full range instead of an average (so, in the
above example, the M/500 would be rated as 4.7-6.7, not 5.6).  Reporting the
range would also have the advantage of giving people an idea of the variances
involved.

So, if the community does shift to using multiple programs for benchmarking
(which I strongly approve of), we should consider reporting "relative
performance" not as a single average number, but rather as a range.

					Ehud Reiter
					reiter@harvard	(ARPA,BITNET,UUCP)

zrm@eddie.MIT.EDU (Zigurd R. Mednieks) (04/21/87)

Just what is optimization? Is Green Hills' compiler optimizing when it
ignores "register" storage class keywords and uses its own intelligent
register allocation? Are RISC compiler postpass optimizers within the
rules of the Dhrystone game?

While it is pretty easy to say that something that radically alters
the traslation of C into machine code is optimizing, what separates
clever (or in the case of RISC, almost required) hacks in code
generation from optimization?

-Zigurd

eugene@pioneer.arpa (Eugene Miya N.) (04/21/87)

In article <1706@husc6.UUCP> reiter@harvard.UUCP (Ehud Reiter) writes:
>In article <312@gumby.UUCP> earl@mips.UUCP (Earl Killian) writes:
>>At MIPS, we use the following programs for benchmarking and
>>architectural study:
>>	<list of 10 or so programs>
>>All of these have very different statistics.
>
> discussion about statistics.
>
>So, if the community does shift to using multiple programs for benchmarking
>(which I strongly approve of), we should consider reporting "relative
>performance" not as a single average number, but rather as a range.
>
>					Ehud Reiter
>					reiter@harvard	(ARPA,BITNET,UUCP)

Rather than prematurely throw statistical tools at the problem, why not
pick up a book on experiment design and measurement?  

Rather than use ranges or single numbers consider other ways to convey
information.  First, you have to verify things: you can use the:
has or not have characterization.  Graphics is real popular now.
Consider the work done on Faces to display multidimensional data at Los
Alamos Natl. Lab (where I'm going next week).

Consider other analyses of dynamics: Muybridge's studies on motion.
Consider that hearts are characterized by a rate, but an EKG is better
characterization of heart rhythm.  Cardiologists can get away from
describing hearts solely in terms rate.  [Don't argue that heart rate is
a good over all characterizer, it isn't.]  We have to get away from our
infatuation with scalars, it will also stump the marketing types.

I like Earl's elaboration on whetstones.  It reminds me of a time I was
skiing two weeks ago: there was no snow at base of North Star, but up
higher on the mountain was a snow gauge with 5.5 feet of snow
registering in a flat, unskied, tree sheltered area.  So what's the
coverage at this ski resort?

No single figures of merit.  (A tee shirt I am making for our team).
I will try to have extras at SIGMETRICS 87 and PERFORMANCE 87.
Paper coming soon [I'm still late].  Stop some of this B.S.

From the Rock of Ages Home for Retired Hackers:

--eugene miya

  Not Ready for Prime Time Performer

  NASA Ames Research Center
  eugene@ames-aurora.ARPA
  "You trust the `reply' command with all those different mailers out there?"
  "Send mail, avoid follow-ups.  If enough, I'll summarize."
  {hplabs,hao,ihnp4,decwrl,allegra,tektronix,menlo70}!ames!aurora!eugene

earl@mips.UUCP (Earl Killian) (04/23/87)

Ehud Reiter and Eugene Miya both have good points.  I recently saw a
paper on the VAX 8700 from DEC and they presented their benchmark
results in a variety of ways, including something like:

TOP CAVEAT: I just typed in imaginary data points for illustration
-- THESE ARE NOT REAL RESULTS.

	4.0	ID
	4.5	IICFFDD
	5.0	IIICFFFDDD
	5.5	IIICCCFFFFDDDD
	6.0	IIIIICCCFFFFFDDD
	6.5	IIIIICCFFDD
	7.0	IIIICFFDD
	7.5	IIICFD
	8.0	I

BOTTOM CAVEAT: I just typed in imaginary data points for illustration
-- THESE ARE NOT REAL RESULTS.

They had something like 120 benchmarks, and presented these in the
above form, where "I" represented an integer benchmark, "C" Cobol,
"F" single-precision floating point, and "D" double-precision floating
point, etc.

I liked this presentation.  It really gives you a feel for how wide
the performance variation can be, even within members of the same
family.

Perhaps someone else can post a reference to this paper.

earl@mips.UUCP (Earl Killian) (04/23/87)

David Hough pointed out to me that most of the nice benchmarking
features I attributed to whetstone are not generic whetstone.  The
original version was written in Algol, and lacked the features I
described.

The one I found lying around here runs the whole program twice, once
with the loop multiplier set to an input parameter, and once run with
that parameter plus ten.  The difference of the times for these two
runs is a measure of the time for 10 loops with various overhead
canceled (for example, the timing calls and result printout outside of
the loops).

I just sent away and received a version from Argonne's netlib service,
and found that it does not do this.  There's probably a lot of
different whetstones out there.

I have heard that NBS is setting up a benchmark service.  Perhaps
they'll straighten up some of this.  Anyone know any more about it?

eugene@pioneer.arpa (Eugene Miya N.) (04/23/87)

Yes, it's called NBSLIB.  It's being tested.
I won't post the address, but I'll send it to Earl.  Just like netlib.
The problem is some of the programs are out of date (like the Livermore
loops).  Frank McMahon (LOOPS author, aka the MFLOPS program [the
original use of MFLOPS, BTW in the 1950s) has to updtae the NBS's 14
loop kernel to the more modern 24 LOOP [significantly better in my opinion]
kernel.

Faces:
Herman Chernoff. "The Use of Faces to Represent Points in k-Dimensional
Space Graphically," Journal of the American Statistical Association,
68 June 1973.  May I suggest the use of a Cray icon (Cray-1) because it
has more dimensionlity than most boxes out there.  Cray-2 does not have
enough (it lost the seats).  This is why we need a graphical (raster)
data standard to transfer pictures of data better.  The Bar graph was
not as good as I would hopes.  Check out Tufte's book on data (Then
watch one of those Apple Mac ads and see what Apple does wrong).

Bye.

From the Rock of Ages Home for Retired Hackers:

--eugene miya
  NASA Ames Research Center
  eugene@ames-aurora.ARPA
  "You trust the `reply' command with all those different mailers out there?"
  "Send mail, avoid follow-ups.  If enough, I'll summarize."
  {hplabs,hao,ihnp4,decwrl,allegra,tektronix,menlo70}!ames!aurora!eugene