[comp.arch] Neal Nelson Benchmarks

gerry@zds-ux.UUCP (Gerry Gleason) (02/23/90)

I have just been going through a bunch of marketing hype for Neal
Nelson.  He claims that his "Business Benchmark" measures how
well machines perform on "tasks like word processing, spread sheets,
database management, accounting, programming and CAD," but I have
never seen anything that backs this up with analysis or real data.

Also in the package are quite a few reprints that prominently
feature these benchmarks, including several saying RISC is not
much of a win based on his benchmarks.  (Federal Computer Week,
"Tests Challenge Old RISC, CISC Notions; EE Times, "CISC beats RISC
in test"; Computerworld, "Unearthing RISC worms")  The EE Times
article has the results for his Test 5 (Short integer math) showing
the Sun-3 to be ~10% faster than a Sun-4, which leads me to believe
that the benchmark is bogus.  I thought EE Times was a pretty good
publication, but the article does not even ask the question of what
the benchmark is really measuring.

I was hoping that someone has already done some analysis of these
benchmarks, and can confirm my suspicion that these test not only
are bogus, but don't even measure what they claim to.  Unfortunately,
at least some important fraction of the market uses these benchmarks
to evaluate products, so many of us must apply them to our products
even though we suspect them of being misleading.  If they really are
bogus, what can be done to publicly discredit them, so further harm
is not done?

Gerry Gleason

Gerry Gleason

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (02/23/90)

In article <196@zds-ux.UUCP> gerry@zds-ux.UUCP (Gerry Gleason) writes:
| I have just been going through a bunch of marketing hype for Neal
| Nelson.  He claims that his "Business Benchmark" measures how
| well machines perform on "tasks like word processing, spread sheets,
| database management, accounting, programming and CAD," but I have
| never seen anything that backs this up with analysis or real data.

  I've been doing benchmarks for years (about 25) and I will say that
used carefully I am pretty happy with the NN suite. I have run extensive
test and live loads on machines he has tested, and my results are close
to his.

  The secret to any benchmark is using it to predict the future, and
depends on both how well the benchmarks represent your load, and how
well you *think* the benchmarks represents your load. I usually suggest
that NN be used to select a few final machines for additional testing,
which is about all I claim for my own suite.

-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
            "Stupidity, like virtue, is its own reward" -me

steves@conan.SanDiego.NCR.COM (Steve Schlesinger) (02/24/90)

In article <196@zds-ux.UUCP> gerry@zds-ux.UUCP (Gerry Gleason) writes:
>I have just been going through a bunch of marketing hype for Neal
>Nelson.  He claims that his "Business Benchmark" measures how
>well machines perform on "tasks like word processing, spread sheets,
>database management, accounting, programming and CAD," but I have
>never seen anything that backs this up with analysis or real data.
>
>  [ paragraph deleted ]
>
>I was hoping that someone has already done some analysis of these
>benchmarks, and can confirm my suspicion that these test not only
>are bogus, but don't even measure what they claim to.  Unfortunately,
>at least some important fraction of the market uses these benchmarks
>to evaluate products, so many of us must apply them to our products
>even though we suspect them of being misleading.  If they really are
>bogus, what can be done to publicly discredit them, so further harm
>is not done?
>
>Gerry Gleason

***************************************************************
	The following is only my opinion 
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 

My company is a licensee of the Neal Nelson Benchmarks.  I have
been involved in benchmarking and performance evaluation for many
years.  I do not have a very high opinion of the NN Benchmarks.

NN is covered by a very strict licensing agreement.  The results of
running the benchmarks cannot be published, ie.  a licensee cannot
publicly reveal the individual or composite results of the benchmarks.
You cannot say my system ran 70 gazillion dhrystones, 80 gathousand
Linpacks and 22 on the NN suite.  (Heck, another division of my company
was also a licensee and they couldn't event tell us their raw results!!
Oh yes, the license agreement only permits the source to be on
a single machine on a single site.)

All you can say is that your system was Y times the performance of 
Fasta Computer - Model A on the suite.  How do you know this, if Fasta
Computer didn't publish their numbers ?  NN tells you this as part of
your license agreement.  You report your numbers back to them
and they give you the relative numbers of a specified number of other
systems.  If you want more data, you pay NN more $$.

On the technical side, the benchmarks are **VERY** simple.  I cannot
reveal the details under the license agreement.  The only good thing
is that you can run multiple copies of the suite in parallel fairly
easily.

The computational benchmarks are of two types: arithmetic for different
data types and memory moves.  The arithmetic ones over emphasize the
frequency of multiply and divide relative to plus and minus in real programs.
This explains some of the article mentioned where a CISC machine
(Sun3 with 68020 and 68881 or other FP silicon) "beat" a RISC machine
(Sun4 with SPARC which doesn't have multiply/divide for integer arithmetic).
The memory move tests will show how the cache/memory perform FOR ONE
SPECIFIC TYPE OF MOVES.

The disk I/O tests are not as idiosyncratic as the processor memory tests,
but they are **VERY** simple.

What really bothered me about the tests was the "C" coding style.  Yes,
I know this doesn't necessarily mean the benchmarks are not meaningful.
The code looked like it had originally been written in Cobol, then
translated line for line into "C".  The program structure (what there
was of it) didn't look anything like any "C" program anywhere.
It said to me the author had little experience programming in "C".

One effect of the coding style, was that it made it difficult to optimize.
This can be seen two ways:  one, is that it makes the suite more accurate,
since it removes the variable of compiler optimization from System comparisons
(a "notorious" problem with Dhrystone - especially 1.1).  The other is that
it makes it less accurate since the compiler's ability to optimize real
code is an important attribute of a system.

I tend to the second view.  But, since NN promotes his suite as a measure of
system performance it seems to me the code should test the optimizer.

My recommendation is not to pay much attention to anything you see published
about results on the NN suite.  I place it slightly below dhrystone in
overall usability.  You can get similar results by spending several hours
enhancing the public domain byte benchmarks.

I look forward to a future suite from SPEC that will include system I/O tests.
If these benchmarks are anything like SPEC Release 1.0, the NN suite will
be technically obsolete.

I admire NN's ability as a business person.  He saw a need and filled it.
Most technically naive system buyers don't understand the performance data
that floats around.  Observe the constant misunderstandings on comp.arch
about what a "mip" is.  We techies seldom agree on much.  NN created
a simple set of tests that could be explained to Joe Naiveuser.  Joe N.
could understand the marketing hype of the test and it gave him confidence
in his computer purchase.

Steve Schlesinger

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 
	The preceding is only my opinion
	It does not reflect the opinion of my employer or
	any other person or organization.
***************************************************************

wsd@cs.brown.edu (Wm. Scott `Spot' Draves) (02/24/90)

In article <196@zds-ux.UUCP> gerry@zds-ux.UUCP (Gerry Gleason) writes:

   Path: brunix!uunet!zds-ux!gerry
   From: gerry@zds-ux.UUCP (Gerry Gleason)
   Newsgroups: comp.arch
   Summary: Do they measure anything?
   Date: 22 Feb 90 17:19:19 GMT
   Reply-To: gerry@zds-ux.UUCP (Gerry Gleason)
   Organization: Zenith Data Systems
   Lines: 29

   I have just been going through a bunch of marketing hype for Neal
...
   The EE Times
   article has the results for his Test 5 (Short integer math) showing
   the Sun-3 to be ~10% faster than a Sun-4, which leads me to believe
   that the benchmark is bogus.
...
   Gerry Gleason


This may very well be accurate due to the SPARC's lack of integer
divide.

I would, however, seriously question a benchmark that claims to
measure performance of a certain class of applications
(business/personal productivity in this case), and one of the tests is
a very low-level, MIPS sort of rating.

Scott Draves			Space... The Final Frontier
wsd@cs.brown.edu
uunet!brunix!wsd
Box 2555 Brown U Prov RI 02912

amir@smsc.sony.com (Amir ) (02/24/90)

In article <196@zds-ux.UUCP> gerry@zds-ux.UUCP (Gerry Gleason) writes:
>I have just been going through a bunch of marketing hype for Neal
>Nelson.  He claims that his "Business Benchmark" measures how
>well machines perform on "tasks like word processing, spread sheets,
>database management, accounting, programming and CAD," but I have
>never seen anything that backs this up with analysis or real data.
 
Quite true.  I have seen a lot of bad benchmarks in my time but this is
one of the worst.

>Also in the package are quite a few reprints that prominently
>feature these benchmarks, including several saying RISC is not
>much of a win based on his benchmarks.  (Federal Computer Week,
>"Tests Challenge Old RISC, CISC Notions; EE Times, "CISC beats RISC
>in test"; Computerworld, "Unearthing RISC worms")  The EE Times
>article has the results for his Test 5 (Short integer math) showing
>the Sun-3 to be ~10% faster than a Sun-4, which leads me to believe
>that the benchmark is bogus.  I thought EE Times was a pretty good
>publication, but the article does not even ask the question of what
>the benchmark is really measuring.
 
I was quite surprised too but it sort of made sense.  There were a lot
of people who were trying hard for a reason to discredit RISC.

>I was hoping that someone has already done some analysis of these
>benchmarks, and can confirm my suspicion that these test not only
>are bogus, but don't even measure what they claim to.  Unfortunately,
>at least some important fraction of the market uses these benchmarks
>to evaluate products, so many of us must apply them to our products
>even though we suspect them of being misleading.  If they really are
>bogus, what can be done to publicly discredit them, so further harm
>is not done?
 
It is really bogus!  I had the opportunity to meet Mr. Nelson before he
went public with his benchmark.  The story that he gave goes as follows:

He had written/ported an accounting package to an old Unix box (don't remember 
which now -- back in 82-84).  Then his client bought what he thought was
a faster machine and to his surprise, Mr. Nelson's package actually ran
slower.  So, he started to analyze the problem and this led to his
infamous benchmark.

As for the contents of the package, I had pretty strong disagreements with
him.  He takes a simple benchmark that measures something very small (e.g.
speed of add operating) and runs multiple copies to simulate "multi-user"
response.  Then he does the same thing for another simple operationg (e.g.
multiply) and so on.  So, almost all of his arithmetic tests show the same
linear slow down (unless you run out of memory).  The test is showing the
context switch overhead and not add/multiply times...

Then there is the "sync" test.  Apparently, his package used to do a lot
of unneeded syncs so, he tests how fast you can do syncs.  First one copy,
then 2 then...  Well, you get the idea.  Even he agreed that this was
stupid (as sync most systems returns immediately before the data is written
to disk anyways).  But last time I looked, it was still in there.

Also, once you look at the source of the benchmark, you'll realize that he
is not much of programmer either.  It must have more "goto"s and labels
than any other C program that I have ever seen.  It looks more like a 
decomplation of an assembler program than anything else.

There are numerous other flaws in there that I won't go into now.

>Gerry Gleason

To be fair, the benchmark, like any other, does generate a set of "data 
points" that can be useful.  What angers me, is that the results are
interpolated to mean performance of the system for "business" applications.
Since there are no other programs that claim to do this, the benchmark
has become fairly popular....
-- 
Amir H. Majidimehr
Operating Systems Group
Sony Microsystems
amir@smsc.sony.com | ...!{uunet,mips}!sonyusa!amir

steves@ivory.SanDiego.NCR.COM (Steve Schlesinger x2150) (02/28/90)

I posted some comments on this subject last week.

In article <196@zds-ux.UUCP> gerry@zds-ux.UUCP (Gerry Gleason) writes:
>I have just been going through a bunch of marketing hype for Neal
>Nelson.  He claims that his "Business Benchmark" measures how
>well machines perform on "tasks like word processing, spread sheets,
>database management, accounting, programming and CAD," but I have

*****************************************************************
	The following is my opinion
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 

This true in the sense that the named applications do add, subtract,
multiply, divide, memory move, compare and disk I/O and the NN
benchmarks also do these operations.

Steve Schlesinger

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 
	The preceding is my opinion and does not reflect the 
	opinion of my employer or any other organization.
*****************************************************************

>never seen anything that backs this up with analysis or real data.
>
>Also in the package are quite a few reprints that prominently
>feature these benchmarks, including several saying RISC is not
>much of a win based on his benchmarks.  (Federal Computer Week,
>"Tests Challenge Old RISC, CISC Notions; EE Times, "CISC beats RISC
>in test"; Computerworld, "Unearthing RISC worms")  The EE Times
>article has the results for his Test 5 (Short integer math) showing
>the Sun-3 to be ~10% faster than a Sun-4, which leads me to believe
>that the benchmark is bogus.  I thought EE Times was a pretty good
>publication, but the article does not even ask the question of what
>the benchmark is really measuring.
>
>I was hoping that someone has already done some analysis of these
>benchmarks, and can confirm my suspicion that these test not only
>are bogus, but don't even measure what they claim to.  Unfortunately,
>at least some important fraction of the market uses these benchmarks
>to evaluate products, so many of us must apply them to our products
>even though we suspect them of being misleading.  If they really are
>bogus, what can be done to publicly discredit them, so further harm
>is not done?
>
>Gerry Gleason


*****************************************************************
	No disclaimer on this part
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 

A while ago, NCR posted the System Characterization Benchmark
on "net.sources."  It measures similar things to the NN benchmark,
but is better and is free.  As an added bonus, you can read the
source and decide for yourself what the results really mean. 
It is not perfect (useful comments will be forwarded to the author,
flames to /dev/null).

My advice to anyone looking for benchmarks is first to use the
applications you currently run, then use SPEC data for computational
results (with your own weightings) and the SCB.

Steve
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
steve schlesinger	steve.schlesinger@sandiego.ncr.com
619-485-2150		NCR - 4010, 16550 W Bernardo Dr, San Diego, CA 92127
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (03/01/90)

In article <231@iss-rb.SanDiego.NCR.COM> steves@ivory.SanDiego.NCR.COM (Steve Schlesinger x2150) writes:
>A while ago, NCR posted the System Characterization Benchmark
>on "net.sources."

I wonder if you could post the location of the benchmark on well known
archive sites such as uunet or wherever?

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)604-6117       

ps@fps.com (Patricia Shanahan) (03/01/90)

In article <231@iss-rb.SanDiego.NCR.COM> steves@ivory.SanDiego.NCR.COM (Steve Schlesinger x2150) writes:
...
>
>My advice to anyone looking for benchmarks is first to use the
>applications you currently run, then use SPEC data for computational
>results (with your own weightings) and the SCB.
>
>Steve
>::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
>steve schlesinger	steve.schlesinger@sandiego.ncr.com
>619-485-2150		NCR - 4010, 16550 W Bernardo Dr, San Diego, CA 92127
>::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


I would reverse the order here. Use "standard" benchmarks to decide which
systems are likely to be useful enough to you to be worth really measuring,
then use multiple jobs of the types you are really going to run to measure
the value of each of those systems to you.

The main value I see in standard benchmarks is that you can get numbers for
them a lot quicker and cheaper than a full scale benchmarking excersize
using actual jobs.

I do think the SPEC approach is likely to be more robust in the face of
architecture changes than the more abstract benchmark approaches. The real
problem with abstract benchmarks is that future architecture changes can
make whatever was changed in doing the abstraction critically important.
For example, the whetstone benchmark was designed to measure, among other
things, the performance of array references that were observed to be a
significant component of scientific computing. At the time is was not
obvious that vector length mattered, so the whetstone only tests arrays
of length 4. By definition an abstract benchmark differs from the real
jobs that it models in some ways. Those differences, for a well-designed
benchmark, will not be significant for CURRENT architecture and compiler
technology. On the other hand, if you select real jobs, they have a better
chance of looking like real jobs in aspects that are unimportant to current
systems but that may be critical to performance on future systems.
--
	Patricia Shanahan
	ps@fps.com
        uucp : {decvax!ucbvax || ihnp4 || philabs}!ucsd!celerity!ps
	phone: (619) 271-9940

kaul@icarus.eng.ohio-state.edu (Rich Kaul) (03/01/90)

In article <43902@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
   In article <231@iss-rb.SanDiego.NCR.COM> steves@ivory.SanDiego.NCR.COM (Steve Schlesinger x2150) writes:
   >A while ago, NCR posted the System Characterization Benchmark
   >on "net.sources."

   I wonder if you could post the location of the benchmark on well known
   archive sites such as uunet or wherever?

You can find it on cheops.cis.ohio-state.edu [128.146.8.62] in
pub/net.sources/ncrscb.[1-4].Z.

-rich
-=-
Rich Kaul                         | "Horse sense is what keeps horses from
kaul@icarus.eng.ohio-state.edu    |  betting on what people will do."
or ...!osu-cis!kaul		  |  			-Damon  Runyon

pb@idca.tds.PHILIPS.nl (P. Brouwer) (03/01/90)

In article <2557@ncr-sd.SanDiego.NCR.COM> steves@conan.SanDiego.NCR.COM (Steve Schlesinger) writes:
>In article <196@zds-ux.UUCP> gerry@zds-ux.UUCP (Gerry Gleason) writes:
>>I have just been going through a bunch of marketing hype for Neal
>>Nelson.  He claims that his "Business Benchmark" measures how
>>  [ paragraph deleted ]
>>
>>Gerry Gleason
>
>***************************************************************
>	The following is only my opinion 
>* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 

This is valid for me too!!!!!!!!

>
> I do not have a very high opinion of the NN Benchmarks.

I do think so to , and to add another argument to the ones mentinioned in
the previous postings:
All times measured in the benchmark are measured with a 1 second resolution.
This means that fasts tests that will only take a few seconds will have
a bad accuracy. So when you see comparison between machines this this into
account. For instance text x takes 5 seconds on machine A and 7 on machine B.
The difference is (7 - 5 )    / 7    = 28.6 %
This might be     (7.99 - 5 ) / 7.99 = 37.4 % or
                  (7 - 5.99 ) / 7    = 14.4 %

Draw your own conclusions:
>
>
>* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 
>	The preceding is only my opinion
>	It does not reflect the opinion of my employer or
>	any other person or organization.
>***************************************************************

Again this is valid for me as well.



-- 
Peter Brouwer,                # Philips Telecommunications and Data Systems,
NET  : pb@idca.tds.philips.nl # Department SSP-P9000 Building V2,
UUCP : ....!mcvax!philapd!pb  # P.O.Box 245, 7300AE Apeldoorn, The Netherlands.
PHONE:ext [+31] [0]55 432523, #