[comp.benchmarks] TPC-B - is this really progress?

jonathan@cs.pitt.edu (Jonathan Eunice) (03/26/91)

I notice a number of vendors, such as Sun and (given a recent
comp.arch posting) DG, concentrating on TPC-B benchmarks, rather than
TPC-A.  With all due respect to the TPC folks, who do appear to be
trying to make the world safer for benchmarking, isn't this the same
deal we had before with TP1 (the TPC-B precursor) benchmarks?  The
deal whereby vendors get to quote absurdly high #s that don't reflect
real life?  Oughtn't we be concentrating on the more-fully-scaled
TPC-A runs, and stongly encouraging our vendors to do likewise?

Also, I notice that Sun used two machines to do their TPC-B runs --
either a SS2 and a SS1+ (then quoting this in their spec sheets as the
SS2 figure), or a SS490 and a SS2 (result quoted as SS490 value).  I
don't know the precise TPC rules, but it seems a little un-kosher to
me.  (Kind of like Motorola when it slipped through the first draft of
the SPEC reporting rules, forcing the quick adoption of SPECthru
numbers.)  Am I wrong?  Is it legit?  Are other vendors doing it?  (I
know HP and IBM haven't.)

renglish@cello.hpl.hp.com (Bob English) (03/28/91)

jonathan@cs.pitt.edu (Jonathan Eunice) writes:
> I notice a number of vendors, such as Sun and (given a recent
> comp.arch posting) DG, concentrating on TPC-B benchmarks, rather than
> TPC-A.  With all due respect to the TPC folks, who do appear to be
> trying to make the world safer for benchmarking, isn't this the same
> deal we had before with TP1 (the TPC-B precursor) benchmarks?

The two measure different things.  TPC-A numbers measure the actual
performance you would get if you built a local automatic teller network
with thousands of terminals.  It's basic premise is that it doesn't make
sense to claim 10,000 TPS, unless your system can support 100,000 users
and unless you price it with 100,000 terminals and terminal connections.
That is not necessarily what people use their systems for.

In addition, TPC-A numbers are difficult to get.  It takes many runs,
using lots of equipment, to get an accurate assessment.  For systems not
intended for the marketplace described above, this level of effort is
difficult to justify.

--bob--
renglish@hplabs.hp.com
The opinions expressed here may not reflect the views of the Hewlett-
Packard Co., its shareholders, its executives, or its more competent
engineers.

jeff@u02.svl.cdc.com (Jeff Needham) (03/28/91)

renglish@cello.hpl.hp.com (Bob English) writes:


>In addition, TPC-A numbers are difficult to get.  It takes many runs,
>using lots of equipment, to get an accurate assessment.  For systems not
>intended for the marketplace described above, this level of effort is
>difficult to justify.

Amen


--
Waiting for the Oracle port to the Oberheim OB-Xa
| Jeffrey Needham
| Yet Another Oracle Performance Group
| Control Data - Santa Clara, CA - INTERNET jeff@hawk.svl.cdc.com

sweiger@sequent.UUCP (Mark Sweiger) (03/30/91)

In article <JONATHAN.91Mar25175131@speedy.cs.pitt.edu> jonathan@cs.pitt.edu (Jonathan Eunice) writes:
>I notice a number of vendors, such as Sun and (given a recent
>comp.arch posting) DG, concentrating on TPC-B benchmarks, rather than
>TPC-A.  With all due respect to the TPC folks, who do appear to be
>trying to make the world safer for benchmarking, isn't this the same
>deal we had before with TP1 (the TPC-B precursor) benchmarks?  The
>deal whereby vendors get to quote absurdly high #s that don't reflect
>real life?  Oughtn't we be concentrating on the more-fully-scaled
>TPC-A runs, and stongly encouraging our vendors to do likewise?

It is very hard to get high rates of throughput in a TPC-A test.
It is even harder to get good $/TPS in a TPC-A test.  So there
is a tendency to retreat to easier ground, the TPC-B.  

However, TPC-B is a better test than the TP1 in that it requires no single
point of failure (implies mirrored DBMS log), requires log archiving
while the test is running, and requires at least one checkpoint during the
test run.  It also has some strict full disclosure rules.

>
>Also, I notice that Sun used two machines to do their TPC-B runs --
>either a SS2 and a SS1+ (then quoting this in their spec sheets as the
>SS2 figure), or a SS490 and a SS2 (result quoted as SS490 value).  I
>don't know the precise TPC rules, but it seems a little un-kosher to
>me.  (Kind of like Motorola when it slipped through the first draft of
>the SPEC reporting rules, forcing the quick adoption of SPECthru
>numbers.)  Am I wrong?  Is it legit?  Are other vendors doing it?  (I
>know HP and IBM haven't.)

It doesn't matter how many machines you use.  The Sun systems put
clients (tellers) on one machine, the RDBMS server or servers on the other
machine.  You cannot have more than one server machine unless the
software allows a database to be distributed in a consistent (ACID
property) fashion across nodes.  The notable example of this is the
recent Oracle 6.2/VAX cluster test which reported 425 TPC-B TPS with a 4
node cluster.  In general, the fewer machines you use for the test,
the better your $/TPS.  


-- 
Mark Sweiger			Sequent Computer Systems
Database Software Engineer	15450 SW Koll Parkway
Office: (503)578-4329		Beaverton, Oregon  97006-6063
FAX: (503)578-7569		sweiger@sequent.com

jonathan@cs.pitt.edu (Jonathan Eunice) (04/02/91)

   renglish@cello.hpl.hp.com (Bob English) writes:

   >In addition, TPC-A numbers are difficult to get.  It takes many runs,
   >using lots of equipment, to get an accurate assessment.  For systems not
   >intended for the marketplace described above, this level of effort is
   >difficult to justify.

Many replys have mentioned the cost/difficulty of getting these numbers.
Fair enough.  This seems to indicate that TPC did not do a terribly good
job defining benchmarks that are both useful and doable.  

But what about my original premise--that if all we have are TPC-B/TP1
numbers, we basically have no terribly valuable quantification of what
a system is capable of?  That we have numbers no more useful, and perhaps
less useful, than the MIPS/MFLOPS figures so abused in the technical
computing arena?  If all TPC has does is to give the weight of "industry
standard" and a patina of respectability to empty metrics, ought we call
this progress?

jonathan@cs.pitt.edu (Jonathan Eunice) (04/02/91)

renglish@cello.hpl.hp.com (Bob English) writes:

   jonathan@cs.pitt.edu (Jonathan Eunice) writes:
   > I notice a number of vendors, such as Sun and (given a recent
   > comp.arch posting) DG, concentrating on TPC-B benchmarks, rather than
   > TPC-A.  With all due respect to the TPC folks, who do appear to be
   > trying to make the world safer for benchmarking, isn't this the same
   > deal we had before with TP1 (the TPC-B precursor) benchmarks?

   The two measure different things.  TPC-A numbers measure the actual
   performance you would get if you built a local automatic teller network
   with thousands of terminals.  It's basic premise is that it doesn't make
   sense to claim 10,000 TPS, unless your system can support 100,000 users
   and unless you price it with 100,000 terminals and terminal connections.
   That is not necessarily what people use their systems for.

   In addition, TPC-A numbers are difficult to get.  It takes many runs,
   using lots of equipment, to get an accurate assessment.  For systems not
   intended for the marketplace described above, this level of effort is
   difficult to justify.

   --bob--

Yes, but what does TPC-B measure?  What useful thing, that is?  If
TPC-A demands a scaled-up system that quasi-accurately reflects a
realistic OLTP use ("the actual performance you would get if you built
a local automatic teller network"), TPC-B measures what?  A system
that is *not* configured like one that you'd find in a real OLTP
system?  This is my impression--that TPC-B reflects a highly
synthetic, highly unrealistic configuration unlikely to be found in
real OLTP situations.  Just like the TP1 rubish we've had for some
time.

If I'm wrong on this point, tell me why.  Otherwise, tell me why TPC is
promoting an OLTP benchmark that does not reflect anything resembling
the reality of OLTP computing.  

jvm@hpfcso.FC.HP.COM (Jack McClurg) (04/03/91)

> / hpfcso:comp.benchmarks / jonathan@cs.pitt.edu (Jonathan Eunice) /  7:56 pm  Apr  1, 1991 /
> Yes, but what does TPC-B measure?  What useful thing, that is?
> 
> If I'm wrong on this point, tell me why.  Otherwise, tell me why TPC is
> promoting an OLTP benchmark that does not reflect anything resembling
> the reality of OLTP computing.  
> ----------

I think that TPC-B is intended to measure a batch, nighttime processing
workload roughly equivalent to TPC-A.  This would seem to be a reasonable
workload, but it should be measured on exactly the same configuration used to
generate the TPC-A results.  Instead what we have seen are some results for
TPC-B and not TPC-A.

Jack McClurg

news@sequent.com (News on Muncher) (04/03/91)

In article <JONATHAN.91Apr1215610@speedy.cs.pitt.edu> jonathan@cs.pitt.edu (Jonathan Eunice) writes:

>Yes, but what does TPC-B measure?  What useful thing, that is?  If
>TPC-A demands a scaled-up system that quasi-accurately reflects a
>realistic OLTP use ("the actual performance you would get if you built
>a local automatic teller network"), TPC-B measures what?  A system
>that is *not* configured like one that you'd find in a real OLTP
>system?  This is my impression--that TPC-B reflects a highly
>synthetic, highly unrealistic configuration unlikely to be found in
>real OLTP situations.  Just like the TP1 rubish we've had for some
>time.

TP1 and TPC-B measure the performance of the database backend (the
server side of the client/server model).  TPC-A measures *both* the
client and server sides of the problem, making it more realistic.
A previous post of mine to this newsgroup explains how TPC-B is more 
rigorous than TP1.  And in defense of the TP1, I'd have to say
that this benchmark caused a lot of questionable backend database
code (in terms of the ACID property) and a lot of non-performing database 
backend code to get cleaned up, throughout the database software industry.

Sure, TPC-A is hard to do, but not impossible.  Many companies have done
TPC-A's;  Omri Serlin's "Fault Tolerant News" newsletter's most
recent issue gives a listing of all the TPC-A and TPC-B results
to date.  TPC-A results exist or are in progress for many platforms
including HP, Sun, Unisys, Sequent, and DEC.

If you really want realism, I'll bet you can hardly wait for TPC-C,
the Order-Entry Benchmark.  In this benchmark, the throughput is
measured in Orders Per Second (yes, OPS)!  This benchmark is an
adaptation of one developed at DEC, and is the third benchmark in
the TPC series.  TPC-C's are in progress at at least one hardware
vendor right now.

renglish@cello.hpl.hp.com (Bob English) (04/04/91)

jonathan@cs.pitt.edu (Jonathan Eunice) writes:
> But what about my original premise--that if all we have are TPC-B/TP1
> numbers, we basically have no terribly valuable quantification of what
> a system is capable of?  That we have numbers no more useful, and perhaps
> less useful, than the MIPS/MFLOPS figures so abused in the technical
> computing arena?  If all TPC has does is to give the weight of "industry
> standard" and a patina of respectability to empty metrics, ought we call
> this progress?

First, let me say that I am not in any way involved in the defining of
TPC benchmarks, and have only been peripherally involved in TPC bench-
marking at HP.  If you really want to know what TPC thinks the
benchmarks are good for, you should ask them.

As I understand it, the TPC-A numbers measure automatic teller
performance.  It assumes human users at terminals, some of whom are
active and some of whom are not.  The active users make requests at a
rate reflective of the speed at which humans work.  This means that in
order to get a large TPC-A number, the system must support a large
number of active and non-active connections.  These connections consume
resources on the host machine and reduce its throughput.

For an environment where the end-users are not humans but machines,
these constraints may not be appropriate.  A network of remote sensors,
for example, might generate the same number of transactions with far
fewer connections.  The same might be true for automated controllers on
a factory floor.  In such cases, the TPC-B number may give a better
picture of the capacity of the machine.

--bob--
renglish@hplabs
Not a spokeman for anyone, including myself.

spuhler@hpcuhc.cup.hp.com (Tom Spuhler) (04/04/91)

/ hpcuhc:comp.benchmarks / jonathan@cs.pitt.edu (Jonathan Eunice) /  6:56 pm  Apr  1, 1991 /
renglish@cello.hpl.hp.com (Bob English) writes:

   jonathan@cs.pitt.edu (Jonathan Eunice) writes:
   > I notice a number of vendors, such as Sun and (given a recent
   > comp.arch posting) DG, concentrating on TPC-B benchmarks, rather than
   > TPC-A.  With all due respect to the TPC folks, who do appear to be
   > trying to make the world safer for benchmarking, isn't this the same
   > deal we had before with TP1 (the TPC-B precursor) benchmarks?

   The two measure different things.  TPC-A numbers measure the actual
   performance you would get if you built a local automatic teller network
   with thousands of terminals.  It's basic premise is that it doesn't make
   sense to claim 10,000 TPS, unless your system can support 100,000 users
   and unless you price it with 100,000 terminals and terminal connections.
   That is not necessarily what people use their systems for.

   In addition, TPC-A numbers are difficult to get.  It takes many runs,
   using lots of equipment, to get an accurate assessment.  For systems not
   intended for the marketplace described above, this level of effort is
   difficult to justify.

   --bob--

Yes, but what does TPC-B measure?  What useful thing, that is?  If
TPC-A demands a scaled-up system that quasi-accurately reflects a
realistic OLTP use ("the actual performance you would get if you built
a local automatic teller network"), TPC-B measures what?  A system
that is *not* configured like one that you'd find in a real OLTP
system?  This is my impression--that TPC-B reflects a highly
synthetic, highly unrealistic configuration unlikely to be found in
real OLTP situations.  Just like the TP1 rubish we've had for some
time.

If I'm wrong on this point, tell me why.  Otherwise, tell me why TPC is
promoting an OLTP benchmark that does not reflect anything resembling
the reality of OLTP computing.  
----------

spuhler@hpcuhc.cup.hp.com (Tom Spuhler) (04/04/91)

I think that my previous post got out (nasty old notes!!) but what I
wanted to comment on was:

TPC-A is a rigorous (as in well-defined) interactive database benchmark
(TPC-B is a rigorour batch DBMS benchmark). They never have been thought of as
particulalry indicative benchmarks,
rather, they are intended to replace Debit/Credit TP1  as "the" DBMS benchmarks,
in such a way so at least everyone would know exactly what was meant when
someone reported "TPC-A" or "TPC-B" results.  While the applicability of the
results to a specific application may be weak, at least you have some
assurance of the conditions under which the results were obtained, and
can compare different machines performance and cost of solution (for
the TPC workload).  

In this, I think that TPC has done very well.  Only a few oddballs still
bother to report TP1 or debit/credit results (and they are scorned), and
there is significant peer review of results to insure compliance.  The
specification is detailed enough and the required reporting detailed
enough that it it posible (IMHO) to make an apples to apples comparison
of TPC numbers (A-A, B-B, as defined by TPC) and feel relatively
comfortable with what you are seeing.  This has NOT BEEN POSSIBLE in the
past.  With the earlier "benchmarks" what was actually done was largely
a related to the vendors interprettion and ethics, with TPC, at least
you know what you got.  Now, anyone who is silly enough to
report, say TP1 instead of TPC-B results, for example, is seen as trying
to hide something.  

As to how useful these are in terms of customer application performance
prediction, probably not all that much, HOWEVER, they are certainly
better for anticipating the likely performance of a OLTP/DBMS application then
any other common benchmark out there (Try it with Dhrystones!).   There
are various caveats, but now at least everyone is playing on the same
field.  

Now, if that was the end of it, there might be a problem.  However,
things are just getting started in TPC land.  Whereas TPC-[AB] were
intended as stopgap efforts, TPC is finalizing it's first application 
oriented benchmark, TPC-C.  I encourage everyone out there to get ahold
of the spec, and read it.  And they have more planned.  [this is
actually a nuisance in some ways as that means we will have to implement
them, but this is in the total scheme of things: GOOD].

-Tom

"TPC-A, TPC-B:  ask for them by name"

sdo@cbnewsl.att.com (scott.orshan) (04/09/91)

Is there any transaction processing benchmark that addresses cost per user
rather than cost per TPS?  The TPC results are for a given number of users
(about 10X the TPS rating).  I think a more useful rating would vary
the number of users up to some very large number, maybe ten times
the nominal TPC number of users.  The database size would be constant.
The reported costs would show the incremental cost per user, and the
maximum ultimate limit of users.  The think time could be lengthened
to accommodate the extra users.

Here's why this is useful.  Suppose I need to support 1000 users, generating
an equivalent of 50 TPC-A TPS.  Anyone reporting TPC numbers will only run with
500 users for that throughput level.  I want to know whether a 50 TPS
system can support 1000 users - I want to decouple the TPS rating from
the number of users.

Given two 50 TPS systems, it could be that one is at the upper limit
of its configurable resources, while the other could easily scale to
double the number of users.  Not counting the problems of database
replication, I want to know whether it is cheaper to buy two 500 user
systems or one 1000 user system.  Note that each of the 500 user systems
would only run at 25 TPS, since my hypothetical users don't deliver requests
as fast as TPC front ends.

It's not too hard to figure out an approximate TPC-equivalent rating
for an application (at least within a factor or two or three), but
there's no way to know if another network connection can be made,
or another process attached to a DBMS unless the vendors report it.

In the absence of this information, I would have to start with TPC-B
numbers and price out a configuration for each Vendor/DBMS combination,
and then see if the DBMS would work with that many users.


Any comments on this?


	Scott Orshan
	UNIX System Labs
	908-522-5063
	attunix!sdo
	sdo@attunix.att.com