[net.unix-wizards] IOCALL results and problems

stubbs@ncr-sd.UUCP (Jan Stubbs) (12/12/85)

	IOCALL, A UNIX SYSTEM PERFORMANCE BENCHMARK
The results so far are below. Thanks everybody.
Send your results to me directly. The benchmark is a "C" program
which measures Unix kernel performance. 

time iocall     Send all 3 times (user, system, real)
                I am reporting the system time only.
         
"The opinions expressed herein are those of the author. Your mileage may vary".
 
Problems:

1) As Jeff Makey kindly pointed out, IOCALL unfortunately does cross a
buffer boundary if your buffer size is 512. Older versions of Unix
(Version 7, System III) and their progeny were 512. Berkeley 4.2, 4.3,
System V and their progeny are 1024 or bigger, so no problem with those
numbers. But all the numbers sent to me for the 512 byte buffer unixes
are slower than they should be because they did over 1000 disk writes
which uses lots of cpu cycles in the drivers. I don't know about
Version 8 or 2.9 BSD, can anyone help?

Jeff offered a solution, which adds a seek to keep everything in the
1st 512 bytes. This makes the kernel do a little extra work, but it did
not change the timing on our Pyramid. The new source is below, if you
have a 512 byte buffer version of Unix please rerun with this one.

2) Jeff and others also pointed out that the 2nd argument to lseek
should be a long not an int. Shame on me! See what happens when you
don't lint your programs? The source below also fixes this. Reruns may
be required to get correct results on machines where longs aren't the
same as ints. (PDP's...).

3) I failed to mention that these timings should be run on an otherwise
idle machine. If you can please run them so, it does improve the
timings.

4) Since not everyone is a good sport about benchmarks, and since I
might be a biased source, and since I don't have access to the latest
NCR Unix stuff anyhow (the M68020 based Tower/32) I won't publish any
NCR numbers, unless offered to me by NCR E&M Columbia which is where
the Tower line comes from. I encourage someone else to do so however.


Jan Stubbs ..sdcsvax!ncr-sd!stubbs
              
IOCALL RESULTS:

SYSTEM				UNIX VERSION		SYSTEM TIME SECONDS
-----------			----------------	-------------------
DEC Rainbow100 w/NECV20 	Venix			18.4 *a
DEC Pro-300			Venix 1.1		18.1 *a
MicroVax I			Ultrix V1.1		18.0
Onyx C8002s Z8000		SIII			13.7 *a	
Onyx C8002 Z8000		v7			13.3 *a
TIL NS32016 9MHz No Wait states	Local Port		12.2
ATT 3b2/300			SV			10.3
VAX 11/750			4.2 BSD			10.0
PDP 11/44			ISR 2.9 BSD		9.5
VAX 11/750			SV.2			9.4
VAX 11/750			4.3 BSD			9.0
Sun-2 10MHz 68010		4.2 BSD Rel 2.0		9.0
Sun-2 10MHz 68010		4.2 BSD Rel 3.0 	8.7
PE 3220				V7 Workbench		8.5 *a
VAX 11/750			research version 8	8.1
VAX 11/750			4.1 BSD			7.2
Radio Shack 16A			Xenix (v7)		7.2 *a
PC/AT 				Venix 5.2		6.8
ATT7300 Unix PC 10MHz 68010	SV.2			6.4
Bullet286(PC/XT)		Venix 2.0		6.0 *a
Pyramid 90x w/cache		OSx2.3			5.8
VAX 11/780			4.2 BSD			5.7
Plessey Mantra 12.5Mhz 68000	Uniplus SV Release 0	5.5
MicroVax II			Ultrix 1.1		5.2
HP9000-550 3cpu's		HP-UX 5.01		5.1 *c
PC/AT 7.5 Mhz			Venix286 SV.2		5.1
Convex C-1			4.2 BSD			4.6
VAX 11/785			SV.2			4.4
VAX 11/785			4.3 BSD			3.6
Sun-3/75 16.67Mhz 68020		4.2 BSD			3.6
Sun-3/160M-4 16.67Mhz 68020	4.2 BSD Rel 3.0 Alpha	3.6
GEC 63/40			S 5.1			2.7
Gould PN9080			UTX 1.2			2.5
Sperry 7000/40 (aka CCI 6/32)	4.2 BSD			1.9 *b
VAX 8600			4.3 BSD			1.3
VAX 8600			Ultrix 1.2-1		1.1
IBM 3083			UTS SV			1.0 *b
Amdahl 470/V8			UTS/V (SV Rel 2,3)V1.1+ .98 *b

Notes:

*a 
This result obtained with original version of IOCALL which crosses the 512
512 byte buffer boundary, and this version of Unix has buffers of 512 bytes.
This is believed to be the case with all Version 7 and SIII derived OS's. It
will result in a 1001 writes being done which uses significantly more cpu time 
and makes these results comparable only to others with the same problem. See 
discussion above. 2.9 BSD????

*b 
This result was obtained with a system which probably had other programs runningat the time the result was obtained. Submitter is requested to rerun if possiblewhen system is idle. This will improve the result somewhat.

*c
Multi-cpu system. IOCALL was run single thread, which probably did not
utilize all cpu's. This system probably has considerably more power than
is reflected by the result.





-------cut----cut------cut-------------------------------

/*This benchmark tests speed of Unix system call interface
  and speed of cpu doing common Unix io system calls. */

char buf[512];
int fd,count,i,j;

main()
{
 fd = creat("/tmp/testfile",0777);
 close(fd);
  fd = open("/tmp/testfile",2);
  unlink("/tmp/testfile");
for (i=0;i<=1000;i++) {
  lseek(fd,0L,0);		/* add this line! */
  count = write(fd,buf,500);
  lseek(fd,0L,0);		/* second argument must be long */

  for (j=0;j<=3;j++) 
  	count = read(fd,buf,100);
  }
}

dan@rna.UUCP (Dan Ts'o) (12/13/85)

In article <354@ncr-sd.UUCP> stubbs@ncr-sd.UUCP (0000-Jan Stubbs) writes:
>
>	IOCALL, A UNIX SYSTEM PERFORMANCE BENCHMARK
>The results so far are below. Thanks everybody.
>Send your results to me directly. The benchmark is a "C" program
>which measures Unix kernel performance. 
>
>
>/*This benchmark tests speed of Unix system call interface
>  and speed of cpu doing common Unix io system calls. */
>
>char buf[512];
>int fd,count,i,j;
>
>main()
>{
> fd = creat("/tmp/testfile",0777);
> close(fd);
>  fd = open("/tmp/testfile",2);
>  unlink("/tmp/testfile");
>for (i=0;i<=1000;i++) {
>  lseek(fd,0L,0);		/* add this line! */
>  count = write(fd,buf,500);
>  lseek(fd,0L,0);		/* second argument must be long */
>
>  for (j=0;j<=3;j++) 
>  	count = read(fd,buf,100);
>  }
>}

	Well I don't want to flame too much. Just a few comments.

	Basically, I find it difficult to take this benchmark and the presented
results too seriously.

	- I have trouble understanding the point of the benchmark program. It
just seems bizarre. For 1000 times, it writes 500 bytes at the beginning of the
file and reads 400 of them back, 100 at a time. Because of the buffer cache,
this whole routine just does user/kernel buffer copies, back and forth. If the
performance of the system call interface and user/kernel memory copies is the
what is trying to be measured, then the results may be okay, although strangely
obtained. I don't believe it measures much else in the way of kernel
performance, or system performance.  Its not even something a normal user can
relate to, such as "copying files on a X is twice as fast as Y".

	- It is obviously a single point measurement. It can tell you very
little about how particular applications or the system in general will run.

	- The numbers are way to small to interpret with any substantial
significance (i.e. you should run the benchmark with say 10000, rather than 1000
in the the loop). The difference between the various VAX 11/750 times are,
for example 7.2 to 9.4 . I could be convince there is significance there, but...

	- That a Radio Shack 16A performs 25% better than a VAX 11/750 is cute
but little practical interest (read ridiculous, a benchmark that tells me that
is probably not going to be very useful, are we really to think that an
Amdahl 470/V8 is only 12% faster than a VAX8600, that a Pyramid is slower than
a VAX 11/780).

hammond@petrus.UUCP (Rich A. Hammond) (12/16/85)

> In article <354@ncr-sd.UUCP> stubbs@ncr-sd.UUCP (0000-Jan Stubbs) writes:
> >
> >	IOCALL, A UNIX SYSTEM PERFORMANCE BENCHMARK
> >... The benchmark is a "C" program which measures Unix kernel performance. 
> 
Dan Tsao writes:
> 	Well I don't want to flame too much. Just a few comments.
> 
> 	Basically, I find it difficult to take this benchmark and the presented
> results too seriously.
> 
> 	- I have trouble understanding the point of the benchmark program.
> ...  It's not even something a normal user can relate to,
> such as "copying files on a X is twice as fast as Y".
> 
> 	- It is obviously a single point measurement. It can tell you very
> little about how particular applications or the system in general will run.
> 
> 	- The numbers are way to small to interpret with any substantial
> significance (i.e. you should run the benchmark with say 10000, rather than 1000
> in the the loop). The difference between the various VAX 11/750 times are,
> for example 7.2 to 9.4 . I could be convince there is significance there, but...
> 
> 	- That a Radio Shack 16A performs 25% better than a VAX 11/750 is cute
> but little practical interest (read ridiculous, a benchmark that tells me that
> is probably not going to be very useful, are we really to think that an
> Amdahl 470/V8 is only 12% faster than a VAX8600, that a Pyramid is slower than
> a VAX 11/780).

a) I agree it doesn't measure everything, but it does check three important
aspects that affect overall system performance: context switch costs,
copying costs, and the cost of finding the buffer in the buffer cache.
b) You want to avoid using the disks, since after all, an IBM PC with a
fast hard disk would probably outperform an 8600 with an RK05.
Thus, the statement "system A copies files twice as fast as system B"
is only useful knowing the I/O configuration (was it massbus/unibus disks
on a Vax?, what type disks, ....).
c) I agree, run the benchmark with more times through the loop on fast
machines.  1000 is probably enough on small machines.
d) The point about the benchmark results is not that they are ridiculous,
but that they might show up areas which need work.  For example, if
you simply port UNIX to a large machine and increase the number of buffers
without thinking about the way the buffer cache works, you are likely to
find that you have, say, 1024 buffers chained into 60 queues.  Whereas on a
pdp11 you had 60 buffers in 60 queues.  Which one will take less time to
find a buffer in?  Raw machine speed alone won't tell you the answer.
Further, lets suppose you built a machine with lots of registers and a
load/store architecture (i.e. RISC, Pyramid).  It turns out the cost of
doing a context switch is higher (save all registers) and the load/store
architecture is at its worst on doing memory to memory copies.  Thus,
a pyramid might very well do worse than a Vax 11/780.  I timed a long to
long copy on a pyramid in user mode, it was only 1.15 * the 11/780.
Given that the pyramid has a slow context switch....
e) The variation among machines of the same model is real, we have two
780's and one is consistently about 5% faster on benchmarks.  We have
two Pyramids and again, one is consistently faster on the same benchmarks.
One should always take +/- 10% on benchmarks to compare machines.

Rich Hammond, Bell Communications Research

larry@geowhiz.UUCP (Larry McVoy) (12/18/85)

In article <761@petrus.UUCP> hammond@petrus.UUCP (Rich A. Hammond) writes:
>> In article <354@ncr-sd.UUCP> stubbs@ncr-sd.UUCP (0000-Jan Stubbs) writes:
>> >
>> >	IOCALL, A UNIX SYSTEM PERFORMANCE BENCHMARK
>> >... The benchmark is a "C" program which measures Unix kernel performance. 
>> 
>Dan Tsao writes:
>> 	Well I don't want to flame too much. Just a few comments.
>> 
>> 	Basically, I find it difficult to take this benchmark and the presented
>> results too seriously.

I tend to agree with Dan.  I think what people would like to see is a 
benchmark which measures how well Unix, running multiple users, performs
on each machine.  The benchmark would have to measure something that did
not vary widely (such as I/O devices), as those results would only reflect
how much one had spent on the bus & disk.  So, how about this:

The dryhstone benchmarks are considered good tests of the CPU (at least by 
me they are), but don't really test Unix at all (in fact some people run 
them in standalone mode).  How about a version, (called forkstone?), which
runs the dryhstone as 1, 2, 8, and 64 concurrent processes?  This would
show 1) the speed of the CPU, 2) first part of the curve, 8) a nice single
user level, and 64) what happens when you have multiple users.  

It would not test I/O, which is a hard thing to test fairly.  It would get
rid of those Z80 dryhstones (flame, flame) as they're not multi tasking...

I guess if there is any response and nobody wants to do it, I'll hack the
drystones.  I think it would be better if the original author did it, as
{s}he probably can understand that bastardized {C}Ada source.

Please post your views to the net.  I don't want to discuss this via mail.

-- 
Larry McVoy
-----------
Arpa:  mcvoy@rsch.wisc.edu                              
Uucp:  {seismo, ihnp4}!uwvax!geowhiz!geophiz!larry      

"If you are undertaking anything substantial, C is the only reasonable 
 choice of programming language"   -  Brian W. Kerninghan

gemini@homxb.UUCP (Rick Richardson) (12/19/85)

Larry McVoy writes:
>I tend to agree with Dan.  I think what people would like to see is a 
>benchmark which measures how well Unix, running multiple users, performs
>on each machine.  The benchmark would have to measure something that did
>not vary widely (such as I/O devices), as those results would only reflect
>how much one had spent on the bus & disk.  So, how about this:
>
>The dryhstone benchmarks are considered good tests of the CPU (at least by 
>me they are), but don't really test Unix at all (in fact some people run 
>them in standalone mode).  How about a version, (called forkstone?), which
>runs the dryhstone as 1, 2, 8, and 64 concurrent processes?  This would
>show 1) the speed of the CPU, 2) first part of the curve, 8) a nice single
>user level, and 64) what happens when you have multiple users.  
>
>It would not test I/O, which is a hard thing to test fairly.  It would get
>rid of those Z80 dryhstones (flame, flame) as they're not multi tasking...
>
>I guess if there is any response and nobody wants to do it, I'll hack the
>drystones.  I think it would be better if the original author did it, as
>{s}he probably can understand that bastardized {C}Ada source.

I don't think that running multiple dhrystones would measure anything more
than the cost of doing a context switch once every <scheduling granularity>.
Except on a multiple processor machine, the time will be N*1 dhrystone +
M context switches.  There are easier ways to measure the time to do a context
switch.  If you want to measure multi-user response, you've GOT to open the
IO can-of-worms, since they WILL be doing IO.

Rick Richardson, PC Research, Inc. (201) 922-1134
..!ihnp4!houxm!castor!{rer,pcrat!rer} <--Replies to here, not to homxb!!!

P.S. Rheinhold Weicker is the author of Dhrystone.  I apologize for
creating the bastardized {C}Ada source from his original Ada!

larry@geowhiz.UUCP (Larry McVoy) (12/21/85)

>I wrote:
>>I tend to agree with Dan.  I think what people would like to see is a 
>>benchmark which measures how well Unix, running multiple users, performs
>>on each machine.  The benchmark would have to measure something that did
>>not vary widely (such as I/O devices), as those results would only reflect
>
Rick Richardson writes:
>I don't think that running multiple dhrystones would measure anything more
>than the cost of doing a context switch once every <scheduling granularity>.
>Except on a multiple processor machine, the time will be N*1 dhrystone +
>M context switches.  There are easier ways to measure the time to do a context
>switch.  If you want to measure multi-user response, you've GOT to open the
>IO can-of-worms, since they WILL be doing IO.
>
>P.S. Rheinhold Weicker is the author of Dhrystone.  I apologize for
>creating the bastardized {C}Ada source from his original Ada!

Well, ok, so you don't think multiple dhrystones would be interesting.  Hmm...
I do - it would be interesting to know how well they do when there's lots of
them.  You say it's no more than testing context switches implying that
all context switches are equal.  Uh-uh.  For example:  I heard (from guy
harris who I'm sure will correct any inaccuracies) that Sun-3 memory management
is done such that 8 memory mapping context blocks are in memory at all times.
This leads to fast-fast-fast response for active jobs <= 8, but what happens
when you go to 16? 32?

I think we both agree that testing I/O is a mess.  Really hard to get an 
objective and accurate reflection of a machines performance.  I think we
also both agree that what people would like to see is some sort of 
measurement of a machines multi-{user,tasking} capability.  So, I made a 
pass -- what have you to offer instead?

-larry

BTW - sorry about the {C}Ada crack, just my peevishness at not being able
      to decipher it...
-- 
Larry McVoy
-----------
Arpa:  mcvoy@rsch.wisc.edu                              
Uucp:  {seismo, ihnp4}!uwvax!geowhiz!geophiz!larry      

"If you are undertaking anything substantial, C is the only reasonable 
 choice of programming language"   -  Brian W. Kerninghan

jph@whuxlm.UUCP (Holtman Jim) (12/22/85)

> Larry McVoy writes:
> >I tend to agree with Dan.  I think what people would like to see is a 
> >benchmark which measures how well Unix, running multiple users, performs
> >on each machine.  The benchmark would have to measure something that did
> >not vary widely (such as I/O devices), as those results would only reflect
> >how much one had spent on the bus & disk.  So, how about this:
> >
> >The dryhstone benchmarks are considered good tests of the CPU (at least by 
> >me they are), but don't really test Unix at all (in fact some people run 
> >them in standalone mode).  How about a version, (called forkstone?), which
> >runs the dryhstone as 1, 2, 8, and 64 concurrent processes?  This would
> >show 1) the speed of the CPU, 2) first part of the curve, 8) a nice single
> >user level, and 64) what happens when you have multiple users.  
> >
> >It would not test I/O, which is a hard thing to test fairly.  It would get
> >rid of those Z80 dryhstones (flame, flame) as they're not multi tasking...
> >
> >I guess if there is any response and nobody wants to do it, I'll hack the
> >drystones.  I think it would be better if the original author did it, as
> >{s}he probably can understand that bastardized {C}Ada source.
> 
> I don't think that running multiple dhrystones would measure anything more
> than the cost of doing a context switch once every <scheduling granularity>.
> Except on a multiple processor machine, the time will be N*1 dhrystone +
> M context switches.  There are easier ways to measure the time to do a context
> switch.  If you want to measure multi-user response, you've GOT to open the
> IO can-of-worms, since they WILL be doing IO.
> 
> Rick Richardson, PC Research, Inc. (201) 922-1134
> ..!ihnp4!houxm!castor!{rer,pcrat!rer} <--Replies to here, not to homxb!!!
> 
> P.S. Rheinhold Weicker is the author of Dhrystone.  I apologize for
> creating the bastardized {C}Ada source from his original Ada!

*** REPLACE THIS LINE WITH YOUR MESSAGE ***
Results for VAX 8600 running SVR2

1.2   Real
1.1   System
0.0   User

stubbs@ncr-sd.UUCP (Jan Stubbs) (01/02/86)

In article <1035@homxb.UUCP> gemini@homxb.UUCP (Rick Richardson) writes:
> There are easier ways to measure the time to do a context
>switch.  If you want to measure multi-user response, you've GOT to open the
>IO can-of-worms, since they WILL be doing IO.

How about the following as a multiuser benchmark?

iocall&
dhrystone&
iocall&
dhrystone&
etc..... 

Putting the above in a shell file and getting stop watch times on a
dedicated system gives a reasonable approximation of a real system
workload.  If you want physical IO in there as well, add a few cc hello.c&.
If you want to simulate user think time, add a sleep between programs.
Vary the mix of these programs to simulate your prospective use of the machine.
If you really want to get fancy, have one shell file for each simulated user
and measure response time degradation as you add simulated users. IOCALL and thecc invocations would have to be modified to use unique file names or they
will write on top of each other.

We have done this with some success, the problem is getting any two performance
people to agree on what is an appropriate mix.

The AIM benchmarks from AIM Technology (Santa Clara, CA.) attempt to do this 
sort of thing but more comprehensively, for a price and they provide results for many machines as well.

The above opinions are those of the author only.

Jan Stubbs

Jan Stubbs