[comp.sys.hp] Disk performance HP-UX 6.5

magnar@sfd.uit.no (Magnar Antonsen) (12/08/89)

While testing the CDC Wren IV disk some dramatic differences between HP and SUN
turned up.  Running the dd command illustrates the point. We reads 2000 blocks
of 8K size from the disk with the command

     dd if=/dev/<block-special> of=/dev/null bs=8k count=2000

and the time (measured with /bin/time) it takes do this is   

Sun 3/80 with SunOS 4.03    real:  18.4   sys: 8.2
HP 9000/370 with HP-UX 6.5  real: 144.8   sys: 9.1.

The same physical disk was used on both computers. The interface on the HP is
2.7 MB synchronous SCSI. Other tests (e.g. omitting the dd command, other block
sizes) gives the same conclusion: Sun is 7 to 9 times faster than HP.

Could anyone comment on or explain this difference in performance? A first
conclusion of our tests shows that the answer may be found in differences
between the device drivers in HP-UX 6.5 and SunOs 4.03.


//// Magnar Antonsen                   // N-9001 TROMSOE, NORWAY              /
/// Computer Science Department       // Phone : + 47 83 44043               //
// University of Tromsoe             // Telefax: + 47 83 55418              ///
/ NORWAY                            // Email: magnar@sfd.uit.no            ////

kinsell@hpfcdj.HP.COM (Dave Kinsell) (12/12/89)

>While testing the CDC Wren IV disk some dramatic differences between HP and SUN
>turned up.  Running the dd command illustrates the point. We reads 2000 blocks
>of 8K size from the disk with the command

>     dd if=/dev/<block-special> of=/dev/null bs=8k count=2000

>and the time (measured with /bin/time) it takes do this is   

>Sun 3/80 with SunOS 4.03    real:  18.4   sys: 8.2
>HP 9000/370 with HP-UX 6.5  real: 144.8   sys: 9.1.

Using dd with the block special device file isn't doing what you think
it's doing.  For reasons I don't understand, the physical I/O is broken
into 2 Kbyte chunks, at least on the HP system.  Note that the man page
for dd says that the blocksize declaration works only for raw I/O.

The big factor in the performance difference is that the Wren IV has
readahead which must be specifically enabled.  Since it's not a
supported drive on the HP system, it doesn't get turned on.  It must be
getting turned on with the SUN system.

You're sending short, consecutive read requests to the disk, which is
exactly where readahead shows the biggest improvement.  However, it is
not at all representative of file system or swap activity.

Without readahead, it will take slightly more than one revolution to
get each 2K of data (one latency, plus skipping the previously read 2K):

   2K/18ms = 114 k/sec

Your results: 

   8K*2000/144.8 = 113 k/sec

The Wren IV is a ZBR technology drive, which means the data rate changes
significantly depending on what cylinder is being read.  This complicates
using it for file system benchmarking.


-Dave Kinsell
 use kinsell@hpfcmb.hp.com


DISCLAIMER:  This is not an official or officious policy statement of the
             Hewlett-Packard company.

kinsell@hpfcdj.HP.COM (Dave Kinsell) (12/16/89)

{Sorry for the repost, but this likely didn't make it out to the public net}


>While testing the CDC Wren IV disk some dramatic differences between HP and SUN
>turned up.  Running the dd command illustrates the point. We reads 2000 blocks
>of 8K size from the disk with the command

>     dd if=/dev/<block-special> of=/dev/null bs=8k count=2000

>and the time (measured with /bin/time) it takes do this is   

>Sun 3/80 with SunOS 4.03    real:  18.4   sys: 8.2
>HP 9000/370 with HP-UX 6.5  real: 144.8   sys: 9.1.

Using dd with the block special device file isn't doing what you think
it's doing.  For reasons I don't understand, the physical I/O is broken
into 2 Kbyte chunks, at least on the HP system.  Note that the man page
for dd says that the blocksize declaration works only for raw I/O.

The big factor in the performance difference is that the Wren IV has
readahead which must be specifically enabled.  Since it's not a
supported drive on the HP system, it doesn't get turned on.  It must be
getting turned on with the SUN system.

You're sending short, consecutive read requests to the disk, which is
exactly where readahead shows the biggest improvement.  However, it is
not at all representative of file system or swap activity.

Without readahead, it will take slightly more than one revolution to
get each 2K of data (one latency, plus skipping the previously read 2K):

   2K/18ms = 114 k/sec

Your results: 

   8K*2000/144.8 = 113 k/sec

The Wren IV is a ZBR technology drive, which means the data rate changes
significantly depending on what cylinder is being read.  This complicates
using it for file system benchmarking.


-Dave Kinsell
 use kinsell@hpfcmb.hp.com


DISCLAIMER:  This is not an official or officious policy statement of the
             Hewlett-Packard company.

pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi) (12/19/89)

   [somebody has done this test:]
   >     dd if=/dev/<block-special> of=/dev/null bs=8k count=2000

   >and the time (measured with /bin/time) it takes do this is   

   >Sun 3/80 with SunOS 4.03    real:  18.4   sys: 8.2
   >HP 9000/370 with HP-UX 6.5  real: 144.8   sys: 9.1.

In article <17330009@hpfcdj.HP.COM> kinsell@hpfcdj.HP.COM (Dave
Kinsell) writes:

   Using dd with the block special device file isn't doing what you think
   it's doing.
	[ ...some comments about read ahead buffering... ]

Actaully, what's happening here isn't what you think either. Read
ahead buffering in the device does not enter the picture at all.

SunOS 4 does ALL io (save raw devices!) via memory mapping of
files (including block devices as well I believe), and does not
read data in, if that's written to /dev/null immediately
thereafter (this because of copy-on-write).

Try doing 'time cp /vmunix /dev/null' under SunOS 4, SunOS 3, and
HP-UX, and you will see; especially if run under the C shell,
that gives you a count of IO operations.

Otherwise we would be seeing a miracolous 1 MByte a sec out of
SunOS, and I do not believe in miracles; in much the same
conditions, but with SunOS 3.5 on a 3/50, I get real: 151 and
sys: 17, which are compatible with the slower CPU speed of the
3/50 w.r.t. the 3/80; I do not think that SunOS 4 has drivers
and a filesystem that are 10 times as fast as SunOS 3, because
they are largely the same.

In other words, the numbers posted above by somebody are
completely bogus, and it is my general impression that HP-UX and
SunOS are really within the same league of IO bandwidth. The only
machines that seem to have decent or above average IO bandwidth
are the MIPSco ones, and they took a lot of care in that.
--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

guy@auspex.auspex.com (Guy Harris) (12/20/89)

>Actaully, what's happening here isn't what you think either.

Nor is it what you think....

>SunOS 4 does ALL io (save raw devices!) via memory mapping of
>files (including block devices as well I believe), and does not
>read data in, if that's written to /dev/null immediately
>thereafter (this because of copy-on-write).

SunOS does, in fact, do UFS and NFS I/O by something that basically
amounts to memory mapping.  What a "read" of a UFS or NFS file, or, I
think, a block special file amounts to is "map the region being read
into the kernel's address space, and then copy from that mapped region
into the user's buffer".  The kernel obviously has no idea that the data
in question is going to be written to "/dev/null", so it copies it
anyway, which means it has to fault the data in from the file if it's
not already in memory.  (Yes, I've read the code; that's how it works.)

>Try doing 'time cp /vmunix /dev/null' under SunOS 4, SunOS 3, and
>HP-UX, and you will see; especially if run under the C shell,
>that gives you a count of IO operations.

That's different from "dd".  "cp" doesn't do "read"s from "/vmunix", it
"mmap"s "/vmunix" into *its* address space and then writes that mapped
region to "/dev/null".  *That* write, obviously, never touches the data,
so it doesn't have to be faulted in.

"dd", on the other hand, actually does "read"s from its input file (as
proven by running "trace" on it), so "dd if=/vmunix of=/dev/null" does
actually read all the data in "/vmunix", as opposed to
"cp /vmunix /dev/null".

pcg@aber-cs.UUCP (Piercarlo Grandi) (12/22/89)

In article <2771@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes:
    >Actaully, what's happening here isn't what you think either.
    
    Nor is it what you think....

Well, apparently you are right, but in a sense this is disappointing; Sun
could have been cleverer, vut has been clever enough.
    
    SunOS does, in fact, do UFS and NFS I/O by something that basically
    amounts to memory mapping.  What a "read" of a UFS or NFS file, or, I
    think, a block special file amounts to is "map the region being read
    into the kernel's address space, and then copy from that mapped region
					      ^^^^
    into the user's buffer".

If they do copy on write, and especially if the buffers are page aligned,
nothing need happen. In particular, the stdio library and even open(2),
read(2), ... are really layered onto memory mapping in SunOS 4. The old Unix
I/O system is nearly dead; open(2) maps the file, and read(2) accesses it
direct.

In other words, the traditional Unix I/O is done only for some character
devices (some use streams instead). SunOS 4 emulates a PDP on a Multics...

    The kernel obviously has no idea that the data in question is going to
    be written to "/dev/null", so it copies it anyway, which means it has to
    fault the data in from the file if it's not already in memory.

Not necessarily true. With read(2) implemented as copy-on-write, the data
pages would never be actually touched, they could as well remain on disc.
This of course would more easily be true if Sun's dd had page aligned
buffers, or if SunOS implemented unaligned copy-on-write (much difficult).

    (Yes, I've read the code; that's how it works.)

This settles it.
    
    "dd", on the other hand, actually does "read"s from its input file (as
    proven by running "trace" on it),

Again, if dd buffers were page aligned (they should be) and copy-on-write
were used, this would not need happen. Unfortunately SunOS does not do
this, so more investigation is needed.

I have got some more data points, that show something interesting, after
some simple tests under both SunOS 3 and 4.

The discs in both cases are broadly comparable; CPU speeds are not very
important here. To do the reads, and be sure that pages were faulted, I
used a trivial program like this:

	main()
	{
		char buf[16*1024];
		/* hope the optimizer does not do funny tricks */
		while (read(0,buf,sizeof buf) > 0)
			buf[1*1024] = buf[9*1024] = 'x';
	}

which reads two Sun pages at a time, and modifies a byte in each to make
sure that copy on write if present is exercised, and faults the page in.
The following figures are quite approximate, but representative:

	SunOS	Mbytes	Type	Seconds	KB/sec	I/Os	Machine

	3	10	block	95	100	5100	Sun 3/50
	4	10	raw	20	500	?	Sun 3/50
	4	24	-	45	500	530	Sun 3/280

What I read here is less optimistic results then some that have been posted,
but impressive nonetheless. Getting over 500KB/s out of a disc is no mean
feat.  SunOS manages to do that both using raw device access under 3 and
either under 4. The fact that SunOS 4 gives with the block device the same
performance as SunOS 3 on the block device means that mapping disc blocks
is effective in not requiring any additional overhead associated with
buffer cache management.

Actually, the interesting column is the "I/Os" column, that tells us how
many I/O operations were scheduled. It is not available for SunOS 3 raw
devices, but it tells us that (and I have other data that confirms this)
that even if we are reading by two pages at a time, SunOS 4 actually fetches
six at a time, that is does heavvy clustering os I/O requests (hopefully
only when it detects sequential access). This looks like a big win,
*for sequential access*.

As to the actual faulting and page copying, apparently it does not matter
a lot, given the CPU speed and the overlapping of I/O operations.

So the result here is that eliminating the buffer cache means that mapped
devices exploit well the available bandwidth, while using buffer
cache and passing thru the strategy function therein reduces effective
bandwidth to 20%.

It would be interesting to see the bandwidth reduction due to the filesystem
under both technologies. If anybody wants to have a go... Remember that
you must unmount and remount a filesystem before each test, to invalidate
any in-core pages.

It would be interesting to see similar number for HP-UX (one of whose
incarnations used to have an extent based filesystem, BTW).
-- 
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk