aglew@crhc.uiuc.edu (Andy Glew) (11/20/90)
I just helped a friend look at a performance problem that was almost
definitely due to excessive disk seeking, and it causes me to ask: Why
don't disks/disk controllers/dcomputer systems provide more meaningful
measures of disk seeking and other I/O activity?
Disk seeking is such on important bottleneck on many machines that it
sure would be nice to have a way of directly measuring it, instead of
inferring it from other observations.
For example, in our recent performance problem, the workload was both
CPU and I/O intensive (uncompression and analysis of large traces).
It was being run on an Encore with eight processors. Interactive
response slowed to a crawl, but the applications were not eating much
CPU. Paging was nil, I/O modest, but system idle time was up around
70%. Turns out that several large files were being read from disk,
uncompressed, and written back to the same disk => lots of time spent
waiting for disk heads to seek back and forth.
At least, seeking seems the likely culprit --- but it sure would be
nice to have a way of *verifying* this, eg. by looking at the disk
activity monitor and observing that the disk was 100% busy, spending
10% of time on track, 5% of time transferring, and 85% of the time
seeking...
It sure would be nice to be able to see this disk activity,
instead of guessing (no matter how informed) --- especially when the
application is going to have to be recoded to test this hypothesis
out.
Why should disks not have performance counters on board? Something
like a counter that ticks every millisecond the drive is seeking, and
then is sampled by the OS to get a usage profile.
On a more primitive level, at Gould I added "Kiviat profiling" to the
kernel. This took maybe 5 minutes, and provided cross products of the
system activity: instead of %user, %system, %waitio and %idle it gave
cross products: %(user.io), %(user.!io), %(sys.io), %(sys.!io),
%(idle.io), %(idle.!io).
Actually, I took it a bit further - I had two processors, so
looked at %(user1.user2.io), %((user1.sys2+user2.sys1).io), etc. ----
any meaningful reduction ---- and similarly for multiple drives. This
was very useful in diagnosing problems obtaining I/O overlap, etc.
Of course, due to the wonders of private enterprise, my old Kiviat
code is inaccessible and I no longer have UNIX source (at least our
vendors won't provide us with the source for OS updates until the
update is obsolete).
Please, systems people: if you're worried about performance, provide
things like Kiviat profiling and disk seek counters! Provide ways of
measuring the magnitude of real performance problems.
--
Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]