aglew@crhc.uiuc.edu (Andy Glew) (11/20/90)
I just helped a friend look at a performance problem that was almost definitely due to excessive disk seeking, and it causes me to ask: Why don't disks/disk controllers/dcomputer systems provide more meaningful measures of disk seeking and other I/O activity? Disk seeking is such on important bottleneck on many machines that it sure would be nice to have a way of directly measuring it, instead of inferring it from other observations. For example, in our recent performance problem, the workload was both CPU and I/O intensive (uncompression and analysis of large traces). It was being run on an Encore with eight processors. Interactive response slowed to a crawl, but the applications were not eating much CPU. Paging was nil, I/O modest, but system idle time was up around 70%. Turns out that several large files were being read from disk, uncompressed, and written back to the same disk => lots of time spent waiting for disk heads to seek back and forth. At least, seeking seems the likely culprit --- but it sure would be nice to have a way of *verifying* this, eg. by looking at the disk activity monitor and observing that the disk was 100% busy, spending 10% of time on track, 5% of time transferring, and 85% of the time seeking... It sure would be nice to be able to see this disk activity, instead of guessing (no matter how informed) --- especially when the application is going to have to be recoded to test this hypothesis out. Why should disks not have performance counters on board? Something like a counter that ticks every millisecond the drive is seeking, and then is sampled by the OS to get a usage profile. On a more primitive level, at Gould I added "Kiviat profiling" to the kernel. This took maybe 5 minutes, and provided cross products of the system activity: instead of %user, %system, %waitio and %idle it gave cross products: %(user.io), %(user.!io), %(sys.io), %(sys.!io), %(idle.io), %(idle.!io). Actually, I took it a bit further - I had two processors, so looked at %(user1.user2.io), %((user1.sys2+user2.sys1).io), etc. ---- any meaningful reduction ---- and similarly for multiple drives. This was very useful in diagnosing problems obtaining I/O overlap, etc. Of course, due to the wonders of private enterprise, my old Kiviat code is inaccessible and I no longer have UNIX source (at least our vendors won't provide us with the source for OS updates until the update is obsolete). Please, systems people: if you're worried about performance, provide things like Kiviat profiling and disk seek counters! Provide ways of measuring the magnitude of real performance problems. -- Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]