[comp.arch] Performance & Diagnosis

rdh@sli.com (Robert D. Houk) (08/11/89)

In article <5818@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes:
>In article <559@halley.UUCP> tjd@foghorn.mpd.tandem.com (Tom Davidson) writes:
>>>Not that if makes much difference, but the ETA-10 has several extra registers
>>>to keep track of cycle counts for the vector and scalar units.
>>
>>AS John mentions, some "registers" kept such goodies as a clock counter (in
>>whatever periods the particular cpu was running: 7, 10.5, 19ns etc), vector
>>unit busy.  It also had 5 programmable counters which could be set to track
>>such things as
>>	. number of in stack branches
>>	. number of branches NOT taken
>>	. number of times opcode xx was executed
>>and a whole host of other neat things.  All this could be accesed from a
>>fortran program.
>
>One thing that the ETA lacks is a count of the page table traffic
>generated by the memory management unit.
>
>When a programmer suspects thrashing, the average OS can help by
>reporting paging rates, task switch counts, interrupt load, ethernet
>packets, and so on. The OS typically is unable to report on cache
>traffic or on TLB traffic. To the serious performance tuner, this
>is a flaw. On rare occasions, it's even a serious flaw.
>-- 
>Don		D.C.Lindsay 	Carnegie Mellon School of Computer Science

One of the things I really *HATE* about most "modern" computer systems is
their total black-box'edness. Some don't even have power lights. I really
miss "lights". They enable one to "at a glance" know amazingly well what
a machine is up to. They enable one to perform "miracles" of online diag-
nosis of ailing systems. If the cpu doesn't work, lights won't help all
that much, but for misbehaving systems they were godsends. 'fer instance,
I remember a time when the KI-10 (ye olde second-generation PDP-10) I was
on just hung. Stopped. Did not responde. A friend and I simultaneously
converged on the machine room to check it out. Not a light flashing. Hmm,
looks like it halted. Glance at console terminal to see halt reason. Odd,
no crash info. Oh, the RUN light is on. Odd. No PI channel active, and
they're all enabled. OH - look at that, PI system is not on. This is
wrong! Hey, how's about we toggle in a CONO PI,PION instruction and XCT
it from the switch register. Poof! Timesharing resumes. Disks resumes
chattering. Terminals resume terminaling. And all that. No user lost any
data (except terminal typein of course - not a single typeout character
was lost, no disk files lost or corrupted, only a 3-minute pause in ser-
vice was noticed).

What is the point, all you single-user workstation users ask? Just reboot
the machine if you have a problem? Well, the point is that in less time
than it takes my Sun3/Unix to reboot (and it only has one little ole WrenV
disk), we were able to walk down the hall, look at a hung system, diagnose
it, and cure it. NO DATA/FILES LOST, OR CORRUPTED (Ha - just try that, you
UNIX workstation users, see how many files you lose when you forcibly re-
boot!)

Now I'm told that the reason for the demise of the lights was that they
accounted for 10% of the cost of the system (not an insignificant amount,
that). Further, I guess that with a front-panel "footprint" of about 2 inches
by 20 inches for modern workstations (e.g., my Sun3/60), there's not a lot
of room for lights. But boy do I get annoyed when I come in, the screen is
blanked out, and won't unblank. How helpless! I can't even tell if the damn
power is on or not! (well, OK, I can feel how hot it is and surmise that
it is on if it is hotter than ambient, or I can kill the AC/Stereo/Office
chatter and probably hear the little boxer fan whirring away, and I even
gotta admit that there is this little LED array in back beside/underneath
the Ethernet/keyboard connectors that, if flashing, would tell me that the
power is at least on - assuming I feel like clearing a path to the backside
of my machine).

The point is that today's systems simply do not seem to pay much attention
to providing system diagnosis or tuning info. I consider this to be a
serious flaw in their architecture. (Don't take me wrong, I am not a total
curmudgeon - I am truly impressed with the advances H/W technology has
provided: a multiple-MIPS many-MB system in a box smaller than ONE power
supply (out of several) in ONE memory box/rack (out of many) for a less-
powerful (raw-MIPS-wise) 10-year-old cpu, all at a cost less (in adjusted
dollars) than a 2400 baud 24-line x 80-character Video Display Terminal
of 20 years ago... It's just that in the race for H/W miracles I think one
important aspect of total system design has been lost in the name of saving
a few up-front dollars - at an untold cost of lost and reduced manhours at
some later point in the product life.)

As to "counters", the KL-10 (another of ye olde PDP-10 processors, although
of a more recent vintage, it's only 10 years old) while having no lights
(not true, it had a power light, and a fault light; also the PDP-11 front-
end had, let's see, 16 data lights, 18 address lights, and 4 (6?) cpu
status lights, all of which were more-or-less useless, not even being enter-
taining to watch), did have a "PERF" board. This board was a "peripheral"
device that just sorta sat on the cpu's most intimate inner busii and
watched stuff. You could program it to count in either event mode (number
of times "X" happened) or in duration mode (count of clock cycles "X" was
happening), for "X"es of PI activitity, I/O activity, and half a dozen other
useful tidbit-wise things. You could use the PERF board to measure the cache
hit ratio. Etc. And so on. A very useful tool for measuring just what your
system was doing. System performing poorly? Just glance at the display,
and ask the obvious question like "Why is the system spending 40% of its time
at PI 4 (the network PI channel)" and procede from there to investigate who
is flooding the network with bogus packets. Etc. This sort of functionality
is readily adaptable to today's silicon miracles (and in fact pretty much
has to be tightly integrated into the cpu chips to even exist), but I don't
see it happening. At least, not in the mainstream cpus (Intel - anyone
home? Motorola? National? Anyone? Anywhere? Sigh.)

					-RDH

P.S.	I will admit that the H/W people have a good gripe with the S/W
	people about not using the nifty H/W provided. The only reason that
	TOPS-10 used the PERF board provided in every KL-10 system is that
	I "discovered" it one day whilst idly perusing the print set (Hey,
	what's this thingie? Looks neat, how do I use it? Wowie, neato
	stuff. You mean noone uses it? Well, can't have this, it's too neat
	not to use!), thought it was neat, and "slipped" it into the system
	one weekend when noone much was looking. Shipped it as an unsup-
	ported tool. H/W people get annoyed when S/W people ignore their
	nifty toys, so all you S/W types need to encourage your H/W types...