[comp.arch] Cycle Counter

mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) (08/10/89)

In the above-referenced message, Dirk Grunwald asks about cycle counters:

>I know something like this exists on the Cray X-MP; do other machines
>have cycle counters as well?

Not that it makes much difference, but the ETA-10 has several extra
registers to keep track of cycle counts for the vector and scalar units
separately.

On the FSU machine, these registers are publicly accessible, and we
have a utility which gives the fraction of vector cycles for each 
process in the system.

This makes for entertaining user's group meetings:

	Person A:"Why don't you get your !@#$^& scalar code off 
		  of our vector computer!"
	Person B:"Well, your graduate student ran a 10-hour job
		  yesterday at 0.007% vector utilization!"
	etc....

>Being able to set it would mean that you might not care if it was only
>32bits, since you set it to 0 to time routines.  With a 20 nanosecond
>clock, it would only be good for 86 seconds, but that might be enough.

The ETA-10 has 48 bits of usable integer (though it uses 64 bits of
storage).  At 142 MHz, this allows about 11.4 days before rollover,
and the machine tends to be rebooted more frequently than this....
--
John D. McCalpin - mccalpin@masig1.ocean.fsu.edu - mccalpin@nu.cs.fsu.edu
		   mccalpin@delocn.udel.edu

grunwald@flute.cs.uiuc.edu (Dirk Grunwald) (08/10/89)

Hi,

Another ``how much does this cost'' question.

When doing performance monitoring, benchmarking or profiling, you want
a high-resolution timer. Some systems have microsecond timers, and
those are considered pretty snazzy; I know I was overjoyed when I
found one on the Encore. Normal machines, e.g., a Sun, have about 5
millisecond resolution. That's pathetic.

How much would it cost to add an additional register that would be
incremented each cycle? It doesn't need to flow through the ALU, it
would be doing a single count-up. One could conjecture using a
mode-flag to say ``yeah, count using this register'' -- if you didn't
want to use the counter, you'd have one extra register to play with.

I know something like this exists on the Cray X-MP; do other machines
have cycle counters as well?

Using a register has some advantages; it's a normal part of the processor
state, reducing save/restore cost. Also, processes can have a virtual
cycle counter, reflecting the cycle counter for that process alone.

Being able to set it would mean that you might not care if it was only
32bits, since you set it to 0 to time routines.  With a 20 nanosecond
clock, it would only be good for 86 seconds, but that might be enough.

ttl@astroatc.UUCP (Tony Laundrie) (08/10/89)

In article <GRUNWALD.89Aug9162836@flute.cs.uiuc.edu> grunwald@flute.cs.uiuc.edu writes:
>
>I know something like this exists on the Cray X-MP; do other machines
>have cycle counters as well?
>

The famous Astronautics ZS-1 has a 64-bit register that increments every
clock period of 45 nS, making it good for 263 centuries.

-----

tjd@foghorn.mpd.tandem.com (Tom Davidson) (08/10/89)

>Not that if makes much difference, but the ETA-10 has several extra registers
>to keep track of cycle counts for the vector and scalar units.

Actually, for performance analysis, the ETA10 had some rather useful hardware.
AS John mentions, some "registers" kept such goodies as a clock counter (in
whatever periods the particular cpu was running: 7, 10.5, 19ns etc), vector
unit busy.  It also had 5 programmable counters which could be set to track
such things as
	. number of in stack branches
	. number of branches NOT taken
	. number of times opcode xx was executed
and a whole host of other neat things.  All this could be accesed from a
fortran
program.

These counters were kept on a per-process basis in a state area called an
"invisible package".

Performance analysis and code profiling were made a lot easier with this
type of hardware feature.  I hope h/w architects are doing the same....

Tom



Tom Davidson			internet: halley!foghorn!tjd@cs.utexas.edu 
Tandem Computers, Inc.		fax: (512) 244-8247 voice: (512) 244-8375
14231 Tandem Boulevard
Austin, TX 78728-6610

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (08/10/89)

In article <559@halley.UUCP> tjd@foghorn.mpd.tandem.com (Tom Davidson) writes:
>>Not that if makes much difference, but the ETA-10 has several extra registers
>>to keep track of cycle counts for the vector and scalar units.
>
>AS John mentions, some "registers" kept such goodies as a clock counter (in
>whatever periods the particular cpu was running: 7, 10.5, 19ns etc), vector
>unit busy.  It also had 5 programmable counters which could be set to track
>such things as
>	. number of in stack branches
>	. number of branches NOT taken
>	. number of times opcode xx was executed
>and a whole host of other neat things.  All this could be accesed from a
>fortran program.

One thing that the ETA lacks is a count of the page table traffic
generated by the memory management unit.

That's not too surprising, because I don't know of any production
machine that has this.  But they all should! 

When a programmer suspects thrashing, the average OS can help by
reporting paging rates, task switch counts, interrupt load, ethernet
packets, and so on. The OS typically is unable to report on cache
traffic or on TLB traffic. To the serious performance tuner, this
is a flaw. On rare occasions, it's even a serious flaw.
-- 
Don		D.C.Lindsay 	Carnegie Mellon School of Computer Science

seanf@sco.COM (Sean Fagan) (08/11/89)

In article <MCCALPIN.89Aug9153545@masig3.ocean.fsu.edu> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes:
>In the above-referenced message, Dirk Grunwald asks about cycle counters:
>>I know something like this exists on the Cray X-MP; do other machines
>>have cycle counters as well?
>Not that it makes much difference, but the ETA-10 has several extra
>registers to keep track of cycle counts for the vector and scalar units
>separately.

I've been told that the Elxsi has a clock register available, with 25ns
resolution.  Since the cycle time is 25 ns, this makes it possible to time
any instruction (mov <clock>, r1; instruction; sub <clock>, r1; or whatever
the syntax would be).  Since it's also 64 bits, I believe its epoch is
something like 14000 years from now...

-- 
Sean Eric Fagan  |    "Uhm, excuse me..."
seanf@sco.UUCP   |      -- James T. Kirk (William Shatner), ST V: TFF
(408) 458-1422   | Any opinions expressed are my own, not my employers'.

ram@shukra.Sun.COM (Renu Raman) (08/13/89)

In article <GRUNWALD.89Aug9162836@flute.cs.uiuc.edu> grunwald@flute.cs.uiuc.edu writes:
>
>Hi,
>
>Another ``how much does this cost'' question.
>
>When doing performance monitoring, benchmarking or profiling, you want
>a high-resolution timer. Some systems have microsecond timers, and
>those are considered pretty snazzy; I know I was overjoyed when I
>found one on the Encore. Normal machines, e.g., a Sun, have about 5
>millisecond resolution. That's pathetic.

   Depends on what kind of a "normal" Sun you have.  Anything since
   SPARCstation should have a micro-second timer (only 21 bits tho') - so
   2 second is all you have if you want to watch anything.

   renu raman

dwc@cbnewsh.ATT.COM (Malaclypse the Elder) (08/13/89)

In article <GRUNWALD.89Aug9162836@flute.cs.uiuc.edu>, grunwald@flute.cs.uiuc.edu (Dirk Grunwald) writes:
> 
> When doing performance monitoring, benchmarking or profiling, you want
> a high-resolution timer. Some systems have microsecond timers, and
> those are considered pretty snazzy; I know I was overjoyed when I
> found one on the Encore. Normal machines, e.g., a Sun, have about 5
> millisecond resolution. That's pathetic.
> 
in a paper that we presented in the 88 summer usenix, we describe
a high resolution timing and tracing package for unix system v (called
casper) that takes advantage of the fact that most systems now use
programmable interval timers to generate their clock interrupts.  these
interval timers are usually loaded with an initial value, count down
at a rate that is determined by an external clock signal, and generate
the clock interrupt when it hits zero.  they then reload their initial
value and start over again.  the nice thing about these things is that
they are usually driven at a fairly hit rate.

using these interval timers, our package is able to deliver 10 microsecond
resolution on the at&t's 3b2 computers and 1 microsecond resolution on
the at&t 6386s.  not too shabby and cheap too.  and yes, when we looked
at the suns, we found that they used some hardwired interrupt generator
so we could only get clock interrupt resolutions (10 milliseconds).

the other nice thing about doing things this way is that since there
is usually a kernel variable keeping count of the number of clock interrupts
since boot, we can combine the value of the interrupt counter with the
value in the countdown timer and not worry about wrap-around.  there are
problems introduced by the fact that looking at the kernel variable and
the countdown timer is not an atomic operation but i refer interested
parties to the paper for details.

danny chen
att!hocus!dwc

melvin@vangogh.Berkeley.EDU (Steve Melvin) (08/15/89)

In article <GRUNWALD.89Aug9162836@flute.cs.uiuc.edu> grunwald@flute.cs.uiuc.edu writes:
>
>Hi,
>
>Another ``how much does this cost'' question.
>
>When doing performance monitoring, benchmarking or profiling, you want
>a high-resolution timer. Some systems have microsecond timers, and
>those are considered pretty snazzy; I know I was overjoyed when I
>found one on the Encore. Normal machines, e.g., a Sun, have about 5
>millisecond resolution. That's pathetic.

First, some clarification.  All Sun 3's and Sun 4's lack a high
resolution timer.  Sun 2's had them and SPARCstations have them.  In the
Sun 3's and Sun 4's there is a 10ms hardware interrupt which in SunOS
is ignored every other time to generate a 20ms interrupt for use by the
operating system.

Fortunately, however, Sun put sockets in these machines for data encryption
chips (the whole encryption chip story is interesting in itself).  Another
grad student here at Berkeley (Peter Danzig) and I have designed a small board
which plugs into these machines and allows a timer chip (being clocked at
4Mhz) to look like the encryption chip.  Then, with an appropriate device
driver installed, the software has access to high resolution time measurements
just as though the feature was built in.

Apparently, Sun at one time had intended to sell a data encryption option for
these machines.  The encryption chip provided for was the AMD Am9518 (which
implements the official data encryption standard (DES)).  In the 3/50 and 3/60,
all that was needed was to plug the chip in (and move a jumper in the case of
the 3/50) but in later models chips needed to drive the 9518 (one or two
PALs and a buffer) were not supplied on the motherboard.  The idea was
apparently dropped and as far as we know the DES option has never been made
available.  It probably had a lot to do with the Feds (the DES chip is
supposedly not allowed to be exported from the US).  Having the socket enabled
by a PAL may have had something to do with controlling the use of the DES
chip.

Anyway, I think the answer to your question is that these kinds of things
cost very little in hardware and don't slow anything down, but since they
don't tangibly affect the bottom line performance, they are often ignored by
hardware designers.

-------
Steve Melvin
...!ucbvax!melvin				melvin@polaris.Berkeley.EDU
-------

jmk@alice.UUCP (Jim McKie) (08/17/89)

The Crisp CPU has the following register:

	Timer
	The timer is a 28 bit internal register which can be incremented
	every cpu clock cycle or at the completion of every instruction.
	The timer can also, optionally, interrupt the cpu when the count
	overflows. When read, the least significant bit of the timer appears
	on bit 4 of the resultant data and the low-order four bits are
	always zero. The timer is both readable and writeable, and the 
	counting function is controlled by the low three bits, which are
	write only. These timer register bits are used to configure the
	timer:

	- bit 0. When clear, the timer counts cycles. When set, the timer
	  counts completed instructions (folded branches do not count).
	- bit 1. When clear, the timer is on all the time (with reference
	  to bit 0). When set, the timer only counts when the PSW indicates
	  User execution level.
	- bit 2. When set, the timer will generate a timeout exception, not
	  an interrupt, when it overflows (goes from 0 to non-zero). It has
	  precedence over all exceptions except zero divide. Interrupts
	  have precedence over time outs.

A similar register was added to the locally-developed 68020-based grey-scale
bitmap terminal and has been used as the basis for a debugger.

Jim McKie	research!jmk -or- jmk@research.att.com

cbcscmrs@csun.edu (08/18/89)

In article <121192@sun.Eng.Sun.COM> ram@sun.UUCP (Renu Raman) writes:
>In article <GRUNWALD.89Aug9162836@flute.cs.uiuc.edu> grunwald@flute.cs.uiuc.edu writes:
>>When doing performance monitoring, benchmarking or profiling, you want
>>a high-resolution timer. Some systems have microsecond timers, and
>>those are considered pretty snazzy; I know I was overjoyed when I
>>found one on the Encore. Normal machines, e.g., a Sun, have about 5
>>millisecond resolution. That's pathetic.
>
>   Depends on what kind of a "normal" Sun you have.  Anything since
>   SPARCstation should have a micro-second timer (only 21 bits tho') - so
>   2 second is all you have if you want to watch anything.

I like nanosecond timers, built into the instruction set!
You can tell how far the head on the disk moved if you hit a page
fault!

The elxsi has a 25 nanosecond resolution process timer (to measure CPU time)
and a CPU wide real time clock that also has 25 nanosecond resolution.

Oh, unlike the 21 bits SPARCstation timer, on the elxsi you have to wait
a little longer if you want to see the counter overflow.  About 7,311
years and 284 days.  Yes, that is 63 bits. :-)  I think they are signed,
and no, I don't know why...  (Does it really matter at that point?)

Syncing the thing up to a chimmer takes on a new meaning... :-)
(NTP time servers...)