[mod.computers.vax] VAX instruction timing

ins_ajbh@jhunix.BITNET (James B Houser) (01/09/87)

How do you go about getting instruction time numbers for VAXEN. Things
like how long a register to register move takes etc. The local DEC
people have no idea and the documentation (sic) is not helpful. I am
particularly looking for 785 specs but others would be good too.
I did some informal timing tests and some of the results do not make a lot
of sense. How are you supposed to write really fast code without
these numbers? Also, how much do the values change for different versions
of the microcode?

yerazuws@CSV.RPI.EDU (Crah) (01/09/87)

In article <8701090448.AA22342@ucbvax.Berkeley.EDU>, ins_ajbh@jhunix.BITNET (James B Houser) writes:
> 
> How do you go about getting instruction time numbers for VAXEN. Things
> like how long a register to register move takes etc. 

The problem is that such timing numbers depend on what instructions 
have recently gone through the machine.  As far as I know, ALL
vaxes (excepting /730 and /725) have a pipeline of some sort.  
On the big fast ones (like 8800's) this pipeline is three instructions
deep.  So, if you are on an 8800, and run a loop consisting of 
something like
    
	TOP:	Load R1 with R2
		Increment R3
		Branch if R3 .ne. 0 to TOP

the pipeline will flush at the end of each loop, artificially
making the machine appear slower than it is.  Even the trick of 
adding another load of R1 with R2 and measuring the difference 
in times between the two loops may not help, because some big 
VAXes have "bypasses" built into the microcode.  The ucode "sees"
that R2 is already on an internal bus, and grabs it without refetching
it from R2- artificially DEcreasing the time needed.
	
	Then we have to worry about cache contention, flushing, etc.
And what the cache algorithms are on the particular CPU you are on.
And what kind of memory the CPU has (different memories may have 
different behavior under different fetching parameters - and whether
those fetches were aligned - or at least fell into the same memory
block  (some VAXes fetch 128 bits at a time))
	
	Conclusion: you simply have to TRY your application
and tweak where you think it's slow.  Consider: A Sun 3 with 
the standard CPU (not the $30,000 option one) should be twice
as fast as my uVAXstation II - but I find that for what I
use them for, the uVAX II is faster.  Why?  I dunno.
	
	Hope this eases your brain.
	
		-Bill Yerazunis

SYSMGR%UK.AC.KCL.PH.IPG@AC.UK (01/10/87)

Re: How does one time VAX instructions, why don't results make much sense,
    how is one expected to write really fast code without these figures...

The following is a personal opinion that tends to arouse controversy, but here
goes... If you don't want to read 5 or do screenfulls hit delete now...

It is a waste of time to attempt to time individual instructions on
a VAX cpu, or indeed to attempt any code tweaking which does not improve the
algorithm of which the code is a possible realisation.

Why? Well, to start with the only instructions which are likely to have a
consistent execution time are register-to-register operations. You can time
these, quite easily, by setting up a trivial subroutine containing a loop
such as:
		.entry test, #M<r2>
                movl #100000, r2
10$:		xxx
                sobgtr r2,10$
		ret
		.blkb 100        ; space to patch into with DEBUG
		.end

(xxx is the instruction(s) to time). You run it first with xxx deleted to get
a time for the loop overhead, and then with instruction(s) to be timed.
Note that DEBUG can be used to deposit xxx, rather than MACRO and LINK each
time. Timing is accomplished by bracketing this routine with a pair of calls
to SYS$GETJPI returing CPUTIM (if you believe the results) or SYS$GETTIM to
measure realtime, which is reliable if you have elevated your process priority
above the swapper's (ie realtime priority) and provided the environment is
quiet enough that few if any interrupts are being handled. Note that this latter
approach will annoy other users (if any) and requires privilege (ALTPRI).

It is these considerations which lead to my opinion. A more complex piece of
code, such as a memory-to-memory copy of a buffer, has a time which depends both
on virtual memory management (unless you lock the buffers into your working
set, which is usually unrealistic) and on other system hardware activity, which
cannot be avoided.

VM management is tuned by a large number of SYSGEN parameters, 'correct' setting
of which is a subject of much debate amongst system managers. In fact there are
no right values, the 'best' values depend on your system configuration,
type of workload, management objectives, etc. The CPU time taken by the buffer
copy must include page fault overheads to be realistic, but these depend both on
your SYSGEN, and also on how busy the system is when you run your test. If there
are no other users, pages of memory lost from your process working set simply
sit around in semiconductor memory until they are next needed (unless you exceed
the physical memory size of your VAX), and are retrived with a 'soft' page
fault which involves no disc IO. In contrast, if the system is busy, a page
lost by your process will probably be written back to disc and grabbed by
some other process, so you incur a second disc IO when you next fault it in.
'Hard' page faults take a lot longer (in both CPU and real time) than 'soft'
ones.

As for hardware activity, only one device can be a bus master at once. If
a DMA transfer to memory is in progress, the CPU may have to wait for one
or more bus cycles until it can become bus master and access memory. Whether
this is significant depends on the amount of DMA activity and on the
bandwidth of the main bus (SBI, BI, CMI) and its interaction with the device
bus (UNIBUS, QBUS); I have heard that is is particularly significant with
a UNIBUS adapter on a BI-bus VAX.

If this sounds rather theoretical, I know that on our VAX some jobs can take
twice as much CPU on a busy system than on a lightly-loaded one. When users
complain, I point out that this effect is common in the real world; most
companies offer discounts to customers for off-peak resources or to shift
less popular products. I also noticed that when we doubled our physical
memory, many jobs got lots faster, and when we went to VMS V4, many jobs
slowed down, both of which make rather a nonsense of CPU MIP ratings.

Note that my opinion only applies to instruction tweaking that does not
improve the algorithm. A while ago (on VMS 1.5 I think!) I transferred a
two-dimensional FFT routine from a CDC 7600 to out VAX. On the 7600,
hand-coded assembler in the inner loops reduced runtime by 60%. On the
VAX, I achieved 10%, and it wasn't because the compiler was optimising
registers well enough already - I took a look at the compiler code and
shuddered! In contrast, an algorithmic improvement that reduced the
number of complex multiplies involved in the computation saved much the
same percentage on the 7600 and on the VAX, and another that accessed
memory in a more sequential manner (thereby reducing pagefaults) paid
handsomely.

In summary, VAXes are high-level machines and a low-level approach to
optimisation can be left to DEC's compiler writers who are now doing a good
job. Where programmers can score is by improving their algorithms, and
by understanding the basic priciples of VM management so as to work with
VMS rather than against it. System programmers likewise usually do better to
study the VMS internals manual than to worry about whether two
BBS instructions are better or worse than a BITL and a BEQL.

Incidentally this is probably equally true of any other virtual memory system.
MIPs are a waste of time for everybody except salesmen ... "Figures can't
lie, but liars sure can figure".

Nigel Arnot (Dept. Physics, Kings college, Univ. of London;  U.K)

Bitnet/NetNorth/Earn:   sysmgr@ipg.ph.kcl.ac.uk (or) sysmgr%kcl.ph.vaxa@ac.uk
       Arpa         :   sysmgr%ipg.ph.kcl.ac.uk@ucl-cs.arpa

dww@seismo.CSS.GOV@stl.stc.co.uk (01/31/87)

In article <8701100344.AA11320@ucbvax.Berkeley.EDU> SYSMGR%UK.AC.KCL.PH.IPG@AC.UK writes:
>...
>Where programmers can score is by improving their algorithms, ...
Very true.   I remember a programmer asking permission to use assembler
language when I ran a "No assembler except in emergency" rule, because his
1.5mSec rate interrupt program ran for 2mSeconds.  A better algorith (still HLL)
took 150uSecs.   For even better examples, read Knuth!

>Incidentally ... is probably equally true of any other virtual memory system.
>MIPs are a waste of time for everybody except salesmen 

A company we work with use Ridge (UNIX) and VAX (VMS) computers.   The Ridge
(I don't know which model) is supposed to have twice the CPU performance of
the VAX (a 785).   For most of their CPU-heavy simulations this was so, but
for the biggest the VAX was several times faster, simply because it could 
handle paging of large virtual memory space with far less page faults.