ins_ajbh@jhunix.BITNET (James B Houser) (01/09/87)
How do you go about getting instruction time numbers for VAXEN. Things like how long a register to register move takes etc. The local DEC people have no idea and the documentation (sic) is not helpful. I am particularly looking for 785 specs but others would be good too. I did some informal timing tests and some of the results do not make a lot of sense. How are you supposed to write really fast code without these numbers? Also, how much do the values change for different versions of the microcode?
yerazuws@CSV.RPI.EDU (Crah) (01/09/87)
In article <8701090448.AA22342@ucbvax.Berkeley.EDU>, ins_ajbh@jhunix.BITNET (James B Houser) writes: > > How do you go about getting instruction time numbers for VAXEN. Things > like how long a register to register move takes etc. The problem is that such timing numbers depend on what instructions have recently gone through the machine. As far as I know, ALL vaxes (excepting /730 and /725) have a pipeline of some sort. On the big fast ones (like 8800's) this pipeline is three instructions deep. So, if you are on an 8800, and run a loop consisting of something like TOP: Load R1 with R2 Increment R3 Branch if R3 .ne. 0 to TOP the pipeline will flush at the end of each loop, artificially making the machine appear slower than it is. Even the trick of adding another load of R1 with R2 and measuring the difference in times between the two loops may not help, because some big VAXes have "bypasses" built into the microcode. The ucode "sees" that R2 is already on an internal bus, and grabs it without refetching it from R2- artificially DEcreasing the time needed. Then we have to worry about cache contention, flushing, etc. And what the cache algorithms are on the particular CPU you are on. And what kind of memory the CPU has (different memories may have different behavior under different fetching parameters - and whether those fetches were aligned - or at least fell into the same memory block (some VAXes fetch 128 bits at a time)) Conclusion: you simply have to TRY your application and tweak where you think it's slow. Consider: A Sun 3 with the standard CPU (not the $30,000 option one) should be twice as fast as my uVAXstation II - but I find that for what I use them for, the uVAX II is faster. Why? I dunno. Hope this eases your brain. -Bill Yerazunis
SYSMGR%UK.AC.KCL.PH.IPG@AC.UK (01/10/87)
Re: How does one time VAX instructions, why don't results make much sense, how is one expected to write really fast code without these figures... The following is a personal opinion that tends to arouse controversy, but here goes... If you don't want to read 5 or do screenfulls hit delete now... It is a waste of time to attempt to time individual instructions on a VAX cpu, or indeed to attempt any code tweaking which does not improve the algorithm of which the code is a possible realisation. Why? Well, to start with the only instructions which are likely to have a consistent execution time are register-to-register operations. You can time these, quite easily, by setting up a trivial subroutine containing a loop such as: .entry test, #M<r2> movl #100000, r2 10$: xxx sobgtr r2,10$ ret .blkb 100 ; space to patch into with DEBUG .end (xxx is the instruction(s) to time). You run it first with xxx deleted to get a time for the loop overhead, and then with instruction(s) to be timed. Note that DEBUG can be used to deposit xxx, rather than MACRO and LINK each time. Timing is accomplished by bracketing this routine with a pair of calls to SYS$GETJPI returing CPUTIM (if you believe the results) or SYS$GETTIM to measure realtime, which is reliable if you have elevated your process priority above the swapper's (ie realtime priority) and provided the environment is quiet enough that few if any interrupts are being handled. Note that this latter approach will annoy other users (if any) and requires privilege (ALTPRI). It is these considerations which lead to my opinion. A more complex piece of code, such as a memory-to-memory copy of a buffer, has a time which depends both on virtual memory management (unless you lock the buffers into your working set, which is usually unrealistic) and on other system hardware activity, which cannot be avoided. VM management is tuned by a large number of SYSGEN parameters, 'correct' setting of which is a subject of much debate amongst system managers. In fact there are no right values, the 'best' values depend on your system configuration, type of workload, management objectives, etc. The CPU time taken by the buffer copy must include page fault overheads to be realistic, but these depend both on your SYSGEN, and also on how busy the system is when you run your test. If there are no other users, pages of memory lost from your process working set simply sit around in semiconductor memory until they are next needed (unless you exceed the physical memory size of your VAX), and are retrived with a 'soft' page fault which involves no disc IO. In contrast, if the system is busy, a page lost by your process will probably be written back to disc and grabbed by some other process, so you incur a second disc IO when you next fault it in. 'Hard' page faults take a lot longer (in both CPU and real time) than 'soft' ones. As for hardware activity, only one device can be a bus master at once. If a DMA transfer to memory is in progress, the CPU may have to wait for one or more bus cycles until it can become bus master and access memory. Whether this is significant depends on the amount of DMA activity and on the bandwidth of the main bus (SBI, BI, CMI) and its interaction with the device bus (UNIBUS, QBUS); I have heard that is is particularly significant with a UNIBUS adapter on a BI-bus VAX. If this sounds rather theoretical, I know that on our VAX some jobs can take twice as much CPU on a busy system than on a lightly-loaded one. When users complain, I point out that this effect is common in the real world; most companies offer discounts to customers for off-peak resources or to shift less popular products. I also noticed that when we doubled our physical memory, many jobs got lots faster, and when we went to VMS V4, many jobs slowed down, both of which make rather a nonsense of CPU MIP ratings. Note that my opinion only applies to instruction tweaking that does not improve the algorithm. A while ago (on VMS 1.5 I think!) I transferred a two-dimensional FFT routine from a CDC 7600 to out VAX. On the 7600, hand-coded assembler in the inner loops reduced runtime by 60%. On the VAX, I achieved 10%, and it wasn't because the compiler was optimising registers well enough already - I took a look at the compiler code and shuddered! In contrast, an algorithmic improvement that reduced the number of complex multiplies involved in the computation saved much the same percentage on the 7600 and on the VAX, and another that accessed memory in a more sequential manner (thereby reducing pagefaults) paid handsomely. In summary, VAXes are high-level machines and a low-level approach to optimisation can be left to DEC's compiler writers who are now doing a good job. Where programmers can score is by improving their algorithms, and by understanding the basic priciples of VM management so as to work with VMS rather than against it. System programmers likewise usually do better to study the VMS internals manual than to worry about whether two BBS instructions are better or worse than a BITL and a BEQL. Incidentally this is probably equally true of any other virtual memory system. MIPs are a waste of time for everybody except salesmen ... "Figures can't lie, but liars sure can figure". Nigel Arnot (Dept. Physics, Kings college, Univ. of London; U.K) Bitnet/NetNorth/Earn: sysmgr@ipg.ph.kcl.ac.uk (or) sysmgr%kcl.ph.vaxa@ac.uk Arpa : sysmgr%ipg.ph.kcl.ac.uk@ucl-cs.arpa
dww@seismo.CSS.GOV@stl.stc.co.uk (01/31/87)
In article <8701100344.AA11320@ucbvax.Berkeley.EDU> SYSMGR%UK.AC.KCL.PH.IPG@AC.UK writes: >... >Where programmers can score is by improving their algorithms, ... Very true. I remember a programmer asking permission to use assembler language when I ran a "No assembler except in emergency" rule, because his 1.5mSec rate interrupt program ran for 2mSeconds. A better algorith (still HLL) took 150uSecs. For even better examples, read Knuth! >Incidentally ... is probably equally true of any other virtual memory system. >MIPs are a waste of time for everybody except salesmen A company we work with use Ridge (UNIX) and VAX (VMS) computers. The Ridge (I don't know which model) is supposed to have twice the CPU performance of the VAX (a 785). For most of their CPU-heavy simulations this was so, but for the biggest the VAX was several times faster, simply because it could handle paging of large virtual memory space with far less page faults.