[comp.arch] Partial clock/compiler quality characterizer

eugene@pioneer.arpa (Eugene Miya N.) (07/03/87)

I am sorry I have taken so long.  I have really been bogged down with
meetings and other hacking on parallel programs.  I should have shar'ed
this source, but bear with me.  The files comments (full of comments)
describe some of this test.

1) Do not attempt loop repetition individual measurements.  A later test
will do that.  If you have a poor clock, it's important to characterize it.
Systems with 60 Hz clocks might create interesting results.

2) The test also checks out inline code substitution as a qualitative
optimization: no parameter passing, simplest possible routines.  I know
of two architectures which make effective use of this type of
optimizations.  Other tests on the way, part of what also bogs me down.

3) Note: all output comes out as comments or near comments and I use the
NBS measurement record (current version, we are suggesting some
additions).

Directions: save this as a file.  Break that file into the named files.
You need an editor and m4.  Check out the m4 defines and substitute the
most accurate system clock you have on your system [I use the integer
value of a Real Time Clock: a tick is 9.5 ns one of the most accurate I
know].  The text is really simple, try to figure it out, I have typed
the basic stuff into an A*t after it did not get mailed to a system.
Fill out (change) the NBS measurement record for your machine
(machine.t).
The m4 output is a not quite ready for compilation.  The code in the
file SYNC must be modified to have unique statement numbers (set at 1).
If you rewrite this in C, Pascal, LISP, please send me you code and I
will check it out.  Keep the output style use comments in those
respective languages.  Another useful test might be to reverse the order
of the calls one ONE to SIX to SIX to ONE.  Exercise for the reader
(don't post, but silent guess) why might this be useful?

Crufty stuff, but better than not getting it out.  This is research, no
endorsement is implied, I can't issue a copyright as this was
"developed" on US Government time, it's also dumb, [as well as smart
when you consider how few machines optimize this dumb code out].

Why did I do this?  Following a query to comp.compiler on testing
optimization (basically no really does test [maybe Tartan Labs]), a
co-worker remarked on that an upgraded machine and compiler
(manufacturer, not f77 distributed with Unix) did not optimize
an obvious block of code in one of his tests.  He was a little pissed
off.  (No, the machine was not an A*t).  Another thing was an article in
Spectrum.  Don't try to guess, we have a lot of different Unix machines
here [to protect the innocent as well as the guilty.]

We give compilers too much credit.

You are welcome to send me output.

From the Rock of Ages Home for Retired Hackers:

--eugene miya
  NASA Ames Research Center
  eugene@ames-aurora.ARPA
  "You trust the `reply' command with all those different mailers out there?"
  "Send mail, avoid follow-ups.  If enough, I'll summarize."
  {hplabs,hao,ihnp4,decwrl,allegra,tektronix,menlo70}!ames!aurora!eugene

C----------------------- this file is named CONTROL--------
include(SYNC)
      start_watch
      lap_watch
      stop_watch
C ----------------------- this file is named SYNC-----------
C
C SYNC includes code to:
C 1) perform any pre-cache, register or other flushing
C 2) synchronize the ticks of a clock if necessary
      ITIME = IRTC()
    1 IF(ITIME .EQ. IRTC()) GO TO 1
C
C This version of SYNC must be modified to increment SYNC's statement
C numbers.  I assume this is obvious.  Later versions will get rid of
C this.
C
C------------------------this file is the main.f--------------
C
C m4 words to clock by:
define(startw,IBEFOR)
define(stopw,IAFTER)
define(lapw,INTIM)
C
C m4 macros to call clocks by:
C
define(start_watch,startw = IRTC())
define(lap_watch,lapw = IRTC())
define(stop_watch,stopw = IRTC())
C
C
C
define(determine_time,$1 = stopw - startw)
define(determine_lap,$3 = $2 - $1)
C
C Use the integer arrays for integer clocks
C Use the real arrays for floating point clocks
C
       INTEGER ITICKS(7,3), JTICKS(7,3)
       REAL TICKSI(7,3), TICKSJ(7,3)
C
C Subroutine call overhead testing
C
       DO 10 ISAMP = 1, 3
include(CONTROL)
          determine_lap(startw,lapw,ITICKS(1,ISAMP))
          determine_lap(lapw,stopw,JTICKS(1,ISAMP))
include(SYNC)
          start_watch
          CALL ONE
          stop_watch
          determine_lap(startw,lapw,ITICKS(2,ISAMP))
          determine_lap(lapw,stopw,JTICKS(2,ISAMP))
include(SYNC)
          start_watch
          CALL TWO
          stop_watch
          determine_lap(startw,lapw,ITICKS(3,ISAMP))
          determine_lap(lapw,stopw,JTICKS(3,ISAMP))
include(SYNC)
          start_watch
          CALL THREE
          stop_watch
          determine_lap(startw,lapw,ITICKS(4,ISAMP))
          determine_lap(lapw,stopw,JTICKS(4,ISAMP))
include(SYNC)
          start_watch
          CALL FOUR
          stop_watch
          determine_lap(startw,lapw,ITICKS(5,ISAMP))
          determine_lap(lapw,stopw,JTICKS(5,ISAMP))
include(SYNC)
          start_watch
          CALL FIVE
          stop_watch
          determine_lap(startw,lapw,ITICKS(6,ISAMP))
          determine_lap(lapw,stopw,JTICKS(6,ISAMP))
include(SYNC)
          start_watch
          CALL SIX
          stop_watch
          determine_lap(startw,lapw,ITICKS(7,ISAMP))
          determine_lap(lapw,stopw,JTICKS(7,ISAMP))
10    CONTINUE
C
C Print data
C
      DO 30 ISAMP = 1,3
          DO 20 IDEPTH = 1,7
             ID = IDEPTH - 1
             WRITE(*,*) ' C',ID,ITICKS(IDEPTH,ISAMP),JTICKS(IDEPTH,ISAMP)
20        CONTINUE
30    CONTINUE
C
C Print measurement record
C
include(machine.t)
         STOP
         END
include(routines.f)
C
C----------------------routines.f---------------------
C
C the following should be compiled two ways if rewritten in a
C language with forward referencing like Pascal.
C The first way should be with all satisified forward references.
C The second way should be (if possible) with separately compiled
C modules in reverse order prior to linking.
C This is important to determine what environmental information is
C passed.
C
C This test fails to check out parameter passing overheads.
C
      SUBROUTINE ONE
      COMMON lapw
      lap_watch
      RETURN
      END
      SUBROUTINE TWO
      CALL ONE
      RETURN
      END
      SUBROUTINE THREE
      CALL TWO
      RETURN
      END
      SUBROUTINE FOUR
      CALL THREE
      RETURN
      END
      SUBROUTINE FIVE
      CALL FOUR
      RETURN
      END
      SUBROUTINE SIX
      CALL FIVE
      RETURN
      END
C------------------------------file named machine.t------------
C V1.1 measrec.wp NBS COMPUTER MEASUREMENT RESEARCH FACILITY
C 
C                   BENCHMARK MEASUREMENT RECORD
C 
C Benchmark: Miya Clock tests call it version 0.0
      WRITE (*,*) ' C Benchmark: Miya Clock tests call it version 0.0'
C 
C Machine:
      WRITE(*,*) ' C Hardware: Cray X-MP/48'
C 
C Location:
      WRITE(*,*) ' C Measurement organization: NASA Ames Research Center'
      WRITE(*,*) ' C Measurement city: Moffett Field, CA 94035'
C 
C Date, Time:
      WRITE(*,*) ' C Measurement Date: NDEF '
C 
C                      Machine Environment
C 
C CPU type:
      WRITE(*,*) ' C CPU type: Cray X-MP/4'
      WRITE(*,*) ' C Serial #103'
C 
C Word size (specify in bits, give number of mantissa and exponent
C      bits for numeric benchmark):
      WRITE(*,*) ' C Single Precision FP Word size == 64 bit'
      WRITE(*,*) ' C FP Word mantissa size == 47 bit'
      WRITE(*,*) ' C FP Word exponent size == 15 bit'
      WRITE(*,*) ' C Integer Word size == 64 bit'
      WRITE(*,*) ' C Pointer size == 24 bit'
      WRITE(*,*) ' C Memory addressing == word-oriented'
C 
C Machine cycle time (major and minor cycles, specify units):
      WRITE(*,*) ' C CPU Cycle == 9.5 ns'
      WRITE(*,*) ' C Clock Cycle == 9.5 ns'
C 
C Memory Size (specify units, MBytes, GBytes, etc.):
      WRITE(*,*) ' C Memory size == 8 MW'
C 
C Amount of global or private memory (per processor):
      WRITE(*,*) ' C Memory size == 8 MW'
C 
C Number of mem. boards & size of each:
C 
C Memory interleaving (describe):
      WRITE(*,*) ' C Memory interleaving == 64-way'
C 
C Number of channels bet. local mem. and CPU/regs.:
      WRITE(*,*) ' C Local memory == NONE'
      WRITE(*,*) ' C CPU registers == 64 A regs'
      WRITE(*,*) ' C CPU registers == 64 B regs'
      WRITE(*,*) ' C CPU registers == 64 T regs'
      WRITE(*,*) ' C CPU registers == 8 V regs'
C 
C Bandwidth of mem. channels (specify units):
C 
C Number of processors (number in system and maximum number used):
      WRITE(*,*) ' C Number of processors == 4'
C 
C Type of interconnection between processors (e.g., shared memory,
C      network):
      WRITE(*,*) ' C Shared-memory'
C 
C Bandwidth of an interconnection path or paths (specify unit,
C      rate, data-carrying capacity and total size of packets if
C      used):
C 
C Other (everything else about machine and configuration that could
C      be relevant to duplicating timing results with the benchmark
C      used):
      WRITE(*,*) ' C Other: Chaining present'
      WRITE(*,*) ' C Other: Scatter-Gather present'
C 
C Cache and size:
      WRITE(*,*) ' C Cache and size: NONE 0'
C 
C Number of pipes:
      WRITE(*,*) ' C Number of pipes: None (technically)'
C 
C Special fast memory present and whether used:
      WRITE(*,*) ' C Special fast memory present and whether used:'
      WRITE(*,*) ' C Bipolar logic in main memory used'
C 
C Paging device (type and speed, specify units):
      WRITE(*,*) ' C Paging device: None'
C      WRITE(*,*) ' C Paging device: Cray SSD/128 MW'
C 
C I/O devices exercised in benchmark (type, speed, specify units):
      WRITE(*,*) ' C I/O devices exercised: NONE 0'
C 
C                       Software Environment
C 
C Operating System and version:
      WRITE(*,*) ' C Operating System: COS -- Cray Operating System'
      WRITE(*,*) ' C Version Number: 1.15'
C 
C Compiler (name and version):
      WRITE(*,*) ' C Compiler: CFT'
      WRITE(*,*) ' C Language: FORTRAN'
      WRITE(*,*) ' C Compiler version: 1.15'
C 
C Standard (if any) this compiler claims conformance to:
      WRITE(*,*) ' C Standard: Language FORTRAN 66'
C 
C                       Benchmark Conditions
C 
C Compiler options used (name them and describe their effects; give
C      compiler command used):
      WRITE(*,*) ' C Compiler options: SITE SYSTEM DEFAULT'
C 
C Compiler messages encountered:
C 
C Object module size (specify units):
C 
C Load module size (specify units):
C 
C Linker/loader options used (name them and describe their effects;
C      give linker/loader command used);
      WRITE(*,*) ' C Linker/loader options: SITE SYSTEM DEFAULT'
C 
C Number of lines of code required to be modified to compile
C      successfully:
      WRITE(*,*) ' C Modifed lines(compile): 0'
C 
C Number of lines of code modified for successful run:
      WRITE(*,*) ' C Modifed lines(execution): 0'
C 
C Tuning Level (as with NAS Kernels: 0, 1, 2, 3):
      WRITE(*,*) ' C Tuning Level: 0'
C 
C Run parameters (special execution time options required to run
C      benchmark, e.g., amount of storage pre-allocated, etc.,
C      often specified in job control statements; give run command
C      used):
      WRITE(*,*) ' C Run parameters: NONE'
C 
C Files required (input and output; give sizes, number of records,
C      how blocked):
      WRITE(*,*) ' C File required: NONE'
C 
C Other workload present/absent during run:
      WRITE(*,*) ' C Other workload present/absent during run:'
      WRITE(*,*) ' C    Normal fully loaded system'
C 
C Timer type (stopwatch, timers internal to benchmark -- give call
C      statement or function invocation; describe the value
C      returned, e.g., whether system time, elapsed job execution
C      time, etc. and give resolution of timing call):
      WRITE(*,*) ' C Function call: IRTC()'
      WRITE(*,*) ' C                value returned INTEGER'
      WRITE(*,*) ' C                value returned REAL TIME CLOCK'
      WRITE(*,*) ' C                value returned CLOCK TICKS'
      WRITE(*,*) ' C                resolution 9.5 ns minimum'
C                                      note: newer X-MP/4s have 8.5 ns clock
C 
C Number of runs with this benchmark:
      WRITE(*,*) ' C RUNS == 1, Samples == 3'
C 
C System messages encountered (specify whether from loader or run
C      time, e.g., "DROP FILE EXTENDED TO SIZE 496" on a CDC Cyber
C      205):
C 
C Page faults (if any) recorded:
      WRITE(*,*) ' C Page faults: 0'
      WRITE(*,*) ' C Page faults: comment no virtual memory'
C 
C Problem size (give range, specify units, e.g. "64-bit real arrays
C      of from 100 to 65535 elements", number of iterations, etc.):
      WRITE(*,*) ' C Tests of clock resolution and certain'
      WRITE(*,*) ' C simple operations.'
C ----------------------file named comments----------------------
C This is a check for subroutine call overhead, and the effects
C of possible INline code generation.
C The first time should just be sequential time, and perhaps overhead
C to do one clock call.
C The next calls should either show CALL overhead or inline code
C subtitution (no difference)
C This test can be rewritten in any language, but it should be noted
C that languages like Pascal which require forward referencing
C should be run in a forward reference case (predefined routines one
C thru six must come first) and in a separate compilation case
C (important for transmission on environmental information).
C Other measurements include work to sync with a clock
C If more than one clock is available, all clock calls should be tested
C with minimum overhead.  (We assume consistent data types.)
C (One test does consider type conversion costs of clock values.)
C (Another test considers simple array indexing costs, and Hopefully
C separates I/O from the rest of the clock.)
C If a time FUNCTION is used, a sync like:
C      FTIME = CLOCK()
C    2 IF(FTIME .EQ. CLOCK()) GO TO 2
C      START = CLOCK() 
C Spin waits well.
C If your environment is dumb (no functions call in IF evaluation then,
C please use something like:
C      FTIME = CLOCK()
C    2 START = CLOCK()
C      IF(FTIME .EQ. START) GO TO 2
C      START = CLOCK() 
C If a subroutine call is used please use something like:
C      CALL CLOCK(FTIME)
C    2 CALL CLOCK(START)
C      IF(FTIME .EQ. START) GO TO 2
C
C Yet another set of tests tries to determine a combination of loop
C overhead/optimizing compiler looking for unused loops or dead code.
C We have straight empty DO loops as well as equivalent IF-GOTO loops.
C A sync is used in all timing cases prior to execution, and a
C version of the loops is included for those systems with poor clocks
C which trying to determine maximum work between any two ticks.
C
C A series of tests on resolution: A and nA
C When testing resolution it would also be useful to check for
C simple Constant Subexpression Elimination (CSE).  Another smart
C compiler artifact
C
C
C Simple tests are those tests which have material which are assigned
C (no arithmetic or logical computation) at compile-time (Complexity 0)
C No checking value duplication or movement, etc.
C Complexity 1 might be a stage bit more complex, not assigned, but
C can be compile time computed (simple arithmetic and logical).
C Simple checking for value movement might be possible, but the exact
C value might not exist in literal form (excepting checks at the end).
C Complexity 2 might involve more complex clock tick determination
C during run time which might have variable components (unresolved at
C compile-time.
C Complexity 3 must be input, so a machine cannot know a value
C in advance.  Only read statements.  Certain constants for repetition,
C for instance, are arbitrary and may be freely modified.
C
C Tests of simple resolution.
C Many machines with 60/100 Hz and even Microsecond clocks might fail
C this portion.  This section must not be modified to perform repetition.
C A really smart compiler might optimize these statements out since all
C the data can be determined by forward reference.
C later version will be released which uses repetiton.
C
C Good luck
C