eugene@eos.arc.nasa.gov (Eugene Miya) (12/21/90)
Fortran? C? Makes no sense to compare them if you don't have a basis. (I had hoped to write an ICPP paper on this, tough luck). I need a break among the sandstone. Remember my minimal post? Seeing Pat McGehearty's post got me motivated again ( Hi Pat! ). FYI: Pat had what I regard as an interesting PhD thesis (yes I read the whole thing a few years back, Blue CMU cover). But see, we need a basis for comparison. The fact that Pat works for Convex got me interested in this. They are one of the few companies with an integrated language system using a common code generating back end. If you have the same program written in two languages are they equivalent? What do we expect of equivalence? It ain't easy. The LLNL Loops in Fortran and C are nearly the same thing. I think C versus Fortran comparisons are relatively uninteresting. Same imperative style of language. LISP, now we stand at get interesting. I would really like to see VAL (Jack Dennis) or SISAL (McGraw et al). Any ways how to compare languages. Minimally. To quote Kernighan and Plauger (The Elements of Programming style): DO nothing gracefully. So I took an empty program: one which does nothing and compiled them: Fortran PROGRAM EMPTY STOP END C main () {} Pascal program empty; begin end. Consider this: ON a C-2: -rwxr-xr-x 1 eugene 31302 Dec 20 14:06 ec -rwxr-xr-x 1 eugene 155098 Dec 20 14:06 ef Look at the sizes of the executables. Should they be the same? They have the same common back end. Well they do have different models of storage. What about on a Cray (Y-MP)? -rwxr-xr-x 1 eugene npo 108112 Dec 20 14:45 ec -rwxr-xr-x 1 eugene npo 717064 Dec 20 14:47 ef -rwxr-xr-x 1 eugene npo 183584 Dec 20 14:57 ep And execution time? % time ec 0.00u 0.01s 0:00 100% 0+0k 0+0io 6pf+0w % time fc 0.00u 0.01s 0:00 100% 0+0k 0+0io 13pf+0w Identical functionality, different performance. Inadequate tool for timing (time(1)). On the Y? % time ec 0.0003u 0.0064s 0:00 0% % time ef STOP (called by EMPTY ) CP: 0.001s, Wallclock: 0.002s, 4.8% of 8-CPU Machine HWM mem: 97663, HWM stack: 2048, Stack overflows: 0 0.0012u 0.0069s 0:00 0% % time ep 0.0005u 0.0038s 0:00 0% But this does not tell you enough. We need to count instructions (with precision so we minimize use of statistics). So I use the afford mentioned hardware performance monitor % hpm -g0 ef STOP (called by EMPTY ) CP: 0.001s, Wallclock: 0.008s, 0.8% of 8-CPU Machine HWM mem: 97666, HWM stack: 2048, Stack overflows: 0 Group 0: CPU seconds : 0.00 CP executing : 193415 Million inst/sec (MIPS) : 44.02 Instructions : 51086 Avg. clock periods/inst : 3.79 % CP holding issue : 42.74 CP holding issue : 82674 Inst.buffer fetches/sec : 0.78M Inst.buf. fetches: 904 Floating adds/sec : 0.21M F.P. adds : 246 Floating multiplies/sec : 0.23M F.P. multiplies : 267 Floating reciprocal/sec : 0.05M F.P. reciprocals : 54 I/O mem. references/sec : 0.00M I/O references : 0 CPU mem. references/sec : 14.70M CPU references : 17058 Floating ops/CPU second : 0.49M We are doing a lot of work to "do nothing gracefully." Let's see the C and Pascal cases: % hpm -g0 ec Group 0: CPU seconds : 0.00 CP executing : 35247 Million inst/sec (MIPS) : 46.92 Instructions : 9923 Avg. clock periods/inst : 3.55 % CP holding issue : 43.26 CP holding issue : 15249 Inst.buffer fetches/sec : 0.66M Inst.buf. fetches: 140 Floating adds/sec : 0.00M F.P. adds : 1 Floating multiplies/sec : 0.00M F.P. multiplies : 0 Floating reciprocal/sec : 0.00M F.P. reciprocals : 0 I/O mem. references/sec : 0.00M I/O references : 0 CPU mem. references/sec : 17.24M CPU references : 3645 Floating ops/CPU second : 0.00M % hpm -g0 ep Group 0: CPU seconds : 0.00 CP executing : 61878 Million inst/sec (MIPS) : 46.97 Instructions : 17439 Avg. clock periods/inst : 3.55 % CP holding issue : 42.90 CP holding issue : 26545 Inst.buffer fetches/sec : 0.89M Inst.buf. fetches: 332 Floating adds/sec : 0.00M F.P. adds : 1 Floating multiplies/sec : 0.00M F.P. multiplies : 0 Floating reciprocal/sec : 0.00M F.P. reciprocals : 0 I/O mem. references/sec : 0.02M I/O references : 6 CPU mem. references/sec : 18.26M CPU references : 6781 Floating ops/CPU second : 0.00M Now this is on a loaded system, but I assert you can get important information even on a leaded system. The important figure, BTW is the right most column, this is the raw data. The middle column of figures is a rounded approximation. But other interesting information can be taken by the HPM, I only used the FORTRAN version of the code to describe this "universe." % hpm -g1 ef STOP (called by EMPTY ) CP: 0.001s, Wallclock: 0.004s, 1.5% of 8-CPU Machine HWM mem: 97666, HWM stack: 2048, Stack overflows: 0 Group 1: CPU seconds : 0.00116 CP executing: 193018 Hold issue condition % of all CPs actual # of CPs Waiting on semaphores : 0.13 249 Waiting on shared registers : 0.00 0 Waiting on A-registers/funct. units: 9.43 18200 Waiting on S-registers/funct. units: 27.62 53304 Waiting on V-registers : 1.38 2668 Waiting on vector functional units : 0.00 9 Waiting on scalar memory references: 0.57 1103 Waiting on block memory references : 1.91 3677 % hpm -g2 ef STOP (called by EMPTY ) CP: 0.001s, Wallclock: 0.002s, 4.1% of 8-CPU Machine HWM mem: 97666, HWM stack: 2048, Stack overflows: 0 Group 2: CPU seconds : 0.00116 CP executing : 192818 Inst. buffer fetches/sec : 0.78M total fetches : 904 fetch conflicts : 1396 I/O memory refs/sec : 0.00M actual refs : 0 avg conflict/ref 0.00: actual conflicts : 37 Scalar memory refs/sec : 5.59M actual refs : 6462 Block memory refs/sec : 9.16M actual refs : 10600 CPU memory refs/sec : 14.75M actual refs : 17062 avg conflict/ref 0.07: actual conflicts : 1161 CPU memory writes/sec : 8.99M actual refs : 10399 CPU memory reads/sec : 5.76M actual refs : 6663 % hpm -g3 ef STOP (called by EMPTY ) CP: 0.001s, Wallclock: 0.003s, 2.1% of 8-CPU Machine HWM mem: 97666, HWM stack: 2048, Stack overflows: 0 Group 3: CPU seconds : 0.00116 CP executing: 192990 (octal) type of instruction inst./CPUsec actual inst. % of all inst. (000-017)jump/special : 5.35M 6190 12.10 (020-077)scalar functional unit : 33.12M 38350 74.96 (100-137)scalar memory : 5.58M 6462 12.63 (140-157,175)vector integer/log.: 0.01M 14 0.03 (160-174)vector floating point : 0.00M 2 0.00 (176-177)vector load and store : 0.12M 141 0.28 type of operation ops/CPUsec actual ops avg. VL Vector integer&logical : 0.12M 138 9.86 Vector floating point : 0.20M 232 116.00 Scalar functional unit : 33.12M 38350 That took four executions to learn all that (for a simple program which does nothing gracefully). That's quite a cost. You can't do that with some real programs. There is more to performance than execution time. If we want to design faster machines (worry about buying them later) we must potentially make observations with this detail. This isn't possible on many machines. Future machines must have this kind of environment. I hope you can begin to see why we MUST come to some kind of consensus on equivalence or we won't get anywhere with our comparisons. Do you really know what your programs are doing late in the evenings? 8^) Next... should we deal with the problem of resolution? Adding things to these empty programs and seeing how influences such as optimization, etc. affect execution. --e.n. miya, NASA Ames Research Center, eugene@eos.arc.nasa.gov {uunet,mailrus,most gateways}!ames!eugene AMERICA: CHANGE IT OR LOSE IT.
eugene@eos.arc.nasa.gov (Eugene Miya) (12/21/90)
BTW: In case you missed the thread I am trying to show (that included the summary of an unknown program (acutally EMPTY) using HPM Group 0). Let me give you a few hints of things to come. 1) Starting with this calibration (not a fair one we shall see, but it appears fair), I'll add (really) simple work. 2) I'll try to show real examples of "over-optimization." 3) How to work around one or two of these. 4) Try to show hardware and software artifacts. 5) Considering sampling strategies: one or two proposals which might be radical. 6) Show how a few programs might have deceptive execution. 7) Consider interesting analogies of performance measurement (Mine are photography: Muybridge [Stanford], Edgerton [MIT & EG&G]. Rafael's is audio equipment. Othersm might use cars. etc.) 8) Cover real "hard" topics like parallelism, data flow languages and machines, equivalence, timing scope, synchronization. Etc. But first, skiing and climbing. --e.n. miya, NASA Ames Research Center, eugene@eos.arc.nasa.gov {uunet,mailrus,most gateways}!ames!eugene AMERICA: CHANGE IT OR LOSE IT.