dav2@sequent.cc.hull.ac.uk (Dominic "A." Varley) (06/03/91)
Dear Symmetry Users, I am conducting research into the use of parallelism in digital cartography. I have available to me an eight processor Sequent Symmetry S81 running DYNIX V3.0.17.9. Below I have presented what I believe to be a floor in the performance measuring tool Gprof. This, I hope, is demonstrated by comparing the results from gprof against timings taken using the on-board clock. The program in question involves a very large amount of input, typical of many of the task I am looking in to. I am having difficulty in interpreting the profile results of my sequential programs as presented to me by gprof(1). I became suspicious of gprof's results when the observed elapsed time, as measured by the C shell command time (csh(1)), was over three times more than the total lapsed time reported by gprof. Closer examination using the microsecond clock (getusclk(3)) reveals much more significant differences between the two approaches at the procedural level. Presented below is a summary of the important procedure calls : START @ Tue May 28 18:42:33 WET DST 1991 6 users, load average: 0.05, 0.04, 0.00 Phase 1 --> 48858659. microseconds => 48.86 seconds Phase 2 --> 4136634372. microseconds => 4136.63 seconds Phase 3 --> 4759527. microseconds => 4.76 seconds Lapsed time : 4190.25 seconds Lapsed time : 1 hour 9 minutes 50.25 seconds (excludes initialising proc) STOP @ Tue May 28 19:52:30 WET DST 1991 4 users, load average: 0.15, 0.12, 0.00 Lapsed system time : 1 hour 9 minutes and 57 seconds (include initialising proc) (from STOP@ - START@) The times for the three individual phases (procedures) are calculated by taking snap shots from the on-board microsecond clock immediately before and after the appropriate procedure call, then calculating the difference. The sum of these times is almost equivalent to the observed system lapse time as calculated by subtracting the start and end times presented by date(1). The slight difference (6 seconds) is accounted for by initialising and closing subroutines excluded from the on-board clock calculations. Below is a summary the gprof results produced from the same run as above. GPROF RESULTS ============= granularity: each sample hit covers 4 byte(s) for 0.00% of 1239.75 seconds <spontaneous> [1] 73.1 0.00 614.97 start [1] 0.00 614.97 1/1 _main [2] ----------------------------------------------- 0.00 614.97 1/1 start [1] [2] 73.1 0.00 614.97 1 _main [2] 2.06 378.75 1/1 _phase2 [4] 1.90 221.16 1/1 _phase3 [6] 4.46 5.72 1/1 _phase1 [11] 0.24 0.24 1/1 _init [15] 0.03 0.40 1/1 _dbsave [16] 0.00 0.00 3/3 _fflush [22] 0.00 0.00 5/5 _fputs [49] 0.00 0.00 3/4 _puts [50] 0.00 0.00 3/13 _printf [48] Do you seen the problem? -> phase1 takes 4.46 (self) + 5.72 (desc) = 10.18 secs phase2 takes 2.06 + 378.75 = 380.81 secs phase3 takes 1.90 + 221.16 = 223.06 secs Compare these to the readings taken from the microsecond clock. phase1 = 48.86 secs phase2 = 4136.63 secs phase3 = 4.76 secs In "Gprof : A Call Graph Execution Profiler" (Graham et al, 1982) they discuss the use of statistical approximation, based on program counter samples, as the best method of providing execution times in a time-sharing systems. This may be one possible reason for such a difference, or have I missed something in my interpretation? Questions --------- i) do you agree with my interpretation of the results? if not, why? ii) have you experienced similar problems? what did you do? iii) Is there any more literature on Gprof/Sequent? iv) has anybody developed alternative measuring tools ? v) Sequent appear to have modified gprof - try sccs what `which gprof` What has been changed? If this problem is not due to some misunderstanding on my part, then it is a major one. For based on the results of gprof I parallelised phase3, when in reality it only accounts for around 0.12% (as opposed to 36%) of the total execution time. About the program ================= The program is written in C and compiled using cc(1) under the -pg option. It is executed in the ucb universe. All three phases eventually call a single subroutine which fetches a very large amount of data. Postscript ========== In addition has anybody had experience with gprof applied to parallel programs. The FORTRAN compiler produces separate profiles for each processor, but I can't make any sense of the results. Any comments about gprof or any other tool in this respect will be gratefully received. Reference ========= Graham S.L., Kessler P.B., McKusick M.K., 1982, "Gprof : a Call Graph Execution Profiler", Proceedings of the ACM SIGPLAN `82 Symposium on Compiler Construction, SIGPLAN Notices 17, pp 120-126. (Included in Volume II of the DYNIX Manual set.) Thank you for your time Dominic A.Varley Cartographic Information Systems Research Group (CISRG) Department of Computer Science University of Hull Hull JANET : dom@uk.ac.hull.cs North Humberside England HU6 7RX Deptl Office (+44) (0) 482 465951 Fax (+44) (0) 482 466666