[comp.sys.sequent] gprof problem?

dav2@sequent.cc.hull.ac.uk (Dominic "A." Varley) (06/03/91)

Dear Symmetry Users,

I am conducting research into the use of parallelism in digital cartography.
I have available to me an eight processor Sequent Symmetry S81 running DYNIX
V3.0.17.9.  Below I have presented what I believe to be a floor in the
performance measuring tool Gprof.  This, I hope, is demonstrated by
comparing the results from gprof against timings taken using the on-board
clock.  The program in question involves a very large amount of input,
typical of many of the task I am looking in to.

I am having difficulty in interpreting the profile results of my sequential
programs as presented to me by gprof(1).  I became suspicious of gprof's
results when the observed elapsed time, as measured by the C shell command
time (csh(1)), was over three times more than the total lapsed time reported
by gprof.  Closer examination using the microsecond clock (getusclk(3))
reveals much more significant differences between the two approaches at the
procedural level.

Presented below is a summary of the important procedure calls :


START @ Tue May 28 18:42:33 WET DST 1991 6 users, load average: 0.05, 0.04, 0.00

	Phase 1 -->         48858659. microseconds =>   48.86 seconds
	Phase 2 -->       4136634372. microseconds => 4136.63 seconds
	Phase 3 -->          4759527. microseconds =>    4.76 seconds
	Lapsed time : 4190.25 seconds
	Lapsed time : 1 hour 9 minutes 50.25 seconds 
		      (excludes initialising proc)

STOP @ Tue May 28 19:52:30 WET DST 1991 4 users,  load average: 0.15, 0.12, 0.00

Lapsed system time : 1 hour 9 minutes and 57 seconds 
		     (include initialising proc)
		     (from STOP@ - START@)


The times for the three individual phases (procedures) are calculated by
taking snap shots from the on-board microsecond clock immediately before and
after the appropriate procedure call, then calculating the difference.  The
sum of these times is almost equivalent to the observed system lapse
time as calculated by subtracting the start and end times presented by
date(1).  The slight difference (6 seconds) is accounted for by initialising
and closing subroutines excluded from the on-board clock calculations.


Below is a summary the gprof results produced from the same run as above.

GPROF RESULTS
=============

granularity: each sample hit covers 4 byte(s) for 0.00% of 1239.75 seconds

                                                     <spontaneous>
[1]     73.1    0.00      614.97                 start [1]
                0.00      614.97       1/1           _main [2]

-----------------------------------------------

                0.00      614.97       1/1           start [1]
[2]     73.1    0.00      614.97       1         _main [2]
                2.06      378.75       1/1           _phase2 [4]
                1.90      221.16       1/1           _phase3 [6]
                4.46        5.72       1/1           _phase1 [11]
                0.24        0.24       1/1           _init [15]
                0.03        0.40       1/1           _dbsave [16]
                0.00        0.00       3/3           _fflush [22]
                0.00        0.00       5/5           _fputs [49]
                0.00        0.00       3/4           _puts [50]
                0.00        0.00       3/13          _printf [48]


Do you seen the problem? -> phase1 takes 4.46 (self) + 5.72 (desc) =  10.18 secs
			    phase2 takes 2.06 + 378.75 = 380.81 secs
			    phase3 takes 1.90 + 221.16 = 223.06 secs
Compare these to the readings taken from the microsecond clock.
	phase1 =   48.86 secs
	phase2 = 4136.63 secs
	phase3 =    4.76 secs

In "Gprof : A Call Graph Execution Profiler" (Graham et al, 1982) they
discuss the use of statistical approximation, based on program counter
samples, as the best method of providing execution times in a time-sharing
systems.  This may be one possible reason for such a difference, or 
have I missed something in my interpretation?

Questions
---------

	  i) do you agree with my interpretation of the results? if not, why?
	 ii) have you experienced similar problems? what did you do?
	iii) Is there any more literature on Gprof/Sequent?
	 iv) has anybody developed alternative measuring tools ?
	  v) Sequent appear to have modified gprof - try sccs what `which gprof`
	     What has been changed?

If this problem is not due to some misunderstanding on my part, then it is a
major one.  For based on the results of gprof I parallelised phase3, when in
reality it only accounts for around 0.12% (as opposed to 36%) of the total
execution time.


About the program
=================

The program is written in C and compiled using cc(1) under the -pg option.
It is executed in the ucb universe.  All three phases eventually call a
single subroutine which fetches a very large amount of data.


Postscript
==========

In addition has anybody had experience with gprof applied to parallel
programs.  The FORTRAN compiler produces separate profiles for each
processor, but I can't make any sense of the results.  Any comments about
gprof or any other tool in this respect will be gratefully received.


Reference
=========

Graham S.L., Kessler P.B., McKusick M.K., 1982,
"Gprof : a Call Graph Execution Profiler",
Proceedings of the ACM SIGPLAN `82 Symposium on Compiler Construction,
SIGPLAN Notices 17, pp 120-126.
(Included in Volume II of the DYNIX Manual set.)


Thank you for your time
Dominic A.Varley

Cartographic Information Systems Research Group (CISRG)
Department of Computer Science
University of Hull			
Hull				       JANET : dom@uk.ac.hull.cs
North Humberside		       
England
HU6 7RX
Deptl Office (+44) (0) 482 465951	 Fax (+44) (0) 482 466666