[comp.arch] Page size and linkers

moss@cs.umass.edu (Eliot Moss) (01/22/91)

MIPS systems have a tool that reorganizes code for better *cache* behavior.
The same idea would presumably work for paging (and TLB) stress reduction ...
--

		J. Eliot B. Moss, Assistant Professor
		Department of Computer and Information Science
		Lederle Graduate Research Center
		University of Massachusetts
		Amherst, MA  01003
		(413) 545-4206, 545-1249 (fax); Moss@cs.umass.edu

marcs@slc.com (Marc San Soucie) (01/23/91)

Richard Tobin writes:

> Piercarlo Grandi writes:

> > Incidentally, even a 4KB pagesize is still too large, even if it is
> > barely tolerable. 8KB was simply crazy, for a workstation/time sharing
> > server.

> Large page sizes seem likely to interact badly with programs
> containing a large number of small procedures, such as X clients.  An
> obvious improvement would be to have the linker order procedures so that
> procedures commonly used together were adjacent, and separate from
> rarely used procedures.
>
> Has this been done in any real systems?

I know of one instance, which may not qualify for the label 'real', which was a
workstation project undertaken at Wang Laboratories and later discarded shortly
before its release date. The OS for the workstation was a System 5.3 compatible
home-built, with home-cobbled tools and other home-made goodies.

One of the real goodies, which I have yet to see in a UNIX system, was a
service provided by the Kernel (and making liberal use of a nifty 68020/30
hardware feature) whereby one could run one's program under the auspices of the
debugger, and cause it to emit a trace log of all jumps and calls. I wrote a
set of spiffers which munged the trace log into a linker script which would
cause the program to be linked in an order most advantageous for initial
page-in and in many cases for least runtime memory use.

It took about three steps, but one could use a real instance of a program to
improve the program's linking. The same info, of course, could be turned into
performance analysis data, into branch testing data, etc. Very, very nice.

Unfortunately, few chips seems to support the jump/call hardware interrupt, and
even fewer OS's support the trace log feature. Alas, another great service
tossed in the dumper...

    Marc San Soucie
    Portland, Oregon
    marcs@slc.com

sbs@ciara.Frame.COM (Steven Sargent) (01/23/91)

In article <3981@skye.ed.ac.uk>, richard@aiai.ed.ac.uk (Richard Tobin) writes:
|> In article <PCG.91Jan20191955@teacho.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
|> >Incidentally, even a 4KB pagesize is still too large, even if it is
|> >barely tolerable. 8KB was simply crazy, for a workstation/time sharing
|> >server.
|> 
|> Large page sizes seem likely to interact badly with programs
|> containing a large number of small procedures, such as X clients.  An
|> obvious improvement would be to have the linker order procedures so that
|> procedures commonly used together were adjacent, and separate from
|> rarely used procedures.
|> 
|> Has this been done in any real systems?
|> 

NeXT's 2.0 linker provides a scheme for ordering procedures.
I've used it to separate out startup code from the rest of the
program and gotten good results (naturally, the numbers aren't
handy.)  Special flags to gprof cause it to present output in a
form useful to the linker.
-- 
Steven Sargent sbs@frame.com	"Frame's phone bill, my opinions."

pcg@cs.aber.ac.uk (Piercarlo Grandi) (01/23/91)

On 21 Jan 91 13:56:48 GMT, richard@aiai.ed.ac.uk (Richard Tobin) said:

richard> In article <PCG.91Jan20191955@teacho.cs.aber.ac.uk>
richard> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:

pcg> Incidentally, even a 4KB pagesize is still too large, even if it is
pcg> barely tolerable. 8KB was simply crazy, for a workstation/time sharing
pcg> server.

richard> Large page sizes seem likely to interact badly with programs
richard> containing a large number of small procedures,

Same for data. Other things being equal, a smaller page size is better
than a larger page size, down to a fairly small limit, mostly for this
reason. Naturally other things are not equal -- smaller page sizes imply
larger page tables, and IO costs are largely independent of page size.

richard> such as X clients.

Ah, also the X server. By private communication I have been told that
the X server typically spends 90% of its time in a 2KB stretch of code.
My reaction was disbelief, as this would imply that the X server code
working set would be around one page, which observations tells us is
not.

I was told that in getting to those 2KB, control touches a large number
of layers of abstraction realized as widely scattered procedures.  In
other words the path leading to the 2KB is fairly short in time but long
in jumps from one page to another... I would surmise that the same is
probably true for the X server's data. Arrrrggghhh.

richard> An obvious improvement would be to have the linker order
richard> procedures so that procedures commonly used together were
richard> adjacent, and separate from rarely used procedures.  Has this
richard> been done in any real systems?

Another lost art! Yes, it used to be popular before Unix. I have seen
several papers on the subject, late sixties to end of the seventies. You
would have tools for profiling and rearranging code in order to minimize
the working set. I have even heard of an Algol 68 compiler with an
"infrequently" pragma, used to tag infrequently executed sections of
code to offline.

As I have already remarked, programming for "locality" is a dark lost
secret nowadays -- we have got the worst aspects of Lisp programming in
languages like C without the benefits!

Memory is cheap, so instead of using it for more powerful applications
it is used for sloppier programming.

I had once a short e-mail discussion with the author on the appalling
inefficiencies intrinsic in the GNU Emacs design decisions. The usual
answer was "who cares! modern machines are so fast!".

The problem is that the GNU Emacs botches do not have a constant cost;
that is, say an overhead of 100KB or of 1 million needless instructions,
which would become percentage wise insignificant with time. No, like
many sloppily written programs it has inefficiencies proportional to the
size and speed of the machine/application. And applications grow in
size; once one could complain about GNU Emacs being slow in editing
100KB buffers and running 200 line ELisp functions on a 68010. Now that
we have machines that are 20 times as fast and large, if we still edit
100KB files and run 200 Elisp functions GNU Emacs looks fast.

Unfortunately now we want to edit 2MB files and run 4000 lines ELisp
functions routinely, and GNU Emacs still looks slow as it was then, or
even more, because the inefficiency overhead is more than linear. So we
have the choice of investing the 20 fold improvement in machine power to
have either decent responsiveness running the same things we did with
20 times less powerful machines or running applications that are 20
times as large at the same old slow pace.

Another example is the abject System V expansion swap technology; the
"cure" suggested by AT&T is to have about 10 times more memory than that
is needed for the working sets of active programs, so that the misdesign
is never exercised. OK, memory is cheap -- but the *factor* of 10
remains there, whatever the application size is, and it irks me for some
stupid reason that I have to reserve 90% of my real memory to AT&T's
swapping inanity, instead of for running applications that are ten times
as large.

So, the figleaf "machines are getting faster" really does not cover
much...
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

dmk@craycos.com (David Keaton) (01/24/91)

In article <3981@skye.ed.ac.uk> richard@aiai.UUCP (Richard Tobin) writes:
>  An
>obvious improvement would be to have the linker order procedures so that
>procedures commonly used together were adjacent, and separate from
>rarely used procedures.
>
>Has this been done in any real systems?

     Multiflow's linker used the output of gprof to optimally scatter
frequently cooperating procedures in the I-cache.  Sometimes this
required placing them in different pages.  Typical speedup was about
10%, although I remember a couple of cases that achieved 20%+.

     It was a real system for a while.

					David Keaton
					dmk@craycos.com
					uunet!dmk3b1!dmk

davecb@yunexus.YorkU.CA (David Collier-Brown) (01/24/91)

| In article <3981@skye.ed.ac.uk> richard@aiai.UUCP (Richard Tobin) writes:
| An obvious improvement would be to have the linker order procedures so that
| procedures commonly used together were adjacent, and separate from
| rarely used procedures.
| 
| Has this been done in any real systems?

  This was done by the optional ``binder'' on Multics.  Binding was done
to programs whose performance was important and reference patterns known.
Other programs linked dynamically as needed.

--dave
-- 
David Collier-Brown,  | davecb@Nexus.YorkU.CA | lethe!dave
72 Abitibi Ave.,      | 
Willowdale, Ontario,  | Even cannibals don't usually eat their
CANADA. 416-223-8968  | friends.

schwartz@groucho.cs.psu.edu (Scott Schwartz) (01/24/91)

In article <3981@skye.ed.ac.uk> richard@aiai.UUCP (Richard Tobin) writes:
| An obvious improvement would be to have the linker order procedures ...
| Has this been done in any real systems?

Primos did (still does, I guess) this for a number of common programs,
like the line editor.

rro@debussy.cs.colostate.edu (Rod Oldehoeft) (01/24/91)

In article <9840@frame.UUCP> sbs@ciara.Frame.COM (Steven Sargent) writes:

>In article <3981@skye.ed.ac.uk>, richard@aiai.ed.ac.uk (Richard Tobin) writes:

>|> In article <PCG.91Jan20191955@teacho.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:

    [A little deleted here.]

>|> 
>|> Large page sizes seem likely to interact badly with programs
>|> containing a large number of small procedures, such as X clients.  An
>|> obvious improvement would be to have the linker order procedures so that
>|> procedures commonly used together were adjacent, and separate from
>|> rarely used procedures.
>|> 
>|> Has this been done in any real systems?
>|> 
>
>NeXT's 2.0 linker provides a scheme for ordering procedures.
>I've used it to separate out startup code from the rest of the
>program and gotten good results (naturally, the numbers aren't
>handy.)  Special flags to gprof cause it to present output in a
>form useful to the linker.

Before we all get out chisels and find a roundish rock, here are some
references to the idea of restructuring programs for virtual memory:

   Brawn and Gustavson, Program behavior in a paging environment, Proc.
   AFIPS FJCC 33, 1968: 1019-1032.

   Comeau, L., A study fo the effect of user program optimization in a
   paging system, Proc. 1st ACM Symp.  OS Principles, 1967.

   Ferrari, D. and M. Kobayashi, Program restructuring algorithms for
   global LRU environments, Proc. Int. Comp. Symp., 1977: 277-283.

   Hatfield, D. J. and J. Gerald, Program restructuring for virtual
   memory, IBM Sys. J. 10, 3, 1971: 168-192.

   Johnson, J. W., Program restructuring for virtual memory systems,
   Project MAC TR-148, MIT, 1975.

   McKellar, A. and E. G. Coffman, The organization of matrices and
   matrix operations in a paged multiprogramming environment, CACM 12, 3,
   1969: 153-165.

Notice how old these references are.  There was great concern that
automatic memory management via paging would not be as efficient as
its predecessor, manually designed and explicitly executed overlaying.

This research fell off because it was suggested that the advent of
electronic auxiliary memory would cause page sizes to decrease and
therefore working sets would more closely approximate the actually
needed portions of the program in real memory, so thinking about this
was unnecessary.  And of course, memory was getting to be nearly
free, wasn't it?

Of course this didn't happen.  Disks with rotational latency delays
didn't go away, so large page sizes continued.  Disk technology
continued to improve dramatically and unexpectedly, so the price/
performance advantage over electronic backing store continued.
Also, applications got bigger much faster than memory prices dropped.

When I co-authored "Operating Systems--Advanced Concepts" I included
a section covering Johnson's work (reference above), partly because
no one else seemed to think it was important any more, but I did.
In general, automatic restructuring based on execution trace data
can reduce page faults by 50% over a "bad" virtual memory ordering,
for a broad range of available real memory sizes.  Since orderings
are now random, how do you know yours aren't bad?

{{{   No need for you to rush out and buy this book, published by
Benjamin-Cummings, 1987.  Really.  First author is Maekawa.  Really
no need to buy it.   :->   }}}

On a more recent note than the references above, the MIPS f77
compiler, which has only the name in common with the original
ATT f77, allows profiling of the execution (via pixie) and subsequent
recompilation which uses profile data to reorganize the text
segment for better I-cache performance.  Same concept, maybe
pixie output could be input for Johnson's algorithms.

Rod Oldehoeft                    Email: rro@cs.colostate.edu
Computer Science Department      Voice: 303/491-5792
Colorado State University        Fax:   303/491-2293
Fort Collins, CO  80523

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (01/25/91)

In article <PCG.91Jan23155035@athene.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:

| richard> An obvious improvement would be to have the linker order
| richard> procedures so that procedures commonly used together were
| richard> adjacent, and separate from rarely used procedures.  Has this
| richard> been done in any real systems?
| 
| Another lost art! Yes, it used to be popular before Unix. I have seen
| several papers on the subject, late sixties to end of the seventies. You

  Well, not totally lost. I just looked on a SCO machine and there are
two things in the compiler which allow the user to group code in the
address space. The first is and incremental link flag which allows
building large object files from small, and sounds hard to use, but the
second is the ability to specify the segment name for the text and local
data. This would allow grouping of init, exception handling, and wrapup
code, or any other independent part of the program.

  The term segment in the 386 doesn't apply to hardware segmentation,
the 386 still only uses one text and one data segment, so you're
limited to 4GB for text and 4GB for stack/data.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
  "I'll come home in one of two ways, the big parade or in a body bag.
   I prefer the former but I'll take the latter" -Sgt Marco Rodrigez

bruce@cs.su.oz (Bruce Janson) (01/25/91)

Marc,

In article <azofdo.b1j@wang.com> marcs@slc.com (Marc San Soucie) writes:
>..
>One of the real goodies, which I have yet to see in a UNIX system, was a
>service provided by the Kernel (and making liberal use of a nifty 68020/30
>hardware feature) whereby one could run one's program under the auspices of the
>debugger..

On the newer RISC chips the instruction sets and their encodings
are fairly straightforward.
It's not too much trouble to write a simulator that runs your
program (albeit slowly), printing out the jump/call info., and
thereby avoiding the need for os support.

Cheers,
bruce.

Bruce Janson
Basser Department of Computer Science
University of Sydney
Sydney, N.S.W., 2006
AUSTRALIA

Internet:	bruce@cs.su.oz.au
Telephone:	+61-2-692-3272
Fax:		+61-2-692-3838

mash@mips.COM (John Mashey) (01/26/91)

In article <MOSS.91Jan21172107@ibis.cs.umass.edu> moss@cs.umass.edu writes:
>MIPS systems have a tool that reorganizes code for better *cache* behavior.
>The same idea would presumably work for paging (and TLB) stress reduction ...

Yes,  although as it stands, the tool only does it for code, not data;
for many programs, code impact on paging and TLB is much less than
data impacts .... or in some cases, the code impact is horrible, but
there's nothing in the world that can be done about it except to have
huge caches, and even better, fast cache refill.  (Consider the kind of
program that's >200MB of code, much of it in a giant single loop,
leading to a high I-cache miss rate.)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

gdtltr@chopin.udel.edu (root@research.bdi.com (Systems Research Supervisor)) (01/26/91)

In article <45242@mips.mips.COM> mash@mips.COM (John Mashey) writes:
=>
=>Yes,  although as it stands, the tool only does it for code, not data;
=>for many programs, code impact on paging and TLB is much less than
=>data impacts .... or in some cases, the code impact is horrible, but
=>there's nothing in the world that can be done about it except to have
=>huge caches, and even better, fast cache refill.  (Consider the kind of
=>program that's >200MB of code, much of it in a giant single loop,
=>leading to a high I-cache miss rate.)

   I'm not an expert, so please pardon me if this is a stupid or obvious
question.
   Is any work being done in dynamically adjusting cache policies? In
other words, are there any caches out there that can detect this sort of
thing and adjust the line size to reduce refill overhead?

                                        Gary Duzan
                                        Time  Lord
                                    Third Regeneration



-- 
                            gdtltr@brahms.udel.edu
   _o_                      ----------------------                        _o_
 [|o o|]   Two CPU's are better than one; N CPU's would be real nice.   [|o o|]
  |_o_|           Disclaimer: I AM Brain Dead Innovations, Inc.          |_o_|

moss@cs.umass.edu (Eliot Moss) (01/27/91)

I doubt that dynamic policy adjustment is looked at seriously very much, since
we're talking about algorithms that are committed to very fast hardware, and
least partly because caches are frequently on a critical path (*please*, I do
not mean to start some kind of flame war over the detailed veracity of this
phrase).
--

		J. Eliot B. Moss, Assistant Professor
		Department of Computer and Information Science
		Lederle Graduate Research Center
		University of Massachusetts
		Amherst, MA  01003
		(413) 545-4206, 545-1249 (fax); Moss@cs.umass.edu

tbray@watsol.waterloo.edu (Tim Bray) (01/28/91)

mash@mips.COM (John Mashey) writes:
 (Consider the kind of
 program that's >200MB of code, much of it in a giant single loop,
 leading to a high I-cache miss rate.)

PLEASE tell us that's a thought experiment, John.  Does the world really
contain such appalling creations?  If so, I think a taxonomy would be of
general interest to the readers here...

(Mind you, in our group, we have occasionally executed large Deterministic
Finite Automata very fast by compiling them into a few Mb of branch
statements, which effectively trashes any I-cache with a relatively small
volume of code - but it's still the fastest way to run a big DFA).

Cheers, Tim Bray, Open Text Systems

marcs@slc.com (Marc San Soucie) (01/29/91)

Bruce Janson writes:

> Marc San Soucie writes:

> > One of the real goodies, which I have yet to see in a UNIX system, was a
> > service provided by the Kernel (and making liberal use of a nifty 68020/30
> > hardware feature) whereby one could run one's program under the auspices of
> > the debugger, and cause it to emit a trace log of all jumps and calls.

> On the newer RISC chips the instruction sets and their encodings
> are fairly straightforward.
> It's not too much trouble to write a simulator that runs your
> program (albeit slowly), printing out the jump/call info., and
> thereby avoiding the need for os support.
>
> Cheers,
> Bruce Janson
> Basser Department of Computer Science
> University of Sydney
> Sydney, N.S.W., 2006
> AUSTRALIA

Right, but this kind of technique is even less portable than one which relies
on hardware support that could be generic if more hardware provided it. ALso,
the beauty of OS-based analysis support is that it can be performed on-site in
a customer situation. Since customer performance nightmares can seldom be
reproduced back in the manufacturer's lab, on-site analysis is extremely handy.

Also, the OS-based support, with hardware assist, is leagues faster - in our
case about 10% of original speed - than a software simulation would be.

    Marc San Soucie
    Servio Corporation
    Beaverton, Oregon
    marcs@slc.com

lm@slovax.Berkeley.EDU (Larry McVoy) (01/29/91)

In article <azofdo.b1j@wang.com>, marcs@slc.com (Marc San Soucie) writes:
|> Richard Tobin writes:
|> 
|> > Piercarlo Grandi writes:
|> 
|> > > Incidentally, even a 4KB pagesize is still too large, even if it is
|> > > barely tolerable. 8KB was simply crazy, for a workstation/time sharing
|> > > server.
|> 
|> > Large page sizes seem likely to interact badly with programs
|> > containing a large number of small procedures, such as X clients.  An
|> > obvious improvement would be to have the linker order procedures so that
|> > procedures commonly used together were adjacent, and separate from
|> > rarely used procedures.
|> >
|> > Has this been done in any real systems?
|> 

The SunOS libc has done this for years.  I believe that the X people are also considering this sort of optimization.  I donot know of any system that does it dynamically.  I'd like to hear of such a system if anyone knows of one.

$ head -20 lorder-sparc
#
# @(#)lorder-sparc 1.2 88/02/08 SMI
#
# This list based upon the dynamic behavior of libc on a sun4.
# It is mostly accurate for other architectures too.
# The list was generated by a tool.
#
stret4.o
stret2.o
recvfrom.o
rpc_dtablesize.o
rpc_commondata.o
sendto.o
rpc_prot.o
bcopy.o
multiply.o
xdr.o
gettimeofday.o
auth_none.o
xdr_mem.o

---
Larry McVoy, Sun Microsystems     (415) 336-7627       ...!sun!lm or lm@sun.com

dmk@craycos.com (David Keaton) (01/29/91)

In article <1991Jan27.214522.24408@watdragon.waterloo.edu> tbray@watsol.waterloo.edu (Tim Bray) writes:
>mash@mips.COM (John Mashey) writes:
> (Consider the kind of
> program that's >200MB of code, much of it in a giant single loop,
> leading to a high I-cache miss rate.)
>
>PLEASE tell us that's a thought experiment, John.  Does the world really
>contain such appalling creations?  If so, I think a taxonomy would be of
>general interest to the readers here...

     I can imagine 200MB outer loops;  there are certainly at least
near megabyte inner loops in real code out there.  Back at Multiflow I
remember an actual user's code that simulated the aerodynamic
properties of a cyclical device.  (To protect the guilty, I won't say
what the device was.)  There was one enormous inner loop.  When the
device got past a certain point in its cyclical behavior, you could
actually watch the cache-miss lights come on in the back of the
machines as the loop extended way beyond the length of the I-cache.
That VLIW I-cache held the equivalent of about 56k RISC instructions
on Multiflow's smallest machine, with a memory representation of the
cache size taking about 512kB.

					David Keaton
					dmk@craycos.com
					uunet!dmk3b1!dmk

pcg@cs.aber.ac.uk (Piercarlo Grandi) (02/05/91)

On 28 Jan 91 21:13:20 GMT, lm@slovax.Berkeley.EDU (Larry McVoy) said:

Piercarlo Grandi writes:

pcg> Incidentally, even a 4KB pagesize is still too large, even if it is
pcg> barely tolerable. 8KB was simply crazy, for a workstation/time
pcg> sharing server.

Richard Tobin writes (attribution uncertain [pcg]):

tobin> Large page sizes seem likely to interact badly with programs
tobin> containing a large number of small procedures, such as X clients.
tobin> An obvious improvement would be to have the linker order
tobin> procedures so that procedures commonly used together were
tobin> adjacent, and separate from rarely used procedures.

tobin> Has this been done in any real systems?

lm> The SunOS libc has done this for years.

A well documented example is the System V shared libc_s. Reading the
relevant paper in the System V programmer's guide is highly
enlightening, especially the warning that shared libraries, precisely
because of locality of reference considerations on large page machines,
are not always a win.

    Explanation: if you fault is an 8KB page of a shared library to
    access a 200 byte function, you probably have lost more memory than
    by linking the 200 inline with each executable that uses them,
    unless you have more than 40 different such executables running at
    any one time. Naturally if the page had already been faulted in you
    don't lose as badly. Decisions on physical clustering in shared
    memory environments are hard to make. I am horrified by the cavalier
    approach used by many builders of shared libraries on systems that
    do have them. At least Sun do something about libc.
 
lm> I believe that the X people are also considering this sort of
lm> optimization.

The X server has particularly and horrifying poor locality of reference
patterns; the X libraries are built with total disregard to physical
clustering. That explains why things like Xserver and xclock, which
should have a working set of maybe two-three pages, actually require
hundreds. I suspect that that things like GNU CC, GNU Emacs, SunOS,
Ultrix, System V, X, NeWS, NeXTStep and so on are make-work schemes for
Toshiba, Hitachi and Fujitsu engineers.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

md@HQ.Ileaf.COM (Mark Dionne x5551) (02/05/91)

I have done some experiments with Interleaf (a large publishing program
written in C), and found that for many usage patterns, typically about 25%
of the code that is paged-in is actually touched. This can mean that up to
2 meg of memory is often being "wasted". Simply reordering the .o files 
helped improve things about 10 to 20%, but structured code tends to put
boring initialization routines next to workhorses, etc., preventing one
from getting a lot of improvement.

One thing that would help out here would be a compiler switch that would
produce multiple .o files for a single .c file (one .o file per function). 
It would then be possible to set up an automated reordering scheme where
gprof output would be used to reorder the functions. (Static functions
would optionally be either coalesced or treated as global.) An ideal 
candidate would be gcc. Any volunteers?

(I've actually tried this by writing a program to split up compiler
assembler output. It worked, but is painful and not portable.)

As someone hinted at, X servers would probably be excellent candidates
for this treatment. I've heard rumors that Sun has been doing this.
-- 
  md@HQ.Ileaf.COM (Internet)	  Mark Dionne, Interleaf
  uunet!leafusa!md (uucp)	  9 Hillside Ave, Cambridge, MA 02154
					(617) 290-4943 x5551

news@NeXT.COM (news) (02/05/91)

In article <PCG.91Feb4170042@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo  
Grandi) writes:
> A well documented example is the System V shared libc_s. Reading the
> relevant paper in the System V programmer's guide is highly
> enlightening, especially the warning that shared libraries, precisely
> because of locality of reference considerations on large page machines,
> are not always a win.
> 
>     Explanation: if you fault is an 8KB page of a shared library to
>     access a 200 byte function, you probably have lost more memory than
>     by linking the 200 inline with each executable that uses them,
>     unless you have more than 40 different such executables running at
>     any one time. Naturally if the page had already been faulted in you
>     don't lose as badly. Decisions on physical clustering in shared
>     memory environments are hard to make. I am horrified by the cavalier
>     approach used by many builders of shared libraries on systems that
>     do have them. At least Sun do something about libc.
>  
NeXTStep relies on two shared libraries for most of our applications and user
programs in our system, libsys_s and libNeXT_s.  For our second release of 
our system software, we made modifications to our linker to allow us to 
directly state our load order of these shlibs.  We then wrote a number of tools 
that allowed us to take statistics of running apps to determine which shared
library functions were being used most frequently.  We were able to reduce our
shared library resident set size for our system shlib by about 25% and our
application kit shlib by about 40%.  By applying these same tools to our
Workspace manager, windowserver, and all of the daemons that our system needs
we were able to reclaim about another meg of physical memory.

The bottom line was that this had a significant effect compared to our 1.0
application launching speed and the time that it takes to reactivate
applications.  In the market we are in, perceived user interface performance 
is everything.

Morris Meyer		NeXT, Inc.
Software Engineer	900 Chesapeake Drive
NeXT OS Group		Redwood City, CA   94063
mmeyer@next.com	415-780-4683

pcg@cs.aber.ac.uk (Piercarlo Grandi) (02/09/91)

On 4 Feb 91 19:09:49 GMT, md@HQ.Ileaf.COM (Mark Dionne x5551) said:

md> I have done some experiments with Interleaf (a large publishing
md> program written in C), and found that for many usage patterns,
md> typically about 25% of the code that is paged-in is actually
md> touched.  This can mean that up to 2 meg of memory is often being
md> "wasted".

Not unsurprising; there are lots of statistics that say that on average
only 200 instructions are executed before a "long jump" is executed;
this means that small page sizes tend to reduce working sets
dramatically. A concise summary of the "armchair" evidence on this is in
Shaw's "Logical design of operating systems", in the section on paging
performance (page 232 in the II edition).

It would be amusing for you to repeat the same measurements on different
machines, e.g. a Gould PN1 with 16KB/32KB pages, a Sun 3 with 8KB pages,
and all the way down to a BSD VAX with 1KB pages. If you do, please send
a report to SIGOPS or SIGARCH! (post it here first :->). A large amount
of statistics already exists on this subject, but they are at times
twenty years old. It would be interesting to see them confirmed with
data on more recent applications/languages.

md> Simply reordering the .o files helped improve things about 10 to
md> 20%, but structured code tends to put boring initialization routines
md> next to workhorses, etc., preventing one from getting a lot of
md> improvement.

I am fond of mentioning another nice idea similar to 'register', and
with the same excellent benefits/costs ratio: I have been told that some
Algol 68 compiler had a 'rarely' (executed) PRAGMA so that the compiler
would generate the relevant code "offline", thus making the most
frequently executed path as streamlined as possible.

md> One thing that would help out here would be a compiler switch that would
md> produce multiple .o files for a single .c file (one .o file per function). 

Actually the better idea is probably to have smart linkers; but, lacking
those, this is not a bad idea. I used to be unhappy with the GNU C++
streams implementations, which was very monolithic, so I wrote an
implementations that had lots of small source files. Space occupied, and
presumably time, decreased dramatically.

md> As someone hinted at, X servers would probably be excellent candidates
md> for this treatment. I've heard rumors that Sun has been doing this.

Actually I think that MIT are doing it. I have been told by somebody
else that most of the time X is executing the same 2KB stretch of
code...  Another terribly bad offender is GNU Emacs. Actually most any
contemporary program. We do have 'time' profilers readily available,
some as sophisticated as pixie and gprof, but I am not aware of *any*
readily available 'locality' profiler. Control flow profilers like the
two mentioned above can be used to optimize placement of functions; MIPS
uses pixie to optimize for caching, for example.


As an overall comment, I find this posting and the earlier one by a
person from NeXT very refreshing: some people *do* understand and care
about engineering and cost/benefit ratios. It is encouraging.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

davecb@yunexus.YorkU.CA (David Collier-Brown) (02/11/91)

On 4 Feb 91 19:09:49 GMT, md@HQ.Ileaf.COM (Mark Dionne x5551) said:
md> I have done some experiments with Interleaf (a large publishing
md> program written in C), and found that for many usage patterns,
md> typically about 25% of the code that is paged-in is actually
md> touched.  This can mean that up to 2 meg of memory is often being
md> "wasted".

pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
| Not unsurprising; there are lots of statistics that say that on average
| only 200 instructions are executed before a "long jump" is executed;
| this means that small page sizes tend to reduce working sets
| dramatically. 
[...] 							A large amount
| of statistics already exists on this subject, but they are at times
| twenty years old. It would be interesting to see them confirmed with
| data on more recent applications/languages.

  Well, some of the twenty-year-old data came from a (pre)linker on
ICL's George III, which I remember fuzzily...
  Much more recently (5 years ago) I had the doubtful pleasure of linking C
code for a large, multi-module program for the IBM Poisonous Computer while
working on Xanaro's ``Ability''.  We were using a program which evolved from
an overlay scheme, but had been rewritten into a pager with the same
interface by one of our senior people, Mr. Andrew Forber.

  This didn't actually generate statistics, but did allow us to manually
reorder code by locality of reference. Since we were writing classical
``structured'' code and had a compiler which implemented static-to-file by
placing multiple functions in a single .o file, you understand we tended to
discover the dynamic call graph of the programs very quickly...

  The 200 instruction figure was approximately correct for all non-unrolled
code, and quite low for things which we found on the critical path and
expanded inline (such as file i/o, paging and the compiler/interpreter
engine).
  About 20% of the code was initializers of some sort or another, and could
usefully be moved to ``pages'' which were loaded at startup time, called by
the main() routine and then were never referenced again.  

  As the system was partitioned into several large modules, we found three
definite orders of locality of reference.  
  The weakest was within the module as a whole: the user interface used the
user-visible operations which used the primitives.
  The second weakest was within the primitives, many of which interacted in
interesting ways (eg, to preserve consistency), ad tended to call each other.
  The strongest was between the user-level operations and the primitives. A
given user-level operation tended to call the same set of primitives over and
over again. This was true of all the modules, not just the ones which were
``similar''.
  Indeed, we noticed a contraintuitive effect: some few user-level operations
called primitives which we expected only to be called by other modules
entirely!  The text editor often found it's file i/o operations calling
primitives bound with the spreadsheet, because of a reference from one file's
data to another.  We first discovered this when the speed of file i/o
suddenly fell to about 1 line/sec, accompanied by much thrashing when two
routines started fighting over the same ``page''.
   Breaking the modules up into smaller and smaller ``pages'' yielded
order-of-magnitude improvements in performance.  Hand-tuning generally gave
us two orders of magnitude (or more correctly, cost us two orders of
magnitude when I did it wrong (:-)).


  This was one of the ``small'' things referred to by Dick McMurray's dictum:

	First you make it right,
	then you make it fast,
	then you make it **small**.

--dave
  
-- 
David Collier-Brown,  | davecb@Nexus.YorkU.CA | lethe!dave
72 Abitibi Ave.,      | 
Willowdale, Ontario,  | Even cannibals don't usually eat their
CANADA. 416-223-8968  | friends.

cprice@mips.COM (Charlie Price) (02/11/91)

In article <1991Feb4.190949.1190@HQ.Ileaf.COM> md@HQ.Ileaf.COM (Mark Dionne x5551) writes:
>
>I have done some experiments with Interleaf (a large publishing program
>written in C), and found that for many usage patterns, typically about 25%
>of the code that is paged-in is actually touched. This can mean that up to
>
>One thing that would help out here would be a compiler switch that would
>produce multiple .o files for a single .c file (one .o file per function). 
>It would then be possible to set up an automated reordering scheme where
>gprof output would be used to reorder the functions. (Static functions
>would optionally be either coalesced or treated as global.) An ideal 
>candidate would be gcc. Any volunteers?

MIPS has shipped this basic capability (in a slightly different form)
for years.  The MIPS' tool is primary used to rearrange routines
during linkage so that the most active routines don't conflict with
each other in the instruction cache.  It lays them out sequentially
from most active to least active (up to some limit) and this would
achieve the effect you talk about.

Here is how it works:

cc -O -o gleep gleep.c

	produces gleep.
	Create (or find) the executable you are interested in...

pixie gleep

	Create an instrumented version of the executable with pixie.
	The instrumented version, named OBJNAME.pixie by default, does the
	same function that the original program did, and also keeps statistics
	about what goes on inside the program -- in particular
	it keeps execution counts for basic blocks.
	Pixie also writes out a "map" file (OBJNAME.Addrs) that is needed to
	reconcile the basic block structure to program labels.

gleep.pixie program_arguments...

	Run the instrumented program producing a statistics file
	named OBJNAME.Counts (here gleep.Counts).
	Multiple runs (with different inputs, for example),
	can be handled, but you must move the gleep.Counts file
	"out of the way" before each run.

prof -pixie -feedback feedback_file_name gleep gleep.Addrs gleep.Counts...

	Process the gathered statistics file(s) (one or more X.Counts files)
	and produce a "feedback" file that can be used by the compiler
	system.  The basic computation is to figure out which
	relocatable objects (functions) are used the most from
	the basic block count information.

cc -O -o gleep.new -cord -feedback feedback_file_name gleep.c

	Recreate the executable, running the "cord" rearranger pass
	using the specified feedback file.


As an aside, pixie is really for program performance analysis.
Getting information for rearrangement was not the primary choice.
For instance, you can discover which lines in a source program
use up the largest number of cycles.

prof -pixie -heavy glorp

produces the following information:

Profile listing generated Sun Feb 10 15:35:20 1991 with:
   prof -pixie -heavy glorp

blkclr and bzero (bzero.s) synonymous: using latter
--------------------------------------------------------------------------------
--
*  -h[eavy] using basic-block counts;                                      *
*  sorted in descending order by the number of cycles executed in each     *
*  line; unexecuted lines are excluded                                     *
--------------------------------------------------------------------------------
--

procedure (file)                           line bytes     cycles      %  cum %

lwputstr (glorp.c)                         1016   216    1632974  11.67  11.67
pfile (glorp.c)                             754    76    1564005  11.17  22.84
pfile (glorp.c)                             766    88    1354976   9.68  32.52
pfile (glorp.c)                             753    44    1121371   8.01  40.53
pfile (glorp.c)                             853    80    1121311   8.01  48.54
pfile (glorp.c)                             762    40     741465   5.30  53.84
lwputstr (glorp.c)                         1014    24     648692   4.63  58.48
lwputstr (glorp.c)                         1019    20     538362   3.85  62.32
pfile (glorp.c)                             761    32     342904   2.45  64.77
pfile (glorp.c)                             849    72     336097   2.40  67.17
... cut off here...
-- 
Charlie Price    cprice@mips.mips.com        (408) 720-1700
MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA   94086-23650

suitti@ima.isc.com (Stephen Uitti) (02/13/91)

In article <PCG.91Feb8204700@teacho.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
>On 4 Feb 91 19:09:49 GMT, md@HQ.Ileaf.COM (Mark Dionne x5551) said:
>md> Simply reordering the .o files helped improve things about 10 to
>md> 20%, but structured code tends to put boring initialization routines
>md> next to workhorses, etc., preventing one from getting a lot of
>md> improvement.

pg>I am fond of mentioning another nice idea similar to 'register', and
pg>with the same excellent benefits/costs ratio: I have been told that some
pg>Algol 68 compiler had a 'rarely' (executed) PRAGMA so that the compiler
pg>would generate the relevant code "offline", thus making the most
pg>frequently executed path as streamlined as possible.

>md> One thing that would help out here would be a compiler switch that would
>md> produce multiple .o files for a single .c file (one .o file per function). 

pg>Actually the better idea is probably to have smart linkers; but, lacking
pg>those, this is not a bad idea. I used to be unhappy with the GNU C++
pg>streams implementations, which was very monolithic, so I wrote an
pg>implementations that had lots of small source files. Space occupied, and
pg>presumably time, decreased dramatically.

If you have the compiler generate multiple .o's for .c's, you
need some way to tell the linker, "look here first" when
resolving symbol addresses.  In C, if you reference a function,
typically the linker looks for the symbol within the same .o first.

If the compiler produces a list of what symbols that were in each
.o were global, static, static but relevant to these other .o's,
then user's might accidentally obtain more useful libraries.

Imagine you are writing a subroutine library.  You want function
x() to be available to other routines within the library, but you
don't want it to be directly available to the caller.  If you
declare it static, it is only available to other functions within
that source file.  If you copy it to each source that needs it,
you might end up with several copies.  If you make your library
from all one source, then you may get all sorts of functions
linked into the final object that are never referenced.

If you can build a library, and say "these are the library
external symbols", using a list that is perhaps outside of the C
sources, then you can build better libraries.  If it is easy to
do, maybe more libraries will be built.

For that matter, sometimes it is more convenient to stuff lots of
functions into a single source file.  It'd be real nice if the
functions that aren't referenced aren't linked into the user's
application.  Again, if it is easy (or free) to do it right, more
programmers will do it right.

A smart linker could do the reordering, or even dead function
elimination, but an optional external list would make library
writing easier to do right, and bullet proof libraries could be built.

One of the first interesting debugging exercises I witnessed had
to do with a novice C user, who typed in the "index()" function
in the K&R C book.  He was amazed that I/O no longer worked
right.  The K&R index() took two string arguments.  The libc
version takes a string and a character.  Somewhere in stdio,
index() was invoked.  Since the user had already defined it, the
linker didn't load the libc version...

Imagine if the linker were smart enough to know that libc should
be hooked up to the libc version of "index()", and the user's
code should be hooked up to his/her own version of "index()".

I've heard that some big company, GM or someone, has done
something like this.  Whoever it was hasn't released it for
the world.  We need people who publish software to do it.

Then we could start working for world peace or something.

Stephen.
suitti@ima.isc.com
"We Americans want peace, and it is now evident that we must be
prepared to demand it.  For other peoples have wanted peace, and
the peace they received was the peace of death." - the Most Rev.
Francis J. Spellman, Archbishop of New York.  22 September, 1940

milton@en.ecn.purdue.edu (Milton D Miller) (02/13/91)

In an effort to reorder object code to provide smaller working sets,
md@HQ.Ileaf.COM (Mark Dionne x5551) said:
>>md> One thing that would help out here would be a compiler switch that would
>>md> produce multiple .o files for a single .c file (one .o file per function).

To which suitti@ima.isc.com (Stephen Uitti) responded in article
<1991Feb12.164056.8865@dirtydog.ima.isc.com> with:

>If you have the compiler generate multiple .o's for .c's, you
>need some way to tell the linker, "look here first" when
>resolving symbol addresses.  In C, if you reference a function,
>typically the linker looks for the symbol within the same .o first.
>
>If the compiler produces a list of what symbols that were in each
>.o were global, static, static but relevant to these other .o's,
>then user's might accidentally obtain more useful libraries.
>

[Talks about writing a subroutine library, and the want
to have some functions accessable only internally]

>
>If you can build a library, and say "these are the library
>external symbols", using a list that is perhaps outside of the C
>sources, then you can build better libraries.  If it is easy to
>do, maybe more libraries will be built.

[Also talks about eliminating unused functions from a source file]

>A smart linker could do the reordering, or even dead function
>elimination, but an optional external list would make library
>writing easier to do right, and bullet proof libraries could be built.
>

[Story of C novice who wrote index() from K&R breaking stdio, which
used a function with that name but different arguments]

>Imagine if the linker were smart enough to know that libc should
>be hooked up to the libc version of "index()", and the user's
>code should be hooked up to his/her own version of "index()".
>
>I've heard that some big company, GM or someone, has done
>something like this.  Whoever it was hasn't released it for
>the world.  We need people who publish software to do it.
>
>Stephen.
>suitti@ima.isc.com

I waited a few hours to post because of the lack of first hand
knowledge, but since no one has brought it up, I will.

Although not quite all that is necessary, the shared libraries in
AIX 3.1 (which runs on the IBM S/6000) do have the notion that
only some functions should be "exported" to other modules.  The
list of symbols to be exported are placed in a file provided at
link time.  There is also the concept of an import list, which
lists the names of functions which are available somewhere else
(for example, in the kernel) (I think).  The biggest complaint I
have seen in the comp.unix.aix (besides the data space used by the
linker) is that you can't use your own malloc, because the library
already has one linked into it (Actually you could use your own
malloc, if you are prepared not to use anything from libc).

Although this does not solve all the linking problems, it is a place
to start, and it does solve the problem with index in the example
given above.  I don't know if the import files would allow the
single c file to be broken up and still preserve static variable
binding or not.  Maybe someone that has a S/6000 could try?

milton

Disclaimer: I have only read about the linker; the only time I used
it (when I had access to it) was to compile a simple c program written
by someone else.

PS: Source licenses are supposed to be available about a year after
the machine started shipping...

pcg@cs.aber.ac.uk (Piercarlo Grandi) (02/14/91)

On 12 Feb 91 16:40:56 GMT, suitti@ima.isc.com (Stephen Uitti) said:

suitti> In article <PCG.91Feb8204700@teacho.cs.aber.ac.uk>
suitti> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:

On 4 Feb 91 19:09:49 GMT, md@HQ.Ileaf.COM (Mark Dionne x5551) said:

md> One thing that would help out here would be a compiler switch that
md> would produce multiple .o files for a single .c file (one .o file
md> per function).

pg> Actually the better idea is probably to have smart linkers; but,
pg> lacking those, this is not a bad idea. I used to be unhappy with the
pg> GNU C++ streams implementations, which was very monolithic, so I
pg> wrote an implementations that had lots of small source files. Space
pg> occupied, and presumably time, decreased dramatically.

suitti> If you have the compiler generate multiple .o's for .c's, you
suitti> need some way to tell the linker, "look here first" when
suitti> resolving symbol addresses.  In C, if you reference a function,
suitti> typically the linker looks for the symbol within the same .o first.

suitti> If the compiler produces a list of what symbols that were in each
suitti> .o were global, static, static but relevant to these other .o's,
suitti> then user's might accidentally obtain more useful libraries.

This is true -- you need to add a new visibility distinction, whether an
entitity is visible only within or also outside a library. This can be
hacked by tweaking appropriately the format of the symbol table in .a
files. Incidentally, I think that instead of having the assembler
compiler generate multiple .o files, it would be nice for it to generate
a single .a file. It would not be difficult to modify GNU to have a
suitable -a option, as in 'gas -a many-functions.a many-functions.s'.

But really the root problem is that we are discussing a language like C
and its three visibility levels, and other aspects which were designed
around linking technology straight out of the the fifties. The inertia
of masses of C applications and of linkers designed exactly as they were
in the fifties is tremendous, as C++ users discover to their detriment.

It is deeply ironic that most hardware designers have read something at
least on Multics and have put hooks for modern linking technology and
languages in the interior decors of most modern systems, even if most
OSes and nearly all incarnations of Unix simply do not use them! Only
recently SunOS and SVR4 have acquired some hacky dynamic linking
technology, for example. This tragedy is almost as great as that of
millions of 286s runnin DOS when they have protected mode rings and
virtual memory and features straight out of Multics or TSS/370.

There are some languages that are designed under the impression that the
evolution of linker technology has not stopped with the 7090 or the 1100
or the 600, and that make use of linkers deign of the early sixties, for
example Modula2 and Ada.  There are even languages that assume linking
technology as it evolved in the seventies, but I know of no mainline
system that uses it. It could be said that many languages in the Scheme,
Lisp, Poplog family have linking technology out of the seventies, but
they are hardly as popular as they should be.

Ah architecture! Things are so bad that some recent hw/os architectures
have difficulty supporting dynamic linking and loading, and self
modifying code, as discussed previously on these screens, and those
things are the essence of modern linking technology, from the sixties
onwards.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk