moss@cs.umass.edu (Eliot Moss) (01/22/91)
MIPS systems have a tool that reorganizes code for better *cache* behavior. The same idea would presumably work for paging (and TLB) stress reduction ... -- J. Eliot B. Moss, Assistant Professor Department of Computer and Information Science Lederle Graduate Research Center University of Massachusetts Amherst, MA 01003 (413) 545-4206, 545-1249 (fax); Moss@cs.umass.edu
marcs@slc.com (Marc San Soucie) (01/23/91)
Richard Tobin writes: > Piercarlo Grandi writes: > > Incidentally, even a 4KB pagesize is still too large, even if it is > > barely tolerable. 8KB was simply crazy, for a workstation/time sharing > > server. > Large page sizes seem likely to interact badly with programs > containing a large number of small procedures, such as X clients. An > obvious improvement would be to have the linker order procedures so that > procedures commonly used together were adjacent, and separate from > rarely used procedures. > > Has this been done in any real systems? I know of one instance, which may not qualify for the label 'real', which was a workstation project undertaken at Wang Laboratories and later discarded shortly before its release date. The OS for the workstation was a System 5.3 compatible home-built, with home-cobbled tools and other home-made goodies. One of the real goodies, which I have yet to see in a UNIX system, was a service provided by the Kernel (and making liberal use of a nifty 68020/30 hardware feature) whereby one could run one's program under the auspices of the debugger, and cause it to emit a trace log of all jumps and calls. I wrote a set of spiffers which munged the trace log into a linker script which would cause the program to be linked in an order most advantageous for initial page-in and in many cases for least runtime memory use. It took about three steps, but one could use a real instance of a program to improve the program's linking. The same info, of course, could be turned into performance analysis data, into branch testing data, etc. Very, very nice. Unfortunately, few chips seems to support the jump/call hardware interrupt, and even fewer OS's support the trace log feature. Alas, another great service tossed in the dumper... Marc San Soucie Portland, Oregon marcs@slc.com
sbs@ciara.Frame.COM (Steven Sargent) (01/23/91)
In article <3981@skye.ed.ac.uk>, richard@aiai.ed.ac.uk (Richard Tobin) writes: |> In article <PCG.91Jan20191955@teacho.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes: |> >Incidentally, even a 4KB pagesize is still too large, even if it is |> >barely tolerable. 8KB was simply crazy, for a workstation/time sharing |> >server. |> |> Large page sizes seem likely to interact badly with programs |> containing a large number of small procedures, such as X clients. An |> obvious improvement would be to have the linker order procedures so that |> procedures commonly used together were adjacent, and separate from |> rarely used procedures. |> |> Has this been done in any real systems? |> NeXT's 2.0 linker provides a scheme for ordering procedures. I've used it to separate out startup code from the rest of the program and gotten good results (naturally, the numbers aren't handy.) Special flags to gprof cause it to present output in a form useful to the linker. -- Steven Sargent sbs@frame.com "Frame's phone bill, my opinions."
pcg@cs.aber.ac.uk (Piercarlo Grandi) (01/23/91)
On 21 Jan 91 13:56:48 GMT, richard@aiai.ed.ac.uk (Richard Tobin) said:
richard> In article <PCG.91Jan20191955@teacho.cs.aber.ac.uk>
richard> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
pcg> Incidentally, even a 4KB pagesize is still too large, even if it is
pcg> barely tolerable. 8KB was simply crazy, for a workstation/time sharing
pcg> server.
richard> Large page sizes seem likely to interact badly with programs
richard> containing a large number of small procedures,
Same for data. Other things being equal, a smaller page size is better
than a larger page size, down to a fairly small limit, mostly for this
reason. Naturally other things are not equal -- smaller page sizes imply
larger page tables, and IO costs are largely independent of page size.
richard> such as X clients.
Ah, also the X server. By private communication I have been told that
the X server typically spends 90% of its time in a 2KB stretch of code.
My reaction was disbelief, as this would imply that the X server code
working set would be around one page, which observations tells us is
not.
I was told that in getting to those 2KB, control touches a large number
of layers of abstraction realized as widely scattered procedures. In
other words the path leading to the 2KB is fairly short in time but long
in jumps from one page to another... I would surmise that the same is
probably true for the X server's data. Arrrrggghhh.
richard> An obvious improvement would be to have the linker order
richard> procedures so that procedures commonly used together were
richard> adjacent, and separate from rarely used procedures. Has this
richard> been done in any real systems?
Another lost art! Yes, it used to be popular before Unix. I have seen
several papers on the subject, late sixties to end of the seventies. You
would have tools for profiling and rearranging code in order to minimize
the working set. I have even heard of an Algol 68 compiler with an
"infrequently" pragma, used to tag infrequently executed sections of
code to offline.
As I have already remarked, programming for "locality" is a dark lost
secret nowadays -- we have got the worst aspects of Lisp programming in
languages like C without the benefits!
Memory is cheap, so instead of using it for more powerful applications
it is used for sloppier programming.
I had once a short e-mail discussion with the author on the appalling
inefficiencies intrinsic in the GNU Emacs design decisions. The usual
answer was "who cares! modern machines are so fast!".
The problem is that the GNU Emacs botches do not have a constant cost;
that is, say an overhead of 100KB or of 1 million needless instructions,
which would become percentage wise insignificant with time. No, like
many sloppily written programs it has inefficiencies proportional to the
size and speed of the machine/application. And applications grow in
size; once one could complain about GNU Emacs being slow in editing
100KB buffers and running 200 line ELisp functions on a 68010. Now that
we have machines that are 20 times as fast and large, if we still edit
100KB files and run 200 Elisp functions GNU Emacs looks fast.
Unfortunately now we want to edit 2MB files and run 4000 lines ELisp
functions routinely, and GNU Emacs still looks slow as it was then, or
even more, because the inefficiency overhead is more than linear. So we
have the choice of investing the 20 fold improvement in machine power to
have either decent responsiveness running the same things we did with
20 times less powerful machines or running applications that are 20
times as large at the same old slow pace.
Another example is the abject System V expansion swap technology; the
"cure" suggested by AT&T is to have about 10 times more memory than that
is needed for the working sets of active programs, so that the misdesign
is never exercised. OK, memory is cheap -- but the *factor* of 10
remains there, whatever the application size is, and it irks me for some
stupid reason that I have to reserve 90% of my real memory to AT&T's
swapping inanity, instead of for running applications that are ten times
as large.
So, the figleaf "machines are getting faster" really does not cover
much...
--
Piercarlo Grandi | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
dmk@craycos.com (David Keaton) (01/24/91)
In article <3981@skye.ed.ac.uk> richard@aiai.UUCP (Richard Tobin) writes: > An >obvious improvement would be to have the linker order procedures so that >procedures commonly used together were adjacent, and separate from >rarely used procedures. > >Has this been done in any real systems? Multiflow's linker used the output of gprof to optimally scatter frequently cooperating procedures in the I-cache. Sometimes this required placing them in different pages. Typical speedup was about 10%, although I remember a couple of cases that achieved 20%+. It was a real system for a while. David Keaton dmk@craycos.com uunet!dmk3b1!dmk
davecb@yunexus.YorkU.CA (David Collier-Brown) (01/24/91)
| In article <3981@skye.ed.ac.uk> richard@aiai.UUCP (Richard Tobin) writes: | An obvious improvement would be to have the linker order procedures so that | procedures commonly used together were adjacent, and separate from | rarely used procedures. | | Has this been done in any real systems? This was done by the optional ``binder'' on Multics. Binding was done to programs whose performance was important and reference patterns known. Other programs linked dynamically as needed. --dave -- David Collier-Brown, | davecb@Nexus.YorkU.CA | lethe!dave 72 Abitibi Ave., | Willowdale, Ontario, | Even cannibals don't usually eat their CANADA. 416-223-8968 | friends.
schwartz@groucho.cs.psu.edu (Scott Schwartz) (01/24/91)
In article <3981@skye.ed.ac.uk> richard@aiai.UUCP (Richard Tobin) writes: | An obvious improvement would be to have the linker order procedures ... | Has this been done in any real systems? Primos did (still does, I guess) this for a number of common programs, like the line editor.
rro@debussy.cs.colostate.edu (Rod Oldehoeft) (01/24/91)
In article <9840@frame.UUCP> sbs@ciara.Frame.COM (Steven Sargent) writes: >In article <3981@skye.ed.ac.uk>, richard@aiai.ed.ac.uk (Richard Tobin) writes: >|> In article <PCG.91Jan20191955@teacho.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes: [A little deleted here.] >|> >|> Large page sizes seem likely to interact badly with programs >|> containing a large number of small procedures, such as X clients. An >|> obvious improvement would be to have the linker order procedures so that >|> procedures commonly used together were adjacent, and separate from >|> rarely used procedures. >|> >|> Has this been done in any real systems? >|> > >NeXT's 2.0 linker provides a scheme for ordering procedures. >I've used it to separate out startup code from the rest of the >program and gotten good results (naturally, the numbers aren't >handy.) Special flags to gprof cause it to present output in a >form useful to the linker. Before we all get out chisels and find a roundish rock, here are some references to the idea of restructuring programs for virtual memory: Brawn and Gustavson, Program behavior in a paging environment, Proc. AFIPS FJCC 33, 1968: 1019-1032. Comeau, L., A study fo the effect of user program optimization in a paging system, Proc. 1st ACM Symp. OS Principles, 1967. Ferrari, D. and M. Kobayashi, Program restructuring algorithms for global LRU environments, Proc. Int. Comp. Symp., 1977: 277-283. Hatfield, D. J. and J. Gerald, Program restructuring for virtual memory, IBM Sys. J. 10, 3, 1971: 168-192. Johnson, J. W., Program restructuring for virtual memory systems, Project MAC TR-148, MIT, 1975. McKellar, A. and E. G. Coffman, The organization of matrices and matrix operations in a paged multiprogramming environment, CACM 12, 3, 1969: 153-165. Notice how old these references are. There was great concern that automatic memory management via paging would not be as efficient as its predecessor, manually designed and explicitly executed overlaying. This research fell off because it was suggested that the advent of electronic auxiliary memory would cause page sizes to decrease and therefore working sets would more closely approximate the actually needed portions of the program in real memory, so thinking about this was unnecessary. And of course, memory was getting to be nearly free, wasn't it? Of course this didn't happen. Disks with rotational latency delays didn't go away, so large page sizes continued. Disk technology continued to improve dramatically and unexpectedly, so the price/ performance advantage over electronic backing store continued. Also, applications got bigger much faster than memory prices dropped. When I co-authored "Operating Systems--Advanced Concepts" I included a section covering Johnson's work (reference above), partly because no one else seemed to think it was important any more, but I did. In general, automatic restructuring based on execution trace data can reduce page faults by 50% over a "bad" virtual memory ordering, for a broad range of available real memory sizes. Since orderings are now random, how do you know yours aren't bad? {{{ No need for you to rush out and buy this book, published by Benjamin-Cummings, 1987. Really. First author is Maekawa. Really no need to buy it. :-> }}} On a more recent note than the references above, the MIPS f77 compiler, which has only the name in common with the original ATT f77, allows profiling of the execution (via pixie) and subsequent recompilation which uses profile data to reorganize the text segment for better I-cache performance. Same concept, maybe pixie output could be input for Johnson's algorithms. Rod Oldehoeft Email: rro@cs.colostate.edu Computer Science Department Voice: 303/491-5792 Colorado State University Fax: 303/491-2293 Fort Collins, CO 80523
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (01/25/91)
In article <PCG.91Jan23155035@athene.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes: | richard> An obvious improvement would be to have the linker order | richard> procedures so that procedures commonly used together were | richard> adjacent, and separate from rarely used procedures. Has this | richard> been done in any real systems? | | Another lost art! Yes, it used to be popular before Unix. I have seen | several papers on the subject, late sixties to end of the seventies. You Well, not totally lost. I just looked on a SCO machine and there are two things in the compiler which allow the user to group code in the address space. The first is and incremental link flag which allows building large object files from small, and sounds hard to use, but the second is the ability to specify the segment name for the text and local data. This would allow grouping of init, exception handling, and wrapup code, or any other independent part of the program. The term segment in the 386 doesn't apply to hardware segmentation, the 386 still only uses one text and one data segment, so you're limited to 4GB for text and 4GB for stack/data. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "I'll come home in one of two ways, the big parade or in a body bag. I prefer the former but I'll take the latter" -Sgt Marco Rodrigez
bruce@cs.su.oz (Bruce Janson) (01/25/91)
Marc, In article <azofdo.b1j@wang.com> marcs@slc.com (Marc San Soucie) writes: >.. >One of the real goodies, which I have yet to see in a UNIX system, was a >service provided by the Kernel (and making liberal use of a nifty 68020/30 >hardware feature) whereby one could run one's program under the auspices of the >debugger.. On the newer RISC chips the instruction sets and their encodings are fairly straightforward. It's not too much trouble to write a simulator that runs your program (albeit slowly), printing out the jump/call info., and thereby avoiding the need for os support. Cheers, bruce. Bruce Janson Basser Department of Computer Science University of Sydney Sydney, N.S.W., 2006 AUSTRALIA Internet: bruce@cs.su.oz.au Telephone: +61-2-692-3272 Fax: +61-2-692-3838
mash@mips.COM (John Mashey) (01/26/91)
In article <MOSS.91Jan21172107@ibis.cs.umass.edu> moss@cs.umass.edu writes: >MIPS systems have a tool that reorganizes code for better *cache* behavior. >The same idea would presumably work for paging (and TLB) stress reduction ... Yes, although as it stands, the tool only does it for code, not data; for many programs, code impact on paging and TLB is much less than data impacts .... or in some cases, the code impact is horrible, but there's nothing in the world that can be done about it except to have huge caches, and even better, fast cache refill. (Consider the kind of program that's >200MB of code, much of it in a giant single loop, leading to a high I-cache miss rate.) -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
gdtltr@chopin.udel.edu (root@research.bdi.com (Systems Research Supervisor)) (01/26/91)
In article <45242@mips.mips.COM> mash@mips.COM (John Mashey) writes: => =>Yes, although as it stands, the tool only does it for code, not data; =>for many programs, code impact on paging and TLB is much less than =>data impacts .... or in some cases, the code impact is horrible, but =>there's nothing in the world that can be done about it except to have =>huge caches, and even better, fast cache refill. (Consider the kind of =>program that's >200MB of code, much of it in a giant single loop, =>leading to a high I-cache miss rate.) I'm not an expert, so please pardon me if this is a stupid or obvious question. Is any work being done in dynamically adjusting cache policies? In other words, are there any caches out there that can detect this sort of thing and adjust the line size to reduce refill overhead? Gary Duzan Time Lord Third Regeneration -- gdtltr@brahms.udel.edu _o_ ---------------------- _o_ [|o o|] Two CPU's are better than one; N CPU's would be real nice. [|o o|] |_o_| Disclaimer: I AM Brain Dead Innovations, Inc. |_o_|
moss@cs.umass.edu (Eliot Moss) (01/27/91)
I doubt that dynamic policy adjustment is looked at seriously very much, since we're talking about algorithms that are committed to very fast hardware, and least partly because caches are frequently on a critical path (*please*, I do not mean to start some kind of flame war over the detailed veracity of this phrase). -- J. Eliot B. Moss, Assistant Professor Department of Computer and Information Science Lederle Graduate Research Center University of Massachusetts Amherst, MA 01003 (413) 545-4206, 545-1249 (fax); Moss@cs.umass.edu
tbray@watsol.waterloo.edu (Tim Bray) (01/28/91)
mash@mips.COM (John Mashey) writes:
(Consider the kind of
program that's >200MB of code, much of it in a giant single loop,
leading to a high I-cache miss rate.)
PLEASE tell us that's a thought experiment, John. Does the world really
contain such appalling creations? If so, I think a taxonomy would be of
general interest to the readers here...
(Mind you, in our group, we have occasionally executed large Deterministic
Finite Automata very fast by compiling them into a few Mb of branch
statements, which effectively trashes any I-cache with a relatively small
volume of code - but it's still the fastest way to run a big DFA).
Cheers, Tim Bray, Open Text Systems
marcs@slc.com (Marc San Soucie) (01/29/91)
Bruce Janson writes: > Marc San Soucie writes: > > One of the real goodies, which I have yet to see in a UNIX system, was a > > service provided by the Kernel (and making liberal use of a nifty 68020/30 > > hardware feature) whereby one could run one's program under the auspices of > > the debugger, and cause it to emit a trace log of all jumps and calls. > On the newer RISC chips the instruction sets and their encodings > are fairly straightforward. > It's not too much trouble to write a simulator that runs your > program (albeit slowly), printing out the jump/call info., and > thereby avoiding the need for os support. > > Cheers, > Bruce Janson > Basser Department of Computer Science > University of Sydney > Sydney, N.S.W., 2006 > AUSTRALIA Right, but this kind of technique is even less portable than one which relies on hardware support that could be generic if more hardware provided it. ALso, the beauty of OS-based analysis support is that it can be performed on-site in a customer situation. Since customer performance nightmares can seldom be reproduced back in the manufacturer's lab, on-site analysis is extremely handy. Also, the OS-based support, with hardware assist, is leagues faster - in our case about 10% of original speed - than a software simulation would be. Marc San Soucie Servio Corporation Beaverton, Oregon marcs@slc.com
lm@slovax.Berkeley.EDU (Larry McVoy) (01/29/91)
In article <azofdo.b1j@wang.com>, marcs@slc.com (Marc San Soucie) writes: |> Richard Tobin writes: |> |> > Piercarlo Grandi writes: |> |> > > Incidentally, even a 4KB pagesize is still too large, even if it is |> > > barely tolerable. 8KB was simply crazy, for a workstation/time sharing |> > > server. |> |> > Large page sizes seem likely to interact badly with programs |> > containing a large number of small procedures, such as X clients. An |> > obvious improvement would be to have the linker order procedures so that |> > procedures commonly used together were adjacent, and separate from |> > rarely used procedures. |> > |> > Has this been done in any real systems? |> The SunOS libc has done this for years. I believe that the X people are also considering this sort of optimization. I donot know of any system that does it dynamically. I'd like to hear of such a system if anyone knows of one. $ head -20 lorder-sparc # # @(#)lorder-sparc 1.2 88/02/08 SMI # # This list based upon the dynamic behavior of libc on a sun4. # It is mostly accurate for other architectures too. # The list was generated by a tool. # stret4.o stret2.o recvfrom.o rpc_dtablesize.o rpc_commondata.o sendto.o rpc_prot.o bcopy.o multiply.o xdr.o gettimeofday.o auth_none.o xdr_mem.o --- Larry McVoy, Sun Microsystems (415) 336-7627 ...!sun!lm or lm@sun.com
dmk@craycos.com (David Keaton) (01/29/91)
In article <1991Jan27.214522.24408@watdragon.waterloo.edu> tbray@watsol.waterloo.edu (Tim Bray) writes: >mash@mips.COM (John Mashey) writes: > (Consider the kind of > program that's >200MB of code, much of it in a giant single loop, > leading to a high I-cache miss rate.) > >PLEASE tell us that's a thought experiment, John. Does the world really >contain such appalling creations? If so, I think a taxonomy would be of >general interest to the readers here... I can imagine 200MB outer loops; there are certainly at least near megabyte inner loops in real code out there. Back at Multiflow I remember an actual user's code that simulated the aerodynamic properties of a cyclical device. (To protect the guilty, I won't say what the device was.) There was one enormous inner loop. When the device got past a certain point in its cyclical behavior, you could actually watch the cache-miss lights come on in the back of the machines as the loop extended way beyond the length of the I-cache. That VLIW I-cache held the equivalent of about 56k RISC instructions on Multiflow's smallest machine, with a memory representation of the cache size taking about 512kB. David Keaton dmk@craycos.com uunet!dmk3b1!dmk
pcg@cs.aber.ac.uk (Piercarlo Grandi) (02/05/91)
On 28 Jan 91 21:13:20 GMT, lm@slovax.Berkeley.EDU (Larry McVoy) said:
Piercarlo Grandi writes:
pcg> Incidentally, even a 4KB pagesize is still too large, even if it is
pcg> barely tolerable. 8KB was simply crazy, for a workstation/time
pcg> sharing server.
Richard Tobin writes (attribution uncertain [pcg]):
tobin> Large page sizes seem likely to interact badly with programs
tobin> containing a large number of small procedures, such as X clients.
tobin> An obvious improvement would be to have the linker order
tobin> procedures so that procedures commonly used together were
tobin> adjacent, and separate from rarely used procedures.
tobin> Has this been done in any real systems?
lm> The SunOS libc has done this for years.
A well documented example is the System V shared libc_s. Reading the
relevant paper in the System V programmer's guide is highly
enlightening, especially the warning that shared libraries, precisely
because of locality of reference considerations on large page machines,
are not always a win.
Explanation: if you fault is an 8KB page of a shared library to
access a 200 byte function, you probably have lost more memory than
by linking the 200 inline with each executable that uses them,
unless you have more than 40 different such executables running at
any one time. Naturally if the page had already been faulted in you
don't lose as badly. Decisions on physical clustering in shared
memory environments are hard to make. I am horrified by the cavalier
approach used by many builders of shared libraries on systems that
do have them. At least Sun do something about libc.
lm> I believe that the X people are also considering this sort of
lm> optimization.
The X server has particularly and horrifying poor locality of reference
patterns; the X libraries are built with total disregard to physical
clustering. That explains why things like Xserver and xclock, which
should have a working set of maybe two-three pages, actually require
hundreds. I suspect that that things like GNU CC, GNU Emacs, SunOS,
Ultrix, System V, X, NeWS, NeXTStep and so on are make-work schemes for
Toshiba, Hitachi and Fujitsu engineers.
--
Piercarlo Grandi | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
md@HQ.Ileaf.COM (Mark Dionne x5551) (02/05/91)
I have done some experiments with Interleaf (a large publishing program written in C), and found that for many usage patterns, typically about 25% of the code that is paged-in is actually touched. This can mean that up to 2 meg of memory is often being "wasted". Simply reordering the .o files helped improve things about 10 to 20%, but structured code tends to put boring initialization routines next to workhorses, etc., preventing one from getting a lot of improvement. One thing that would help out here would be a compiler switch that would produce multiple .o files for a single .c file (one .o file per function). It would then be possible to set up an automated reordering scheme where gprof output would be used to reorder the functions. (Static functions would optionally be either coalesced or treated as global.) An ideal candidate would be gcc. Any volunteers? (I've actually tried this by writing a program to split up compiler assembler output. It worked, but is painful and not portable.) As someone hinted at, X servers would probably be excellent candidates for this treatment. I've heard rumors that Sun has been doing this. -- md@HQ.Ileaf.COM (Internet) Mark Dionne, Interleaf uunet!leafusa!md (uucp) 9 Hillside Ave, Cambridge, MA 02154 (617) 290-4943 x5551
news@NeXT.COM (news) (02/05/91)
In article <PCG.91Feb4170042@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes: > A well documented example is the System V shared libc_s. Reading the > relevant paper in the System V programmer's guide is highly > enlightening, especially the warning that shared libraries, precisely > because of locality of reference considerations on large page machines, > are not always a win. > > Explanation: if you fault is an 8KB page of a shared library to > access a 200 byte function, you probably have lost more memory than > by linking the 200 inline with each executable that uses them, > unless you have more than 40 different such executables running at > any one time. Naturally if the page had already been faulted in you > don't lose as badly. Decisions on physical clustering in shared > memory environments are hard to make. I am horrified by the cavalier > approach used by many builders of shared libraries on systems that > do have them. At least Sun do something about libc. > NeXTStep relies on two shared libraries for most of our applications and user programs in our system, libsys_s and libNeXT_s. For our second release of our system software, we made modifications to our linker to allow us to directly state our load order of these shlibs. We then wrote a number of tools that allowed us to take statistics of running apps to determine which shared library functions were being used most frequently. We were able to reduce our shared library resident set size for our system shlib by about 25% and our application kit shlib by about 40%. By applying these same tools to our Workspace manager, windowserver, and all of the daemons that our system needs we were able to reclaim about another meg of physical memory. The bottom line was that this had a significant effect compared to our 1.0 application launching speed and the time that it takes to reactivate applications. In the market we are in, perceived user interface performance is everything. Morris Meyer NeXT, Inc. Software Engineer 900 Chesapeake Drive NeXT OS Group Redwood City, CA 94063 mmeyer@next.com 415-780-4683
pcg@cs.aber.ac.uk (Piercarlo Grandi) (02/09/91)
On 4 Feb 91 19:09:49 GMT, md@HQ.Ileaf.COM (Mark Dionne x5551) said: md> I have done some experiments with Interleaf (a large publishing md> program written in C), and found that for many usage patterns, md> typically about 25% of the code that is paged-in is actually md> touched. This can mean that up to 2 meg of memory is often being md> "wasted". Not unsurprising; there are lots of statistics that say that on average only 200 instructions are executed before a "long jump" is executed; this means that small page sizes tend to reduce working sets dramatically. A concise summary of the "armchair" evidence on this is in Shaw's "Logical design of operating systems", in the section on paging performance (page 232 in the II edition). It would be amusing for you to repeat the same measurements on different machines, e.g. a Gould PN1 with 16KB/32KB pages, a Sun 3 with 8KB pages, and all the way down to a BSD VAX with 1KB pages. If you do, please send a report to SIGOPS or SIGARCH! (post it here first :->). A large amount of statistics already exists on this subject, but they are at times twenty years old. It would be interesting to see them confirmed with data on more recent applications/languages. md> Simply reordering the .o files helped improve things about 10 to md> 20%, but structured code tends to put boring initialization routines md> next to workhorses, etc., preventing one from getting a lot of md> improvement. I am fond of mentioning another nice idea similar to 'register', and with the same excellent benefits/costs ratio: I have been told that some Algol 68 compiler had a 'rarely' (executed) PRAGMA so that the compiler would generate the relevant code "offline", thus making the most frequently executed path as streamlined as possible. md> One thing that would help out here would be a compiler switch that would md> produce multiple .o files for a single .c file (one .o file per function). Actually the better idea is probably to have smart linkers; but, lacking those, this is not a bad idea. I used to be unhappy with the GNU C++ streams implementations, which was very monolithic, so I wrote an implementations that had lots of small source files. Space occupied, and presumably time, decreased dramatically. md> As someone hinted at, X servers would probably be excellent candidates md> for this treatment. I've heard rumors that Sun has been doing this. Actually I think that MIT are doing it. I have been told by somebody else that most of the time X is executing the same 2KB stretch of code... Another terribly bad offender is GNU Emacs. Actually most any contemporary program. We do have 'time' profilers readily available, some as sophisticated as pixie and gprof, but I am not aware of *any* readily available 'locality' profiler. Control flow profilers like the two mentioned above can be used to optimize placement of functions; MIPS uses pixie to optimize for caching, for example. As an overall comment, I find this posting and the earlier one by a person from NeXT very refreshing: some people *do* understand and care about engineering and cost/benefit ratios. It is encouraging. -- Piercarlo Grandi | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
davecb@yunexus.YorkU.CA (David Collier-Brown) (02/11/91)
On 4 Feb 91 19:09:49 GMT, md@HQ.Ileaf.COM (Mark Dionne x5551) said: md> I have done some experiments with Interleaf (a large publishing md> program written in C), and found that for many usage patterns, md> typically about 25% of the code that is paged-in is actually md> touched. This can mean that up to 2 meg of memory is often being md> "wasted". pcg@cs.aber.ac.uk (Piercarlo Grandi) writes: | Not unsurprising; there are lots of statistics that say that on average | only 200 instructions are executed before a "long jump" is executed; | this means that small page sizes tend to reduce working sets | dramatically. [...] A large amount | of statistics already exists on this subject, but they are at times | twenty years old. It would be interesting to see them confirmed with | data on more recent applications/languages. Well, some of the twenty-year-old data came from a (pre)linker on ICL's George III, which I remember fuzzily... Much more recently (5 years ago) I had the doubtful pleasure of linking C code for a large, multi-module program for the IBM Poisonous Computer while working on Xanaro's ``Ability''. We were using a program which evolved from an overlay scheme, but had been rewritten into a pager with the same interface by one of our senior people, Mr. Andrew Forber. This didn't actually generate statistics, but did allow us to manually reorder code by locality of reference. Since we were writing classical ``structured'' code and had a compiler which implemented static-to-file by placing multiple functions in a single .o file, you understand we tended to discover the dynamic call graph of the programs very quickly... The 200 instruction figure was approximately correct for all non-unrolled code, and quite low for things which we found on the critical path and expanded inline (such as file i/o, paging and the compiler/interpreter engine). About 20% of the code was initializers of some sort or another, and could usefully be moved to ``pages'' which were loaded at startup time, called by the main() routine and then were never referenced again. As the system was partitioned into several large modules, we found three definite orders of locality of reference. The weakest was within the module as a whole: the user interface used the user-visible operations which used the primitives. The second weakest was within the primitives, many of which interacted in interesting ways (eg, to preserve consistency), ad tended to call each other. The strongest was between the user-level operations and the primitives. A given user-level operation tended to call the same set of primitives over and over again. This was true of all the modules, not just the ones which were ``similar''. Indeed, we noticed a contraintuitive effect: some few user-level operations called primitives which we expected only to be called by other modules entirely! The text editor often found it's file i/o operations calling primitives bound with the spreadsheet, because of a reference from one file's data to another. We first discovered this when the speed of file i/o suddenly fell to about 1 line/sec, accompanied by much thrashing when two routines started fighting over the same ``page''. Breaking the modules up into smaller and smaller ``pages'' yielded order-of-magnitude improvements in performance. Hand-tuning generally gave us two orders of magnitude (or more correctly, cost us two orders of magnitude when I did it wrong (:-)). This was one of the ``small'' things referred to by Dick McMurray's dictum: First you make it right, then you make it fast, then you make it **small**. --dave -- David Collier-Brown, | davecb@Nexus.YorkU.CA | lethe!dave 72 Abitibi Ave., | Willowdale, Ontario, | Even cannibals don't usually eat their CANADA. 416-223-8968 | friends.
cprice@mips.COM (Charlie Price) (02/11/91)
In article <1991Feb4.190949.1190@HQ.Ileaf.COM> md@HQ.Ileaf.COM (Mark Dionne x5551) writes: > >I have done some experiments with Interleaf (a large publishing program >written in C), and found that for many usage patterns, typically about 25% >of the code that is paged-in is actually touched. This can mean that up to > >One thing that would help out here would be a compiler switch that would >produce multiple .o files for a single .c file (one .o file per function). >It would then be possible to set up an automated reordering scheme where >gprof output would be used to reorder the functions. (Static functions >would optionally be either coalesced or treated as global.) An ideal >candidate would be gcc. Any volunteers? MIPS has shipped this basic capability (in a slightly different form) for years. The MIPS' tool is primary used to rearrange routines during linkage so that the most active routines don't conflict with each other in the instruction cache. It lays them out sequentially from most active to least active (up to some limit) and this would achieve the effect you talk about. Here is how it works: cc -O -o gleep gleep.c produces gleep. Create (or find) the executable you are interested in... pixie gleep Create an instrumented version of the executable with pixie. The instrumented version, named OBJNAME.pixie by default, does the same function that the original program did, and also keeps statistics about what goes on inside the program -- in particular it keeps execution counts for basic blocks. Pixie also writes out a "map" file (OBJNAME.Addrs) that is needed to reconcile the basic block structure to program labels. gleep.pixie program_arguments... Run the instrumented program producing a statistics file named OBJNAME.Counts (here gleep.Counts). Multiple runs (with different inputs, for example), can be handled, but you must move the gleep.Counts file "out of the way" before each run. prof -pixie -feedback feedback_file_name gleep gleep.Addrs gleep.Counts... Process the gathered statistics file(s) (one or more X.Counts files) and produce a "feedback" file that can be used by the compiler system. The basic computation is to figure out which relocatable objects (functions) are used the most from the basic block count information. cc -O -o gleep.new -cord -feedback feedback_file_name gleep.c Recreate the executable, running the "cord" rearranger pass using the specified feedback file. As an aside, pixie is really for program performance analysis. Getting information for rearrangement was not the primary choice. For instance, you can discover which lines in a source program use up the largest number of cycles. prof -pixie -heavy glorp produces the following information: Profile listing generated Sun Feb 10 15:35:20 1991 with: prof -pixie -heavy glorp blkclr and bzero (bzero.s) synonymous: using latter -------------------------------------------------------------------------------- -- * -h[eavy] using basic-block counts; * * sorted in descending order by the number of cycles executed in each * * line; unexecuted lines are excluded * -------------------------------------------------------------------------------- -- procedure (file) line bytes cycles % cum % lwputstr (glorp.c) 1016 216 1632974 11.67 11.67 pfile (glorp.c) 754 76 1564005 11.17 22.84 pfile (glorp.c) 766 88 1354976 9.68 32.52 pfile (glorp.c) 753 44 1121371 8.01 40.53 pfile (glorp.c) 853 80 1121311 8.01 48.54 pfile (glorp.c) 762 40 741465 5.30 53.84 lwputstr (glorp.c) 1014 24 648692 4.63 58.48 lwputstr (glorp.c) 1019 20 538362 3.85 62.32 pfile (glorp.c) 761 32 342904 2.45 64.77 pfile (glorp.c) 849 72 336097 2.40 67.17 ... cut off here... -- Charlie Price cprice@mips.mips.com (408) 720-1700 MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA 94086-23650
suitti@ima.isc.com (Stephen Uitti) (02/13/91)
In article <PCG.91Feb8204700@teacho.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes: >On 4 Feb 91 19:09:49 GMT, md@HQ.Ileaf.COM (Mark Dionne x5551) said: >md> Simply reordering the .o files helped improve things about 10 to >md> 20%, but structured code tends to put boring initialization routines >md> next to workhorses, etc., preventing one from getting a lot of >md> improvement. pg>I am fond of mentioning another nice idea similar to 'register', and pg>with the same excellent benefits/costs ratio: I have been told that some pg>Algol 68 compiler had a 'rarely' (executed) PRAGMA so that the compiler pg>would generate the relevant code "offline", thus making the most pg>frequently executed path as streamlined as possible. >md> One thing that would help out here would be a compiler switch that would >md> produce multiple .o files for a single .c file (one .o file per function). pg>Actually the better idea is probably to have smart linkers; but, lacking pg>those, this is not a bad idea. I used to be unhappy with the GNU C++ pg>streams implementations, which was very monolithic, so I wrote an pg>implementations that had lots of small source files. Space occupied, and pg>presumably time, decreased dramatically. If you have the compiler generate multiple .o's for .c's, you need some way to tell the linker, "look here first" when resolving symbol addresses. In C, if you reference a function, typically the linker looks for the symbol within the same .o first. If the compiler produces a list of what symbols that were in each .o were global, static, static but relevant to these other .o's, then user's might accidentally obtain more useful libraries. Imagine you are writing a subroutine library. You want function x() to be available to other routines within the library, but you don't want it to be directly available to the caller. If you declare it static, it is only available to other functions within that source file. If you copy it to each source that needs it, you might end up with several copies. If you make your library from all one source, then you may get all sorts of functions linked into the final object that are never referenced. If you can build a library, and say "these are the library external symbols", using a list that is perhaps outside of the C sources, then you can build better libraries. If it is easy to do, maybe more libraries will be built. For that matter, sometimes it is more convenient to stuff lots of functions into a single source file. It'd be real nice if the functions that aren't referenced aren't linked into the user's application. Again, if it is easy (or free) to do it right, more programmers will do it right. A smart linker could do the reordering, or even dead function elimination, but an optional external list would make library writing easier to do right, and bullet proof libraries could be built. One of the first interesting debugging exercises I witnessed had to do with a novice C user, who typed in the "index()" function in the K&R C book. He was amazed that I/O no longer worked right. The K&R index() took two string arguments. The libc version takes a string and a character. Somewhere in stdio, index() was invoked. Since the user had already defined it, the linker didn't load the libc version... Imagine if the linker were smart enough to know that libc should be hooked up to the libc version of "index()", and the user's code should be hooked up to his/her own version of "index()". I've heard that some big company, GM or someone, has done something like this. Whoever it was hasn't released it for the world. We need people who publish software to do it. Then we could start working for world peace or something. Stephen. suitti@ima.isc.com "We Americans want peace, and it is now evident that we must be prepared to demand it. For other peoples have wanted peace, and the peace they received was the peace of death." - the Most Rev. Francis J. Spellman, Archbishop of New York. 22 September, 1940
milton@en.ecn.purdue.edu (Milton D Miller) (02/13/91)
In an effort to reorder object code to provide smaller working sets, md@HQ.Ileaf.COM (Mark Dionne x5551) said: >>md> One thing that would help out here would be a compiler switch that would >>md> produce multiple .o files for a single .c file (one .o file per function). To which suitti@ima.isc.com (Stephen Uitti) responded in article <1991Feb12.164056.8865@dirtydog.ima.isc.com> with: >If you have the compiler generate multiple .o's for .c's, you >need some way to tell the linker, "look here first" when >resolving symbol addresses. In C, if you reference a function, >typically the linker looks for the symbol within the same .o first. > >If the compiler produces a list of what symbols that were in each >.o were global, static, static but relevant to these other .o's, >then user's might accidentally obtain more useful libraries. > [Talks about writing a subroutine library, and the want to have some functions accessable only internally] > >If you can build a library, and say "these are the library >external symbols", using a list that is perhaps outside of the C >sources, then you can build better libraries. If it is easy to >do, maybe more libraries will be built. [Also talks about eliminating unused functions from a source file] >A smart linker could do the reordering, or even dead function >elimination, but an optional external list would make library >writing easier to do right, and bullet proof libraries could be built. > [Story of C novice who wrote index() from K&R breaking stdio, which used a function with that name but different arguments] >Imagine if the linker were smart enough to know that libc should >be hooked up to the libc version of "index()", and the user's >code should be hooked up to his/her own version of "index()". > >I've heard that some big company, GM or someone, has done >something like this. Whoever it was hasn't released it for >the world. We need people who publish software to do it. > >Stephen. >suitti@ima.isc.com I waited a few hours to post because of the lack of first hand knowledge, but since no one has brought it up, I will. Although not quite all that is necessary, the shared libraries in AIX 3.1 (which runs on the IBM S/6000) do have the notion that only some functions should be "exported" to other modules. The list of symbols to be exported are placed in a file provided at link time. There is also the concept of an import list, which lists the names of functions which are available somewhere else (for example, in the kernel) (I think). The biggest complaint I have seen in the comp.unix.aix (besides the data space used by the linker) is that you can't use your own malloc, because the library already has one linked into it (Actually you could use your own malloc, if you are prepared not to use anything from libc). Although this does not solve all the linking problems, it is a place to start, and it does solve the problem with index in the example given above. I don't know if the import files would allow the single c file to be broken up and still preserve static variable binding or not. Maybe someone that has a S/6000 could try? milton Disclaimer: I have only read about the linker; the only time I used it (when I had access to it) was to compile a simple c program written by someone else. PS: Source licenses are supposed to be available about a year after the machine started shipping...
pcg@cs.aber.ac.uk (Piercarlo Grandi) (02/14/91)
On 12 Feb 91 16:40:56 GMT, suitti@ima.isc.com (Stephen Uitti) said:
suitti> In article <PCG.91Feb8204700@teacho.cs.aber.ac.uk>
suitti> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
On 4 Feb 91 19:09:49 GMT, md@HQ.Ileaf.COM (Mark Dionne x5551) said:
md> One thing that would help out here would be a compiler switch that
md> would produce multiple .o files for a single .c file (one .o file
md> per function).
pg> Actually the better idea is probably to have smart linkers; but,
pg> lacking those, this is not a bad idea. I used to be unhappy with the
pg> GNU C++ streams implementations, which was very monolithic, so I
pg> wrote an implementations that had lots of small source files. Space
pg> occupied, and presumably time, decreased dramatically.
suitti> If you have the compiler generate multiple .o's for .c's, you
suitti> need some way to tell the linker, "look here first" when
suitti> resolving symbol addresses. In C, if you reference a function,
suitti> typically the linker looks for the symbol within the same .o first.
suitti> If the compiler produces a list of what symbols that were in each
suitti> .o were global, static, static but relevant to these other .o's,
suitti> then user's might accidentally obtain more useful libraries.
This is true -- you need to add a new visibility distinction, whether an
entitity is visible only within or also outside a library. This can be
hacked by tweaking appropriately the format of the symbol table in .a
files. Incidentally, I think that instead of having the assembler
compiler generate multiple .o files, it would be nice for it to generate
a single .a file. It would not be difficult to modify GNU to have a
suitable -a option, as in 'gas -a many-functions.a many-functions.s'.
But really the root problem is that we are discussing a language like C
and its three visibility levels, and other aspects which were designed
around linking technology straight out of the the fifties. The inertia
of masses of C applications and of linkers designed exactly as they were
in the fifties is tremendous, as C++ users discover to their detriment.
It is deeply ironic that most hardware designers have read something at
least on Multics and have put hooks for modern linking technology and
languages in the interior decors of most modern systems, even if most
OSes and nearly all incarnations of Unix simply do not use them! Only
recently SunOS and SVR4 have acquired some hacky dynamic linking
technology, for example. This tragedy is almost as great as that of
millions of 286s runnin DOS when they have protected mode rings and
virtual memory and features straight out of Multics or TSS/370.
There are some languages that are designed under the impression that the
evolution of linker technology has not stopped with the 7090 or the 1100
or the 600, and that make use of linkers deign of the early sixties, for
example Modula2 and Ada. There are even languages that assume linking
technology as it evolved in the seventies, but I know of no mainline
system that uses it. It could be said that many languages in the Scheme,
Lisp, Poplog family have linking technology out of the seventies, but
they are hardly as popular as they should be.
Ah architecture! Things are so bad that some recent hw/os architectures
have difficulty supporting dynamic linking and loading, and self
modifying code, as discussed previously on these screens, and those
things are the essence of modern linking technology, from the sixties
onwards.
--
Piercarlo Grandi | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk