doug@cogent.UUCP (Doug Perlich) (10/11/89)
I have recently become interested in having an application program run as fast as possible! (Sound familiar)? What I am interested in is how can a program get a higher priority at run time. More exactly what other methods are there to get screaming performance out of a UNIX machine. As I understand it only root can use a negative nice to make a program run faster, are there ways of dedicating a processor (w/ or wo/ a multiprocessor)? I am mainly interested in a multi-user system. It seems to me the answer is no because then every programmer would try this in his/her code and the system would halt. -Doug. . .
seibel@cgl.ucsf.edu (George Seibel) (10/11/89)
In article <593@cogent.UUCP> doug@cogent.UUCP (Doug Perlich) writes: >I have recently become interested in having an application program run as fast >as possible! (Sound familiar)? Yup. >What I am interested in is how can a program get a higher priority at run time. >More exactly what other methods are there to get screaming performance out of >a UNIX machine. >As I understand it only root can use a negative nice to make a program run >faster, are there ways of dedicating a processor (w/ or wo/ a multiprocessor)? > >I am mainly interested in a multi-user system. > >It seems to me the answer is no because then every programmer would try this >in his/her code and the system would halt. Even if you get 100% of the machine, you only go as fast as the machine will run your program. Here's what to do if you *really* want to go fast: 1) Choose the best algorithm. e.g. Quicksort beats Bubble sort... 2) profile the application with representative inputs. The usual scenario is: cc -p applic.c -o applic [could be f77 or pc instead of cc] applic [will produce file "mon.out"] prof applic > prof.out Now look in prof.out. This should tell you where your program is spending its time. Look at those parts of the code. Are they doing unnecessary work? Find a hacker and ask how to make it go faster. Bringing frequently-called functions inline is usually a win. If you're doing a lot of I/O, can it be brought in-core? Can you use binary files instead of formatted files? Check out the options on your compiler. Try the optimisation options. Make sure you are not using runtime bounds checking. Are you even using a compiler? If the application is written in an interpreted language, there probably is no profiler or optimiser. Consider rewriting. What if you aren't a programmer, or you don't have the source code? 3) Buy a faster computer. (This is also a valid solution if you *are* a programmer) George Seibel, UCSF
madd@bu-cs.BU.EDU (Jim Frost) (10/12/89)
In article <12034@cgl.ucsf.EDU> seibel@cgl.ucsf.edu (George Seibel) writes: |In article <593@cogent.UUCP> doug@cogent.UUCP (Doug Perlich) writes: |>More exactly what other methods are there to get screaming performance out of |>a UNIX machine. | |Even if you get 100% of the machine, you only go as fast as the machine |will run your program. Here's what to do if you *really* want to go fast: Very valid techniques. There are others that may work depending on your application which can squeeze even more performance out of your machine. UNIX splits up the CPU amongst processes (and threads if your UNIX supports them). A single application can thus get more CPU at the same priority if you can break the job up between multiple processes or threads. It's generally easier to do this with threads than with separate processes, but you can do pretty well with separate processes and shared memory for a lot of tasks -- particularly sequential independent calculations (often found inside loops). A simple shell script illustrates the principle: #! /bin/csh foreach i (/usr/*) find $i -name foo end versus: #! /bin/csh foreach i (/usr/*) find $i -name foo & end The latter will finish much faster (unless it thrashes the system), but it has at least one problem -- output will become intermingled unless you give each process its own output file. The same sorts of problems will have to be solved for a real application; see any operating systems book for lots of solutions. This method of parallelism is seen most often on multiprocessor machines since processes will tend to execute on separate processors and you get incredible throughput improvements. On the Encore Multimax this technique is used for grep and make, for instance. The technique still works on single processors -- obviously so since the whole idea behind multitasking is to fully utilize the CPU (amongst other resources), but usually not as well. |>It seems to me the answer is no because then every programmer would try this |>in his/her code and the system would halt. Yes, this technique will definitely hurt the system if you run too many parallel processes, but so will any technique that gives a single application more than its "share" of CPU. It's also a lot harder because you have to coordinate processes so many people won't bother unless they really need it or the system makes such a thing easy (I have yet to see a system where it was particularly easy to parallelize tasks effectively :-). jim frost software tool & die madd@std.com
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (10/12/89)
In article <40090@bu-cs.BU.EDU>, madd@bu-cs.BU.EDU (Jim Frost) writes: | application more than its "share" of CPU. It's also a lot harder | because you have to coordinate processes so many people won't bother | unless they really need it or the system makes such a thing easy (I | have yet to see a system where it was particularly easy to parallelize | tasks effectively :-). The Encore version of make looks at an environment variable and determines how many copies of the ccompilers to start. On a machine with 8 cpu's you get a blindingly fast make compared to doing the same thing (in serial) on a faster machine. Several companies claim they do parallelizing within a process, but I haven't got the measurements here and the guy who has them is on vacation. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "The world is filled with fools. They blindly follow their so-called 'reason' in the face of the church and common sense. Any fool can see that the world is flat!" - anon
larry@macom1.UUCP (Larry Taborek) (10/12/89)
From article <593@cogent.UUCP>, by doug@cogent.UUCP (Doug Perlich): > I have recently become interested in having an application program run as fast > as possible! (Sound familiar)? > > What I am interested in is how can a program get a higher priority at run time. > More exactly what other methods are there to get screaming performance out of > a UNIX machine. > As I understand it only root can use a negative nice to make a program run > faster, are there ways of dedicating a processor (w/ or wo/ a multiprocessor)? > > I am mainly interested in a multi-user system. > > It seems to me the answer is no because then every programmer would try this > in his/her code and the system would halt. > > -Doug. I used to be the system administrator of a older Unix system (BSD 4.2) and I wanted to get my programs to run faster. So I went through the Manual Pages on the Network Demons and removed (commented out) each demon from the start up scripts rc and rc.local that were used for network services we wern't using. Those that were started had a nice placed on them so that they ran at a lower priority (we didn't do much networking anyway). This meant that a whole host of processes that usually competed pretty evenly with my stuff for CPU time ether wern't there or were competing at a disadvantage. I immediately noticed a speed increase in 'user' tasks. Another technique is to run your program in single user mode. Naturally on a multiuser system running programs in single user mode is not advantageous, but for large database loads and the like it does make sense. If you do have access to the super user login, the nice command can be used to speed up your program. If not, perhaps you can persuade your co workers to nice their programs down. Programs can be run at night were they do not have to compete with the resources of the machine with other user processes. Well, hope this helps... -- Larry Taborek ..!uunet!grebyn!macom1!larry Centel Federal Systems larry@macom1.UUCP 11400 Commerce Park Drive Reston, VA 22091-1506 My views do not reflect those of Centel 703-758-7000
chris@mimsy.UUCP (Chris Torek) (10/13/89)
In article <1029@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) writes: > The Encore version of make looks at an environment variable and >determines how many copies of the ccompilers to start. On a machine with >8 cpu's you get a blindingly fast make compared to doing the same thing >(in serial) on a faster machine. (Not if the serial machine is more than 8 times faster, or if there is only one source file.) Unfortunately, the Encore version of cc, which is apparently a Greenhills C compiler, has all of its `phases' built in. Thus, if you are compiling a single file, you cannot preprocess on cpu 0, compile on cpu 1, and assemble on cpu 2 all at the same time. Given the standard edit/compile/debug cycle, this---combining everything---seems to me to be a major mistake. Well, not so major as all that, perhaps, since most of the time is spent in the compilation part, not in preprocessing or assembly. Still, the potential was there, and would return if Encore used gcc as their standard compiler. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@cs.umd.edu Path: uunet!mimsy!chris
david@quad1.quad.com (David A. Fox) (10/13/89)
In article <20140@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes: > >Unfortunately, the Encore version of cc, which is apparently a Greenhills >C compiler, has all of its `phases' built in. Thus, if you are compiling >a single file, you cannot preprocess on cpu 0, compile on cpu 1, and >assemble on cpu 2 all at the same time. > >In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) >Domain: chris@cs.umd.edu Path: uunet!mimsy!chris Tsk, tsk. You are forgetting about the -X182 option to cc :-) : .c.o: $(CC) $(CFLAGS) -S -c -Wc,-X182 $< | as -o $*.o I saw this in some encore makefiles once. I sure don't bother, but now that I've been reminded... Also worth looking at perhaps, are the (apparently obsolete) "-A" and (apparently preferred) "-q nodirect_code" options; They seem also to fork off processes like crazy. After all, the more processes you use, the sooner it'll be done. Right? :-) -- -- David A. Fox Quadratron Systems Inc. Inet: david@quad.com, Postmaster@quad.com UUCP: david@quad1.uucp uunet!ccicpg!quad1!david "Man, woman, child... All is up against the wall - of science."
bph@buengc.BU.EDU (Blair P. Houghton) (10/14/89)
In article <20140@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes: >In article <1029@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.COM >(Wm E Davidsen Jr) writes: >> The Encore version of make looks at an environment variable and >>determines how many copies of the ccompilers to start. On a machine with >>8 cpu's you get a blindingly fast make compared to doing the same thing >>(in serial) on a faster machine. But not perfectly reliably. It's messed up on some dependencies in large makes we've run in the past. Sometimes it gets ahead of itself and goes on to the next step before some of the parallel cc'ing is completed in this step...it hasn't done this in a long while, though. It doesn't stop me from setenv'ing PARALLEL to 12, or anything...it's just _one_ more possible reason for having to rerun a "make World." :-) >Unfortunately, the Encore version of cc, which is apparently a Greenhills >C compiler, has all of its `phases' built in. Thus, if you are compiling >a single file, you cannot preprocess on cpu 0, compile on cpu 1, and >assemble on cpu 2 all at the same time. Much harder to manage if you've got eight processors and PARALLEL==3. Might even cause a performance decrease if the ccom phase gets assigned to the overlapping processor. With the entire cc constrained to one processor, you're assured of eight simultaneous compilations (for whatever an assurance of regularity from a computer is worth; I mean, aren't they supposed to manage the complexities and leave us to the iced tea?). >Given the standard edit/compile/debug cycle, this---combining >everything---seems to me to be a major mistake. >Well, not so major as all that, perhaps, since most of the time is >spent in the compilation part, not in preprocessing or assembly. >Still, the potential was there, and would return if Encore used gcc as >their standard compiler. Uh, who does? :-) Seems to me you can usually snarf a _newer_ version of gcc than any computer company is prepared to deliver with a packaged system. Then again, I haven't tried to build gcc on our Encore (it kicks silicon butt on the uVAXen, tho'). Any reason I should or shouldn't? --Blair "I just want to know when rn is going to take advantage of parallelism. :-) :-) :-)"
meissner@twohot.rtp.dg.com (Michael Meissner) (10/16/89)
In article <20140@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes: > In article <1029@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.COM > (Wm E Davidsen Jr) writes: > > The Encore version of make looks at an environment variable and > >determines how many copies of the ccompilers to start. On a machine with > >8 cpu's you get a blindingly fast make compared to doing the same thing > >(in serial) on a faster machine. > > (Not if the serial machine is more than 8 times faster, or if there is > only one source file.) > > Unfortunately, the Encore version of cc, which is apparently a Greenhills > C compiler, has all of its `phases' built in. Thus, if you are compiling > a single file, you cannot preprocess on cpu 0, compile on cpu 1, and > assemble on cpu 2 all at the same time. > > Given the standard edit/compile/debug cycle, this---combining > everything---seems to me to be a major mistake. Well, not so major as > all that, perhaps, since most of the time is spent in the compilation > part, not in preprocessing or assembly. Still, the potential was > there, and would return if Encore used gcc as their standard compiler. Especially when you optimize with gcc, most (80% or more) is spent in cc1, which has the following passes over the RTL file: * The initial pass creating the RTL file from the TREE file created by the parser; * A pass to copy any shared RTL structure that should not be shared. * The first jump optimization pass. * A pass to scan for registers to prepare for common sub- expression eliminiation. * The common sub-expression elimination pass. * Another jump optimization pass. * Another register scan pass for loop optimizations. * A loop optimization pass. * A flow analysis pass. * A combiner pass to combine multiple RTL expressions into larger RTL expressions. * A pass to allocate registers that are live within a single basic block. * A pass to allocate registers whose lifetime spans multiple basic blocks. * A final jump optimization pass. * An optional delayed branch recognition pass. * A final pass that expands peepholes, and emits assembler code. Also note in using -pipe, that the preprocessor internally buffers the entire text, and writes it in one fell swoop at the end. This means that in general only the compiler proper (cc1) and assembler run in parallel. This helps to some degree. A few months ago, I measured how much it helped on a dual processor 88100. Without the -pipe option, the build time of the entire compiler was about 14 minutes without -pipe, and about 10-12 minutes with -pipe. -- Michael Meissner, Data General. If compiles where much Uucp: ...!mcnc!rti!xyzzy!meissner faster, when would we Internet: meissner@dg-rtp.DG.COM have time for netnews?