[comp.unix.questions] Getting the most for a process.

doug@cogent.UUCP (Doug Perlich) (10/11/89)

I have recently become interested in having an application program run as fast
as possible!  (Sound familiar)?

What I am interested in is how can a program get a higher priority at run time.
More exactly what other methods are there to get screaming performance out of
a UNIX machine.
As I understand it only root can use a negative nice to make a program run
faster, are there ways of dedicating a processor (w/ or wo/ a multiprocessor)?

I am mainly interested in a multi-user system.

It seems to me the answer is no because then every programmer would try this
in his/her code and the system would halt.

-Doug.
.
.

seibel@cgl.ucsf.edu (George Seibel) (10/11/89)

In article <593@cogent.UUCP> doug@cogent.UUCP (Doug Perlich) writes:
>I have recently become interested in having an application program run as fast
>as possible!  (Sound familiar)?

Yup.

>What I am interested in is how can a program get a higher priority at run time.
>More exactly what other methods are there to get screaming performance out of
>a UNIX machine.
>As I understand it only root can use a negative nice to make a program run
>faster, are there ways of dedicating a processor (w/ or wo/ a multiprocessor)?
>
>I am mainly interested in a multi-user system.
>
>It seems to me the answer is no because then every programmer would try this
>in his/her code and the system would halt.

Even if you get 100% of the machine, you only go as fast as the machine
will run your program.  Here's what to do if you *really* want to go fast:

1) Choose the best algorithm.  e.g. Quicksort beats Bubble sort...
2) profile the application with representative inputs.  The usual
   scenario is:

   cc -p applic.c -o applic  [could be f77 or pc instead of cc]
   applic                    [will produce file "mon.out"]
   prof applic > prof.out
   
   Now look in prof.out.  This should tell you where your program is
   spending its time.  Look at those parts of the code.  Are they doing
   unnecessary work?   Find a hacker and ask how to make it go faster.
   Bringing frequently-called functions inline is usually a win.
   If you're doing a lot of I/O, can it be brought in-core?  Can you
   use binary files instead of formatted files?   Check out the options
   on your compiler.  Try the optimisation options.  Make sure you are
   not using runtime bounds checking.   Are you even using a compiler?
   If the application is written in an interpreted language, there
   probably is no profiler or optimiser.  Consider rewriting.

What if you aren't a programmer, or you don't have the source code?

3) Buy a faster computer.  (This is also a valid solution if you *are*
   a programmer)

George Seibel, UCSF

madd@bu-cs.BU.EDU (Jim Frost) (10/12/89)

In article <12034@cgl.ucsf.EDU> seibel@cgl.ucsf.edu (George Seibel) writes:
|In article <593@cogent.UUCP> doug@cogent.UUCP (Doug Perlich) writes:
|>More exactly what other methods are there to get screaming performance out of
|>a UNIX machine.

|
|Even if you get 100% of the machine, you only go as fast as the machine
|will run your program.  Here's what to do if you *really* want to go fast:

Very valid techniques.  There are others that may work depending on
your application which can squeeze even more performance out of your
machine.

UNIX splits up the CPU amongst processes (and threads if your UNIX
supports them).  A single application can thus get more CPU at the
same priority if you can break the job up between multiple processes
or threads.  It's generally easier to do this with threads than with
separate processes, but you can do pretty well with separate processes
and shared memory for a lot of tasks -- particularly sequential
independent calculations (often found inside loops).

A simple shell script illustrates the principle:

	#! /bin/csh
	foreach i (/usr/*)
	  find $i -name foo
	end

versus:

	#! /bin/csh
	foreach i (/usr/*)
	  find $i -name foo &
	end

The latter will finish much faster (unless it thrashes the system),
but it has at least one problem -- output will become intermingled
unless you give each process its own output file.  The same sorts of
problems will have to be solved for a real application; see any
operating systems book for lots of solutions.

This method of parallelism is seen most often on multiprocessor
machines since processes will tend to execute on separate processors
and you get incredible throughput improvements.  On the Encore
Multimax this technique is used for grep and make, for instance.  The
technique still works on single processors -- obviously so since the
whole idea behind multitasking is to fully utilize the CPU (amongst
other resources), but usually not as well.

|>It seems to me the answer is no because then every programmer would try this
|>in his/her code and the system would halt.

Yes, this technique will definitely hurt the system if you run too
many parallel processes, but so will any technique that gives a single
application more than its "share" of CPU.  It's also a lot harder
because you have to coordinate processes so many people won't bother
unless they really need it or the system makes such a thing easy (I
have yet to see a system where it was particularly easy to parallelize
tasks effectively :-).

jim frost
software tool & die
madd@std.com

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (10/12/89)

In article <40090@bu-cs.BU.EDU>, madd@bu-cs.BU.EDU (Jim Frost) writes:
|  application more than its "share" of CPU.  It's also a lot harder
|  because you have to coordinate processes so many people won't bother
|  unless they really need it or the system makes such a thing easy (I
|  have yet to see a system where it was particularly easy to parallelize
|  tasks effectively :-).

  The Encore version of make looks at an environment variable and
determines how many copies of the ccompilers to start. On a machine with
8 cpu's you get a blindingly fast make compared to doing the same thing
(in serial) on a faster machine.

  Several companies claim they do parallelizing within a process, but I
haven't got the measurements here and the guy who has them is on vacation.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
"The world is filled with fools. They blindly follow their so-called
'reason' in the face of the church and common sense. Any fool can see
that the world is flat!" - anon

larry@macom1.UUCP (Larry Taborek) (10/12/89)

From article <593@cogent.UUCP>, by doug@cogent.UUCP (Doug Perlich):
> I have recently become interested in having an application program run as fast
> as possible!  (Sound familiar)?
> 
> What I am interested in is how can a program get a higher priority at run time.
> More exactly what other methods are there to get screaming performance out of
> a UNIX machine.
> As I understand it only root can use a negative nice to make a program run
> faster, are there ways of dedicating a processor (w/ or wo/ a multiprocessor)?
> 
> I am mainly interested in a multi-user system.
> 
> It seems to me the answer is no because then every programmer would try this
> in his/her code and the system would halt.
> 
> -Doug.

I used to be the system administrator of a older Unix system (BSD
4.2) and I wanted to get my programs to run faster.  So I went
through the Manual Pages on the Network Demons and removed (commented
out) each demon from the start up scripts rc and rc.local that
were used for network services we wern't using.  Those
that were started had a nice placed on them so that they ran at a
lower priority (we didn't do much networking anyway).  This meant
that a whole host of processes that usually competed pretty
evenly with my stuff for CPU time ether wern't there or were
competing at a disadvantage.  I immediately noticed a speed
increase in 'user' tasks.

Another technique is to run your program in single user mode.
Naturally on a multiuser system running programs in single user
mode is not advantageous, but for large database loads and the
like it does make sense.

If you do have access to the super user login, the nice command
can be used to speed up your program.  If not, perhaps you can
persuade your co workers to nice their programs down.

Programs can be run at night were they do not have to compete
with the resources of the machine with other user processes.

Well, hope this helps...
-- 
Larry Taborek	..!uunet!grebyn!macom1!larry	Centel Federal Systems
		larry@macom1.UUCP		11400 Commerce Park Drive
						Reston, VA 22091-1506
My views do not reflect those of Centel		703-758-7000

chris@mimsy.UUCP (Chris Torek) (10/13/89)

In article <1029@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.COM
(Wm E Davidsen Jr) writes:
>  The Encore version of make looks at an environment variable and
>determines how many copies of the ccompilers to start. On a machine with
>8 cpu's you get a blindingly fast make compared to doing the same thing
>(in serial) on a faster machine.

(Not if the serial machine is more than 8 times faster, or if there is
only one source file.)

Unfortunately, the Encore version of cc, which is apparently a Greenhills
C compiler, has all of its `phases' built in.  Thus, if you are compiling
a single file, you cannot preprocess on cpu 0, compile on cpu 1, and
assemble on cpu 2 all at the same time.

Given the standard edit/compile/debug cycle, this---combining
everything---seems to me to be a major mistake.  Well, not so major as
all that, perhaps, since most of the time is spent in the compilation
part, not in preprocessing or assembly.  Still, the potential was
there, and would return if Encore used gcc as their standard compiler.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris

david@quad1.quad.com (David A. Fox) (10/13/89)

In article <20140@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>
>Unfortunately, the Encore version of cc, which is apparently a Greenhills
>C compiler, has all of its `phases' built in.  Thus, if you are compiling
>a single file, you cannot preprocess on cpu 0, compile on cpu 1, and
>assemble on cpu 2 all at the same time.
>
>In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
>Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris

Tsk, tsk. You are forgetting about the -X182 option to cc :-) : 

.c.o:
	$(CC) $(CFLAGS) -S -c -Wc,-X182 $< | as -o $*.o 

I saw this in some encore makefiles once. I sure don't bother, but now
that I've been reminded... Also worth looking at perhaps, are the
(apparently obsolete) "-A" and (apparently preferred) "-q
nodirect_code" options; They seem also to fork off processes like
crazy.

After all, the more processes you use, the sooner it'll be done. 

Right? :-) 


-- 
--
David  A. Fox					Quadratron Systems Inc.	
Inet: david@quad.com, Postmaster@quad.com
UUCP: david@quad1.uucp 
      uunet!ccicpg!quad1!david

"Man, woman, child... All is up against the wall - of science." 

bph@buengc.BU.EDU (Blair P. Houghton) (10/14/89)

In article <20140@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>In article <1029@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.COM
>(Wm E Davidsen Jr) writes:
>>  The Encore version of make looks at an environment variable and
>>determines how many copies of the ccompilers to start. On a machine with
>>8 cpu's you get a blindingly fast make compared to doing the same thing
>>(in serial) on a faster machine.

But not perfectly reliably.  It's messed up on some dependencies in
large makes we've run in the past.  Sometimes it gets ahead of itself
and goes on to the next step before some of the parallel cc'ing is
completed in this step...it hasn't done this in a long while, though.
It doesn't stop me from setenv'ing PARALLEL to 12, or anything...it's
just _one_ more possible reason for having to rerun a "make World." :-)

>Unfortunately, the Encore version of cc, which is apparently a Greenhills
>C compiler, has all of its `phases' built in.  Thus, if you are compiling
>a single file, you cannot preprocess on cpu 0, compile on cpu 1, and
>assemble on cpu 2 all at the same time.

Much harder to manage if you've got eight processors and PARALLEL==3.
Might even cause a performance decrease if the ccom phase gets assigned
to the overlapping processor.  With the entire cc constrained to one
processor, you're assured of eight simultaneous compilations (for whatever
an assurance of regularity from a computer is worth; I mean, aren't they
supposed to manage the complexities and leave us to the iced tea?).

>Given the standard edit/compile/debug cycle, this---combining
>everything---seems to me to be a major mistake.
>Well, not so major as all that, perhaps, since most of the time is
>spent in the compilation part, not in preprocessing or assembly.
>Still, the potential was there, and would return if Encore used gcc as
>their standard compiler.

Uh, who does?  :-)  Seems to me you can usually snarf a _newer_ version
of gcc than any computer company is prepared to deliver with a packaged
system.  Then again, I haven't tried to build gcc on our Encore (it kicks
silicon butt on the uVAXen, tho').  Any reason I should or shouldn't?

				--Blair
				  "I just want to know when rn
				   is going to take advantage of
				   parallelism.  :-) :-) :-)"

meissner@twohot.rtp.dg.com (Michael Meissner) (10/16/89)

In article <20140@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:

>  In article <1029@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.COM
>  (Wm E Davidsen Jr) writes:
>  >  The Encore version of make looks at an environment variable and
>  >determines how many copies of the ccompilers to start. On a machine with
>  >8 cpu's you get a blindingly fast make compared to doing the same thing
>  >(in serial) on a faster machine.
>  
>  (Not if the serial machine is more than 8 times faster, or if there is
>  only one source file.)
>  
>  Unfortunately, the Encore version of cc, which is apparently a Greenhills
>  C compiler, has all of its `phases' built in.  Thus, if you are compiling
>  a single file, you cannot preprocess on cpu 0, compile on cpu 1, and
>  assemble on cpu 2 all at the same time.
>  
>  Given the standard edit/compile/debug cycle, this---combining
>  everything---seems to me to be a major mistake.  Well, not so major as
>  all that, perhaps, since most of the time is spent in the compilation
>  part, not in preprocessing or assembly.  Still, the potential was
>  there, and would return if Encore used gcc as their standard compiler.

Especially when you optimize with gcc, most (80% or more) is spent in
cc1, which has the following passes over the RTL file:

    *	The initial pass creating the RTL file from the TREE file
	created by the parser;

    *	A pass to copy any shared RTL structure that should not be
	shared.

    *	The first jump optimization pass.

    *	A pass to scan for registers to prepare for common sub-
	expression eliminiation.

    *	The common sub-expression elimination pass.

    *	Another jump optimization pass.

    *	Another register scan pass for loop optimizations.

    *	A loop optimization pass.

    *	A flow analysis pass.

    *	A combiner pass to combine multiple RTL expressions into
	larger RTL expressions.

    *	A pass to allocate registers that are live within a single
	basic block.

    *	A pass to allocate registers whose lifetime spans multiple
	basic blocks.

    *	A final jump optimization pass.

    *	An optional delayed branch recognition pass.

    *	A final pass that expands peepholes, and emits assembler code.

Also note in using -pipe, that the preprocessor internally buffers the
entire text, and writes it in one fell swoop at the end.  This means
that in general only the compiler proper (cc1) and assembler run in
parallel.  This helps to some degree.  A few months ago, I measured
how much it helped on a dual processor 88100.  Without the -pipe
option, the build time of the entire compiler was about 14 minutes
without -pipe, and about 10-12 minutes with -pipe.

--

Michael Meissner, Data General.				If compiles where much
Uucp:		...!mcnc!rti!xyzzy!meissner		faster, when would we
Internet:	meissner@dg-rtp.DG.COM			have time for netnews?