[comp.lang.misc] Assembly language and Speed...

billw@navajo.STANFORD.EDU (William E. Westfield) (12/19/86)

Well, Ill admit to doing almost all of my current programming in
assembly languages of various sorts (mostly tops20 monitor and utility
hacking - it's already in macro, so I don't have much choice.  On the
other hand, the PDP10, like the PDP11, has an artistically beautiful
instruction set.  The other thing I program a lot in assembly language
is Intel 808x based MSDOS machines...)

While higher level languages certainly have portability advantages in
many cases (but not all), and there are clear areas where complexity
forbids assembly (I hope never to have to write a package doing heavy
floating point math in assembler, even on a machine that has floating
point instructions!), I'd like to argue against the idea that HLL code
is only a "little bit" slower than assembly code.

It is true that any compiler can generate code as good or better than a
human for straight line math code.  A good compiler ought to be able to
optimize loops and array accesses and register allocation for complex
expressions as well as a human.  But where HLL's fall down is when you
start to do function and procedure calls.  And of course you should do
a lot of those in a well structured program.

An assembly language programmer has the option of writing "fast"
subroutines that pass arguments in registers and don't bother setting
up stack frames, and so on.  He can make sure that the arguments he is
passing in registers are already in the resgisters that they are
supposed to be in, and results go where they are ready to be used next.
Arguments can be passed by value or by reference, which ever is more
convenient, or faster.  And of course there are special instructions to
be exploited.

As an extreme example, which I have actually done, consider writing
some kind of parser in a high level language.  One of the things you
have to do is push and pop tokens from a stack.  Great.  Almost every
popular processor has instructions that do PUSH and POP in hardware.
Some processors do overflow and/or underflow detection at the same
time.  Many processors allow multiple stacks to exist concurrently
(multiple stack pointers in registers).  In other cases its is
perfectly possible to use the hardware CALL/RETURN/etc stack for user
values too.  So what happens when you try to do this in PASCAL, huh?
Well, you write functions called PUSH and POP, and your arguments get
put on a stack frame on the hardware stack, and you do a call
instruction which puts the return addresses on a stack and goes off to
code that probably fetches the value of stack pointer from memory and
puts it back, and does a return instruction that fetches the return
address back, and then you clean up the stack frame and continue on.
Wow, that was only 10 or 20 times slower than the hardware instuction.
Boy am I glad that I was able to write that in a high level language
with error checking and things like that instead of having to remember
the strange mnemonic for the PUSH instruction (gee, it could have been
something really nasty like MOVW source,A2@-).

That was a frustrating Pascal program to write.

[OK, ill alllow that in C this might end up written as a macro
 doing something like *sp++ = &source that would have a good chance
 of being optimized to a single instruction too, but this is not
 specifically a C vs Assembler debate...]

Sigh.

BillW, writing small, fast, programs, in assembler, and happy.

amos@instable.UUCP (Amos Shapir) (12/19/86)

In article <1233@navajo.STANFORD.EDU> billw@navajo.STANFORD.EDU (William E. Westfield) writes:
>... I'd like to argue against the idea that HLL code
>is only a "little bit" slower than assembly code.
>
> [describing a complicated way to do stack ops in PASCAL subroutines,
  and the compliacted code that is generated]

What makes you think that a *good* optimizing compilers can't unroll
function calls, put them in-line, and optimize them to a point that comes
very close to the intended machine instruction? Any logic that you can put
into the programming process, can be translated to an algorithm and
done by tne compiler. It only takes a lot of CPU power and memory (in
compile time only!) and these become cheaper all the time.
(I know most of you still cant use such compilers, but they also
become cheaper and more available - no flames on that, please)

>BillW, writing small, fast, programs, in assembler, and happy.
Amos Shapir, writing small, fast, programs, in C, and happy - much happier
than I'd be if I had to write them in assembler!

-- 
	Amos Shapir
National Semiconductor (Israel)
6 Maskit st. P.O.B. 3007, Herzlia 46104, Israel
(011-972) 52-522261  amos%nsta@nsc 34.48'E 32.10'N

shebs@utah-cs.UUCP (Stanley Shebs) (12/19/86)

In article <1233@navajo.STANFORD.EDU> billw@navajo.STANFORD.EDU (William E. Westfield) writes:
>But where HLL's fall down is when you
>start to do function and procedure calls.  And of course you should do
>a lot of those in a well structured program.

Can you say "interprocedural analysis"?  A really hot compiler would
flush every subroutine call that it possibly could.  Disadvantages are
that you must have both the source code for caller and callee available
to mess with, and the resulting code is harder to debug at the object
level (no stack frames to look at).

>An assembly language programmer has the option of writing "fast"
>subroutines that pass arguments in registers and don't bother setting
>up stack frames, and so on.

Some language implementations (for instance Portable Standard Lisp)
*do* pass arguments in registers.  But of course since a function may
be called before it is defined, the calling protocol has to be more
conservative and save registers in non-tail-recursive cases.

>...  One of the things you
>have to do is push and pop tokens from a stack.  Great.  Almost every
>popular processor has instructions that do PUSH and POP in hardware.
>[... various hardware features]
>So what happens when you try to do this in PASCAL, huh?
>Well, you write functions called PUSH and POP, and your arguments get
>put on a stack frame on the hardware stack, and you do a call
>instruction which puts the return addresses on a stack and goes off to
>code that probably fetches the value of stack pointer from memory and
>puts it back, and does a return instruction that fetches the return
>address back, and then you clean up the stack frame and continue on.

The first right answer is "don't use Pascal, use a high-level language".
The second right answer has already been mentioned by others, and that is
to "fix the compiler".  You need a sufficiently high-level language (like
Lisp) so that you can write the push and pop operations as normal functions.
For prototyping, you can use a dumb implementation, then when it's time
to optimize, you hack the compiler to opencode the function calls into
an appropriate sequence of instructions.  Of course, hacking the compiler
may not be easy!  On the other hand, some compilers (such as the PSL
and PCLS compilers) are highly declarative, and it is not unusually hard
to add optimizations.  At the moment, you still have to know quite a bit
about how the system works internally, and we've been working on ways to
make optimizations even easier to add/modify.  Ultimately, we want a compiler
that is completely open and accessible to programmers.

>BillW, writing small, fast, programs, in assembler, and happy.

							stan shebs
							(orion.utah.edu)

rentsch@unc.UUCP (Tim Rentsch) (12/24/86)

In some article William E. Westfield writes:
> It is true that any compiler can generate code as good or better than a
> human for straight line math code.  A good compiler ought to be able to
> optimize loops and array accesses and register allocation for complex
> expressions as well as a human.  But where HLL's fall down is when you
> start to do function and procedure calls.  And of course you should do
> a lot of those in a well structured program.

It seems to me that a well-structured programming style argues in
favor of a HLL, not against it.  The reason is that a provision for
"inline" procedure calls supports a highly procedurized style, while
not incurring the procedure call overhead (at runtime).  Contrast
with assembly where your write 'PUSHJ' (or whatever) to call a
subroutine.  Oh, I know all about macro assemblers, but macro
processors are never quite transparent, and then you have two
*different* mechanisms (with peculiar interactions).  With inlineable
procedures, on the other hand, there is uniformity of syntax,
identical semantics, strong type checking, all those other good
things, AND no procedure call overhead.  Just like having your cake
and eating it, too.  

> An assembly language programmer has the option of writing "fast"
> subroutines that pass arguments in registers and don't bother setting
> up stack frames, and so on.  He can make sure that the arguments he is
> passing in registers are already in the resgisters that they are
> supposed to be in, and results go where they are ready to be used next.
> Arguments can be passed by value or by reference, which ever is more
> convenient, or faster.  And of course there are special instructions to
> be exploited.

Again, inline procedures give a safer, easier to use mechanism, and
they run faster as well.  Argument passing?  Good compilers (some
actually exist) will check the body of the procedure and pass
parameters by renaming the variables in the procedure body, if the
renaming preserves the semantics of the procedure call.  The result
is that no data is moved (as in the assembler case), but without the
programmer having to think if the improved code will "work".  Here
is the crucial difference between HLL and assembler -- when I write
in assembler I have to think about each such optimization to be sure
it does the right thing.  Using inline procedure calls, the
semantics of the call is guaranteed not to change if the procedure
is made inline, so I never have to worry if passing an argument by
reference is identical to passing it by value -- the compiler does
that for me.

Exploiting special instructions is something most compilers do
poorly (if at all), so there should be an (infrequently used) escape
mechanism to generate them directly.  BUT, this escape should be
much more structured than the assembly language style of "anything
goes".  For example, the generated instructions should be
encapsulated within a procedure.  This way, the caller need not know
whether he is using the special instruction, and one implementation
can be exchanged for another completely transparently.  If combined
with the inline mechanism, the procedure call overhead again
disappears, and we obtain the best of both worlds.

Lest anyone think such a language is just a fantasy, I should say
that the MESA programming language supports both INLINE procedures
and procedures escaped to machine code.  If your view is that
therefore MESA is not an HLL but rather an assembler, well, I won't
disagree with you, but I will say in that case the the MESA
assembler is better than your assembler.

cheers,

txr

rh@mit-eddie.MIT.EDU (Randy Haskins) (12/25/86)

There are problems with INLINE, however.  I have forsaken the
world of big things and started writing code for PC's, and for
the first time in my life, have to give serious thought to trading
speed for space in some cases.  Inline coding may be faster, but
it takes up more space.  The coder has to decide the relative
importance for each case. 

Just to add to the argument, here is an example of what a human can do
that I'd seriously doubt that a compiler would do.  I first saw something
like this in some local hacks to the Monitor for a Tops-20 at MIT.  At
first, I was offended, but then I saw the sheer elegance of it.

PRSPC:	Skipa	T2,[32.]	; move a space into 2, skip next instruction
PRTAB:   Movei  T2,9.		; put a tab in 2
        Movei   T1,.PRIOU	; primary output
	Bout%			; byte output
	 Ercal  .+1		; ignore error
	Popj    P		; return

Not only does it avoid having to make an extraneous jump or call (in order
to use common code), it saves space.  Would a compiler "know" to do something
like this?  I doubt it.

Random
-- 
Randwulf  (Randy Haskins);  Path= genrad!mit-eddie!rh

ggs@ulysses.homer.nj.att.com (Griff Smith) (12/27/86)

In article <4375@mit-eddie.MIT.EDU>, rh@mit-eddie.MIT.EDU (Randy Haskins) writes:
> Just to add to the argument, here is an example of what a human can do
> that I'd seriously doubt that a compiler would do.  I first saw something
> like this in some local hacks to the Monitor for a Tops-20 at MIT.  At
> first, I was offended, but then I saw the sheer elegance of it.
> 
> PRSPC:	Skipa	T2,[32.]	; move a space into 2, skip next instruction
> PRTAB:	Movei  T2,9.		; put a tab in 2
>		Movei   T1,.PRIOU	; primary output
>		Bout%			; byte output
>		 Ercal  .+1		; ignore error
>		Popj    P		; return
> 
> Not only does it avoid having to make an extraneous jump or call (in order
> to use common code), it saves space.  Would a compiler "know" to do something
> like this?  I doubt it.
> 
> Randwulf  (Randy Haskins);  Path= genrad!mit-eddie!rh

Actually, this kind of optimization was standard for the TOPS-10 and
TOPS-20 Fortran and LISP compilers.  A simple peephole optimizer could
find these cases easily.  The first step would be for the optimizer to
recognize that the ends of the two functions are identical; it would
then do a space optimization by creating a jump to the common code.
The {skipa, movei} trick would then fall out while applying standard
peephole optimizations of moves and jumps.  As an example of how far a
compiler could carry it, consider the following hypothetical output
from a (probably non-existent) optimizing C compiler:

Given the statement:

	if (a == 0 || a == 1)
		b += 1;
	else
		b -= 1;

the raw PDP-10 assembly language from the code generator would be something like

	MOVE	R1,a	; if (a == 0
	JUMPE	R1,L1
	CAIE	R1,1	; || a == 1)
	 JRST	L2
L1:
	AOS	b	; b += 1
	JRST	L3
L2:
	SOS	b	; b -= 1
L3:

A simple optimization of the last five lines of text would reduce this to:

	MOVE	R1,a	; if (a == 0
	JUMPE	R1,L1
	CAIE	R1,1	; || a == 1)
	 JRST	L2
L1:
	AOSA	b	; b += 1 (skip next instruction)
L2:
	 SOS	b	; b -= 1

This can then be further optimized by inverting the sense of the compare
and eliminating the jump (JRST):

	MOVE	R1,a	; if (a == 0
	JUMPE	R1,L1
	CAIN	R1,1	; || a == 1)
L1:
	 AOSA	b	; b += 1 (skip next instruction)
	 SOS	b	; b -= 1

Finally, by replacing the {move, jumpe} with an instruction that loads
and tests in one operation we get:

	SKIPE	R1,a	; if (a == 0
	 CAIN	R1,1	; || a == 1)
	 AOSA	b	; b += 1 (skip next instruction)
	 SOS	b	; b -= 1

None of these substitutions are any more complicated that ones commonly
used by the UNIX(R) C compiler, yet the resulting sequence of instructions
appears to be the result of clever design.  Given how long it took to
make sure I didn't have any mistakes in the above example, I would much
rather have a compiler do it for me.
-- 
Griff Smith	AT&T (Bell Laboratories), Murray Hill
Phone:		1-201-582-7736
UUCP:		{allegra|ihnp4}!ulysses!ggs
Internet:	ggs@ulysses.uucp

rentsch@unc.UUCP (Tim Rentsch) (12/31/86)

In article <4375@mit-eddie.MIT.EDU> rh@eddie.MIT.EDU (Randy Haskins writes:
> There are problems with INLINE, however.  I have forsaken the
> world of big things and started writing code for PC's, and for
> the first time in my life, have to give serious thought to trading
> speed for space in some cases.  Inline coding may be faster, but
> it takes up more space.  The coder has to decide the relative
> importance for each case. 

Good point.  On the other hand, any procedure which is called only
once (such as those used to provide structure) will run faster and
be smaller if expanded inline.

> Just to add to the argument, here is an example of what a human can do
> that I'd seriously doubt that a compiler would do.  I first saw something
> like this in some local hacks to the Monitor for a Tops-20 at MIT.  At
> first, I was offended, but then I saw the sheer elegance of it.
> 
> PRSPC:Skipa	T2,[32.]	; move a space into 2, skip next instruction
> PRTAB: Movei  T2,9.		; put a tab in 2
>       Movei   T1,.PRIOU	; primary output
> 	Bout%			; byte output
> 	 Ercal  .+1		; ignore error
> 	Popj    P		; return

Yes, clever code, but be careful not to mistake saving space in the
small for saving space in the large.  If the routine were duplicated,
it would cost only a few extra instructions, in other words not much.
Programs get much bigger usually by more global design choices, and
these little optimizations don't contribute much one way or the
other.  As with coding speed, it's the *big* decisions that matter,
and this isn't one of them.

cheers,

txr