[comp.arch] More on Linpack pivoting: isamax and instruction set design

dgh@validgh.com (David G. Hough on validgh) (06/14/91)

Refering to isamax in my previous Linpack posting brought to mind
the quandary that it represents for computer instruction set architects.

I believe that Cray first noticed that when LU-factoring matrices of dimension
n, although saxpy operations
( X += c * Y for scalar c and vector X and Y) are n times more frequent
than isamax operations (find i such that Xi has largest magnitude in X),
you can gain a lot more from agressive vector hardware design and compiler
technology on saxpy than on isamax, so that eventually isamax becomes
the bottleneck.  The heart of isamax is

      do 30 i = 2,n
         if(abs(dx(i)).le.dmax) go to 30
         isamax = i
         dmax = abs(dx(i))
   30 continue

What kinds of additional
instructions belong in an instruction set architecture that is
intended to be scalable, from one-chip systems with inexpensive memory,
to very high performance systems implemented with multiple paths to memory
and perhaps multiple functional units (integer, float, branch)
that may, however, be relatively distant, so that conditional
branches become quite expensive?
A simple max/min won't help.   How can multiple comparisons be overlapped?
Can the "abs" be concealed?
-- 

David Hough

dgh@validgh.com		uunet!validgh!dgh	na.hough@na-net.ornl.gov

andy@DEC-Lite.Stanford.EDU (Andy Freeman) (06/14/91)

In article <396@validgh.com> dgh@validgh.com (David G. Hough on validgh) writes:
>      do 30 i = 2,n
>         if(abs(dx(i)).le.dmax) go to 30
>         isamax = i
>         dmax = abs(dx(i))
>   30 continue

One instruction that can eliminate the conditional branch is
conditional assignment.  (It is unclear whether it should be a 3 in 1
out operation or a 2 in 1 out operation with the destination as an
implicit operand.)  In the above, it can be used like:

	isamax = 1
	dmax = abs(dx(1))
	do 30 i = 2, n
	    absdx = abs(dx(i))
	    ibig  = absdx .le. dmax
	    asm("<if ibig is true, replace isamax's value with i>")
	    asm("<if ibig is true, replace dmax's value with absdx>")
    30	continue

Obviously one can pipeline the abs.  The delay caused by the
loop-carried dependence can be eliminated by unrolling (or recursive
doubling) and computing separate values of ibig and dmax for different
sections of the array.

One has to be careful with this, as branch-prediction might hide the
branch penalty associated with the original version.

-andy
--
UUCP:    {arpa gateways, sun, decwrl, uunet, rutgers}!neon.stanford.edu!andy
ARPA:    andy@neon.stanford.edu

jbuck@forney.berkeley.edu (Joe Buck) (06/14/91)

In article <396@validgh.com>, dgh@validgh.com (David G. Hough on validgh) writes:
|> [ what architectural features speed this up? ]
!>
|>       do 30 i = 2,n
|>          if(abs(dx(i)).le.dmax) go to 30
|>          isamax = i
|>          dmax = abs(dx(i))
|>    30 continue

DSP chips are good for things like this.

The following routine takes 10+3N cycles (60 nsec/cycle for the
original C30):

	ldi	dx,ar0			; ar0 points to the data
	ldi	@n,rc			; vector length
	subi	1,rc			; n-1 to get n loops
	ldf	-1.0,r1			; set max abs value to -1
	rptb	loop			; start zero overhead loop
;...........................
	absf	*ar0++,r0		; r0 = absval of dx[i]
	cmpf	r0,r1			; larger than max?
loop:	ldigt	rc,r2			; if so, mark its position
;...........................
; rc is decremented once each time -- it's n-1 if the first term is
; the max, n-2 if the second, etc.  So n-rc would be the isamax
; output of a Fortran routine.
	ldi	@n,r0
	subi	rc,r0
; now r0 has isamax and r1 has dmax. (extra instructions needed
; to do a C call interface).


Several elements contribute to speed: the zero-overhead loop, the
conditional load (ldigt), the absolute value instruction, and
(sorry, purists) the autoincrement addressing mode.

The RS/6000 already has, in many cases, zero-overhead loops.
I have found, though, that conditional loads are a big win
on heavily pipelined machines where a branch would cause a
large pipeline penalty.


--
Joe Buck
jbuck@galileo.berkeley.edu	 {uunet,ucbvax}!galileo.berkeley.edu!jbuck

bs@frieda.mitre.org (Robert D. Silverman) (06/14/91)

In article <1991Jun13.234834.22970@neon.Stanford.EDU> andy@DEC-Lite.Stanford.EDU (Andy Freeman) writes:
:In article <396@validgh.com> dgh@validgh.com (David G. Hough on validgh) writes:
:>      do 30 i = 2,n
:>         if(abs(dx(i)).le.dmax) go to 30
:>         isamax = i
:>         dmax = abs(dx(i))
:>   30 continue
:
:One instruction that can eliminate the conditional branch is
     ^^^^^^^^^^^^
:conditional assignment.  (It is unclear whether it should be a 3 in 1
:	do 30 i = 2, n
:	    absdx = abs(dx(i))
:	    ibig  = absdx .le. dmax
	    ^^^^^^^^^^^^^^^^^^^^^^^^

I would be very interested in seeing the assembler code that gets
emitted for this line of Fortran. How can this statement get executed
WITHOUT a branch??
 
I also have never seen a machine that has conditional assignment
implemented as an instruction! (As indicated above). What machine is
this? Could you show us the assembler syntax??

I ** believe ** that the above conditional assignment will get compiled
into a test and branch. I don't know of any machine that provides
instructions to do otherwise. It doesn't appear to the Fortan programmer
to be a branch because the branch is 'hidden'.

Of course, I could be wrong.
--
Bob Silverman
#include <std.disclaimer>
Mitre Corporation, Bedford, MA 01730
"You can lead a horse's ass to knowledge, but you can't make him think"

chrisg@cbmvax.commodore.com (Chris Green) (06/14/91)

In article <1991Jun13.234834.22970@neon.Stanford.EDU> andy@DEC-Lite.Stanford.EDU (Andy Freeman) writes:
>In article <396@validgh.com> dgh@validgh.com (David G. Hough on validgh) writes:
>>      do 30 i = 2,n
>>         if(abs(dx(i)).le.dmax) go to 30
>>         isamax = i
>>         dmax = abs(dx(i))
>>   30 continue
>

	Motorola's DSP 96000 family has some instructions that would help this
loop out a lot. They have conditional assign, and two moves per instruction.

I think the code for this loop could be done with the following instructions
(I don't know the exact mnemonics, but here's the general idea):


	int abs1=dx(0)
	i=&dx[2]

lp:	set abs1=abs(abs1). fetch *(i++) to abs2
	compare abs1 to dmax
	conditionally store i to isamax and abs1 to dmax
	
	set abs2=abs(abs2). fetch *(i++) to abs1
	compare abs2 to dmax
	conditionally store i to isamax and abs2 to dmax

	6 instructions for 2 iterations ain't bad. you'd have to do one SUB at
the end to correct isamax to be i instead of &dx[i]. zero overhead looping handles
the iteration.

-- 
*-------------------------------------------*---------------------------*
|Chris Green - Graphics Software Engineer   - chrisg@commodore.COM      f
|                  Commodore-Amiga          - uunet!cbmvax!chrisg       n
|My opinions are my own, and do not         - killyouridolssonicdeath   o
|necessarily represent those of my employer.- itstheendoftheworld       r
*-------------------------------------------*---------------------------d

rmb@omews57.intel.com (Bob Bentley) (06/15/91)

In article <1991Jun14.134338.4673@linus.mitre.org> bs@frieda.mitre.org (Robert D. Silverman) writes:
> 
>I also have never seen a machine that has conditional assignment
>implemented as an instruction! ....... What machine is this?
>Could you show us the assembler syntax??
>

The Intel i960(tm) architecture has a limited type of conditional assignment
in the form of the TEST instructions, which are used to test the condition
code and set a boolean value in a register.  They are intended primarily
for implementing C statements of the form:

			a = b > c

without having to do conditional branches.  The assembler code might look like:

			cmpo  r5, r6
			testg r4

which causes r4 to be set to 0 or 1 depending on the result of the comparison.

See any 80960 Programmers Reference Manual or the Myers & Budde book for more
details.

	Bob Bentley
 	Intel Corp., M/S JF1-58			E-mail: rmb@ichips.intel.com 
 	5200 N.E. Elam Young Parkway		Phone:  (503) 696-4728          
 	Hillsboro, Oregon 97124			Fax:    (503) 696-4515

ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (06/16/91)

In article <1991Jun14.134338.4673@linus.mitre.org>, bs@frieda.mitre.org (Robert D. Silverman) writes:
> :	    absdx = abs(dx(i))
> :	    ibig  = absdx .le. dmax
> 	    ^^^^^^^^^^^^^^^^^^^^^^^ 
> I would be very interested in seeing the assembler code that gets
> emitted for this line of Fortran. How can this statement get executed
> WITHOUT a branch??

Quite a lot of machines can do this, including VAXes, MC680x0, MC88k,
Prime 50 series, Burroughs machines, NS32?32.  Since you asked for
assembler, here's NS32532 assembly code.

	# assume that DX is a static address and that I is in R0
	absf	DX[R0:D], F0	# F0 := abs(dx(i))
	# assume that dmax is in F1
	cmpf	F1, F0		# N := F1 > F0, Z := F1 = F0
	# assume that ibig is to be in R1
	sgtd	R1		# R1 := 1 if abs(dx(i)) > dmax, 0 otherwise

Indeed, even an 8086 can do it.  Once you've done the comparison,
	lahf			# AH := [Sign,Zero,*,Aux,*,Parity,*,Carry]
will do the trick, if your Fortran compiler takes "negative" as the
representation of .true., "non-negative" as the representation of .false.

> I also have never seen a machine that has conditional assignment
> implemented as an instruction!

Perhaps someone who knows the ARM instruction set can comment on this.
There was a trick that could be used on KA-10s.  Registers could be
addressed as memory.  If the NS32532 supported that hack for FP
registers, then something like
	movf	F1, "F0"[R1:D]
would be usable in this case.  I have a notion that something similar
could be done on the AMD 29000.

As a single instruction, you may not find it, but it isn't hard to do a
conditional assignment using straight-line code.  Suppose R0 holds a 0
or 1 (which we may destroy), and we want to do
	R2 := R3 IF R0
Here's the code, using NS32532 mnemonics.
	negd	r0, r0	# r0 := if <change wanted> then ~0 else 0
	andd	r0, r3	# if not <change wanted> then r3 := 0
	comd	r0, r0	# r0 := if <change wanted> then 0 else ~0
	andd	r0, r2	# if <change wanted> then r2 := 0
	ord	r3, r2	# r2 := if <change wanted> then 0 | r3
			#       else r2 | 0

Of course, as Herman Rubin will point out if I don't, this involves doing
bitwise operations on the floating-point registers, which is why I quietly
did it to R2, R3 instead of F1, F0.  (NB:  the instructions above are just
a simple MUX done wordwise.)

-- 
Q:  What should I know about quicksort?   A:  That it is *slow*.
Q:  When should I use it?  A:  When you have only 256 words of main storage.

bbeckwit@next.com (Bob Beckwith) (06/19/91)

In article <1991Jun14.134338.4673@linus.mitre.org> bs@frieda.mitre.org (Robert D. Silverman) writes:
>
> [example code deleted]
>
>How can this statement get executed WITHOUT a branch??
> 
>I also have never seen a machine that has conditional assignment
>implemented as an instruction! (As indicated above). What machine is
>this? Could you show us the assembler syntax??
>
>I ** believe ** that the above conditional assignment will get compiled
>into a test and branch. I don't know of any machine that provides
>instructions to do otherwise.

  Someone who knows better can correct me if I'm wrong (and they might also 
provide the assembler syntax, any ex-Multifloyds wanna speak up?), but I believe
that the Multiflow Trace machines had a conditional assignment instruction.

  --Bob

--
Bob_Beckwith@NeXT.COM
Digital Hardware Engineering
NeXT Computer, Inc.
900 Chesapeake Drive
Redwood City, CA USA

mac@gold.kpc.com (Mike McNamara) (06/22/91)

	Ah, I see I posted too soon.  I now understand that the
discussion is about linpack's IDAMAX routine.  I think the current
champion hardware for this loop is the ardent titan vector unit.  It
has the DRAMAX instruction. This is translated Double precision
Reduction Absolute MAXimum value instruction.  Run over a vector, it
returns in a scalar the maximum absolute value of that vector. 
	( The machine supports the single precision, minimum absolute
value, and non absolute variants, just so no one would think the
instruction was inserted just for Dongarra.  :-)

	Of course, the failing is that IDAMAX doesn't care
particularily about the actual maximum value; it only wants to know
the *index* of that value.  But then a DVCEQ instruction is fully
vectorized also...

	-mac
--
+-----------+-----------------------------------------------------------------+
|mac@kpc.com| Increasing Software complexity lets us sell Mainframes as       |
|           | personal computers. Carry on, X windows/Postscript/emacs/CASE!! |
+-----------+-----------------------------------------------------------------+

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (06/22/91)

>>>>> On 21 Jun 91 17:24:05 GMT, mac@gold.kpc.com (Mike McNamara) said:

Mike> 	[....] the discussion is about linpack's IDAMAX routine.  I
Mike> think the current champion hardware for this loop is the ardent
Mike> titan vector unit.  It has the DRAMAX instruction. This is
Mike> translated Double precision Reduction Absolute MAXimum value
Mike> instruction.  Run over a vector, it returns in a scalar the
Mike> maximum absolute value of that vector.  	( The machine supports
Mike> the single precision, minimum absolute value, and non absolute
Mike> variants, just so no one would think the instruction was
Mike> inserted just for Dongarra.  :-)

The Cyber 205/ETA-10 contain an instruction which returns *both* the
extreme value and its index.   It can optionally ignore the sign bit.

Timing is something like 1 cycle per element plus 16 cycles for every
new maximum found.  On one of the Livermore Loops we saw a very large
speedup going from double-precision to single precision.  It turns out
that the vector to be tested was a *very* slowly increasing function,
so that the 64-bit version found lots more new maxima than the 32-bit
version....
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET