dgh@validgh.com (David G. Hough on validgh) (06/14/91)
Refering to isamax in my previous Linpack posting brought to mind the quandary that it represents for computer instruction set architects. I believe that Cray first noticed that when LU-factoring matrices of dimension n, although saxpy operations ( X += c * Y for scalar c and vector X and Y) are n times more frequent than isamax operations (find i such that Xi has largest magnitude in X), you can gain a lot more from agressive vector hardware design and compiler technology on saxpy than on isamax, so that eventually isamax becomes the bottleneck. The heart of isamax is do 30 i = 2,n if(abs(dx(i)).le.dmax) go to 30 isamax = i dmax = abs(dx(i)) 30 continue What kinds of additional instructions belong in an instruction set architecture that is intended to be scalable, from one-chip systems with inexpensive memory, to very high performance systems implemented with multiple paths to memory and perhaps multiple functional units (integer, float, branch) that may, however, be relatively distant, so that conditional branches become quite expensive? A simple max/min won't help. How can multiple comparisons be overlapped? Can the "abs" be concealed? -- David Hough dgh@validgh.com uunet!validgh!dgh na.hough@na-net.ornl.gov
andy@DEC-Lite.Stanford.EDU (Andy Freeman) (06/14/91)
In article <396@validgh.com> dgh@validgh.com (David G. Hough on validgh) writes: > do 30 i = 2,n > if(abs(dx(i)).le.dmax) go to 30 > isamax = i > dmax = abs(dx(i)) > 30 continue One instruction that can eliminate the conditional branch is conditional assignment. (It is unclear whether it should be a 3 in 1 out operation or a 2 in 1 out operation with the destination as an implicit operand.) In the above, it can be used like: isamax = 1 dmax = abs(dx(1)) do 30 i = 2, n absdx = abs(dx(i)) ibig = absdx .le. dmax asm("<if ibig is true, replace isamax's value with i>") asm("<if ibig is true, replace dmax's value with absdx>") 30 continue Obviously one can pipeline the abs. The delay caused by the loop-carried dependence can be eliminated by unrolling (or recursive doubling) and computing separate values of ibig and dmax for different sections of the array. One has to be careful with this, as branch-prediction might hide the branch penalty associated with the original version. -andy -- UUCP: {arpa gateways, sun, decwrl, uunet, rutgers}!neon.stanford.edu!andy ARPA: andy@neon.stanford.edu
jbuck@forney.berkeley.edu (Joe Buck) (06/14/91)
In article <396@validgh.com>, dgh@validgh.com (David G. Hough on validgh) writes: |> [ what architectural features speed this up? ] !> |> do 30 i = 2,n |> if(abs(dx(i)).le.dmax) go to 30 |> isamax = i |> dmax = abs(dx(i)) |> 30 continue DSP chips are good for things like this. The following routine takes 10+3N cycles (60 nsec/cycle for the original C30): ldi dx,ar0 ; ar0 points to the data ldi @n,rc ; vector length subi 1,rc ; n-1 to get n loops ldf -1.0,r1 ; set max abs value to -1 rptb loop ; start zero overhead loop ;........................... absf *ar0++,r0 ; r0 = absval of dx[i] cmpf r0,r1 ; larger than max? loop: ldigt rc,r2 ; if so, mark its position ;........................... ; rc is decremented once each time -- it's n-1 if the first term is ; the max, n-2 if the second, etc. So n-rc would be the isamax ; output of a Fortran routine. ldi @n,r0 subi rc,r0 ; now r0 has isamax and r1 has dmax. (extra instructions needed ; to do a C call interface). Several elements contribute to speed: the zero-overhead loop, the conditional load (ldigt), the absolute value instruction, and (sorry, purists) the autoincrement addressing mode. The RS/6000 already has, in many cases, zero-overhead loops. I have found, though, that conditional loads are a big win on heavily pipelined machines where a branch would cause a large pipeline penalty. -- Joe Buck jbuck@galileo.berkeley.edu {uunet,ucbvax}!galileo.berkeley.edu!jbuck
bs@frieda.mitre.org (Robert D. Silverman) (06/14/91)
In article <1991Jun13.234834.22970@neon.Stanford.EDU> andy@DEC-Lite.Stanford.EDU (Andy Freeman) writes: :In article <396@validgh.com> dgh@validgh.com (David G. Hough on validgh) writes: :> do 30 i = 2,n :> if(abs(dx(i)).le.dmax) go to 30 :> isamax = i :> dmax = abs(dx(i)) :> 30 continue : :One instruction that can eliminate the conditional branch is ^^^^^^^^^^^^ :conditional assignment. (It is unclear whether it should be a 3 in 1 : do 30 i = 2, n : absdx = abs(dx(i)) : ibig = absdx .le. dmax ^^^^^^^^^^^^^^^^^^^^^^^^ I would be very interested in seeing the assembler code that gets emitted for this line of Fortran. How can this statement get executed WITHOUT a branch?? I also have never seen a machine that has conditional assignment implemented as an instruction! (As indicated above). What machine is this? Could you show us the assembler syntax?? I ** believe ** that the above conditional assignment will get compiled into a test and branch. I don't know of any machine that provides instructions to do otherwise. It doesn't appear to the Fortan programmer to be a branch because the branch is 'hidden'. Of course, I could be wrong. -- Bob Silverman #include <std.disclaimer> Mitre Corporation, Bedford, MA 01730 "You can lead a horse's ass to knowledge, but you can't make him think"
chrisg@cbmvax.commodore.com (Chris Green) (06/14/91)
In article <1991Jun13.234834.22970@neon.Stanford.EDU> andy@DEC-Lite.Stanford.EDU (Andy Freeman) writes: >In article <396@validgh.com> dgh@validgh.com (David G. Hough on validgh) writes: >> do 30 i = 2,n >> if(abs(dx(i)).le.dmax) go to 30 >> isamax = i >> dmax = abs(dx(i)) >> 30 continue > Motorola's DSP 96000 family has some instructions that would help this loop out a lot. They have conditional assign, and two moves per instruction. I think the code for this loop could be done with the following instructions (I don't know the exact mnemonics, but here's the general idea): int abs1=dx(0) i=&dx[2] lp: set abs1=abs(abs1). fetch *(i++) to abs2 compare abs1 to dmax conditionally store i to isamax and abs1 to dmax set abs2=abs(abs2). fetch *(i++) to abs1 compare abs2 to dmax conditionally store i to isamax and abs2 to dmax 6 instructions for 2 iterations ain't bad. you'd have to do one SUB at the end to correct isamax to be i instead of &dx[i]. zero overhead looping handles the iteration. -- *-------------------------------------------*---------------------------* |Chris Green - Graphics Software Engineer - chrisg@commodore.COM f | Commodore-Amiga - uunet!cbmvax!chrisg n |My opinions are my own, and do not - killyouridolssonicdeath o |necessarily represent those of my employer.- itstheendoftheworld r *-------------------------------------------*---------------------------d
rmb@omews57.intel.com (Bob Bentley) (06/15/91)
In article <1991Jun14.134338.4673@linus.mitre.org> bs@frieda.mitre.org (Robert D. Silverman) writes: > >I also have never seen a machine that has conditional assignment >implemented as an instruction! ....... What machine is this? >Could you show us the assembler syntax?? > The Intel i960(tm) architecture has a limited type of conditional assignment in the form of the TEST instructions, which are used to test the condition code and set a boolean value in a register. They are intended primarily for implementing C statements of the form: a = b > c without having to do conditional branches. The assembler code might look like: cmpo r5, r6 testg r4 which causes r4 to be set to 0 or 1 depending on the result of the comparison. See any 80960 Programmers Reference Manual or the Myers & Budde book for more details. Bob Bentley Intel Corp., M/S JF1-58 E-mail: rmb@ichips.intel.com 5200 N.E. Elam Young Parkway Phone: (503) 696-4728 Hillsboro, Oregon 97124 Fax: (503) 696-4515
ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (06/16/91)
In article <1991Jun14.134338.4673@linus.mitre.org>, bs@frieda.mitre.org (Robert D. Silverman) writes: > : absdx = abs(dx(i)) > : ibig = absdx .le. dmax > ^^^^^^^^^^^^^^^^^^^^^^^ > I would be very interested in seeing the assembler code that gets > emitted for this line of Fortran. How can this statement get executed > WITHOUT a branch?? Quite a lot of machines can do this, including VAXes, MC680x0, MC88k, Prime 50 series, Burroughs machines, NS32?32. Since you asked for assembler, here's NS32532 assembly code. # assume that DX is a static address and that I is in R0 absf DX[R0:D], F0 # F0 := abs(dx(i)) # assume that dmax is in F1 cmpf F1, F0 # N := F1 > F0, Z := F1 = F0 # assume that ibig is to be in R1 sgtd R1 # R1 := 1 if abs(dx(i)) > dmax, 0 otherwise Indeed, even an 8086 can do it. Once you've done the comparison, lahf # AH := [Sign,Zero,*,Aux,*,Parity,*,Carry] will do the trick, if your Fortran compiler takes "negative" as the representation of .true., "non-negative" as the representation of .false. > I also have never seen a machine that has conditional assignment > implemented as an instruction! Perhaps someone who knows the ARM instruction set can comment on this. There was a trick that could be used on KA-10s. Registers could be addressed as memory. If the NS32532 supported that hack for FP registers, then something like movf F1, "F0"[R1:D] would be usable in this case. I have a notion that something similar could be done on the AMD 29000. As a single instruction, you may not find it, but it isn't hard to do a conditional assignment using straight-line code. Suppose R0 holds a 0 or 1 (which we may destroy), and we want to do R2 := R3 IF R0 Here's the code, using NS32532 mnemonics. negd r0, r0 # r0 := if <change wanted> then ~0 else 0 andd r0, r3 # if not <change wanted> then r3 := 0 comd r0, r0 # r0 := if <change wanted> then 0 else ~0 andd r0, r2 # if <change wanted> then r2 := 0 ord r3, r2 # r2 := if <change wanted> then 0 | r3 # else r2 | 0 Of course, as Herman Rubin will point out if I don't, this involves doing bitwise operations on the floating-point registers, which is why I quietly did it to R2, R3 instead of F1, F0. (NB: the instructions above are just a simple MUX done wordwise.) -- Q: What should I know about quicksort? A: That it is *slow*. Q: When should I use it? A: When you have only 256 words of main storage.
bbeckwit@next.com (Bob Beckwith) (06/19/91)
In article <1991Jun14.134338.4673@linus.mitre.org> bs@frieda.mitre.org (Robert D. Silverman) writes: > > [example code deleted] > >How can this statement get executed WITHOUT a branch?? > >I also have never seen a machine that has conditional assignment >implemented as an instruction! (As indicated above). What machine is >this? Could you show us the assembler syntax?? > >I ** believe ** that the above conditional assignment will get compiled >into a test and branch. I don't know of any machine that provides >instructions to do otherwise. Someone who knows better can correct me if I'm wrong (and they might also provide the assembler syntax, any ex-Multifloyds wanna speak up?), but I believe that the Multiflow Trace machines had a conditional assignment instruction. --Bob -- Bob_Beckwith@NeXT.COM Digital Hardware Engineering NeXT Computer, Inc. 900 Chesapeake Drive Redwood City, CA USA
mac@gold.kpc.com (Mike McNamara) (06/22/91)
Ah, I see I posted too soon. I now understand that the discussion is about linpack's IDAMAX routine. I think the current champion hardware for this loop is the ardent titan vector unit. It has the DRAMAX instruction. This is translated Double precision Reduction Absolute MAXimum value instruction. Run over a vector, it returns in a scalar the maximum absolute value of that vector. ( The machine supports the single precision, minimum absolute value, and non absolute variants, just so no one would think the instruction was inserted just for Dongarra. :-) Of course, the failing is that IDAMAX doesn't care particularily about the actual maximum value; it only wants to know the *index* of that value. But then a DVCEQ instruction is fully vectorized also... -mac -- +-----------+-----------------------------------------------------------------+ |mac@kpc.com| Increasing Software complexity lets us sell Mainframes as | | | personal computers. Carry on, X windows/Postscript/emacs/CASE!! | +-----------+-----------------------------------------------------------------+
mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (06/22/91)
>>>>> On 21 Jun 91 17:24:05 GMT, mac@gold.kpc.com (Mike McNamara) said:
Mike> [....] the discussion is about linpack's IDAMAX routine. I
Mike> think the current champion hardware for this loop is the ardent
Mike> titan vector unit. It has the DRAMAX instruction. This is
Mike> translated Double precision Reduction Absolute MAXimum value
Mike> instruction. Run over a vector, it returns in a scalar the
Mike> maximum absolute value of that vector. ( The machine supports
Mike> the single precision, minimum absolute value, and non absolute
Mike> variants, just so no one would think the instruction was
Mike> inserted just for Dongarra. :-)
The Cyber 205/ETA-10 contain an instruction which returns *both* the
extreme value and its index. It can optionally ignore the sign bit.
Timing is something like 1 cycle per element plus 16 cycles for every
new maximum found. On one of the Livermore Loops we saw a very large
speedup going from double-precision to single precision. It turns out
that the vector to be tested was a *very* slowly increasing function,
so that the 64-bit version found lots more new maxima than the 32-bit
version....
--
John D. McCalpin mccalpin@perelandra.cms.udel.edu
Assistant Professor mccalpin@brahms.udel.edu
College of Marine Studies, U. Del. J.MCCALPIN/OMNET