[comp.arch] isamax and instruction set design

preston@asta.rice.edu (Preston Briggs) (06/14/91)

bs@frieda.mitre.org (Robert D. Silverman) writes:

>:	    ibig  = absdx .le. dmax
>	    ^^^^^^^^^^^^^^^^^^^^^^^^
>
>I would be very interested in seeing the assembler code that gets
>emitted for this line of Fortran. How can this statement get executed
>WITHOUT a branch??

The RS/6000 could do this, basically assigning ibig to one of the
condition code registers.  Actually, many machines could, though they don't
always have enough CC registers to be useful.

If this test can be issued enough in advance, then later tests
of 'ibig' would be free.

Don't know of any conditional assignments, though we might
use a skip instruction to accomplish the purpose.

Preston Briggs

bs@faron.mitre.org (Robert D. Silverman) (06/15/91)

In article <1991Jun14.163141.17728@rice.edu> preston@asta.rice.edu (Preston Briggs) writes:
:bs@frieda.mitre.org (Robert D. Silverman) writes:
:
:>:	    ibig  = absdx .le. dmax
:>	    ^^^^^^^^^^^^^^^^^^^^^^^^
:>
:>I would be very interested in seeing the assembler code that gets
:>emitted for this line of Fortran. How can this statement get executed
:>WITHOUT a branch??
:
:The RS/6000 could do this, basically assigning ibig to one of the
:condition code registers.  Actually, many machines could, though they don't
:always have enough CC registers to be useful.
 
However, there is STILL a problem with the above expression.
 
absdx .le. dmax  is a BOOLEAN expression. It evaluates to either TRUE
or FALSE.

Hence, the above code simply says to store the value TRUE or FALSE
into ibig according to whether absdx is less than dmax

IT DOES NOT MEAN

ibig = MINIMUM(absdx, dmax)

For people to do this 'conditional assignement' it would have to be:

ibig = if (absdx .le. dmax) then absdx
       else dmax


--
Bob Silverman
#include <std.disclaimer>
Mitre Corporation, Bedford, MA 01730
"You can lead a horse's ass to knowledge, but you can't make him think"

preston@asta.rice.edu (Preston Briggs) (06/15/91)

bs@faron.mitre.org (Robert D. Silverman) writes:

>:>:	    ibig  = absdx .le. dmax
>:>	    ^^^^^^^^^^^^^^^^^^^^^^^^
>:>
>:>I would be very interested in seeing the assembler code that gets
>:>emitted for this line of Fortran. How can this statement get executed
>:>WITHOUT a branch??

And I wrote

>:The RS/6000 could do this, basically assigning ibig to one of the
>:condition code registers.

And Silverman replies

>However, there is STILL a problem with the above expression.
> 
>absdx .le. dmax  is a BOOLEAN expression. It evaluates to either TRUE
>or FALSE.
>
>Hence, the above code simply says to store the value TRUE or FALSE
>into ibig according to whether absdx is less than dmax
>
>IT DOES NOT MEAN
>
>ibig = MINIMUM(absdx, dmax)


Right.  I think it's time for some of the original context:

andy@DEC-Lite.Stanford.EDU (Andy Freeman) wrote:
>
>	isamax = 1
>	dmax = abs(dx(1))
>	do 30 i = 2, n
>	    absdx = abs(dx(i))
>	    ibig  = absdx .le. dmax
>	    asm("<if ibig is true, replace isamax's value with i>")
>	    asm("<if ibig is true, replace dmax's value with absdx>")
>    30	continue

Surely it's clear that ibig is intended to be a boolean,
and hence may be (even "should be") allocated to a condition code register.
I expect Silverman misread the original code, think that 'ibig'
was to hold the min of absdx and dmax.

The question he probably wanted to ask is: How do you implement

	asm("<if ibig is true, replace isamax's value with i>")

??  Unfortunately, I answered the question he asked rather
than what he may have meant.  Trying again, ...

I suggest a skip instruction and a copy.

The idea of a skip is that the instruction fetcher gets ahead of
the executer.  By the time the condition is resolved, the fetcher
has already fetched well past the potentially skipped instruction;
therefore, execution can continue, skipping or not, with minimal
bubbling of the pipeline.  Of course, this _is_ a form of conditional
branch; but it's fast, executing in only 1 cycle, with an extra cycle
for the copy (if not skipping).

Tera's machine has (or will have) skip instructions.
Probably others do too.

Preston Briggs

slackey@bbn.com (Stan Lackey) (06/19/91)

In article <1991Jun14.163141.17728@rice.edu> preston@asta.rice.edu (Preston Briggs) writes:
>bs@frieda.mitre.org (Robert D. Silverman) writes:
>
>>:	    ibig  = absdx .le. dmax
>>	    ^^^^^^^^^^^^^^^^^^^^^^^^
>>
>>I would be very interested in seeing the assembler code that gets
>>emitted for this line of Fortran. How can this statement get executed
>>WITHOUT a branch??
>
>Don't know of any conditional assignments, though we might
>use a skip instruction to accomplish the purpose.

I once worked on a machine that had instructions like x=min(y,z) in hardware.
Also there were vector instructions that would elementwise compare vectors
and build a vector of logicals which could be applied to a subsequent vector
op.  So a loop of the form

  where (a(i) .lt. b(i)) c(i) = d(i) + e(i)

could be done with two vector instructions and no branches (other than
loop control, of course).

-Stan

tonys@pyra.co.uk (Tony Shaughnessy) (06/19/91)

In article <64739@bbn.BBN.COM> slackey@BBN.COM (Stan Lackey) writes:
>
>I once worked on a machine that had instructions like x=min(y,z) in hardware.
>Also there were vector instructions that would elementwise compare vectors
>and build a vector of logicals which could be applied to a subsequent vector
>op.  So a loop of the form
>
>  where (a(i) .lt. b(i)) c(i) = d(i) + e(i)
>
>could be done with two vector instructions and no branches (other than
>loop control, of course).
>
>-Stan

How many cycles would it take to run each of these instructions? If it would
take many cycles, then wouldn't this dominate any expense incurred in taking
a branch, or are there other reasons for avoiding a branch?

--
Tony Shaughnessy
tonys@pyra.co.uk

	"Pedal away those tag-nut blues"

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (06/19/91)

>>>>> On 19 Jun 91 09:18:53 GMT, tonys@pyra.co.uk (Tony Shaughnessy) said:

Tony> In article <64739@bbn.BBN.COM> slackey@BBN.COM (Stan Lackey) writes:
>
>I once worked on a machine that had instructions like x=min(y,z) in
>hardware.  Also there were vector instructions that would elementwise
>compare vectors and build a vector of logicals which could be applied
>to a subsequent vector op.  So a loop of the form
>
>  where (a(i) .lt. b(i)) c(i) = d(i) + e(i)
>
>could be done with two vector instructions and no branches (other than
>loop control, of course).
>
>-Stan

Tony> How many cycles would it take to run each of these instructions?
Tony> If it would take many cycles, then wouldn't this dominate any
Tony> expense incurred in taking a branch, or are there other reasons
Tony> for avoiding a branch?

On the Cyber 205/ETA-10, the second instruction requires one cycle per
element, whether the result is stored or not.  I think that the first
instruction also requires one cycle per element.  Then add two vector
startups to the total time required.  So the cycle count is something
like (on the Cyber 205):

	WHERE block: 140 + 2*N
	original op:  70 +   N	(No IF tests -- compute at all locations)

For long vectors, if a(i)<b(i) half of the time, then the WHERE block
above will run at 1/4 the MFLOPS rate of the basic loop.  If you get
to re-use the control vector generated by the comparison, then you can
amortize the overhead cost of the comparison over all of its uses.
(The compiler was never smart enough to do this for you, so you had to
create a bit vector manually and use it explicitly).

Since scalar code tended to be a bit slow on the Cyber 205/ETA-10
machines, the point at which the scalar and vector codes ran at the
same speed might be (for long vectors) where a(i) .lt. b(i) is true at
least 15-20% of the time.
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

slackey@bbn.com (Stan Lackey) (06/20/91)

>>>>>> On 19 Jun 91 09:18:53 GMT, tonys@pyra.co.uk (Tony Shaughnessy) said:
>
>>I once worked on a machine that had instructions like x=min(y,z) in
>>hardware.  Also there were vector instructions that would elementwise
>>compare vectors and build a vector of logicals which could be applied
>>to a subsequent vector op.  So a loop of the form
>>
>>  where (a(i) .lt. b(i)) c(i) = d(i) + e(i)
>>
>>could be done with two vector instructions and no branches (other than
>>loop control, of course).
>
>Tony> How many cycles would it take to run each of these instructions?
>Tony> If it would take many cycles, then wouldn't this dominate any
>Tony> expense incurred in taking a branch, or are there other reasons
>Tony> for avoiding a branch?

The min/max took one or two depending upon which version of the CPU
you had.  The interesting thing about the scalar machine was it was
heavily pipelined, and used branch prediction.  Unfortunately,
branches which depend upon floating point values are often wrong 50%
of the time, causing the pipe to be "drained" and restarted.  In this
case the min/max took the same amount of time as the compare by
itself, so the branch followed by the assignment would be added on.

The vector case took one cycle per element plus overhead of a few
cycles per instruction, so the total time was 2*(#elements +
overhead).

This machine did not have conditional scalar instructions as others
mentioned in this thread, but that is a good solution to this class of
problem: instructons are just fetched and dropped down a pipe;
"masked" instructions are just no-op'ed; the scalar pipeline is not
disturbed.

-Stan