[comp.arch] Why FP at all?

stephen@estragon.uchicago.edu (Stephen P Spackman) (09/05/90)

Please excuse the quasi-flamage....

Am I really dense, or have I completely missed the point? Why are we
burning silicon on floating point arithmetic when we could have fast
128-bit INTEGER arithmetic? Why do all the great arguments for RISC
suddenly evaporate when FP looms?

Seems to me that 90% of the floating point code I've seen had a
dynamic range that was low *and potentially known to the compiler*; it
could have been directly compiled into scaled integers. Of what's
left, much was pretty weird stuff with unpredictable behaviour and the
exponent in fixed format FP was not big ENOUGH and maybe it should
have been broken out into a separately specifiable (and independently
computed) exponent ([int, long]float, you see).

Ok, so we can do the full analysis; maybe there're a couple of
normalisation-shifts that deserve instructions, but I am *so tired* of
having the silicon on my PC and my workstation wasted on floating
point when all I want is Emacs and Unix and X to be tolerably fast
(which on a Sparc they aren't). Sniff.

Sorry about that. But it really does seem to me that floating point is
the most INCREDIBLY arcane and domain-specific hack, it has nothing
approaching the utility of arbitrary-precision integer arithmetic,
bitblt, finite-field arithmetic, graph-rewrite, unification or just
more registers.

Of course, I'm not into weather-prediction. Then again, I only ever
met one programmer who was.

Somehow it seems to me that we are being led by the nose by marketing
types and the ghosts of languages past.

stephen p spackman  stephen@estragon.uchicago.edu  312.702.3982

amull@Morgan.COM (Andrew P. Mullhaupt) (09/05/90)

In article <STEPHEN.90Sep5000536@estragon.uchicago.edu>, stephen@estragon.uchicago.edu (Stephen P Spackman) writes:
> Please excuse the quasi-flamage....
> 
> Am I really dense, or have I completely missed the point? Why are we
> burning silicon on floating point arithmetic when we could have fast
> 128-bit INTEGER arithmetic? Why do all the great arguments for RISC
> suddenly evaporate when FP looms?

I can only answer your second question.

> Seems to me that 90% of the floating point code I've seen had a
> dynamic range that was low *and potentially known to the compiler*; it

It would seem you haven't seen much code for scientific computation,
or many subroutine libraries. If I compile say, and inner product
subroutine am I then supposed to go around remembering what scaling
is appropriate and what length vectors it can handle? 

> could have been directly compiled into scaled integers. Of what's
> left, much was pretty weird stuff with unpredictable behaviour and the
> exponent in fixed format FP was not big ENOUGH and maybe it should
> have been broken out into a separately specifiable (and independently
> computed) exponent ([int, long]float, you see).

Ok, _you_ might consider Linpack pretty weird stuff but I think this
would put you in a decided minority.

> Ok, so we can do the full analysis; maybe there're a couple of
> normalisation-shifts that deserve instructions, but I am *so tired* of
> having the silicon on my PC and my workstation wasted on floating
> point when all I want is Emacs and Unix and X to be tolerably fast
> (which on a Sparc they aren't). Sniff.

I keep my emacs and X on a Sun 3 where they belong. The RISC machines
have better things to do than message passing. 

> Of course, I'm not into weather-prediction. Then again, I only ever
> met one programmer who was.

NCAR is actually a large place. Weather prediction is a billion dollar
problem. Damage due to severe storms, floods, droughts is usually
in the billions _every_ year. Our best attempts in this regard still
fall short, but you can't argue that they're not worth it. It may
be possible to predict El Nino, (that's what my old Ph.D. advisor is
into these days...) and there is a lot of California exposed to this
one. Naturally, a billion dollar problem can justify quite a few
programmers.

> Somehow it seems to me that we are being led by the nose by marketing
> types and the ghosts of languages past.

There are a lot of ghosts put into the market, mostly by marketing
types. When we buy a machine, we find out if it can run our
proprietary software, and if so, how fast. Floating performance is
a big piece of the picture, along with our need for massive I/O
throughput and our evaluation of the operating system. Sure, we need
good integer performance, but we're not even remotely likely to
consider replacing floating point with fixed arithmetic. It's a
bad trade off for us on any of the present machines.

Later,
Andrew Mullhaupt

Disclaimer: opinions expressed here are my own.

aglew@crhc.uiuc.edu (Andy Glew) (09/07/90)

>Is it so absurd to suggest, in sum, that exposing separate mantissa
>and exponent to the optimiser might result in *speedup* due to
>constant propagation and expression-rearrangement, while at the same
>time increasing expressivity by allowing an INDEPENDENT choice of
>mantissa and exponent sizes?

Someone at MIT(?) (MS thesis?) had a paper on "Micro-optimization of
floating point" in a conference within the last few years.

Note that some of the same ideas can be applied to hardware: for
example, you can combine the normalization post-shift with the
alignment pre-shift for forwarded FP operands, saving a cycle.

--
Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]

stephen@estragon.uchicago.edu (Stephen P Spackman) (09/07/90)

Let me try to rephrase my question in a less controversial form. And
please understand that I *do* appreciate the importance of weather
forecasting, and te fact there there is a genuine *need* for
supercomputing for that and all the other applications that involve
the simulation of large dynamical systems. But it does seem to me that
most cycles really do go on operating systems and user interface, and
putting integrated floating point into a silly little workstation like
a Sparc or an 80486 machine is serious overkill (and I'm almost
serious: these machines aren't fast enough for editing anymore) and a
poor use of gates. Especially since, I conjecture, emulation could
theoretically be faster.

So here's the restatement:

SUPPOSE that hardware floating point were NOT a given. Suppose we were
completely free to choose how to support the user. But users have
these needs involving non-integral numbers. What should we do?

Well, any fast conventional architecture will have good support for
non-negative integers, at least - they're needed for addressing. And
it'll have good support for function call - that's needed to execute
code.

So the question is, what ELSE do we have to put in to get good
coverage?

All I'm thinking is that an FPU *may not* be the best answer. It's
1990 now; we can rely on the compiler. Language-driven architecture is
dead (though language-tuned architecture is another story). Semantic
gap is, if not a myth, then a strength - compilers need elbowroom in
which to optimise.

Is it so absurd to suggest that outside of the supercomputer market,
scaled integers might honestly be a better solution for the problems
that need solving (assuming that there is proper compiler support for
them, and it isn't all hand-coded at every step)?

Is it so absurd to suggest that there might be PARTS of floating point
instructions that, in the hands of a good optimiser, might be used to
generate better code than their wholes (remembering the VAX CALLS
linkage... :-)?

Is it so absurd to suggest, in sum, that exposing separate mantissa
and exponent to the optimiser might result in *speedup* due to
constant propagation and expression-rearrangement, while at the same
time increasing expressivity by allowing an INDEPENDENT choice of
mantissa and exponent sizes?

Is it so absurd to suggest that the effort and the silicon that go
into an FPU might be better spent on supporting some other datatype
that is of more _general_ applicability?

Maybe this has all been tried and I just haven't heard about it. If
*that's* the case, please point me at some references. But I have the
distinct impression that we're coasting along on research that was
done thirty years ago or more, and that may need to be updated in the
light of changing technology.

stephen p spackman  stephen@estragon.uchicago.edu  312.702.3982

billms@dip.eecs.umich.edu (Bill Mangione-Smith) (09/07/90)

In article <AGLEW.90Sep6233218@dwarfs.crhc.uiuc.edu> aglew@crhc.uiuc.edu (Andy Glew) writes:
>
>Someone at MIT(?) (MS thesis?) had a paper on "Micro-optimization of
>floating point" in a conference within the last few years.

William Dally, ASPLOS III.

>Andy Glew

bill


--
-------------------------------
	Bill Mangione-Smith
	billms@eecs.umich.edu

khb@chiba.Eng.Sun.COM (Keith Bierman - SPD Advanced Languages) (09/08/90)

In article <STEPHEN.90Sep6215928@estragon.uchicago.edu> stephen@estragon.uchicago.edu (Stephen P Spackman) writes:

...   All I'm thinking is that an FPU *may not* be the best answer. It's
   1990 now; we can rely on the compiler. Language-driven architecture is
   dead (though language-tuned architecture is another story). Semantic
   gap is, if not a myth, then a strength - compilers need elbowroom in
   which to optimise.

   Is it so absurd to suggest that outside of the supercomputer market,
...

It isn't absurd to think about it. The literature contains such
thoughts over the years, going back at least 15 years. Probably more.
The consensus view of those who cast the little bugger though, has so
far been that this isn't a good idea.

Often folks employ the heuristic that any instruction which gets used
frequently, say 3+% of the time has certainly earned its keep. FP
instructions satisfy that. There are all sorts of other data points
also. 

Go, formalize your proposal, gather statistics from "real" programs
(spec, perfect club, US steel, etc.) using both the conventional and
your special compiler (and possibly other candiate special compilers)
on a variety of machines and publish the results and your conclusion.
--
----------------------------------------------------------------
Keith H. Bierman    kbierman@Eng.Sun.COM | khb@chiba.Eng.Sun.COM
SMI 2550 Garcia 12-33			 | (415 336 2648)   
    Mountain View, CA 94043

sritacco@hpdmd48.boi.hp.com (Steve Ritacco) (09/08/90)

> putting integrated floating point into a silly little workstation like
> a Sparc or an 80486 machine is serious overkill (and I'm almost
> serious: these machines aren't fast enough for editing anymore) and a
> poor use of gates. Especially since, I conjecture, emulation could
> theoretically be faster.

This I agree with.  Especially when we are talking about the possibility
of single processors executing multiple instructions per cycle.  

> Is it so absurd to suggest that there might be PARTS of floating point
> instructions that, in the hands of a good optimiser, might be used to
> generate better code than their wholes (remembering the VAX CALLS
> linkage... :-)?
>
> Is it so absurd to suggest, in sum, that exposing separate mantissa
> and exponent to the optimiser might result in *speedup* due to
> constant propagation and expression-rearrangement, while at the same
> time increasing expressivity by allowing an INDEPENDENT choice of
> mantissa and exponent sizes?

Very true, who need IEEE format anyway.  Give me a processor capable of
doing a few arithmetic instructions in a single cycle, with a single
cycle multiply, and I think you've got it.  Lets use all the FPU silicon
to do more needed operations and good floating point could fall out anyway.

> Is it so absurd to suggest that the effort and the silicon that go
> into an FPU might be better spent on supporting some other datatype
> that is of more _general_ applicability?

Might be worth a try.

vu0310@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (09/09/90)

In article <14900015@hpdmd48.boi.hp.com> sritacco@hpdmd48.boi.hp.com (Steve Ritacco) writes:
\\\
>Very true, who need IEEE format anyway.  Give me a processor capable of
                        ^^^^^^^^^^^^^^^^
>doing a few arithmetic instructions in a single cycle, with a single
>cycle multiply, and I think you've got it.  Lets use all the FPU silicon
\\\

Good G*d! The IEEE std _was_ intended to produce some kind of uniformity
in results across different kind of h/w. (A rather like the idea
of getting results accurate to the _bit_ in FP calcs).

Now, _some_ of us wouldnt go _near_ an FP calc in a month of weekends
but some of us like to waste cycles rather than $M simulating
the ``real'' world (where physics tells us we really dont _need_
FP since its all discrete; FP is just a handy hack) and sundry
other pursuits.

However, Im sure DSP guys _love_ this kind idea. The trouble comes
when they try to port their wizz-bang code to some other processor
(e.g. when their current chip/fabricator is superceded).

-Kym Horsell

usenet@nlm.nih.gov (usenet news poster) (09/09/90)

In article <14900015@hpdmd48.boi.hp.com> 
sritacco@hpdmd48.boi.hp.com (Steve Ritacco) writes (and quotes):
>> putting integrated floating point into a silly little workstation like
>> a Sparc or an 80486 machine is serious overkill ...
>
>> Is it so absurd to suggest, in sum, that exposing separate mantissa
>> and exponent to the optimiser might result in *speedup* due to
>> constant propagation and expression-rearrangement

The chained multiply and add FP hardware in processors like the IBM 6000
effectively do this.  The marginal gain of putting resolution of the
exponent off by more than every other operation is going to be small.

>> while at the same
>> time increasing expressivity by allowing an INDEPENDENT choice of
>> mantissa and exponent sizes?
>
>Very true, who need IEEE format anyway.  

The market is in simulation and modeling.  Everything from stockbrokers
running econometric models to chemists looking at molecules.  IEEE
format has proven to be a reasonable balance which allows you to write
general purpose tools that function over a wide range of input values.
Between 32 bit integer and 32/64 bit FP and an occaissional algorithmic
tweak, the vast majority of data can be be reasonably well represented.  
Custom fixed point formats have a place in DSP where performance is
critical and you have the advantage of knowing exactly where the input
data is coming from and what values will be acceptable.

>Give me a processor capable of
>doing a few arithmetic instructions in a single cycle, with a single
>cycle multiply, and I think you've got it.  

Alot of the "superscalar" marketing hype is really just FP coprocessors.
Take away the load/store operations in a superscalar RISC (those used to 
be part of the CISC instruction anyway) and the FPU, and what have you
got left?  ~one op/cycle.

>Lets use all the FPU silicon
>to do more needed operations and good floating point could fall out anyway.

Matching similar levels of integration for the IPU and FPU, I have yet
to see software emulation of FP that comes anywhere close to the speed
of a hardware FPU.

David States

gillies@m.cs.uiuc.edu (09/10/90)

1. Does the tail wag the dog?

2. Who needs floating point when *everyone* knows that operating systems
code is the most important / most prevalent type of software.

Does anyone see the similarity in the two statements above?  How about
these [circa 1970]:

1.  Does the tail wag the dog?

2.  Who needs BitBlt support when *everyone* knows that user
interfaces are dominated by 24*80 displays and job control languages.

kahn@batcomputer.tn.cornell.edu (Shahin Kahn) (09/11/90)

In article <KHB.90Sep7103618@chiba.Eng.Sun.COM> khb@chiba.Eng.Sun.COM (Keith Bierman - SPD Advanced Languages) writes:
>Go, formalize your proposal, gather statistics from "real" programs
>(spec, perfect club, US steel, etc.) using both the conventional and

"real" programs?  Programs cease to be "real" as soon as they become benchmarks.

Sometimes, It is not the code that is "real", it is the data-set.

Very few of these codes are real as far as supercomputng goes.  Most of them
fit within the whole of 16 MBytes and take the whole of several minutes to
execute.

Shahin.

jgk@osc.COM (Joe Keane) (09/11/90)

In article <KHB.90Sep7103618@chiba.Eng.Sun.COM> khb@chiba.Eng.Sun.COM (Keith
Bierman - SPD Advanced Languages) writes:
>Often folks employ the heuristic that any instruction which gets used
>frequently, say 3+% of the time has certainly earned its keep. FP
>instructions satisfy that. There are all sorts of other data points
>also. 
>
>Go, formalize your proposal, gather statistics from "real" programs
>(spec, perfect club, US steel, etc.) using both the conventional and
>your special compiler (and possibly other candiate special compilers)
>on a variety of machines and publish the results and your conclusion.

This doesn't work.  If you get statistics from C program and make a machine
based on that, you'll get a C machine.  Given the way C is, this machine will
be good at subroutine calls, floating-point arithmetic, and pointer
dereferencing.  Conversely, it will probably be not so good at co-routines,
multi-precision arithemtic, and associative lookup.  If you optimize the
machine based on the language, and then adjust the language based on what is
efficient on the machine, you're stuck in a loop.  It's time to get out.

jgk@osc.COM (Joe Keane) (09/11/90)

In article <3961@bingvaxu.cc.binghamton.edu>
vu0310@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) writes:
>Good G*d! The IEEE std _was_ intended to produce some kind of uniformity
>in results across different kind of h/w. (A rather like the idea
>of getting results accurate to the _bit_ in FP calcs).

You could argue that this is bad not good.  Suppose you ran a computation on
machine X and it gives the answer -37.69, then you moved to machine Y and it
gives the same answer.  This might give you some unfounded confidence that the
answer is actually -37.69.  Actually the right answer could be 2.00 but you
don't know that.  It used to be that if you wanted to check an answer you
would run it on a different machine, but now this doesn't do much for you.

In fact you could argue that if you're going to do rounding, the best way to
do it is randomly.  A couple of free-running oscillators on your FP chip
wouldn't take up too much space.  If you did this the error in the expected
value could actually be much lower than 1 LSB.  If you ran the same program 20
times, you'd get 20 different answers.  However, if these agreed well you'd
have good reason to believe that the answer is right.

I'm not arguing that multiple runs is a substitute for good numerical
analysis, but it can point out that something is drastically wrong.  The
floating point on machine X may be much better than that on machine Z, but if
you get a terrible answer from machine Z, i wouldn't trust the answer from
machine X so much either.

I'm sort of playing devil's advocate here.  Actually i think the IEEE standard
is very good as far as FP goes.  However, if you're dealing with inherently
inaccurate computations, a little diversity may be a good thing.

geoff@hls0.hls.oz (Geoff Bull) (09/12/90)

In article <14900015@hpdmd48.boi.hp.com> sritacco@hpdmd48.boi.hp.com (Steve Ritacco) writes:

>Very true, who need IEEE format anyway.  Give me a processor capable of

IEEE format is one of the better things that has happened to the industry.
You seem to have forgotten the bad old days when numerical programming was
a black art, and programs would give different answers on different machines.
-- 
Geoff Bull (Senior Engineer)	Phone  : (+61 48) 68 3490
Highland Logic Pty. Ltd.	Fax    : (+61 48) 68 3474
348-354 Argyle St		ACSnet : geoff@hls0.hls.oz.au
Moss Vale, 2577, AUSTRALIA

gillies@m.cs.uiuc.edu (09/12/90)

/* Written  9:59 pm  Sep  6, 1990 by stephen@estragon.uchicago.edu in m.cs.uiuc.edu:comp.arch */
> But it does seem to me that most cycles really do go on operating
> systems and user interface, and putting integrated floating point into
> a silly little workstation like a Sparc or an 80486 machine is serious
> overkill.

My favorite device-independent screen language is display postscript
(is there another one that is device independent?).  This is user
interface code.  Display Postscript really beats up on a floating
point chip.

Once we have all machines running floating point, I think a whole new
class of applications will emerge to take advantage of this feature.
Remember, the software development biz is now dominated by the common
denominator (typically PC's).  Most floating-point intensive
applications are unthinkable on today's integer workstation.