[comp.arch] Performance increase - a suggestion

physh@unicom.UUCP (Jon 'Quality in - Quantity out' Foreman) (01/14/88)

	I really don't think the real world really needs anything more
expansive than a 32 bit processor to get most jobs done.  Given that,
(even for argument sake only) here are my ideas of a possible way
to increase performance for relatively little cost.

	It seems to me that one way to get a further increase in
performance without increasing clock speeds and thus memory and other
chip costs, would be to leave the processor at 32 bits and increase the
external data path width to say 128 bits.  Then it may be possible to
get at least one instruction per fetch.  Since most computers fetch
(mostly) one instruction per operation this would seem to translate
into a 3 times plus increase in performance. (I'm probably wrong here,
but in which direction I have no idea.)

	Data, as seen by the processor, would still look like 32 bits,
but with a single (and almost free) external data word cache, you could
also increase the performance of sequential data read operations (say a
sequential table lookup) because your one 32 bit fetch would net you 4
32 bit words which you can use later.  This would probably not work as
well when writing to memory, because of DMA, etc.

	One additional advantage of such an approach would be that you
could increase the clock rate on the cpu and still have lots of memory
bandwidth.

	Implementation of DMA devices may be tricky, and it may be a help
to have the DMA devices view the memory array as 8 or 16 bits wide. Also,
I/O instructions (like on the 8086 family) would have to be worked out
somehow.

	There is one other apparent bonus, you could fetch a IEEE
temporary real number in one fetch, but compared to compute time, this
may not really be all that wonderful.

	There is at least one obvious downside, however.  Say one chooses
a 128 bit data path, one would have to supply at least that number of
pins on the chip to handle it.  With control signals and all you could
end up needing 200 pins on the chip.  One could also argue that to layout
such a thing on a circuit board would be a real drag, but I tend not
to get too excited about this, mostly  because of the reality of multilayer
circuit boards.

--------------

	Ok, gang.  I don't have the resources or know how to implement such
a thing, so in the interrests of learning, please feel free to pick me
to pieces.  I really hope this is a good idea, because I am already feeling
that my 16 MHz '286 box is too slow, and I really can't afford a machine
that uses 20 nS. ram. :-)

	Has anyone experimented with this, or has implemented it?  I
at least would like to know.

	Jon Foreman
-- 
ucbvax!pixar!\            | For small letters  | ~~~~~~~\~~~   That's spelled
hoptoad!well!unicom!physh | ( < 10K ) only:    |  Jon  }()      "physh" and 
       ptsfa!/            | physh@ssbn.wlk.com |        /     pronounced "fish".

johnl@ima.ISC.COM (John R. Levine) (01/15/88)

In article <235@unicom.UUCP> physh@unicom.UUCP writes:
>	It seems to me that one way to get a further increase in
>performance without increasing clock speeds and thus memory and other
>chip costs, would be to leave the processor at 32 bits and increase the
>external data path width to say 128 bits.  Then it may be possible to
>get at least one instruction per fetch.  ...

This is a great idea, but one that has been well known for a long time. The
360/85, just about the first commercial machine with a cache, did exactly that
for the path from core (and I mean core) to the cache. It still seems to be a
standard technique in machines that have separate buses for memory and I/O and
so avoid issues like what a 128 bit path to a serial terminal means.

Speaking of I/O buses, one of the ways that IBM has increased channel speeds
over the years is to move from the original one byte wide System/360 I/O bus
to wider ones.
-- 
John R. Levine, IECC, PO Box 349, Cambridge MA 02238-0349, +1 617 492 3869
{ ihnp4 | decvax | cbosgd | harvard | yale }!ima!johnl, Levine@YALE.something
Gary Hart for President -- Let's win one for the zipper.

hutchson@convexc.UUCP (01/16/88)

A similar scheme was used on the GE-600 and its successors up to the
Honeywell DPS-70.  All memory references were double-word (72 bits);
nearly all instructions were single-word, so an instruction fetch got two
instructions.  It "worked" very well; a 600-family machine got about the
same performance as an I** 370-family machine with double the memory access
speed.  (That's raw performance, sorta like mips'es; throughput on real
multiprocessing loads was driven by other architectural details and by
OS behavior:  the GE would appear to be up to an order of magnitude faster.)

oconnor@sunray.steinmetz (Dennis M. O'Connor) (01/16/88)

An article by physh@unicom.UUCP says:
 [ proposes increasing computer performance by making busses
   128 bits wide, while leaving data-paths 32-bits wide ]

Comments :
1. We're not talking a microcomputer here, not with current
   packaging & interconnect technology. So for now this
   proposal is only applicable to super-minis, mainframes
   and super- or hyper-computers.

   ( A hyper-computer is a super-computer being promoted
     by an over-excited marketing division :-)

   Many machines do what you have proposed, on the "far" (non-CPU)
   side of their caches. Even micros can do this, since cache
   data width isn't as limited by pins-per-package as CPU
   bus width is in a microcomputer.

2. When considering the total system cost of such a machine,
   the CPU datapaths are only a small part. Admittidly making
   them 64 bits wide may slow them down ( adds 1 gate delay
   to your carry chain, plus higher parasitics and greater
   distance between components ) but for many applications
   ( like banking, where you need to keep track of more than
   16 billion tenths-of-a-cent ( $160 million ), and scientific
   computing, where double-precision is becoming the LEAST
   precise thing people are willing to use ) you probably win.

3. When you mentioned fetching instructions, you seemed to assume
   that instruction size had to equal data word size. This isn't true.
   There is no direct relation between word size and instruction
   size. There is some relation between word size and address space
   ( if you have n-bit data paths, you ought to have n-bit addresses,
    and vice-versa. This is just a sensible opinion ? )

To sum it all up : Yes the idea is good, it is used already,
and yeilds higher memory bandwidth at reduced cast. But it's not
practical for single-chip implementation yet : when we get
beyond 244-pin high-speed packages ( e.g. wafer-scale, or
whatever other high-density interconnect technology you like )
then it will probably be used in the micro arena.

This feature migration from mainframes to micros is not
at all uncommon. Or unexpected.


--
	Dennis O'Connor 	oconnor@sungoddess.steinmetz.UUCP ??
				ARPA: OCONNORDM@ge-crd.arpa
        "If I have an "s" in my name, am I a PHIL-OSS-IF-FER?"

davidsen@steinmetz.steinmetz.UUCP (William E. Davidsen Jr) (01/16/88)

In article <235@unicom.UUCP> physh@unicom.UUCP writes:
...
| 	It seems to me that one way to get a further increase in
| performance without increasing clock speeds and thus memory and other
| chip costs, would be to leave the processor at 32 bits and increase the
| external data path width to say 128 bits.  Then it may be possible to
| get at least one instruction per fetch.  Since most computers fetch
| (mostly) one instruction per operation this would seem to translate
| into a 3 times plus increase in performance. (I'm probably wrong here,
| but in which direction I have no idea.)

There are other ways to get a bandwidth increase. One is to go to a
Harvard archetecture, with the data and code using separate busses. The
68030 has gone part way on this, with an internal Harvard archetecture
on chip, separate memory caches for each internal bus, and a multiplexed
external bus.

I believe that Intel will separate the code, data, and i/o busses in the
80486, but that's based on rumor.

Sequent reduces bus usage by having a local cache and delayed write
through. When a value is modified in the cache, it is *not* written to
memory. Only when the value is flushed from cache, or when another
processor reads the value, is the modified value placed on the bus. When
another processor reads the value, the processor cache which contains
the most recently modified version of the data places it on the bus, and
the other processor *and the memory* are updated.

Obviously there is more to this, to handle the case where (1) processor
A reads data from memory, (2) processor B reads data from memory, (3)
processor A updates the cached value, and (4) processor B updates the
cached value. Now when processor C reads the value, both A and B will
try to supply it. I assume that this works, since they have 30
processors running on the bus, but I don't know how.

The disadvantage of wider external busses is cost. With a wider bus you
have more support logic, and that adds to the cost. Since many data
accesses are made on bytes, the extra bus bandwidth doesn't buy you as
much as it could, although cache improves this. Since most CPUs already
have a pipeline, the wider bandwidth reduces the number of bus cycles,
but doesn't improve the rate at which the CPU can execute.
-- 
	bill davidsen		(wedu@ge-crd.arpa)
  {uunet | philabs | seismo}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

kyriazis@pawl22.pawl.rpi.edu (George Kyriazis) (01/16/88)

In article <8843@steinmetz.steinmetz.UUCP> sunray!oconnor@steinmetz.UUCP writes:
>An article by physh@unicom.UUCP says:
> [ proposes increasing computer performance by making busses
>   128 bits wide, while leaving data-paths 32-bits wide ]
>
  I don't know what he says exactly, but he mentions expanding data paths to
128 bits.  I don't know, but I thought of that, just a few days ago, but for
fetching instructions only.

>Comments :
>1. We're not talking a microcomputer here, not with current
>   packaging & interconnect technology. So for now this
>   proposal is only applicable to super-minis, mainframes
>   and super- or hyper-computers.
>
It might as well be in a micro-computer.  OK, maybe something in the size of 
a SUN.  Memories are usually what produces the bottleneck; the CPU can get
data very fast.  You can get 128 bits of data out of the memory at once, but
give it to the CPU in 32-bit chuncks.  What I state here is maybe another
(slightly different) proposal.  I stated before, I'd rather do it for
instructions.  Ok, Harvard architecture.  You read 128 bits at a time, and
(assuming a 32-bit CPU) you feed the CPU with 32 bits at 4 times the
speed.  You might as weel feed all this data in shift registers, and then 
shift data out to the CPU.  Assuming 64K*1 chips, you will need 128 chips
therefore 64K*128 = 8 megs.  There are also 64K*4 chips, so minimum memory
can go down by a factor of 4.  A quite feasible approach for a medium
sized computer (close to a micro).
  The only problem when doing that is jump instructions.  Assume that memory
operates at its fastest possible speed.  If you meet a jump instruction
in the middle of the 128-bit word, you'll have to (more or less) execute
all the rest up till the end of the fetch.  Some RISC CPU's have done this
but for only one instruction.  Can a compiler succesfully put 3 useful
instructions after the jump??  Maybe it sounds too cheap: "After any jump
the CPU executes at most three instructions after it". (Actually it turns
out that the jump has to be in the first of the 4 instructions in the 128-bit
word, so as the memory can get the right address.) 

>3. When you mentioned fetching instructions, you seemed to assume
>   that instruction size had to equal data word size. This isn't true.
>   There is no direct relation between word size and instruction
>   size. There is some relation between word size and address space
>   ( if you have n-bit data paths, you ought to have n-bit addresses,
>    and vice-versa. This is just a sensible opinion ? )
>
  I tend to agree on that.  It's better to have so much memory that you can 
reference, not more.  Not like the poor 8086 that tries to access 1Mbyte
with 16 bit index registers.


*******************************************************
*George C. Kyriazis                                   *    Gravity is a myth
*userfe0e@mts.rpi.edu or userfe0e@rpitsmts.bitnet     *        \    /
*Electrical and Computer Systems Engineering Dept.    *         \  /
*Rensselear Polytechnic Institute, Troy, NY 12180     *          ||
*******************************************************      Earth sucks.

mason@tmsoft.UUCP (Dave Mason) (01/17/88)

In article <8844@steinmetz.steinmetz.UUCP> davidsen@crdos1.UUCP (bill davidsen) writes:
>Sequent reduces bus usage by having a local cache and delayed write
>through. When a value is modified in the cache, it is *not* written to
>memory. Only when the value is flushed from cache, or when another
>processor reads the value, is the modified value placed on the bus. When
>another processor reads the value, the processor cache which contains
>the most recently modified version of the data places it on the bus, and
>the other processor *and the memory* are updated.

There's actually a bit more to it than this.  A processor only changes
the cache if it has EXCLUSIVE ownership of the word (in fact every
write to memory is treated this way, I believe).  When another
processor asks for a word, the owner gives it (rather than memory),
but also holds on to it, marking it as shared ownership (the requestor
also marks it as shared ownership).  Then before modifying it, a
processor negociates over the bus for exclusive ownership.  Many
processors may simultaneously hold a word for shared access.

  I think this is a particularly elegant solution to the cache
coherency problem.  Does anyone know if Sequent invented it (seems
rather too obvious for that), or a proper reference to who did? 
	../Dave

tim@amdcad.AMD.COM (Tim Olson) (01/18/88)

In article <221@imagine.PAWL.RPI.EDU> userfe0e@mts.rpi.edu (George Kyriazis) writes:
| You read 128 bits at a time, and
| (assuming a 32-bit CPU) you feed the CPU with 32 bits at 4 times the
| speed.  You might as weel feed all this data in shift registers, and then 
| shift data out to the CPU.  Assuming 64K*1 chips, you will need 128 chips
| therefore 64K*128 = 8 megs.  There are also 64K*4 chips, so minimum memory
| can go down by a factor of 4.  A quite feasible approach for a medium
| sized computer (close to a micro).

Or you can simply interleave the memories 4 ways, and reduce the path
between memory and the instruction bus to 32 bits.

|   The only problem when doing that is jump instructions.  Assume that memory
| operates at its fastest possible speed.  If you meet a jump instruction
| in the middle of the 128-bit word, you'll have to (more or less) execute
| all the rest up till the end of the fetch.  Some RISC CPU's have done this
| but for only one instruction.  Can a compiler succesfully put 3 useful
| instructions after the jump??  Maybe it sounds too cheap: "After any jump
| the CPU executes at most three instructions after it". (Actually it turns
| out that the jump has to be in the first of the 4 instructions in the 128-bit
| word, so as the memory can get the right address.) 

It is very hard to effectively schedule more than 1 instruction after a
delayed-branch, and even more prohibitive to force branches to occur
only at 4-instruction boundaries.  A solution to this problem is to
throw away the instructions coming from memory, sourcing them instead
from a cache while the instruction stream is restarted.  This is what
the Branch-Target Cache is for on the Am29000.

	-- Tim Olson
	Advanced Micro Devices
	(tim@amdcad.amd.com)

mash@mips.UUCP (John Mashey) (01/18/88)

In article <221@imagine.PAWL.RPI.EDU> userfe0e@mts.rpi.edu (George Kyriazis) writes:
.....
>but for only one instruction.  Can a compiler succesfully put 3 useful
>instructions after the jump?? ....
No:
	1st delay slot: fill 70+% of the time, or more
	2nd: a lot less (studied at Stanford; I don're recall the results,
		because they were low enough we never considered it)
	3rd: forget it.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

aglew@ccvaxa.UUCP (01/18/88)

>/* Written  2:43 pm  Jan 17, 1988 by tim@amdcad.AMD.COM in ccvaxa:comp.arch */
>In article <221@imagine.PAWL.RPI.EDU> userfe0e@mts.rpi.edu (George Kyriazis) writes:
>|   The only problem when doing that is jump instructions.  Assume that memory
>| operates at its fastest possible speed.  If you meet a jump instruction
>| in the middle of the 128-bit word, you'll have to (more or less) execute
>| all the rest up till the end of the fetch.  Some RISC CPU's have done this
>| but for only one instruction.  Can a compiler succesfully put 3 useful
>| instructions after the jump??  Maybe it sounds too cheap: "After any jump
>| the CPU executes at most three instructions after it". (Actually it turns
>| out that the jump has to be in the first of the 4 instructions in the 128-bit
>| word, so as the memory can get the right address.) 
>
>It is very hard to effectively schedule more than 1 instruction after a
>delayed-branch, and even more prohibitive to force branches to occur
>only at 4-instruction boundaries.  A solution to this problem is to
>throw away the instructions coming from memory, sourcing them instead
>from a cache while the instruction stream is restarted.  This is what
>the Branch-Target Cache is for on the Am29000.
>
>	-- Tim Olson

Another approach is to execute all of the instructions come what may.
This is the VLIW, trace-scheduling, approach commercialized by Multiflow.



Andy "Krazy" Glew. Gould CSD-Urbana.    1101 E. University, Urbana, IL 61801   
    aglew@gould.com     	- preferred, if you have nameserver
    aglew@gswd-vms.gould.com    - if you don't
    aglew@gswd-vms.arpa 	- if you use DoD hosttable
    aglew%mycroft@gswd-vms.arpa - domains are supposed to make things easier?
   
My opinions are my own, and are not the opinions of my employer, or any
other organisation. I indicate my company only so that the reader may
account for any possible bias I may have towards our products.

mbutts@mntgfx.mentor.com (Mike Butts) (01/19/88)

In article <844@ima.ISC.COM>, johnl@ima.ISC.COM (John R. Levine) writes:
> In article <235@unicom.UUCP> physh@unicom.UUCP writes:
> >	It seems to me that one way to get a further increase in
> >performance without increasing clock speeds and thus memory and other
> >chip costs, would be to leave the processor at 32 bits and increase the
> >external data path width to say 128 bits.  Then it may be possible to
> >get at least one instruction per fetch.  ...
> 
> This is a great idea, but one that has been well known for a long time. The
> 360/85, just about the first commercial machine with a cache, did exactly that
> for the path from core (and I mean core) to the cache. It still seems to be a
> standard technique in machines that have separate buses for memory and I/O and
> so avoid issues like what a 128 bit path to a serial terminal means.
> 

This is indeed a fine idea, both for data and for instructions, but the big
problem with wider datapaths nowadays is the pin count costs on the CPU chip.

When fast computers were built at the board level, very wide busses were 
relatively easy to achieve.  Now there's a great speed advantage from having
the whole CPU on one chip, but VLSI packages have severe pin count limitations.
More than 100-150 pins on a chip gets very expensive very fast, and more than
256 becomes untestable most places.  This is at least one reason why MIPS and
others must go to the trouble to double-clock their busses.  Also, driving
lots of outputs from high to low at once will put a big bounce on the chip's 
internal ground system, risking dropped bits or even destructive latchup.

I think the tension between wide busses and packaging is one of the toughest
problems faced by CPU engineers these days.

-- 
Mike Butts, Research Engineer                         503-626-1302
Mentor Graphics Corp., 8500 SW Creekside Place, Beaverton OR 97005
...!{sequent,tessi,apollo}!mntgfx!mbutts OR  mbutts@pdx.MENTOR.COM
These are my opinions, & not necessarily those of Mentor Graphics.

petolino%joe@Sun.COM (Joe Petolino) (01/19/88)

> [ description of ownership protocol for cache coherence ]
>
>  I think this is a particularly elegant solution to the cache
>coherency problem.  Does anyone know if Sequent invented it (seems
>rather too obvious for that), or a proper reference to who did? 

Amdahl was designing systems like this almost ten years ago (all of the
580-series processors use ownership schemes).  Note that cache consistency
can be a problem even in single-processor systems, if Virtual aliases are
allowed in a virtually-addressed cache.

The earliest references I could find by digging through my old Alan Jay Smith
papers are:

   C.K. Tang, "Cache System Design in the Tightly Coupled Multiprocessor
   System", Proc. NCC, 1976.

   Lucien Censier and Paul Feautrier, "A New Solution to Coherence Problems
   in Multicache Systems", IEEETC, C-27, 12, Dec 1978.

I have no idea whether either of these papers actually describes an ownership
protocol.

-Joe

mason@tmsoft.UUCP (Dave Mason) (01/19/88)

In article <221@imagine.PAWL.RPI.EDU> userfe0e@mts.rpi.edu (George Kyriazis) writes:
"In article <8843@steinmetz.steinmetz.UUCP> sunray!oconnor@steinmetz.UUCP writes:
">An article by physh@unicom.UUCP says:
"> [ proposes increasing computer performance by making busses
">   128 bits wide, while leaving data-paths 32-bits wide ]
">
"  I don't know what he says exactly, but he mentions expanding data paths to
"128 bits.  I don't know, but I thought of that, just a few days ago, but for
"fetching instructions only.
" ...
"  The only problem when doing that is jump instructions.  Assume that memory
"operates at its fastest possible speed.  If you meet a jump instruction
"in the middle of the 128-bit word, you'll have to (more or less) execute
"all the rest up till the end of the fetch.
" ... (Actually it turns
"out that the jump has to be in the first of the 4 instructions in the 128-bit
"word, so as the memory can get the right address.) 

Actually (like everything else) it's already been done.  In this case
by the Control Data 6000 series at least.  They had a mix of 15 and 30
bit instructions in a 60 bit word.  (Probably applies to all other
Cray designs as well, although I haven't programmed others.)

Any label (that could therefore be branched to) forced alignment on a
new word.  Also a 30 bit instruction when you only had 15 bits left in
the current word would go to the next word.  There wasn't a delayed
jump however, so there was no attempt to fill the rest of the word with
useful instructions, it was simply filled with NOPs.  Also a jump could
appear anywhere in a word.  (The FORTRAN compiler was pretty advanced
for its day, and I believe did some code migration to fill words)

root@mfci.UUCP (SuperUser) (01/21/88)

In article <28200088@ccvaxa> aglew@ccvaxa.UUCP writes:
>
>>/* Written  2:43 pm  Jan 17, 1988 by tim@amdcad.AMD.COM in ccvaxa:comp.arch */
>>In article <221@imagine.PAWL.RPI.EDU> userfe0e@mts.rpi.edu (George Kyriazis) writes:
>>|   The only problem when doing that is jump instructions.  Assume that memory
>>| operates at its fastest possible speed.  If you meet a jump instruction
>>| in the middle of the 128-bit word, you'll have to (more or less) execute
>>| all the rest up till the end of the fetch.  Some RISC CPU's have done this
>>| but for only one instruction.  Can a compiler succesfully put 3 useful
>>| instructions after the jump??  Maybe it sounds too cheap: "After any jump
>>| the CPU executes at most three instructions after it". (Actually it turns
>>| out that the jump has to be in the first of the 4 instructions in the 128-bit
>>| word, so as the memory can get the right address.) 
>>
>>It is very hard to effectively schedule more than 1 instruction after a
>>delayed-branch, and even more prohibitive to force branches to occur
>>only at 4-instruction boundaries.  A solution to this problem is to
>>throw away the instructions coming from memory, sourcing them instead
>>from a cache while the instruction stream is restarted.  This is what
>>the Branch-Target Cache is for on the Am29000.
>>
>>	-- Tim Olson
>
>Another approach is to execute all of the instructions come what may.
>This is the VLIW, trace-scheduling, approach commercialized by Multiflow.
>
>
>
>Andy "Krazy" Glew. Gould CSD-Urbana.    1101 E. University, Urbana, IL 61801   
>    aglew@gould.com     	- preferred, if you have nameserver
>    aglew@gswd-vms.gould.com    - if you don't
>    aglew@gswd-vms.arpa 	- if you use DoD hosttable
>    aglew%mycroft@gswd-vms.arpa - domains are supposed to make things easier?
>   
>My opinions are my own, and are not the opinions of my employer, or any
>other organisation. I indicate my company only so that the reader may
>account for any possible bias I may have towards our products.

Right, and moreover, Multiflow's Trace can have as many as 4
conditional branches plus a fall-through in a given instruction
(prioritized on a per-instr basis using a simple but clever
arbitration scheme.)  While we have a delayed-branch (2 clock beats
per instruction, and one set-of-branches per instruction, so at least
one beat of real work always gets executed) the compiler's forte is
finding useful things to do and scheduling them wherever they fit
best.  We also use four 32-bit 65 nS buses at full tilt to fill the
icache (same buses are used for memory transfers, or cpu
cluster-to-cluster transfers), another recent topic of interest here.

Also, while I'm at it, we provide a way for programmers to indicate
likelihood of branching one way or another if they happen to know
(people who are porting their code from a Cray usually do).  The
keyword isn't "Frequency" but the intent is the same.  On the other
hand, we have found that our compiler does a fine job of predicting
branch probabilities without user input and also without having to 
profile the code.  We initially thought the profiling would be very
important, but (at least for branch prediction) it hasn't turned out
that way.

    -Bob Colwell
     Multiflow Computer
     175 N. Main St.
     Branford, CT 06405  203-488-6090

<Standard disclaimer:  I speak only for me>Newsgroups: comp.arch
Subject: Re: Performance increase - a suggestion
Summary: 
Expires: 
References: <235@unicom.UUCP> <28200088@ccvaxa>
Sender: 
Reply-To: colwell@m6.UUCP (Robert Colwell)
Followup-To: 
Distribution: 
Organization: Multiflow Computer Inc., Branford, CT. 06405
Keywords: 
Newsgroups: comp.arch
Subject: Re: Performance increase - a suggestion
Summary: 
Expires: 
References: <235@unicom.UUCP> <28200088@ccvaxa>
Sender: 
Reply-To: colwell@m6.UUCP (Robert Colwell)
Followup-To: 
Distribution: 
Organization: Multiflow Computer Inc., Branford, CT. 06405
Keywords: 

In article <28200088@ccvaxa> aglew@ccvaxa.UUCP writes:
>
>>/* Written  2:43 pm  Jan 17, 1988 by tim@amdcad.AMD.COM in ccvaxa:comp.arch */
>>In article <221@imagine.PAWL.RPI.EDU> userfe0e@mts.rpi.edu (George Kyriazis) writes:
>>|   The only problem when doing that is jump instructions.  Assume that memory
>>| operates at its fastest possible speed.  If you meet a jump instruction
>>| in the middle of the 128-bit word, you'll have to (more or less) execute
>>| all the rest up till the end of the fetch.  Some RISC CPU's have done this
>>| but for only one instruction.  Can a compiler succesfully put 3 useful
>>| instructions after the jump??  Maybe it sounds too cheap: "After any jump
>>| the CPU executes at most three instructions after it". (Actually it turns
>>| out that the jump has to be in the first of the 4 instructions in the 128-bit
>>| word, so as the memory can get the right address.) 
>>
>>It is very hard to effectively schedule more than 1 instruction after a
>>delayed-branch, and even more prohibitive to force branches to occur
>>only at 4-instruction boundaries.  A solution to this problem is to
>>throw away the instructions coming from memory, sourcing them instead
>>from a cache while the instruction stream is restarted.  This is what
>>the Branch-Target Cache is for on the Am29000.
>>
>>	-- Tim Olson
>
>Another approach is to execute all of the instructions come what may.
>This is the VLIW, trace-scheduling, approach commercialized by Multiflow.
>
>
>
>Andy "Krazy" Glew. Gould CSD-Urbana.    1101 E. University, Urbana, IL 61801   
>    aglew@gould.com     	- preferred, if you have nameserver
>    aglew@gswd-vms.gould.com    - if you don't
>    aglew@gswd-vms.arpa 	- if you use DoD hosttable
>    aglew%mycroft@gswd-vms.arpa - domains are supposed to make things easier?
>   
>My opinions are my own, and are not the opinions of my employer, or any
>other organisation. I indicate my company only so that the reader may
>account for any possible bias I may have towards our products.

Right, and moreover, Multiflow's Trace can have as many as 4
conditional branches plus a fall-through in a given instruction
(prioritized on a per-instr basis using a simple but clever
arbitration scheme.)  While we have a delayed-branch (2 clock beats
per instruction, and one set-of-branches per instruction, so at least
one beat of real work always gets executed) the compiler's forte is
finding useful things to do and scheduling them wherever they fit
best.  We also use four 32-bit 65 nS buses at full tilt to fill the
icache (same buses are used for memory transfers, or cpu
cluster-to-cluster transfers), another recent topic of interest here.

Also, while I'm at it, we provide a way for programmers to indicate
likelihood of branching one way or another if they happen to know
(people who are porting their code from a Cray usually do).  The
keyword isn't "Frequency" but the intent is the same.  On the other
hand, we have found that our compiler does a fine job of predicting
branch probabilities without user input and also without having to 
profile the code.  We initially thought the profiling would be very
important, but (at least for branch prediction) it hasn't turned out
that way.

    -Bob Colwell
     Multiflow Computer
     175 N. Main St.
     Branford, CT 06405  203-488-6090

<Standard disclaimer:  I speak only for me>

aglew@ccvaxa.UUCP (01/22/88)

>	I really don't think the real world really needs anything more
>expansive than a 32 bit processor to get most jobs done.

Probably, but... I wonder how much interest might be out there for
a true double-precision floating point engine - one that did 64 bit
floating point, or IEEE 80 bit extended floating point, or even 
128 bit floating point, as its native floating point mode, as fast
as single precision on nearly any other machine in its price range?

I'm sure that most people wouldn't need this, but some might - and I'd 
like to get a feel for the size of such a niche, if it exists. Post
if you want to discuss, or mail me (I'll summarize to the net if I get
a large enough sample).


Andy "Krazy" Glew. Gould CSD-Urbana.    1101 E. University, Urbana, IL 61801   
    aglew@gould.com     	- preferred, if you have nameserver
    aglew@gswd-vms.gould.com    - if you don't
    aglew@gswd-vms.arpa 	- if you use DoD hosttable
    aglew%mycroft@gswd-vms.arpa - domains are supposed to make things easier?
   
My opinions are my own, and are not the opinions of my employer, or any
other organisation. I indicate my company only so that the reader may
account for any possible bias I may have towards our products.

rorden@kolob.UUCP (Randy Rorden) (01/22/88)

in article <39245@sun.uucp>, petolino%joe@Sun.COM (Joe Petolino) says:
> 
> [in response to request for references on cache ownership protocols]
>
> The earliest references I could find by digging through my old Alan Jay Smith
> papers are:
> 
>    C.K. Tang, "Cache System Design in the Tightly Coupled Multiprocessor
>    System", Proc. NCC, 1976.
> 
>    Lucien Censier and Paul Feautrier, "A New Solution to Coherence Problems
>    in Multicache Systems", IEEETC, C-27, 12, Dec 1978.
> 
> I have no idea whether either of these papers actually describes an ownership
> protocol.
> 
> -Joe

I ran across the following article which describes how to use a Shared bit
and Dirty bit for cache consistency management in a multiprocessor system:

	Charles P. Thacker and Lawrence C. Stewart, "Firefly: a Multiprocessor
	Workstation", Proc. ASPLOS II, 1987.


On another aspect of the Sequent bus interface - I once read an article that
explained that it uses a request/response type protocol to avoid tying up the
bus for the entire length of a read or write access.  It also mentioned that
several processors could have requests pending at the same time but that
somehow they avoided sending requests to devices that already had "full"
request queues.  This is the part I couldn't figure out - does each
processor "count" how many requests and replies have been transacted with
each possible device?  Or do they just monitor how many total requests are
pending and assume the worst case that they all went to the same device?
Either of these methods requires some understanding of how deep a given
device's request queue is.


Randy Rorden		    {ihnp4,byuadam,uunet}!iconsys!rorden
Icon International, Inc.    ARPANET: icon%byuadam.bitnet@wiscvm.wisc.edu
Orem, Utah 84058	    BITNET: icon%byuadam.bitnet
(801) 225-6888		    <This space reserved for witty remark>

alvitar@madhat.UUCP (Phil Harbison) (01/29/88)

In article <39245@sun.uucp>, petolino%joe@Sun.COM (Joe Petolino) writes:
> > [ description of ownership protocol for cache coherence ]
> >Does anyone know if Sequent invented it (seems
> >rather too obvious for that), or a proper reference to who did? 
> 
> Amdahl was designing systems like this almost ten years ago ...

Synapse,  a  manufacturer  of fault-tolerant systems based on the 68000,
used  an  ownership  protocol on their proprietary bus.  I'm not sure if
Synapse  is still in business, but I believe they started building their
68K  systems  around 1982.  The memory controllers maintained extra bits
to  indicate  whether  any public of private copies of a memory block (4
32-bit  words)  were checked out.  Any time you performed a read and the
cache  missed, you had to do a "public" read from the memory controller.
Public  copies  of  a  memory block were read-only.  Before performing a
write,  you  had  to  do  a "private" read from the memory.  If no other
private copies were checked out, the memory controller would respond and
the  bus  watchers  on  all  other  caches would invalidate their public
copies.   If  a  private  copy  was  already  checked  out,  the  memory
controller  would  respond with an abort-and-retry signal.  Then the bus
watcher  with  the  private copy would update main memory and invalidate
its copy.

-- 
Live: Phil Harbison
USPS: 3409 Grassfort Drive, Huntsville, AL 35805-5421
UUCP: {clyde,uunet}!madhat!alvitar
PSTN: 205-881-4317

oconnor@sunray.steinmetz (Dennis M. O'Connor) (01/30/88)

An article by aglew@ccvaxa.UUCP says:
]
]>	I really don't think the real world really needs anything more
]>expansive than a 32 bit processor to get most jobs done.
]
]Probably, but... I wonder how much interest might be out there for
]a true double-precision floating point engine - one that did 64 bit
]floating point, or IEEE 80 bit extended floating point, or even 
]128 bit floating point, as its native floating point mode, as fast
]as single precision on nearly any other machine in its price range?
]
]I'm sure that most people wouldn't need this, but some might - and I'd 
]like to get a feel for the size of such a niche, if it exists.
]
]Andy "Krazy" Glew. Gould CSD-Urbana.    1101 E. University, Urbana, IL 61801   

There is indeed a market for this. Several in fact.

  1. Simulation of Physical Systems ( Chemistry, Aerodynamics, Mechanics )
     These people need all the accuracy they can get. Truncation
     error just grows to fast on 32-bit floats, and the dynamic range
     is too small.

  2. Real-time control systems ( aircraft engines, aircraft control surfaces )
     These systems can't afford to flake out because of truncation
     errors accumulating or dynamic range being exceeded. They also
     are so tightly pressed for performance that they can't afford
     ellaborate checks of their results. 

  3. Sonar and OTH Radar systems
     These are TOUGH signal processing problems, with very noisy
     data, needing long, complicated munging. And must be real time.

  4. Strategic Weapons Systems, including SDI
     Think about aiming at a 1 meter target thats 500,000 meters away.
     That's moving at 10,000 meters per second. How much error can
     you afford ? How much time have you got to do the job ?

Single precision is apparantly becoming obsolete. It's just too
much of a pain to deal with. Double precision is the only way to go.
Optimize for double, keep single only if you need "backwards"
compatability. "You'll thank me for it."

  
--
	Dennis O'Connor 	oconnor@sungoddess.steinmetz.UUCP ??
				ARPA: OCONNORDM@ge-crd.arpa
        "If I have an "s" in my name, am I a PHIL-OSS-IF-FER?"

roy@phri.UUCP (Roy Smith) (01/30/88)

In article <28200089@ccvaxa> aglew@ccvaxa.UUCP writes:
> I wonder how much interest might be out there for a true double-precision
> floating point engine - one that did 64 bit floating point, or IEEE 80
> bit extended floating point, or even 128 bit floating point, as its
> native floating point mode, as fast as single precision on nearly any
> other machine in its price range?

	I don't know about price range, but doesn't the FPS-164 do
exclusively 64-bit floating point?

	Actually, I wonder if a RISC-FPP would make sense.  Consider a
machine like a Vax.  Let's say we got rid of all the multiple precision
floating point formats and instructions (at last count, F, D, G, and H; did
I miss any?) and made all floating point math a single high-precision
format.  Clearly that would save silicon.  If we took that silicon real
estate and devoted it to making that single floating point format work
faster, could we build a machine which only has double precision, but does
it as fast as the old vax did single precision?

	Given that we could do double floating math as fast as single
floating math, the only advantage single would have left would be saving
memory on large arrays.  Maybe we'd have to keep {load,store}
{single,double} instructions around for that.
-- 
Roy Smith, {allegra,cmcl2,philabs}!phri!roy
System Administrator, Public Health Research Institute
455 First Avenue, New York, NY 10016

kahn@batcomputer.tn.cornell.edu (Shahin Kahn) (01/30/88)

>Another approach is to execute all of the instructions come what may.
>This is the VLIW, trace-scheduling, approach commercialized by Multiflow.
>
>Andy "Krazy" Glew. Gould CSD-Urbana.    1101 E. University, Urbana, IL 61801   
>    aglew@gould.com     	- preferred, if you have nameserver

VLIW has been part of Floating Point Systems' machines for more than 
10 years.

cik@l.cc.purdue.edu (Herman Rubin) (02/01/88)

Whatever accuracy is built into the floating point processor, more may
be needed.  If the appropriate hardware is included, almost double of 
the "usual" is relatively easily obtainable, but no more.  For more, it
is necessary to go to integer arithmetic--even when there is no integer
arithmetic, it is necessary to force the floating point hardware to 
treat integers.

Thus if more accuracy than anticipated is needed, good integer arithmetic
is called for.  This means unsigned multiplication and division, among
other not necessarily available constructs.  I suggest that however many
bits in the mantissa of floating point numbers are directly treatable by
the hardware, that integer multiplication of numbers of that accuracy ,
with the product of twice the length, be hardware, and that this be
unsigned, and additional reasonable hardware to facilitate the arithmetic
be available.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet

dave@micropen (David F. Carlson) (02/01/88)

In article <3127@phri.UUCP>, roy@phri.UUCP (Roy Smith) writes:
> In article <28200089@ccvaxa> aglew@ccvaxa.UUCP writes:
> > I wonder how much interest might be out there for a true double-precision
> > floating point engine - one that did 64 bit floating point, or IEEE 80
> > bit extended floating point, or even 128 bit floating point, as its
> > native floating point mode, as fast as single precision on nearly any
> > other machine in its price range?
> 
> 	Given that we could do double floating math as fast as single
> floating math, the only advantage single would have left would be saving
> memory on large arrays.  Maybe we'd have to keep {load,store}
> Roy Smith, {allegra,cmcl2,philabs}!phri!roy
> System Administrator, Public Health Research Institute
> 455 First Avenue, New York, NY 10016

This is a question/problem that has been gnawing on me for a long time:
It is the case (K&R p184) that "All floating arithmetic in C is carried
out in double-precision;" regardless whether declared as single precision
or not.  Given that a large majority of code being written (at least at
a typical UNIX site) is in C, why not dispose of single precision floating
point altogether and concentrate the silicon real-estate to something
that my compiler might actaully use.  Alternately, perhaps hardware/arch.
people can convince C compiler writers that (as I often find) single
precision is more than sufficient for all but the most numerically nasty
problems and should be used for float arithmetic unless code directed
otherwise.  (It seems to go against UNIX philosophy to have a compiler
force a double precision arithmetic when source code clearly indicates
that single precision would suffice.)  I am willing to be that my FPA
has *never* executed a single precision arithmetic operation (other than
double->single and single->double)!!

-- 
David F. Carlson, Micropen, Inc.
...!{ames|harvard|rutgers|topaz|...}!rochester!ur-valhalla!micropen!dave

"The faster I go, the behinder I get." --Lewis Carroll

alanh@tekig4.TEK.COM (Al Hooton, the available sailing crew member) (02/02/88)

	Gee... the first machine I programmed in assembly was the CDC
6600, which has 60-bit words and 15-bit or 30-bit instructions...

	I always thought that the machines that DIDN'T take this
approach were the wierd ones  :-)
________________________________ ______________________________________________
				|
 What's the only thing better	|	Al Hooton, Tektronix Inc.
	than a sailboat??	|
				|	...tektronix!tekig4!alanh
				|
	TWO sailboats!!		|	(503) 627-3904, 591-8460
________________________________|______________________________________________
Discail^H^H^H^H^H^H^HDics^H^H^H^HDisclaiemer^H^H^H^H^H^H^H^H^H^H^H Oh, bag it!

rbbb@acornrc.UUCP (David Chase) (02/02/88)

In article <671@l.cc.purdue.edu>, cik@l.cc.purdue.edu (Herman Rubin) writes:
> 
> Whatever accuracy is built into the floating point processor, more may
> be needed.  ...
> ... For more, it
> is necessary to go to integer arithmetic--even when there is no integer
> arithmetic, it is necessary to force the floating point hardware to 
> treat integers.

This is not a direct contradiction, but rather than fuss with
reimplementing the wheel, try using someone else's.  See Boehm,
Cartwright, Riggle, and O'Donnell, "Exact Real Arithmetic: A Case Study
in Higher Order Programming" in the 1986 Lisp and Functional Programming 
Conference.  It's real (pun intended) and it is faster than glacial.  It
has been used to build interpreters, machine simulators, and an RPN
calculator.  

What you are talking about doing sounds tedious and error-prone.  (Yes,
I know that what I describe is much slower than hand-coded assembler
integer arithmetic, but how fast a programmer are you, anyway?  Can you
TRUST the results of your programs?)

It is interesting that *programmers* are debating the pros and cons of
the various architectures.  Why worry?  Unless you are using assembly
language, the interesting things should be the performance of the
architecture+compiler combination and the price tag on the whole
package.  Who cares how the arithmetic is done, as long as it is fast
and accurate?

David Chase, Olivetti Research Center, Menlo Park
(chase%orc.uucp@unix.sri.com or oliveb!orc!chase)

guy@gorodish.Sun.COM (Guy Harris) (02/02/88)

> Alternately, perhaps hardware/arch. people can convince C compiler writers
> that (as I often find) single precision is more than sufficient for all but
> the most numerically nasty problems and should be used for float arithmetic
> unless code directed otherwise.

They already have: see the latest ANSI C draft standard.  The reason why double
precision was specified in K&R was, in part (as was, I think, stated here
earlier) that it was easier to implement on the PDP-11 that way, as the
PDP-11's floating point instructions (FP11-style) didn't come in
single-precision and double-precision flavors; instead, there was a mode bit
that you had to set to single-precision or double-precision mode with special
instructions.  (It was certainly *possible* to support both single-precision
and double-precision arithmetic in compiled code - FORTRAN-IV Plus offers an
existence proof - but I suspect it was deemed unnecessary in the context of C.)
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com

brooks@lll-crg.llnl.gov (Eugene D. Brooks III) (02/02/88)

In article <408@micropen> dave@micropen (David F. Carlson) writes:
>This is a question/problem that has been gnawing on me for a long time:
>It is the case (K&R p184) that "All floating arithmetic in C is carried
>out in double-precision;" regardless whether declared as single precision
>or not.  Given that a large majority of code being written (at least at
>a typical UNIX site) is in C, why not dispose of single precision floating
>point altogether and concentrate the silicon real-estate to something
>that my compiler might actaully use.  Alternately, perhaps hardware/arch.
>people can convince C compiler writers that (as I often find) single
>precision is more than sufficient for all but the most numerically nasty
>problems and should be used for float arithmetic unless code directed
>otherwise.  (It seems to go against UNIX philosophy to have a compiler
>force a double precision arithmetic when source code clearly indicates
>that single precision would suffice.)  I am willing to be that my FPA
>has *never* executed a single precision arithmetic operation (other than
>double->single and single->double)!!
This is the "boondoggle" of the C programming language, which I know and
love, that arose from its early PDP11+FP11 targets.  It is being "fixed",
read that PATCHED, in the ANSI spec by introducing the new type "long double"
and or by the use of function prototypes.  Lets not reinvent the FP11
arcanity which is the source of all this evil!  Many C compilers have some
flag which causes float expressions to be handled in single precision, and
a whole lot of users are using this option to get fast code.

For those of you who are younger than the PDP11, the FP11 was the PDP11's
floating point hardware, which had two modes (single and double precison) and
could only be in one mode at a time.  The cheap fix was to do all computation
in double, the "double mode" did have a conversion instruction which allowed
loading a float, converting it to a double on the fly.  Doing all computation
in 64, 80, or 128 bits is no way to make a computer run fast.

root@mfci.UUCP (SuperUser) (02/02/88)

In article <3127@phri.UUCP> roy@phri.UUCP (Roy Smith) writes:
>In article <28200089@ccvaxa> aglew@ccvaxa.UUCP writes:
>> I wonder how much interest might be out there for a true double-precision
>> floating point engine - one that did 64 bit floating point, or IEEE 80
>> bit extended floating point, or even 128 bit floating point, as its
>> native floating point mode, as fast as single precision on nearly any
>> other machine in its price range?
>
>	I don't know about price range, but doesn't the FPS-164 do
>exclusively 64-bit floating point?
>
>	Actually, I wonder if a RISC-FPP would make sense.  Consider a
>machine like a Vax.  Let's say we got rid of all the multiple precision
>floating point formats and instructions (at last count, F, D, G, and H; did
>I miss any?) and made all floating point math a single high-precision
>format.  Clearly that would save silicon.  If we took that silicon real
>estate and devoted it to making that single floating point format work
>faster, could we build a machine which only has double precision, but does
>it as fast as the old vax did single precision?
>
>	Given that we could do double floating math as fast as single
>floating math, the only advantage single would have left would be saving
>memory on large arrays.  Maybe we'd have to keep {load,store}
>{single,double} instructions around for that.
>-- 
>Roy Smith, {allegra,cmcl2,philabs}!phri!roy
>System Administrator, Public Health Research Institute
>455 First Avenue, New York, NY 10016

I don't believe you can do double precision math as fast as single
precision math if both are implemented in the same technology.  If
we're including division and sqrts, derived via a high-radix iterative
procedure, it's certainly not true, since you get only one or two
bits of mantissa per trip through the ALU.  In that case the time
to solution is proportional to the number of bits of result you want.

If we're only discussing addition, subtraction, and multiplication,
then I still don't believe it.  There's an adder at the heart of each
of those, and its width decides its speed -- the wider, the slower
(more levels of carry-lookahead).  If your choice is between making
one engine to do both single and double precision (or dbl and quad),
or making only dbl (quad), then I think the engine that has less to
do can be made slightly faster.

My personal perception of the market for scientific computation is
that given a choice between more precision and more speed, speed
wins hands down.

Bob Colwell
Multiflow Computer  <Above is personal opinion only!!>
175 N. Main St.
Branford CT 06405    mfci!colwell@uunet.uucp  203-488-6090

root@mfci.UUCP (SuperUser) (02/03/88)

In article <3536@batcomputer.tn.cornell.edu> kahn@tcgould.tn.cornell.edu (Shahin Kahn) writes:
>>Another approach is to execute all of the instructions come what may.
>>This is the VLIW, trace-scheduling, approach commercialized by Multiflow.
>>
>>Andy "Krazy" Glew. Gould CSD-Urbana.    1101 E. University, Urbana, IL 61801   
>>    aglew@gould.com     	- preferred, if you have nameserver
>
>VLIW has been part of Floating Point Systems' machines for more than 
>10 years.

Under that definition of VLIW, every microcoded machine ever made
qualifies.  It's never been that big a trick to make a machine with
many parallel functional units.  The trick is to be able to keep them
busy without hand coding, and to provide the right interface so that
the compiler can express the parallelism that it finds.

Read Ellis' thesis;  he presents reasonable arguments as to why FPS'
architectures have hot spots that make it a very difficult target
for a compiler.

Bob Colwell
Multiflow Computer  <Above is personal opinion only!!>
175 N. Main St.
Branford CT 06405    mfci!colwell@uunet.uucp  203-488-6090

gwu@clyde.ATT.COM (George Wu) (02/04/88)

Robert Colwell posted:

> If we're only discussing addition, subtraction, and multiplication,
> then I still don't believe it.  There's an adder at the heart of each
> of those, and its width decides its speed -- the wider, the slower
> (more levels of carry-lookahead).

     You can significantly reduce the time it takes to do a multiply by
using a faster algorithm. The bottleneck here is the carry-lookahead. A
matrix of carry-save adders eliminates this, giving you O(log2(n)) instead
of O(n). So the difference between 32 and 64 bits becomes trivial.

     Sorry I don't have more detail on the CSA matrix, but this is all
from a dim memory. If anyone wishes, *email* me a request, and I'll
provide more information after digging out my notes.

						George J Wu
						rutgers!clyde!gwu

oconnor@sunset.steinmetz (Dennis M. O'Connor) (02/04/88)

An article by colwell@m6.UUCP (Robert Colwell) says:
] I don't believe you can do double precision math as fast as single
] precision math if both are implemented in the same technology.  If
] we're including division and sqrts, derived via a high-radix iterative
] procedure, it's certainly not true, since you get only one or two
] bits of mantissa per trip through the ALU.  In that case the time
] to solution is proportional to the number of bits of result you want.

NO. Sorry. But the fastest floating point division and root
algorithms use Newton-Rapheson iteration, where the time to
solution is proportional to log2( number_of_bits_of_result ).
That is, if single-precision takes 4 iterations, double precision
will take 5, and quad will take 6.

] If we're only discussing addition, subtraction, and multiplication,
] then I still don't believe it.  There's an adder at the heart of each
] of those, and its width decides its speed -- the wider, the slower
] (more levels of carry-lookahead).  If your choice is between making
] one engine to do both single and double precision (or dbl and quad),
] or making only dbl (quad), then I think the engine that has less to
] do can be made slightly faster.

Doubling the width of a carry-select adder adds one gate delay to
its latency. Carry-lookahead schemes have a similar less-than-
proportionate penalty for speed up. Subtract is just an add.
Floating point adds also need de- and re-normalization.

Multipliers are more complex : all the fast stuff uses big
booth-encoded adder arrays. But the final carry-resolution,
the justifiaction and normalization logic add signicantly to
the latency of the multiply.

] My personal perception of the market for scientific computation is
] that given a choice between more precision and more speed, speed
] wins hands down.

Anyone wanna buy a high-speed FPU for 16-bit ( 5 bit exponent,
10 bit fraction, 1 bit sign ) reals ? I didn't think so.
It doesn't matter how fast you are if your answer is wrong.
For many problems being tackled today, single-precision
introduces too much error. So it is not used, regardless
of how much faster it is.

Besides, double-precision ( 64 bit ) floating multiply and add
should only take about 20% longer than single-precision multiply
and add. DSAme for division and square roots.

] Bob Colwell
] Multiflow Computer  <Above is personal opinion only!!] 

Bob, you seem a little naive about floating-point hardware.
It's not all shift-add or shift-compare-subtract anymore.



--
	Dennis O'Connor 	oconnor@sungoddess.steinmetz.UUCP ??
				ARPA: OCONNORDM@ge-crd.arpa
        "If I have an "s" in my name, am I a PHIL-OSS-IF-FER?"

hes@ecsvax.UUCP (Henry Schaffer) (02/04/88)

In article <230@m2.mfci.UUCP>, root@mfci.UUCP (SuperUser) writes:
> ...
> My personal perception of the market for scientific computation is
> that given a choice between more precision and more speed, speed
> wins hands down.
> 
> Bob Colwell
> Multiflow Computer  <Above is personal opinion only!!>
> 175 N. Main St.
> Branford CT 06405    mfci!colwell@uunet.uucp  203-488-6090

  I agree, but suggest that this is a result of the triumph of marketing
over good sense.  More speed *always* is useful, and there is a limit to
how much precision one needs, so there clearly is a domain in which more
speed is more useful than more precision - but I don't believe that
the usual single precision (4 byte) floating point is in that domain.

  If I'm right, then one should design for speed rather than precision -
shouldn't one?

  (One could argue that the smaller word decreases storage needs as well
as increases speed - which is correct, but one can find many examples of
choice of algorithms in which the tradeoff between speed and accuracy 
doesn't significantly affect storage requirements, and yet people either
choose speed or don't take the effort to choose accuracy.)

--henry schaffer  n c state univ

cik@l.cc.purdue.edu (Herman Rubin) (02/04/88)

In article <9408@steinmetz.steinmetz.UUCP>, oconnor@sunset.steinmetz (Dennis M. O'Connor) writes:
> An article by colwell@m6.UUCP (Robert Colwell) says:
> ] I don't believe you can do double precision math as fast as single
> ] precision math if both are implemented in the same technology.  
>  .....

> NO. Sorry. But the fastest floating point division and root
> algorithms use Newton-Rapheson iteration, where the time to
> solution is proportional to log2( number_of_bits_of_result ).
> That is, if single-precision takes 4 iterations, double precision
> will take 5, and quad will take 6.

You are essentially correct about the number of _iterations_.  However,
unless the accuracy of an iteration is built in in hardware, the cost
of an iteration grows more quickly than the length of the operands.
It is possible to design hardware such that the hardware "double
precision" may not be too much slower than the hardware "single 
precision."  However, to extend precision beyond that built into
hardware, it is necessary that the number be broken into blocks such
that operations yielding results of twice the number of bits in the unit
for multiplication are available.  In addition, floating point arithmetic
adds so many problems, especially if normalized, that it is highly
advisable to do the operations in integer arithmetic.  

To summarize, arithmetic whose precision is beyond the built-in is 
only somewhat expensive if the hardware is designed so that results
of double the length of the built-in are available (i.e., if 52-bit
mantissas are supported, 104-bit products must be available).  In 
addition, this can be done with some difficulty if it has to be done
in unnormalized floating-point, it is extremely difficult in normalized
floating-point, and easiest in fixed-point.

> Besides, double-precision ( 64 bit ) floating multiply and add
> should only take about 20% longer than single-precision multiply
> and add. DSAme for division and square roots.

If additional "silicon real estate" is added, this is true.  The
chip area is doubled for addition, and probably a factor of 3-4
for multiplication.

-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet

henry@utzoo.uucp (Henry Spencer) (02/04/88)

> This is the "boondoggle" of the C programming language, which I know and
> love, that arose from its early PDP11+FP11 targets...

Not entirely true.  To be certain of my facts on this, I dug out the article
Dennis Ritchie posted about it some five years ago.  Although he acknowledges
the FP11's contribution, he cites that as reason number two, not reason
number one, for C doing everything in double.  Reason number one was the
problem of getting the types of function arguments and return values correct,
combined with a desire to avoid multiple versions of library functions.  He
repeatedly and clearly states that this was the most important reason for
this design decision.

> ... It is being "fixed", read
> that PATCHED, in the ANSI spec by introducing the new type "long double"
> and or by the use of function prototypes...

Well, no, not quite, it is being fixed -- really fixed -- by allowing the
compiler to do floating-point arithmetic in float, rather than double, if
all operands are float, and by permitting function arguments to be passed
as floats given suitable prototypes.  The problem of multiple library
functions can't be entirely avoided for the speed demons, but function
prototypes should make things manageable for ordinary users.  While some
of the permissiveness in the wording may be regrettable (e.g. it is still
legal to do all arithmetic in double, for backward compatibility), the
overall scheme really is a proper fix, not a "patch".

There is room for debate about whether function prototypes are a Good Thing
overall, but they do go a long way towards cleaning up this problem.

"long double" is being introduced mostly because of a non-trivial number
of machines which also have 128-bit floating point.  It has nothing to do
with the all-arithmetic-in-double issue.
-- 
Those who do not understand Unix are |  Henry Spencer @ U of Toronto Zoology
condemned to reinvent it, poorly.    | {allegra,ihnp4,decvax,utai}!utzoo!henry

yuval@taux01.UUCP (Gideon Yuval) (02/05/88)

In a recent message, Dennis O'conner says:
>Message-ID: <9408@steinmetz.steinmetz.UUCP>

> NO. Sorry. But the fastest floating point division and root
> algorithms use Newton-Rapheson iteration, where the time to
> solution is proportional to log2( number_of_bits_of_result ).

Since, under machine arithmetic, a/b is NOT equal to a*(1/b), and since the
IEEE floating-point standard demands EXACT rounding, I don't see how Newton-
Raphson is any use an an IEEE environment.
-- 
Gideon Yuval, +972-52-522255 (work), -2-690992 (home), yuval@taux01.nsc.com

oconnor@sunset.steinmetz (Dennis M. O'Connor) (02/09/88)

An article by yuval@taux01.UUCP (Gideon Yuval) says:
>Since, under machine arithmetic, a/b is NOT equal to a*(1/b), and since the
>IEEE floating-point standard demands EXACT rounding, I don't see how Newton-
>Raphson is any use an an IEEE environment.
>-- 
>Gideon Yuval, +972-52-522255 (work), -2-690992 (home), yuval@taux01.nsc.com

Newton-Raphson does not ( neccesarily ) involve actually computing 1/b
and then multiplying by a, since as you mention this is NOT the same
as a/b. What can be done is to create a two varible function F which 
when invoked as F( 1, b ) produces 1/b, and when invoked as F( a, b )
produces a/b. This function does NOT form 1/b as part of its
algorithm. Instead it performs a series of operations that if
applied to second variable would produce 1, and performs this series
of operations to the first variable. With done in "infinite" precision
this is equivalent  to a*1/b; it is not eqivalent to a*1/b in finite
precision machines. The first use of the algorithm I'm aware of
was in the IBM 360/195, I think. I don't see how rounding is any
more a problem with this approach than with any other approach.


--
	Dennis O'Connor 	oconnor@sunset.steinmetz.UUCP ??
				ARPA: OCONNORDM@ge-crd.arpa
    "Nuclear War is NOT the worst thing people can do to this planet."

mbutts@mntgfx.mentor.com (Mike Butts) (02/09/88)

In article <231@m2.mfci.UUCP>, root@mfci.UUCP (SuperUser) writes:
> In article <3536@batcomputer.tn.cornell.edu> kahn@tcgould.tn.cornell.edu (Shahin Kahn) writes:
> >VLIW has been part of Floating Point Systems' machines for more than 
> >10 years.
> 
> Under that definition of VLIW, every microcoded machine ever made
> qualifies.  It's never been that big a trick to make a machine with
> many parallel functional units.  The trick is to be able to keep them
> busy without hand coding, and to provide the right interface so that
> the compiler can express the parallelism that it finds.
> 
> Read Ellis' thesis;  he presents reasonable arguments as to why FPS'
> architectures have hot spots that make it a very difficult target
> for a compiler.
> 

1) Contrary to popular impression, FPS has been compiling directly from
FORTRAN to their 64-bit "LIW" 164 machine since the early '80s.  The code is
of very high quality, and schedules operations onto the multi-field
instructions across a whole basic block.  For example, the memory system
in the FPS 164 is pipelined, and software accounts for the pipeline length,
not a hardware scoreboarder.  The FPS compiler made full use of this pipeline,
scheduling appropriately.  It also has a very effective loop pipeliner.
In fact, I believe Josh Fisher & Co. used 164 hardware for their early work at 
Yale that grew into Multiflow.  

I suppose it's just a consequence of FPS' non-existent paper publishing and 
lack of marketing that people generally still think FPS machines are hand-coded 
array processors.  There is no code between the FORTRAN and the 64-bit LIW, 
and the LIW fields are encoded and at a minicomputer level of expression, 
thus it is not microcode.

Mentor's Compute Engine is also a 64-bit LIW machine, oriented more towards
EDA-type codes, with optimizing, scheduling C, FORTRAN and Pascal compilers that 
make effective use of the machine.  (For historical purposes only - 
not a commercial ;-) )

Don't get me wrong, I have a high opinion of the way Multiflow is developing
VLIW technology, I just feel the early FPS people (I wasn't one of them) should
get more credit for blazing this trail 5-10 years ago.

2) Please give us a reference for Ellis' thesis.

-- 
Mike Butts, Research Engineer         KC7IT           503-626-1302
Mentor Graphics Corp., 8500 SW Creekside Place, Beaverton OR 97005
...!{sequent,tessi,apollo}!mntgfx!mbutts OR  mbutts@pdx.MENTOR.COM
These are my opinions, & not necessarily those of Mentor Graphics.

aglew@ccvaxa.UUCP (02/09/88)

>Since, under machine arithmetic, a/b is NOT equal to a*(1/b), and since the
>IEEE floating-point standard demands EXACT rounding, I don't see how Newton-
>Raphson is any use an an IEEE environment.
>
>Gideon Yuval, +972-52-522255 (work), -2-690992 (home), yuval@taux01.nsc.com

Use Newton-Raphson to approximate 1/b, and then correct.


Andy "Krazy" Glew. Gould CSD-Urbana.    1101 E. University, Urbana, IL 61801   
    aglew@gould.com     	- preferred, if you have nameserver
    aglew@gswd-vms.gould.com    - if you don't
    aglew@gswd-vms.arpa 	- if you use DoD hosttable
    aglew%mycroft@gswd-vms.arpa - domains are supposed to make things easier?
   
My opinions are my own, and are not the opinions of my employer, or any
other organisation. I indicate my company only so that the reader may
account for any possible bias I may have towards our products.

johnl@ima.ISC.COM (John R. Levine) (02/10/88)

In article <1988Feb8.121200.370@mntgfx.mentor.com> mbutts@mntgfx.mentor.com (Mike Butts) writes:
>2) Please give us a reference for Ellis' thesis.

John R. Ellis, "Bulldog: A Compiler for VLIW Architectures," MIT Press,
Cambridge Mass., 1986.

ISBN 0-262-05034-X

It won the 1985 ACM Doctoral Dissertation award, and is one heck of a
piece of work (which I'd say even if he hadn't been in the office across
from mine.)
-- 
John R. Levine, IECC, PO Box 349, Cambridge MA 02238-0349, +1 617 492 3869
{ ihnp4 | decvax | cbosgd | harvard | yale }!ima!johnl, Levine@YALE.something
Gary Hart for President -- Let's win one for the zipper.

root@mfci.UUCP (SuperUser) (02/13/88)

In article <9408@steinmetz.steinmetz.UUCP> sunset!oconnor@steinmetz.UUCP writes:
]An article by colwell@m6.UUCP (Robert Colwell) says:
]] I don't believe you can do double precision math as fast as single
]] precision math if both are implemented in the same technology.
]]...
]NO. Sorry. But the fastest floating point division and root
]algorithms use Newton-Rapheson iteration, where the time to
]solution is proportional to log2( number_of_bits_of_result ).
]That is, if single-precision takes 4 iterations, double precision
]will take 5, and quad will take 6.
]
]] If we're only discussing addition, subtraction, and multiplication,
]] then I still don't believe it.  There's an adder at the heart of each
]] of those, and its width decides its speed -- the wider, the slower
]] (more levels of carry-lookahead). 
]]...
]Doubling the width of a carry-select adder adds one gate delay to
]its latency. Carry-lookahead schemes have a similar less-than-
]proportionate penalty for speed up. Subtract is just an add.
]Floating point adds also need de- and re-normalization.
]
]Multipliers are more complex : all the fast stuff uses big
]booth-encoded adder arrays. But the final carry-resolution,
]the justifiaction and normalization logic add signicantly to
]the latency of the multiply.
]
]] Bob Colwell
]
]Bob, you seem a little naive about floating-point hardware.
]It's not all shift-add or shift-compare-subtract anymore.
]
]	Dennis O'Connor 	oconnor@sungoddess.steinmetz.UUCP ??
]				ARPA: OCONNORDM@ge-crd.arpa

Whether I'm naive or not isn't relevant to anything.  In fact, I
don't see where you and I disagreed anywhere above;  I said more
bits takes longer, and you quantified it.

On the other hand, now that you brought it up, I'm with Gideon Yuval
on the subject of rounding.  As far as I know, you can't get double
precision IEEE round-to-nearest sqrt or divide with only 5 iterations
of Newton-Rapheson.  I believe that gives you enough *accuracy*,
meaning all of the mantissa bits are meaningful at that point, but
you still won't know if the number you just got from iteration number
5 is the best finite-precision approximation to the real answer
(meaning the infinite-precision case) unless you then proceed to
another step which can figure that out.  Iterative hardware, of
course, does one extra step and rounds the mantissa according to
whether the guard bit was a 1 or 0 (and if the mantissa then
overflows, the exponent must be bumped, and if that overflows...)
If you do the N-R in extended precision, obviously the conversion
handles the rounding exactly.

You might well argue that many programs are not sensitive to noise
in the lsb of an IEEE double precision sqrt or divide, and that's
certainly true.  I have seen, however, at least one program that
required absolutely monotonic behavior in the lsb or it would take
unacceptably long to converge.  And then there are programs which
demand IEEE accuracy everywhere.

Bob Colwell
Multiflow Computer  <Above is personal opinion only!!>
175 N. Main St.
Branford CT 06405    mfci!colwell@uunet.uucp  203-488-6090

root@mfci.UUCP (SuperUser) (02/21/88)

>> 
>> Under that definition of VLIW, every microcoded machine ever made
>> qualifies.  It's never been that big a trick to make a machine with
>> many parallel functional units.  The trick is to be able to keep them
>> busy without hand coding, and to provide the right interface so that
>> the compiler can express the parallelism that it finds.
>> 
>
>1) Contrary to popular impression, FPS has been compiling directly from
>FORTRAN to their 64-bit "LIW" 164 machine since the early '80s.  The code is
>of very high quality, and schedules operations onto the multi-field
>instructions across a whole basic block.  For example, the memory system
>in the FPS 164 is pipelined, and software accounts for the pipeline length,
>not a hardware scoreboarder.  The FPS compiler made full use of this pipeline,
>scheduling appropriately.  It also has a very effective loop pipeliner.
>In fact, I believe Josh Fisher & Co. used 164 hardware for their early work at 
>Yale that grew into Multiflow.  

The above comments are well taken, but the 164 had one basic flaw.  The 164
architecture was designed years before anybody contemplated writing a
compiler for it.

Largely as a result, there are problems with the architecture that make it
of limited suitability as a compiler target.  For example, the machine has
only half the register bandwidth required to support the functional units'
requirements for operands and results.  If the functional units are to be
used at full speed, the pipelined output of one must be used "live" as an
input to the other.  This causes great difficulty for a scheduler; the
machine cannot be modeled as independent functional units, and scheduling
cannot proceed based only on data precedence and the availability of the
required functional unit.  

Symptoms of these limitations show up in the wide disparity between
performance achieved by FPS' compilers and by hand coders, and the resulting
commitment by FPS to the production of large mathematical libraries written
in hand code.

John Ellis' thesis discusses these issues in greater depth.

>
>2) Please give us a reference for Ellis' thesis.
>
John R. Ellis, "Bulldog: A Compiler for VLIW Architectures".  1986.
ACM Doctoral Dissertation Award Series, MIT Press, Cambridge, MA.

See also Colwell et al., "A VLIW Architecture for a Trace Scheduling Compiler",
in Proceedings, ASPLOS-II (Computer Architecture News, OS Review, or SIGPLAN
Notices, October 1987) for an update on our architectural approach.