physh@unicom.UUCP (Jon 'Quality in - Quantity out' Foreman) (01/14/88)
I really don't think the real world really needs anything more expansive than a 32 bit processor to get most jobs done. Given that, (even for argument sake only) here are my ideas of a possible way to increase performance for relatively little cost. It seems to me that one way to get a further increase in performance without increasing clock speeds and thus memory and other chip costs, would be to leave the processor at 32 bits and increase the external data path width to say 128 bits. Then it may be possible to get at least one instruction per fetch. Since most computers fetch (mostly) one instruction per operation this would seem to translate into a 3 times plus increase in performance. (I'm probably wrong here, but in which direction I have no idea.) Data, as seen by the processor, would still look like 32 bits, but with a single (and almost free) external data word cache, you could also increase the performance of sequential data read operations (say a sequential table lookup) because your one 32 bit fetch would net you 4 32 bit words which you can use later. This would probably not work as well when writing to memory, because of DMA, etc. One additional advantage of such an approach would be that you could increase the clock rate on the cpu and still have lots of memory bandwidth. Implementation of DMA devices may be tricky, and it may be a help to have the DMA devices view the memory array as 8 or 16 bits wide. Also, I/O instructions (like on the 8086 family) would have to be worked out somehow. There is one other apparent bonus, you could fetch a IEEE temporary real number in one fetch, but compared to compute time, this may not really be all that wonderful. There is at least one obvious downside, however. Say one chooses a 128 bit data path, one would have to supply at least that number of pins on the chip to handle it. With control signals and all you could end up needing 200 pins on the chip. One could also argue that to layout such a thing on a circuit board would be a real drag, but I tend not to get too excited about this, mostly because of the reality of multilayer circuit boards. -------------- Ok, gang. I don't have the resources or know how to implement such a thing, so in the interrests of learning, please feel free to pick me to pieces. I really hope this is a good idea, because I am already feeling that my 16 MHz '286 box is too slow, and I really can't afford a machine that uses 20 nS. ram. :-) Has anyone experimented with this, or has implemented it? I at least would like to know. Jon Foreman -- ucbvax!pixar!\ | For small letters | ~~~~~~~\~~~ That's spelled hoptoad!well!unicom!physh | ( < 10K ) only: | Jon }() "physh" and ptsfa!/ | physh@ssbn.wlk.com | / pronounced "fish".
johnl@ima.ISC.COM (John R. Levine) (01/15/88)
In article <235@unicom.UUCP> physh@unicom.UUCP writes: > It seems to me that one way to get a further increase in >performance without increasing clock speeds and thus memory and other >chip costs, would be to leave the processor at 32 bits and increase the >external data path width to say 128 bits. Then it may be possible to >get at least one instruction per fetch. ... This is a great idea, but one that has been well known for a long time. The 360/85, just about the first commercial machine with a cache, did exactly that for the path from core (and I mean core) to the cache. It still seems to be a standard technique in machines that have separate buses for memory and I/O and so avoid issues like what a 128 bit path to a serial terminal means. Speaking of I/O buses, one of the ways that IBM has increased channel speeds over the years is to move from the original one byte wide System/360 I/O bus to wider ones. -- John R. Levine, IECC, PO Box 349, Cambridge MA 02238-0349, +1 617 492 3869 { ihnp4 | decvax | cbosgd | harvard | yale }!ima!johnl, Levine@YALE.something Gary Hart for President -- Let's win one for the zipper.
hutchson@convexc.UUCP (01/16/88)
A similar scheme was used on the GE-600 and its successors up to the Honeywell DPS-70. All memory references were double-word (72 bits); nearly all instructions were single-word, so an instruction fetch got two instructions. It "worked" very well; a 600-family machine got about the same performance as an I** 370-family machine with double the memory access speed. (That's raw performance, sorta like mips'es; throughput on real multiprocessing loads was driven by other architectural details and by OS behavior: the GE would appear to be up to an order of magnitude faster.)
oconnor@sunray.steinmetz (Dennis M. O'Connor) (01/16/88)
An article by physh@unicom.UUCP says: [ proposes increasing computer performance by making busses 128 bits wide, while leaving data-paths 32-bits wide ] Comments : 1. We're not talking a microcomputer here, not with current packaging & interconnect technology. So for now this proposal is only applicable to super-minis, mainframes and super- or hyper-computers. ( A hyper-computer is a super-computer being promoted by an over-excited marketing division :-) Many machines do what you have proposed, on the "far" (non-CPU) side of their caches. Even micros can do this, since cache data width isn't as limited by pins-per-package as CPU bus width is in a microcomputer. 2. When considering the total system cost of such a machine, the CPU datapaths are only a small part. Admittidly making them 64 bits wide may slow them down ( adds 1 gate delay to your carry chain, plus higher parasitics and greater distance between components ) but for many applications ( like banking, where you need to keep track of more than 16 billion tenths-of-a-cent ( $160 million ), and scientific computing, where double-precision is becoming the LEAST precise thing people are willing to use ) you probably win. 3. When you mentioned fetching instructions, you seemed to assume that instruction size had to equal data word size. This isn't true. There is no direct relation between word size and instruction size. There is some relation between word size and address space ( if you have n-bit data paths, you ought to have n-bit addresses, and vice-versa. This is just a sensible opinion ? ) To sum it all up : Yes the idea is good, it is used already, and yeilds higher memory bandwidth at reduced cast. But it's not practical for single-chip implementation yet : when we get beyond 244-pin high-speed packages ( e.g. wafer-scale, or whatever other high-density interconnect technology you like ) then it will probably be used in the micro arena. This feature migration from mainframes to micros is not at all uncommon. Or unexpected. -- Dennis O'Connor oconnor@sungoddess.steinmetz.UUCP ?? ARPA: OCONNORDM@ge-crd.arpa "If I have an "s" in my name, am I a PHIL-OSS-IF-FER?"
davidsen@steinmetz.steinmetz.UUCP (William E. Davidsen Jr) (01/16/88)
In article <235@unicom.UUCP> physh@unicom.UUCP writes: ... | It seems to me that one way to get a further increase in | performance without increasing clock speeds and thus memory and other | chip costs, would be to leave the processor at 32 bits and increase the | external data path width to say 128 bits. Then it may be possible to | get at least one instruction per fetch. Since most computers fetch | (mostly) one instruction per operation this would seem to translate | into a 3 times plus increase in performance. (I'm probably wrong here, | but in which direction I have no idea.) There are other ways to get a bandwidth increase. One is to go to a Harvard archetecture, with the data and code using separate busses. The 68030 has gone part way on this, with an internal Harvard archetecture on chip, separate memory caches for each internal bus, and a multiplexed external bus. I believe that Intel will separate the code, data, and i/o busses in the 80486, but that's based on rumor. Sequent reduces bus usage by having a local cache and delayed write through. When a value is modified in the cache, it is *not* written to memory. Only when the value is flushed from cache, or when another processor reads the value, is the modified value placed on the bus. When another processor reads the value, the processor cache which contains the most recently modified version of the data places it on the bus, and the other processor *and the memory* are updated. Obviously there is more to this, to handle the case where (1) processor A reads data from memory, (2) processor B reads data from memory, (3) processor A updates the cached value, and (4) processor B updates the cached value. Now when processor C reads the value, both A and B will try to supply it. I assume that this works, since they have 30 processors running on the bus, but I don't know how. The disadvantage of wider external busses is cost. With a wider bus you have more support logic, and that adds to the cost. Since many data accesses are made on bytes, the extra bus bandwidth doesn't buy you as much as it could, although cache improves this. Since most CPUs already have a pipeline, the wider bandwidth reduces the number of bus cycles, but doesn't improve the rate at which the CPU can execute. -- bill davidsen (wedu@ge-crd.arpa) {uunet | philabs | seismo}!steinmetz!crdos1!davidsen "Stupidity, like virtue, is its own reward" -me
kyriazis@pawl22.pawl.rpi.edu (George Kyriazis) (01/16/88)
In article <8843@steinmetz.steinmetz.UUCP> sunray!oconnor@steinmetz.UUCP writes: >An article by physh@unicom.UUCP says: > [ proposes increasing computer performance by making busses > 128 bits wide, while leaving data-paths 32-bits wide ] > I don't know what he says exactly, but he mentions expanding data paths to 128 bits. I don't know, but I thought of that, just a few days ago, but for fetching instructions only. >Comments : >1. We're not talking a microcomputer here, not with current > packaging & interconnect technology. So for now this > proposal is only applicable to super-minis, mainframes > and super- or hyper-computers. > It might as well be in a micro-computer. OK, maybe something in the size of a SUN. Memories are usually what produces the bottleneck; the CPU can get data very fast. You can get 128 bits of data out of the memory at once, but give it to the CPU in 32-bit chuncks. What I state here is maybe another (slightly different) proposal. I stated before, I'd rather do it for instructions. Ok, Harvard architecture. You read 128 bits at a time, and (assuming a 32-bit CPU) you feed the CPU with 32 bits at 4 times the speed. You might as weel feed all this data in shift registers, and then shift data out to the CPU. Assuming 64K*1 chips, you will need 128 chips therefore 64K*128 = 8 megs. There are also 64K*4 chips, so minimum memory can go down by a factor of 4. A quite feasible approach for a medium sized computer (close to a micro). The only problem when doing that is jump instructions. Assume that memory operates at its fastest possible speed. If you meet a jump instruction in the middle of the 128-bit word, you'll have to (more or less) execute all the rest up till the end of the fetch. Some RISC CPU's have done this but for only one instruction. Can a compiler succesfully put 3 useful instructions after the jump?? Maybe it sounds too cheap: "After any jump the CPU executes at most three instructions after it". (Actually it turns out that the jump has to be in the first of the 4 instructions in the 128-bit word, so as the memory can get the right address.) >3. When you mentioned fetching instructions, you seemed to assume > that instruction size had to equal data word size. This isn't true. > There is no direct relation between word size and instruction > size. There is some relation between word size and address space > ( if you have n-bit data paths, you ought to have n-bit addresses, > and vice-versa. This is just a sensible opinion ? ) > I tend to agree on that. It's better to have so much memory that you can reference, not more. Not like the poor 8086 that tries to access 1Mbyte with 16 bit index registers. ******************************************************* *George C. Kyriazis * Gravity is a myth *userfe0e@mts.rpi.edu or userfe0e@rpitsmts.bitnet * \ / *Electrical and Computer Systems Engineering Dept. * \ / *Rensselear Polytechnic Institute, Troy, NY 12180 * || ******************************************************* Earth sucks.
mason@tmsoft.UUCP (Dave Mason) (01/17/88)
In article <8844@steinmetz.steinmetz.UUCP> davidsen@crdos1.UUCP (bill davidsen) writes: >Sequent reduces bus usage by having a local cache and delayed write >through. When a value is modified in the cache, it is *not* written to >memory. Only when the value is flushed from cache, or when another >processor reads the value, is the modified value placed on the bus. When >another processor reads the value, the processor cache which contains >the most recently modified version of the data places it on the bus, and >the other processor *and the memory* are updated. There's actually a bit more to it than this. A processor only changes the cache if it has EXCLUSIVE ownership of the word (in fact every write to memory is treated this way, I believe). When another processor asks for a word, the owner gives it (rather than memory), but also holds on to it, marking it as shared ownership (the requestor also marks it as shared ownership). Then before modifying it, a processor negociates over the bus for exclusive ownership. Many processors may simultaneously hold a word for shared access. I think this is a particularly elegant solution to the cache coherency problem. Does anyone know if Sequent invented it (seems rather too obvious for that), or a proper reference to who did? ../Dave
tim@amdcad.AMD.COM (Tim Olson) (01/18/88)
In article <221@imagine.PAWL.RPI.EDU> userfe0e@mts.rpi.edu (George Kyriazis) writes: | You read 128 bits at a time, and | (assuming a 32-bit CPU) you feed the CPU with 32 bits at 4 times the | speed. You might as weel feed all this data in shift registers, and then | shift data out to the CPU. Assuming 64K*1 chips, you will need 128 chips | therefore 64K*128 = 8 megs. There are also 64K*4 chips, so minimum memory | can go down by a factor of 4. A quite feasible approach for a medium | sized computer (close to a micro). Or you can simply interleave the memories 4 ways, and reduce the path between memory and the instruction bus to 32 bits. | The only problem when doing that is jump instructions. Assume that memory | operates at its fastest possible speed. If you meet a jump instruction | in the middle of the 128-bit word, you'll have to (more or less) execute | all the rest up till the end of the fetch. Some RISC CPU's have done this | but for only one instruction. Can a compiler succesfully put 3 useful | instructions after the jump?? Maybe it sounds too cheap: "After any jump | the CPU executes at most three instructions after it". (Actually it turns | out that the jump has to be in the first of the 4 instructions in the 128-bit | word, so as the memory can get the right address.) It is very hard to effectively schedule more than 1 instruction after a delayed-branch, and even more prohibitive to force branches to occur only at 4-instruction boundaries. A solution to this problem is to throw away the instructions coming from memory, sourcing them instead from a cache while the instruction stream is restarted. This is what the Branch-Target Cache is for on the Am29000. -- Tim Olson Advanced Micro Devices (tim@amdcad.amd.com)
mash@mips.UUCP (John Mashey) (01/18/88)
In article <221@imagine.PAWL.RPI.EDU> userfe0e@mts.rpi.edu (George Kyriazis) writes: ..... >but for only one instruction. Can a compiler succesfully put 3 useful >instructions after the jump?? .... No: 1st delay slot: fill 70+% of the time, or more 2nd: a lot less (studied at Stanford; I don're recall the results, because they were low enough we never considered it) 3rd: forget it. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
aglew@ccvaxa.UUCP (01/18/88)
>/* Written 2:43 pm Jan 17, 1988 by tim@amdcad.AMD.COM in ccvaxa:comp.arch */ >In article <221@imagine.PAWL.RPI.EDU> userfe0e@mts.rpi.edu (George Kyriazis) writes: >| The only problem when doing that is jump instructions. Assume that memory >| operates at its fastest possible speed. If you meet a jump instruction >| in the middle of the 128-bit word, you'll have to (more or less) execute >| all the rest up till the end of the fetch. Some RISC CPU's have done this >| but for only one instruction. Can a compiler succesfully put 3 useful >| instructions after the jump?? Maybe it sounds too cheap: "After any jump >| the CPU executes at most three instructions after it". (Actually it turns >| out that the jump has to be in the first of the 4 instructions in the 128-bit >| word, so as the memory can get the right address.) > >It is very hard to effectively schedule more than 1 instruction after a >delayed-branch, and even more prohibitive to force branches to occur >only at 4-instruction boundaries. A solution to this problem is to >throw away the instructions coming from memory, sourcing them instead >from a cache while the instruction stream is restarted. This is what >the Branch-Target Cache is for on the Am29000. > > -- Tim Olson Another approach is to execute all of the instructions come what may. This is the VLIW, trace-scheduling, approach commercialized by Multiflow. Andy "Krazy" Glew. Gould CSD-Urbana. 1101 E. University, Urbana, IL 61801 aglew@gould.com - preferred, if you have nameserver aglew@gswd-vms.gould.com - if you don't aglew@gswd-vms.arpa - if you use DoD hosttable aglew%mycroft@gswd-vms.arpa - domains are supposed to make things easier? My opinions are my own, and are not the opinions of my employer, or any other organisation. I indicate my company only so that the reader may account for any possible bias I may have towards our products.
mbutts@mntgfx.mentor.com (Mike Butts) (01/19/88)
In article <844@ima.ISC.COM>, johnl@ima.ISC.COM (John R. Levine) writes: > In article <235@unicom.UUCP> physh@unicom.UUCP writes: > > It seems to me that one way to get a further increase in > >performance without increasing clock speeds and thus memory and other > >chip costs, would be to leave the processor at 32 bits and increase the > >external data path width to say 128 bits. Then it may be possible to > >get at least one instruction per fetch. ... > > This is a great idea, but one that has been well known for a long time. The > 360/85, just about the first commercial machine with a cache, did exactly that > for the path from core (and I mean core) to the cache. It still seems to be a > standard technique in machines that have separate buses for memory and I/O and > so avoid issues like what a 128 bit path to a serial terminal means. > This is indeed a fine idea, both for data and for instructions, but the big problem with wider datapaths nowadays is the pin count costs on the CPU chip. When fast computers were built at the board level, very wide busses were relatively easy to achieve. Now there's a great speed advantage from having the whole CPU on one chip, but VLSI packages have severe pin count limitations. More than 100-150 pins on a chip gets very expensive very fast, and more than 256 becomes untestable most places. This is at least one reason why MIPS and others must go to the trouble to double-clock their busses. Also, driving lots of outputs from high to low at once will put a big bounce on the chip's internal ground system, risking dropped bits or even destructive latchup. I think the tension between wide busses and packaging is one of the toughest problems faced by CPU engineers these days. -- Mike Butts, Research Engineer 503-626-1302 Mentor Graphics Corp., 8500 SW Creekside Place, Beaverton OR 97005 ...!{sequent,tessi,apollo}!mntgfx!mbutts OR mbutts@pdx.MENTOR.COM These are my opinions, & not necessarily those of Mentor Graphics.
petolino%joe@Sun.COM (Joe Petolino) (01/19/88)
> [ description of ownership protocol for cache coherence ] > > I think this is a particularly elegant solution to the cache >coherency problem. Does anyone know if Sequent invented it (seems >rather too obvious for that), or a proper reference to who did? Amdahl was designing systems like this almost ten years ago (all of the 580-series processors use ownership schemes). Note that cache consistency can be a problem even in single-processor systems, if Virtual aliases are allowed in a virtually-addressed cache. The earliest references I could find by digging through my old Alan Jay Smith papers are: C.K. Tang, "Cache System Design in the Tightly Coupled Multiprocessor System", Proc. NCC, 1976. Lucien Censier and Paul Feautrier, "A New Solution to Coherence Problems in Multicache Systems", IEEETC, C-27, 12, Dec 1978. I have no idea whether either of these papers actually describes an ownership protocol. -Joe
mason@tmsoft.UUCP (Dave Mason) (01/19/88)
In article <221@imagine.PAWL.RPI.EDU> userfe0e@mts.rpi.edu (George Kyriazis) writes: "In article <8843@steinmetz.steinmetz.UUCP> sunray!oconnor@steinmetz.UUCP writes: ">An article by physh@unicom.UUCP says: "> [ proposes increasing computer performance by making busses "> 128 bits wide, while leaving data-paths 32-bits wide ] "> " I don't know what he says exactly, but he mentions expanding data paths to "128 bits. I don't know, but I thought of that, just a few days ago, but for "fetching instructions only. " ... " The only problem when doing that is jump instructions. Assume that memory "operates at its fastest possible speed. If you meet a jump instruction "in the middle of the 128-bit word, you'll have to (more or less) execute "all the rest up till the end of the fetch. " ... (Actually it turns "out that the jump has to be in the first of the 4 instructions in the 128-bit "word, so as the memory can get the right address.) Actually (like everything else) it's already been done. In this case by the Control Data 6000 series at least. They had a mix of 15 and 30 bit instructions in a 60 bit word. (Probably applies to all other Cray designs as well, although I haven't programmed others.) Any label (that could therefore be branched to) forced alignment on a new word. Also a 30 bit instruction when you only had 15 bits left in the current word would go to the next word. There wasn't a delayed jump however, so there was no attempt to fill the rest of the word with useful instructions, it was simply filled with NOPs. Also a jump could appear anywhere in a word. (The FORTRAN compiler was pretty advanced for its day, and I believe did some code migration to fill words)
root@mfci.UUCP (SuperUser) (01/21/88)
In article <28200088@ccvaxa> aglew@ccvaxa.UUCP writes: > >>/* Written 2:43 pm Jan 17, 1988 by tim@amdcad.AMD.COM in ccvaxa:comp.arch */ >>In article <221@imagine.PAWL.RPI.EDU> userfe0e@mts.rpi.edu (George Kyriazis) writes: >>| The only problem when doing that is jump instructions. Assume that memory >>| operates at its fastest possible speed. If you meet a jump instruction >>| in the middle of the 128-bit word, you'll have to (more or less) execute >>| all the rest up till the end of the fetch. Some RISC CPU's have done this >>| but for only one instruction. Can a compiler succesfully put 3 useful >>| instructions after the jump?? Maybe it sounds too cheap: "After any jump >>| the CPU executes at most three instructions after it". (Actually it turns >>| out that the jump has to be in the first of the 4 instructions in the 128-bit >>| word, so as the memory can get the right address.) >> >>It is very hard to effectively schedule more than 1 instruction after a >>delayed-branch, and even more prohibitive to force branches to occur >>only at 4-instruction boundaries. A solution to this problem is to >>throw away the instructions coming from memory, sourcing them instead >>from a cache while the instruction stream is restarted. This is what >>the Branch-Target Cache is for on the Am29000. >> >> -- Tim Olson > >Another approach is to execute all of the instructions come what may. >This is the VLIW, trace-scheduling, approach commercialized by Multiflow. > > > >Andy "Krazy" Glew. Gould CSD-Urbana. 1101 E. University, Urbana, IL 61801 > aglew@gould.com - preferred, if you have nameserver > aglew@gswd-vms.gould.com - if you don't > aglew@gswd-vms.arpa - if you use DoD hosttable > aglew%mycroft@gswd-vms.arpa - domains are supposed to make things easier? > >My opinions are my own, and are not the opinions of my employer, or any >other organisation. I indicate my company only so that the reader may >account for any possible bias I may have towards our products. Right, and moreover, Multiflow's Trace can have as many as 4 conditional branches plus a fall-through in a given instruction (prioritized on a per-instr basis using a simple but clever arbitration scheme.) While we have a delayed-branch (2 clock beats per instruction, and one set-of-branches per instruction, so at least one beat of real work always gets executed) the compiler's forte is finding useful things to do and scheduling them wherever they fit best. We also use four 32-bit 65 nS buses at full tilt to fill the icache (same buses are used for memory transfers, or cpu cluster-to-cluster transfers), another recent topic of interest here. Also, while I'm at it, we provide a way for programmers to indicate likelihood of branching one way or another if they happen to know (people who are porting their code from a Cray usually do). The keyword isn't "Frequency" but the intent is the same. On the other hand, we have found that our compiler does a fine job of predicting branch probabilities without user input and also without having to profile the code. We initially thought the profiling would be very important, but (at least for branch prediction) it hasn't turned out that way. -Bob Colwell Multiflow Computer 175 N. Main St. Branford, CT 06405 203-488-6090 <Standard disclaimer: I speak only for me>Newsgroups: comp.arch Subject: Re: Performance increase - a suggestion Summary: Expires: References: <235@unicom.UUCP> <28200088@ccvaxa> Sender: Reply-To: colwell@m6.UUCP (Robert Colwell) Followup-To: Distribution: Organization: Multiflow Computer Inc., Branford, CT. 06405 Keywords: Newsgroups: comp.arch Subject: Re: Performance increase - a suggestion Summary: Expires: References: <235@unicom.UUCP> <28200088@ccvaxa> Sender: Reply-To: colwell@m6.UUCP (Robert Colwell) Followup-To: Distribution: Organization: Multiflow Computer Inc., Branford, CT. 06405 Keywords: In article <28200088@ccvaxa> aglew@ccvaxa.UUCP writes: > >>/* Written 2:43 pm Jan 17, 1988 by tim@amdcad.AMD.COM in ccvaxa:comp.arch */ >>In article <221@imagine.PAWL.RPI.EDU> userfe0e@mts.rpi.edu (George Kyriazis) writes: >>| The only problem when doing that is jump instructions. Assume that memory >>| operates at its fastest possible speed. If you meet a jump instruction >>| in the middle of the 128-bit word, you'll have to (more or less) execute >>| all the rest up till the end of the fetch. Some RISC CPU's have done this >>| but for only one instruction. Can a compiler succesfully put 3 useful >>| instructions after the jump?? Maybe it sounds too cheap: "After any jump >>| the CPU executes at most three instructions after it". (Actually it turns >>| out that the jump has to be in the first of the 4 instructions in the 128-bit >>| word, so as the memory can get the right address.) >> >>It is very hard to effectively schedule more than 1 instruction after a >>delayed-branch, and even more prohibitive to force branches to occur >>only at 4-instruction boundaries. A solution to this problem is to >>throw away the instructions coming from memory, sourcing them instead >>from a cache while the instruction stream is restarted. This is what >>the Branch-Target Cache is for on the Am29000. >> >> -- Tim Olson > >Another approach is to execute all of the instructions come what may. >This is the VLIW, trace-scheduling, approach commercialized by Multiflow. > > > >Andy "Krazy" Glew. Gould CSD-Urbana. 1101 E. University, Urbana, IL 61801 > aglew@gould.com - preferred, if you have nameserver > aglew@gswd-vms.gould.com - if you don't > aglew@gswd-vms.arpa - if you use DoD hosttable > aglew%mycroft@gswd-vms.arpa - domains are supposed to make things easier? > >My opinions are my own, and are not the opinions of my employer, or any >other organisation. I indicate my company only so that the reader may >account for any possible bias I may have towards our products. Right, and moreover, Multiflow's Trace can have as many as 4 conditional branches plus a fall-through in a given instruction (prioritized on a per-instr basis using a simple but clever arbitration scheme.) While we have a delayed-branch (2 clock beats per instruction, and one set-of-branches per instruction, so at least one beat of real work always gets executed) the compiler's forte is finding useful things to do and scheduling them wherever they fit best. We also use four 32-bit 65 nS buses at full tilt to fill the icache (same buses are used for memory transfers, or cpu cluster-to-cluster transfers), another recent topic of interest here. Also, while I'm at it, we provide a way for programmers to indicate likelihood of branching one way or another if they happen to know (people who are porting their code from a Cray usually do). The keyword isn't "Frequency" but the intent is the same. On the other hand, we have found that our compiler does a fine job of predicting branch probabilities without user input and also without having to profile the code. We initially thought the profiling would be very important, but (at least for branch prediction) it hasn't turned out that way. -Bob Colwell Multiflow Computer 175 N. Main St. Branford, CT 06405 203-488-6090 <Standard disclaimer: I speak only for me>
aglew@ccvaxa.UUCP (01/22/88)
> I really don't think the real world really needs anything more >expansive than a 32 bit processor to get most jobs done. Probably, but... I wonder how much interest might be out there for a true double-precision floating point engine - one that did 64 bit floating point, or IEEE 80 bit extended floating point, or even 128 bit floating point, as its native floating point mode, as fast as single precision on nearly any other machine in its price range? I'm sure that most people wouldn't need this, but some might - and I'd like to get a feel for the size of such a niche, if it exists. Post if you want to discuss, or mail me (I'll summarize to the net if I get a large enough sample). Andy "Krazy" Glew. Gould CSD-Urbana. 1101 E. University, Urbana, IL 61801 aglew@gould.com - preferred, if you have nameserver aglew@gswd-vms.gould.com - if you don't aglew@gswd-vms.arpa - if you use DoD hosttable aglew%mycroft@gswd-vms.arpa - domains are supposed to make things easier? My opinions are my own, and are not the opinions of my employer, or any other organisation. I indicate my company only so that the reader may account for any possible bias I may have towards our products.
rorden@kolob.UUCP (Randy Rorden) (01/22/88)
in article <39245@sun.uucp>, petolino%joe@Sun.COM (Joe Petolino) says: > > [in response to request for references on cache ownership protocols] > > The earliest references I could find by digging through my old Alan Jay Smith > papers are: > > C.K. Tang, "Cache System Design in the Tightly Coupled Multiprocessor > System", Proc. NCC, 1976. > > Lucien Censier and Paul Feautrier, "A New Solution to Coherence Problems > in Multicache Systems", IEEETC, C-27, 12, Dec 1978. > > I have no idea whether either of these papers actually describes an ownership > protocol. > > -Joe I ran across the following article which describes how to use a Shared bit and Dirty bit for cache consistency management in a multiprocessor system: Charles P. Thacker and Lawrence C. Stewart, "Firefly: a Multiprocessor Workstation", Proc. ASPLOS II, 1987. On another aspect of the Sequent bus interface - I once read an article that explained that it uses a request/response type protocol to avoid tying up the bus for the entire length of a read or write access. It also mentioned that several processors could have requests pending at the same time but that somehow they avoided sending requests to devices that already had "full" request queues. This is the part I couldn't figure out - does each processor "count" how many requests and replies have been transacted with each possible device? Or do they just monitor how many total requests are pending and assume the worst case that they all went to the same device? Either of these methods requires some understanding of how deep a given device's request queue is. Randy Rorden {ihnp4,byuadam,uunet}!iconsys!rorden Icon International, Inc. ARPANET: icon%byuadam.bitnet@wiscvm.wisc.edu Orem, Utah 84058 BITNET: icon%byuadam.bitnet (801) 225-6888 <This space reserved for witty remark>
alvitar@madhat.UUCP (Phil Harbison) (01/29/88)
In article <39245@sun.uucp>, petolino%joe@Sun.COM (Joe Petolino) writes: > > [ description of ownership protocol for cache coherence ] > >Does anyone know if Sequent invented it (seems > >rather too obvious for that), or a proper reference to who did? > > Amdahl was designing systems like this almost ten years ago ... Synapse, a manufacturer of fault-tolerant systems based on the 68000, used an ownership protocol on their proprietary bus. I'm not sure if Synapse is still in business, but I believe they started building their 68K systems around 1982. The memory controllers maintained extra bits to indicate whether any public of private copies of a memory block (4 32-bit words) were checked out. Any time you performed a read and the cache missed, you had to do a "public" read from the memory controller. Public copies of a memory block were read-only. Before performing a write, you had to do a "private" read from the memory. If no other private copies were checked out, the memory controller would respond and the bus watchers on all other caches would invalidate their public copies. If a private copy was already checked out, the memory controller would respond with an abort-and-retry signal. Then the bus watcher with the private copy would update main memory and invalidate its copy. -- Live: Phil Harbison USPS: 3409 Grassfort Drive, Huntsville, AL 35805-5421 UUCP: {clyde,uunet}!madhat!alvitar PSTN: 205-881-4317
oconnor@sunray.steinmetz (Dennis M. O'Connor) (01/30/88)
An article by aglew@ccvaxa.UUCP says: ] ]> I really don't think the real world really needs anything more ]>expansive than a 32 bit processor to get most jobs done. ] ]Probably, but... I wonder how much interest might be out there for ]a true double-precision floating point engine - one that did 64 bit ]floating point, or IEEE 80 bit extended floating point, or even ]128 bit floating point, as its native floating point mode, as fast ]as single precision on nearly any other machine in its price range? ] ]I'm sure that most people wouldn't need this, but some might - and I'd ]like to get a feel for the size of such a niche, if it exists. ] ]Andy "Krazy" Glew. Gould CSD-Urbana. 1101 E. University, Urbana, IL 61801 There is indeed a market for this. Several in fact. 1. Simulation of Physical Systems ( Chemistry, Aerodynamics, Mechanics ) These people need all the accuracy they can get. Truncation error just grows to fast on 32-bit floats, and the dynamic range is too small. 2. Real-time control systems ( aircraft engines, aircraft control surfaces ) These systems can't afford to flake out because of truncation errors accumulating or dynamic range being exceeded. They also are so tightly pressed for performance that they can't afford ellaborate checks of their results. 3. Sonar and OTH Radar systems These are TOUGH signal processing problems, with very noisy data, needing long, complicated munging. And must be real time. 4. Strategic Weapons Systems, including SDI Think about aiming at a 1 meter target thats 500,000 meters away. That's moving at 10,000 meters per second. How much error can you afford ? How much time have you got to do the job ? Single precision is apparantly becoming obsolete. It's just too much of a pain to deal with. Double precision is the only way to go. Optimize for double, keep single only if you need "backwards" compatability. "You'll thank me for it." -- Dennis O'Connor oconnor@sungoddess.steinmetz.UUCP ?? ARPA: OCONNORDM@ge-crd.arpa "If I have an "s" in my name, am I a PHIL-OSS-IF-FER?"
roy@phri.UUCP (Roy Smith) (01/30/88)
In article <28200089@ccvaxa> aglew@ccvaxa.UUCP writes: > I wonder how much interest might be out there for a true double-precision > floating point engine - one that did 64 bit floating point, or IEEE 80 > bit extended floating point, or even 128 bit floating point, as its > native floating point mode, as fast as single precision on nearly any > other machine in its price range? I don't know about price range, but doesn't the FPS-164 do exclusively 64-bit floating point? Actually, I wonder if a RISC-FPP would make sense. Consider a machine like a Vax. Let's say we got rid of all the multiple precision floating point formats and instructions (at last count, F, D, G, and H; did I miss any?) and made all floating point math a single high-precision format. Clearly that would save silicon. If we took that silicon real estate and devoted it to making that single floating point format work faster, could we build a machine which only has double precision, but does it as fast as the old vax did single precision? Given that we could do double floating math as fast as single floating math, the only advantage single would have left would be saving memory on large arrays. Maybe we'd have to keep {load,store} {single,double} instructions around for that. -- Roy Smith, {allegra,cmcl2,philabs}!phri!roy System Administrator, Public Health Research Institute 455 First Avenue, New York, NY 10016
kahn@batcomputer.tn.cornell.edu (Shahin Kahn) (01/30/88)
>Another approach is to execute all of the instructions come what may. >This is the VLIW, trace-scheduling, approach commercialized by Multiflow. > >Andy "Krazy" Glew. Gould CSD-Urbana. 1101 E. University, Urbana, IL 61801 > aglew@gould.com - preferred, if you have nameserver VLIW has been part of Floating Point Systems' machines for more than 10 years.
cik@l.cc.purdue.edu (Herman Rubin) (02/01/88)
Whatever accuracy is built into the floating point processor, more may be needed. If the appropriate hardware is included, almost double of the "usual" is relatively easily obtainable, but no more. For more, it is necessary to go to integer arithmetic--even when there is no integer arithmetic, it is necessary to force the floating point hardware to treat integers. Thus if more accuracy than anticipated is needed, good integer arithmetic is called for. This means unsigned multiplication and division, among other not necessarily available constructs. I suggest that however many bits in the mantissa of floating point numbers are directly treatable by the hardware, that integer multiplication of numbers of that accuracy , with the product of twice the length, be hardware, and that this be unsigned, and additional reasonable hardware to facilitate the arithmetic be available. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet
dave@micropen (David F. Carlson) (02/01/88)
In article <3127@phri.UUCP>, roy@phri.UUCP (Roy Smith) writes: > In article <28200089@ccvaxa> aglew@ccvaxa.UUCP writes: > > I wonder how much interest might be out there for a true double-precision > > floating point engine - one that did 64 bit floating point, or IEEE 80 > > bit extended floating point, or even 128 bit floating point, as its > > native floating point mode, as fast as single precision on nearly any > > other machine in its price range? > > Given that we could do double floating math as fast as single > floating math, the only advantage single would have left would be saving > memory on large arrays. Maybe we'd have to keep {load,store} > Roy Smith, {allegra,cmcl2,philabs}!phri!roy > System Administrator, Public Health Research Institute > 455 First Avenue, New York, NY 10016 This is a question/problem that has been gnawing on me for a long time: It is the case (K&R p184) that "All floating arithmetic in C is carried out in double-precision;" regardless whether declared as single precision or not. Given that a large majority of code being written (at least at a typical UNIX site) is in C, why not dispose of single precision floating point altogether and concentrate the silicon real-estate to something that my compiler might actaully use. Alternately, perhaps hardware/arch. people can convince C compiler writers that (as I often find) single precision is more than sufficient for all but the most numerically nasty problems and should be used for float arithmetic unless code directed otherwise. (It seems to go against UNIX philosophy to have a compiler force a double precision arithmetic when source code clearly indicates that single precision would suffice.) I am willing to be that my FPA has *never* executed a single precision arithmetic operation (other than double->single and single->double)!! -- David F. Carlson, Micropen, Inc. ...!{ames|harvard|rutgers|topaz|...}!rochester!ur-valhalla!micropen!dave "The faster I go, the behinder I get." --Lewis Carroll
alanh@tekig4.TEK.COM (Al Hooton, the available sailing crew member) (02/02/88)
Gee... the first machine I programmed in assembly was the CDC 6600, which has 60-bit words and 15-bit or 30-bit instructions... I always thought that the machines that DIDN'T take this approach were the wierd ones :-) ________________________________ ______________________________________________ | What's the only thing better | Al Hooton, Tektronix Inc. than a sailboat?? | | ...tektronix!tekig4!alanh | TWO sailboats!! | (503) 627-3904, 591-8460 ________________________________|______________________________________________ Discail^H^H^H^H^H^H^HDics^H^H^H^HDisclaiemer^H^H^H^H^H^H^H^H^H^H^H Oh, bag it!
rbbb@acornrc.UUCP (David Chase) (02/02/88)
In article <671@l.cc.purdue.edu>, cik@l.cc.purdue.edu (Herman Rubin) writes: > > Whatever accuracy is built into the floating point processor, more may > be needed. ... > ... For more, it > is necessary to go to integer arithmetic--even when there is no integer > arithmetic, it is necessary to force the floating point hardware to > treat integers. This is not a direct contradiction, but rather than fuss with reimplementing the wheel, try using someone else's. See Boehm, Cartwright, Riggle, and O'Donnell, "Exact Real Arithmetic: A Case Study in Higher Order Programming" in the 1986 Lisp and Functional Programming Conference. It's real (pun intended) and it is faster than glacial. It has been used to build interpreters, machine simulators, and an RPN calculator. What you are talking about doing sounds tedious and error-prone. (Yes, I know that what I describe is much slower than hand-coded assembler integer arithmetic, but how fast a programmer are you, anyway? Can you TRUST the results of your programs?) It is interesting that *programmers* are debating the pros and cons of the various architectures. Why worry? Unless you are using assembly language, the interesting things should be the performance of the architecture+compiler combination and the price tag on the whole package. Who cares how the arithmetic is done, as long as it is fast and accurate? David Chase, Olivetti Research Center, Menlo Park (chase%orc.uucp@unix.sri.com or oliveb!orc!chase)
guy@gorodish.Sun.COM (Guy Harris) (02/02/88)
> Alternately, perhaps hardware/arch. people can convince C compiler writers > that (as I often find) single precision is more than sufficient for all but > the most numerically nasty problems and should be used for float arithmetic > unless code directed otherwise. They already have: see the latest ANSI C draft standard. The reason why double precision was specified in K&R was, in part (as was, I think, stated here earlier) that it was easier to implement on the PDP-11 that way, as the PDP-11's floating point instructions (FP11-style) didn't come in single-precision and double-precision flavors; instead, there was a mode bit that you had to set to single-precision or double-precision mode with special instructions. (It was certainly *possible* to support both single-precision and double-precision arithmetic in compiled code - FORTRAN-IV Plus offers an existence proof - but I suspect it was deemed unnecessary in the context of C.) Guy Harris {ihnp4, decvax, seismo, decwrl, ...}!sun!guy guy@sun.com
brooks@lll-crg.llnl.gov (Eugene D. Brooks III) (02/02/88)
In article <408@micropen> dave@micropen (David F. Carlson) writes: >This is a question/problem that has been gnawing on me for a long time: >It is the case (K&R p184) that "All floating arithmetic in C is carried >out in double-precision;" regardless whether declared as single precision >or not. Given that a large majority of code being written (at least at >a typical UNIX site) is in C, why not dispose of single precision floating >point altogether and concentrate the silicon real-estate to something >that my compiler might actaully use. Alternately, perhaps hardware/arch. >people can convince C compiler writers that (as I often find) single >precision is more than sufficient for all but the most numerically nasty >problems and should be used for float arithmetic unless code directed >otherwise. (It seems to go against UNIX philosophy to have a compiler >force a double precision arithmetic when source code clearly indicates >that single precision would suffice.) I am willing to be that my FPA >has *never* executed a single precision arithmetic operation (other than >double->single and single->double)!! This is the "boondoggle" of the C programming language, which I know and love, that arose from its early PDP11+FP11 targets. It is being "fixed", read that PATCHED, in the ANSI spec by introducing the new type "long double" and or by the use of function prototypes. Lets not reinvent the FP11 arcanity which is the source of all this evil! Many C compilers have some flag which causes float expressions to be handled in single precision, and a whole lot of users are using this option to get fast code. For those of you who are younger than the PDP11, the FP11 was the PDP11's floating point hardware, which had two modes (single and double precison) and could only be in one mode at a time. The cheap fix was to do all computation in double, the "double mode" did have a conversion instruction which allowed loading a float, converting it to a double on the fly. Doing all computation in 64, 80, or 128 bits is no way to make a computer run fast.
root@mfci.UUCP (SuperUser) (02/02/88)
In article <3127@phri.UUCP> roy@phri.UUCP (Roy Smith) writes: >In article <28200089@ccvaxa> aglew@ccvaxa.UUCP writes: >> I wonder how much interest might be out there for a true double-precision >> floating point engine - one that did 64 bit floating point, or IEEE 80 >> bit extended floating point, or even 128 bit floating point, as its >> native floating point mode, as fast as single precision on nearly any >> other machine in its price range? > > I don't know about price range, but doesn't the FPS-164 do >exclusively 64-bit floating point? > > Actually, I wonder if a RISC-FPP would make sense. Consider a >machine like a Vax. Let's say we got rid of all the multiple precision >floating point formats and instructions (at last count, F, D, G, and H; did >I miss any?) and made all floating point math a single high-precision >format. Clearly that would save silicon. If we took that silicon real >estate and devoted it to making that single floating point format work >faster, could we build a machine which only has double precision, but does >it as fast as the old vax did single precision? > > Given that we could do double floating math as fast as single >floating math, the only advantage single would have left would be saving >memory on large arrays. Maybe we'd have to keep {load,store} >{single,double} instructions around for that. >-- >Roy Smith, {allegra,cmcl2,philabs}!phri!roy >System Administrator, Public Health Research Institute >455 First Avenue, New York, NY 10016 I don't believe you can do double precision math as fast as single precision math if both are implemented in the same technology. If we're including division and sqrts, derived via a high-radix iterative procedure, it's certainly not true, since you get only one or two bits of mantissa per trip through the ALU. In that case the time to solution is proportional to the number of bits of result you want. If we're only discussing addition, subtraction, and multiplication, then I still don't believe it. There's an adder at the heart of each of those, and its width decides its speed -- the wider, the slower (more levels of carry-lookahead). If your choice is between making one engine to do both single and double precision (or dbl and quad), or making only dbl (quad), then I think the engine that has less to do can be made slightly faster. My personal perception of the market for scientific computation is that given a choice between more precision and more speed, speed wins hands down. Bob Colwell Multiflow Computer <Above is personal opinion only!!> 175 N. Main St. Branford CT 06405 mfci!colwell@uunet.uucp 203-488-6090
root@mfci.UUCP (SuperUser) (02/03/88)
In article <3536@batcomputer.tn.cornell.edu> kahn@tcgould.tn.cornell.edu (Shahin Kahn) writes: >>Another approach is to execute all of the instructions come what may. >>This is the VLIW, trace-scheduling, approach commercialized by Multiflow. >> >>Andy "Krazy" Glew. Gould CSD-Urbana. 1101 E. University, Urbana, IL 61801 >> aglew@gould.com - preferred, if you have nameserver > >VLIW has been part of Floating Point Systems' machines for more than >10 years. Under that definition of VLIW, every microcoded machine ever made qualifies. It's never been that big a trick to make a machine with many parallel functional units. The trick is to be able to keep them busy without hand coding, and to provide the right interface so that the compiler can express the parallelism that it finds. Read Ellis' thesis; he presents reasonable arguments as to why FPS' architectures have hot spots that make it a very difficult target for a compiler. Bob Colwell Multiflow Computer <Above is personal opinion only!!> 175 N. Main St. Branford CT 06405 mfci!colwell@uunet.uucp 203-488-6090
gwu@clyde.ATT.COM (George Wu) (02/04/88)
Robert Colwell posted: > If we're only discussing addition, subtraction, and multiplication, > then I still don't believe it. There's an adder at the heart of each > of those, and its width decides its speed -- the wider, the slower > (more levels of carry-lookahead). You can significantly reduce the time it takes to do a multiply by using a faster algorithm. The bottleneck here is the carry-lookahead. A matrix of carry-save adders eliminates this, giving you O(log2(n)) instead of O(n). So the difference between 32 and 64 bits becomes trivial. Sorry I don't have more detail on the CSA matrix, but this is all from a dim memory. If anyone wishes, *email* me a request, and I'll provide more information after digging out my notes. George J Wu rutgers!clyde!gwu
oconnor@sunset.steinmetz (Dennis M. O'Connor) (02/04/88)
An article by colwell@m6.UUCP (Robert Colwell) says: ] I don't believe you can do double precision math as fast as single ] precision math if both are implemented in the same technology. If ] we're including division and sqrts, derived via a high-radix iterative ] procedure, it's certainly not true, since you get only one or two ] bits of mantissa per trip through the ALU. In that case the time ] to solution is proportional to the number of bits of result you want. NO. Sorry. But the fastest floating point division and root algorithms use Newton-Rapheson iteration, where the time to solution is proportional to log2( number_of_bits_of_result ). That is, if single-precision takes 4 iterations, double precision will take 5, and quad will take 6. ] If we're only discussing addition, subtraction, and multiplication, ] then I still don't believe it. There's an adder at the heart of each ] of those, and its width decides its speed -- the wider, the slower ] (more levels of carry-lookahead). If your choice is between making ] one engine to do both single and double precision (or dbl and quad), ] or making only dbl (quad), then I think the engine that has less to ] do can be made slightly faster. Doubling the width of a carry-select adder adds one gate delay to its latency. Carry-lookahead schemes have a similar less-than- proportionate penalty for speed up. Subtract is just an add. Floating point adds also need de- and re-normalization. Multipliers are more complex : all the fast stuff uses big booth-encoded adder arrays. But the final carry-resolution, the justifiaction and normalization logic add signicantly to the latency of the multiply. ] My personal perception of the market for scientific computation is ] that given a choice between more precision and more speed, speed ] wins hands down. Anyone wanna buy a high-speed FPU for 16-bit ( 5 bit exponent, 10 bit fraction, 1 bit sign ) reals ? I didn't think so. It doesn't matter how fast you are if your answer is wrong. For many problems being tackled today, single-precision introduces too much error. So it is not used, regardless of how much faster it is. Besides, double-precision ( 64 bit ) floating multiply and add should only take about 20% longer than single-precision multiply and add. DSAme for division and square roots. ] Bob Colwell ] Multiflow Computer <Above is personal opinion only!!] Bob, you seem a little naive about floating-point hardware. It's not all shift-add or shift-compare-subtract anymore. -- Dennis O'Connor oconnor@sungoddess.steinmetz.UUCP ?? ARPA: OCONNORDM@ge-crd.arpa "If I have an "s" in my name, am I a PHIL-OSS-IF-FER?"
hes@ecsvax.UUCP (Henry Schaffer) (02/04/88)
In article <230@m2.mfci.UUCP>, root@mfci.UUCP (SuperUser) writes: > ... > My personal perception of the market for scientific computation is > that given a choice between more precision and more speed, speed > wins hands down. > > Bob Colwell > Multiflow Computer <Above is personal opinion only!!> > 175 N. Main St. > Branford CT 06405 mfci!colwell@uunet.uucp 203-488-6090 I agree, but suggest that this is a result of the triumph of marketing over good sense. More speed *always* is useful, and there is a limit to how much precision one needs, so there clearly is a domain in which more speed is more useful than more precision - but I don't believe that the usual single precision (4 byte) floating point is in that domain. If I'm right, then one should design for speed rather than precision - shouldn't one? (One could argue that the smaller word decreases storage needs as well as increases speed - which is correct, but one can find many examples of choice of algorithms in which the tradeoff between speed and accuracy doesn't significantly affect storage requirements, and yet people either choose speed or don't take the effort to choose accuracy.) --henry schaffer n c state univ
cik@l.cc.purdue.edu (Herman Rubin) (02/04/88)
In article <9408@steinmetz.steinmetz.UUCP>, oconnor@sunset.steinmetz (Dennis M. O'Connor) writes: > An article by colwell@m6.UUCP (Robert Colwell) says: > ] I don't believe you can do double precision math as fast as single > ] precision math if both are implemented in the same technology. > ..... > NO. Sorry. But the fastest floating point division and root > algorithms use Newton-Rapheson iteration, where the time to > solution is proportional to log2( number_of_bits_of_result ). > That is, if single-precision takes 4 iterations, double precision > will take 5, and quad will take 6. You are essentially correct about the number of _iterations_. However, unless the accuracy of an iteration is built in in hardware, the cost of an iteration grows more quickly than the length of the operands. It is possible to design hardware such that the hardware "double precision" may not be too much slower than the hardware "single precision." However, to extend precision beyond that built into hardware, it is necessary that the number be broken into blocks such that operations yielding results of twice the number of bits in the unit for multiplication are available. In addition, floating point arithmetic adds so many problems, especially if normalized, that it is highly advisable to do the operations in integer arithmetic. To summarize, arithmetic whose precision is beyond the built-in is only somewhat expensive if the hardware is designed so that results of double the length of the built-in are available (i.e., if 52-bit mantissas are supported, 104-bit products must be available). In addition, this can be done with some difficulty if it has to be done in unnormalized floating-point, it is extremely difficult in normalized floating-point, and easiest in fixed-point. > Besides, double-precision ( 64 bit ) floating multiply and add > should only take about 20% longer than single-precision multiply > and add. DSAme for division and square roots. If additional "silicon real estate" is added, this is true. The chip area is doubled for addition, and probably a factor of 3-4 for multiplication. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet
henry@utzoo.uucp (Henry Spencer) (02/04/88)
> This is the "boondoggle" of the C programming language, which I know and > love, that arose from its early PDP11+FP11 targets... Not entirely true. To be certain of my facts on this, I dug out the article Dennis Ritchie posted about it some five years ago. Although he acknowledges the FP11's contribution, he cites that as reason number two, not reason number one, for C doing everything in double. Reason number one was the problem of getting the types of function arguments and return values correct, combined with a desire to avoid multiple versions of library functions. He repeatedly and clearly states that this was the most important reason for this design decision. > ... It is being "fixed", read > that PATCHED, in the ANSI spec by introducing the new type "long double" > and or by the use of function prototypes... Well, no, not quite, it is being fixed -- really fixed -- by allowing the compiler to do floating-point arithmetic in float, rather than double, if all operands are float, and by permitting function arguments to be passed as floats given suitable prototypes. The problem of multiple library functions can't be entirely avoided for the speed demons, but function prototypes should make things manageable for ordinary users. While some of the permissiveness in the wording may be regrettable (e.g. it is still legal to do all arithmetic in double, for backward compatibility), the overall scheme really is a proper fix, not a "patch". There is room for debate about whether function prototypes are a Good Thing overall, but they do go a long way towards cleaning up this problem. "long double" is being introduced mostly because of a non-trivial number of machines which also have 128-bit floating point. It has nothing to do with the all-arithmetic-in-double issue. -- Those who do not understand Unix are | Henry Spencer @ U of Toronto Zoology condemned to reinvent it, poorly. | {allegra,ihnp4,decvax,utai}!utzoo!henry
yuval@taux01.UUCP (Gideon Yuval) (02/05/88)
In a recent message, Dennis O'conner says: >Message-ID: <9408@steinmetz.steinmetz.UUCP> > NO. Sorry. But the fastest floating point division and root > algorithms use Newton-Rapheson iteration, where the time to > solution is proportional to log2( number_of_bits_of_result ). Since, under machine arithmetic, a/b is NOT equal to a*(1/b), and since the IEEE floating-point standard demands EXACT rounding, I don't see how Newton- Raphson is any use an an IEEE environment. -- Gideon Yuval, +972-52-522255 (work), -2-690992 (home), yuval@taux01.nsc.com
oconnor@sunset.steinmetz (Dennis M. O'Connor) (02/09/88)
An article by yuval@taux01.UUCP (Gideon Yuval) says: >Since, under machine arithmetic, a/b is NOT equal to a*(1/b), and since the >IEEE floating-point standard demands EXACT rounding, I don't see how Newton- >Raphson is any use an an IEEE environment. >-- >Gideon Yuval, +972-52-522255 (work), -2-690992 (home), yuval@taux01.nsc.com Newton-Raphson does not ( neccesarily ) involve actually computing 1/b and then multiplying by a, since as you mention this is NOT the same as a/b. What can be done is to create a two varible function F which when invoked as F( 1, b ) produces 1/b, and when invoked as F( a, b ) produces a/b. This function does NOT form 1/b as part of its algorithm. Instead it performs a series of operations that if applied to second variable would produce 1, and performs this series of operations to the first variable. With done in "infinite" precision this is equivalent to a*1/b; it is not eqivalent to a*1/b in finite precision machines. The first use of the algorithm I'm aware of was in the IBM 360/195, I think. I don't see how rounding is any more a problem with this approach than with any other approach. -- Dennis O'Connor oconnor@sunset.steinmetz.UUCP ?? ARPA: OCONNORDM@ge-crd.arpa "Nuclear War is NOT the worst thing people can do to this planet."
mbutts@mntgfx.mentor.com (Mike Butts) (02/09/88)
In article <231@m2.mfci.UUCP>, root@mfci.UUCP (SuperUser) writes: > In article <3536@batcomputer.tn.cornell.edu> kahn@tcgould.tn.cornell.edu (Shahin Kahn) writes: > >VLIW has been part of Floating Point Systems' machines for more than > >10 years. > > Under that definition of VLIW, every microcoded machine ever made > qualifies. It's never been that big a trick to make a machine with > many parallel functional units. The trick is to be able to keep them > busy without hand coding, and to provide the right interface so that > the compiler can express the parallelism that it finds. > > Read Ellis' thesis; he presents reasonable arguments as to why FPS' > architectures have hot spots that make it a very difficult target > for a compiler. > 1) Contrary to popular impression, FPS has been compiling directly from FORTRAN to their 64-bit "LIW" 164 machine since the early '80s. The code is of very high quality, and schedules operations onto the multi-field instructions across a whole basic block. For example, the memory system in the FPS 164 is pipelined, and software accounts for the pipeline length, not a hardware scoreboarder. The FPS compiler made full use of this pipeline, scheduling appropriately. It also has a very effective loop pipeliner. In fact, I believe Josh Fisher & Co. used 164 hardware for their early work at Yale that grew into Multiflow. I suppose it's just a consequence of FPS' non-existent paper publishing and lack of marketing that people generally still think FPS machines are hand-coded array processors. There is no code between the FORTRAN and the 64-bit LIW, and the LIW fields are encoded and at a minicomputer level of expression, thus it is not microcode. Mentor's Compute Engine is also a 64-bit LIW machine, oriented more towards EDA-type codes, with optimizing, scheduling C, FORTRAN and Pascal compilers that make effective use of the machine. (For historical purposes only - not a commercial ;-) ) Don't get me wrong, I have a high opinion of the way Multiflow is developing VLIW technology, I just feel the early FPS people (I wasn't one of them) should get more credit for blazing this trail 5-10 years ago. 2) Please give us a reference for Ellis' thesis. -- Mike Butts, Research Engineer KC7IT 503-626-1302 Mentor Graphics Corp., 8500 SW Creekside Place, Beaverton OR 97005 ...!{sequent,tessi,apollo}!mntgfx!mbutts OR mbutts@pdx.MENTOR.COM These are my opinions, & not necessarily those of Mentor Graphics.
aglew@ccvaxa.UUCP (02/09/88)
>Since, under machine arithmetic, a/b is NOT equal to a*(1/b), and since the >IEEE floating-point standard demands EXACT rounding, I don't see how Newton- >Raphson is any use an an IEEE environment. > >Gideon Yuval, +972-52-522255 (work), -2-690992 (home), yuval@taux01.nsc.com Use Newton-Raphson to approximate 1/b, and then correct. Andy "Krazy" Glew. Gould CSD-Urbana. 1101 E. University, Urbana, IL 61801 aglew@gould.com - preferred, if you have nameserver aglew@gswd-vms.gould.com - if you don't aglew@gswd-vms.arpa - if you use DoD hosttable aglew%mycroft@gswd-vms.arpa - domains are supposed to make things easier? My opinions are my own, and are not the opinions of my employer, or any other organisation. I indicate my company only so that the reader may account for any possible bias I may have towards our products.
johnl@ima.ISC.COM (John R. Levine) (02/10/88)
In article <1988Feb8.121200.370@mntgfx.mentor.com> mbutts@mntgfx.mentor.com (Mike Butts) writes: >2) Please give us a reference for Ellis' thesis. John R. Ellis, "Bulldog: A Compiler for VLIW Architectures," MIT Press, Cambridge Mass., 1986. ISBN 0-262-05034-X It won the 1985 ACM Doctoral Dissertation award, and is one heck of a piece of work (which I'd say even if he hadn't been in the office across from mine.) -- John R. Levine, IECC, PO Box 349, Cambridge MA 02238-0349, +1 617 492 3869 { ihnp4 | decvax | cbosgd | harvard | yale }!ima!johnl, Levine@YALE.something Gary Hart for President -- Let's win one for the zipper.
root@mfci.UUCP (SuperUser) (02/13/88)
In article <9408@steinmetz.steinmetz.UUCP> sunset!oconnor@steinmetz.UUCP writes:
]An article by colwell@m6.UUCP (Robert Colwell) says:
]] I don't believe you can do double precision math as fast as single
]] precision math if both are implemented in the same technology.
]]...
]NO. Sorry. But the fastest floating point division and root
]algorithms use Newton-Rapheson iteration, where the time to
]solution is proportional to log2( number_of_bits_of_result ).
]That is, if single-precision takes 4 iterations, double precision
]will take 5, and quad will take 6.
]
]] If we're only discussing addition, subtraction, and multiplication,
]] then I still don't believe it. There's an adder at the heart of each
]] of those, and its width decides its speed -- the wider, the slower
]] (more levels of carry-lookahead).
]]...
]Doubling the width of a carry-select adder adds one gate delay to
]its latency. Carry-lookahead schemes have a similar less-than-
]proportionate penalty for speed up. Subtract is just an add.
]Floating point adds also need de- and re-normalization.
]
]Multipliers are more complex : all the fast stuff uses big
]booth-encoded adder arrays. But the final carry-resolution,
]the justifiaction and normalization logic add signicantly to
]the latency of the multiply.
]
]] Bob Colwell
]
]Bob, you seem a little naive about floating-point hardware.
]It's not all shift-add or shift-compare-subtract anymore.
]
] Dennis O'Connor oconnor@sungoddess.steinmetz.UUCP ??
] ARPA: OCONNORDM@ge-crd.arpa
Whether I'm naive or not isn't relevant to anything. In fact, I
don't see where you and I disagreed anywhere above; I said more
bits takes longer, and you quantified it.
On the other hand, now that you brought it up, I'm with Gideon Yuval
on the subject of rounding. As far as I know, you can't get double
precision IEEE round-to-nearest sqrt or divide with only 5 iterations
of Newton-Rapheson. I believe that gives you enough *accuracy*,
meaning all of the mantissa bits are meaningful at that point, but
you still won't know if the number you just got from iteration number
5 is the best finite-precision approximation to the real answer
(meaning the infinite-precision case) unless you then proceed to
another step which can figure that out. Iterative hardware, of
course, does one extra step and rounds the mantissa according to
whether the guard bit was a 1 or 0 (and if the mantissa then
overflows, the exponent must be bumped, and if that overflows...)
If you do the N-R in extended precision, obviously the conversion
handles the rounding exactly.
You might well argue that many programs are not sensitive to noise
in the lsb of an IEEE double precision sqrt or divide, and that's
certainly true. I have seen, however, at least one program that
required absolutely monotonic behavior in the lsb or it would take
unacceptably long to converge. And then there are programs which
demand IEEE accuracy everywhere.
Bob Colwell
Multiflow Computer <Above is personal opinion only!!>
175 N. Main St.
Branford CT 06405 mfci!colwell@uunet.uucp 203-488-6090
root@mfci.UUCP (SuperUser) (02/21/88)
>> >> Under that definition of VLIW, every microcoded machine ever made >> qualifies. It's never been that big a trick to make a machine with >> many parallel functional units. The trick is to be able to keep them >> busy without hand coding, and to provide the right interface so that >> the compiler can express the parallelism that it finds. >> > >1) Contrary to popular impression, FPS has been compiling directly from >FORTRAN to their 64-bit "LIW" 164 machine since the early '80s. The code is >of very high quality, and schedules operations onto the multi-field >instructions across a whole basic block. For example, the memory system >in the FPS 164 is pipelined, and software accounts for the pipeline length, >not a hardware scoreboarder. The FPS compiler made full use of this pipeline, >scheduling appropriately. It also has a very effective loop pipeliner. >In fact, I believe Josh Fisher & Co. used 164 hardware for their early work at >Yale that grew into Multiflow. The above comments are well taken, but the 164 had one basic flaw. The 164 architecture was designed years before anybody contemplated writing a compiler for it. Largely as a result, there are problems with the architecture that make it of limited suitability as a compiler target. For example, the machine has only half the register bandwidth required to support the functional units' requirements for operands and results. If the functional units are to be used at full speed, the pipelined output of one must be used "live" as an input to the other. This causes great difficulty for a scheduler; the machine cannot be modeled as independent functional units, and scheduling cannot proceed based only on data precedence and the availability of the required functional unit. Symptoms of these limitations show up in the wide disparity between performance achieved by FPS' compilers and by hand coders, and the resulting commitment by FPS to the production of large mathematical libraries written in hand code. John Ellis' thesis discusses these issues in greater depth. > >2) Please give us a reference for Ellis' thesis. > John R. Ellis, "Bulldog: A Compiler for VLIW Architectures". 1986. ACM Doctoral Dissertation Award Series, MIT Press, Cambridge, MA. See also Colwell et al., "A VLIW Architecture for a Trace Scheduling Compiler", in Proceedings, ASPLOS-II (Computer Architecture News, OS Review, or SIGPLAN Notices, October 1987) for an update on our architectural approach.