baum@apple.UUCP (Allen J. Baum) (01/01/70)
-------- [] A while back, somebody wrote: >BTW, there is an address modification procedure which is missing on all >machines I have seen except the UNIVAC's. That is to consider the register >file as a memory block and allow indexing on it... Somebody else wrote: >The PDP-10 also did this. The first 16 memory locations were the registers. >There was an option to get fast (non-core) memory for these few bits. And >Brian Utterback replied: >Another advantage the PDP-10 had by mapping the registers to the memory space, >other than indexing, was in execution. You could load a short loop into the >registers and jump to them! The loop would run much faster, executing out >of the registers. This is not the case exactly. EMACS did, in fact, use this trick to significantly speed up such things as 'search', but when the KL-10 & -20 processors came out, this trick ran SLOWER than running code out of regular RAM. -- {decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
cik@l.cc.purdue.edu (Herman Rubin) (09/21/87)
There are many instructions which are easy to implement in hardware, but for which software implementation may even be so costly that a procedure using the instruction may be worthless. Some of these instructions have been implemented in the past and have died because the ill-designed languages do not even recognize their existence. Others have not been included due to the non-recognition of them by the so-called experts and by the stupid attitude that something should not be implemented unless 99.99% of the users of the machine should be able to want the instruction _now_. As you can tell from this article, I consider the present CISC computers to be RISCy. One situation of this type which has been discussed in this newsgroup is the proper treatment of quotient and remainder for integer division when the numbers are not both positive. Everyone took a stand for some specific- ation. I say "let the user decide." Even if both signs are positive, which alternative I want for one problem may not be the one I want for another problem. Having 2-4 bits to specify the alternative for each sign combination should take very little run time and little space. Since floating point machines first came out, the much needed instruction to divide one floating point number by another with an integer quotient and a floating remainder has not, to my knowledge, appeared. If you need to see uses of this, look at any good trigonometric or exponential subroutine. With the advent of floating point, fixed point operations seem to be vanishing. On the early floating point machines, frequently numerical functions would be done in fixed point for speed and accuracy. The need for this has not changed, but the availability has. Also, it should be possible to convert between fixed and floating point without the overhead of a multiply; this was possible on the UNIVAC 1108 and 1110. Another operation is to multiply a floating point number by a power of 2 by adding to the exponent; this was on the CDC 3600. The need for this as a separate instruction is because of the possibility of overflow and/or underflow. I have run into situations in non-uniform random number generation for which considerable time is needed to carry out tests which would be better handled as exceptions. One of these is to decrement an index, use the result for a read or write instruction if non-negative, and interrupt if negative to a user-provided exception handler. Another is to find the distance to the next one in a bit stream, with an interrupt if the stream is emptied. There are procedures which are extremely efficient computationally, but for which the overhead is large if this is not hardware; if a higher level language has to be used for the instruction, I would make the cost prohibitive. The VAXen have in hardware (at least for some machines) a FFO instruction, but it requires three other operations, one of which is a conditional, to get one result. On many machines, even if fixed point arithmetic is in the hardware, multipli- cation and division cannot be unsigned. All of the multiple precision software with which I am familiar is sign-magnitude. An additional hardware bit to say if signed or unsigned is to be used would be cheap. (It is extremely difficult to program multiple precision arithmetic in floating point. It is difficult on machines, of which there are many, which do not have reasonable integer multiplication.) I make no pretense that this list is complete. While I might find it useful, I would not suggest that transcendental functions (except for the CORDIC routines) be hardware, as they would be merely encoding a software algorithm using the existing instructions as a hardware, rather than software, series of instructions. What I am suggesting is that instructions manipulating the bits in different ways, or using easy branching at nanocode time instead of slow branching when the hardware cannot use the non-restrictive nature of the branch, should be. The cost of the CPU is usually a small part of the cost of the computer. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet
haynes@ucscc.UCSC.EDU.ucsc.edu (99700000) (09/22/87)
In article <581@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: >... >With the advent of floating point, fixed point operations seem to be >vanishing. On the early floating point machines, frequently numerical >functions would be done in fixed point for speed and accuracy. The need >for this has not changed, but the availability has. Also, it should be >possible to convert between fixed and floating point without the overhead >of a multiply; this was possible on the UNIVAC 1108 and 1110. Burroughs (pre-Unisys) handles this by making all numbers floating point. Integers simply have a zero exponent. The normalizing algorithm tries to keep the exponent zero rather than invariably normalizing. So fixed-to-float takes no time; float-to-fixed may take time. haynes@ucscc.ucsc.edu haynes@ucscc.bitnet ..ucbvax!ucscc!haynes
tim@amdcad.AMD.COM (Tim Olson) (09/22/87)
In article <581@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: +----- | There are many instructions which are easy to implement in hardware, but | for which software implementation may even be so costly that a procedure | using the instruction may be worthless. Some of these instructions have | been implemented in the past and have died because the ill-designed | languages do not even recognize their existence. Others have not been | included due to the non-recognition of them by the so-called experts and | by the stupid attitude that something should not be implemented unless | 99.99% of the users of the machine should be able to want the instruction | _now_. As you can tell from this article, I consider the present CISC | computers to be RISCy. +----- From the following examples, it sure appears as if you are arguing for "letting the user decide" how certain functions are implemented. The easiest (and probably best) way to do this is to provide a fast, fixed set of primitive operations, and let users build what they need from that set (i.e. RISC). +----- | One situation of this type which has been discussed in this newsgroup is | the proper treatment of quotient and remainder for integer division when | the numbers are not both positive. Everyone took a stand for some specific- | ation. I say "let the user decide." Even if both signs are positive, | which alternative I want for one problem may not be the one I want for | another problem. Having 2-4 bits to specify the alternative for each | sign combination should take very little run time and little space. +----- With the correct primatives, you can easily code these as procedures which will run *as fast* as standard div, mod, rem. +----- | I have run into situations in non-uniform random number generation for which | considerable time is needed to carry out tests which would be better handled | as exceptions. One of these is to decrement an index, use the result for a | read or write instruction if non-negative, and interrupt if negative to a | user-provided exception handler. +----- Fast (free) detection of over/underflow conditions is important, especially to efficiently implement languages with runtime bounds-checking and exception handling. This is why the Am29000 (and other RISC processors) have, in addition to the standard add and sub instructions, adds (add signed) and addu (add unsigned) which trap on overflow/underflow conditions. -- Tim Olson Advanced Micro Devices (tim@amdcad.amd.com)
henry@utzoo.UUCP (Henry Spencer) (09/23/87)
> One situation of this type which has been discussed in this newsgroup is > the proper treatment of quotient and remainder for integer division when > the numbers are not both positive. Everyone took a stand for some specific- > ation. I say "let the user decide."... Of course, on most of the RISC machines the user *does* have the choice, since division is generally done in software rather than hardware. > Since floating point machines first came out, the much needed instruction > to divide one floating point number by another with an integer quotient > and a floating remainder has not, to my knowledge, appeared... Although you don't get it bundled into one instruction, the pieces needed to do this are present in any IEEE floating-point implementation, e.g. the 68881. The remainder can be had with one instruction (on the 68881, FMOD or FREM depending on exactly what you're doing), the quotient would take two I think (just a divide and a convert-to-integer). > ... Another > operation is to multiply a floating point number by a power of 2 by > adding to the exponent; this was on the CDC 3600... FSCALE on the 68881. > ... Another is to find the distance to the next > one in a bit stream, with an interrupt if the stream is emptied... On most modern machines it should be possible to write a loop that will do this at very nearly full memory bandwidth, looking at a byte or a word at a time and using table lookup for the final bit-picking. I am constantly amused by people who scream for bit-flipping instructions when doing it a byte or a word at a time, using table lookup for non-trivial functions, is still faster. "Work smart, not hard". > On many machines, even if fixed point arithmetic is in the hardware, multipli- > cation and division cannot be unsigned... Again, on the RISCs you generally get your choice, because multiply is done in tuned software rather than hardware. (And it's usually faster than a CISC multiply, since most multiplies are by small integer constants that a RISC can generate custom code for.) > I would not suggest that transcendental functions (except for the CORDIC > routines) be hardware, as they would be merely encoding a software algorithm > using the existing instructions as a hardware, rather than software, series > of instructions... Actually, there is one fairly good argument for putting the transcendentals in hardware, to wit making a high-quality implementation available cheaply. The transcendentals in (say) the 68881 are *better* than anything you will come up with in software without large amounts of work. You can buy a 68881 for far less than it would cost you to commission or license equivalent code. > What I am suggesting is that instructions manipulating the > bits in different ways, or using easy branching at nanocode time instead of > slow branching when the hardware cannot use the non-restrictive nature of the > branch, should be... Note that many RISCs are directly quite specifically at this objective: giving the programmer (or, more usually, compiler writer) detailed control of the hardware, rather than putting a half-baked interpretive layer in between. To misquote the famous adage, "microcode stands between the user and the hardware". -- "There's a lot more to do in space | Henry Spencer @ U of Toronto Zoology than sending people to Mars." --Bova | {allegra,ihnp4,decvax,utai}!utzoo!henry
cik@l.cc.purdue.edu.UUCP (09/23/87)
In article <18336@amdcad.AMD.COM>, tim@amdcad.AMD.COM (Tim Olson) writes: > In article <581@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: > +----- > From the following examples, it sure appears as if you are arguing for > "letting the user decide" how certain functions are implemented. The > easiest (and probably best) way to do this is to provide a fast, fixed > set of primitive operations, and let users build what they need from > that set (i.e. RISC). > +----- > +----- > With the correct primatives, you can easily code these as procedures > which will run *as fast* as standard div, mod, rem. > +----- > -- Tim Olson > Advanced Micro Devices > (tim@amdcad.amd.com) Olson greatly underestimates the number of RISC instructions needed to do even a fair job. If the user is going to be able to do the things needed efficiently, the combining of instructions must be done at the "nanocode" level. Frankly, I think that having thousands of instructions, arranged so that decoding patterns can be used, is much easier. One of the reasons for the problem is that such things as arranging which way the quotient and remainder are formed depending on the signs of the arguments is extremely clumsy in software unless at least the adjustment procedure is hardware. I cannot think of any reasonable method even in "microcode" to do the trivial operations to achieve this for the four combinations of signs, especially if the choices change at run time. I have read on this net of hardware which enables the user to specify _sometimes_ that a particular branch should be assumed, with an exception otherwise; I have not seen such. I have not seen any remotely efficient bit-handling hardware on any machine. (BTW, I am interested in seeing what modifications I must make to my procedures base on the architecture of particular machines; if you know of one which is sufficiently different, I would be interested.) Of the machines I know, only the 680xx, 8086 and similar (although otherwise I consider its architecture horrible), and 16/320xx have (I believe) the right integer operations. To do unsigned multiplication with only signed multiplication available requires that 2 conditional additions must be done after the multiplication; as machines get faster conditional operations are bad except in nanocode. Unsigned division is so complicated that one introduces other inefficiencies instead. BTW, there is an address modification procedure which is missing on all machines I have seen except the UNIVAC's. That is to consider the register file as a memory block and allow indexing on it. Another missing procedure is to enable the register file to be treated as a block of memory so that bytes or short words can be addressed. These two operations can be combined on a byte-addressable machine. I am definitely not the person to run this, but maybe there should be a mailing list to communicate suggestions about the manifold instructions which would be profitable in hardware. Also, I know a little about microcode (enough to know that what I think should be done cannot be done that way) and very little about nanocode. I find it relatively easy to read the manuals describing the machine instructions. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet
ccplumb@watmath.waterloo.edu (Colin Plumb) (09/24/87)
In article <582@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: >BTW, there is an address modification procedure which is missing on all >machines I have seen except the UNIVAC's. That is to consider the register >file as a memory block and allow indexing on it. Another missing procedure >is to enable the register file to be treated as a block of memory so that >bytes or short words can be addressed. These two operations can be combined >on a byte-addressable machine. The PDP-10 also did this. The first 16 memory locations were the registers. There was an option to get fast (non-core) memory for these few bits. (I think this is interesting, since these days you'd implement the registers on-chip (on the CPU board, at least), and handle memory accesses to them as a special case.) -Colin (ccplumb@watmath) P.S. No, I'm not that old - I just read the manual today, thinking I should know about the first (as far as I know) machine to have a register file.
alverson@decwrl.UUCP (09/24/87)
In article <8646@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >Although you don't get it bundled into one instruction, the pieces needed >to do this are present in any IEEE floating-point implementation, e.g. the >68881. The remainder can be had with one instruction (on the 68881, FMOD >or FREM depending on exactly what you're doing), the quotient would take >two I think (just a divide and a convert-to-integer). Careful now. FREM gets you the remiander you want. However, getting the integer quotient is actually harder. The problem occurs when the quotient is larger than an integer. Often you want the low few bits of the integer quotient when using FREM to do range reduction. The last time I looked the IEEE standard did not provide for these. However, most chips give 3 or so of the low end bits, since the designers have actually thought about why you want FREM. Overall though, I agree with Henry. The main reason most of the complicated instructions mentioned do not show up in RISC's is that there is no way to express the action in C or Pascal such that the compiler can reasonably determine to select the complicated instruction over a sequence of simpler one's. Bob
daveb@geac.UUCP (Brown) (09/24/87)
In article <8646@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: |[discussion of operations in the MC 68881] || ... Another is to find the distance to the next || one in a bit stream, with an interrupt if the stream is emptied... | | On most modern machines it should be possible to write a loop that will do | this at very nearly full memory bandwidth, looking at a byte or a word at | a time and using table lookup for the final bit-picking. I am constantly | amused by people who scream for bit-flipping instructions when doing it a | byte or a word at a time, using table lookup for non-trivial functions, is | still faster. "Work smart, not hard". | The distance-to-next bit instruction is, for operands of about 2-4 words in lenght, called "floating normalize". A chess program (Johnathon Schaefer's) I once worked on used this... -- David Collier-Brown. {mnetor|yetti|utgpu}!geac!daveb Geac Computers International Inc., | Computer Science loses its 350 Steelcase Road,Markham, Ontario, | memory (if not its mind) CANADA, L3R 1B3 (416) 475-0525 x3279 | every 6 months.
mash@mips.UUCP (09/24/87)
In article <582@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: ....general discussion of wishlists of things that should be in hardware; more support for various flavors of multiply/divide, bit manipulation, etc. >Olson greatly underestimates the number of RISC instructions needed to do >even a fair job. If the user is going to be able to do the things needed >efficiently, the combining of instructions must be done at the "nanocode" >level. Frankly, I think that having thousands of instructions, arranged >so that decoding patterns can be used, is much easier. ... Earlier, Herman had written: >There are many instructions which are easy to implement in hardware, but >for which software implementation may even be so costly that a procedure >using the instruction may be worthless. Some of these instructions have >been implemented in the past and have died because the ill-designed >languages do not even recognize their existence. Others have not been >included due to the non-recognition of them by the so-called experts and >by the stupid attitude that something should not be implemented unless >99.99% of the users of the machine should be able to want the instruction >_now_. As you can tell from this article, I consider the present CISC >computers to be RISCy.... Sigh. There is a useful point embedded here, but it sounds like a topic I'd thought beaten to death in this newsgroup has to be reviewed one more time. Legitimate point: a) If an operation is not supported in hardware, AND IF b) Doing it in software takes a lot longer, AND IF c) Programs use that operation with high dynamic frequency, THEN d) Providing that operation in hardware might be worthwhile. For example, there have been some ludicrous statements in the press about "RISC machines can't do floating point": if floating point is important to you, you'd better include it in the instruction set. However, the general approach (anecdotal) is not the way people design computers, these days, and for good reason. As noted before here, a plausible way to design a computer is: 1) Pick a REPRESENTATIVE set of benchmarks. 2) Do a first-cut architecture, based on past experience. 3) Do compilers. 4) Add or delete features, measuring the impact by running compiled/assembled code through architectural simulators. 5) Iterate until you can't find anything else to add that actually improves performance by some noticeable amount, or until you run out of time. This is not a perfect recipe, of course. For example, if the benchmark set is chosen poorly, bad surprises will happen. However, what people don't do these days, is design architectures by saying "I remember code where it would have been handy to have operation X, which was stupidly not provided. Let's add it." What's needed to be useful, is NOT a list of anecdotes about features that might be useful in some cases [and indeed, they might], or that look interesting when one reads the instruction manuals, or that look like they save a few cycles here or there in the context of small code sequences, but hard DATA about the benefits of including them. For example, much more useful to people who design computers is reasoning like: 1) Here is a specific application program, or even better, this known to be typical of an important class of applications. 2) Running on a computer that lacks feature X, with appropriate instrumentation, we've found that the addition of X would reduce the runtime by Y%. One of the other comments wished to have back the UNIVAC-style addressing of registers, with no backup for why this would be good. As it happens: a) This can be costly in hardwre, especially in single-chip implementations, unless the whole architecture is built around it (like CRISP). b) It can complexify life for optimizing compilers. Show us some data why this feature is worth more than it hurts. Data, not anecdots. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
baum@apple.UUCP (09/24/87)
-------- [] >In article <582@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: > >Olson greatly underestimates the number of RISC instructions needed to do >even a fair job. I don't think that Olson is underestimating anything. Most RISC architectures have a divide step instruction, which is precisely what underlying microcode would use. Furthermore, in order to get signed/unsigned variations, microcode has to do the same kinds of conditional operations that a RISC would have to do. It is is mistake to assume that a RISC would be slower to do these than a microcoed engine; some RISC machines (Acorn ARM, HP Spectrum) have support for conditional operations. Furthermore, any hardware support in excess of this will inevitably slow the basic cycle down (I've been through the exercise). > I have not seen any remotely efficient bit-handling hardware on any machine. Check out the HP Spectrum. > To do unsigned >multiplication with only signed multiplication available requires that >2 conditional additions must be done after the multiplication; as machines >get faster conditional operations are bad except in nanocode. Unsigned >division is so complicated that one introduces other inefficiencies instead. Again, you make the mistake of believing that for some reason nanocode is somehow magically faster or more efficient than a well designed instruction set. Wrong. Microcode, or nanocode, has to go through all the same operations that assembly level code does. While special purpose data paths can be included to make the sign correction run faster, it is just that: special purpose. It can't be used for anything else, it may have the effect of making everything else run slower, and making division run a cycle or two faster will have no noticable effect on performance. Its VERY difficult to make fixed point division run faster than a bit per cycle, without a LOT of hardware. By leaving out the special purpose speedup stuff, you can afford to include some VERY useful general purpose speedup stuff: More registers, perhaps, or branch folding logic ala the ATT CRISP. > >BTW, there is an address modification procedure which is missing on all >machines I have seen except the UNIVAC's. That is to consider the register >file as a memory block and allow indexing on it. Another missing procedure >is to enable the register file to be treated as a block of memory so that >bytes or short words can be addressed. These two operations can be combined >on a byte-addressable machine. The original PDP-10 from DEC allowed that, originally because registers were real expensive, so that hardware registers were an expensive (but effective) speedup option; otherwise, they went to real memory. Registers were the first 16 locations in memory. This came back to bite them in the later KL models, because instructions could put into the registers and executed from them. While this was a real speedup hack on the older models, it slowed down the newer ones. The ATT CRISP doesn't have any registers. But, by caching the top of the local frame, references to locals are effectively turned into register references, and you get register windows as well. You can index into these 'registers', byte access them, and reference them with short 5-bit fields in the instruction. -- {decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
earl@mips.UUCP (09/24/87)
In article <8646@utzoo.UUCP>, henry@utzoo.UUCP (Henry Spencer) writes: > Actually, there is one fairly good argument for putting the transcendentals > in hardware, to wit making a high-quality implementation available cheaply. > The transcendentals in (say) the 68881 are *better* than anything you will > come up with in software without large amounts of work. You can buy a 68881 > for far less than it would cost you to commission or license equivalent code. The 68881 transcendentals are not implemented in hardware; they are implemented in microcode. I believe the extra 0.5-1.5ulp of accuracy of the 68881 is due to the use of extended precision calculations, not to either hardware or algorithm (simple rational approximations are very accurate too when evaluated in extended precision). This is likely one reason why the numerical analysts put extended precision in the IEEE standard. Usually when someone says "X should be in hardware", it's usually because they haven't thought very much about how to solve the problem. Usually the easiest way to solve a problem without thinking about it is to say "someone else should do it". In this case "the hardware designer should solve it". If an extra half bit of accuracy for transcendentals is important (I'm not sure it is), then the right way to accomplish this is to add IEEE extended precision hardware, not transcendental instructions. In some ways, this is the RISC approach: when someone says "I need X to do Y", first ignore X, and then figure out the right way to provide general-purpose building blocks to accomplish Y. A final note: implementing transcendentals in 68881 microcode did nothing to make them fast. The cycle counts for sin, cos, tan, atan, log, exp, etc. average about 3.5 longer for 68881 instructions than for MIPS R2000 libm subroutines.
alan@pdn.UUCP (Alan Lovejoy) (09/25/87)
In article <705@gumby.UUCP> earl@mips.UUCP (Earl Killian) writes: >extended precision hardware, not transcendental instructions. In some >ways, this is the RISC approach: when someone says "I need X to do Y", >first ignore X, and then figure out the right way to provide >general-purpose building blocks to accomplish Y. Interesting. Sounds like my philosophy for programming language design: the best way to provide a feature is to build the proper abstraction mechanisms and primitive operations into the language that will provide the most general solution to the problem which makes the feature desirable. The analogy with 'primitive operations' in hardware is clear, but what's the hardware equivalent of an abstraction mechanism? Perhaps some of the features of the Smalltalk Virtual Machine represent hardware abstraction mechanisms. I think user-definable microcode routines would also qualify. Hmmm... --alan@pdn
greg@xios.XIOS.UUCP (Greg Franks) (09/25/87)
cik@l.cc.purdue.edu (Herman Rubin) writes about the need for all sorts of fancy instructions done in hardware. Might I suggest the good-ole 780 and certain Burroughs machines which let you load your own microcode. The details of this operation are left as an exercise for the reader.... (especially because a: I don't have a VAX to play with, and b: I don't know the details of this operation myself :-) ). -- Greg Franks XIOS Systems Corporation, 1600 Carling Avenue, (613) 725-5411 Ottawa, Ontario, Canada, K1Z 8R8 uunet!mnetor!dciem!nrcaer!xios!greg "Vermont ain't flat!"
esf00@amdahl.amdahl.com (Elliott S. Frank) (09/25/87)
>In article <582@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: >>BTW, there is an address modification procedure which is missing on all >>machines I have seen except the UNIVAC's. That is to consider the register >>file as a memory block and allow indexing on it. Another missing procedure >>is to enable the register file to be treated as a block of memory so that >>bytes or short words can be addressed. These two operations can be combined In article <14750@watmath.waterloo.edu> ccplumb@watmath.waterloo.edu (Colin Plumb) writes: > ...PDP-10... > the first (as far as I know) machine to have a register file. The feature was part of the UNIVAC SS-90 and the 110x machines (all which [architecturally] predate the founding of DEC). It may be in the SS-80, too (but that machine was before my time :-)). It is one of the ultimate cases of providing a hardware feature where the implementation ignores the architecture. (Can you say `dependant upon a side effect'?) The UNIVAC 1108 assumed (if memory serves me right) that indirect addressing would reference the memory `masked' by the registers. Direct references to the memory locations accessed the registers. Two different access modes referring to the same location referenced two different objects! -- Elliott S Frank ...!{hplabs,ames,sun}!amdahl!esf00 (408) 746-6384 or ....!{bnrmtv,drivax,hoptoad}!amdahl!esf00 [the above opinions are strictly mine, if anyone's.] [the above signature may or may not be repeated, depending upon some inscrutable property of the mailer-of-the-week.]
thomsen@trwspf.TRW.COM (Mark Thomsen) (09/26/87)
In article <704@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes: >However, the general approach (anecdotal) is not the way people design >computers, these days, and for good reason. >As noted before here, a plausible way to design a computer is: > >1) Pick a REPRESENTATIVE set of benchmarks. >2) Do a first-cut architecture, based on past experience. >3) Do compilers. >4) Add or delete features, measuring the impact by running compiled/assembled >code through architectural simulators. >5) Iterate until you can't find anything else to add that actually >improves performance by some noticeable amount, or until you run out of time. I second this -- a processor design in the abstract space of "features that are useful in current processors, that would seem to be useful, or that every other processor has" creates the rut that the breakthrough designs have been exploiting. I am familiar with the MIPS, Transputer, and Lilith processors and all three seem very efficiently designed for their role. By efficiency I mean speed of execution of the applications run relative to clock speeds - as compared to competitive processors. The Lilith for example (Wirth's desktop marvel) is a legitimate 1 MIP workstation with a clock speed of 6.67 MHz. Now, if you want to use it for numerical analysis it practically grinds to a halt. But for programming support, document production, and such it is exceptional. It also supports bit-map operations at 30 Mb/s. Not bad for its age. Please don't go bashing the good processors of our time. The bad processors are so easy to pick out. However, after weeding them out the remainders generally have some domain of support (Unix/C, Lisp, Modula-2, assumbly language, CAD, program development, document production, numerical analysis) that they are quite good for. Learn from them and sally forth to the next generation. This is self-serving. I get to use the processors and I am delighted to see what is happening. Better and better processors are becoming available. Mark R. Thomsen
peter@sugar.UUCP (Peter da Silva) (09/27/87)
In article <14750@watmath.waterloo.edu>, ccplumb@watmath.waterloo.edu (Colin Plumb) writes: > In article <582@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: > >BTW, there is an address modification procedure which is missing on all > >machines I have seen except the UNIVAC's. That is to consider the register > >file as a memory block and allow indexing on it... > The PDP-10 also did this. The first 16 memory locations were the registers. > There was an option to get fast (non-core) memory for these few bits. The TI 99 processor has something like an address base register, and uses the next X words of memory as the registers. A standard trick (apparently) is to map the registers into the I/O page. I think the subroutine call mechanism involves copying the PC and assigning a new register file. Sort of like a slow RISC. A very interesting design, anyway. -- -- Peter da Silva `-_-' ...!hoptoad!academ!uhnix1!sugar!peter -- 'U` Have you hugged your wolf today? -- Disclaimer: These aren't mere opinions... these are *values*.
henry@utzoo.UUCP (Henry Spencer) (09/29/87)
> > The transcendentals in (say) the 68881 are *better* than anything you will > > come up with in software without large amounts of work... > > The 68881 transcendentals are not implemented in hardware; they are > implemented in microcode. I believe the extra 0.5-1.5ulp of accuracy > of the 68881 is due to the use of extended precision calculations, not > to either hardware or algorithm (simple rational approximations are > very accurate too when evaluated in extended precision)... Nope, sorry, you have misunderstood slightly. I wasn't saying "the 68881 is more accurate than carefully-implemented double-precision software such as one would expect from e.g. MIPSco"; I was saying "the 68881 is more accurate than the sloppy first-cut software that one confidently expects XYZ Vaporboxes Inc. to ship as its `production' release". The point is not that the 68881 has inherent advantages over software, but that it represents a *cheap* *prepackaged* high-quality solution. In principle one could find the same thing in software, but commercial realities make this unlikely unless it comes from a university: the 68881 can be cheaply and widely sold at a profit because *it cannot be pirated easily*. I agree that the right way to do transcendentals is in software, with help (e.g. extended-precision arithmetic) in the hardware when appropriate. But how much carefully-written software can you buy for the price of one 68881? -- "There's a lot more to do in space | Henry Spencer @ U of Toronto Zoology than sending people to Mars." --Bova | {allegra,ihnp4,decvax,utai}!utzoo!henry
stachour@umn-cs.UUCP (09/29/87)
> In article <582@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: > ... > > As noted before here, a plausible way to design a computer is: > > 1) Pick a REPRESENTATIVE set of benchmarks. > 2) Do a first-cut architecture, based on past experience. > 3) Do compilers. I've NEVER seen anyone design compilers for a machine that is only being similated, and chose the architecture of the hardware based on measurement, and build the machine later. (Well, one exception, Multics many years ago, but that design set goals seldom met now.) > 4) Add or delete features, > measuring the impact by running compiled/assembled > code through architectural simulators. > 5) Iterate until you can't find anything else to add that actually > improves performance by some noticeable amount, > or until you run out of time. > > This is not a perfect recipe, of course. For example, if the benchmark > set is chosen poorly, bad surprises will happen. However, the biggest problem getting support is finding relevent benchmarks. For example, Bill Young's article about design & implementation showed that unix security was broken, and all unix systems vulnerable because of a 'bug'. The bug causer was over-running an array bounds. But people don't write code that checks array-bounds. They even choose ugly languages like C that don't check rather than ones (like PL/I or Pascal) that do because they can't stand the inefficiency (or so they say). Instead they write buggy programs! It is these buggy programs used for the benchmarks, they're there... But try to find a benchmark with lots of good array-checking in it. Unless the program was written in Algol for a B55xx or PL/I for a GE6xxx, you probably won't. So putting array-checking into hardware (to make it reasonable for all programmers to do something which most of them know they should, but don't) will not happen, because the benchmarks will not contain code that checks array bounds. BENCHMARKS PREVENT ONE FROM REPEATING ERRORS OF THE PAST, *BUT* THEY ARE NOT VERY HELPFUL IN GUIDING THE FUTURE. Paul Stachour Honeywell SCTC (Stachour@HI-Multics) UMinn. Computer Science (stachour at umn-cs.edu)
eugene@pioneer.arpa (Eugene Miya N.) (09/29/87)
It's pretty obvious to putting vector and floating point hardware in Silicon with products like the Weitek, but I was having a discussion with a colleague about LISP machines, Intellicorp and all those companies doing "AI." What about putting CDR hardware into machines? The colleague pointed out that SUN is the only company doing well in this arena. Agree or disagree? Aren't Symbolics, TI, LMI doing okay? From the Rock of Ages Home for Retired Hackers: --eugene miya NASA Ames Research Center eugene@ames-aurora.ARPA "You trust the `reply' command with all those different mailers out there?" {hplabs,hao,ihnp4,decwrl,allegra,tektronix}!ames!aurora!eugene On second thought, don't send me follow ups on this one.
chuck@amdahl.amdahl.com (Charles Simmons) (09/30/87)
In article <2917@ames.arpa> eugene@pioneer.UUCP (Eugene Miya N.) writes: >It's pretty obvious to putting vector and floating point hardware >in Silicon with products like the Weitek, but I was having a discussion >with a colleague about LISP machines, Intellicorp and all those >companies doing "AI." What about putting CDR hardware into machines? >The colleague pointed out that SUN is the only company doing well in >this arena. Agree or disagree? Aren't Symbolics, TI, LMI doing okay? > >From the Rock of Ages Home for Retired Hackers: > >--eugene miya > {hplabs,hao,ihnp4,decwrl,allegra,tektronix}!ames!aurora!eugene John Hennesy was giving a talk on RISC architechtures in Santa Clara today. You should take a look at the performance ratios between a MIPS processor running LISP and dedicated LISP architechtures running LISP. MIPS seems to win big. (Hopefully, John Mashey or some other knowledgeable person at MIPS will correct me if I misunderstood the slide. It may be the case that MIPS was only comparing LISP performance against general purpose processors like a Cray, Vax, and a couple of small LISP boxes.) -- Chuck
lamaster@pioneer.arpa (Hugh LaMaster) (09/30/87)
In article <15393@amdahl.amdahl.com> chuck@amdahl.amdahl.com (Charles Simmons) writes: >In article <2917@ames.arpa> eugene@pioneer.UUCP (Eugene Miya N.) writes: >>It's pretty obvious to putting vector and floating point hardware >>in Silicon with products like the Weitek, but I was having a discussion >>with a colleague about LISP machines, Intellicorp and all those >>companies doing "AI." What about putting CDR hardware into machines? >>this arena. Agree or disagree? Aren't Symbolics, TI, LMI doing okay? >John Hennesy was giving a talk on RISC architechtures in Santa Clara : >MIPS processor running LISP and dedicated LISP architechtures running >LISP. MIPS seems to win big. : On a lot of benchmarks, a general purpose (e.g. MIPS, 68020, etc.) processor will run faster than a special purpose LISP machine. However, the speed of, and behavior during the process of, garbage collection, is not always so well advertised. CDR support is easily added to any general purpose machine, even a "RISC" machine, but probably not necessary. But, it is a good idea (actually, I would like to see more machines with a "descriptor" hardware data type that could be used for lists, pointers, and vectors). Hardware garbage collection support is an entirely different question. However, the performance advantage of the LISP machines has been eroded for the simple reason that general purpose machines, being used in a much bigger marketplace, have typically gone through new generations much more quickly and tend to use current technology. It is difficult to finance the level of R&D necessary to do that for a special purpose processor. AI machines are not unique in this respect. Hugh LaMaster, m/s 233-9, UUCP {topaz,lll-crg,ucbvax}! NASA Ames Research Center ames!pioneer!lamaster Moffett Field, CA 94035 ARPA lamaster@ames-pioneer.arpa Phone: (415)694-6117 ARPA lamaster@pioneer.arc.nasa.gov (Disclaimer: "All opinions solely the author's responsibility")
eugene@pioneer.arpa (Eugene Miya N.) (09/30/87)
In summary, as requested ( had not intended orginally): From: johnl@ima.ISC.COM (John R. Levine) In article <2917@ames.arpa> you write: >The colleague pointed out that SUN is the only company doing well in >this arena. Agree or disagree? Aren't Symbolics, TI, LMI doing okay? No, actually they're not. LMI essentially went bankrupt, the empty shell of the company was picked up by some Canadians real cheap. Symbolics is doing marginally well, but my impression is that's due to software more than hardware. Their future may well be in moving their software to platforms like MIPSco machines or even 386's. I was around when the T version of Scheme was under construction at Yale, and got the strong impression that for practically anything you want to do in Lisp, a little cleverness lets you do it on a conventional processor with little performance loss compared to a microcoded version in similar technology. But for the same money you can certainly buy a lot faster high volume conventional machine than a special purpose machine like the Symbolics box. Looks to me like the RISC philosophy wins again. John From: mips!mash@ames (John Mashey) This must have been sarcastic. LMI is out of that business. Symbolics is hurting. I don't know how the AI part of TI is doing. Sun's SPARC architecture has only the slightest caterings to LISP, and I hear from friends that most of the AI folks do NOT use those features, because it turns out that you can't get at them well thru the O.S.
reiter@pandora.UUCP (10/01/87)
In article <2917@ames.arpa> eugene@pioneer.UUCP (Eugene Miya N.) writes: >It's pretty obvious to putting vector and floating point hardware >in Silicon with products like the Weitek, but ... > what about putting CDR hardware into machines? >The colleague pointed out that SUN is the only company doing well in >this arena. Agree or disagree? Aren't Symbolics, TI, LMI doing okay? The last I heard, the various LISP companies were in trouble because of stagnating sales (i.e. sales are constant - they're not decreasing), which could be as much due to market saturation as anything else. Technically, CDR is just a pointer operation and is trivial to implement on any machine. The special hardware that LISP machines tend to have are: 1) Tagged data. The tags give data type. A word in memory might, for example, consist of 32 data bits and a 4 bit type field. Note that in LISP, variables are not typed, so the types of data elements in an operation may not be known at compile time. 2) Memory management. Support for CONS and for garbage collection. 3) Special caches, instruction sets, etc., which are geared towards LISP. [This list is by no means exhaustive] Whether LISP machines give any price/performance advantage over conventional machines is unclear. I once looked into this, and received highly contradictory data. In any case, since LISP tends to be a "research" language (as opposed to a "production" language), most LISP people are more interested in good software development environments than in hardware speed. Ehud Reiter reiter@harvard (ARPA,BITNET,UUCP) reiter@harvard.harvard.EDU (new ARPA)
rbbb@acornrc.UUCP (10/02/87)
In article <2207@umn-cs.UUCP>, stachour@umn-cs.UUCP (Paul Stachour) writes: > However, the biggest problem getting support is finding relevent > benchmarks. ... > But try to find a benchmark with lots of good array-checking in it. > Unless the program was written in Algol for a B55xx or PL/I for a > GE6xxx, you probably won't. So putting array-checking into hardware > (to make it reasonable for all programmers to do something which > most of them know they should, but don't) will not happen, because > the benchmarks will not contain code that checks array bounds. If you check some of the IBM 801 and 801-related literature, I believe you will find a discussion of optimizations that (safely) remove checking code, so at least someone has thought about this problem. Running a language with array bounds checking the 801 (in combination with the PL8 compiler) ran very well. Again, RISC wins (with a clever enough compiler). David Chase
blu@hall.cray.com (Brain Utterback) (10/02/87)
In article <826@sugar.UUCP> peter@sugar.UUCP (Peter da Silva) writes: >In article <14750@watmath.waterloo.edu>, ccplumb@watmath.waterloo.edu (Colin Plumb) writes: >> In article <582@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: >> >BTW, there is an address modification procedure which is missing on all >> >machines I have seen except the UNIVAC's. That is to consider the register >> >file as a memory block and allow indexing on it... >> The PDP-10 also did this. The first 16 memory locations were the registers. >> There was an option to get fast (non-core) memory for these few bits. Another advantage the PDP-10 had by mapping the registers to the memory space, other than indexing, was in execution. You could load a short loop into the registers and jump to them! The loop would run much faster, executing out of the registers. Brian Utterback Cray Research Inc. (603) 888-3083
franka@mmintl.UUCP (Frank Adams) (10/02/87)
In article <2913@husc6.UUCP> reiter@harvard.UUCP (Ehud Reiter) writes: |The special hardware that LISP machines tend to have are: | | 1) Tagged data. The tags give data type. A word in memory might, for |example, consist of 32 data bits and a 4 bit type field. | | 2) Memory management. Support for CONS and for garbage collection. | | 3) Special caches, instruction sets, etc., which are geared towards LISP. | |[This list is by no means exhaustive] This looks like much the same sort of stuff that one would want for Smalltalk. Has anyone looked at implementing Smalltalk on Lisp machines? (Of course, if Lisp machines really *don't* give better Lisp performance for the price than conventional architectures, it is unlikely that they would do better for Smalltalk.) -- Frank Adams ihnp4!philabs!pwa-b!mmintl!franka Ashton-Tate 52 Oakland Ave North E. Hartford, CT 06108
lyang%scherzo@Sun.COM (Larry Yang) (10/08/87)
In article <8668@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >Nope, sorry, you have misunderstood slightly. I wasn't saying "the 68881 >is more accurate than carefully-implemented double-precision software such >as one would expect from e.g. MIPSco"; I was saying "the 68881 is more >accurate than the sloppy first-cut software that one confidently expects >XYZ Vaporboxes Inc. to ship as its `production' release". The point is not >that the 68881 has inherent advantages over software, but that it represents >a *cheap* *prepackaged* high-quality solution. In principle one could find >the same thing in software, but commercial realities make this unlikely >unless it comes from a university: the 68881 can be cheaply and widely >sold at a profit because *it cannot be pirated easily*. > >I agree that the right way to do transcendentals is in software, with help >(e.g. extended-precision arithmetic) in the hardware when appropriate. >But how much carefully-written software can you buy for the price of one >68881? How much more time would Motorola buy if they didn't do transcendentals in micro/nanocode and had software engineers write libraries that they could sell to customers? Could the 881 be fit onto a smaller die (i.e., easier layout, better yield)? What's wrong with Motorola saying: "Here's this wonderful fp chip we've made. It does all the basic fp operations really fast. If you want to do sin, cos, and stuff, then here are the software library routines that are guaranteed to work." Are there no competent software engineers at these IC houses? I'll have to admit that I haven't designed any floating point arithmetic, so if I'm way off base, someone please correct me. (Of course, I didn't have to request this... :-) But it would seem that much would be gained from the chip design/fab/test area if the sweating over complex functions would be moved to the software realm. ************************************************************************* --Larry Yang [lyang@sun.com,{backbone}!sun!lyang]| A REAL _|> /\ | _ _ _ Sun Microsystems, Inc., Mountain View, CA | signature | | | / \ | \ / \ Hobbes: "Why do we play war and not peace?" | <|_/ \_| \_/\| |_\_| Calvin: "Too few role models." | _/ _/
oconnor@sunray.steinmetz (Dennis Oconnor) (10/09/87)
In article <30382@sun.uucp> lyang@sun.UUCP (Larry Yang) writes: >How much more time would Motorola buy if they didn't do transcendentals >in micro/nanocode and had software engineers write libraries that they >could sell to customers? Could the 881 be fit onto a smaller die (i.e., >easier layout, better yield)? What's wrong with Motorola saying: >"Here's this wonderful fp chip we've made. It does all the basic >fp operations really fast. If you want to do sin, cos, and stuff, then >here are the software library routines that are guaranteed to work." >Are there no competent software engineers at these IC houses? Anyone who can write microcode for the chip considered the STANDARD ( as in, if you and the '881 disagree on a result , you are wrong ) for IEEE floating point MUST be a competent software engineer. >I'll have to admit that I haven't designed any floating point arithmetic, Actually, I could tell this without you admitting it, so you need't have. >so if I'm way off base, someone please correct me. (Of course, I didn't >have to request this... :-) But it would seem that much would be gained >from the chip design/fab/test area if the sweating over complex functions >would be moved to the software realm. > >--Larry Yang [lyang@sun.com,{backbone}!sun!lyang] > Sun Microsystems, Inc., Mountain View, CA Right, great idea (sarcasm). For every operation currently done in the '881 in microcode, make the user fetch an instruction from i-mem. Boy, that'll improve performance! (heavier sarcasm). Hey, and don't forget to use plenty of the available user registers in these routines. The 68000, '010 and '020 family are NOT RISCs. Putting a RISCy FP Coprocessor on them makes as much sense as putting a telescopic sight on a sawed-off shotgun : different design contexts bring about different solutions. -- Dennis O'Connor oconnor@sungoddess.steinmetz.UUCP ?? ARPA: OCONNORDM@ge-crd.arpa "If I have an "s" in my name, am I a PHIL-OSS-IF-FER?"
nerd@percival.UUCP (Michael Galassi) (10/10/87)
In article <30382@sun.uucp> lyang@sun.UUCP (Larry Yang) writes: ... >I'll have to admit that I haven't designed any floating point arithmetic, >so if I'm way off base, someone please correct me. I've designed some floating point routines for the 68k and can testify that it is not overliy dificult, the hard part being detirminig what ranges of inputs are valid and verifying that the function is well behaved over all that range. >... But it would seem that much would be gained >from the chip design/fab/test area if the sweating over complex functions >would be moved to the software realm. The bigest advantage I see in using microcode to do the FP is that you save the memory references while the computation is being done freeing up the bus for the likes of dma, concurrent cpu operations, and any other bus master's intervention. -- If my employer knew my opinions he would probably look for another engineer. Michael Galassi, Frye Electronics, Tigard, OR ...!tektronix!reed!percival!nerd
lyang%scherzo@Sun.COM (Larry Yang) (10/13/87)
In article <7587@steinmetz.steinmetz.UUCP> oconnor@sunray.UUCP (Dennis Oconnor) writes: > >Anyone who can write microcode for the chip considered the STANDARD >( as in, if you and the '881 disagree on a result , you are wrong ) >for IEEE floating point MUST be a competent software engineer. > Hmm. I'll concede this point. Microcoders *are* software people, in some sense. And in order to understand the mystical IEEE standard, one would have to be pretty competent. > >The 68000, '010 and '020 family are NOT RISCs. Putting a RISCy >FP Coprocessor on them makes as much sense as putting a >telescopic sight on a sawed-off shotgun : different design >contexts bring about different solutions. > Good point, although it was hard to extract out of all that sarcasm. :-) Making the 881 'RISCY' would have been incredibly foolish, given the 'CISCY' design of the 68000 family. The analogy with the shotgun is very appropriate. --Larry Yang [lyang@sun.com,{backbone}!sun!lyang]| A REAL _|> /\ | _ _ _ Sun Microsystems, Inc., Mountain View, CA | signature | | | / \ | \ / \ Hobbes: "Why do we play war and not peace?" | <|_/ \_| \_/\| |_\_| Calvin: "Too few role models." | _/ _/