ccplumb@watmath.UUCP (12/09/87)
In article <6901@apple.UUCP> baum@apple.UUCP (Allen Baum) writes: >A small addendum: while the 801 may not have played an enormous role, [In the design of the HP precision architecture] >System/370 certainly did! Many of the design decisions in Spectrum were in >reaction to problems we saw (& measured & were told about) in the 370 >architecture, from both IBM and Amdahl. This got me thinking (with apologies for ever uttering anything faintly resembling praise for IBM), that the 360/370/whatever architecture isn't all that bad. (Perhaps you can tell me via mail what the problems are. I'm interested, but it would probably be ancient headgear to others.) With the exception, of course, of all the packed decimal divide stuff. The most recent processor family I know of that still faintly supports BCD math is the 68000. Hooray! It seems people are finally realizing that, while for simple procesing (say, adding up a payroll), the work of binary <-> decimal conversion far exceeds the processing that's done on the binary numbers, it isn't worth jumping through hoops in hardware and software to avoid the translation. The overhead is fixed and won't kill your system. Optimize somewhere else. Surely you can improve the disk buffering algorithm to detect sequential access and do read-ahead and purge-behind. Does anyone else have any ideas about COBO* (I can't bring myself to say "Langauge") support in architectures? Is is just hanging on like bad smells and MS-DOS, or is it worth anything at all? -- -Colin (watmath!ccplumb) (To the tune of "I don't know why she swallowed the fly") I don't know why the process won't die. Perhaps it's a BUG. Verses, anyone?
baum@apple.UUCP (Allen J. Baum) (12/10/87)
-------- [] >In article <15782@watmath.waterloo.edu> ccplumb@watmath.waterloo.edu (Colin Plumb) writes: >This got me thinking (with apologies for ever uttering anything faintly >resembling praise for IBM), that the 360/370/whatever architecture isn't >all that bad. (Perhaps you can tell me via mail what the problems are. >I'm interested, but it would probably be ancient headgear to others.) > >With the exception, of course, of all the packed decimal divide stuff. The idea of an architecture, to the extent that IBM did it in the 360/370 is amazing. You can build compatible systems just by looking at the P.O.O. No one else does it as well. All the creepy little corner cases are there. The actual architecture is none too good. For example, no prohibition about writing into the instruction stream, which gives cache designers headaches. The Test-under-Mask should be a simple Test-Bit, or even TestBit&Branch. Branches should be PC relative! Base+Index+Displacement is not necessary & an utter pain. You can dissect the instruction set to death and find much problems. The HP precision has some minimal support for decimal operations: Decimal Correct and Intermediate Decimal Correct. A decimal add would be something like: (assume packed, unsigned RegA+regB->RegC) Add RegA+0x66666666 -> temp ;optionally trap if any digit carry-- this ;checks for illegal decimal ops Add RegB+temp -> RegC ;save digit carries DCor RegC -> RegC ;subtracts 6 every there was no digit carry The intermediate decimal correct is similar; it combines the DCor with the add of 0x66666666. -- {decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
mash@mips.UUCP (John Mashey) (12/10/87)
In article <15782@watmath.waterloo.edu> ccplumb@watmath.waterloo.edu (Colin Plumb) writes: .... >The most recent processor family I know of that still faintly supports >BCD math is the 68000..... >Does anyone else have any ideas about COBO* (I can't bring myself to >say "Langauge") support in architectures? Is is just hanging on like >bad smells and MS-DOS, or is it worth anything at all? HP added some support (described in Allen Baum's posting), and is probably the only RISC machine to have much of anything there explicitly for COBOL. We didn't aim at COBOL with the R2000, but it happens to be pretty good for it, since loads, stores, branches, and function calls are all fast. Finally, it turns out that the unaligned-word operations are wonderfully useful for getting good code for the 100 or so worthwhile low-level computations and data movements. [opinion]: Any architecture that wants to be truly widespread better be reasonable for more than C, Pascal, Modula-2, or FORTH (just to pick a random example). Regardless of what anybody thinks of COBOL, an archiecture has no chance in many quarters if it doesn't run COBOL at least adequately. [That doesn't mean ahving lots of decimal operations, just that the performance be OK.] Ignoring COBOL is almost like ignoring FORTRAN. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
hansen@mips.UUCP (Craig Hansen) (12/11/87)
>In article <15782@watmath.waterloo.edu> ccplumb@watmath.waterloo.edu (Colin Plumb) writes: >...the 360/370/whatever architecture isn't >all that bad. ... >With the exception, of course, of all the packed decimal divide stuff. A recent (well, my journal stack is getting deep these days) article in CACM interviewed two of the IBM architecture folks on the life and times of the 360/370/30xx/43xx series architectures. When asked what they would have done differently, in hindsight, their top item was "drop the packed decimal instructions," in that the only justification for having them was that the 8-bit-wide implementation of the 360 did decimal operations significantly faster with them than without them. The days of 8-bit 360's are long past. -- Craig Hansen Manager, Architecture Development MIPS Computer Systems, Inc. ...{ames,decwrl,prls}!mips!hansen or hansen@mips.com
ok@quintus.UUCP (Richard A. O'Keefe) (12/11/87)
The Intel 80386 is a pretty recent machine. It still has instructions to support decimal arithmetic: AAA, AAD, AAM, AAS, DAA, DAS. Of course this is to keep all those 8086 programs in the air... I saw a paper years ago from somebody at Burroughs explaining some of the decisions they had made for the B6700. That was explicitly designed to run Algol, Fortran, and COBOL. It did a reasonable job of PL/I and PASCAL as well, but didn't make a good Lisp machine. They were very much concerned to support COBOL well. They decided that the best thing was to leave out decimal arithmetic entirely, but to provide fast binary<->decimal conversion and decimal scaling. This seems like the right decision for any machine: even C programmers want binary<->decimal conversion and have no objection to it being fast, and even COBOL has USAGE COMPUTATIONAL.
sjc@mips.UUCP (Steve "The" Correll) (12/12/87)
In article <15782@watmath.waterloo.edu>, ccplumb@watmath.waterloo.edu (Colin Plumb) writes: > The most recent processor family I know of that still faintly supports > BCD math is the 68000...Is [Cobol support] just hanging on like > bad smells and MS-DOS, or is it worth anything at all? As is often the case, if hardware BCD instructions do everything you need and exactly what you need, you win; if they do almost everything you need and almost exactly what you need, you might as well not use them. I just compared the "inner loop" code for a commercial MC68000 implementation of packed decimal addition, which uses the "ABCD" instruction to add a pair of digits and propagate the carry, against the MIPS R2000 implementation by E. Killian and K. Mortensen, which has no BCD support (just XOR). MC68000 cost: 5 * ((precision + 1) div 2) R2000 cost: (precision <= 7) ? 19 : ((precision <= 15 ? 30 : ...)) The R2000, in typical RISC fashion, assumes that packed decimal strings are word aligned, and manipulates a word at a time, so that it costs 19 instructions whether you have 1 digit or 7. But it turns out that the work done by the MC68000 "ABCD" instruction, which adds a pair of packed-decimal digits and propagates the carry for the next pair, just isn't an important part of the total. Outside the inner loop, both machines must extract and deposit sign nibbles; both must compare the signs to see whether you're really doing subtraction; both must check that they don't create a negative zero result. All of this happens outside the loop which uses "ABCD", and costs more for the MC68000. I arbitrarily picked a case as "reasonable" (both operands positive, both operands have the same number of digits, odd number of digits, not all digits zero, and a nonzero result). Subject to my slightly sloppy instruction-counting, the cost was: digits MC68000 instructions MIPS R2000 instructions 1 34 28 3 39 28 5 44 28 7 49 28 9 54 49 11 59 49 ... (I ignored some scaling which the R2000 does inside the 8..15 digit routine but the MC68000 does outside its routine.) If you pick a harder case (the operands aren't known to have the same number of digits or aren't known to have the decimal point in the same place; you have to remap the sign nibble because there are multiple representations; and so on) both implementations become much more expensive; I didn't have the patience to count instructions to see how they compare. The moral: if you have an instruction for "add or subtract two packed decimal strings including encoded signs where the sources and destination may have disparate lengths with decimal points in different places", it will be a big help. If you have only "ABCD", you might as well use XOR. -- ...decwrl!mips!sjc Steve Correll
larry@mips.UUCP (Larry Weber) (12/12/87)
In article <1101@quacky.UUCP> sjc@mips.UUCP (Steve "The" Correll) writes: >In article <15782@watmath.waterloo.edu>, ccplumb@watmath.waterloo.edu (Colin Plumb) writes: > digits MC68000 instructions MIPS R2000 instructions > 1 34 28 > 3 39 28 > 5 44 28 > 7 49 28 > 9 54 49 > 11 59 49 > ... > Note that these numbers are instruction counts, not cycle counts. All 68020 instructions take at least 2 cycles (subject to a slight amount of overlap in the processor). In fact the ABCD instruction that was used in this example takes 4 or 5 cycles. Given equal cycle times on the two machines, you would expect a performance difference of 2 to 4 (depending on whether you are buying or selling). Of course if you really must use a 68020 and want the best performance, you should implement an algorithm along the lines of ours and just ignore the ABCD instruction. In other words, treat the 68020 like a RISC machine. This is not as absurd as it may sound. When the 801 group implemented its PL.8 compiler for the 370 (after making it generate 801 code), they treated the 370 like a RISC machine and also got better results from the 370. Lots of people have pointed out that the fastest way to call a subroutine in a VAX is to NOT use 'calls' (call subroutine). -- -Larry Weber DISCLAIMER: I speak only for myself, and I even deny that. UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!larry, DDD:408-720-1700, x214 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
fotland@hpcupt1.HP.COM (Dave Fotland) (12/24/87)
>For symbolic languages, tagged pointer >operations are very important. Typically, tagged dispatch and tagged >pointer following are done quite a lot, and cutting the number of >machine instructions to do these things can make quite a difference in >performance. Take the example of following a tagged pointer. If >the tag is kept in the high few bits of a 32 bit address, one must and the >pointer with a 32 bit constant. Quite a lot of overhead, when a simple >addressing mode that ignored the high, say, 4 bits of the address would >do the trick perfectly. I know this runs contrary to the RISC ideal. >But on a CISC, this is no more arcane than some of the other addressing >modes. On the HP precision architecture, loads are indexed with index shifting according to the operand size. If the upper two bits are the tag, and the pointer is 32 bits, the load word will shift the index left two bits, and ignore the tag. No requirement for anding registers and no extra instructions on this RISC machine! If a bigger tag is needed, we only need one extra instruction, an extract, which can be used to strip off the tag. >Similarly for tagged dispatch. If there were an instruction to take >the top, say, 2 bits of a register, shift them right a whole bunch of >places, add them to a given address, and jump to the address stored >there. Sure, this could be done with a shift or rotate, an and, an add, >and an indirect jump. But wouldn't you rather do one instruction than >4? In HP precision architecture there are exactly the primitives you want, the extract instruction, which extracts a bit field from a word and right justifies it in the result. We paid a lot of attention to instruction count when we were designing HP-PA, and we found that operating systems and LISP did a lot of bit field manipulation so we included extract and it's converse, deposit, in the instruction set. Your tagged dispatch would be an extract followed by a load and a branch. Three instructions, but just as fast as the microcoded single instruction on the CISC machine. >The point machine designers should take into account is that more and >more, people are buying general-purpose hardware rather than the >expensive specialized hardware. Therefore, they should design their >machines taking symbolic languages, CAD, and other specialized tasks >into account. I agree! But general purpose does not have to mean CISC! HP-PA is a general purpose RISC architecture. -David Fotland fotland@hpda.HP.COM