[comp.arch] Brain-Clogging Decimal

ccplumb@watmath.UUCP (12/09/87)

In article <6901@apple.UUCP> baum@apple.UUCP (Allen Baum) writes:
>A small addendum: while the 801 may not have played an enormous role,
 [In the design of the HP precision architecture]
>System/370 certainly did! Many of the design decisions in Spectrum were in
>reaction to problems we saw (& measured & were told about) in the 370 
>architecture, from both IBM and Amdahl.

This got me thinking (with apologies for ever uttering anything faintly
resembling praise for IBM), that the 360/370/whatever architecture isn't
all that bad.  (Perhaps you can tell me via mail what the problems are.
I'm interested, but it would probably be ancient headgear to others.)

With the exception, of course, of all the packed decimal divide stuff.

The most recent processor family I know of that still faintly supports
BCD math is the 68000.  Hooray!  It seems people are finally realizing
that, while for simple procesing (say, adding up a payroll), the work
of binary <-> decimal conversion far exceeds the processing that's done
on the binary numbers, it isn't worth jumping through hoops in hardware
and software to avoid the translation.  The overhead is fixed and won't
kill your system.  Optimize somewhere else.  Surely you can improve the
disk buffering algorithm to detect sequential access and do read-ahead
and purge-behind.

Does anyone else have any ideas about COBO* (I can't bring myself to
say "Langauge") support in architectures?  Is is just hanging on like
bad smells and MS-DOS, or is it worth anything at all?
--
	-Colin (watmath!ccplumb)

(To the tune of "I don't know why she swallowed the fly")
I don't know why the process won't die.
Perhaps it's a BUG.

Verses, anyone?

baum@apple.UUCP (Allen J. Baum) (12/10/87)

--------
[]
>In article <15782@watmath.waterloo.edu> ccplumb@watmath.waterloo.edu (Colin Plumb) writes:

>This got me thinking (with apologies for ever uttering anything faintly
>resembling praise for IBM), that the 360/370/whatever architecture isn't
>all that bad.  (Perhaps you can tell me via mail what the problems are.
>I'm interested, but it would probably be ancient headgear to others.)
>
>With the exception, of course, of all the packed decimal divide stuff.

The idea of an architecture, to the extent that IBM did it in the 360/370 is
amazing. You can build compatible systems just by looking at the P.O.O.
No one else does it as well. All the creepy little corner cases are there.

The actual architecture is none too good. For example, no prohibition about
writing into the instruction stream, which gives cache designers headaches.
The Test-under-Mask should be a simple Test-Bit, or even TestBit&Branch.
Branches should be PC relative! Base+Index+Displacement is not necessary &
an utter pain. You can dissect the instruction set to death and find much
problems.

The HP precision has some minimal support for decimal operations: Decimal
Correct and Intermediate Decimal Correct. A decimal add would be something
like: (assume packed, unsigned RegA+regB->RegC)
     Add RegA+0x66666666 -> temp ;optionally trap if any digit carry-- this
                                 ;checks for illegal decimal ops
     Add RegB+temp       -> RegC ;save digit carries
     DCor RegC           -> RegC ;subtracts 6 every there was no digit carry

The intermediate decimal correct is similar; it combines the DCor with the
add of 0x66666666.

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

mash@mips.UUCP (John Mashey) (12/10/87)

In article <15782@watmath.waterloo.edu> ccplumb@watmath.waterloo.edu (Colin Plumb) writes:
....
>The most recent processor family I know of that still faintly supports
>BCD math is the 68000.....
>Does anyone else have any ideas about COBO* (I can't bring myself to
>say "Langauge") support in architectures?  Is is just hanging on like
>bad smells and MS-DOS, or is it worth anything at all?

HP added some support (described in Allen Baum's posting),
and is probably the only RISC machine to have much of anything
there explicitly for COBOL.

We didn't aim at COBOL with the R2000, but it happens to be pretty good
for it, since loads, stores, branches, and function calls are all fast.
Finally, it turns out that the unaligned-word operations are wonderfully
useful for getting good code for the 100 or so worthwhile low-level
computations and data movements.

[opinion]: Any architecture that wants to be truly widespread better be
reasonable for more than C, Pascal, Modula-2, or FORTH (just to pick
a random example).  Regardless of what anybody thinks of COBOL,
an archiecture has no chance in many quarters if it doesn't
run COBOL at least adequately. [That doesn't mean ahving lots of
decimal operations, just that the performance be OK.]
Ignoring COBOL is almost like ignoring FORTRAN.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

hansen@mips.UUCP (Craig Hansen) (12/11/87)

>In article <15782@watmath.waterloo.edu> ccplumb@watmath.waterloo.edu (Colin Plumb) writes:
>...the 360/370/whatever architecture isn't
>all that bad. ...
>With the exception, of course, of all the packed decimal divide stuff.

A recent (well, my journal stack is getting deep these days) article in CACM
interviewed two of the IBM architecture folks on the life and times of the
360/370/30xx/43xx series architectures. When asked what they would have done
differently, in hindsight, their top item was "drop the packed decimal
instructions," in that the only justification for having them was that the
8-bit-wide implementation of the 360 did decimal operations significantly
faster with them than without them. The days of 8-bit 360's are long past.

-- 
Craig Hansen
Manager, Architecture Development
MIPS Computer Systems, Inc.
...{ames,decwrl,prls}!mips!hansen or hansen@mips.com

ok@quintus.UUCP (Richard A. O'Keefe) (12/11/87)

The Intel 80386 is a pretty recent machine.
It still has instructions to support decimal arithmetic:
	AAA, AAD, AAM, AAS, DAA, DAS.
Of course this is to keep all those 8086 programs in the air...

I saw a paper years ago from somebody at Burroughs explaining some of
the decisions they had made for the B6700.  That was explicitly designed
to run Algol, Fortran, and COBOL.  It did a reasonable job of PL/I and
PASCAL as well, but didn't make a good Lisp machine.

They were very much concerned to support COBOL well.  They decided that
the best thing was to leave out decimal arithmetic entirely, but to
provide fast binary<->decimal conversion and decimal scaling.

This seems like the right decision for any machine:  even C programmers
want binary<->decimal conversion and have no objection to it being fast,
and even COBOL has USAGE COMPUTATIONAL.

sjc@mips.UUCP (Steve "The" Correll) (12/12/87)

In article <15782@watmath.waterloo.edu>, ccplumb@watmath.waterloo.edu (Colin Plumb) writes:
> The most recent processor family I know of that still faintly supports
> BCD math is the 68000...Is [Cobol support] just hanging on like
> bad smells and MS-DOS, or is it worth anything at all?

As is often the case, if hardware BCD instructions do everything you need
and exactly what you need, you win; if they do almost everything you need
and almost exactly what you need, you might as well not use them.

I just compared the "inner loop" code for a commercial MC68000 implementation
of packed decimal addition, which uses the "ABCD" instruction to add a pair of
digits and propagate the carry, against the MIPS R2000 implementation by
E. Killian and K. Mortensen, which has no BCD support (just XOR).

  MC68000 cost: 	5 * ((precision + 1) div 2)

  R2000 cost:		(precision <= 7) ? 19 : ((precision <= 15 ? 30 : ...))

The R2000, in typical RISC fashion, assumes that packed decimal strings
are word aligned, and manipulates a word at a time, so that it costs 19
instructions whether you have 1 digit or 7.

But it turns out that the work done by the MC68000 "ABCD" instruction,
which adds a pair of packed-decimal digits and propagates the carry for
the next pair, just isn't an important part of the total. Outside the
inner loop, both machines must extract and deposit sign nibbles; both
must compare the signs to see whether you're really doing subtraction;
both must check that they don't create a negative zero result. All of
this happens outside the loop which uses "ABCD", and costs more for the
MC68000.

I arbitrarily picked a case as "reasonable" (both operands positive,
both operands have the same number of digits, odd number of digits, not
all digits zero, and a nonzero result).  Subject to my slightly sloppy
instruction-counting, the cost was:

	digits		MC68000 instructions 	MIPS R2000 instructions
	1 		34			28
	3		39			28
	5		44			28
	7		49			28
	9		54			49
	11		59			49
	...

(I ignored some scaling which the R2000 does inside the 8..15 digit routine 
but the MC68000 does outside its routine.)

If you pick a harder case (the operands aren't known to have the same
number of digits or aren't known to have the decimal point in the same
place; you have to remap the sign nibble because there are multiple
representations; and so on) both implementations become much more expensive;
I didn't have the patience to count instructions to see how they compare.

The moral: if you have an instruction for "add or subtract two packed decimal
strings including encoded signs where the sources and destination may
have disparate lengths with decimal points in different places", it will be a
big help. If you have only "ABCD", you might as well use XOR.

-- 
...decwrl!mips!sjc						Steve Correll

larry@mips.UUCP (Larry Weber) (12/12/87)

In article <1101@quacky.UUCP> sjc@mips.UUCP (Steve "The" Correll) writes:
>In article <15782@watmath.waterloo.edu>, ccplumb@watmath.waterloo.edu (Colin Plumb) writes:
>	digits		MC68000 instructions 	MIPS R2000 instructions
>	1 		34			28
>	3		39			28
>	5		44			28
>	7		49			28
>	9		54			49
>	11		59			49
>	...
>
Note that these numbers are instruction counts, not cycle counts.  All 68020
instructions take at least 2 cycles (subject to a slight amount of overlap
in the processor).  In fact the ABCD instruction that was used in this example
takes 4 or 5 cycles.  Given equal cycle times on the two machines, you 
would expect a performance difference of 2 to 4 (depending on whether you
are buying or selling).

Of course if you really must use a 68020 and want the best performance,
you should implement an algorithm along the lines of ours and just
ignore the ABCD instruction.  In other words, treat the 68020 like a 
RISC machine.

This is not as absurd as it may sound.  When the 801 group implemented
its PL.8 compiler for the 370 (after making it generate 801 code), they
treated the 370 like a RISC machine and also got better results from the
370.  Lots of people have pointed out that the fastest way to call a subroutine
in a VAX is to NOT use 'calls' (call subroutine).

-- 
-Larry Weber  DISCLAIMER: I speak only for myself, and I even deny that.
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!larry,   DDD:408-720-1700, x214
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

fotland@hpcupt1.HP.COM (Dave Fotland) (12/24/87)

>For symbolic languages, tagged pointer
>operations are very important.  Typically, tagged dispatch and tagged
>pointer following are done quite a lot, and cutting the number of
>machine instructions to do these things can make quite a difference in
>performance.  Take the example of following a tagged pointer.  If
>the tag is kept in the high few bits of a 32 bit address, one must and the
>pointer with a 32 bit constant.  Quite a lot of overhead, when a simple
>addressing mode that ignored the high, say, 4 bits of the address would
>do the trick perfectly.  I know this runs contrary to the RISC ideal.
>But on a CISC, this is no more arcane than some of the other addressing
>modes.  

On the HP precision architecture, loads are indexed with index
shifting according to the operand size.  If the upper two bits are the 
tag, and the pointer is 32 bits, the load word will shift the index left 
two bits, and ignore the tag.  No requirement for anding registers and no
extra instructions on this RISC machine!  If a bigger tag is needed,
we only need one extra instruction, an extract, which can be used to
strip off the tag.

>Similarly for tagged dispatch.  If there were an instruction to take
>the top, say, 2 bits of a register, shift them right a whole bunch of
>places, add them to a given address, and jump to the address stored
>there.  Sure, this could be done with a shift or rotate, an and, an add,
>and an indirect jump.  But wouldn't you rather do one instruction than
>4?

In HP precision architecture there are exactly the primitives you
want, the extract instruction, which extracts a bit field from
a word and right justifies it in the result.  We paid a lot of
attention to instruction count when we were designing HP-PA, and
we found that operating systems and LISP did a lot of bit field
manipulation so we included extract and it's converse, deposit,
in the instruction set.  Your tagged dispatch would be an extract
followed by a load and a branch.  Three instructions, but just as
fast as the microcoded single instruction on the CISC machine.

>The point machine designers should take into account is that more and
>more, people are buying general-purpose hardware rather than the
>expensive specialized hardware.  Therefore, they should design their
>machines taking symbolic languages, CAD, and other specialized tasks
>into account.

I agree!  But general purpose does not have to mean CISC!  HP-PA is
a general purpose RISC architecture.

-David Fotland fotland@hpda.HP.COM