[comp.arch] Architecture questions

cik@l.cc.purdue.edu (Herman Rubin) (09/06/90)

In article <10057@goofy.Apple.COM>, pauls@apple.com (Paul Sweazey) writes:

			.......................

> It is obvious to most that todays computing systems would not get faster 
> by having the hardware support bit-level addressing, much less an infinite 
> number of data types.  But there are entire classes of systems that don't 
> get explored (or even conceived of) because they don't map well into the 
> architectural constraints of computing systems today .

I disagree.  It is the case, of course, that if the computations are done in
the current methods with the current structures that they are likely not to
improve it other hardware is added.  But would the same algorithms be used
with these other hardware facilities.

> I am mildly surprised that brainstorms about revolutionary architectures 
> are rare in comp.arch.  We could be just a brainstorm away from the next 
> killer-micro-wave.

I am not at all surprised.  Look at the vitriol against those who suggest
that integer arithmetic is for more than addresses and loop counting.  Look
at the arguments that including such-and-such would break the current 
compilers, and conflict with the current languages.

On and off, I have suggested numerous operations which have arisen from
computational problems in probability and statistics which would be easy
in hardware and are at least moderately difficult in software.  The common
response (there have been a few exceptions) is that nobody should want
those things.  We need, instead, the view that if something can be done
easily in hardware it should, and that if the concept is not in the language,
that is the fault of the language.

This group really should be concerned with the last quoted paragraph, and
not with how to limit hardware and software.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)	{purdue,pur-ee}!l.cc!cik(UUCP)

rwallace@vax1.tcd.ie (09/08/90)

In article <2516@l.cc.purdue.edu>, cik@l.cc.purdue.edu (Herman Rubin) writes:
>> It is obvious to most that todays computing systems would not get faster 
>> by having the hardware support bit-level addressing, much less an infinite 
>> number of data types.  But there are entire classes of systems that don't 
>> get explored (or even conceived of) because they don't map well into the 
>> architectural constraints of computing systems today .
> 
> I disagree.  It is the case, of course, that if the computations are done in
> the current methods with the current structures that they are likely not to
> improve it other hardware is added.  But would the same algorithms be used
> with these other hardware facilities.
> 
> On and off, I have suggested numerous operations which have arisen from
> computational problems in probability and statistics which would be easy
> in hardware and are at least moderately difficult in software.  The common
> response (there have been a few exceptions) is that nobody should want
> those things.  We need, instead, the view that if something can be done
> easily in hardware it should, and that if the concept is not in the language,

> that is the fault of the language.

Consider the following kinds of operations:

1. Scientific and engineering work (floating-point)

2. Text processing (byte addressing)

3. Transaction processing (floating-point, bytes and fixed-length integers)

4. Graphics (best done in chunks as large as the CPU can handle, using barrel
   shifters and masking)

5. Compiling (integers and pointers)

 - in no particular order. These make up something over 99% of all computation.
Hence hardware is designed to make them fast. This means dealing with
floating-point, integers and bytes fast. To be sure you could probably list 50
different kinds of algorithms that would benefit from bit addressing but these
make up less than 1% of the computation that gets done, and the extra hardware
would slow down other operations and increase costs. Why should 99% of users
pay in money and time for something that will improve processing speed for 1%?

-- 
"To summarize the summary of the summary: people are a problem"
Russell Wallace, Trinity College, Dublin
rwallace@vax1.tcd.ie

aglew@crhc.uiuc.edu (Andy Glew) (09/09/90)

>Consider the following kinds of operations:
>1. Scientific and engineering work (floating-point)
>2. Text processing (byte addressing)
>3. Transaction processing (floating-point, bytes and fixed-length integers)
>4. Graphics (best done in chunks as large as the CPU can handle, using barrel
>   shifters and masking)
>5. Compiling (integers and pointers)
> - in no particular order. These make up something over 99% of all computation.

You'd probably do well to add an extra entry for database queries and
decision support to this list (that is, if that 99% is supposed to be
weighted by importance, or by the reasons people buy computers).  Yes,
that's distinct from, although intimately related to, transaction
processing, but it tends to have different characteristics.
    Transaction processing is stuff like handling lots of ATMs.
Database queries and decision support is something like the bank
president requesting a summary of all ATM traffic records, looking for
all ATMs that do 20% of their business in rush hours in denominations
of 20$ or less.
    DeWitt (UW Madison), Mr. Database: large queries may be only a
small portion of database activities.  But transactions are fast
enough already.  Queries are the most important part.

--
Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]

dave@tygra.UUCP (David Conrad) (09/09/90)

In article <6838.26e7f109@vax1.tcd.ie> rwallace@vax1.tcd.ie writes:
;Consider the following kinds of operations:
;1. Scientific and engineering work (floating-point)
;2. Text processing (byte addressing)
;3. Transaction processing (floating-point, bytes and fixed-length integers)
;4. Graphics (best done in chunks as large as the CPU can handle, using barrel
;   shifters and masking)
;5. Compiling (integers and pointers)
;
; - in no particular order. These make up something over 99% of all computation.
;Hence hardware is designed to make them fast. This means dealing with
;floating-point, integers and bytes fast. To be sure you could probably list 50
;different kinds of algorithms that would benefit from bit addressing but these
;make up less than 1% of the computation that gets done, and the extra hardware
;would slow down other operations and increase costs. Why should 99% of users
;pay in money and time for something that will improve processing speed for 1%?
;-- 
;Russell Wallace, Trinity College, Dublin

I'll only list two: Huffman coding and LZW compression.  But notice, every one
of the things you've mentioned typically use large disk files.  And I've listed
data compression algorithms.  If data could be uncompressed on the fly faster,
then this would be done more often - with a decrease in the disk I/O needed by
these operations which would be paid for in computation time.  Disk reads vs.
number crunching cycles.  My guess is there would be a net gain in efficiency
there.
--
David Conrad                    |     "To Do Is To Be" -- Socrates
dave@tygra.ddmi.com             |     "To Be Is To Do" -- Plato
I cannot prevent the below.     |     "Do Be Do Be Do" -- Sinatra
-- 
=  CAT-TALK Conferencing Network, Prototype Computer Conferencing System  =
-  1-800-825-3069, 300/1200/2400/9600 baud, 8/N/1. New users use 'new'    - 
=  as a login id.    <<Redistribution to GEnie PROHIBITED!!!>>>           =
E-MAIL Address: dave@ThunderCat.COM

peter@ficc.ferranti.com (Peter da Silva) (09/09/90)

In article <6838.26e7f109@vax1.tcd.ie> rwallace@vax1.tcd.ie writes:
> 4. Graphics (best done in chunks as large as the CPU can handle, using barrel
>    shifters and masking)

That depends. Bit pointers could be used to make any graphics
operations that work on a pixel level (as opposed to BitBlt) a whole
lot faster. Rotated images, for example... BitBlt can be used (fairly
inefficiently) to rotate by 90 degree increments, but you need to operate
on a pixel level otherwise.

How much bit-banging is done on a Mac, or in the rendering engine of your
laser printer?

I suspect, myself, that communications, statistics, and finance comprise
the majority of computational resources. Compilers aren't even in the noise.
-- 
Peter da Silva.   `-_-'
+1 713 274 5180.   'U`
peter@ferranti.com

cik@l.cc.purdue.edu (Herman Rubin) (09/10/90)

In article <6838.26e7f109@vax1.tcd.ie>, rwallace@vax1.tcd.ie writes:
> In article <2516@l.cc.purdue.edu>, cik@l.cc.purdue.edu (Herman Rubin) writes:

			........................

> Consider the following kinds of operations:
> 
> 1. Scientific and engineering work (floating-point)
> 
> 2. Text processing (byte addressing)
> 
> 3. Transaction processing (floating-point, bytes and fixed-length integers)
> 
> 4. Graphics (best done in chunks as large as the CPU can handle, using barrel
>    shifters and masking)
> 
> 5. Compiling (integers and pointers)
> 
>  - in no particular order. These make up something over 99% of all computation.
> Hence hardware is designed to make them fast. This means dealing with
> floating-point, integers and bytes fast. To be sure you could probably list 50
> different kinds of algorithms that would benefit from bit addressing but these
> make up less than 1% of the computation that gets done, and the extra hardware
> would slow down other operations and increase costs. Why should 99% of users
> pay in money and time for something that will improve processing speed for 1%?

Every trigonometic and exponential routine starts out with a division, getting
an integer quotient and a remainder.  Ususally, these are heavily kludged,
because of the inadequacy of hardware.  Even the multiplication by the quotient
is kludged.

Text processing, for anyone who needs more than ASCII, is in a very primitive
state.  Even ASCII processing is limited.  Again, hardware instructions would
help.

There are quite a few situations in transaction processing which could use a
few intelligent instructions, like reading from a buffer with an automatic
interrupt when the buffer becomes empty.

People producing graphics software have complained loudly about the lack of
fixed-point arithmetic.

Compilers and languages are weak, and those who live by them seem incapable of
understanding that anything else can and should be done.

What about simulation (Monte Carlo)?  I have methods which I would not suggest
as economical without certain fast bit operations.

The question of cost is relevant.  With the cost of the entire ALU being a
rather small part of the cost of the machine, and also that the units wanted
could have a great deal of overlap with the present units, I would doubt that
a more flexible hardware would cost 1% more.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)	{purdue,pur-ee}!l.cc!cik(UUCP)

guy@auspex.auspex.com (Guy Harris) (09/11/90)

>Text processing, for anyone who needs more than ASCII, is in a very primitive
>state.  Even ASCII processing is limited.  Again, hardware instructions would
>help.

Such as?

>There are quite a few situations in transaction processing which could use a
>few intelligent instructions, like reading from a buffer with an automatic
>interrupt when the buffer becomes empty.

"Reading from a buffer" in what sense?  Is this just an in-memory buffer
being read by a transaction processing application - in which case I
don't see how an *interrupt* would help, as a dumb old conditional
branch testing whether the number of characters left in the buffer was
zero would probably be *faster* than an interrupt (no tons of context so
save, etc.), or is it something else?

stephen@estragon.uchicago.edu (Stephen P Spackman) (09/11/90)

In article <4043@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes:
   >There are quite a few situations in transaction processing which could use a
   >few intelligent instructions, like reading from a buffer with an automatic
   >interrupt when the buffer becomes empty.

This operation localises better in the MMU than in the CPU (which is
nice). It's not conceptually difficult to do, either.

   "Reading from a buffer" in what sense?  Is this just an in-memory buffer
   being read by a transaction processing application - in which case I
   don't see how an *interrupt* would help, as a dumb old conditional
   branch testing whether the number of characters left in the buffer was
   zero would probably be *faster* than an interrupt (no tons of context so
   save, etc.), or is it something else?

Er. You have an 8k buffer. Each branch takes 1 cycle to test, 1 cycle
if not taken. There goes 16k cycles. You're telling me that servicing
a trap is going to take 16k cycles?

stephen p spackman  stephen@estragon.uchicago.edu  312.702.3982

d6b@psuecl.bitnet (09/12/90)

In article <4043@auspex.auspex.com>, guy@auspex.auspex.com (Guy Harris) writes:
>>Text processing, for anyone who needs more than ASCII, is in a very primitive
>>state.  Even ASCII processing is limited.  Again, hardware instructions would
>>help.
>
> Such as?

Typically you only use 8 bits of your 32 or 64 bit registers/data bus when
manipulating text. Obviously you can add instructions to improve this
situation, and there are several examples in the real world of this being
done. (The Am29000 and the -gasp- Intel chips come to mind).

-- Dan Babcock
How do great minds get great ideas? With a little help from their Amigas!!!

guy@auspex.auspex.com (Guy Harris) (09/13/90)

>   "Reading from a buffer" in what sense?  Is this just an in-memory buffer
>   being read by a transaction processing application - in which case I
>   don't see how an *interrupt* would help, as a dumb old conditional
>   branch testing whether the number of characters left in the buffer was
>   zero would probably be *faster* than an interrupt (no tons of context so
>   save, etc.), or is it something else?
>
>Er. You have an 8k buffer. Each branch takes 1 cycle to test, 1 cycle
>if not taken. There goes 16k cycles. You're telling me that servicing
>a trap is going to take 16k cycles?

OK, so what is the extra magic instruction here?  I assume it basically
amounts to a block move; if the basic "reading from the buffer"
operation is done with a loop, only terminated with a trap instread of a
conditional branch, the question is whether servicing the trap takes
longer than the conditional branch at loop termination (which may be an
untaken branch, depending on the code in the loop).

A quick look at e.g.  the 68020 book indicates that a byte-offset
conditional branch with the branch not taken takes as much time as a
conditional trap (with no argument) if the trap isn't taken, but if the
trap *is* taken it takes a fair bit longer.

I suspect the trap is unlikely to win if it's just a conditional trap
instruction.  Therefore, I assume the instruction is a block move.

So, if it's a block move, what's the point of the trap? Just have the
instruction do a conditional branch, which, again, will probably take
fewer cycles than the trap.

(We don't even count any cycles spent in the trap service routine in
either case.)

Thus, the interrupt doesn't help in either case.

The question then is whether the *rest* of the instruction helps.  If
you can run the copy loop at memory speed (we assume here that the block
move can run at memory speed) by building it out of ordinary boring
instructions, it probably won't help.  If you can e.g. stuff the actual
fetch-from-the-buffer instruction into the delay slot on a machine with
branch delays, you might be able to come close enough to the block move
instruction (or do as well) - and we haven't even considered unrolling
the loop here.

Of course, the other possibility might be to have the application use
"locate mode", and just use the data as it is in the buffer rather than
copying it out, if that's possible....

guy@auspex.auspex.com (Guy Harris) (09/13/90)

>Typically you only use 8 bits of your 32 or 64 bit registers/data bus when
>manipulating text. Obviously you can add instructions to improve this
>situation, and there are several examples in the real world of this being
>done. (The Am29000 and the -gasp- Intel chips come to mind).

Or you can *not* add them, and just do it anyway.  I seem to remember
the 29K having an instruction to test any of the bytes in a word for
zero; other systems do this in software - does the 29K do it faster?

cik@l.cc.purdue.edu (Herman Rubin) (09/13/90)

In article <4056@auspex.auspex.com>, guy@auspex.auspex.com (Guy Harris) writes:
> >   "Reading from a buffer" in what sense?  Is this just an in-memory buffer
> >   being read by a transaction processing application - in which case I
> >   don't see how an *interrupt* would help, as a dumb old conditional
> >   branch testing whether the number of characters left in the buffer was
> >   zero would probably be *faster* than an interrupt (no tons of context so
> >   save, etc.), or is it something else?
> >
> >Er. You have an 8k buffer. Each branch takes 1 cycle to test, 1 cycle
> >if not taken. There goes 16k cycles. You're telling me that servicing
> >a trap is going to take 16k cycles?
> 
> OK, so what is the extra magic instruction here?  I assume it basically
> amounts to a block move; if the basic "reading from the buffer"
> operation is done with a loop, only terminated with a trap instread of a
> conditional branch, the question is whether servicing the trap takes
> longer than the conditional branch at loop termination (which may be an
> untaken branch, depending on the code in the loop).

In computer generation of random variables, essentially all efficient methods
take a random number of input items.  These items may even be bits.  I believe
that this is an obvious case of a buffered read.  Keep in mind that in a 
Monte Carlo calculation, millions are not that large.

Now it is possible to check at certain times the status of the buffer, and to
have policies to greatly reduce the frequency of a buffer being empty, even
to the extent that this will rarely happen at all.  But it must be provided
for.  It is not at all difficult to have a situation where billions of reads
can be performed, with the probability of a read finding the buffer empty 
being so low that the great bulk of runs will never have an interrupt.

If the items being read are bits, the problems are even worse.  One wants the
bits to be in fast memory, such as registers.  Now we have the following 
situation, to read one bit:

	if(nbits == 0){
		if(nbuf == 0)BUFREFILL;
		read_from_buffer;
		nbits = REGSIZE;}
	nbits -= 1;
	getbit;
	adjust so that the next time, the next bit will be obtained;

and then use the bit.  If we wish to do a conditional transfer on the bit,
a very common situation,  shifting and conditional transfer will eliminate
one operation if tests for overflow and/or carry are available.  This is not
present in the HLLs, so is it present in the architecture any more?

The algorithms which would use this are efficient from the standpoint that
few bits are used, and only simple computations are made (for this purpose,
a multiplication is not simple).  Since we are interested in random output,
the result of this operation is not needed per se--a totally different result
generated by a totally different algorithm can be used.  These far more 
complicated algorithms (from the standpoint of the amount of actual bit
processing which is done) will run far fasted, so that these algorithms
are not even going to be coded.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)	{purdue,pur-ee}!l.cc!cik(UUCP)

rwallace@vax1.tcd.ie (09/14/90)

In article <390@tygra.UUCP>, dave@tygra.UUCP (David Conrad) writes:
> I'll only list two: Huffman coding and LZW compression.  But notice, every one
> of the things you've mentioned typically use large disk files.  And I've listed
> data compression algorithms.  If data could be uncompressed on the fly faster,
> then this would be done more often - with a decrease in the disk I/O needed by
> these operations which would be paid for in computation time.  Disk reads vs.
> number crunching cycles.  My guess is there would be a net gain in efficiency
> there.
> --

To make Huffman more efficient you don't just need bit pointers, you need bit
pointers that can overlap a word boundary. I'm not familiar with the LZW
algorithm but I suspect the same thing applies. To implement this you'd need so
much extra hardware in the data fetch path that the reduction in speed in
instruction fetches, counter operations, word transfers to/from disk buffers
etc. would probably more than outweigh the speed increase in bit manipulation.
In other words, even data compression could end up SLOWER with non-aligned bit
pointers. And as I said I don't believe there are significant uses for aligned
bit pointers (which would still slow down the chip but not as much).

-- 
"To summarize the summary of the summary: people are a problem"
Russell Wallace, Trinity College, Dublin
rwallace@vax1.tcd.ie

d6b@psuecl.bitnet (09/14/90)

In article <4057@auspex.auspex.com>, guy@auspex.auspex.com (Guy Harris) writes:
>>Typically you only use 8 bits of your 32 or 64 bit registers/data bus when
>>manipulating text. Obviously you can add instructions to improve this
>>situation, and there are several examples in the real world of this being
>>done. (The Am29000 and the -gasp- Intel chips come to mind).
>
> Or you can *not* add them, and just do it anyway.  I seem to remember
> the 29K having an instruction to test any of the bytes in a word for
> zero; other systems do this in software - does the 29K do it faster?

Certainly. You can "often" speed things up by adding hardware. After all,
the reason why RISC chips are faster than typical CISC chips is because
they use 32-bit fixed-field opcodes and immediate register address
decoding, not because they have fewer instructions...(am I asking for
a flame here? :-)) Actually, a lot of so-called RISC chips have
instructions that are more complicated than CISC chips, simply by
stacking things up in parallel (once again using the larger opcode size
to advantage). Obviously if you're just in scientific computations, or
some such, text-handling support is superfluous. Adding features does
typically slow things down, but I claim that if one is sufficiently
clever the overall speed decrease can be made very small. Of course
there's also an opportunity cost involved, but what the heck - text
manipulation is pretty common.

-- Dan Babcock