[comp.arch] hmmm, BitBlt, New Instructions For RISC

alan@oz.nm.paradyne.com (Alan Lovejoy) (02/23/90)

In article <1990Feb19.174926.7899@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <7404@pdn.paradyne.com> alan@oz.paradyne.com () writes:
>>>[small BitBlts]
>>>overhead of deciding what BitBlt to do completely swamps the BitBlting
>>>itself.  (Credit to Pike, once again, for pointing this out.)
>>
>>Sorry, no.  It only takes roughly 50 cycles (on an 88k) to setup for a blit,
>>if you aren't generating code on the fly (and for the common cases, you
>>don't generate code on the fly).  Since a character blit has a copyrect
>>whose width is 7 to 15 (typically) and whose height is also 7 to 15, that
>>means we need to blit roughly 10 words at roughly 10 cycles per word...
>
>I'd be interested to know exactly what you include under "setup".  Pike's
>numbers showed a much greater disparity the other way, counting *all*
>the overhead.

Consulting my source code reveals the following:

Given two pointers to two bitmaps (source and destination) which are structs
describing a) the pixel width and height of the bitmaps, b) the pixel depth, 
and c) the base address for the bits in the bitmap, and given the two origins
(in source and destination) of the the copyrect and  the extent (width and 
height) of the copyrect, my code can "set up" for the bitblt loop in
38 cycles on an 88k.  Of course, this does not count procedure call overhead
(other than LOCAL (as opposed to caller) register save/restore), and it does
not account for clipping (which takes about 10 cycles if everything
is within proper bounds).  Add a small (less than 10) number of cycles
for figuring out that one of the precooked blits can be used.  Total overhead
with clipping and two levels of procedure call (the first is the generic blit
function which does clipping, determines which blit to use, perhaps generates
the code for it, and then calls it) is very close to 60 cycles.  So "roughly
50 cycles" was a slightly understated approximation, but not by much, 
especially since it is often possible to a) reuse the specific blit function
several times once the correct one has been determined (and perhaps generated),
and b) avoid clipping.

I suspect the major reason for the low cycle count here is the 88k MUL 
instruction, which only has a four cycle latency, and which is fully
pipelined (6 multiplies can be in the pipe at the same time).  There are
NO PIPELINE STALLS in my code!  Since there are 5 multiplies by numbers
which cannot be known at translation time, even an R3000 will have much
worse cycle counts than the 88k for setup (add roughly 30 NO-OP cycles for an
R3000, hundreds of cycles for most other CPUs).  Also, I use EXTU and AND to 
do division and modulus by 32 (the number of bits per word);  if one uses
DIV, cycle counts will of course explode drastically.

____"Congress shall have the power to prohibit speech offensive to Congress"____
Alan Lovejoy; alan@pdn; 813-530-2211; AT&T Paradyne: 8550 Ulmerton, Largo, FL.
Disclaimer: I do not speak for AT&T Paradyne.  They do not speak for me. 
Mottos:  << Many are cold, but few are frozen. >>     << Frigido, ergo sum. >>

henry@utzoo.uucp (Henry Spencer) (02/24/90)

In article <7452@pdn.paradyne.com> alan@oz.paradyne.com (Alan Lovejoy) writes:
>>>... It only takes roughly 50 cycles (on an 88k) to setup for a blit...
>>I'd be interested to know exactly what you include under "setup".  Pike's
>>numbers showed a much greater disparity the other way, counting *all*
>>the overhead.
>
>Consulting my source code reveals the following:
>
>Given two pointers to two bitmaps (source and destination) which are structs
>describing...

Where do you get those two pointers?  For character blitting you'll need
font lookup and possibly some other bits of overhead, depending on exactly
how the stuff is being invoked by higher levels.  The relevant measurement
of overhead for character blitting is "end to end", including absolutely
everything.  However, done well, that's probably not going to add a whole
lot more.  That *is* an impressively fast setup you've got there.
-- 
"The N in NFS stands for Not, |     Henry Spencer at U of Toronto Zoology
or Need, or perhaps Nightmare"| uunet!attcan!utzoo!henry henry@zoo.toronto.edu