[comp.arch] BitBlt, new instructions for RISC.

alan@oz.paradyne.com (02/24/90)

In an email message to me, bbd@rice.edu wrote:
>In article <7398@pdn.paradyne.com> I wrote:
>>Another important case is when drawing vertical lines.
>>Remember, 32 * 32 = 1024; it only takes 32 32-bit words to store a full line
>>of monochrome pixels on a megapixel (1024x1024) screen.

>[On most systems, doesn't this special case occur for horizontal,
>rather than vertical lines?  My next 2 paragraphs assume so...]

>The line only fits into the 32 words if it is going in the proper (eg.
>horizontal) direction.  If the line goes the other way (eg. vertical),
>it takes 1024 32-bit words to store it, if it's stored in the obvious
>way.

To which I attempted to reply via email. However:

   ----- Transcript of session follows -----
>>> RCPT To:<bbd@rice.edu>
<<< 550 <bbd@rice.edu>... User unknown
550 rice.edu!bbd... User unknown

So here is my reply:

I apologize:  I did not realize how easy it would be to misinterpret what I
wrote.  I did not mean to imply that pixels which are vertically sequential
are also stored sequentially in memory.  The point of my comment was how
"narrow" the screen is in terms of words of memory, to make it graphically :-)
obvious that many blits fall within a single word or two.  The point about
vertical lines is that they always (unless they, or the pixels themselves,
are very thick) fall within one or two words horizontally.

>>Also, with burst-mode cache refils and sequential word accesses, the cache
>>miss penalty is not that significant: you only "miss" one word in sixteen
>>(typically).

>Eh?  If I'm reading a character glyph from a bitmap that is
>approximately the size of the screen, then each row of the glyph is
>separated by about 30 32-bit words, right?  That would play hell with
>burst-mode refilled caches, right?

Again, I apologize:  my comment was addressed to the big bitblt case that
the original poster was concerned about.  However, be advised that character
glyphs may be stored one character glyph per character (so that the entire
glyph bitmap fits into a single cache line), or perhaps one whole font face
per glyph.  In any case, characters are normally blitted in a loop which
writes a whole string to the screen, often soon followed by another.  So it 
is likely that the most frequently used characters would have their pixels
in the cache already.

>>then the total loop cycle count is 8, an 11% improvement.  Isn't it normally
>>considered worthwile to add an instruction if it improves performance by as
>>little as 2 or 3 percent?

>Perhaps it's worthwhile if it improves overall performance by 2 or 3%.
>You've only (theoretically) improved performance of the _blitter_ by
>8% or 11%.  Will the "bmerge" (or whatever) instruction be used
>anywhere else?  Sure, a compiler could be _taught_ to use it, but it
>won't bother to emit this instruction unless the magical pattern "(Rx
>& Ry) | (Rz & ~Ry)" shows up in the intermediate code.  Where else would
>this occur beside BitBlt?

BMERGE would be useful for dealing with unaligned accesses (remember
those?).  That's essentially what BitBlt does:  unaligned accesses.  Most
CPUs are at least byte aligned, if not half-word (68000) or word aligned (RISC).
BitBlt is not necessary on a bit-aligned CPU.

>> Don't people spend mongo$ to increase the cache size for no more than
>> a 10% performance improvement?

>Again, I think these mongo$ are going for an across-the-board 10%
>performance improvement.

Well, suppose you want to market your CPU as a graphics engine?

____"Congress shall have the power to prohibit speech offensive to Congress"____
Alan Lovejoy; alan@pdn; 813-530-2211; AT&T Paradyne: 8550 Ulmerton, Largo, FL.
Disclaimer: I do not speak for AT&T Paradyne.  They do not speak for me. 
Mottos:  << Many are cold, but few are frozen. >>     << Frigido, ergo sum. >>

bbc@legia.rice.edu (Benjamin Chase) (02/26/90)

alan@oz.paradyne.com writes:

>In an email message to me, bbd@rice.edu wrote:
>To which I attempted to reply via email. However:
>[it failed]
>So here is my reply:

>I apologize:  I did not realize how easy it would be to misinterpret what I
>wrote.

How painfully true.  I'm bbc@rice.edu (as in British Broadcasting
Company), not bbd@rice.edu.

>  I did not mean to imply that pixels which are vertically sequential
>are also stored sequentially in memory.  The point of my comment was how
>"narrow" the screen is in terms of words of memory, to make it graphically :-)
>obvious that many blits fall within a single word or two.

But the _screen_ isn't narrow.  A modern monochrome display is 32
(32 bit) words wide.

>The point about
>vertical lines is that they always (unless they, or the pixels themselves,
>are very thick) fall within one or two words horizontally.

Yes, and ~1000 words vertically.  And these words are spaced at ~32
intervals.  Thus, when you draw a vertical line, you get a word of
screen memory, perform some operation on it to turn on or off ~1 bit
of that word, and then write it back to screen memory.  Then, you get
the next word, which is 32 words further along, and whoops, it's not
in the data cache, because your silly fetcher got the next 3
consecutive words of screen memory (which you won't be needing right
now because your vertical line is so skinny and all), rather than the
next 3 words spaced at 32 word offsets, which is what you really
wanted it to do if you were drawing a vertical line.

>However, be advised that character
>glyphs may be stored one character glyph per character (so that the entire
>glyph bitmap fits into a single cache line), or perhaps one whole font face
>per glyph.

Fine, I'll certainly agree with "one character glyph per character",
and that for any normal (monochrome) screen resolution and font face
that a row of a character glyph fits into a single cache line, and
usually into a single machine word.

>  In any case, characters are normally blitted in a loop which
>writes a whole string to the screen, often soon followed by another.  So it 
>is likely that the most frequently used characters would have their pixels
>in the cache already.

But that's only half the problem.  What about the screen memory that
isn't in the cache because it won't fit?  (Especially when you're
filling the cache with screen words that you won't be writing upon, as
in my example above.)

>Well, suppose you want to market your CPU as a graphics engine?

That is another story entirely.  I thought this whole thing started by
discussing what sorts of new instructions might fit into the next RISC
processor.  Granted, just because it's a "RISC processor" (TM) :-)
doesn't mean it won't be marketed as a graphics engine.

Now, getting back to "new instructions for RISCs", sort of...  [I
think most of this is from a brain-storming session one recent night
that afflicted me and Preston Briggs. :-) ]

Suppose you're designing a superscalar architecture.  A recent example
of such an architecture allows a branch, an "integer instruction", and
a floating point instruction to be issued all at once.  This seems
like a good idea, especially if your chip set will be used for
heavy-duty number crunching, or for generating really high SPEC marks,
especially on those floating point benchmarks. :-) But if you weren't
so interested in floating point, what else would you put into that
part of the instruction?

Well, probably not graphics operations, per se, since you wouldn't
expect graphics operations to be sprinkled throughout your code, where
the compiler would find them and fold them together with branches and
additions into neat little triplets.  Even counting on a MERGE
instruction to occur throughout your "instruction stream" is a little
extreme, because if all the memory accesses turn out to be aligned,
you've just thrown away a large chunk of your superscalar chip's
potential performance (cf.  relative performance of the IBM 6000s on
integer vs.  floating point benchmarks).

Perhaps in the floating point slot of the superscalar instruction,
you'd put (among other things?) little hints to the cache instead?
Whisper gently into the processor's ear, saying "get the value in
register N, and use that as a stride for yanking words into the cache
until I tell you something different".

Perhaps N contains the width (in machine words) of a bitmap, and this
will cause the cache to get the 4 words out of your bitmap, one
(conceptually) stacked on top of another, so that you can blit 4 rows
of your character glyph into them, before missing again.

Or maybe you're addressing an array in the "wrong order" (for a
particular language), not getting consecutive elements, but getting
only one element out of each row or column, before moving onto the
next row or column.  Or maybe you used one of those wacky
array-reshaping commands from FORTRAN 2001...  :-) Then register N
might contain the number of words in a row or column.

Or maybe the correct thing to do in a particular case is tell the
processor not to load anything extra into the cache, just get what's
needed, because the next several instructions will be loading words
from all over the place, and just because location A was loaded is
absolutely no indication that location A+1 will be needed anytime
soon.

Or maybe the current instruction is loading a value that won't be
needed again for a loooong time.  In this case, we'd like to tell the
cache to just hand the word to the processor, and not waste any space
keeping a copy for itself.

A good compiler might notice all of cases, and construct the
appropriate whisper for that third slot of the superscalar
instruction.

Does anyone else have thoughts on this?  What kinds of things would
_you_ put in the slots of your superscalar chip(s)?
--
	Ben Chase <bbc@rice.edu>, Rice University, Houston, Texas

jesup@cbmvax.commodore.com (Randell Jesup) (02/27/90)

In article <BBC.90Feb25195510@legia.rice.edu> Benjamin Chase <bbc@rice.edu> writes:
>Perhaps in the floating point slot of the superscalar instruction,
>you'd put (among other things?) little hints to the cache instead?
>Whisper gently into the processor's ear, saying "get the value in
>register N, and use that as a stride for yanking words into the cache
>until I tell you something different".

	This idea exists (though not in superscalar form) in the RPM-40
cache and coprocessor design.  The cache control appears to the processor
as a logical coprocessor.  You can fill pipeline interlocks, load delays,
etc with cache setup/hint instructions.  The RPM-40 had lines going to the
cache to tell it which register (if any) a load/store was relative to, which
could be used by the cache to act differently according to register number.

	For example, access relative to register 1 might be sequential,
relative to register 2 might skip X words on each access, register 3 might
have random access habits, etc. (there were a couple of other things about it
you could control also).  For I-Cache, you could control how many entries of
the Branch-Target-Cache (BTC) were handled by LRU, and how many the software
wanted to provide the addresses to cache.

-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.cbm.commodore.com  BIX: rjesup  
Common phrase heard at Amiga Devcon '89: "It's in there!"

zs04+@andrew.cmu.edu (Zachary T. Smith) (02/27/90)

For all the engineering it would take (manhours) to get a given RISC
processor to do 'fast' blitting, one might as well wait a few years
until the (semiconductor) industry is capable of putting a blitter on
a 1 megabit VRAM chip (or better yet on a 4 megabit chip), and then
just do that. It'll get done eventually, so why come up with hacks in
the meantime?

-Zach Smith (zs04@andrew.cmu.edu)

PS: Why wouldn't it get done? A blitter that operates on 1024 or 2048
bit lines per op and which can cache fonts in the DRAM as well would
even a very fast blit processor running w/standard VRAM look like a
toy.

alan@oz.nm.paradyne.com (Alan Lovejoy) (02/28/90)

In article <BBC.90Feb25195510@legia.rice.edu< Benjamin Chase <bbc@rice.edu> writes:
<alan@oz.paradyne.com writes:
<
<>In an email message to me, bbd@rice.edu wrote:
<>To which I attempted to reply via email. However:
<>[it failed]
<>So here is my reply:
<
<>I apologize:  I did not realize how easy it would be to misinterpret what I
<>wrote.
<
<How painfully true.  I'm bbc@rice.edu (as in British Broadcasting
<Company), not bbd@rice.edu.

Hey, shit happens :-) (translation: we all make mistakes, from little typos
to Great Social Blunders.  Peace, bro.).

<>  I did not mean to imply that pixels which are vertically sequential
<>are also stored sequentially in memory.  The point of my comment was how
<>"narrow" the screen is in terms of words of memory, to make it graphically :-)
<>obvious that many blits fall within a single word or two.
<
<But the _screen_ isn't narrow.  A modern monochrome display is 32
<(32 bit) words wide.

Well, I think ~30 words is "narrow."  It's a much less imposing number than
1024, which approximates the number of pixels along the x-axis.  But it
depends on your point of view, I guess.

<>The point about
<>vertical lines is that they always (unless they, or the pixels themselves,
<>are very thick) fall within one or two words horizontally.
<
<Yes, and ~1000 words vertically.  And these words are spaced at ~32
<intervals.  Thus, when you draw a vertical line, you get a word of
<screen memory, perform some operation on it to turn on or off ~1 bit
<of that word, and then write it back to screen memory.  Then, you get
<the next word, which is 32 words further along, and whoops, it's not
<in the data cache, because your silly fetcher got the next 3
<consecutive words of screen memory (which you won't be needing right
<now because your vertical line is so skinny and all), rather than the
<next 3 words spaced at 32 word offsets, which is what you really
<wanted it to do if you were drawing a vertical line.

Yep. That's why one "loads" the word from screen memory before "load"-ing
the word from the source bitmap, so that there are as many cycles as possible
between the load instruction which fetches the destination word and the 
first instruction which accesses that word.  This is where the 88k's multiple
independent functional units, full pipelining and register scoreboarding
really help.  But even so, one may in fact pay some of the performance penalty 
you describe, depending on the details of the situation.  Different hardware 
configurations exact different penalties for cache misses, especially in the 
case of the frame buffer.  Also, if there is a mask (stipple) bitmap, that 
adds  roughly 4 to 6 more cycles between the destination bitmap load and the 
first access of the loaded word.  One can also get REALLY agressive, and
have different versions of bitblt optimized for the the (tall, narrow) and
(short, wide) cases.  But best of all, the code for striding to the next line
(of each bitmap) can execute immediately after the loads for the first words in
those lines, but before acessing the loaded words, adding many cycles between 
first-word-in-line loads and first access to those words.  My code gets 
really aggressive :-).      

But what am I doing giving away my secrets like this!  

<>  In any case, characters are normally blitted in a loop which
<>writes a whole string to the screen, often soon followed by another.  So it 
<>is likely that the most frequently used characters would have their pixels
<>in the cache already.
<
<But that's only half the problem.  What about the screen memory that
<isn't in the cache because it won't fit?  (Especially when you're
<filling the cache with screen words that you won't be writing upon, as
<in my example above.)

The usual case FOR ME is to blit a string of characters whose glyphs are
one bit per pixel to a destination bitmap (often the screen) which is 8 bits 
per pixel.  The way I actually do this is to blit the character string to
a temporary one-bit-per-pixel destination bitmap (which can be treated as a
special case), then blit the temporary bitmap to the screen using a 
bit-depth-changing/pixel-value-translating algorithm.  This (bit-depth-changing
blit) algorithm provides even more cycles between the load-word-from-dest 
instruction and the first instruction accessing that word.  And of course,
in tends to shift performance considerations away from first word/last word
issues back toward inner-loop issues.  And it's a very big win, much more so
than one would think.

<Now, getting back to "new instructions for RISCs", sort of...  [I
<think most of this is from a brain-storming session one recent night
<that afflicted me and Preston Briggs. :-) ]
<
<Perhaps in the floating point slot of the superscalar instruction,
<you'd put (among other things?) little hints to the cache instead?
<Whisper gently into the processor's ear, saying "get the value in
<register N, and use that as a stride for yanking words into the cache
<until I tell you something different".

Check out the RPM-40.  It does something along these lines.  

In order for hardware to "go faster," it must either "do more work" per cycle 
or go through more cycles per second.  In order to "do more work per cycle," 
it must find more things it can do simultaneously or in parallel.  In order to 
do more things simultaneously/in parallel, it must have more parallel hardware,
more information per instruction, and instruction semantics which are composed 
of more primitive components. Also, the hardware must make fewer assumptions 
about the future, so that its behavior becomes more fully programable, allowing
greater optimization.  Obviously, the more information the instruction stream 
provides about what will happen in the future, the more work the hardware can 
do in advance (and in parallel with other things) to get ready for it.  

____"Congress shall have the power to prohibit speech offensive to Congress"____
Alan Lovejoy; alan@pdn; 813-530-2211; AT&T Paradyne: 8550 Ulmerton, Largo, FL.
Disclaimer: I do not speak for AT&T Paradyne.  They do not speak for me. 
Mottos:  << Many are cold, but few are frozen. >>     << Frigido, ergo sum. >>

jesup@cbmvax.commodore.com (Randell Jesup) (03/05/90)

[Sorry of people see this twice, news was broken the first time I posted it.]

In article <AZua2yC00WB5Qqd2FJ@andrew.cmu.edu> zs04+@andrew.cmu.edu (Zachary T. Smith) writes:
>until the (semiconductor) industry is capable of putting a blitter on
>a 1 megabit VRAM chip (or better yet on a 4 megabit chip), and then
>just do that. It'll get done eventually, so why come up with hacks in
>the meantime?
...
>PS: Why wouldn't it get done? A blitter that operates on 1024 or 2048
>bit lines per op and which can cache fonts in the DRAM as well would
>even a very fast blit processor running w/standard VRAM look like a
>toy.

	I wouldn't hold my breath.  It may happen, but the cost will probably
be prohibitive even if it does.

	Some off-the-cuff technical problems I can think of are:
1) The blitter can only access data on the same "BRAM".
2) It complicates video system design _a lot_, since the font/etc data must
   be stored in non-visible portions of each ram.  This gets bad, since
   nothing normally forces characters to appear on any given alignment.
why go further, I think those will kill the idea already.

	Personally, I think the correct solution is not to modify RISC cpu
designs to make blitting easy, but to use custom blitter/etc chips in the
video sub-system.  They can be optimized for their intended function more than
any RISC cpu can (since it must also be a good general-purpose cpu).  They
can be isolated from the main CPU, allowing limited multi-processing (tell the
blitter to start, and go off and do other things while waiting).  Isolating
the video system from the main CPU bus means less bus contention (seen by the
CPU), etc.

	This follows the methodology that hardware (CPU/whatever) should be
optimized for it's task whenever possible.  You look at the speed (measured
in both blitting speed and overall system throughput) and the cost, and then
decide.

	Of course, I could be biased, since the Amiga uses a custom video
coprocessor, and runs much faster when the CPU runs from ROM/fast-ram, while
the blitter (and display "copper" (coprocessor)) play with "chip" (video) ram.


-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.cbm.commodore.com  BIX: rjesup  
Common phrase heard at Amiga Devcon '89: "It's in there!"