[comp.arch] hmmm

mpogue@dg.dg.com (Mike Pogue) (02/20/90)

In article <22609@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes:
>In article <1990Feb18.084034.7015@utzoo.uucp> henry@utzoo.uucp
>(Henry Spencer) writes:

>Another interesting side issue is what happens as screens get denser
>(e.g., 2048x2048 24 bit true color displays showing text): the number of
>bits or bytes moved to draw a character goes up to the point where the
>move starts to dominate again.  When we reach page-quality screens
>(4k x 4k or higher) we will definitely need something fancy (a complete
>4096x4096x24 display is 48 MB, which is a lot to move in a short time).
>Fortunately, we have some time before these displays will become popular. :-)

  Actually, this is exactly what we have found.  The memory-CPU bandwidth is simply
not big enough in a low end machine, to support BITBLTs on Color screens.

  I think DEC found this out, when they looked at where all the time was being
spent on their 2100/3100 workstations.  Their next workstation (or so it has
been reported) supports BITBLT in hardware.

  In fact, the January 29, 1990 issue of EE Times quotes ONLY BITBLT performance
and rectangle fill.  Where are the vectors/second numbers?  Maybe they aren't
any faster than on the raw 2100/3100 hardware....We'll have to wait and see.

  For peppy response, we have found it typically takes about 20M pixels/second
BITBLT speed.  Less than that, and things appear sluggish.  On an 8 bit/pixel
color system, this works out to moving (rotate/merge included) about 20M bytes/second.

  A 24b/p design is even worse, requiring a minimum of 3 * 20M = 60M bytes/second.
Memory in this case is often organized to support 24/32 bits/pixel, which requires
a read and a write per pixel (4 bytes at a time).  This is then 20M reads/second
and 20M writes/second, something most RISCs (remember video memory is not normally 
cached!) really don't want to sit around doing.

  On the AViiON 300, 400, SparcStation GXP, the BITBLT engine is in hardware.  Doing it
by an instruction in the RISC CPU will NOT be good enough, IMHO, until the clock rates
and the CPU-Video Memory bandwidth go way up.

Mike Pogue
Data General

Speaking for myself....

alan@oz.nm.paradyne.com (Alan Lovejoy) (02/23/90)

In article <9734@cbmvax.commodore.com< jesup@cbmvax.cbm.commodore.com (Randell Jesup) writes:
<In article <7398@pdn.paradyne.com> alan@oz.paradyne.com () writes:
<>>In any case, bitblt is memory bound as it is ("bitblt plays havoc with
<>>data caches" [Pike], after all).  I doubt that a single instruction BMERGE
<>>would help the typical bitblt inner loop very much.
<>
<>But most BitBlt calls are for relativly small regions, such as when blitting
<>character glyphs.  Often, the inner-most loop never executes in such cases,
<>because each pixel row falls entirely within the same word, or only crosses
<>one word boundary.  Another important case is when drawing vertical lines.
<>Remember, 32 * 32 = 1024; it only takes 32 32-bit words to store a full line
<>of monochrome pixels on a megapixel (1024x1024) screen.
<
<	Be careful of one's assumptions.  On some systems, the relative
<frequencies of character (small) blits to animation/gfx (large) blits is
<reversed, or at least a different balance.  Also, once the character speed
<is "acceptable" (a moving target, true), speed of other operations (like
<large-area drawing, line-draw, animation, etc) can become more important to
<the market, even if character display predominates in volume.  This is of
<course because humans can see the rendering speed, and there are threshold
<levels at which perception of speed is different (especially given scan rates
<and persistance of vision).

Animation invovles repeatedly blitting the same SOURCE bitmap to the screen.
This increases the likelihood that the source pixels will be in the cache,
one would think.  And the longer each pixel-line is, the more the speed of
the innermost loop dominates overall blit speed.

<	Also watch out for assumptions about screen depth.  Even character
<blits spend more time in the inner loop on, say, a 24-bit deep screen.

Not necessarily.  Just because the destination bitmap is N bits per pixel
does NOT mean the source bitmap is N bits per pixel, ESPECIALLY in the
case of character glyphs.  Also, I have found that it is more efficient to 
blit an entire string of characters from one-bit-per-pixel character glyphs
to a one-bit-per-pixel temporary bitmap, and only then to blit the temporary
bitmap to the actual multibit-per-pixel destination than it is to do pixel-depth
changing blits a character at a time (2 to 3 times faster!).  This is due to
the special characteristics of the pixel-depth changing bitblt algorithm (or at
least, the one I desinged and use), because it has a very high setup cost but
is otherwise VERY efficient.  This algorithm translates source pixels into
any arbitray set of destination pixels as a "free" side effect of the change
in pixel depth.

<	Where special hardware is useful is for non-special cases, like
<diagonal lines - look at the difference between horizontal/vertical line
<draw to diagonal linedraw speed on machines such as the Mac and the Amiga.
<The Mac is very fast at horiz/vert line-draw, but (relatively) slow at
<diagonals (by a large factor).  The Amiga shows _much_ less degradation on
<diagonals due to the built-in blitter with line-draw.

But what is the relative frequency with which diagonal lines are drawn as
compared to non-diagonals?  RISC teaches that one should optimize for the more 
frequent cases.

____"Congress shall have the power to prohibit speech offensive to Congress"____
Alan Lovejoy; alan@pdn; 813-530-2211; AT&T Paradyne: 8550 Ulmerton, Largo, FL.
Disclaimer: I do not speak for AT&T Paradyne.  They do not speak for me. 
Mottos:  << Many are cold, but few are frozen. >>     << Frigido, ergo sum. >>

jesup@cbmvax.commodore.com (Randell Jesup) (02/24/90)

In article <7453@pdn.paradyne.com> alan@oz.paradyne.com (Alan Lovejoy) writes:
>In article <9734@cbmvax.commodore.com< jesup@cbmvax.cbm.commodore.com (Randell Jesup) writes:
><	Be careful of one's assumptions.  On some systems, the relative
><frequencies of character (small) blits to animation/gfx (large) blits is
><reversed, or at least a different balance.  Also, once the character speed
><is "acceptable" (a moving target, true), speed of other operations (like
><large-area drawing, line-draw, animation, etc) can become more important to
><the market, even if character display predominates in volume.  This is of
><course because humans can see the rendering speed, and there are threshold
><levels at which perception of speed is different (especially given scan rates
><and persistance of vision).
>
>Animation invovles repeatedly blitting the same SOURCE bitmap to the screen.
>This increases the likelihood that the source pixels will be in the cache,
>one would think.  And the longer each pixel-line is, the more the speed of
>the innermost loop dominates overall blit speed.

	Highly unlikely.  Often the bitmaps are different, and the bitmaps
are often far larger than reasonable caches (even if nothing but blitting is
going on, which is not normally the case).

	We agree that it increases the importance of the inner loop (a lot).
Scrolling and moving rectangles are also important (depending on use, of
course).

><	Also watch out for assumptions about screen depth.  Even character
><blits spend more time in the inner loop on, say, a 24-bit deep screen.
>
>Not necessarily.  Just because the destination bitmap is N bits per pixel
>does NOT mean the source bitmap is N bits per pixel, ESPECIALLY in the
>case of character glyphs.

	True.  Not all screens are "chunky", either - some are bitplanes.
However, you still need to modify all the bits of the pixels you're dealing
with, whether you write 1's or 0's.

><	Where special hardware is useful is for non-special cases, like
><diagonal lines - look at the difference between horizontal/vertical line
><draw to diagonal linedraw speed on machines such as the Mac and the Amiga.
><The Mac is very fast at horiz/vert line-draw, but (relatively) slow at
><diagonals (by a large factor).  The Amiga shows _much_ less degradation on
><diagonals due to the built-in blitter with line-draw.
>
>But what is the relative frequency with which diagonal lines are drawn as
>compared to non-diagonals?  RISC teaches that one should optimize for the more 
>frequent cases.

	It depends on usage.  And usage often depends on what's fast already.
And, as I said, unlike program execution, human perception of time is non-
linear, and also has other affects due to being able to see partially completed
renderings (often).

	The optimal bitblit for a character-only display is different than that
for one that has window borders but no patterns, which is different from one
with patterns, which is different from one that displays icons, which is diff-
erent from one that is used to display CAD images, etc, etc, etc.  Because of
this, it's important to consider all applications of your hardware.  The affect
is more pronounced than in most general instruction set design issues, but it's
there, too.  If you were here, remember the discussions between myself and
John Mashey about R2000 vs RPM-40, and the fact that we had to first define
the environment (i.e. type of programs) before doing comparisons.  R2000 is
a better match for Unix than the RPM-40 (given the same tech), RPM-40 is better
for embedded and small systems (which were a major factor in it's design).

	Designing to handle the general cases well can bring great flexibility
to the use of your hardware, even if you give up a little in some special
cases (and you may not have to).

-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.cbm.commodore.com  BIX: rjesup  
Common phrase heard at Amiga Devcon '89: "It's in there!"