[comp.arch] Sw vs. Hw BitBlit

pf@diab.se (Per Fogelstr|m) (07/26/88)

This is my very own reflections about the recent discussion.

I have during my days worked with a few software and hardware driven
graphic systems. And i can not, from what i have learned until now
accept that a software BitBlit done by a general purpose micro (for ex
a 68K ) can be faster than a "hardware".

If we assume that the speed of the BitBlit is dependent on the memory
bandwidth, all bus cycles doing anything else than movin data is overhead.
Agree ??.  Okay, so even if a 68k is doing a "Blit" in straight code
it will consume some memory bandwidth. A micro programmed "hardware" blit
will be able to use every memory cycle for data accesses, thus having much
higher transfer rate. Correct me if i'm wrong...

I know that the Amiga has been discussed. From what i know
the Amiga has the graphics part of the main memory isolated from the rest of
the bus. This means that the Blitter in the Amiga can do a Blit without
disturbing the 68k cpu. And even if the 68k accesses the graphics memory
the blitter will still have 50% of the availible bandwidth left for its work,
because the 68k only needs the other 50%. (Hope i'm not to desinformed).

Assuming we have a fast micro (an 68030 or a NS32535) the would at least be
supported by their on chip caches. Even if the hitrate in theese caches are
as low as 50% an external hardware Blitter could use the other 50% for its
work, and the micro would run at full speed. So the cpu does not have to
wait for the memory.  

Someone pointed out that placing characters is the main work for the BitBlit.
Yes, that is correct in some systems and this is a problem in many cases.
Placing a character can take from 20micro seconds and up, and the cpu has to
wait for the blitter to be ready before placing the next character. Then 
20 microseconds is to litle for a context switch and the cpu is wasitng time
waiting. But 20-30 microseconds is at least faster than the cpu can place
the caracter anyway. And if the graphic engine is smart enough it can fetch
characters from the buffer or main memory itself and offloding the micro
until it needs some help with a special character (scrolling etc.)

And talking about scrolling ! Everyone who has seen a Sun scroll must agre
that the screen must be scrolled in less than one frame (1/70 of a second)
to make a pleasant impression. I would like to see a 68k do that with
a 1k x 1k x 8 plane display in one frame time.

An other argument i have heard is "But look at the Mac, its cpu driven and its
fast !". Well, i just have one answer to that, what is the display resolution ?

Ok, start flaming i'm ready, but beware, my blitter is fast !!

jesup@cbmvax.UUCP (Randell Jesup) (07/28/88)

In article <399@ma.diab.se> pf@ma.UUCP (Per Fogelstr|m) writes:
>Someone pointed out that placing characters is the main work for the BitBlit.
>Yes, that is correct in some systems and this is a problem in many cases.
>Placing a character can take from 20micro seconds and up, and the cpu has to
>wait for the blitter to be ready before placing the next character. Then 
>20 microseconds is to litle for a context switch and the cpu is wasitng time
>waiting. But 20-30 microseconds is at least faster than the cpu can place
>the caracter anyway. And if the graphic engine is smart enough it can fetch
>characters from the buffer or main memory itself and offloding the micro
>until it needs some help with a special character (scrolling etc.)

	The only trouble with doing characters with a blitter is setup
time is usually longer than the actual blit (at least for 8N wide fixed-
width fonts that happen to be aligned on "nice" boundaries.)  For this
reason, 1.3 of the amiga OS has something that intercepts Text() calls and
renders them via CPU if the font is 8 pixels wide and aligned on a byte
boundary.  This makes things like emacs really blazingly fast, no discernable
rendering time.  The blitter and normal software handles the more general
cases, and scrolling.

>And talking about scrolling ! Everyone who has seen a Sun scroll must agre
>that the screen must be scrolled in less than one frame (1/70 of a second)
>to make a pleasant impression. I would like to see a 68k do that with
>a 1k x 1k x 8 plane display in one frame time.

	I get real annoyed at sun-2 (no blitter) scrolling speeds.  And it's
only dealing with 1 bitplane!

>An other argument i have heard is "But look at the Mac, its cpu driven and its
>fast !". Well, i just have one answer to that, what is the display resolution ?

	Not compared to an amiga with blitter.  Try dragging a color window
on a Mac-II (remember, it has an '020 at twice the speed of the amiga's
'000.)

	Blitter's also can do VERY fast line draws, with a few extra gates/ 
registers, as well as area fills, etc.

-- 
Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup

root@cca.ucsf.edu (Computer Center) (07/28/88)

It's an old debating trick to try to make points against the opposing
view by mis-characterizing it and then arguing against the distorted
image.

Every one posting on this subject advocating cpu data shuffling for
displays has tried to fake us all out by pretending that this is
done in word sized units all neatly aligned on word boundaries.

Now let's see your software timings for real _bit_blit_ operations
such as moving a block 37 bits wide aligned starting at bit 17 in the
source position and starting at bit 29 in the destination position
on a machine with 32-bit registers and data paths.


Thos Sumner       (thos@cca.ucsf.edu)   BITNET:  thos@ucsfcca
(The I.G.)        (...ucbvax!ucsfcgl!cca.ucsf!thos)

OS|2 -- an Operating System for puppets.

#include <disclaimer.std>

guy@gorodish.Sun.COM (Guy Harris) (07/28/88)

> It's an old debating trick to try to make points against the opposing
> view by mis-characterizing it and then arguing against the distorted
> image.

It's an equally old trick to make a counterfactual assertion and treat it as an
axiom....

> Every one posting on this subject advocating cpu data shuffling for
> displays has tried to fake us all out by pretending that this is
> done in word sized units all neatly aligned on word boundaries.
> 
> Now let's see your software timings for real _bit_blit_ operations
> such as moving a block 37 bits wide aligned starting at bit 17 in the
> source position and starting at bit 29 in the destination position
> on a machine with 32-bit registers and data paths.

*Sigh*  I don't think *anybody* claimed that all bit moving is "done in word
size units all neatly aligned on word boundaries."  *HOWEVER*:

	For applications in terminals, there are three cases of "bitblt" that
	dominate: drawing characters, scrolling windows and window-window
	operations such as exchanging off-screen data with the display.  These
	cases also cover the most common graphics operations on personal
	computers.

	  Drawing a character requires decoding a found structure to find the
	location of the charcter in the fount bitmap and calling "bitblt" to
	draw the character on the display.  For a general fount format and
	typical character sizes, over half the total time to draw a character
	on the Blit goes into overhead: at least one subroutine call and setup,
	opening the fount, building the argument list for "bitblt", calling
	"bitblt", and having "bitblt" in turn decode and clip its arguments and
	decide how to draw the image.  Because the characters are so small --
	drawing the letter 'a' touches 7 words of memory -- actually changing
	the pixels in the destination bitmap is relatively unimportant.  Our
	overhead is not unreasonable; the Blit draws about 2500 characters per
	second in the standard fount, which is 9 pixels (not 8) by 14.  An
	experimental version with eight-bit wide characters drawn only on byte
	boundaries, that avoided the overhead of calling "bitblt" and used a
	special fount format that was easy to decode (the current format is
	somewhat compressed for economy of memory), was only a factor of two
	faster.  This is insufficent speed-up for so great a loss of
	generality.

	  The second common case of "bitblt" is scrolling a rectangular region
	of a bitmap, usually the display.  Since the word boundaries in the
	scan lines of a bitmap are at the same place in each line, the speed of
	scrolling depends primarily on the speed of the MC68000 instruction

		mov.l	%a0@+, %a1@+

	or, in C,

		register long *p, *q;
		*p++ = *q++;

	For typical rectangles, the edges, which must be handled with more
	complicated code, do not dominate the performance.  There is nothing
	hardware can do to accelerate this loop except provide faster memory
	access.  If the display were accessed through a narrower or clumsier
	interface, it would take longer to move the data.

	  The last common case is shuffling on- and off-screen rectangles.  It
	can be made fast by a simple observation: the off-screen bitmaps are
	allocated by "balloc", which is given as argument the rectangle on the
	display occupied by the data.  This rectangle is assigned to "rect" in
	the resulting "Bitmap".  "balloc" can therefor allocate the bitmap so
	that the word boundaries occur in the same places in the image as they
	do in the display, reducing to the scrolling case the "bitblt" call
	that copies the data.  This is the last feature of the "Bitmap" data
	structure: "Bitmap.rect" defines not only the co-ordinate system but
	also the word fragmentation; the "x" co-ordinate modulo 16 is 0 at the
	first bit of the word in every bitmap.  This results in a factor of two
	to four speed-up for window-shuffling "bitblt" operations and combines
	neatly with the way textures are generated without diminishing the
	generality of the graphics primitives.  Of course, there is also the
	wide, non-aligned case of "bitblt" to be supported, but almost by
	construction it occurs rarely, and the memory and software are clean
	enough to make it acceptably fast when it is executed.

from "Hardware/Software Trade-offs for Bitmap Graphics on the Blit", Rob Pike,
Bart Locanthi, and John Reiser, Software-Practice and Experience, Vol. 15(2),
131-151 (February 1985).

I tend to believe Rob Pike and company when they say that "for real _bit_blit_
operations such as moving a block 37 bits wide aligned starting at bit 17 in
the source position and starting at bit 29 in the destination position on a
machine with 32-bit registers and data paths" are not typical (at least in the
way they used Blits) except for character painting, where overhead above and
beyond the bit-pushing dominates.  If you have evidence to indicate that this
is not the case, let's see it.

In the aforementioned paper, they also discuss timings.  They compare a Sun-1
(with a somewhat unusual frame buffer), a Sun-2 (with a conventional frame
buffer with a BitBlt chip that acts only on the frame buffer), and the Blit.  I
don't know how much the Sun-2 with BitBlt chip resembles the "hardware BitBlt"
support that has been discussed here, but here are the figures (minus those for
the atypical Sun-1 frame buffer); all timings are in milliseconds:

	Operation		Sun-2		Sun-2			Blit
			  (display w/BB chip)	(memory, no BB chip)

Scroll screen vertically	109		82.2			129
Scroll screen horizontally	110		311			376
Letter 'a' at random positions
	on the screen		0.34		0.74			0.42
Texturing a random 40x40
	square			0.82		1.78			1.60

"The characters were drawn in a 9x14 pixel fount, but the bounding box for the
letter 'a' is only 8x7.  Both systems used "bitblt" to draw characters, rather
than special purpose primitives, and executed clipping code." (from the
article)

So it appears that an 8MhZ 68000 (Blit) can compete reasonably well with a
10MhZ 68010 (Sun-2), even with the assistance of the Sun-2s BitBlt chip.  I
don't know why the Sun-2 scrolled vertically memory-to-memory *faster* than
it did display-to-display.

If a BitBlt chip is reasonably cheap, and can do the whole job, it may be worth
it.  Note that in the cases shown, you got at most a 3.5x speedup (scroll
screen horizontally).  For vertical scrolling, you got only 1.18x; for randomly
drawing the letter 'a', you got only 1.23x; and for texturing a random 40x40
square, you got 1.95x.  How cheap does it have to be for that to be worth it?
(The "do the whole job" comes from comments made in the paper that a
half-hearted hardware assist can get in the way, rather than help.)

aglew@urbsdc.Urbana.Gould.COM (07/28/88)

>It's an old debating trick to try to make points against the opposing
>view by mis-characterizing it and then arguing against the distorted
>image.
>
>Every one posting on this subject advocating cpu data shuffling for
>displays has tried to fake us all out by pretending that this is
>done in word sized units all neatly aligned on word boundaries.
>
>Now let's see your software timings for real _bit_blit_ operations
>such as moving a block 37 bits wide aligned starting at bit 17 in the
>source position and starting at bit 29 in the destination position
>on a machine with 32-bit registers and data paths.
>
>Thos Sumner       (thos@cca.ucsf.edu)   BITNET:  thos@ucsfcca

Has anyone got usage statistics for Blit operations? Like, how many
are well aligned, on word/halfword/byte boundaries? How many are to
characters, etc.?

About 3 years ago when I was trying to choose areas for research,
statistics for graphics operations like those we are familiar with
for instruction set usage was suggested. I haven't been looking
in the meantime - has anybody done work on this?

aglew@gould.com

henry@utzoo.uucp (Henry Spencer) (07/29/88)

In article <399@ma.diab.se> pf@ma.UUCP (Per Fogelstr|m) writes:
>... all bus cycles doing anything else than movin data is overhead.
>Agree ??.  Okay, so even if a 68k is doing a "Blit" in straight code
>it will consume some memory bandwidth. A micro programmed "hardware" blit
>will be able to use every memory cycle for data accesses, thus having much
>higher transfer rate. Correct me if i'm wrong...

Almost right, which means "wrong".  Take out the word "much" and I'll go
along with it.  Bulk data movement, like scrolling, can be done with 68k
instructions like MOVEM, which move a couple of dozen words of data for
every instruction fetch.  Yes, avoiding the fetches would speed things up,
but not by nearly as much as you think.  People designing things like
Blitters, DMA interfaces, etc., consistently ignore just how quickly a
modern CPU can move data if the programmer really sits down and thinks
for a while about how to do it.  Most modern CPUs can nearly saturate
their buses with data movement if they really try.

>Assuming we have a fast micro (an 68030 or a NS32535) the would at least be
>supported by their on chip caches. Even if the hitrate in theese caches are
>as low as 50% an external hardware Blitter could use the other 50% ...

You miss an important point:  those caches are not there to free up external
memory cycles, they are there to help slow memory keep up with a fast CPU.
It's not at all inconceivable to get 50% cache hits (which is low for an
instruction cache but good for a tiny data cache like the 030's) *and*
complete saturation of the external memory bandwidth, when one of those
CPUs gets going.

>Someone pointed out that placing characters is the main work for the BitBlit.
>Yes, that is correct in some systems and this is a problem in many cases.
>Placing a character can take from 20micro seconds and up, and the cpu has to
>wait for the blitter to be ready before placing the next character...

This is in fact nearly irrelevant, because there are probably 200us or more
of overhead required before that 20us BitBlt.  Character drawing is a case
where BitBlt speed is irrelevant, because character drawing speeds are
TOTALLY dominated by the overhead of finding the character and deciding
where to put it.
-- 
MSDOS is not dead, it just     |     Henry Spencer at U of Toronto Zoology
smells that way.               | uunet!mnetor!utzoo!henry henry@zoo.toronto.edu

gillies@p.cs.uiuc.edu (07/29/88)

Re: Render characters

Another approach (taken by the Xerox DLion) is to have a separate
instruction just for displaying characters.  This instruction, called
(appropriately) "TextBLT", knows about the font table formats and
specialized for displaying rectangular blobs of text.  TextBLT is also
implemented in microcode, on the DLion's AMD2900-based CPU.

daver@nscimg.b16.sc.NSC.COM (Dave Rand) (07/29/88)

In article <1313@ucsfcca.ucsf.edu> root@cca.ucsf.edu (Computer Center) writes:
>
>Now let's see your software timings for real _bit_blit_ operations
>such as moving a block 37 bits wide aligned starting at bit 17 in the
>source position and starting at bit 29 in the destination position
>on a machine with 32-bit registers and data paths.
>
>Thos Sumner       (thos@cca.ucsf.edu)   BITNET:  thos@ucsfcca

On the NS32CG16 (32 bit CPU, 16 bit external data path), here are the
numbers.

The source is 3 words (less, really, but that's life in BITBLT). This
needs to be moved to 4 (NOT 3) words of destination. The shift is 12. The
height is not shown: I assume 32 lines.

The times are given in clocks, and MICROseconds, assuming 15 Mhz operation.

EXTBLT	35 + ( 13 + (12*4)) * 32	 = 1987 clocks, or 132 usecs
BBFOR	48 + ( 61 + (4+4) + 25 + 4) * 32 = 3184 clocks, or 212 usecs
BBOR	42 + (107 + (4+4) + 44 + 4) * 32 = 5258 clocks, or 350 usecs
BBAND	45 + (111 + (4+4) + 44 + 4) * 32 = 5389 clocks, or 359 usecs

The EXTBLT instruction in the NS32CG16 drives the DP8510 BITBLT unit. This
does the shift and ALU operation in hardware - the CPU provides only the
addresses.

The BBFOR, BBOR, BBAND (and other BITBLT functions) are implemented in
microcode directly. These instructions execute WITHOUT hardware assist
of any form. The times shown include the shift of 12 (shifts of 0-8 bits
are hidden by the fetch time of the destination data, due to the scheduled
load/pipeline feature of the Series 32000 architechure).

BBFOR is a "Fast OR", performing a left-to-right BITBLT operation. BBOR,
BBAND, EXTBLT and the other BITBLT operations in the NS32CG16 allow a
full 4-direction (left-to-right, right-to-left, top down, and bottom
up) BITBLT.

In moderate quantity, the price is $20-30.

If you need more information, please contact me.

Dave Rand
daver@nscimg.nsc.com {pyramid|sun}!nsc!nscimg!daver
These opinions in no way represent those of National Semiconductor.

root@cca.ucsf.edu (Computer Center) (07/29/88)

In article <61783@sun.uucp>, guy@gorodish.Sun.COM (Guy Harris) writes:

(In an attempt to refute my point about substituting word-blitting
for bit-blitting without admitting it being a debater's trick).

> 
> 	For applications in terminals, there are three cases of "bitblt" that
> 	dominate: drawing characters, scrolling windows and window-window
> 	operations such as exchanging off-screen data with the display.  These
> 	cases also cover the most common graphics operations on personal
> 	computers.
> ...
> 
> from "Hardware/Software Trade-offs for Bitmap Graphics on the Blit", Rob Pike,
> Bart Locanthi, and John Reiser, Software-Practice and Experience, Vol. 15(2),
> 131-151 (February 1985).
> 

Another debater's trick: this one is called Appeal to Authority.

Never mind, I suppose, that this is 3 1/2 years after publication
of the above and who knows how long before that it was written.
A few things have happened in this business since then.

But the real point is that these are exactly the applications that
should not be blitted at all; the video mapping controller should
be handling all of that.

For example, at last month's Usenix meeting Bell Technologies was
showing their Intel 82786 (I hope I got the number right) video
controller running smoothly scrolled text over 2/3 of a high-res
screen while occupying the remainder with instant opening and closing
overlapped windows. No jerks, no glitches, no skew were to be seen.

It sure made the skew distorted scrolling of the corner cutting
move-screen-bits-with-the-cpu systems look awful.

Thos Sumner       (thos@cca.ucsf.edu)   BITNET:  thos@ucsfcca
(The I.G.)        (...ucbvax!ucsfcgl!cca.ucsf!thos)

OS|2 -- an Operating System for puppets.

#include <disclaimer.std>

jesup@cbmvax.UUCP (Randell Jesup) (07/29/88)

In article <61783@sun.uucp> guy@gorodish.Sun.COM (Guy Harris) writes:
>	For applications in terminals, there are three cases of "bitblt" that
>	dominate: drawing characters, scrolling windows and window-window
>	operations such as exchanging off-screen data with the display.  These
>	cases also cover the most common graphics operations on personal
>	computers.

	They were dealing with straight terminals and simple rectangular
windows, being used as a mainly character-oriented interface to larger
machines.  A very different envirionment from today's microcomputers,
such as the Amiga, Mac, etc.

>	decide how to draw the image.  Because the characters are so small --
>	drawing the letter 'a' touches 7 words of memory -- actually changing
>	the pixels in the destination bitmap is relatively unimportant.  Our

	Characters are a somewhat special case, and are well worth
special-casing in the code.  The blitter does help a lot with proportional
kerned fonts, less with monospaced fonts, and not at all with monospaced
byte-multiple wide fonts aligned on byte boundaries.  Unfortunately, this
last case doesn't happen often, especially in a windowing envirionment.
It can make editors that cover the screen several times faster.

>	  The second common case of "bitblt" is scrolling a rectangular region
>	of a bitmap, usually the display.  Since the word boundaries in the
>	scan lines of a bitmap are at the same place in each line, the speed of
>	scrolling depends primarily on the speed of the MC68000 instruction

	Once again, this is true in a text-based envirionment.  In a WIMP
envirionment, this is much less true.  Block operations usually start on
arbitrary boundaries, and tend to be inconvenient widths.

>		register long *p, *q;
>		*p++ = *q++;
>
>	For typical rectangles, the edges, which must be handled with more
>	complicated code, do not dominate the performance.  There is nothing
>	hardware can do to accelerate this loop except provide faster memory
>	access.  If the display were accessed through a narrower or clumsier
>	interface, it would take longer to move the data.

	This is nowhere near as fast as the memory system can go nowadays,
even given the slowest/cheapest DRAMS.  For that loop, even unrolled, the
cpu is being used at least 33% for instruction fetch, and even so the CPU
only uses every other memory cycle.

>	  The last common case is shuffling on- and off-screen rectangles.  It
>	can be made fast by a simple observation: the off-screen bitmaps are
>	allocated by "balloc", which is given as argument the rectangle on the
>	display occupied by the data.  This rectangle is assigned to "rect" in
>	the resulting "Bitmap".  "balloc" can therefor allocate the bitmap so
>	that the word boundaries occur in the same places in the image as they
>	do in the display, reducing to the scrolling case the "bitblt" call
>	that copies the data.

	This is nowhere near the common case on machines like the Amiga.

>I tend to believe Rob Pike and company when they say that "for real _bit_blit_
>operations such as moving a block 37 bits wide aligned starting at bit 17 in
>the source position and starting at bit 29 in the destination position on a
>machine with 32-bit registers and data paths" are not typical (at least in the
>way they used Blits) except for character painting, where overhead above and
>beyond the bit-pushing dominates.  If you have evidence to indicate that this
>is not the case, let's see it.

	You've said the operative clause: the way they used Blits.  As I've
said, blitter hardware can buy you linedraw and areafill as well relatively
cheaply.  These things are MUCH faster as part of blitter than as done
by the CPU, up to 20x for linedraw.

>If a BitBlt chip is reasonably cheap, and can do the whole job, it may be worth
>it.  Note that in the cases shown, you got at most a 3.5x speedup (scroll
>screen horizontally).  For vertical scrolling, you got only 1.18x; for randomly
>drawing the letter 'a', you got only 1.23x; and for texturing a random 40x40
>square, you got 1.95x.  How cheap does it have to be for that to be worth it?

	You get bigger wins in animation or multitasking evironments.  A
blitter is relatively cheap, if you already need video chips (of course,
Commodore has chip design facilities, and uses custom chips for most
things.)  The blitter on the amiga is just a part of one of the graphics
chips, maybe 1/4 of it.

	A factor of 2-4x can make a really amazing difference in percieved
speed, especially if update operations go down to 1-frame time.  Using my
Sun-2 (no blitter, no color) is positively painful compared to my amiga, even
though the amiga is running in 4-colors (in this case).

-- 
Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup

elg@killer.DALLAS.TX.US (Eric Green) (07/29/88)

In message <61783@sun.uucp>, guy@gorodish.Sun.COM (Guy Harris) says:
>If a BitBlt chip is reasonably cheap, and can do the whole job, it may be worth
>it.  Note that in the cases shown, you got at most a 3.5x speedup (scroll
>screen horizontally).  For vertical scrolling, you got only 1.18x; for randomly
>drawing the letter 'a', you got only 1.23x; and for texturing a random 40x40
>square, you got 1.95x.  How cheap does it have to be for that to be worth it?
>(The "do the whole job" comes from comments made in the paper that a
>half-hearted hardware assist can get in the way, rather than help.)
>

It's interesting to note that the Amiga chipset was originally
designed for "the ultimate video game", which required a) low cost,
and b) the ability to move random irregularly shaped objects with
blazing speed, doing logic operations upon the operands (e.g. one
favorite video game trick is EOR'ing in the moving object into the
background bitmap, then EOR'ing it out when it's ready to be moved,
and EOR it into the new location).    Amazing how well-suited such a
chip is for a low-cost windowing system... well, not-so-amazing,
really, since the chipset designers knew an aweful lot about designing
high speed video systems, while the designers of the Sun probably
didn't have that experience when they were faced with the problem of
speeding up their graphics rendering. Seems like the video game jocks have
something to show us Unix jocks, after all....

Just some meaningless trivia to generate flames... ;-)

     Eric

--
Eric Lee Green    ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg
          Snail Mail P.O. Box 92191 Lafayette, LA 70509              
       MISFORTUNE, n. The kind of fortune that never misses.

henry@utzoo.uucp (Henry Spencer) (08/01/88)

In article <1315@ucsfcca.ucsf.edu> root@cca.ucsf.edu (Computer Center) writes:
>For example, at last month's Usenix meeting Bell Technologies was
>showing their Intel 82786 (I hope I got the number right) video
>controller running smoothly scrolled text over 2/3 of a high-res
>screen while occupying the remainder with instant opening and closing
>overlapped windows. No jerks, no glitches, no skew were to be seen.
>
>It sure made the skew distorted scrolling of the corner cutting
>move-screen-bits-with-the-cpu systems look awful.

To quote someone whose name I can't recall :-), "another debater's
trick"!  This time, comparing tomorrow's system with yesterday's.
A 25 MHz AMD 29000 (note, not 2900) and a suitably cooperative memory
subsystem should be able to do a *software* BitBlt that would make
an Amiga look equally awful.  If you want to compare the latest hot
BitBlt chip, compare it against the latest hot CPU.
-- 
MSDOS is not dead, it just     |     Henry Spencer at U of Toronto Zoology
smells that way.               | uunet!mnetor!utzoo!henry henry@zoo.toronto.edu

glennw@nsc.nsc.com (Glenn Weinberg) (08/02/88)

In article <1988Jul28.173301.7275@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <399@ma.diab.se> pf@ma.UUCP (Per Fogelstr|m) writes:
>>Assuming we have a fast micro (an 68030 or a NS32535) the would at least be
>>supported by their on chip caches. Even if the hitrate in theese caches are
>>as low as 50% an external hardware Blitter could use the other 50% ...
>
>You miss an important point:  those caches are not there to free up external
>memory cycles, they are there to help slow memory keep up with a fast CPU.
>It's not at all inconceivable to get 50% cache hits (which is low for an
>instruction cache but good for a tiny data cache like the 030's) *and*
>complete saturation of the external memory bandwidth, when one of those
>CPUs gets going.

It is not true that the sole purpose of cache is to help slow memory keep
up with fast processors.  In multiprocessors, in particular, one of the
most important functions of caches (and especially copy-back caches) is to
reduce bus traffic and so allow a relatively slow bus to support a number
of fast processors.

Within the next five years, this "relatively slow bus" will need a
bandwidth of multiple hundreds of Megabytes per second (you read that right--
the bus will need a bandwidth of several Gigabits per second) in order to
support a multiprocessor system made up of, say, 8-16 50-MIPS processors.
And the only way that you limit yourself to "only" needing hundreds of
Megabytes per second is by using copy-back caches.

-- 
Glenn Weinberg					Email: glennw@nsc.nsc.com
National Semiconductor Corporation		Phone: (408) 721-8102
(My opinions are strictly my own, but you can borrow them if you want.)

guy@gorodish.Sun.COM (Guy Harris) (08/02/88)

> (In an attempt to refute my point about substituting word-blitting
> for bit-blitting without admitting it being a debater's trick).

Excuse me, but I don't *consider* it a debating trick.  Unless you can
demonstrate that it *is* one - which you have *not* done - I have no intention
of "admitting it is (one)."

However, I *do* consider blithely dismissing arguments you don't like as
"debating tricks" to be a debating trick.

The point made by Pike and company is that the bulk of the operations performed
on the Blit *were* word-oriented, except for some that were dominated by
overhead above-and-beyond the bit-pushing, and therefore that the fact that
pushing bits on arbitrary boundaries is more expensive isn't important.

This is similar to the point that in many applications, integer multiplications
are usually multiplications by constants, and therefore machines that don't
have multiply instructions don't suffer a big performance hit in those
applications.  You *don't* always have to make the *general* case fast; you
want to concentrate on making the *common* case fast.

> Another debater's trick: this one is called Appeal to Authority.

Umm, right.  By this logic, *any* citation of *any* paper is "Apppeal to
Authority", and thus dismissable as a "debating trick".  Clever trick, that.

Sorry, but Pike and company, at least, have demonstrated some level of
expertise in the matter of making bit-mapped display hardware and software.  As
such, appealing to their authority is not without merit.

> Never mind, I suppose, that this is 3 1/2 years after publication
> of the above and who knows how long before that it was written.
> A few things have happened in this business since then.

In other words, the common types of bitblt operations have changed since then?

elg@killer.DALLAS.TX.US (Eric Green) (08/02/88)

In message <62296@sun.uucp>, guy@gorodish.Sun.COM (Guy Harris) says:
$Sorry, but Pike and company, at least, have demonstrated some level of
$expertise in the matter of making bit-mapped display hardware and software.  As
$such, appealing to their authority is not without merit.
$
$$ Never mind, I suppose, that this is 3 1/2 years after publication
$$ of the above and who knows how long before that it was written.
$$ A few things have happened in this business since then.
$
$ In other words, the common types of bitblt operations have changed since then?

 No, but the common types of CPU RAM have :-). For example, the slowest
 256K DRAM's that I've seen are 150ns, while I remember the "good old
 days" of 64K DRAM's, where 250ns was fast.... I won't bother you
 further by going taking the Wayback machine back to the late 70's,
 when 16K DRAMs were lucky to keep up with a 6502.

 What made sense for a 8mhz 68000 in 1980 did not necessarily make
sense 4 years later for that same 8mhz 68000.....

 --
Eric Lee Green    ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg
          Snail Mail P.O. Box 92191 Lafayette, LA 70509              
       MISFORTUNE, n. The kind of fortune that never misses.

jesup@cbmvax.UUCP (Randell Jesup) (08/03/88)

In article <62296@sun.uucp> guy@gorodish.Sun.COM (Guy Harris) writes:
>> Never mind, I suppose, that this is 3 1/2 years after publication
>> of the above and who knows how long before that it was written.
>> A few things have happened in this business since then.
>
>In other words, the common types of bitblt operations have changed since then?

	Yes.  Pike et al were only looking at blitters used for text-
oriented terminals that also had graphics capabilities.
-- 
Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup

jesup@cbmvax.UUCP (Randell Jesup) (08/03/88)

In article <1988Aug1.061714.25907@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>To quote someone whose name I can't recall :-), "another debater's
>trick"!  This time, comparing tomorrow's system with yesterday's.
>A 25 MHz AMD 29000 (note, not 2900) and a suitably cooperative memory
>subsystem should be able to do a *software* BitBlt that would make
>an Amiga look equally awful.  If you want to compare the latest hot
>BitBlt chip, compare it against the latest hot CPU.

	Sure, and let me make a 1.2u CMOS version of the amiga blitter and
it'll do the same thing to the 29000.  The Amiga blitter is in 3u NMos or
HMos or some such, 4+ year old tech, running at 7 Mhz with a 16bit bus.

-- 
Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup

henry@utzoo.uucp (Henry Spencer) (08/03/88)

In article <4410@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes:
>	Sure, and let me make a 1.2u CMOS version of the amiga blitter and
>it'll do the same thing to the 29000.  The Amiga blitter is in 3u NMos or
>HMos or some such, 4+ year old tech, running at 7 Mhz with a 16bit bus.

My point was that if the opposition can make unfair comparisons (new Intel
hardware against ten-year-old CPU), I can make them too.

I'm not sure I'd bet on 1.2u CMOS beating the 29000, though:  that processor
is *really good* at saturating memory bandwidth.
-- 
MSDOS is not dead, it just     |     Henry Spencer at U of Toronto Zoology
smells that way.               | uunet!mnetor!utzoo!henry henry@zoo.toronto.edu

henry@utzoo.uucp (Henry Spencer) (08/03/88)

In article <4409@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes:
>	...  Pike et al were only looking at blitters used for text-
>oriented terminals that also had graphics capabilities.

I would conjecture -- note that this is only a conjecture -- that even
the highly graphics-oriented machines spend far more time displaying plain
old text than most people think.  I'd love to see numbers on this; does
anybody have some?
-- 
MSDOS is not dead, it just     |     Henry Spencer at U of Toronto Zoology
smells that way.               | uunet!mnetor!utzoo!henry henry@zoo.toronto.edu

anc@camcon.co.uk (Adrian Cockcroft) (08/05/88)

In article <76700044@p.cs.uiuc.edu>, gillies@p.cs.uiuc.edu writes:
> 
> Re: Render characters
> 
> Another approach (taken by the Xerox DLion) is to have a separate
> instruction just for displaying characters.  This instruction, called
> (appropriately) "TextBLT", knows about the font table formats and
> specialized for displaying rectangular blobs of text.  TextBLT is also
> implemented in microcode, on the DLion's AMD2900-based CPU.

The Intel 82786 has a CHARBLT instruction. There are two forms, in the nicest
one you define a font to the chip, up to 256 16x16 pixel characters mapped
through an indirection table so that e.g. all unwanted chars map to the same
glyph, you then give it a string and a charcount and the CHARBLT instruction
draws proportionally spaced characters for you (the font can be kerned for
italic). This runs at full memory bandwidth speeds. The font has a header
for each glyph giving its size and some mode control bits.

The 82786 can also have a very high memory bandwidth of 40 Mb/s on a 16 bit
wide bus. It uses page mode DRAMS in two banks interleaved so that a new word
is read every 50ns. A burst lasting about a microsecond fills the 25 word
FIFO that feeds the video output registers, leaving plenty of memory
bandwidth for drawing operations. I think the blitter also does a block fetch
although it might use a RMW cycle. The CHARBLT runs at 20000 chars/sec.

In general the 82786 is probably faster then the 34010 but is less
programmable.
-- 
  |   Adrian Cockcroft                  ..!uunet!mcvax!ukc!camcon!anc
-[T]- Cambridge Consultants Ltd,        anc@uk.co.camcon or anc@camcon.uucp
  |   Science Park, Cambridge CB4 4DW, England, UK    (0223) 358855
      (You are in a maze of twisty little C004's, all alike...)

mitch@Stride.COM (Thomas Mitchell) (08/06/88)

In article <1988Jul28.173301.7275@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <399@ma.diab.se> pf@ma.UUCP (Per Fogelstr|m) writes:
>>... all bus cycles doing anything else than movin data is overhead.
>>Agree ??.  Okay, so even if a 68k is doing a "Blit" in straight code
>>Correct me if i'm wrong...
>Almost right, which means "wrong".  Take out the word "much" and I'll go
>... People designing things like
>Blitters, DMA interfaces, etc., consistently ignore just how quickly a
>modern CPU can move data if the programmer really sits down and thinks

This is true, and a common surprise.  Many 'DMA' processors are not
as fast as the main processor. They commonly do not have, a
bus interface equal to the processor or instruction cache or other
goodies we now expect in a micro-processor.  

In a system the DMA processor also must arbitrate with the processor
for control of the bus.  Then communicate (message?) with the
processor ....  Well--

The result is that DMA processors are a loss except to the sales
department.   

If I was careful -- I use the words DMA processor and not DMA
device.  It is possible to build custom hardware (a device) that
does DMA to or from main memory vastly faster than a 'programed'
transfer but such things are today rare.

-- 
Thomas P. Mitchell (mitch@stride1.Stride.COM)
Phone: (702)322-6868	TWX: 910-395-6073	FAX: (702)322-7975
MicroSage Computer Systems Inc.
Opinions expressed are probably mine.

anc@camcon.co.uk (Adrian Cockcroft) (08/08/88)

In article <76700044@p.cs.uiuc.edu>, gillies@p.cs.uiuc.edu writes:
> 
> Re: Render characters
> 
> Another approach (taken by the Xerox DLion) is to have a separate
> instruction just for displaying characters.  This instruction, called
> (appropriately) "TextBLT", knows about the font table formats and
> specialized for displaying rectangular blobs of text.  TextBLT is also
> implemented in microcode, on the DLion's AMD2900-based CPU.

The Intel 82786 has a charblt instruction. There are two forms, in the nicest
one you define a font to the chip, up to 256 16x16 pixel characters mapped
through an indirection table so that e.g. all unwanted chars map to the same
glyph, you then give it a string and a charcount and the CHARBLT instruction
draws proportionally spaced characters for you (the font can be kerned for
italic). This runs at full memory bandwidth speeds. The font has a header
for each glyph giving its size and some mode control

The 82786 can also have a very high memory bandwidth of 40 Mb/s on a 16 bit
wide bus. It uses page mode DRAMS in two banks interleaved so that a new word
is read every 50ns. A burst lasting about a microsecond fills the 25 word
FIFO that feeds the video output registers, leaving plenty of memory
bandwidth for drawing operations. I think the blitter also does a block fetch
although it might use a RMW cycle. The CHARBLT runs at 20000 chars/sec.
The CHARBLT can draw 1 bit deep characters into 1,2,4 or 8 bit deep bitmaps.

-- 
  |   Adrian Cockcroft                  ..!uunet!mcvax!ukc!camcon!anc
-[T]- Cambridge Consultants Ltd,        anc@uk.co.camcon or anc@camcon.uucp
  |   Science Park, Cambridge CB4 4DW, England, UK    (0223) 358855
      (You are in a maze of twisty little C004's, all alike...)

rminnich@super.ORG (Ronald G Minnich) (08/08/88)

In article <1988Aug3.153415.9033@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <4409@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes:
>>	...  Pike et al were only looking at blitters used for text-
>>oriented terminals that also had graphics capabilities.
>I would conjecture -- note that this is only a conjecture -- that even
>the highly graphics-oriented machines spend far more time displaying plain
>old text than most people think.  I'd love to see numbers on this; does
>anybody have some?
   well i would guess mine does, when i am not playing computer games :-)
(about 50% of the time?) 
   One issue that has been ignored in this discussion is that we are 
all talking about slightly different things. Henry is talking about blitters,
and others are talking about amigas. Seems to me that there is a BIG
difference between the graphics supported by Blit and the graphics 
supported by the amiga- something that many probably don't know.
For example, the amiga supports hardware windows. Now, they are
a little limited, in that they have to be the width of the screen, so
as a result the amiga OS people implemented them as screens, with multiple
of the traditional type of window per screen. Thus you actually have
several different worlds, each with their own color map and such, each 
with their own set of windows. You can have a game with 10 open windows
on one screen, then flip to your X-windows screen with its 5 or 10 windows
and its 2-bit-plane color map, then to your Deluxe Paint screen with
its own 4096-color HAM map. The Screens are one thing i like best about
the amiga, esp. since they can overlay each other on the physical display
and flipping between them takes no time at all. I have yet to see 
this sort of graphics on anything other than an amiga, and i miss it 
a lot when i use other machines.
   It was mentioned somewhere at one point that the original Xerox
workstations wanted to support this sort of multiple world environment, 
but the graphics support was not there (I think it was the Dorado).
   The other day i saw an IBM Peanut running X windows. The color map
on the VGA changed as you moved from window to window. It just about
drove me bats. That machine really needed screens, but i think its
gotta happen in hardware if it is going to happen at all.
    To sum up,
   Amiga graphics hardware != blitter chip.
		and maybe  >
ron

stevew@nsc.nsc.com (Steve Wilson) (08/08/88)

In article <840@stride.Stride.COM> mitch@stride.stride.com.UUCP (Thomas Mitchell) writes:
>
>This is true, and a common surprise.  Many 'DMA' processors are not
>as fast as the main processor. They commonly do not have, a
>bus interface equal to the processor or instruction cache or other
>goodies we now expect in a micro-processor.  
>
>In a system the DMA processor also must arbitrate with the processor
>for control of the bus.  Then communicate (message?) with the
>processor ....  Well--
>
>The result is that DMA processors are a loss except to the sales
>department.   

If your only moving one byte of data between interupts to the CPU
telling him you've moved some data, then your right.  A DMA controller
is a WIN when you can tell it to move a large fixed length piece of data
and then forget about it and go do something else.  Your correct about
losing memory bandwidth for the CPU, but this is a design trade-off 
point.  My favorite example is a serial I/O controller I designed
(Henry..You listening!!) where I had 12 serial I/O channels.  The
processor could handle about 7000 interupts/sec.  How ya gonna do
19.2Kb of constant data flow across 12 channels with that kind of interupt 
response time.  DMA was a cheap answer. (Henry, I'm NOT going to put
the DMA hardware on a general purpose CPU!) 

Point being that there are applications where special hardware such
as DMA makes sense, and there are applications where its a dumb idea!

Steve Wilson
National Semiconductor

[The above opinions are mine, not those of my employer! ]

pf@diab.se (Per Fogelstr|m) (08/09/88)

In article <840@stride.Stride.COM> mitch@stride.stride.com.UUCP (Thomas Mitchell) writes:
>..........................  Many 'DMA' processors are not
>as fast as the main processor. They commonly do not have, a
>bus interface equal to the processor or instruction cache or other
>goodies we now expect in a micro-processor.  
>
>If I was careful -- I use the words DMA processor and not DMA
>device.  It is possible to build custom hardware (a device) that
>does DMA to or from main memory vastly faster than a 'programed'
>transfer but such things are today rare.

As a matter of fact, modern busses, supporting more than 100Mb transfer
rate migth saturate the processor. :-)))  Okay, okay, lets be serious.
What i mean is that there is no need for what was called a DMA procecssor
back in the good o'l days. DMA channels was needed because the main 
proceesor couldn't handle the data rate from Mag tapes and disks, etc.
What is needed today in multiprocessor systems is a mechanism wich
allows the "programmed CPU" on the disk controller board to burst the
data over the bus fast as H..L. This to obey the rules for multi-procesor
buses:  1. Don't use the bus.  2. If you must, be fast.  3. To be fast
transfer more data than addresses. e.g use block transfers.

hwt@leibniz.UUCP (Henry Troup) (08/10/88)

The estimable Mr. Spencer (got to keep all the Henry's straight)
queries if anyone has numbers for how much character i/o happens
as against graphics on a graphics (bitmap) terminal.
I don't know, but I do remember that character writing speed was a
big thing for MacIntosh QuickDraw (9k characters per second). It's
in the Byte interview in 1984.

a
hw (percent) leibniz@bnr-di
t

daveh@cbmvax.UUCP (Dave Haynie) (08/10/88)

in article <61783@sun.uucp>, guy@gorodish.Sun.COM (Guy Harris) says:
> Keywords: BitBlit.

> 	  The second common case of "bitblt" is scrolling a rectangular region
> 	of a bitmap, usually the display.  Since the word boundaries in the
> 	scan lines of a bitmap are at the same place in each line, the speed of
> 	scrolling depends primarily on the speed of the MC68000 instruction

> 		mov.l	%a0@+, %a1@+

> 	or, in C,

> 		register long *p, *q;
> 		*p++ = *q++;

> 	For typical rectangles, the edges, which must be handled with more
> 	complicated code, do not dominate the performance.  There is nothing
> 	hardware can do to accelerate this loop except provide faster memory
> 	access.  If the display were accessed through a narrower or clumsier
> 	interface, it would take longer to move the data.

With a MC68000, not so.  Given an equal memory access speed, something like
a DMA controller can be several times faster than the 68000.  All it needs
do is fetch data from location A, dump it to location B, and increment some
internal counters.  While it looks like that's what the 68000 is doing, it's
really also fetching the move instruction and a branch instruction of some
kind.  So for every word moved, you're probably fetching as many instruction
words as overhead.  Certainly the 68010 in some cases and the 68020 in most
cases solve this problem via caching, but I can't yet buy either of these 
parts for the $2.50 or so I pay for a 68000.  

> If a BitBlt chip is reasonably cheap, and can do the whole job, it may be worth
> it.  Note that in the cases shown, you got at most a 3.5x speedup (scroll
> screen horizontally).  For vertical scrolling, you got only 1.18x; for randomly
> drawing the letter 'a', you got only 1.23x; and for texturing a random 40x40
> square, you got 1.95x.  How cheap does it have to be for that to be worth it?
> (The "do the whole job" comes from comments made in the paper that a
> half-hearted hardware assist can get in the way, rather than help.)

You also have to consider a few more things.  For instance, if you have a blitter
that operates on video memory and lets the CPU do things with non video memory
in parallel (like on the Amiga, and apparently on the Sun mentioned), then you
have a big advantage, in that any blit may end up costing nothing but the setup
time in terms of real CPU usage.  Still no good reason to use the blitter for
small, single character blits, but it can really be a justification for larger
things.  And given that a blit chip can often be a much simpler design than the
host CPU, there's a real good chance it WILL be able to have a faster path to
memory.

That depends of course on the chip and the base CPU in your system.  If the
combination of a blitter chip and 68000 ran me more than a 68020, that had
better be one heck of a blitter, or I'm wasting my $$$ -- the 68020 being more
general purpose than a blitter can give you a better overall system performance.
But if I can get my blitter and 68000 CPU and maybe a bunch of other functions 
for less than the cost of a 68010, I'm probably winning (if I'm not concerned
about the 68010's virtual memory facilities, which a Sun of course obviously
is).

-- 
Dave Haynie  "The 32 Bit Guy"     Commodore-Amiga  "The Crew That Never Rests"
   {ihnp4|uunet|rutgers}!cbmvax!daveh      PLINK: D-DAVE H     BIX: hazy
		"I can't relax, 'cause I'm a Boinger!"

haahr@phoenix.Princeton.EDU (Paul Gluckauf Haahr) (08/11/88)

In article <118@leibniz.UUCP> hwt@leibniz.UUCP (Henry Troup) writes:
> The estimable Mr. Spencer (got to keep all the Henry's straight)
> queries if anyone has numbers for how much character i/o happens
> as against graphics on a graphics (bitmap) terminal.
> I don't know, but I do remember that character writing speed was a
> big thing for MacIntosh QuickDraw (9k characters per second). It's
> in the Byte interview in 1984.

The Byte reference is to the February 1984 issue, and the rendering
speed given was actually 7K characters/second.  (the information is
given on page 37 and repeated on page 76).  Still, remarkably fast for
a 68000, even given that this was done in hand coded assembly
language.  The 9K seemed about a factor 5 too high, which is why I
looked the article up.  7000 chars/sec is still faster than I would
have expected.

They do not give sizes of the characters, and say in the article that
it is irrelevant, but that still probably assumes something like 9x14.
Much larger characters would probably hurt performace.  Later Macs may
be faster (if someone recoded the QuickDraw stuff to use the bit field
instructions, the Mac II could scream).

By way of comparision, the Pike/Locanthi/Reiser "Hardware/Software
Tradeoffs for Bitmap Graphics on the Blit" paper gives numbers (page
146) that work out to 2400 chars/sec for the blit, 900 chars/sec for
the Sun-1, and 2950 for the Sun-2.  This assumes rendering one
character at a time.  Locanthi's fastest example from the EUUG "Fast
bitblt() with asm() and cpp" paper gives 6200 chars per second for a 16
MHz, 2 wait state 68020.  Again, this is for one bitblt() call per
character.

My monochrome sun-3/60 (68020, 20 MHz, bwtwo, with the normal, not high
resolution, monitor), using the large console font (gallant.r.19) comes
out to about 3200 chars/sec.  I have no idea if they render more than
one character at once.  This is a very large font, however, and the
output routine is in the prom monitor.  I did not try to write a
program to test pixrect character speeds on a normal sized font.

My own bitblt, on the same sun-3/60, for an 8x14 font, gives 8100
chars/sec, if characters are rendered individually.  If characters are
batched up and bitblt() is called only once, the speed is > 16000
chars/sec.  This code is a combination of c (with inline assembly for
fetching the characters from the font bitmap and bitblt()s narrower
than one word) and compile-on-the-fly code for bitblt()s spanning word
boundaries.

The real point:  the Macintosh, with no hardware assist, and hand-coded
assembly, draws characters very fast.

paul haahr princeton!haahr or haahr@princeton.edu

henry@utzoo.uucp (Henry Spencer) (08/13/88)

In article <5493@nsc.nsc.com> stevew@nsc.UUCP (Steve Wilson) writes:
>>The result is that DMA processors are a loss except to the sales
>>department.   
>
>If your only moving one byte of data between interupts to the CPU
>telling him you've moved some data, then your right.  A DMA controller
>is a WIN when you can tell it to move a large fixed length piece of data
>and then forget about it and go do something else...  [In an example] the
>processor could handle about 7000 interupts/sec.  How ya gonna do
>19.2Kb of constant data flow across 12 channels with that kind of interupt 
>response time.  DMA was a cheap answer...

I think you've missed the point slightly.  Clearly, for high data rates
it is necessary to have buffering of some kind, to keep the latency
requirements down to the point where the CPU can satisfy them.  One way
of doing that is to have the device DMA into memory.  Another way is to
put buffering on board, but have the CPU do the actual transfer into
memory when a reasonable amount of data has accumulated.  Actual timings
in several cases have clearly shown that buffered devices with CPU data
movement can beat DMA devices.  The main reason is that in many systems,
the CPU normally has possession of the bus and the DMA device must first
throw the CPU off.  There can be quite substantial overhead in doing so,
and if the DMA device then transfers a few bytes and goes away again,
the bus-ownership-transfer overhead can hurt throughput badly.  For a
modern CPU which can move data quickly, it is worth considering using
buffering instead of DMA in the peripherals.  (Side benefits are that
it makes the drivers simpler, and it's much more flexible -- most DMA
schemes can't do things like putting network headers down in one place
and the actual packet data down in another, or calculating an IP checksum
as the data is being moved.)
-- 
Intel CPUs are not defective,  |     Henry Spencer at U of Toronto Zoology
they just act that way.        | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

pf@diab.se (Per Fogelstr|m) (08/13/88)

In article <1843@gofast.camcon.co.uk> anc@camcon.co.uk (Adrian Cockcroft) writes:
>
>In general the 82786 is probably faster then the 34010 but is less
>programmable.
>-- 

And the NS DP8500 is much faster than the 82786 and at least as programmable
as the Ti 34010. And the DP8500 can have glyphs up to 256x256 and can do
kerning as well. I did a CRT emulator by only adding a UART to the chip,
easy as 1 2 3 . Look mom' no "CPU".

henry@utzoo.uucp (Henry Spencer) (08/14/88)

In article <1848@titan.camcon.co.uk> anc@camcon.co.uk (Adrian Cockcroft) writes:
>The Intel 82786 has a charblt instruction. There are two forms, in the nicest
>one you define a font to the chip, up to 256 16x16 pixel characters...

So if my characters are, say, 17x17, I can't use it?  This is precisely the
sort of stupid restriction that makes people forget the chip and do it in
software instead, to save the hassle of deciding when the hardware is
actually useful.

>... (the font can be kerned for italic)....

How is the kerning defined?  10-1 it's some sloppy kludge.
-- 
Intel CPUs are not defective,  |     Henry Spencer at U of Toronto Zoology
they just act that way.        | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

pf@diab.se (Per Fogelstr|m) (08/21/88)

In article <1988Aug13.205229.24467@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <1848@titan.camcon.co.uk> anc@camcon.co.uk (Adrian Cockcroft) writes:
>>The Intel 82786 has a charblt instruction. There are two forms, in the nicest
>>one you define a font to the chip, up to 256 16x16 pixel characters...
>
>So if my characters are, say, 17x17, I can't use it?  This is precisely the
>sort of stupid restriction that makes people forget the chip and do it in
>software instead,

Intel always put a lot of dumb restrictions in their silicon. However it is 
possible to use larger fonts by using normal bitblit transfers. The sad thing
is that you cant take advantage of the special functions used by charblt.
(Table lookup etc.)  NS DP8500 also has a charblt instruction but this chip
can handle up to 65536 characters up to 256x256 pixels. Yes I know this is
an restriction as well, but i think it will last for a while.

yuval@taux02.UUCP (Gideon Yuval) (09/01/88)

Are the S/W bitBLT algorithms available (preferably in "C")?
-- 
Gideon Yuval, yuval@taux01.nsc.com, +972-2-690992 (home) ,-52-522255(work)
 Paper-mail: National Semiconductor, 6 Maskit St., Herzliyah, Israel
                                                TWX: 33691, fax: +972-52-558322