[comp.sys.amiga.tech] Using CPU instead of Blitter for speed

dgaudet@crocus.uwaterloo.ca (Dean Gaudet) (06/04/90)

Call me crazy, but wouldn't using the CPU (a 020 or 030 I suppose) instead of
the blitter for speed cause a decrease in the overall speed of the machine?  
i.e., one of the reasons the Amiga seems faster than a similar speed PC or Mac
is because of the support hardware and the parallelism.

Perhaps someone should rewrite the blitter in software so and make the blit
routines "intelligently" decide whether to use the CPU or the blitter chip.
For example:  if CPU is a 030 and CPU traffic is slow, then use the CPU
otherwise use the blitter.

Dean Gaudet

jonabbey@walt.cc.utexas.edu (Jonathan Abbey) (06/05/90)

I'd like some hard data on the blitter vs. the CPU for polygon rendering..
I've heard people here assert that the CPU is faster, even on a 68000.

How much faster?  Are you using the blitter to copy the polygon into the
appropriate bitplanes once it is drawn, or are you using the processor to
fill the bitplanes in parallel?

I'm assuming you're using an ordered edge list?

Thanks for any elucidation..



Jonathan Abbey                    (512) 926-5934 | Amiga Programmer Wanna-be 
jonabbey@ccwf.cc.utexas.edu        bix: jonabbey +----------------------------- 
The University of Texas at Austin - CS Undergrad | Speaking for myself, at best

jcs@crash.cts.com (John Schultz) (06/06/90)

In article <1990Jun4.134811.12142@watdragon.waterloo.edu> dgaudet@crocus.uwaterloo.ca (Dean Gaudet) writes:
>Call me crazy, but wouldn't using the CPU (a 020 or 030 I suppose) instead of
>the blitter for speed cause a decrease in the overall speed of the machine?  
>i.e., one of the reasons the Amiga seems faster than a similar speed PC or Mac
>is because of the support hardware and the parallelism.

  Unfortunately, with a bitplane oriented display, you only get parallelism
on your last bitplane blit. True, you can compute you next blit values before
waiting, but that buys you little time.  For polygons, the processor is
definitely faster, even on a nofastmem A500. But, for general block copying,
such as windows, arcade style animation, etc, the blitter is much faster,
and easier to use.  If the application is oriented such that the blits
can be interleaved with other processor operation, more parallelism could
be achieved, but this is usually not the case in a general purpose
application.

>Perhaps someone should rewrite the blitter in software so and make the blit
>routines "intelligently" decide whether to use the CPU or the blitter chip.
>For example:  if CPU is a 030 and CPU traffic is slow, then use the CPU
>otherwise use the blitter.

  As far as the ROM is concerned, it's fine as is. If a specific application
warrants higher performance, do what is fastest. Either directly access
the blitter with custom code, or use the processor to render polygons. I
haven't tried using the 030 for block moves as the blitter is highly
efficient in this case. Using the blitter and processor on alternate
bitplanes *simultaneously* would yield very high throughput.

  John

cmcmanis@stpeter.Eng.Sun.COM (Chuck McManis) (06/06/90)

In article <30974@ut-emx.UUCP> (Jonathan Abbey) writes:
>I'd like some hard data on the blitter vs. the CPU for polygon rendering..
>I've heard people here assert that the CPU is faster, even on a 68000.

The definitive answer to this was given by Jeremy Sans (wrote StarGlider)
at the S.F. DevCon last year. His assesment was that the blitter and the
CPU were about equal. Where they varied was in the effort it took to set
up the "blit" versus the speed at which the blit was accomplished. The 
blitter is faster than the 68000 once it gets going but the 68000 could
set up for the next blit a lot faster than the blitter could. 68020 and
above processors won on both counts, mostly because of the instruction
cache/clock speed issues. So his analysis was that one's rendering function
had to be smart enough to "know" if the blit was larger than the cutoff
point for the processor and if so feed it to the blitter. The primary
advantage of the blitter was being able to calculate new data simultaneously
with the CPU while the blit is going on, although to make this truely
effective you have to make sure that the CPU doesn't go to the "Chip"
bus and that means True Fast Ram, and no interrupts (because fetching
the interrupt vector requires talking to chip ram.)

--
--Chuck McManis						    Sun Microsystems
uucp: {anywhere}!sun!cmcmanis   BIX: <none>   Internet: cmcmanis@Eng.Sun.COM
These opinions are my own and no one elses, but you knew that didn't you.
"I tell you this parrot is bleeding deceased!"

barnettj@dollar.crd.ge.com (Janet A Barnett) (06/07/90)

In article <3009@crash.cts.com> jcs@crash.cts.com (John Schultz) writes:
>  Unfortunately, with a bitplane oriented display, you only get parallelism
>on your last bitplane blit. True, you can compute you next blit values before
>waiting, but that buys you little time.

What about the graphics library routine QBlit(blitnode)?  I set up all
my blits ahead of time in a linked list pointed to by the blitnode
argument to QBlit().  The blitnode contains a pointer to a routine to
call that handles the actual blitter register stuffing.  QBlit adds my
blits to a queue maintained in GfxBase; when the blitter is ready, THE
OPERATING SYSTEM CALLS my blitter routines.  So what, you say, if the
OS waits, how is that any better than doing my own WaitBlits?  The
trick is that the QBlit routine makes use of the blitter_done
interrupt. This means that the CPU is free to go about other business
until each blit is done, at which point the interrupt service kicks
in.  By setting the CLEANUP flag in the blitnode structure, a special
routine can be called when the list of blits is exhusted.  In my
stuff, the CLEANUP routine usually sends a message to the task that
launched the blits so it can know when the blits are complete.  Seems
ideal.

Of course, everything has caveats.  Obviously, there is more overhead
in QBlit than OwnBlitter/WaitBlit/DisownBlitter, but hopefully you can
make use of the time you don't spend in WaitBlit.  Further, if your
blits are small, it is possible that one interrupt will start several
blits in a row, probably resulting in no real net improvement from the
parallel nature of the coprocessor.  And, I read once, long ago, that
there was a problem with WaitBlit in a heavily loaded system. Could
this manifest itself with QBlit?  I don't know.  My tests of
multitasking with QBlit have consisted of doing DIRs of DF0: at the
same time my graphics are running.  Result? Both processes go slow,
but the blitter seems to be shared correctly.

Consider also QBlit's sibling QBSBlit.  Set the beam sync element of
your blitnode to a vertical scan-line position and the OS will attempt
to set an interrupt based on the 60Hz CIA timer such that your bliiter
routine is called after the e-beam has passed the indicated scan-line.
Semiuseful if you're being niggling with display memory and you don't
mind the occasional glitch when the CPU can't get to your blit in
time.  (When you're makeing star hash out of some viscous,
evil-smelling alien, a little extra sparkle in the debris generally
goes unnoticed.)  See the AutoDocs for more info.

So, even though the blitter may be slower than a 68030, it still
represents a powerful resource when used properly.  By the way, which
processor will draw a line faster?  The blitter (once started) can set
a single pixel in a line every 6 7MHz-bus cycle (1.17E6 pixels/sec).
A line algorithm in Steve Williams "68030 Assembly Language Reference"
shows about 50 instructions for implementing the Bresenham Line
Algorithm.  Unfortunately, this otherwise excellent reference has no
instruction timings, but I'll be generous and allow 4 cycles for each
instruction.  At 28MHz, this gives us about .14E6 pixels/sec.  Hmmm.
Maybe a 68040 could do better.

(See Tomas Rokicki's BlitLab for an explanation of how to draw lines
with the blitter.)

jcs@crash.cts.com (John Schultz) (06/07/90)

In article <30974@ut-emx.UUCP> jonabbey@walt.cc.utexas.edu (Jonathan Abbey) writes:
>I'd like some hard data on the blitter vs. the CPU for polygon rendering..
>I've heard people here assert that the CPU is faster, even on a 68000.
>
>How much faster?  Are you using the blitter to copy the polygon into the
>appropriate bitplanes once it is drawn, or are you using the processor to
>fill the bitplanes in parallel?

  I compute the specific mask for the specific long word, then write the
mask appropriately into the bitplanes (move.l, or.l, or not-and.l). The
blitter is never used.

>I'm assuming you're using an ordered edge list?

  I'm using a highly efficient table fill algorithm I developed. The
trick is to fill the minX/maxX tables as rapidly as possible, then
draw lines from minX to maxX, for the polygon's minY to maxY.
This only works for convex polygons.
The "lines" are 32 bit masks, so worst case for a 320 wide screen would
take ten 32 bit writes. The reason the processor is faster than the blitter
in filling polygons, is that the blitter must clear a temprast the
size of the extent of the polygon, draw the outline with the blitter,
xor maximum y points with the processor, fill the mask with the blitter,
then blit the mask to each bitplane. If you want polygons that don't have
broken or coarse edges, you must re-outline the mask with lines using 
the blitter.  For a four bitplane display, this is equivalent to writing
to a six bitplane display, as well as having to draw a wireframe polygon
to two bitplanes. Furthermore, the blitter must work with a rectangle, and
thus much data movement is wasted, as most polygons are not rectangles.
The processor only moves data where the polygon exists.
  I just tested out my latest code on a nofastmem 500, and it was about
1 frame/sec slower for small polygons, and 1-2 frames/sec faster on large
polygons.  I measure speed differences real time with a frame rate counter.
I just press a key to toggle between rendering methods. Using the blitter
I'll get a frame rate of 17 f/s, while the processor cranks out
24 f/s. That's about a 41% improvement. Some cases the blitter will
bog down to 8 f/s while the processor cranks out 24 f/s, for a 300%
improvement. This is comparing my custom bliter code that *does not*
re-outline the blitter mask, so the polygons look ugly. The rom code
is much slower as it re-outlines the masks. The processor filled polygons
look great. On an Amiga 3000, the processor is at least another 40% faster
than my 25mhz GVP because of the 32 bit chip ram.

  John

jesup@cbmvax.commodore.com (Randell Jesup) (06/07/90)

In article <136730@sun.Eng.Sun.COM> cmcmanis@stpeter.Eng.Sun.COM (Chuck McManis) writes:
>In article <30974@ut-emx.UUCP> (Jonathan Abbey) writes:
>>I'd like some hard data on the blitter vs. the CPU for polygon rendering..
>>I've heard people here assert that the CPU is faster, even on a 68000.
>
>The definitive answer to this was given by Jeremy Sans (wrote StarGlider)
>at the S.F. DevCon last year. His assesment was that the blitter and the
>CPU were about equal. Where they varied was in the effort it took to set
>up the "blit" versus the speed at which the blit was accomplished. The 
>blitter is faster than the 68000 once it gets going but the 68000 could
>set up for the next blit a lot faster than the blitter could. 68020 and
>above processors won on both counts, mostly because of the instruction
>cache/clock speed issues.

	There are some other caveats as well.  The polygons he was dealing
with are filled with a solid color (no patterning), and are regular in
certain ways, in particular there can only be two edge-crossings on any given
horizontal line through it.  That's far harder (and more expensive) to
check than the amount of time saved, perhaps even on an '030 (then again,
the '020 and '030 are pretty fast, and have better bit-field instructions).

-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.cbm.commodore.com  BIX: rjesup  
Common phrase heard at Amiga Devcon '89: "It's in there!"

kp74615@kaakkuri.tut.fi (Karri Tapani Palovuori) (06/07/90)

In article <3022@crash.cts.com> jcs@crash.cts.com (John Schultz) writes:

>then blit the mask to each bitplane. If you want polygons that don't have
>broken or coarse edges, you must re-outline the mask with lines using 
>the blitter.  

This is not necessary. You can draw the leftmost lines one bit too left and
use exclusive fill mode.

Further, with blitter optimized code you can draw polygons of any shape,
including holes in them. This is a great advantage in some situations.

I agree that the 68000 is sometimes faster than the blitter. But it really
depends on the average polygon size and shape.

But the speedup gained by faster processors is one of the strongest points
supporting CPU-drawing. I think.

>  John

Karri

jcs@crash.cts.com (John Schultz) (06/08/90)

In article <8256@crdgw1.crd.ge.com> barnettj@dollar.crd.ge.com (Janet A Barnett) writes:
>In article <3009@crash.cts.com> jcs@crash.cts.com (John Schultz) writes:
>>  Unfortunately, with a bitplane oriented display, you only get parallelism
>>on your last bitplane blit. True, you can compute you next blit values before
>>waiting, but that buys you little time.
>
>What about the graphics library routine QBlit(blitnode)?  I set up all
>my blits ahead of time in a linked list pointed to by the blitnode
>argument to QBlit().  The blitnode contains a pointer to a routine to

  I wrote my own blitter interrupt code to clear the screen in rectanglar
chunks as opposed to a linear clear of all memory. The interrupt version
was slower. The problem with queueing up blits is that your application
my not be able to do anything else until the blits are done. In a 
flight simulator, you must draw all of your polygons in the correct order.
This means that the polygons must be transformed, clipped and sorted first,
then all of the drawing must take place. Further, the processor is used
to draw points. The points can't be drawn until the blitter is finished.
I was going to put processor drawing code in the the blitter interrupt
code, but the payoff of using blitter interrupts was too little to 
continue. 

>Of course, everything has caveats.  Obviously, there is more overhead
>in QBlit than OwnBlitter/WaitBlit/DisownBlitter, but hopefully you can
>make use of the time you don't spend in WaitBlit.  Further, if your
>blits are small, it is possible that one interrupt will start several
>blits in a row, probably resulting in no real net improvement from the
>parallel nature of the coprocessor.  And, I read once, long ago, that
>there was a problem with WaitBlit in a heavily loaded system. Could

  WaitBlit() shouldn't have any problems. If you roll your own, you
must read a chip memory or hardware location before testing the blit
done bit, as described in the 1.3 Amiga Hardware Manual.

>this manifest itself with QBlit?  I don't know.  My tests of
>multitasking with QBlit have consisted of doing DIRs of DF0: at the
>same time my graphics are running.  Result? Both processes go slow,
>but the blitter seems to be shared correctly.
>
>Consider also QBlit's sibling QBSBlit.  Set the beam sync element of
>your blitnode to a vertical scan-line position and the OS will attempt
>to set an interrupt based on the 60Hz CIA timer such that your bliiter
>routine is called after the e-beam has passed the indicated scan-line.
>Semiuseful if you're being niggling with display memory and you don't
>mind the occasional glitch when the CPU can't get to your blit in
>time.  (When you're makeing star hash out of some viscous,
>evil-smelling alien, a little extra sparkle in the debris generally
>goes unnoticed.)  See the AutoDocs for more info.

  Some of us purr-fectionists will notice, and exclaim, "Hack!" :-).
Double or triple buffering should handle any animation situation.

>So, even though the blitter may be slower than a 68030, it still
>represents a powerful resource when used properly.  By the way, which

  You are true. Currently, for block moves, the blitter says to the
processor, "You can't touch this!"

>processor will draw a line faster?  The blitter (once started) can set
>a single pixel in a line every 6 7MHz-bus cycle (1.17E6 pixels/sec).
>A line algorithm in Steve Williams "68030 Assembly Language Reference"
>shows about 50 instructions for implementing the Bresenham Line
>Algorithm.  Unfortunately, this otherwise excellent reference has no
>instruction timings, but I'll be generous and allow 4 cycles for each
>instruction.  At 28MHz, this gives us about .14E6 pixels/sec.  Hmmm.
>Maybe a 68040 could do better.

  Heh, heh.  That example doesn't work. It draws nice 45 degree lines, and
that's it.  I use a high performance fixed point method derived from
68000 Assembly Language, by Krantz and Stanley. 
Take a look at the main loop (this is for a four bitplane display,
40 bytes per row, a2-a5 are the bitplane ptrs):

PMD [YAAD] Program Module Dismemberer V.14
Copyright ) 1990, HoweSoft,Inc. All Rights Reserved.
*  Program_unit #0 name "<UNNAMED>".
*  Code_Hunk [PUBLIC] #0 Length = 60 bytes [15 longwords]
	move.l	D4,D0				; 4	CYCLES
	move.l	D5,D1				; 4	CYCLES
	add.l	A0,D0				; 8	CYCLES
	add.l	A1,D1				; 8	CYCLES
	swap	D0				; 4	CYCLES
	swap	D1				; 4	CYCLES
	move.w	D1,D3				; 4	CYCLES
	add.w	D1,D1				; 4	CYCLES
	add.w	D1,D1				; 4	CYCLES
	add.w	D1,D3				; 4	CYCLES
	lsl.w	#3,D3				; 12	CYCLES
	move.w	D0,D1				; 4	CYCLES
	lsr.w	#3,D0				; 12	CYCLES
	add.w	D0,D3				; 4	CYCLES
	andi.w	#$07,D1				; 8	CYCLES
	not.b	D1				; 4	CYCLES
	bset	D1,$00(A2,D3.W)			; 18	CYCLES
	bset	D1,$00(A3,D3.W)			; 18	CYCLES
	bset	D1,$00(A4,D3.W)			; 18	CYCLES
	bset	D1,$00(A5,D3.W)			; 18	CYCLES
	add.l	D6,D4				; 8	CYCLES
	add.l	D7,D5				; 8	CYCLES
	dbf	D2,-$38(PC)			; 10+	CYCLES
	rts					; 16	CYCLES
* 206 Total Cycles
*  Hunk_End.

  So, what does this work out to on a cached 030? Tough call, so I
tested it real time against the blitter. It's faster for small lines
and slightly slower for very long lines. If we had faster processor
access to chip ram, we could really cook.

>(See Tomas Rokicki's BlitLab for an explanation of how to draw lines
>with the blitter.)

  Also, the 1.3 Hardware Manual explains how to draw lines with the blitter,
as well as example code.

  John

jcs@crash.cts.com (John Schultz) (06/08/90)

In article <13493@etana.tut.fi> kp74615@kaakkuri.tut.fi (Karri Tapani Palovuori) writes:
>In article <3022@crash.cts.com> jcs@crash.cts.com (John Schultz) writes:
>
>>then blit the mask to each bitplane. If you want polygons that don't have
>>broken or coarse edges, you must re-outline the mask with lines using 
>>the blitter.  
>
>This is not necessary. You can draw the leftmost lines one bit too left and
>use exclusive fill mode.

  I tried exclusive fill, inverting the mask, etc. This did produce sharper
polygons, *but* they still fell apart when horizontally thin.

>Further, with blitter optimized code you can draw polygons of any shape,
>including holes in them. This is a great advantage in some situations.

  That's true, as well as concave polygons.

>I agree that the 68000 is sometimes faster than the blitter. But it really
>depends on the average polygon size and shape.
>
>But the speedup gained by faster processors is one of the strongest points
>supporting CPU-drawing. I think.

  You think right. I just read on comp.sys.amiga of a TI TIGA board
for the A3000. A 34010 will put to rest any dialogue disputing processor/
blitter speed. If it's a 34020 (250 megs/sec transfer), the Amiga will
be a most formidable low cost graphics speed demon.

  John

jesup@cbmvax.commodore.com (Randell Jesup) (06/08/90)

In article <3035@crash.cts.com> jcs@crash.cts.com (John Schultz) writes:
>In article <8256@crdgw1.crd.ge.com> barnettj@dollar.crd.ge.com (Janet A Barnett) writes:
>>What about the graphics library routine QBlit(blitnode)?  I set up all
>>my blits ahead of time in a linked list pointed to by the blitnode
>>argument to QBlit().  The blitnode contains a pointer to a routine to
>
>  I wrote my own blitter interrupt code to clear the screen in rectanglar
>chunks as opposed to a linear clear of all memory. The interrupt version
>was slower. The problem with queueing up blits is that your application
>my not be able to do anything else until the blits are done. In a 
>flight simulator, you must draw all of your polygons in the correct order.
>This means that the polygons must be transformed, clipped and sorted first,
>then all of the drawing must take place. Further, the processor is used
>to draw points. The points can't be drawn until the blitter is finished.
>I was going to put processor drawing code in the the blitter interrupt
>code, but the payoff of using blitter interrupts was too little to 
>continue. 

	However, with design oriented towards it you can have the extra cycles
be used.  Either use the blitter interrupts/QBlit and have it do the point
renderings from there, and then signal the main process, or split it into
rendering and calculation processes.  I suspect the interrupts are faster,
but the seperate process/task is easier.

	This allows some nice tricks to make use of all your horsepower to
produce a smoother update rate.

	However, if you meet the special requirements for the polygons that
has been mentioned before, then you are probably better using the processor
(if you're after maximum speed).  Or you could pass all lines larger than X
to the blitter, and smaller ones to the processor, or use the blitter to
block-copy backgrounds/clear mem/whatever while the processor is rendering
polygons into a different buffer, etc, etc.

-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.cbm.commodore.com  BIX: rjesup  
Common phrase heard at Amiga Devcon '89: "It's in there!"