dgaudet@crocus.uwaterloo.ca (Dean Gaudet) (06/04/90)
Call me crazy, but wouldn't using the CPU (a 020 or 030 I suppose) instead of the blitter for speed cause a decrease in the overall speed of the machine? i.e., one of the reasons the Amiga seems faster than a similar speed PC or Mac is because of the support hardware and the parallelism. Perhaps someone should rewrite the blitter in software so and make the blit routines "intelligently" decide whether to use the CPU or the blitter chip. For example: if CPU is a 030 and CPU traffic is slow, then use the CPU otherwise use the blitter. Dean Gaudet
jonabbey@walt.cc.utexas.edu (Jonathan Abbey) (06/05/90)
I'd like some hard data on the blitter vs. the CPU for polygon rendering.. I've heard people here assert that the CPU is faster, even on a 68000. How much faster? Are you using the blitter to copy the polygon into the appropriate bitplanes once it is drawn, or are you using the processor to fill the bitplanes in parallel? I'm assuming you're using an ordered edge list? Thanks for any elucidation.. Jonathan Abbey (512) 926-5934 | Amiga Programmer Wanna-be jonabbey@ccwf.cc.utexas.edu bix: jonabbey +----------------------------- The University of Texas at Austin - CS Undergrad | Speaking for myself, at best
jcs@crash.cts.com (John Schultz) (06/06/90)
In article <1990Jun4.134811.12142@watdragon.waterloo.edu> dgaudet@crocus.uwaterloo.ca (Dean Gaudet) writes: >Call me crazy, but wouldn't using the CPU (a 020 or 030 I suppose) instead of >the blitter for speed cause a decrease in the overall speed of the machine? >i.e., one of the reasons the Amiga seems faster than a similar speed PC or Mac >is because of the support hardware and the parallelism. Unfortunately, with a bitplane oriented display, you only get parallelism on your last bitplane blit. True, you can compute you next blit values before waiting, but that buys you little time. For polygons, the processor is definitely faster, even on a nofastmem A500. But, for general block copying, such as windows, arcade style animation, etc, the blitter is much faster, and easier to use. If the application is oriented such that the blits can be interleaved with other processor operation, more parallelism could be achieved, but this is usually not the case in a general purpose application. >Perhaps someone should rewrite the blitter in software so and make the blit >routines "intelligently" decide whether to use the CPU or the blitter chip. >For example: if CPU is a 030 and CPU traffic is slow, then use the CPU >otherwise use the blitter. As far as the ROM is concerned, it's fine as is. If a specific application warrants higher performance, do what is fastest. Either directly access the blitter with custom code, or use the processor to render polygons. I haven't tried using the 030 for block moves as the blitter is highly efficient in this case. Using the blitter and processor on alternate bitplanes *simultaneously* would yield very high throughput. John
cmcmanis@stpeter.Eng.Sun.COM (Chuck McManis) (06/06/90)
In article <30974@ut-emx.UUCP> (Jonathan Abbey) writes: >I'd like some hard data on the blitter vs. the CPU for polygon rendering.. >I've heard people here assert that the CPU is faster, even on a 68000. The definitive answer to this was given by Jeremy Sans (wrote StarGlider) at the S.F. DevCon last year. His assesment was that the blitter and the CPU were about equal. Where they varied was in the effort it took to set up the "blit" versus the speed at which the blit was accomplished. The blitter is faster than the 68000 once it gets going but the 68000 could set up for the next blit a lot faster than the blitter could. 68020 and above processors won on both counts, mostly because of the instruction cache/clock speed issues. So his analysis was that one's rendering function had to be smart enough to "know" if the blit was larger than the cutoff point for the processor and if so feed it to the blitter. The primary advantage of the blitter was being able to calculate new data simultaneously with the CPU while the blit is going on, although to make this truely effective you have to make sure that the CPU doesn't go to the "Chip" bus and that means True Fast Ram, and no interrupts (because fetching the interrupt vector requires talking to chip ram.) -- --Chuck McManis Sun Microsystems uucp: {anywhere}!sun!cmcmanis BIX: <none> Internet: cmcmanis@Eng.Sun.COM These opinions are my own and no one elses, but you knew that didn't you. "I tell you this parrot is bleeding deceased!"
barnettj@dollar.crd.ge.com (Janet A Barnett) (06/07/90)
In article <3009@crash.cts.com> jcs@crash.cts.com (John Schultz) writes: > Unfortunately, with a bitplane oriented display, you only get parallelism >on your last bitplane blit. True, you can compute you next blit values before >waiting, but that buys you little time. What about the graphics library routine QBlit(blitnode)? I set up all my blits ahead of time in a linked list pointed to by the blitnode argument to QBlit(). The blitnode contains a pointer to a routine to call that handles the actual blitter register stuffing. QBlit adds my blits to a queue maintained in GfxBase; when the blitter is ready, THE OPERATING SYSTEM CALLS my blitter routines. So what, you say, if the OS waits, how is that any better than doing my own WaitBlits? The trick is that the QBlit routine makes use of the blitter_done interrupt. This means that the CPU is free to go about other business until each blit is done, at which point the interrupt service kicks in. By setting the CLEANUP flag in the blitnode structure, a special routine can be called when the list of blits is exhusted. In my stuff, the CLEANUP routine usually sends a message to the task that launched the blits so it can know when the blits are complete. Seems ideal. Of course, everything has caveats. Obviously, there is more overhead in QBlit than OwnBlitter/WaitBlit/DisownBlitter, but hopefully you can make use of the time you don't spend in WaitBlit. Further, if your blits are small, it is possible that one interrupt will start several blits in a row, probably resulting in no real net improvement from the parallel nature of the coprocessor. And, I read once, long ago, that there was a problem with WaitBlit in a heavily loaded system. Could this manifest itself with QBlit? I don't know. My tests of multitasking with QBlit have consisted of doing DIRs of DF0: at the same time my graphics are running. Result? Both processes go slow, but the blitter seems to be shared correctly. Consider also QBlit's sibling QBSBlit. Set the beam sync element of your blitnode to a vertical scan-line position and the OS will attempt to set an interrupt based on the 60Hz CIA timer such that your bliiter routine is called after the e-beam has passed the indicated scan-line. Semiuseful if you're being niggling with display memory and you don't mind the occasional glitch when the CPU can't get to your blit in time. (When you're makeing star hash out of some viscous, evil-smelling alien, a little extra sparkle in the debris generally goes unnoticed.) See the AutoDocs for more info. So, even though the blitter may be slower than a 68030, it still represents a powerful resource when used properly. By the way, which processor will draw a line faster? The blitter (once started) can set a single pixel in a line every 6 7MHz-bus cycle (1.17E6 pixels/sec). A line algorithm in Steve Williams "68030 Assembly Language Reference" shows about 50 instructions for implementing the Bresenham Line Algorithm. Unfortunately, this otherwise excellent reference has no instruction timings, but I'll be generous and allow 4 cycles for each instruction. At 28MHz, this gives us about .14E6 pixels/sec. Hmmm. Maybe a 68040 could do better. (See Tomas Rokicki's BlitLab for an explanation of how to draw lines with the blitter.)
jcs@crash.cts.com (John Schultz) (06/07/90)
In article <30974@ut-emx.UUCP> jonabbey@walt.cc.utexas.edu (Jonathan Abbey) writes: >I'd like some hard data on the blitter vs. the CPU for polygon rendering.. >I've heard people here assert that the CPU is faster, even on a 68000. > >How much faster? Are you using the blitter to copy the polygon into the >appropriate bitplanes once it is drawn, or are you using the processor to >fill the bitplanes in parallel? I compute the specific mask for the specific long word, then write the mask appropriately into the bitplanes (move.l, or.l, or not-and.l). The blitter is never used. >I'm assuming you're using an ordered edge list? I'm using a highly efficient table fill algorithm I developed. The trick is to fill the minX/maxX tables as rapidly as possible, then draw lines from minX to maxX, for the polygon's minY to maxY. This only works for convex polygons. The "lines" are 32 bit masks, so worst case for a 320 wide screen would take ten 32 bit writes. The reason the processor is faster than the blitter in filling polygons, is that the blitter must clear a temprast the size of the extent of the polygon, draw the outline with the blitter, xor maximum y points with the processor, fill the mask with the blitter, then blit the mask to each bitplane. If you want polygons that don't have broken or coarse edges, you must re-outline the mask with lines using the blitter. For a four bitplane display, this is equivalent to writing to a six bitplane display, as well as having to draw a wireframe polygon to two bitplanes. Furthermore, the blitter must work with a rectangle, and thus much data movement is wasted, as most polygons are not rectangles. The processor only moves data where the polygon exists. I just tested out my latest code on a nofastmem 500, and it was about 1 frame/sec slower for small polygons, and 1-2 frames/sec faster on large polygons. I measure speed differences real time with a frame rate counter. I just press a key to toggle between rendering methods. Using the blitter I'll get a frame rate of 17 f/s, while the processor cranks out 24 f/s. That's about a 41% improvement. Some cases the blitter will bog down to 8 f/s while the processor cranks out 24 f/s, for a 300% improvement. This is comparing my custom bliter code that *does not* re-outline the blitter mask, so the polygons look ugly. The rom code is much slower as it re-outlines the masks. The processor filled polygons look great. On an Amiga 3000, the processor is at least another 40% faster than my 25mhz GVP because of the 32 bit chip ram. John
jesup@cbmvax.commodore.com (Randell Jesup) (06/07/90)
In article <136730@sun.Eng.Sun.COM> cmcmanis@stpeter.Eng.Sun.COM (Chuck McManis) writes: >In article <30974@ut-emx.UUCP> (Jonathan Abbey) writes: >>I'd like some hard data on the blitter vs. the CPU for polygon rendering.. >>I've heard people here assert that the CPU is faster, even on a 68000. > >The definitive answer to this was given by Jeremy Sans (wrote StarGlider) >at the S.F. DevCon last year. His assesment was that the blitter and the >CPU were about equal. Where they varied was in the effort it took to set >up the "blit" versus the speed at which the blit was accomplished. The >blitter is faster than the 68000 once it gets going but the 68000 could >set up for the next blit a lot faster than the blitter could. 68020 and >above processors won on both counts, mostly because of the instruction >cache/clock speed issues. There are some other caveats as well. The polygons he was dealing with are filled with a solid color (no patterning), and are regular in certain ways, in particular there can only be two edge-crossings on any given horizontal line through it. That's far harder (and more expensive) to check than the amount of time saved, perhaps even on an '030 (then again, the '020 and '030 are pretty fast, and have better bit-field instructions). -- Randell Jesup, Keeper of AmigaDos, Commodore Engineering. {uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.cbm.commodore.com BIX: rjesup Common phrase heard at Amiga Devcon '89: "It's in there!"
kp74615@kaakkuri.tut.fi (Karri Tapani Palovuori) (06/07/90)
In article <3022@crash.cts.com> jcs@crash.cts.com (John Schultz) writes: >then blit the mask to each bitplane. If you want polygons that don't have >broken or coarse edges, you must re-outline the mask with lines using >the blitter. This is not necessary. You can draw the leftmost lines one bit too left and use exclusive fill mode. Further, with blitter optimized code you can draw polygons of any shape, including holes in them. This is a great advantage in some situations. I agree that the 68000 is sometimes faster than the blitter. But it really depends on the average polygon size and shape. But the speedup gained by faster processors is one of the strongest points supporting CPU-drawing. I think. > John Karri
jcs@crash.cts.com (John Schultz) (06/08/90)
In article <8256@crdgw1.crd.ge.com> barnettj@dollar.crd.ge.com (Janet A Barnett) writes: >In article <3009@crash.cts.com> jcs@crash.cts.com (John Schultz) writes: >> Unfortunately, with a bitplane oriented display, you only get parallelism >>on your last bitplane blit. True, you can compute you next blit values before >>waiting, but that buys you little time. > >What about the graphics library routine QBlit(blitnode)? I set up all >my blits ahead of time in a linked list pointed to by the blitnode >argument to QBlit(). The blitnode contains a pointer to a routine to I wrote my own blitter interrupt code to clear the screen in rectanglar chunks as opposed to a linear clear of all memory. The interrupt version was slower. The problem with queueing up blits is that your application my not be able to do anything else until the blits are done. In a flight simulator, you must draw all of your polygons in the correct order. This means that the polygons must be transformed, clipped and sorted first, then all of the drawing must take place. Further, the processor is used to draw points. The points can't be drawn until the blitter is finished. I was going to put processor drawing code in the the blitter interrupt code, but the payoff of using blitter interrupts was too little to continue. >Of course, everything has caveats. Obviously, there is more overhead >in QBlit than OwnBlitter/WaitBlit/DisownBlitter, but hopefully you can >make use of the time you don't spend in WaitBlit. Further, if your >blits are small, it is possible that one interrupt will start several >blits in a row, probably resulting in no real net improvement from the >parallel nature of the coprocessor. And, I read once, long ago, that >there was a problem with WaitBlit in a heavily loaded system. Could WaitBlit() shouldn't have any problems. If you roll your own, you must read a chip memory or hardware location before testing the blit done bit, as described in the 1.3 Amiga Hardware Manual. >this manifest itself with QBlit? I don't know. My tests of >multitasking with QBlit have consisted of doing DIRs of DF0: at the >same time my graphics are running. Result? Both processes go slow, >but the blitter seems to be shared correctly. > >Consider also QBlit's sibling QBSBlit. Set the beam sync element of >your blitnode to a vertical scan-line position and the OS will attempt >to set an interrupt based on the 60Hz CIA timer such that your bliiter >routine is called after the e-beam has passed the indicated scan-line. >Semiuseful if you're being niggling with display memory and you don't >mind the occasional glitch when the CPU can't get to your blit in >time. (When you're makeing star hash out of some viscous, >evil-smelling alien, a little extra sparkle in the debris generally >goes unnoticed.) See the AutoDocs for more info. Some of us purr-fectionists will notice, and exclaim, "Hack!" :-). Double or triple buffering should handle any animation situation. >So, even though the blitter may be slower than a 68030, it still >represents a powerful resource when used properly. By the way, which You are true. Currently, for block moves, the blitter says to the processor, "You can't touch this!" >processor will draw a line faster? The blitter (once started) can set >a single pixel in a line every 6 7MHz-bus cycle (1.17E6 pixels/sec). >A line algorithm in Steve Williams "68030 Assembly Language Reference" >shows about 50 instructions for implementing the Bresenham Line >Algorithm. Unfortunately, this otherwise excellent reference has no >instruction timings, but I'll be generous and allow 4 cycles for each >instruction. At 28MHz, this gives us about .14E6 pixels/sec. Hmmm. >Maybe a 68040 could do better. Heh, heh. That example doesn't work. It draws nice 45 degree lines, and that's it. I use a high performance fixed point method derived from 68000 Assembly Language, by Krantz and Stanley. Take a look at the main loop (this is for a four bitplane display, 40 bytes per row, a2-a5 are the bitplane ptrs): PMD [YAAD] Program Module Dismemberer V.14 Copyright ) 1990, HoweSoft,Inc. All Rights Reserved. * Program_unit #0 name "<UNNAMED>". * Code_Hunk [PUBLIC] #0 Length = 60 bytes [15 longwords] move.l D4,D0 ; 4 CYCLES move.l D5,D1 ; 4 CYCLES add.l A0,D0 ; 8 CYCLES add.l A1,D1 ; 8 CYCLES swap D0 ; 4 CYCLES swap D1 ; 4 CYCLES move.w D1,D3 ; 4 CYCLES add.w D1,D1 ; 4 CYCLES add.w D1,D1 ; 4 CYCLES add.w D1,D3 ; 4 CYCLES lsl.w #3,D3 ; 12 CYCLES move.w D0,D1 ; 4 CYCLES lsr.w #3,D0 ; 12 CYCLES add.w D0,D3 ; 4 CYCLES andi.w #$07,D1 ; 8 CYCLES not.b D1 ; 4 CYCLES bset D1,$00(A2,D3.W) ; 18 CYCLES bset D1,$00(A3,D3.W) ; 18 CYCLES bset D1,$00(A4,D3.W) ; 18 CYCLES bset D1,$00(A5,D3.W) ; 18 CYCLES add.l D6,D4 ; 8 CYCLES add.l D7,D5 ; 8 CYCLES dbf D2,-$38(PC) ; 10+ CYCLES rts ; 16 CYCLES * 206 Total Cycles * Hunk_End. So, what does this work out to on a cached 030? Tough call, so I tested it real time against the blitter. It's faster for small lines and slightly slower for very long lines. If we had faster processor access to chip ram, we could really cook. >(See Tomas Rokicki's BlitLab for an explanation of how to draw lines >with the blitter.) Also, the 1.3 Hardware Manual explains how to draw lines with the blitter, as well as example code. John
jcs@crash.cts.com (John Schultz) (06/08/90)
In article <13493@etana.tut.fi> kp74615@kaakkuri.tut.fi (Karri Tapani Palovuori) writes: >In article <3022@crash.cts.com> jcs@crash.cts.com (John Schultz) writes: > >>then blit the mask to each bitplane. If you want polygons that don't have >>broken or coarse edges, you must re-outline the mask with lines using >>the blitter. > >This is not necessary. You can draw the leftmost lines one bit too left and >use exclusive fill mode. I tried exclusive fill, inverting the mask, etc. This did produce sharper polygons, *but* they still fell apart when horizontally thin. >Further, with blitter optimized code you can draw polygons of any shape, >including holes in them. This is a great advantage in some situations. That's true, as well as concave polygons. >I agree that the 68000 is sometimes faster than the blitter. But it really >depends on the average polygon size and shape. > >But the speedup gained by faster processors is one of the strongest points >supporting CPU-drawing. I think. You think right. I just read on comp.sys.amiga of a TI TIGA board for the A3000. A 34010 will put to rest any dialogue disputing processor/ blitter speed. If it's a 34020 (250 megs/sec transfer), the Amiga will be a most formidable low cost graphics speed demon. John
jesup@cbmvax.commodore.com (Randell Jesup) (06/08/90)
In article <3035@crash.cts.com> jcs@crash.cts.com (John Schultz) writes: >In article <8256@crdgw1.crd.ge.com> barnettj@dollar.crd.ge.com (Janet A Barnett) writes: >>What about the graphics library routine QBlit(blitnode)? I set up all >>my blits ahead of time in a linked list pointed to by the blitnode >>argument to QBlit(). The blitnode contains a pointer to a routine to > > I wrote my own blitter interrupt code to clear the screen in rectanglar >chunks as opposed to a linear clear of all memory. The interrupt version >was slower. The problem with queueing up blits is that your application >my not be able to do anything else until the blits are done. In a >flight simulator, you must draw all of your polygons in the correct order. >This means that the polygons must be transformed, clipped and sorted first, >then all of the drawing must take place. Further, the processor is used >to draw points. The points can't be drawn until the blitter is finished. >I was going to put processor drawing code in the the blitter interrupt >code, but the payoff of using blitter interrupts was too little to >continue. However, with design oriented towards it you can have the extra cycles be used. Either use the blitter interrupts/QBlit and have it do the point renderings from there, and then signal the main process, or split it into rendering and calculation processes. I suspect the interrupts are faster, but the seperate process/task is easier. This allows some nice tricks to make use of all your horsepower to produce a smoother update rate. However, if you meet the special requirements for the polygons that has been mentioned before, then you are probably better using the processor (if you're after maximum speed). Or you could pass all lines larger than X to the blitter, and smaller ones to the processor, or use the blitter to block-copy backgrounds/clear mem/whatever while the processor is rendering polygons into a different buffer, etc, etc. -- Randell Jesup, Keeper of AmigaDos, Commodore Engineering. {uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.cbm.commodore.com BIX: rjesup Common phrase heard at Amiga Devcon '89: "It's in there!"