[comp.sys.amiga.tech] How do you draw to the screen quickly?

jwz@teak.berkeley.edu (Jamie Zawinski) (02/09/90)

I just recently got "popi", the image processing program described in 
Gerald Holzmann's book "The Digital Darkroom"; this program runs on a variety
of systems, but the conditionally-compiled Amiga code is really really slow,
and I'd like to fix this.  So.

Given a one-dimensional array of 8 bit (greyscale) quantities, I need to be
able to draw one scanline.  There must be a way to do this without repeatedly
calling WritePixel(), right?  (I realize that the Amiga can only display four
bits of grey, but 8 bits is used interally by the program, and I'd rather 
not alter its portable data structures.)  Any ideas?

		-- Jamie

a464@mindlink.UUCP (Bruce Dawson) (02/09/90)

     One way of speeding this up is to write your own WritePixel() routine.
This ONLY works if you have opened a custom screen and you aren't going to be
having any windows over top of the window you are writing in to.  You also have
to make sure that menus don't come down over top while you are writing pixels,
or your results will have holes in them.  You can use LockLayers() or some such
to prevent that from happening, but you'd better not LockLayers() for every
pixel or performance will still be really poor (ie; do a LockLayers(), render a
bunch of pixels, then do an UnLockLayers() would work).

     Another way, probably better, is to create a one scan-line buffer. Use
some assembler code to translate an array of colours into your buffer, then use
DrawImage() or some such to render the whole line onto the screen/window.  This
is more robust.  Warning:  When I last checked (several years ago I admit),
DrawImage didn't block menus (ie; didn't do a LockLayers()), meaning that menus
could drop down and get drawn over by DrawImage(), thus munging the display.
If this is still true, put your own LockLayers() calls around calls to
DrawImage.

     All of this is irrelevant if the calcs are the real CPU hog.  Write a
program to just write pixels to the screen and time it.  If it takes 1% as long
as the program your optimizing, ignore screen write time.  If it takes 50% as
long...

.Bruce.

mwm@raven.pa.dec.com (Mike (Under Construction) Meyer) (02/10/90)

>> I just recently got "popi", the image processing program described in 
>> Gerald Holzmann's book "The Digital Darkroom"; this program runs on a variety
>> of systems, but the conditionally-compiled Amiga code is really really slow,
>> and I'd like to fix this.  So.

The problem isn't really the Amiga specific code; the problem is that
the thing is just plain slow (it's barely useable on a DECStation
3100, at 14 MIPS). The heart of the code is a loop that interprets the
stack-machine code for the transformation expression once for every
pixel on the screen.

The easy solution is to use fewer pixels - say 128x128 instead of
512x512; giving you a factor of 16 increase in speed with no work. The
hard solution is to rewrite the "compiler" to generate a loop that
runs in 68K machine language & run that, instead of generating
stack-machine code and interpreting that (this is what pico, mentioned
in the book, did). An intermediate solution is to write something that
translates the stack-machine code into 68K ML, and run that instead of
the stack-machine code.

Of course, speeding up the display code wouldn't hurt. But I don't see
much point until the transformation code is fixed. In any case, the
best way I can see to do the drawing code is to have the blitter move
one (two?) bit-plane at a time from the image array into the drawing
area. You could also do this for each scan line, but that will be
slower.

Whatever you do, let us know!

	<mike

--
He was your reason for living				Mike Meyer
So you once said					mwm@berkeley.edu
Now your reason for living				ucbvax!mwm
Has left you half dead					mwm@ucbjade.BITNET

a464@mindlink.UUCP (Bruce Dawson) (02/11/90)

> doug writes:
> 
> Msg-ID: <658@xdos.UUCP>
> Posted: 11 Feb 90 17:22:51 GMT
> 
> Org.  : Hunter Systems, Mountain View CA (Silicon Valley)
> Person: Doug Merritt
> 
> In article <1092@mindlink.UUCP> a464@mindlink.UUCP (Bruce Dawson) writes:
> >
> >     One way of speeding this up is to write your own WritePixel() routine.
> 
> To give some idea, I did this for Thad's FaceShower program. It originally
> used WritePixel for displaying 256x256 (approx.) pixels, and took about
> 45 seconds. Changing to direct writes dropped it to around 20 seconds.
> So if that's the approximate amount of time you want to save, then this
> might be the way to go. (Factor of two.)
> 
> But usually it's the algorithm that benefits most from optimization.
> Changing the inline dithering calculations to table lookups with layout
> optimized for the screen's bitplane layout dropped the time from 20
> seconds to around 1 second. (Factor of twenty.)
> 
> Member, Crusaders for a Better Tomorrow         Professional Wildeyed
> Visionary

     Be careful where you give the credit for the speedup.  Changing the
algorithm, in this particular case, saved nineteen seconds.  Changing the
WritePixel() routine saved twenty-five seconds.  The only reason that changing
the algorithm looked so much better was because when you optimized it, it was
the only major user of CPU time left.  If you'd changed the algorithm and then
changed WritePixel, you would have reported just under a factor of two
improvement for the algorithm, and a twenty-six times increase for WritePixel.

     In reality, the two changes should share the credit, and little more than
a factor of two improvement is possible without _both_ of them.

.Bruce.

     P.S.  I totally agree that the potential speed increases from a bood
algorithm (as opposed to rewriting in hand coded assembler or similar
techniques) are generally more important.  But sometimes both are necessary,
and an optimized WritePixel() routine is particularly easy and can speed things
up greatly.

p554mve@mpirbn.UUCP (Michael van Elst) (02/11/90)

In article <21908@pasteur.Berkeley.EDU> Jamie Zawinski <jwz@teak.berkeley.edu> writes:
>Given a one-dimensional array of 8 bit (greyscale) quantities, I need to be
>able to draw one scanline.  There must be a way to do this without repeatedly
>calling WritePixel(), right?  (I realize that the Amiga can only display four
>bits of grey, but 8 bits is used interally by the program, and I'd rather 
>not alter its portable data structures.)  Any ideas?

The fastest way to fill lines is to move them right into the display bitmap.
You have to split the incoming pixel values into bits, say 32 pixels
at a time and construct the memory words that are needed for bitplanes.
Since you only want to display 16 colors you might use 4 registers where
the four longwords for each bitplane are assembled. Now that's for
68000 programmers but a good C-compiler could achieve nearly the same
performance.

One problem still arises, if you write into the screens bitmap you get
into collisions with intuition rendering (like menus). And arbitrarily
positioned windows are another hack.

All these problems can be avoided if you put your pixels into an offscreen
bitmap (pretty aligned, no locking needed). And then make a call to
BltBitMapRastPort to move the data onto the screen.

If you consider speed you may use double buffering. When one bitmap
is drawn, the second can be copied to the screen. Note that it is not as
easy to do completely asynchronous blitter operations. The simplest way
is to use another drawing task that calls the graphics functions.

Michael van Elst
uunet!unido!mpirbn!p554mve

doug@xdos.UUCP (Doug Merritt) (02/12/90)

In article <1092@mindlink.UUCP> a464@mindlink.UUCP (Bruce Dawson) writes:
>
>     One way of speeding this up is to write your own WritePixel() routine.

To give some idea, I did this for Thad's FaceShower program. It originally
used WritePixel for displaying 256x256 (approx.) pixels, and took about
45 seconds. Changing to direct writes dropped it to around 20 seconds.
So if that's the approximate amount of time you want to save, then this
might be the way to go. (Factor of two.)

But usually it's the algorithm that benefits most from optimization.
Changing the inline dithering calculations to table lookups with layout
optimized for the screen's bitplane layout dropped the time from 20
seconds to around 1 second. (Factor of twenty.)

When I hand optimized the assembler in the inner loop, it went from 1.2
seconds to .85 seconds. (Factor of 1.4)

The moral of this is obviously that you should always optimize the
algorithm first, before you start dinking around with bypassing AmigaDos
and hacking assembler. That's where the biggest wins are.

Then again, running it on Thad's 68020 sped it up from 0.85 sec to 0.25 sec,
and you can't beat a factor of 3.4 speedup with no source changes! :-)
	Doug
-- 
Doug Merritt		{pyramid,apple}!xdos!doug
Member, Crusaders for a Better Tomorrow		Professional Wildeyed Visionary

doug@xdos.UUCP (Doug Merritt) (02/13/90)

In article <1109@mindlink.UUCP> a464@mindlink.UUCP (Bruce Dawson) writes:
>     Be careful where you give the credit for the speedup.  Changing the
>algorithm, in this particular case, saved nineteen seconds. [...]

Hmm. Sounds logical; looks like you caught me committing sloppy thinking.
Right, it's additive in this case, not multiplicative.

In any case it's still a good testimonial to those who think that
coding everything in assembler is the only way to go. There was an
article called "FastPix" in the Nov. Amazing Computing that showed
how to do a custom pixel writing routine in assembler to save time,
which will mislead many people.

The impression given is that such a routine would be the ultimate in speed,
just because it's written in assembler. Whereas usually a bigger gain
can be gotten from staying in C and experimenting with different
algorithmic speedups. Using assembler makes sense only when all other
alternatives have been exhausted, and that's the only source of speedup
left.

Several times I've seen commercial products advertised as "(re)written
100% in assembler!!!!!!", as if that were a plus. Inner loops should
be in assembler, not entire programs.
	Doug

P.S. I suppose I'll now be inundated with flames from people pointing
out that, since they have no hard disk, size of programs is critical
to squeeze as many as possible onto their floppies. Yeah, I know, I was
without a hard disk on my Amiga for years. But although some people write
really perfect programs in assembler, with most you pay for that size
bonus with generally buggier programs, simplistic algorithms, etc.
-- 
Doug Merritt		{pyramid,apple}!xdos!doug
Member, Crusaders for a Better Tomorrow		Professional Wildeyed Visionary

cmcmanis@stpeter.Sun.COM (Chuck McManis) (02/13/90)

In article <1092@mindlink.UUCP> a464@mindlink.UUCP (Bruce Dawson) writes:
>     [good comments about speeding up writing deleted]

> Warning:  When I last checked (several years ago I admit), DrawImage 
> didn't block menus (ie; didn't do a LockLayers()), meaning that menus
> could drop down and get drawn over by DrawImage(), thus munging the display.
> If this is still true, put your own LockLayers() calls around calls to
> DrawImage.

In fact what is going on here is that the Screen rastport doesn't have an
associated LayerInfo structure ('cuz it isn't layered) and this makes
it faster to write to because the software doesn't check for clipping or
"locked layers" but has the drawback that when Intuition thinks it has
the system "locked" (like when a menu is down) writing to the screen
will continue unconstrained. 

When righting to the Screen rastport you had better check your clipping
range and, if you use menus, do something like MENU_VERIFY to figure
out when menus are down.


--Chuck McManis
uucp: {anywhere}!sun!cmcmanis   BIX: cmcmanis  ARPAnet: cmcmanis@Eng.Sun.COM
These opinions are my own and no one elses, but you knew that didn't you.
"If it didn't have bones in it, it wouldn't be crunchy now would it?!"

nsw@cbnewsm.ATT.COM (Neil Weinstock) (02/13/90)

In article <1092@mindlink.UUCP> a464@mindlink.UUCP (Bruce Dawson) writes:
>
>     One way of speeding this up is to write your own WritePixel() routine.
[ ... ]

Geez, you guys are all missing the obvious and correct solution.  The
answer, as it almost always is, is to use a lookup table.  Use the data for 
each scan line as the lookup index (simply concatenate the pixel values into
one long binary value), and put the corresponding Image structure into that 
table entry.  So, you just take the data, look up the appropriate scan line, 
and use DrawImage().  Only one operation per scan line.  How much quicker
can you get?

Extending this method to the entire screen is left as an exercise for the
reader.
    ________________    __________________    _________________________
////                \\//                  \\//                         \\\\
\\\\ Neil Weinstock //\\ att!cord!nsw  or //\\ "Your hair is so...     ////
//// AT&T Bell Labs \\// nsw@cord.att.com \\//  lustre-laden." - Moss  \\\\
\\\\________________//\\__________________//\\_________________________////

jeh@elmgate.UUCP (Ed Hanway) (02/16/90)

In article <9231@cbnewsm.ATT.COM> nsw@cbnewsm.ATT.COM (Neil Weinstock) writes:
>Geez, you guys are all missing the obvious and correct solution.  The
>answer, as it almost always is, is to use a lookup table.  Use the data for 
>each scan line as the lookup index (simply concatenate the pixel values into
>one long binary value), and put the corresponding Image structure into that 
>table entry.  So, you just take the data, look up the appropriate scan line, 
>and use DrawImage().  Only one operation per scan line.  How much quicker
>can you get?

It's a joke, right? My calculator gave up but "bc" dutifully told me
that a look-up table for each possible low-res scan line would take
3176248838703735071388317970997340557438739699483247316076389267090944\
8424296143487586179936927258864875554241803836083347728168482432449060\
4855548268961204923221725750403452934972636403246568312773419816518854\
1019193627059188378560474396174450152676920272486972400155193608576973\
9549574787063085604610601339929877328920047078547471584268060390863192\
35148982818821631076219391836160 megabytes of RAM, assuming 4 bit planes.

So are you going to post the table? :-)