[comp.sys.mac] image rotation

kim@analog.UUCP (Kim Helliwell) (12/11/87)

I am interested in any information anyone may have on algorithms for
rotating bitmaps by +/- 90 degrees.

I am aware of two-- the obvious brute-force method of grabbing a pixel
and swapping indices (with an appropriate translation) before stuffing
it into another bitmap, and the parallel blitter version which was
described in an issue of MacTutor a couple of years ago.

I have been thinking about the relative speeds of these two, and I have
come up with the following analysis of the relative speeds:

	the brute force method obviously goes as n*n, where n is the
	number of pixels on a side of the bitmap.

	the blitter version actually goes as n*n*log(n), because the
	log(n) steps required and the n*n memory read/writes required
	by the blitting operation.

The log() means log base two, of course.  What this tells me is that the
blitter version is only a win if you have a hardware blitter (with 
lots of parallism, for example, and DMA).  For use with CopyBits, it
appears to be a lab curiosity.  Or is there something I am missing?

In any case, I have tried both methods, and neither of them are anywhere
close to as fast as MacPaint.  I am interested in speeding things up, and
want to know if anyone knows how to do things faster?  I admit that I haven't
done much to optimize things to date.  Would doing the thing in straight
assembly code be the only thing I need to do to speed it up? (I am writing
it in C at present).  Or is there something really clever that Paint does?

Any leads or help will be appreciated.

Kim Helliwell
hplabs!analog!kim

oster@dewey.soe.berkeley.edu (David Phillip Oster) (12/12/87)

In article <600@analog.UUCP> kim@analog.UUCP (Kim Helliwell) writes:
>I am interested in any information anyone may have on algorithms for
>rotating bitmaps by +/- 90 degrees.

In the summer, Dr. Dobb's magazine of software tools said that they were
planning a graphics issue and that outlines for articles could be posted
to a certain computer mail account. Well, I sent them an article outline
and got no reply, even though I logged in to the same machine they had an
account on to post.

My article was on this very topic. The article in MacTutor refers to a
scheme described in the first Smalltalk-80 book. This scheme works
well if all CopyBits took the same time, independent of their
arguments.

The scheme I use is: divide the picture to be rotated into 8x8 pixel
blocks. (Pad the edges to make them come out exactly on block boundaries.)
Now the destination coordinates of each block are pretty obvious:
example:

a b c d     i e a
e f g h =>  j f b
i j k l     k g c
            l h d

The last remaining step is to take one of these 8x8 blocks and rotate
it. The following C code is a schema for how you do that.

for(i = 0, destBit = 1; i < 8; i++, destBit *= 2){
  src = src8[i];
  for(j = 0, srcBit = 0x80; j < 8; j++, srcBit /= 2){
    if(src & srcBit){		/* if the srcbit is on */
      dest8[j] |= destBit;	/* turn the dest bit on */
    }else{			/* else	*/
      dest8[j] &= ~destBit;     /* turn the dest bit off */
    }
  }
}

Of course, you'd use plenty of register variables, and replace the
explicit array referencing here with pointer arithmetic.

Then you copybits from the offscreen bitmap where you have been doing the
work back to the screen.

I've programmed this up, and it _is_ fast.

--- David Phillip Oster            --A Sun 3/60 makes a poor Macintosh II.
Arpa: oster@dewey.soe.berkeley.edu --A Macintosh II makes a poor Sun 3/60.
Uucp: {uwvax,decvax,ihnp4}!ucbvax!oster%dewey.soe.berkeley.edu

ali@rocky.STANFORD.EDU (Ali Ozer) (12/12/87)

In article ... oster@dewey.soe.berkeley.edu.UUCP (David Phillip Oster) writes:
>In article <600@analog.UUCP> kim@analog.UUCP (Kim Helliwell) writes:
>>I am interested in any information anyone may have on algorithms for
>>rotating bitmaps by +/- 90 degrees.
>The scheme I use is: divide the picture to be rotated into 8x8 pixel
>blocks. (Pad the edges to make them come out exactly on block boundaries.)

I've included (at the end of this article) the 8x8 rotation code, in 
assembly, written by Tomas Rokicki (rokicki@sushi.stanford.edu).
It's lengthy, mainly because the loops are unrolled. It was written
for the Amiga, but it's general enough such that with a little attention to 
how the parameters come in it should work on the Mac. And it's fast ---
I use it to rotate 64x64 4 bit plane images. (But it's a CPU hog...)

Also, when doing the rotation, no matter how large the bitmap you're
rotating is, you only need ONE 8x8 temporary. Then you can do the
rotation in the source area itself.

Ali Ozer, ali@rocky.stanford.edu
--------------------------------
/* Rotation code from Tomas Rokicki.
** If you use it, please leave the author's name intact!
*/

/*
 *   This code flips an 8 x 8 rectangle to the right 90 degrees.
 *   Note that the blitter could be up to 4 times faster (according
 *   to my preliminary calculations), but it would probably also be
 *   more complex, and this should be plenty fast.  Also, for small
 *   rectangles like 16 x 16, this will certainly be faster.
 */
/*
 *   This subroutine takes four arguments, two for the source and
 *   two for the destination.  There is a pointer to the first byte
 *   to be flipped, and the value to be added to the address for
 *   each subsequent byte.
 */
flip(sp, sw, dp, dw)
unsigned char *sp, *dp ;
int sw, dw ;
{
/* 8(a5), 14(a5), 12(a5), 18(a5) */
#asm
;
;   This subroutine actually flips the eight bytes in the lower byte
;   of d0-d7, and sticks the results in the upper bytes of d0-d7,
;   rotating their previous contents down.  Don't let the length of
;   it scare you; you might be able to discern a pattern in the
;   instructions.  It's really quite simple.
;
	movem.w	reglist,-(sp)
	move.l	8(a5),a0
	move.w	12(a5),a1
	move.b	(a0),d0
	add.w	a1,a0
	move.b	(a0),d1
	add.w	a1,a0
	move.b	(a0),d2
	add.w	a1,a0
	move.b	(a0),d3
	add.w	a1,a0
	move.b	(a0),d4
	add.w	a1,a0
	move.b	(a0),d5
	add.w	a1,a0
	move.b	(a0),d6
	add.w	a1,a0
	move.b	(a0),d7
	roxr.b	#1,d0
	roxr.w	#1,d0
	roxr.w	#1,d1
	roxr.w	#1,d0
	roxr.w	#1,d2
	roxr.w	#1,d0
	roxr.w	#1,d3
	roxr.w	#1,d0
	roxr.w	#1,d4
	roxr.w	#1,d0
	roxr.w	#1,d5
	roxr.w	#1,d0
	roxr.w	#1,d6
	roxr.w	#1,d0
	roxr.w	#1,d7
	roxl.w	#8,d0
	roxr.b	#1,d1
	roxr.w	#1,d1
	roxr.w	#1,d2
	roxr.w	#1,d1
	roxr.w	#1,d3
	roxr.w	#1,d1
	roxr.w	#1,d4
	roxr.w	#1,d1
	roxr.w	#1,d5
	roxr.w	#1,d1
	roxr.w	#1,d6
	roxr.w	#1,d1
	roxr.w	#1,d7
	roxl.w	#8,d1
	roxr.b	#1,d2
	roxr.w	#1,d2
	roxr.w	#1,d3
	roxr.w	#1,d2
	roxr.w	#1,d4
	roxr.w	#1,d2
	roxr.w	#1,d5
	roxr.w	#1,d2
	roxr.w	#1,d6
	roxr.w	#1,d2
	roxr.w	#1,d7
	roxl.w	#8,d2
	roxr.b	#1,d3
	roxr.w	#1,d3
	roxr.w	#1,d4
	roxr.w	#1,d3
	roxr.w	#1,d5
	roxr.w	#1,d3
	roxr.w	#1,d6
	roxr.w	#1,d3
	roxr.w	#1,d7
	roxl.w	#8,d3
	roxr.b	#1,d4
	roxr.w	#1,d4
	roxr.w	#1,d5
	roxr.w	#1,d4
	roxr.w	#1,d6
	roxr.w	#1,d4
	roxr.w	#1,d7
	roxl.w	#8,d4
	roxr.b	#1,d5
	roxr.w	#1,d5
	roxr.w	#1,d6
	roxr.w	#1,d5
	roxr.w	#1,d7
	roxl.w	#8,d5
	roxr.b	#1,d6
	roxr.w	#1,d6
	roxr.w	#1,d7
	roxl.w	#8,d6
	roxr.b	#1,d7
	roxl.w	#8,d7
	move.l	14(a5),a0
	move.w	18(a5),a1
	move.b	d7,(a0)
	add.w	a1,a0
	move.b	d6,(a0)
	add.w	a1,a0
	move.b	d5,(a0)
	add.w	a1,a0
	move.b	d4,(a0)
	add.w	a1,a0
	move.b	d3,(a0)
	add.w	a1,a0
	move.b	d2,(a0)
	add.w	a1,a0
	move.b	d1,(a0)
	add.w	a1,a0
	move.b	d0,(a0)
	movem.w	(sp)+,reglist
reglist	reg	d0/d1/d2/d3/d4/d5/d6/d7/a0/a1
#endasm
}

gillies@uiucdcsp.cs.uiuc.edu (12/13/87)

I think your analysis of the parallel bitblt algorithm is fallacious.
The speed of a BitBLT EVEN IN ASSEMBLY LANGUAGE is approx n*n/16 or
n*n/32, depending upon the word size of your machine.  See [Deutsch84]
to learn how to dynamically "compile" a bitblt request into machine
language, then execute it like a bat out of hell, a type of self
modifying (self-compiling?) algorithm.  For the particular mac rotation
problem, you could hand-code each of the 1..9 bitblt operations.

The recursive Bitblt trick is outlined in the book "Smalltalk-80: The
language and its implementation" pp 408-410.  Using clever boolean ops
and masks, you can do all the recursive BitBLTs in parallel hence you
need only log(n) BITBLTs.  Since the Mac screen is approx 512*512
dots, then a full screen BITBLT rotation should take:

(n*n/16)*log(2,512) == (n*n/2) time.

Thus the blitter algorithm should *always* outperform the
simple-minded bit-by-bit transfer algorithm on the macintosh.

[Deustch84] (E-mail me for the ref -- it's not here, or check ACM Guide)

Don Gillies {ihnp4!uiucdcs!gillies} U of Illinois
            {gillies@p.cs.uiuc.edu}/* End of text from uiucdcsp:comp.sys.mac */

oster@dewey.soe.berkeley.edu (David Phillip Oster) (12/14/87)

In article <76000070@uiucdcsp> gillies@uiucdcsp.cs.uiuc.edu writes:

>I think your analysis of the parallel bitblt algorithm is fallacious.
> [much deleted, check it, it is interesting]
You're posting shows the blind preference for elegance that I was critiquing
in my article. Think it through:

1.) Here is the analysis again:

The recursive version for rotating an NxN bitmap does O(log N) BitBlts
to rotate a bitmap by 90 degrees. (It does three BitBlts to interchange
quarters, then must recursively interchange those quarters all the way down.
Each recursive step takes another three BitBlts.)

In addition, the recursive algorithm
only works directly for squares that are a power of 2 on edge. (To
handle other sizes, you usually copy into an offscreen bitmap that is
large enough, rotate it, and copy the result back.)

The 8x8 transpose version does O(1) bitblt to rotate a bitmap by 90 degrees.
In addition, it works directly for any rectangle that has edges that are
a multiple of 8.

2.) by experiment.
I have actually coded up and run both versions. For a 256 by 256 bitmap,
the 8x8 transpose method was about 10 times faster than the recursive
CopyBits method. (And, I didn't even optimize the transpose method.)

The recursive method is elegant, but _slow_. The point of my article is that
the elegant algorithms you learn to appreciate in an algorithms class 
sometimes really _are_ worse than using brute force in the real world.

--- David Phillip Oster            --A Sun 3/60 makes a poor Macintosh II.
Arpa: oster@dewey.soe.berkeley.edu --A Macintosh II makes a poor Sun 3/60.
Uucp: {uwvax,decvax,ihnp4}!ucbvax!oster%dewey.soe.berkeley.edu

kim@analog.UUCP (Kim Helliwell) (12/17/87)

In article <76000070@uiucdcsp>, gillies@uiucdcsp.cs.uiuc.edu writes:
> 
> I think your analysis of the parallel bitblt algorithm is fallacious.
> The speed of a BitBLT EVEN IN ASSEMBLY LANGUAGE is approx n*n/16 or
> n*n/32, depending upon the word size of your machine.  See [Deutsch84]
> to learn how to dynamically "compile" a bitblt request into machine
> language, then execute it like a bat out of hell, a type of self
> modifying (self-compiling?) algorithm.  For the particular mac rotation
> problem, you could hand-code each of the 1..9 bitblt operations.
> 
> ...
> ...
> 
> (n*n/16)*log(2,512) == (n*n/2) time.
> 
> Thus the blitter algorithm should *always* outperform the
> simple-minded bit-by-bit transfer algorithm on the macintosh.

I think you are the one with the fallacious argument.  Factors like
16 or log(some constant) are irrelevant for purposes of analyzing
CPU time requirements.

My original analysis was sketchy, but here it is in a slightly expanded
form:

	Suppose that I wish to rotate an N X N bitmap.  If I do it the
brute-force way, it will take N*N operations (where an operation is
defined as a memory transfer with a masking operation thrown in somewhere (to
get at the right bit of a particular byte.  Ok, have it your way--to get at
the right bit of a long word).

If I use the recursive bitBLT version, it will take log(N) steps (this is
obviously a log base 2), and each step consists of several bitblt operations.
I counted 15 for the version I was using, but I suppose that this could
be reduced for by some clever rewrite of the blit operation.  I won't quibble
about the factor of 15--it's irrelevent when compared to the powers of
N we are working with.  So let's suppose that we somehow magically manage
to do the whole thing with 1 blit per step(!)  What is it we are doing?
We are moving some bytes (OK, longwords) from  one place to another in
memory, again with a masking operation thrown in.  In the real algorithm,
many of the blit's involved moving only half or a quarter of the full
bitmap, but again this is a factor that is irrelevant.  I figure that
I gave up the factor of 15, so it's only fair for you to concede the 
factor of 4 or 2, and we'll say for the purposes of argument that the
one blit per recursion involves the whole bitmap, which is N*N/(8,16,or 32).
The constant factors are ignorable by the principal already enunciated, so
I get the the blit version is O(N*N*log(N)).

Why do we ignore the constant factors?  Because, although in specific
cases you could show that for certain values of N, and for specific versions
of the two algorithms, that the blit version would win, the thing that is
relevant is what happens as N increases.  At some point, N*N*log(N) will
become bigger than N*N, NO MATTER WHAT CONSTANT FACTORS MULTIPLY THEM.
That is why there is so much discussion of sorting in the literature
discussing relative merits of various algorithms for various sizes of 
problem--often a less overall efficient algorithm for a specific case is
a win because the size of the problem and the speed of the comparison
fits the method better.

BTW, the brute force method wins in another way.  That is, it can be used
to rotate non-square bitmaps and non-integral values of N, which can be
a big win if you want to rotate a long narrow strip (which I did).

gillies@uiucdcsp.cs.uiuc.edu (12/22/87)

In article <>, kim@analog.UUCP writes:
> I think you are the one with the fallacious argument.  Factors like
>16 or log(some constant) are irrelevant for purposes of analyzing
> CPU time requirements.
>
> My original analysis was sketchy, but here it is in a slightly expanded
> form:
>
>	Suppose that I wish to rotate an N X N bitmap.  If I do it the
> brute-force way, it will take N*N operations (where an operation is
> defined as a memory transfer with a masking operation thrown in somewhere (to
> get at the right bit of a particular byte.  Ok, have it your way--to get at
> the right bit of a long word).
>.....

When you analyze an algorithm whose running time is dominated by the
size of its output (n*n in the rotation algorithm), it is appropriate
to do a low-level Knuth-type analysis, or at least attempt to do it.

There are many ways to speed up a "hopelessly slow" bit algorithms by
exploiting the binary parallelism 16, 32, or 64-bit boolean and
arithmetic operations.  For instance, you can count the bits in a
32-bit computer word in 5 iterations of one loop (O(log(wordsize)) time).

Obviously, ANY algorithm that rotates an n*n bitmap must take O(n*n)
time to do its output, unless you use a degenerate representation
(like storing all 4 rotations of a bitmap, and just returning a
pointer in O(1) time).

You wanted some reasons why Atkinson's rotation algorithm might run so
fast.  You referred to a particular algorithm on a particular machine
with a particular screen size.  And things are not as it seems: Using
some elementary analysis for the particular case, it looks like the
BITBLT algorithm is still a win over the brute force method.

Of course, as another notewriter pointed out, you can get even more clever
in doing the rotation by breaking the screen into 8*8 blocks.  This
method *still* does not "sound" like its asymptotic performance equals
the dumb brute force algorithm, but I bet it runs faster on any particular
commercial processor you choose for a benchmark.  Constants count.

Don Gillies {ihnp4!uiucdcs!gillies} U of Illinois
            {gillies@p.cs.uiuc.edu}