kim@analog.UUCP (Kim Helliwell) (12/11/87)
I am interested in any information anyone may have on algorithms for rotating bitmaps by +/- 90 degrees. I am aware of two-- the obvious brute-force method of grabbing a pixel and swapping indices (with an appropriate translation) before stuffing it into another bitmap, and the parallel blitter version which was described in an issue of MacTutor a couple of years ago. I have been thinking about the relative speeds of these two, and I have come up with the following analysis of the relative speeds: the brute force method obviously goes as n*n, where n is the number of pixels on a side of the bitmap. the blitter version actually goes as n*n*log(n), because the log(n) steps required and the n*n memory read/writes required by the blitting operation. The log() means log base two, of course. What this tells me is that the blitter version is only a win if you have a hardware blitter (with lots of parallism, for example, and DMA). For use with CopyBits, it appears to be a lab curiosity. Or is there something I am missing? In any case, I have tried both methods, and neither of them are anywhere close to as fast as MacPaint. I am interested in speeding things up, and want to know if anyone knows how to do things faster? I admit that I haven't done much to optimize things to date. Would doing the thing in straight assembly code be the only thing I need to do to speed it up? (I am writing it in C at present). Or is there something really clever that Paint does? Any leads or help will be appreciated. Kim Helliwell hplabs!analog!kim
oster@dewey.soe.berkeley.edu (David Phillip Oster) (12/12/87)
In article <600@analog.UUCP> kim@analog.UUCP (Kim Helliwell) writes: >I am interested in any information anyone may have on algorithms for >rotating bitmaps by +/- 90 degrees. In the summer, Dr. Dobb's magazine of software tools said that they were planning a graphics issue and that outlines for articles could be posted to a certain computer mail account. Well, I sent them an article outline and got no reply, even though I logged in to the same machine they had an account on to post. My article was on this very topic. The article in MacTutor refers to a scheme described in the first Smalltalk-80 book. This scheme works well if all CopyBits took the same time, independent of their arguments. The scheme I use is: divide the picture to be rotated into 8x8 pixel blocks. (Pad the edges to make them come out exactly on block boundaries.) Now the destination coordinates of each block are pretty obvious: example: a b c d i e a e f g h => j f b i j k l k g c l h d The last remaining step is to take one of these 8x8 blocks and rotate it. The following C code is a schema for how you do that. for(i = 0, destBit = 1; i < 8; i++, destBit *= 2){ src = src8[i]; for(j = 0, srcBit = 0x80; j < 8; j++, srcBit /= 2){ if(src & srcBit){ /* if the srcbit is on */ dest8[j] |= destBit; /* turn the dest bit on */ }else{ /* else */ dest8[j] &= ~destBit; /* turn the dest bit off */ } } } Of course, you'd use plenty of register variables, and replace the explicit array referencing here with pointer arithmetic. Then you copybits from the offscreen bitmap where you have been doing the work back to the screen. I've programmed this up, and it _is_ fast. --- David Phillip Oster --A Sun 3/60 makes a poor Macintosh II. Arpa: oster@dewey.soe.berkeley.edu --A Macintosh II makes a poor Sun 3/60. Uucp: {uwvax,decvax,ihnp4}!ucbvax!oster%dewey.soe.berkeley.edu
ali@rocky.STANFORD.EDU (Ali Ozer) (12/12/87)
In article ... oster@dewey.soe.berkeley.edu.UUCP (David Phillip Oster) writes: >In article <600@analog.UUCP> kim@analog.UUCP (Kim Helliwell) writes: >>I am interested in any information anyone may have on algorithms for >>rotating bitmaps by +/- 90 degrees. >The scheme I use is: divide the picture to be rotated into 8x8 pixel >blocks. (Pad the edges to make them come out exactly on block boundaries.) I've included (at the end of this article) the 8x8 rotation code, in assembly, written by Tomas Rokicki (rokicki@sushi.stanford.edu). It's lengthy, mainly because the loops are unrolled. It was written for the Amiga, but it's general enough such that with a little attention to how the parameters come in it should work on the Mac. And it's fast --- I use it to rotate 64x64 4 bit plane images. (But it's a CPU hog...) Also, when doing the rotation, no matter how large the bitmap you're rotating is, you only need ONE 8x8 temporary. Then you can do the rotation in the source area itself. Ali Ozer, ali@rocky.stanford.edu -------------------------------- /* Rotation code from Tomas Rokicki. ** If you use it, please leave the author's name intact! */ /* * This code flips an 8 x 8 rectangle to the right 90 degrees. * Note that the blitter could be up to 4 times faster (according * to my preliminary calculations), but it would probably also be * more complex, and this should be plenty fast. Also, for small * rectangles like 16 x 16, this will certainly be faster. */ /* * This subroutine takes four arguments, two for the source and * two for the destination. There is a pointer to the first byte * to be flipped, and the value to be added to the address for * each subsequent byte. */ flip(sp, sw, dp, dw) unsigned char *sp, *dp ; int sw, dw ; { /* 8(a5), 14(a5), 12(a5), 18(a5) */ #asm ; ; This subroutine actually flips the eight bytes in the lower byte ; of d0-d7, and sticks the results in the upper bytes of d0-d7, ; rotating their previous contents down. Don't let the length of ; it scare you; you might be able to discern a pattern in the ; instructions. It's really quite simple. ; movem.w reglist,-(sp) move.l 8(a5),a0 move.w 12(a5),a1 move.b (a0),d0 add.w a1,a0 move.b (a0),d1 add.w a1,a0 move.b (a0),d2 add.w a1,a0 move.b (a0),d3 add.w a1,a0 move.b (a0),d4 add.w a1,a0 move.b (a0),d5 add.w a1,a0 move.b (a0),d6 add.w a1,a0 move.b (a0),d7 roxr.b #1,d0 roxr.w #1,d0 roxr.w #1,d1 roxr.w #1,d0 roxr.w #1,d2 roxr.w #1,d0 roxr.w #1,d3 roxr.w #1,d0 roxr.w #1,d4 roxr.w #1,d0 roxr.w #1,d5 roxr.w #1,d0 roxr.w #1,d6 roxr.w #1,d0 roxr.w #1,d7 roxl.w #8,d0 roxr.b #1,d1 roxr.w #1,d1 roxr.w #1,d2 roxr.w #1,d1 roxr.w #1,d3 roxr.w #1,d1 roxr.w #1,d4 roxr.w #1,d1 roxr.w #1,d5 roxr.w #1,d1 roxr.w #1,d6 roxr.w #1,d1 roxr.w #1,d7 roxl.w #8,d1 roxr.b #1,d2 roxr.w #1,d2 roxr.w #1,d3 roxr.w #1,d2 roxr.w #1,d4 roxr.w #1,d2 roxr.w #1,d5 roxr.w #1,d2 roxr.w #1,d6 roxr.w #1,d2 roxr.w #1,d7 roxl.w #8,d2 roxr.b #1,d3 roxr.w #1,d3 roxr.w #1,d4 roxr.w #1,d3 roxr.w #1,d5 roxr.w #1,d3 roxr.w #1,d6 roxr.w #1,d3 roxr.w #1,d7 roxl.w #8,d3 roxr.b #1,d4 roxr.w #1,d4 roxr.w #1,d5 roxr.w #1,d4 roxr.w #1,d6 roxr.w #1,d4 roxr.w #1,d7 roxl.w #8,d4 roxr.b #1,d5 roxr.w #1,d5 roxr.w #1,d6 roxr.w #1,d5 roxr.w #1,d7 roxl.w #8,d5 roxr.b #1,d6 roxr.w #1,d6 roxr.w #1,d7 roxl.w #8,d6 roxr.b #1,d7 roxl.w #8,d7 move.l 14(a5),a0 move.w 18(a5),a1 move.b d7,(a0) add.w a1,a0 move.b d6,(a0) add.w a1,a0 move.b d5,(a0) add.w a1,a0 move.b d4,(a0) add.w a1,a0 move.b d3,(a0) add.w a1,a0 move.b d2,(a0) add.w a1,a0 move.b d1,(a0) add.w a1,a0 move.b d0,(a0) movem.w (sp)+,reglist reglist reg d0/d1/d2/d3/d4/d5/d6/d7/a0/a1 #endasm }
gillies@uiucdcsp.cs.uiuc.edu (12/13/87)
I think your analysis of the parallel bitblt algorithm is fallacious.
The speed of a BitBLT EVEN IN ASSEMBLY LANGUAGE is approx n*n/16 or
n*n/32, depending upon the word size of your machine.  See [Deutsch84]
to learn how to dynamically "compile" a bitblt request into machine
language, then execute it like a bat out of hell, a type of self
modifying (self-compiling?) algorithm.  For the particular mac rotation
problem, you could hand-code each of the 1..9 bitblt operations.
The recursive Bitblt trick is outlined in the book "Smalltalk-80: The
language and its implementation" pp 408-410.  Using clever boolean ops
and masks, you can do all the recursive BitBLTs in parallel hence you
need only log(n) BITBLTs.  Since the Mac screen is approx 512*512
dots, then a full screen BITBLT rotation should take:
(n*n/16)*log(2,512) == (n*n/2) time.
Thus the blitter algorithm should *always* outperform the
simple-minded bit-by-bit transfer algorithm on the macintosh.
[Deustch84] (E-mail me for the ref -- it's not here, or check ACM Guide)
Don Gillies {ihnp4!uiucdcs!gillies} U of Illinois
            {gillies@p.cs.uiuc.edu}/* End of text from uiucdcsp:comp.sys.mac */oster@dewey.soe.berkeley.edu (David Phillip Oster) (12/14/87)
In article <76000070@uiucdcsp> gillies@uiucdcsp.cs.uiuc.edu writes: >I think your analysis of the parallel bitblt algorithm is fallacious. > [much deleted, check it, it is interesting] You're posting shows the blind preference for elegance that I was critiquing in my article. Think it through: 1.) Here is the analysis again: The recursive version for rotating an NxN bitmap does O(log N) BitBlts to rotate a bitmap by 90 degrees. (It does three BitBlts to interchange quarters, then must recursively interchange those quarters all the way down. Each recursive step takes another three BitBlts.) In addition, the recursive algorithm only works directly for squares that are a power of 2 on edge. (To handle other sizes, you usually copy into an offscreen bitmap that is large enough, rotate it, and copy the result back.) The 8x8 transpose version does O(1) bitblt to rotate a bitmap by 90 degrees. In addition, it works directly for any rectangle that has edges that are a multiple of 8. 2.) by experiment. I have actually coded up and run both versions. For a 256 by 256 bitmap, the 8x8 transpose method was about 10 times faster than the recursive CopyBits method. (And, I didn't even optimize the transpose method.) The recursive method is elegant, but _slow_. The point of my article is that the elegant algorithms you learn to appreciate in an algorithms class sometimes really _are_ worse than using brute force in the real world. --- David Phillip Oster --A Sun 3/60 makes a poor Macintosh II. Arpa: oster@dewey.soe.berkeley.edu --A Macintosh II makes a poor Sun 3/60. Uucp: {uwvax,decvax,ihnp4}!ucbvax!oster%dewey.soe.berkeley.edu
kim@analog.UUCP (Kim Helliwell) (12/17/87)
In article <76000070@uiucdcsp>, gillies@uiucdcsp.cs.uiuc.edu writes: > > I think your analysis of the parallel bitblt algorithm is fallacious. > The speed of a BitBLT EVEN IN ASSEMBLY LANGUAGE is approx n*n/16 or > n*n/32, depending upon the word size of your machine. See [Deutsch84] > to learn how to dynamically "compile" a bitblt request into machine > language, then execute it like a bat out of hell, a type of self > modifying (self-compiling?) algorithm. For the particular mac rotation > problem, you could hand-code each of the 1..9 bitblt operations. > > ... > ... > > (n*n/16)*log(2,512) == (n*n/2) time. > > Thus the blitter algorithm should *always* outperform the > simple-minded bit-by-bit transfer algorithm on the macintosh. I think you are the one with the fallacious argument. Factors like 16 or log(some constant) are irrelevant for purposes of analyzing CPU time requirements. My original analysis was sketchy, but here it is in a slightly expanded form: Suppose that I wish to rotate an N X N bitmap. If I do it the brute-force way, it will take N*N operations (where an operation is defined as a memory transfer with a masking operation thrown in somewhere (to get at the right bit of a particular byte. Ok, have it your way--to get at the right bit of a long word). If I use the recursive bitBLT version, it will take log(N) steps (this is obviously a log base 2), and each step consists of several bitblt operations. I counted 15 for the version I was using, but I suppose that this could be reduced for by some clever rewrite of the blit operation. I won't quibble about the factor of 15--it's irrelevent when compared to the powers of N we are working with. So let's suppose that we somehow magically manage to do the whole thing with 1 blit per step(!) What is it we are doing? We are moving some bytes (OK, longwords) from one place to another in memory, again with a masking operation thrown in. In the real algorithm, many of the blit's involved moving only half or a quarter of the full bitmap, but again this is a factor that is irrelevant. I figure that I gave up the factor of 15, so it's only fair for you to concede the factor of 4 or 2, and we'll say for the purposes of argument that the one blit per recursion involves the whole bitmap, which is N*N/(8,16,or 32). The constant factors are ignorable by the principal already enunciated, so I get the the blit version is O(N*N*log(N)). Why do we ignore the constant factors? Because, although in specific cases you could show that for certain values of N, and for specific versions of the two algorithms, that the blit version would win, the thing that is relevant is what happens as N increases. At some point, N*N*log(N) will become bigger than N*N, NO MATTER WHAT CONSTANT FACTORS MULTIPLY THEM. That is why there is so much discussion of sorting in the literature discussing relative merits of various algorithms for various sizes of problem--often a less overall efficient algorithm for a specific case is a win because the size of the problem and the speed of the comparison fits the method better. BTW, the brute force method wins in another way. That is, it can be used to rotate non-square bitmaps and non-integral values of N, which can be a big win if you want to rotate a long narrow strip (which I did).
gillies@uiucdcsp.cs.uiuc.edu (12/22/87)
In article <>, kim@analog.UUCP writes: > I think you are the one with the fallacious argument. Factors like >16 or log(some constant) are irrelevant for purposes of analyzing > CPU time requirements. > > My original analysis was sketchy, but here it is in a slightly expanded > form: > > Suppose that I wish to rotate an N X N bitmap. If I do it the > brute-force way, it will take N*N operations (where an operation is > defined as a memory transfer with a masking operation thrown in somewhere (to > get at the right bit of a particular byte. Ok, have it your way--to get at > the right bit of a long word). >..... When you analyze an algorithm whose running time is dominated by the size of its output (n*n in the rotation algorithm), it is appropriate to do a low-level Knuth-type analysis, or at least attempt to do it. There are many ways to speed up a "hopelessly slow" bit algorithms by exploiting the binary parallelism 16, 32, or 64-bit boolean and arithmetic operations. For instance, you can count the bits in a 32-bit computer word in 5 iterations of one loop (O(log(wordsize)) time). Obviously, ANY algorithm that rotates an n*n bitmap must take O(n*n) time to do its output, unless you use a degenerate representation (like storing all 4 rotations of a bitmap, and just returning a pointer in O(1) time). You wanted some reasons why Atkinson's rotation algorithm might run so fast. You referred to a particular algorithm on a particular machine with a particular screen size. And things are not as it seems: Using some elementary analysis for the particular case, it looks like the BITBLT algorithm is still a win over the brute force method. Of course, as another notewriter pointed out, you can get even more clever in doing the rotation by breaking the screen into 8*8 blocks. This method *still* does not "sound" like its asymptotic performance equals the dumb brute force algorithm, but I bet it runs faster on any particular commercial processor you choose for a benchmark. Constants count. Don Gillies {ihnp4!uiucdcs!gillies} U of Illinois {gillies@p.cs.uiuc.edu}