alan@oz.paradyne.com (02/24/90)
In an email message to me, bbd@rice.edu wrote: >In article <7398@pdn.paradyne.com> I wrote: >>Another important case is when drawing vertical lines. >>Remember, 32 * 32 = 1024; it only takes 32 32-bit words to store a full line >>of monochrome pixels on a megapixel (1024x1024) screen. >[On most systems, doesn't this special case occur for horizontal, >rather than vertical lines? My next 2 paragraphs assume so...] >The line only fits into the 32 words if it is going in the proper (eg. >horizontal) direction. If the line goes the other way (eg. vertical), >it takes 1024 32-bit words to store it, if it's stored in the obvious >way. To which I attempted to reply via email. However: ----- Transcript of session follows ----- >>> RCPT To:<bbd@rice.edu> <<< 550 <bbd@rice.edu>... User unknown 550 rice.edu!bbd... User unknown So here is my reply: I apologize: I did not realize how easy it would be to misinterpret what I wrote. I did not mean to imply that pixels which are vertically sequential are also stored sequentially in memory. The point of my comment was how "narrow" the screen is in terms of words of memory, to make it graphically :-) obvious that many blits fall within a single word or two. The point about vertical lines is that they always (unless they, or the pixels themselves, are very thick) fall within one or two words horizontally. >>Also, with burst-mode cache refils and sequential word accesses, the cache >>miss penalty is not that significant: you only "miss" one word in sixteen >>(typically). >Eh? If I'm reading a character glyph from a bitmap that is >approximately the size of the screen, then each row of the glyph is >separated by about 30 32-bit words, right? That would play hell with >burst-mode refilled caches, right? Again, I apologize: my comment was addressed to the big bitblt case that the original poster was concerned about. However, be advised that character glyphs may be stored one character glyph per character (so that the entire glyph bitmap fits into a single cache line), or perhaps one whole font face per glyph. In any case, characters are normally blitted in a loop which writes a whole string to the screen, often soon followed by another. So it is likely that the most frequently used characters would have their pixels in the cache already. >>then the total loop cycle count is 8, an 11% improvement. Isn't it normally >>considered worthwile to add an instruction if it improves performance by as >>little as 2 or 3 percent? >Perhaps it's worthwhile if it improves overall performance by 2 or 3%. >You've only (theoretically) improved performance of the _blitter_ by >8% or 11%. Will the "bmerge" (or whatever) instruction be used >anywhere else? Sure, a compiler could be _taught_ to use it, but it >won't bother to emit this instruction unless the magical pattern "(Rx >& Ry) | (Rz & ~Ry)" shows up in the intermediate code. Where else would >this occur beside BitBlt? BMERGE would be useful for dealing with unaligned accesses (remember those?). That's essentially what BitBlt does: unaligned accesses. Most CPUs are at least byte aligned, if not half-word (68000) or word aligned (RISC). BitBlt is not necessary on a bit-aligned CPU. >> Don't people spend mongo$ to increase the cache size for no more than >> a 10% performance improvement? >Again, I think these mongo$ are going for an across-the-board 10% >performance improvement. Well, suppose you want to market your CPU as a graphics engine? ____"Congress shall have the power to prohibit speech offensive to Congress"____ Alan Lovejoy; alan@pdn; 813-530-2211; AT&T Paradyne: 8550 Ulmerton, Largo, FL. Disclaimer: I do not speak for AT&T Paradyne. They do not speak for me. Mottos: << Many are cold, but few are frozen. >> << Frigido, ergo sum. >>
bbc@legia.rice.edu (Benjamin Chase) (02/26/90)
alan@oz.paradyne.com writes: >In an email message to me, bbd@rice.edu wrote: >To which I attempted to reply via email. However: >[it failed] >So here is my reply: >I apologize: I did not realize how easy it would be to misinterpret what I >wrote. How painfully true. I'm bbc@rice.edu (as in British Broadcasting Company), not bbd@rice.edu. > I did not mean to imply that pixels which are vertically sequential >are also stored sequentially in memory. The point of my comment was how >"narrow" the screen is in terms of words of memory, to make it graphically :-) >obvious that many blits fall within a single word or two. But the _screen_ isn't narrow. A modern monochrome display is 32 (32 bit) words wide. >The point about >vertical lines is that they always (unless they, or the pixels themselves, >are very thick) fall within one or two words horizontally. Yes, and ~1000 words vertically. And these words are spaced at ~32 intervals. Thus, when you draw a vertical line, you get a word of screen memory, perform some operation on it to turn on or off ~1 bit of that word, and then write it back to screen memory. Then, you get the next word, which is 32 words further along, and whoops, it's not in the data cache, because your silly fetcher got the next 3 consecutive words of screen memory (which you won't be needing right now because your vertical line is so skinny and all), rather than the next 3 words spaced at 32 word offsets, which is what you really wanted it to do if you were drawing a vertical line. >However, be advised that character >glyphs may be stored one character glyph per character (so that the entire >glyph bitmap fits into a single cache line), or perhaps one whole font face >per glyph. Fine, I'll certainly agree with "one character glyph per character", and that for any normal (monochrome) screen resolution and font face that a row of a character glyph fits into a single cache line, and usually into a single machine word. > In any case, characters are normally blitted in a loop which >writes a whole string to the screen, often soon followed by another. So it >is likely that the most frequently used characters would have their pixels >in the cache already. But that's only half the problem. What about the screen memory that isn't in the cache because it won't fit? (Especially when you're filling the cache with screen words that you won't be writing upon, as in my example above.) >Well, suppose you want to market your CPU as a graphics engine? That is another story entirely. I thought this whole thing started by discussing what sorts of new instructions might fit into the next RISC processor. Granted, just because it's a "RISC processor" (TM) :-) doesn't mean it won't be marketed as a graphics engine. Now, getting back to "new instructions for RISCs", sort of... [I think most of this is from a brain-storming session one recent night that afflicted me and Preston Briggs. :-) ] Suppose you're designing a superscalar architecture. A recent example of such an architecture allows a branch, an "integer instruction", and a floating point instruction to be issued all at once. This seems like a good idea, especially if your chip set will be used for heavy-duty number crunching, or for generating really high SPEC marks, especially on those floating point benchmarks. :-) But if you weren't so interested in floating point, what else would you put into that part of the instruction? Well, probably not graphics operations, per se, since you wouldn't expect graphics operations to be sprinkled throughout your code, where the compiler would find them and fold them together with branches and additions into neat little triplets. Even counting on a MERGE instruction to occur throughout your "instruction stream" is a little extreme, because if all the memory accesses turn out to be aligned, you've just thrown away a large chunk of your superscalar chip's potential performance (cf. relative performance of the IBM 6000s on integer vs. floating point benchmarks). Perhaps in the floating point slot of the superscalar instruction, you'd put (among other things?) little hints to the cache instead? Whisper gently into the processor's ear, saying "get the value in register N, and use that as a stride for yanking words into the cache until I tell you something different". Perhaps N contains the width (in machine words) of a bitmap, and this will cause the cache to get the 4 words out of your bitmap, one (conceptually) stacked on top of another, so that you can blit 4 rows of your character glyph into them, before missing again. Or maybe you're addressing an array in the "wrong order" (for a particular language), not getting consecutive elements, but getting only one element out of each row or column, before moving onto the next row or column. Or maybe you used one of those wacky array-reshaping commands from FORTRAN 2001... :-) Then register N might contain the number of words in a row or column. Or maybe the correct thing to do in a particular case is tell the processor not to load anything extra into the cache, just get what's needed, because the next several instructions will be loading words from all over the place, and just because location A was loaded is absolutely no indication that location A+1 will be needed anytime soon. Or maybe the current instruction is loading a value that won't be needed again for a loooong time. In this case, we'd like to tell the cache to just hand the word to the processor, and not waste any space keeping a copy for itself. A good compiler might notice all of cases, and construct the appropriate whisper for that third slot of the superscalar instruction. Does anyone else have thoughts on this? What kinds of things would _you_ put in the slots of your superscalar chip(s)? -- Ben Chase <bbc@rice.edu>, Rice University, Houston, Texas
jesup@cbmvax.commodore.com (Randell Jesup) (02/27/90)
In article <BBC.90Feb25195510@legia.rice.edu> Benjamin Chase <bbc@rice.edu> writes: >Perhaps in the floating point slot of the superscalar instruction, >you'd put (among other things?) little hints to the cache instead? >Whisper gently into the processor's ear, saying "get the value in >register N, and use that as a stride for yanking words into the cache >until I tell you something different". This idea exists (though not in superscalar form) in the RPM-40 cache and coprocessor design. The cache control appears to the processor as a logical coprocessor. You can fill pipeline interlocks, load delays, etc with cache setup/hint instructions. The RPM-40 had lines going to the cache to tell it which register (if any) a load/store was relative to, which could be used by the cache to act differently according to register number. For example, access relative to register 1 might be sequential, relative to register 2 might skip X words on each access, register 3 might have random access habits, etc. (there were a couple of other things about it you could control also). For I-Cache, you could control how many entries of the Branch-Target-Cache (BTC) were handled by LRU, and how many the software wanted to provide the addresses to cache. -- Randell Jesup, Keeper of AmigaDos, Commodore Engineering. {uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.cbm.commodore.com BIX: rjesup Common phrase heard at Amiga Devcon '89: "It's in there!"
zs04+@andrew.cmu.edu (Zachary T. Smith) (02/27/90)
For all the engineering it would take (manhours) to get a given RISC processor to do 'fast' blitting, one might as well wait a few years until the (semiconductor) industry is capable of putting a blitter on a 1 megabit VRAM chip (or better yet on a 4 megabit chip), and then just do that. It'll get done eventually, so why come up with hacks in the meantime? -Zach Smith (zs04@andrew.cmu.edu) PS: Why wouldn't it get done? A blitter that operates on 1024 or 2048 bit lines per op and which can cache fonts in the DRAM as well would even a very fast blit processor running w/standard VRAM look like a toy.
alan@oz.nm.paradyne.com (Alan Lovejoy) (02/28/90)
In article <BBC.90Feb25195510@legia.rice.edu< Benjamin Chase <bbc@rice.edu> writes: <alan@oz.paradyne.com writes: < <>In an email message to me, bbd@rice.edu wrote: <>To which I attempted to reply via email. However: <>[it failed] <>So here is my reply: < <>I apologize: I did not realize how easy it would be to misinterpret what I <>wrote. < <How painfully true. I'm bbc@rice.edu (as in British Broadcasting <Company), not bbd@rice.edu. Hey, shit happens :-) (translation: we all make mistakes, from little typos to Great Social Blunders. Peace, bro.). <> I did not mean to imply that pixels which are vertically sequential <>are also stored sequentially in memory. The point of my comment was how <>"narrow" the screen is in terms of words of memory, to make it graphically :-) <>obvious that many blits fall within a single word or two. < <But the _screen_ isn't narrow. A modern monochrome display is 32 <(32 bit) words wide. Well, I think ~30 words is "narrow." It's a much less imposing number than 1024, which approximates the number of pixels along the x-axis. But it depends on your point of view, I guess. <>The point about <>vertical lines is that they always (unless they, or the pixels themselves, <>are very thick) fall within one or two words horizontally. < <Yes, and ~1000 words vertically. And these words are spaced at ~32 <intervals. Thus, when you draw a vertical line, you get a word of <screen memory, perform some operation on it to turn on or off ~1 bit <of that word, and then write it back to screen memory. Then, you get <the next word, which is 32 words further along, and whoops, it's not <in the data cache, because your silly fetcher got the next 3 <consecutive words of screen memory (which you won't be needing right <now because your vertical line is so skinny and all), rather than the <next 3 words spaced at 32 word offsets, which is what you really <wanted it to do if you were drawing a vertical line. Yep. That's why one "loads" the word from screen memory before "load"-ing the word from the source bitmap, so that there are as many cycles as possible between the load instruction which fetches the destination word and the first instruction which accesses that word. This is where the 88k's multiple independent functional units, full pipelining and register scoreboarding really help. But even so, one may in fact pay some of the performance penalty you describe, depending on the details of the situation. Different hardware configurations exact different penalties for cache misses, especially in the case of the frame buffer. Also, if there is a mask (stipple) bitmap, that adds roughly 4 to 6 more cycles between the destination bitmap load and the first access of the loaded word. One can also get REALLY agressive, and have different versions of bitblt optimized for the the (tall, narrow) and (short, wide) cases. But best of all, the code for striding to the next line (of each bitmap) can execute immediately after the loads for the first words in those lines, but before acessing the loaded words, adding many cycles between first-word-in-line loads and first access to those words. My code gets really aggressive :-). But what am I doing giving away my secrets like this! <> In any case, characters are normally blitted in a loop which <>writes a whole string to the screen, often soon followed by another. So it <>is likely that the most frequently used characters would have their pixels <>in the cache already. < <But that's only half the problem. What about the screen memory that <isn't in the cache because it won't fit? (Especially when you're <filling the cache with screen words that you won't be writing upon, as <in my example above.) The usual case FOR ME is to blit a string of characters whose glyphs are one bit per pixel to a destination bitmap (often the screen) which is 8 bits per pixel. The way I actually do this is to blit the character string to a temporary one-bit-per-pixel destination bitmap (which can be treated as a special case), then blit the temporary bitmap to the screen using a bit-depth-changing/pixel-value-translating algorithm. This (bit-depth-changing blit) algorithm provides even more cycles between the load-word-from-dest instruction and the first instruction accessing that word. And of course, in tends to shift performance considerations away from first word/last word issues back toward inner-loop issues. And it's a very big win, much more so than one would think. <Now, getting back to "new instructions for RISCs", sort of... [I <think most of this is from a brain-storming session one recent night <that afflicted me and Preston Briggs. :-) ] < <Perhaps in the floating point slot of the superscalar instruction, <you'd put (among other things?) little hints to the cache instead? <Whisper gently into the processor's ear, saying "get the value in <register N, and use that as a stride for yanking words into the cache <until I tell you something different". Check out the RPM-40. It does something along these lines. In order for hardware to "go faster," it must either "do more work" per cycle or go through more cycles per second. In order to "do more work per cycle," it must find more things it can do simultaneously or in parallel. In order to do more things simultaneously/in parallel, it must have more parallel hardware, more information per instruction, and instruction semantics which are composed of more primitive components. Also, the hardware must make fewer assumptions about the future, so that its behavior becomes more fully programable, allowing greater optimization. Obviously, the more information the instruction stream provides about what will happen in the future, the more work the hardware can do in advance (and in parallel with other things) to get ready for it. ____"Congress shall have the power to prohibit speech offensive to Congress"____ Alan Lovejoy; alan@pdn; 813-530-2211; AT&T Paradyne: 8550 Ulmerton, Largo, FL. Disclaimer: I do not speak for AT&T Paradyne. They do not speak for me. Mottos: << Many are cold, but few are frozen. >> << Frigido, ergo sum. >>
jesup@cbmvax.commodore.com (Randell Jesup) (03/05/90)
[Sorry of people see this twice, news was broken the first time I posted it.] In article <AZua2yC00WB5Qqd2FJ@andrew.cmu.edu> zs04+@andrew.cmu.edu (Zachary T. Smith) writes: >until the (semiconductor) industry is capable of putting a blitter on >a 1 megabit VRAM chip (or better yet on a 4 megabit chip), and then >just do that. It'll get done eventually, so why come up with hacks in >the meantime? ... >PS: Why wouldn't it get done? A blitter that operates on 1024 or 2048 >bit lines per op and which can cache fonts in the DRAM as well would >even a very fast blit processor running w/standard VRAM look like a >toy. I wouldn't hold my breath. It may happen, but the cost will probably be prohibitive even if it does. Some off-the-cuff technical problems I can think of are: 1) The blitter can only access data on the same "BRAM". 2) It complicates video system design _a lot_, since the font/etc data must be stored in non-visible portions of each ram. This gets bad, since nothing normally forces characters to appear on any given alignment. why go further, I think those will kill the idea already. Personally, I think the correct solution is not to modify RISC cpu designs to make blitting easy, but to use custom blitter/etc chips in the video sub-system. They can be optimized for their intended function more than any RISC cpu can (since it must also be a good general-purpose cpu). They can be isolated from the main CPU, allowing limited multi-processing (tell the blitter to start, and go off and do other things while waiting). Isolating the video system from the main CPU bus means less bus contention (seen by the CPU), etc. This follows the methodology that hardware (CPU/whatever) should be optimized for it's task whenever possible. You look at the speed (measured in both blitting speed and overall system throughput) and the cost, and then decide. Of course, I could be biased, since the Amiga uses a custom video coprocessor, and runs much faster when the CPU runs from ROM/fast-ram, while the blitter (and display "copper" (coprocessor)) play with "chip" (video) ram. -- Randell Jesup, Keeper of AmigaDos, Commodore Engineering. {uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.cbm.commodore.com BIX: rjesup Common phrase heard at Amiga Devcon '89: "It's in there!"