ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) (04/15/91)
What do RISC designers think of bitfield instructions? (A la the VAX and 680x0 [x >= 2] processors: extract so many bits at a certain offset, insert so many bits at a certain offset, etc.) I was doing some work on some GIF image decompression code on a 68020 processor (a Macintosh LC)--a lot of bit manipulation involved. I looked up the instruction timings for the bitfield instructions in my 68020 handbook, and was mildly dismayed to find that they took about twice as long to execute as a shift or mask instruction! But I tried using them anyway. Motorola's bit-numbering scheme in these instructions was the opposite of the natural order for solving my problem, so I first had to reverse the order of each bufferful of bytes after reading them in, before I could start extracting the compressed data symbols. The result? I sped up the code by more than 20%. I know the compiler I was using generated pretty lousy code, but it allowed the insertion of in-line machine code, which was how I was fine-tuning the time-consuming parts. Do other people find these instructions useful? Lawrence D'Oliveiro fone: +64-71-562-889 Computer Services Dept fax: +64-71-384-066 University of Waikato electric mail: ldo@waikato.ac.nz Hamilton, New Zealand 37^ 47' 26" S, 175^ 19' 7" E, GMT+12:00 To someone with a hammer and a screwdriver, every problem looks like a nail with threads.
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (04/15/91)
In article <1991Apr15.193425.3436@waikato.ac.nz> ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) writes: | The result? I sped up the code by more than 20%. I know the | compiler I was using generated pretty lousy code, but it allowed | the insertion of in-line machine code, which was how I was | fine-tuning the time-consuming parts. | | Do other people find these instructions useful? I have two more questions, which I think would be less controversial... a) for machines which have to do a lot of this would a bitfield coprocessor make sense? I ballpark guess a 5:1 speedup of those instructions. b) should the next version of ANSI/ISO C include a standard bit fiddle library, and what should be in it. My thoughts: - get/put bit(s) to/from a buffer getbits(char * buf, int startbit, int nbits) - read/write bits from a file readbits(FILE * fp, char * buffer, int nbits) -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Most of the VAX instructions are in microcode, but halt and no-op are in hardware for efficiency"
jdarcy@seqp4.ORG (Jeff d'Arcy) (04/15/91)
ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) writes: >What do RISC designers think of bitfield instructions? (A la >the VAX and 680x0 [x >= 2] processors: extract so many bits at >a certain offset, insert so many bits at a certain offset, etc.) I'm not a RISC designer, but I play one on TV. Oops. Seriously, I think the RISC designers out there would be fools not to understand the value of making their chips pleasant for low-level OS geeks (like me) to use, so here are my two bits. My experience with bitfield instructions is with the Moto 88K, which I think uses them in a very elegant fashion. The 88K has *no* shift instructions (it does have rotate), using the make, extract, and extract unsigned bitfield instructions for left shift, arithmetic and logical right shift respectively. These execute in one cycle apiece like all other integer instructions and provide a very intuitive superset of the shift instructions. I found these instructions particularly useful when I was doing instruction emulation in my last OS's exception handlers; they allowed me to play with instruction fields directly instead of having to do all sorts of annoying shift-and-mask operations. -- Jeff d'Arcy, Generic MTS, Sequoia Systems Inc. <jdarcy%seqp4@m2c.org> Yonder nor sorghum stenches shut ladle gulls stopper torque wet strainers
kenton@abyss.zk3.dec.com (Jeff Kenton OSG/UEG) (04/15/91)
In article <1991Apr15.193425.3436@waikato.ac.nz>, ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) writes: |> What do RISC designers think of bitfield instructions? (A la |> the VAX and 680x0 [x >= 2] processors: extract so many bits at |> a certain offset, insert so many bits at a certain offset, etc.) |> |> Do other people find these instructions useful? |> The Motorola 88000 has bitfield instructions which execute in one cycle. I found them useful for stuff I did, and I miss them on other machines which don't have them. ----------------------------------------------------------------------------- == jeff kenton Consulting at kenton@decvax.dec.com == == (617) 894-4508 (603) 881-0011 == -----------------------------------------------------------------------------
ts@cup.portal.com (Tim W Smith) (04/16/91)
I know that the reason bitfield instructions were listed in the early 386 documentation and then disappeared in later documents was because the guy who was writing the Unix C compiler for the 386 port told Intel that his compiler would *never* *ever* use these instructions because, once you took into account having to get everything into the right registers, these instructions were slower and took more code space than doing it with shifts and masks. So, given the choice of finishing debugging the microcode for instructions that were not going to be used, or by taking them out, Intel took them out. Tim Smith
zik@dec19.cs.monash.edu.au (Michael Saleeba) (04/16/91)
In <1991Apr15.193425.3436@waikato.ac.nz> ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) writes: >What do RISC designers think of bitfield instructions? (A la >the VAX and 680x0 [x >= 2] processors: extract so many bits at >a certain offset, insert so many bits at a certain offset, etc.) I rather like the Texas Instruments graphics processors' attitude to this problem. All instructions are addressed by bit, irrespective of the maximum physical word size, and most instructions (as far as I remember) allow a nice range of bitfield lengths for their operands. Surely this is a neater way to design a system than to rely on combinations of arbitrary byte and word boundaries. It could be argued that this would lead to a conflict with the basic goals of RISC design and general inefficiency. I suppose the best comment here is that the TMS34010 and TMS34020 are quite RISCy, and are also pretty damn fast; mainly through exploiting RISC design goals. ______ _ |___ / _ | | __ "I don't want the world - I just want your half." / / |_| | |/ / / / _ | / Name: Michael Saleeba / /__ | | | \ At: Monash University /_____| |_| |_|\_\ E-mail: zik@bruce.oz.au
ckp@grebyn.com (Checkpoint Technologies) (04/16/91)
In article <3339@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: > b) should the next version of ANSI/ISO C include a standard bit fiddle > library, and what should be in it. My thoughts: > - get/put bit(s) to/from a buffer > getbits(char * buf, int startbit, int nbits) > - read/write bits from a file > readbits(FILE * fp, char * buffer, int nbits) You can write any library routines you want for this, right now. I want bit field arrays: struct s { unsigned int bitflags :1 [4096]; }; On the other hand, perhaps extending bit fields is less valuable than allowing packed data objects. Suppose, for example, the next C allows the the following type to occupy 2 bits: typedef enum { zero=0, one, two, three } twobits; Then an array of 'twobits' would be packed tightly. For compatibility, a new qualifier would probably be in order: packed twobits tb[1024]; A pointer to a bit field is possible; the language simply must create a storage pointer plus a bit offset, so it would truly be a different representation than for other pointers. But you all always knew that pointers to different types could have different representations, didn't you??? But then, all this stuff belong in alt.lang.cfutures... -- First comes the logo: C H E C K P O I N T T E C H N O L O G I E S / / ckp@grebyn.com \\ / / Then, the disclaimer: All expressed opinions are, indeed, opinions. \ / o Now for the witty part: I'm pink, therefore, I'm spam! \/
mslater@cup.portal.com (Michael Z Slater) (04/16/91)
Michael Saleeba writes: >I rather like the Texas Instruments graphics processors' attitude to this >problem. All instructions are addressed by bit, irrespective of the maximum >physical word size, and most instructions (as far as I remember) allow a >nice range of bitfield lengths for their operands. Surely this is a neater >way to design a system than to rely on combinations of arbitrary byte and >word boundaries. > >It could be argued that this would lead to a conflict with the basic goals of >RISC design and general inefficiency. I suppose the best comment here is that >the TMS34010 and TMS34020 are quite RISCy, and are also pretty damn fast; >mainly through exploiting RISC design goals. I don't think the 340x0 can be called "quite RISCy" in any meaningful sense of the term. They have a load/store architecture, but the aren't heavily pipelined (each instruction takes 4 clocks on a 34020, 8 clocks on a 34010) and use microcode. They have complex addressing modes and variable-length instructions. As for "pretty damn fast", the 40-MHz 34020 executes at a PEAK rate of 10 NATIVE MIPS. It may be fast in some applications, but this comes from its CISCyness -- the fact that it has complex instructions that are well- suited to some applications, notably graphics. As for RISCs with bit-field operations, both the 88000 and HP-PA have them. Michael Slater, Microprocessor Report mslater@cup.portal.com
ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) (04/17/91)
In article <712@seqp4.UUCP>, jdarcy@seqp4.ORG (Jeff d'Arcy) mentions
the bitfield instructions on the Motorola 88K processor family.
Someone else has kindly e-mailed me a description of the 88K bitfield
instructions. They are truly remarkable animals, except for some reason
the variable versions take *both* the offset and field width (as bitfields)
from the same register.
In other words, if you were dealing with a variable bitfield, you'd need
another sequence of bitfield instructions (using fixed, immediate values for
the width and offset) to set up the operands for the first one! I'd venture
to suggest this would nullify any performance gain from trying to use bitfield
instructions in the first place, except that Jeff D'Arcy also mentions that
the 88K doesn't have any shift or mask operators, so you don't have the
choice.
Why did they design the 88K instructions to pack two operands into
one register?
Lawrence D'Oliveiro fone: +64-71-562-889
Computer Services Dept fax: +64-71-384-066
University of Waikato electric mail: ldo@waikato.ac.nz
Hamilton, New Zealand 37^ 47' 26" S, 175^ 19' 7" E, GMT+12:00
Be kind to plants--eat more herbivores.
kenton@abyss.zk3.dec.com (Jeff Kenton OSG/UEG) (04/17/91)
In article <1991Apr17.174918.3458@waikato.ac.nz>, ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) writes: |> In article <712@seqp4.UUCP>, jdarcy@seqp4.ORG (Jeff d'Arcy) mentions |> the bitfield instructions on the Motorola 88K processor family. |> |> Someone else has kindly e-mailed me a description of the 88K bitfield |> instructions. They are truly remarkable animals, except for some reason |> the variable versions take *both* the offset and field width (as bitfields) |> from the same register. |> |> In other words, if you were dealing with a variable bitfield, you'd need |> another sequence of bitfield instructions (using fixed, immediate values for |> the width and offset) to set up the operands for the first one! |> |> Why did they design the 88K instructions to pack two operands into |> one register? |> I don't remember any case where having them in the same register (you want them in 2 registers?) was inefficient. Furthermore, the instruction set encoding has space for 3 register specifiers -- source, destination and width/offset in this case. There would be no way to encode width and offset separately. ----------------------------------------------------------------------------- == jeff kenton Consulting at kenton@decvax.dec.com == == (617) 894-4508 (603) 881-0011 == -----------------------------------------------------------------------------
jdarcy@seqp4.ORG (Jeff d'Arcy) (04/17/91)
ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) writes: >Someone else has kindly e-mailed me a description of the 88K bitfield >instructions. They are truly remarkable animals, except for some reason >the variable versions take *both* the offset and field width (as bitfields) >from the same register. When I saw this in the manual, I wondered why one would ever use them. I eventually found a use; when emulating misaligned accesses I needed to do the effective address calculation manually, and for the scaled addressing mode this was most easily accomplished by extracting the instruction size into a register and then using that register as an argument to the next make-bitfield instruction. >In other words, if you were dealing with a variable bitfield, you'd need >another sequence of bitfield instructions (using fixed, immediate values for >the width and offset) to set up the operands for the first one! I'd venture >to suggest this would nullify any performance gain from trying to use bitfield >instructions in the first place, except that Jeff D'Arcy also mentions that >the 88K doesn't have any shift or mask operators, so you don't have the >choice. Unlike your name, mine has a small "d". I'm sure you understand. Secondly, the 88K *does* have mask operations, lacking only the shifts which are merely subsets of the make/extract bitfield instructions. There is *no* shift operation which cannot be simulated in a single instruction this way. As for the variable-argument problem, the extra instructions to set up the operand would be present using shifts instead of bitfield ops. As it is, Motorola wisely allowed a width of 0 to mean "all bits to the left of the offset" (left = MSB), and the offset for the bitfield ops is in the *low* order five bits of the register operand. Consider the following: ; 88K notation is dst,src1,src2 - src1 is always a register ; extract (unsigned) bits 26-27 of r10 into r11 extu r11,r10,26<2> ; left shift r12 according to r11 mak r12,r12,r11 After the first instruction, r11 contains: 3 2 1 1098765432109876543210 98765 43210 +----------------------+-----+-----+ |0000000000000000000000|00000|000xx| +----------------------+-----+-----+ The field in the middle (bits 5-9) is the bitfield width; the field on the right (bits 0-4) is the bitfield offset. The second instruction therefore takes bits 0-31 (width of 0 gets translated to 32) of r12 and puts them in r12 starting at position <xx> (the extracted field from r10); excess bits on the left are of course discarded. The point of this detailed explation is thus: I don't know of any CPU that can do such a manipulation in *fewer* than two instructions, and I'd be rather surprised to find one. Hence, the "inconvenience" of Motorola's encoding is actually a win. (Any factual errors are the result of my not having at 88K manual handy since I don't work with the 88K any more) -- Jeff d'Arcy, Generic MTS, Sequoia Systems Inc. <jdarcy%seqp4@m2c.org> Yonder nor sorghum stenches shut ladle gulls stopper torque wet strainers
hamilton@siberia.rtp.dg.com (Eric Hamilton) (04/17/91)
In article <1991Apr17.174918.3458@waikato.ac.nz>, ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) writes: |> |> Someone else has kindly e-mailed me a description of the 88K bitfield |> instructions. They are truly remarkable animals, except for some reason |> the variable versions take *both* the offset and field width (as bitfields) |> from the same register. |> |> In other words, if you were dealing with a variable bitfield, you'd need |> another sequence of bitfield instructions (using fixed, immediate values for |> the width and offset) to set up the operands for the first one! I'd venture |> to suggest this would nullify any performance gain from trying to use bitfield |> instructions in the first place, except that Jeff D'Arcy also mentions that |> the 88K doesn't have any shift or mask operators, so you don't have the |> choice. The 88000 has shift operations. It gives you all flavors (left, right, arithmetic, logical, constant or variable shift count) in one clock. They're special cases of the bitfield operations, so they may not be recognized as shifts, but they are. The 88000 also has mask operators (rd,rs1,rs2 are registers, imm16 a 16-bit literal): and rd,rs1,rs2 rd = rs1 & rs2; and.c rd,rs1,rs2 rd = rs1 & ~rs2 and rd,rs1,imm16 rd = rs1 & (0xffff0000 | imm16) and.u rd,rs1,imm16 rd = rs1 & (0x0000ffff | (imm16 << 16)) mask rd,rs1,imm16 rd = rs1 & imm16 mask.u rd,rs1,imm16 rd = rs1 & (imm16 << 16) as well as the bit-field operations. Packing the operands of the variable bit-field operations into one register doesn't really impose much performance penalty (all instructions are single-cycle): Any shift (constant or variable shift count): 1 instruction. Any constant width and offset bitfield operation: 1 instruction. Any variable width, zero offset operation: 2 instructions. Any variable width, non-zero offset operation: 3 instructions. Any variable width, variable offset operation: 3 instructions. These are not the penalties - they're the total instructions to perform the operation. They're pretty good, especially if you consider the dynamic frequency of the variable width and offset operations. I think you'll be hard-pressed to better with shift-and-mask sequences. |> |> Why did they design the 88K instructions to pack two operands into |> one register? |> With the operands packed, the bit field operations are already three-register instructions: source, destination, and width/offset specifier. For a single-cycle operation, that's about all that's reasonable - five more bits in the instruction to specify another register, and another read port in the register file, just so we can do variable-width/length operations in one or two fewer cycles? That can't be right. -- ---------------------------------------------------------------------- Eric Hamilton +1 919 248 6172 Data General Corporation hamilton@dg-rtp.rtp.dg.com 62 Alexander Drive ...!mcnc!rti!xyzzy!hamilton Research Triangle Park, NC 27709, USA
torbenm@diku.dk (Torben [gidius Mogensen) (04/18/91)
ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) writes: > What do RISC designers think of bitfield instructions? (A la > the VAX and 680x0 [x >= 2] processors: extract so many bits at > a certain offset, insert so many bits at a certain offset, etc.) > > Do other people find these instructions useful? > As John Bowler pointed out in another article, bit masking instructions can be simulated with a few shift and logical instructions, and you oten don't need the full generality of the masks, as the bitfields tend to be at one end or other of the word. What is (often?) needed in graphics systems is to convert raster arrays from one number of bits per pixel to another, for example 1 bit per pixel to two bits per pixel. With the normal instruction selection on processors, this requires handling each pixel on its own, making the operation quite slow. If you had bit ZIP and UNZIP instructions as described below, the operations could be done on whole words (or half words), at least if the bits per pixel are powers of two. A ZIP instructions takes two words abcde..., ABCDE... and merges the bits into a double word aAbBcCdDeE... (or two half words into a word) An UNZIP instruction is the inverse of ZIP, and thus takes a double word aAbBcCdDeE... and produces two words abcde..., ABCDE... (or takes a word and produces two half words). Thus if you want to take a one-bit-per-pixel word and produce a two-bit-per-pixel double word, you ZIP if with 0, -1 or itself, which would yield the following colour translations: X ZIP 0: 0 -> 00, 1 -> 10 0 ZIP X: 0 -> 00, 1 -> 01 X ZIP -1: 0 -> 01, 1 -> 11 -1 ZIP X: 0 -> 10, 1 -> 11 X ZIP X: 0 -> 00, 1 -> 11 other translations can be obtained using logical operations on the result. Conversion of two-bit-per-pixel words to four-bits-per-pixel double words can also be done, but specific colour translation is more tricky. to convert from a two-bit-per-pixel double word to a one-bit per pixel you UNZIP it into two words. Specific colour translations can be obtained using logical operations between the two words (and possibly constants). ZIP and UNZIP can be implemented very efficiently (they are just crossing wires), but will take up rather much space (similar to a barrel shifter I would guess), so it would probably only be worthwhile on graphics processors. You could also just implement EXPAND and CONTRACT, and use shifts and logical operations to get the full ZIP and UNZIP functionality. EXPAND ZIPs a word with zero, and CONTRACT is an UNZIP where only one of the outputs is generated. Does anyone know of any processors (graphics or general purpose) that have such instructions? Torben Mogensen (torbenm@diku.dk)
spot@CS.CMU.EDU (Scott Draves) (04/18/91)
In article <1991Apr18.093804.18183@odin.diku.dk> torbenm@diku.dk (Torben [gidius Mogensen) writes:
What is (often?) needed in graphics systems is to convert raster
arrays from one number of bits per pixel to another, for example 1 bit
per pixel to two bits per pixel.
this rarely happens. 1 -> 8 or 1 -> 24 happens on occasion, but i
doubt any of these are significant compared to blitting and rendering.
With the normal instruction selection
on processors, this requires handling each pixel on its own, making
the operation quite slow.
when it does happen, you can use look up tables. to do your 1 -> 2
bits per pixel example, i would use a table of 256 entries, 16 bits
each.
It also works for bitmap magnification.
A ZIP instructions takes two words abcde..., ABCDE... and merges the
bits into a double word aAbBcCdDeE... (or two half words into a word)
these can be done with the look up tables in a similar fashion. make
a table called halfzip that spaces out the bits of a word:
halfzip[abcd] = 0a0b0c0d. then your ZIP(i,j) = halfzip[i] |
(halfzip[j] << 1). Or you can use two separate tables to avoid the
shift.
An UNZIP instruction is the inverse of ZIP, and thus takes a double
word aAbBcCdDeE... and produces two words abcde..., ABCDE... (or takes
a word and produces two half words).
I can't think of a similar game for unzip, but....
to convert from a two-bit-per-pixel double word to a one-bit per pixel
you UNZIP it into two words. Specific colour translations can be
obtained using logical operations between the two words (and possibly
constants).
...this is a very bad way to convert from 2 bit to 1 bit. Dithering
should be used if you want the result to look good at all. converting
down to 1 bit even less common than converting from 1 bit
You could also just implement EXPAND and
CONTRACT, and use shifts and logical operations to get the full ZIP
and UNZIP functionality. EXPAND ZIPs a word with zero, and CONTRACT is
an UNZIP where only one of the outputs is generated.
EXPAND = halfzip
--
christianity is stupid
Scott Draves communism is good
spot@cs.cmu.edu give up
ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) (04/18/91)
In article <715@seqp4.UUCP>, jdarcy@seqp4.ORG (Jeff d'Arcy) writes: > Unlike your name, mine has a small "d". I'm sure you understand. Whoops! After all the fuss I make about people spelling *my* name correctly, please accept my humble apologues. [detailed example of emulating power-of-two operand scaling using just two bitfield instructions, ending with] > Hence, the "inconvenience" of Motorola's encoding is actually a win. I agree you haven't lost anything in this example, but I don't see that you've gained anything, either, other than saving a register. Your example works OK on this architecture, mine would work less well, and that's all. Todd Bridges (toddb@illini.sps.mot.com) has, if I understand him correctly, given me what I think the real reason is for these instructions being defined the way they are: 88K instructions can't have more than three operands. Lawrence D'Oliveiro fone: +64-71-562-889 Computer Services Dept fax: +64-71-384-066 University of Waikato electric mail: ldo@waikato.ac.nz Hamilton, New Zealand 37^ 47' 26" S, 175^ 19' 7" E, GMT+12:00 Whoever said "There is no such thing as a free lunch" obviously never heard of the Buy-No-Meal Theorem.
henry@zoo.toronto.edu (Henry Spencer) (04/19/91)
In article <1991Apr18.093804.18183@odin.diku.dk> torbenm@diku.dk (Torben [gidius Mogensen) writes: >A ZIP instructions takes two words abcde..., ABCDE... and merges the >bits into a double word aAbBcCdDeE... (or two half words into a word) >An UNZIP instruction is the inverse of ZIP... Unless access to memory is seriously expensive, table lookup can implement both of these operations quite efficiently. Agreed, it's not as fast as having them wired in... -- And the bean-counter replied, | Henry Spencer @ U of Toronto Zoology "beans are more important". | henry@zoo.toronto.edu utzoo!henry
glew@pdx007.intel.com (Andy Glew) (04/19/91)
Someone else has kindly e-mailed me a description of the 88K bitfield instructions. They are truly remarkable animals, except for some reason the variable versions take *both* the offset and field width (as bitfields) from the same register. In other words, if you were dealing with a variable bitfield, you'd need another sequence of bitfield instructions (using fixed, immediate values for the width and offset) to set up the operands for the first one! I'd venture to suggest this would nullify any performance gain from trying to use bitfield instructions in the first place, except that Jeff D'Arcy also mentions that the 88K doesn't have any shift or mask operators, so you don't have the choice. Why did they design the 88K instructions to pack two operands into one register? I hate to defend the 88K, now that I have moved from Motorola to Intel, but setting up the dynamic operands in a register is acceptable for one of the biggest uses of the bitfield operations - bitblt, where the amount of the shift is known before the loop begins. Setup once. -- Andy Glew, glew@ichips.intel.com Intel Corp., M/S JF1-19, 5200 NE Elam Young Parkway, Hillsboro, Oregon 97124-6497 This is a private posting; it does not indicate opinions or positions of Intel Corp.
graeme@labtam.labtam.oz (Graeme Gill) (04/19/91)
In article <SPOT.91Apr18123711@WOOZLE.GRAPHICS.CS.CMU.EDU>, spot@CS.CMU.EDU (Scott Draves) writes: > In article <1991Apr18.093804.18183@odin.diku.dk> torbenm@diku.dk (Torben [gidius Mogensen) writes: > > What is (often?) needed in graphics systems is to convert raster > arrays from one number of bits per pixel to another, for example 1 bit > per pixel to two bits per pixel. > > this rarely happens. 1 -> 8 or 1 -> 24 happens on occasion, but i > doubt any of these are significant compared to blitting and rendering. Consider that many applications are written to run on monochrome devices, so they use single level bitmaps and fonts to encode graphic objects. When rendered on a colour device, the 1->8 or 1->24 bit expansion is used a great deal. > when it does happen, you can use look up tables. to do your 1 -> 2 > bits per pixel example, i would use a table of 256 entries, 16 bits > each. In fast hardware the memory bandwidth is the limit. Using lookup tables doubles the number of memory accesses, thereby halving the ultimate speed. Hardware support for 1->power_of_two expansion is a desirable feature in a graphics processor. Graeme Gill Labtam Australia
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (04/19/91)
In article <1991Apr18.195841.10588@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: | Unless access to memory is seriously expensive, table lookup can implement | both of these operations quite efficiently. Agreed, it's not as fast as | having them wired in... This is true, but the original topic was somewhat more general, breaking N bit chunks into parts and assembling them. In the case where N is a power of two I think the table lookup loses if I want the data elsewhere. I can generate it faster than I can read it from memory (on most machines). The problem comes when I want to go from words back to byte packing. In this case a lookup doesn't help, and I have to fetch, shift, and or the bits regardless of wether I do it with an instruction or a loop. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Most of the VAX instructions are in microcode, but halt and no-op are in hardware for efficiency"
henry@zoo.toronto.edu (Henry Spencer) (04/20/91)
In article <3357@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: > The problem comes when I want to go from words back to byte packing. >In this case a lookup doesn't help, and I have to fetch, shift, and or >the bits regardless of wether I do it with an instruction or a loop. I'm not quite sure what you mean by "go from words back to byte packing". If you're referring to the "unzip" instruction, transforming AaBbCcDd to ABCDabcd, note that you can do the shifts when generating tables. You look up AaBb in *two* tables, one giving you AB-- and one giving ab--, and look up CdDd in tables that give you --CD and --cd. You still get to OR them together yourself, however. How the tradeoffs go on this depends a lot on memory access speeds (and on whether your caches are big enough to contain your lookup tables!), and also on whether the operations being done are regular enough to lend themselves to table generation, but table lookup is always worth considering for messy bit operations. -- And the bean-counter replied, | Henry Spencer @ U of Toronto Zoology "beans are more important". | henry@zoo.toronto.edu utzoo!henry
rex@cs.su.oz (Rex Di Bona) (04/22/91)
In article <10408@labtam.labtam.oz> graeme@labtam.labtam.oz (Graeme Gill) writes: > Consider that many applications are written to run on monochrome > devices, so they use single level bitmaps and fonts to encode graphic > objects. When rendered on a colour device, the 1->8 or 1->24 bit > expansion is used a great deal. This is true, but misleading. You do not want to convert 0 -> 00000000 and 1 -> 11111111, but 0-> some colour, and 1-> some other colour. Both of these colours are user selectable. This requires more hardware support than just a bit stretcher. > > when it does happen, you can use look up tables. to do your 1 -> 2 > > bits per pixel example, i would use a table of 256 entries, 16 bits > > each. > > In fast hardware the memory bandwidth is the limit. Using lookup tables > doubles the number of memory accesses, thereby halving the ultimate speed. > Hardware support for 1->power_of_two expansion is a desirable feature > in a graphics processor. > > > Graeme Gill > Labtam Australia You should be able to do all of the rendering in software, keeping the intermediate values in registers. This requires a better layout of the memory for the graphics 'screen'. You could use a lookup table for additional speed (as an example, if your video memory was 8 bits deep, and layed out so that 4 pixels occupied a 32 bit word you could construct a 16 element table (each 32 bits wide) with the appropriate spaces filled in. To set up the table would require 16 memory stores, but you rendering would be twice as fast (one read, one write as opposed to 4 writes) as the byte at a time method). You could remove the read by having all 16 values stored in registers, and do a branch to the apppropriate store, or other hardware nasties. The one instruction I keep wishing for when doing these sort of graphic operations is a rotate instruction. It can be used for line drawing, as a counter for the above routine (it wouldn't require reloading for each loop however) and does save instructions... I have found that normal shifts are fine for bitblts, even those on random boundaries and for random widths. I would suspect the additional overheads of determining when to use the 88K type bitfield operations would swamp the usefulness of them. -------- Rex di Bona (rex@cs.su.oz.au) Penguin Lust is NOT immoral
peter@ficc.ferranti.com (peter da silva) (04/22/91)
In article <SPOT.91Apr18123711@WOOZLE.GRAPHICS.CS.CMU.EDU>, spot@CS.CMU.EDU (Scott Draves) writes: > when it does happen, you can use look up tables. to do your 1 -> 2 > bits per pixel example, i would use a table of 256 entries, 16 bits > each. Wouldn't this tend to be a cache-basher? Particularly a problem when you're already bashing the cache with the bitmap anyway? -- Peter da Silva. `-_-' peter@ferranti.com +1 713 274 5180. 'U` "Have you hugged your wolf today?"
sef@kithrup.COM (Sean Eric Fagan) (04/23/91)
In article <PUWA8R4@xds13.ferranti.com> peter@ficc.ferranti.com (peter da silva) writes: >Wouldn't this tend to be a cache-basher? Particularly a problem when you're >already bashing the cache with the bitmap anyway? Perhaps that's where a 'don't cache' attribute on a load/store might be useful... You could probably do it now by having telling the OS to not cache the page your data are in (although that won't work on all systems). -- Sean Eric Fagan | "I made the universe, but please don't blame me for it; sef@kithrup.COM | I had a bellyache at the time." -----------------+ -- The Turtle (Stephen King, _It_) Any opinions expressed are my own, and generally unpopular with others.
henry@zoo.toronto.edu (Henry Spencer) (04/23/91)
In article <1991Apr23.053619.13474@kithrup.COM> sef@kithrup.COM (Sean Eric Fagan) writes: >>Wouldn't this tend to be a cache-basher? Particularly a problem when you're >>already bashing the cache with the bitmap anyway? > >Perhaps that's where a 'don't cache' attribute on a load/store might be >useful... Anybody who caches a frame buffer is crazy. Especially if the cache isn't write-through, in which case your frame-buffer updates show up on the screen some arbitrary time later! Table-lookup algorithms (the original topic) can be somewhat hard on caches, although the impact is minimized if the tables are small (i.e. lookup by bytes, not by shorts!) and the cache is large. -- And the bean-counter replied, | Henry Spencer @ U of Toronto Zoology "beans are more important". | henry@zoo.toronto.edu utzoo!henry
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (04/24/91)
In article <1991Apr23.152155.2298@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: | Anybody who caches a frame buffer is crazy. Especially if the cache isn't | write-through, in which case your frame-buffer updates show up on the screen | some arbitrary time later! The second part of this is true, but in a system which writes data to the bus aggressively, the arbitrary time is in ms, not seconds or minutes. Write-back cache seems to fall into the group which trys to unload the bus by only writing to memory when the cache is getting full, and the aggressive cache which writes whenever the bus is not busy. I played with caching a frame buffer, and it had a positive effect on performance, since calculation and writing of points was overlapped. Obviously a bank selected frame buffer can't (easily) be cached, but other than that I don't see any reason why cache won't work nicely. I don't think a write-through cache would make sense because *most* programs don't read back the buffer. I agree with the conclusion, that caching a frame buffer is not cost effective in most applications, but I'm not quite as firmly set against it as Henry. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Most of the VAX instructions are in microcode, but halt and no-op are in hardware for efficiency"
johnl@iecc.cambridge.ma.us (John R. Levine) (04/24/91)
In article <1991Apr23.152155.2298@zoo.toronto.edu> you write: >Anybody who caches a frame buffer is crazy. ... Given how slow some of the frame buffers on PC clones are, upwards of 10 wait states per 16 bit word, caching could be a big performance win. It should probably be write-through, though I can even imagine situations where you're redrawing an entire window and, so long as you can flush it when you're done, a little smudge during the repaint due to a write-back cache wouldn't be offensive. Recent models of screen buffer do remap all of the planes at the same place in the address space, which makes a lot of this moot unless you can do a page flush without a context switch. But even that's possible on the 386 -- it does let you enable individual device addresses for user I/O. Regards, John Levine, johnl@iecc.cambridge.ma.us, {spdcc|ima|world}!iecc!johnl
graeme@labtam.labtam.oz (Graeme Gill) (04/24/91)
In article <2325@cluster.cs.su.oz.au>, rex@cs.su.oz (Rex Di Bona) writes: > > This is true, but misleading. You do not want to convert 0 -> 00000000 and > 1 -> 11111111, but 0-> some colour, and 1-> some other colour. Both of > these colours are user selectable. The ideal graphics support would be an expand instruction with foreground and background colour registers, but a 0 -> 00000000 and 1 -> 11111111 instruction would still be very useful when internal operations are many times faster than memory accesses. The expanded bitmap is used as a mask to merge the foreground and background colours together (as well as a plane mask perhaps). > You should be able to do all of the rendering in software, keeping the > intermediate values in registers. This requires a better layout of the > memory for the graphics 'screen'. You could use a lookup table for additional > speed (as an example, if your video memory was 8 bits deep, and layed out so > that 4 pixels occupied a 32 bit word you could construct a 16 element > table (each 32 bits wide) with the appropriate spaces filled in. To set > up the table would require 16 memory stores, but you rendering would > be twice as fast (one read, one write as opposed to 4 writes) as the byte > at a time method). You could remove the read by having all 16 values stored > in registers, and do a branch to the apppropriate store, or other hardware > nasties. Umm. Packed frame stores are used because they speed up other very important operations like fill and copy. Currently I do an expand operation like this: Read 32 bits of source (word read) Lookup 8 bits of the source in the expand table (double word read) Lookup 8 bits of the source in the expand table (double word read) Write to the destination (quad word write). Lookup 8 bits of the source in the expand table (double word read) Lookup 8 bits of the source in the expand table (double word read) Write to the destination (quad word write). With expand support I could do this: Read 32 bits of source (word read) Write to the destination (quad word write). Write to the destination (quad word write). > I would suspect the additional overheads of determining when to use the 88K > type bitfield operations would swamp the usefulness of them. When speed is important, these sort of routines are often pre-compiled for various cases (eg. alignment, direction), so one selects the appropriate routine, not instruction. Graeme Gill Labtam Australia
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (04/24/91)
In article <9104231527.AA09611@iecc.cambridge.ma.us> johnl@iecc.cambridge.ma.us (John R. Levine) writes: | Recent models of screen buffer do remap all of the planes at the same | place in the address space, which makes a lot of this moot unless you can | do a page flush without a context switch. But even that's possible on the | 386 -- it does let you enable individual device addresses for user I/O. I have been waiting for frame buffers to come out with an option to map the whole thing in extended memory, say between 15-16MB. I predicted this as a popular enhancement when EGA first came out, and I was wrong. I find it hard to belive that no one has done this, since even a 286 can use it, but I don't know of anyone who has. And I'm told that the new XGA from IBM is still bank swapping so they can use an 8086. Slavish dedication to being backward. Ur, backward compatibility I meant, didn't i? -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Most of the VAX instructions are in microcode, but halt and no-op are in hardware for efficiency"
sef@kithrup.COM (Sean Eric Fagan) (04/25/91)
In article <3365@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: > I have been waiting for frame buffers to come out with an option to >map the whole thing in extended memory, say between 15-16MB. I predicted >this as a popular enhancement when EGA first came out, and I was wrong. >I find it hard to belive that no one has done this, since even a 286 can >use it, but I don't know of anyone who has. I have an old Cornerston card and monitor, capable of 1600x1280x1. Really nice. It maps into memory as a 512k chunk of memory (I know, I know, that isn't 512k worth of graphics). It's got one minor problem making it useless for me (namely, reads of the odd bytes aren't possible), but they had the right idea. There is at least one video card for EISA which maps into 0xd0000000, I believe. Takes a megabyte there. -- Sean Eric Fagan | "I made the universe, but please don't blame me for it; sef@kithrup.COM | I had a bellyache at the time." -----------------+ -- The Turtle (Stephen King, _It_) Any opinions expressed are my own, and generally unpopular with others.
henry@zoo.toronto.edu (Henry Spencer) (04/25/91)
In article <9104231527.AA09611@iecc.cambridge.ma.us> johnl@iecc.cambridge.ma.us (John R. Levine) writes: >>Anybody who caches a frame buffer is crazy. ... > >Given how slow some of the frame buffers on PC clones are, upwards of 10 >wait states per 16 bit word, caching could be a big performance win. It's only a win if you are reading that word *repeatedly*. This isn't too likely to be a major performance issue. If you are twiddling around in a small area, odds are good that your performance is dominated by overhead and not by frame-buffer accesses. If you're doing something big, it's probably being done a word at a time without repeated accesses to the same word, and it's also probably going to bust your cache if you cache those accesses. -- And the bean-counter replied, | Henry Spencer @ U of Toronto Zoology "beans are more important". | henry@zoo.toronto.edu utzoo!henry
scottw@ico.isc.com (Scott Wiesner) (04/25/91)
From article <1991Apr24.173400.13397@zoo.toronto.edu>, by henry@zoo.toronto.edu (Henry Spencer): > In article <9104231527.AA09611@iecc.cambridge.ma.us> johnl@iecc.cambridge.ma.us (John R. Levine) writes: >>>Anybody who caches a frame buffer is crazy. ... >> >>Given how slow some of the frame buffers on PC clones are, upwards of 10 >>wait states per 16 bit word, caching could be a big performance win. > > It's only a win if you are reading that word *repeatedly*. This isn't > too likely to be a major performance issue. Unfortunately, there's a bigger problem lurking here. Most of the "frame buffers" on PC's aren't really a frame buffer in the traditional workstation sense. Most don't map all their video memory into the CPU's address space. They usually map 64k, and use some I/O registers to do bank switching. Access of the same address may be looking at different parts of video memory. Also, things like EGA's and VGA's are even more complicated, with 4 planes of memory being shadowed at the same address, with a I/O port selecting which plane is of interest on a read, and another I/O port selecting what action should be taken on a write. (Actions include masking with a plane mask, a bit mask, 4 choices of rastrop, etc.) You definately don't want to be doing standard caching on one of these cards. Scott Wiesner Interactive Systems X Development Group
andreww@uniwa.uwa.oz (Andrew John Williams) (04/26/91)
Not really anything to do with this - but hows this for 'orrible... I picked up the 1980 Nat Semi MOS book recently, and had a look at the COPS420 (or whatever it was) specs. No shift instruction. The way they suggested was to swap the accumulator out to the shift register (also connected to the outside world) and wait. The shift register was syncronised to the instruction cycle, so to shift one bit you swapped A back in the next instruction. For a 2 bit shift, you inserted a NOP (or other such useful instruction). Don't worry though - the maximum delay was 4 instructions, 'cos thats how big A was. And people complain about the 8086! John West (stealing Andrew's account) Do fish need aqualungs?