[comp.arch] Bitfield instructions--a good idea?

ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) (04/15/91)

What do RISC designers think of bitfield instructions? (A la
the VAX and 680x0 [x >= 2] processors: extract so many bits at
a certain offset, insert so many bits at a certain offset, etc.)

I was doing some work on some GIF image decompression code on a
68020 processor (a Macintosh LC)--a lot of bit manipulation involved.
I looked up the instruction timings for the bitfield instructions in
my 68020 handbook, and was mildly dismayed to find that they took about
twice as long to execute as a shift or mask instruction!

But I tried using them anyway. Motorola's bit-numbering scheme
in these instructions was the opposite of the natural order for
solving my problem, so I first had to reverse the order of each
bufferful of bytes after reading them in, before I could start
extracting the compressed data symbols.

The result? I sped up the code by more than 20%. I know the
compiler I was using generated pretty lousy code, but it allowed
the insertion of in-line machine code, which was how I was
fine-tuning the time-consuming parts.

Do other people find these instructions useful?

Lawrence D'Oliveiro                       fone: +64-71-562-889
Computer Services Dept                     fax: +64-71-384-066
University of Waikato            electric mail: ldo@waikato.ac.nz
Hamilton, New Zealand    37^ 47' 26" S, 175^ 19' 7" E, GMT+12:00
To someone with a hammer and a screwdriver, every problem looks
like a nail with threads.

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (04/15/91)

In article <1991Apr15.193425.3436@waikato.ac.nz> ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) writes:

| The result? I sped up the code by more than 20%. I know the
| compiler I was using generated pretty lousy code, but it allowed
| the insertion of in-line machine code, which was how I was
| fine-tuning the time-consuming parts.
| 
| Do other people find these instructions useful?

  I have two more questions, which I think would be less
controversial...

 a) for machines which have to do a lot of this would a bitfield
    coprocessor make sense? I ballpark guess a 5:1 speedup of those
    instructions. 

 b) should the next version of ANSI/ISO C include a standard bit fiddle
    library, and what should be in it. My thoughts:
    - get/put bit(s) to/from a buffer
	getbits(char * buf, int startbit, int nbits)
    - read/write bits from a file
	readbits(FILE * fp, char * buffer, int nbits)
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
        "Most of the VAX instructions are in microcode,
         but halt and no-op are in hardware for efficiency"

jdarcy@seqp4.ORG (Jeff d'Arcy) (04/15/91)

ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) writes:
>What do RISC designers think of bitfield instructions? (A la
>the VAX and 680x0 [x >= 2] processors: extract so many bits at
>a certain offset, insert so many bits at a certain offset, etc.)

I'm not a RISC designer, but I play one on TV.  Oops.  Seriously, I think
the RISC designers out there would be fools not to understand the value of
making their chips pleasant for low-level OS geeks (like me) to use, so
here are my two bits.

My experience with bitfield instructions is with the Moto 88K, which I think
uses them in a very elegant fashion.  The 88K has *no* shift instructions
(it does have rotate), using the make, extract, and extract unsigned bitfield
instructions for left shift, arithmetic and logical right shift respectively.
These execute in one cycle apiece like all other integer instructions and
provide a very intuitive superset of the shift instructions.

I found these instructions particularly useful when I was doing instruction
emulation in my last OS's exception handlers; they allowed me to play with
instruction fields directly instead of having to do all sorts of annoying
shift-and-mask operations.
-- 

  Jeff d'Arcy, Generic MTS, Sequoia Systems Inc. <jdarcy%seqp4@m2c.org>
Yonder nor sorghum stenches shut ladle gulls stopper torque wet strainers

kenton@abyss.zk3.dec.com (Jeff Kenton OSG/UEG) (04/15/91)

In article <1991Apr15.193425.3436@waikato.ac.nz>, ldo@waikato.ac.nz
(Lawrence D'Oliveiro, Waikato University) writes:
|> What do RISC designers think of bitfield instructions? (A la
|> the VAX and 680x0 [x >= 2] processors: extract so many bits at
|> a certain offset, insert so many bits at a certain offset, etc.)
|> 
|> Do other people find these instructions useful?
|> 

The Motorola 88000 has bitfield instructions which execute in one cycle.  I
found them useful for stuff I did, and I miss them on other machines which
don't have them.

-----------------------------------------------------------------------------
==	jeff kenton		Consulting at kenton@decvax.dec.com        ==
==	(617) 894-4508			(603) 881-0011			   ==
-----------------------------------------------------------------------------

ts@cup.portal.com (Tim W Smith) (04/16/91)

I know that the reason bitfield instructions were listed in the
early 386 documentation and then disappeared in later documents
was because the guy who was writing the Unix C compiler for the
386 port told Intel that his compiler would *never* *ever* use
these instructions because, once you took into account having to
get everything into the right registers, these instructions were
slower and took more code space than doing it with shifts and
masks.

So, given the choice of finishing debugging the microcode for
instructions that were not going to be used, or by taking them
out, Intel took them out.

						Tim Smith

zik@dec19.cs.monash.edu.au (Michael Saleeba) (04/16/91)

In <1991Apr15.193425.3436@waikato.ac.nz> ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) writes:

>What do RISC designers think of bitfield instructions? (A la
>the VAX and 680x0 [x >= 2] processors: extract so many bits at
>a certain offset, insert so many bits at a certain offset, etc.)

I rather like the Texas Instruments graphics processors' attitude to this
problem. All instructions are addressed by bit, irrespective of the maximum
physical word size, and most instructions (as far as I remember) allow a
nice range of bitfield lengths for their operands. Surely this is a neater
way to design a system than to rely on combinations of arbitrary byte and
word boundaries.

It could be argued that this would lead to a conflict with the basic goals of
RISC design and general inefficiency. I suppose the best comment here is that
the TMS34010 and TMS34020 are quite RISCy, and are also pretty damn fast;
mainly through exploiting RISC design goals.

 ______      _
|___  /  _  | | __	"I don't want the world - I just want your half."
   / /  |_| | |/ /
  / /    _  |   / 		Name:		Michael Saleeba
 / /__  | | |   \ 		At:		Monash University
/_____| |_| |_|\_\		E-mail:		zik@bruce.oz.au

ckp@grebyn.com (Checkpoint Technologies) (04/16/91)

In article <3339@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
> b) should the next version of ANSI/ISO C include a standard bit fiddle
>    library, and what should be in it. My thoughts:
>    - get/put bit(s) to/from a buffer
>	getbits(char * buf, int startbit, int nbits)
>    - read/write bits from a file
>	readbits(FILE * fp, char * buffer, int nbits)

  You can write any library routines you want for this, right now.

  I want bit field arrays: struct s { unsigned int bitflags :1 [4096]; };

  On the other hand, perhaps extending bit fields is less valuable than
allowing packed data objects.  Suppose, for example, the next C allows
the the following type to occupy 2 bits:

  typedef enum { zero=0, one, two, three } twobits;

  Then an array of 'twobits' would be packed tightly.  For compatibility,
a new qualifier would probably be in order:

  packed twobits tb[1024];

  A pointer to a bit field is possible; the language simply must
create a storage pointer plus a bit offset, so it would truly
be a different representation than for other pointers.  But you all
always knew that pointers to different types could have different
representations, didn't you???

  But then, all this stuff belong in alt.lang.cfutures...
-- 
First comes the logo: C H E C K P O I N T  T E C H N O L O G I E S      / /  
                                                ckp@grebyn.com      \\ / /    
Then, the disclaimer:  All expressed opinions are, indeed, opinions. \  / o
Now for the witty part:    I'm pink, therefore, I'm spam!             \/

mslater@cup.portal.com (Michael Z Slater) (04/16/91)

Michael Saleeba writes:

>I rather like the Texas Instruments graphics processors' attitude to this
>problem. All instructions are addressed by bit, irrespective of the maximum
>physical word size, and most instructions (as far as I remember) allow a
>nice range of bitfield lengths for their operands. Surely this is a neater
>way to design a system than to rely on combinations of arbitrary byte and
>word boundaries.
>
>It could be argued that this would lead to a conflict with the basic goals of
>RISC design and general inefficiency. I suppose the best comment here is that
>the TMS34010 and TMS34020 are quite RISCy, and are also pretty damn fast;
>mainly through exploiting RISC design goals.

I don't think the 340x0 can be called "quite RISCy" in any meaningful sense of
the term. They have a load/store architecture, but the aren't heavily
pipelined (each instruction takes 4 clocks on a 34020, 8 clocks on a 34010)
and use microcode. They have complex addressing modes and variable-length
instructions.

As for "pretty damn fast", the 40-MHz 34020 executes at a PEAK rate of 
10 NATIVE MIPS.  It may be fast in some applications, but this comes from
its CISCyness -- the fact that it has complex instructions that are well-
suited to some applications, notably graphics.

As for RISCs with bit-field operations, both the 88000 and HP-PA have them.

Michael Slater, Microprocessor Report   mslater@cup.portal.com

ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) (04/17/91)

In article <712@seqp4.UUCP>, jdarcy@seqp4.ORG (Jeff d'Arcy) mentions
the bitfield instructions on the Motorola 88K processor family.

Someone else has kindly e-mailed me a description of the 88K bitfield
instructions. They are truly remarkable animals, except for some reason
the variable versions take *both* the offset and field width (as bitfields)
from the same register.

In other words, if you were dealing with a variable bitfield, you'd need
another sequence of bitfield instructions (using fixed, immediate values for
the width and offset) to set up the operands for the first one! I'd venture
to suggest this would nullify any performance gain from trying to use bitfield
instructions in the first place, except that Jeff D'Arcy also mentions that
the 88K doesn't have any shift or mask operators, so you don't have the
choice.

Why did they design the 88K instructions to pack two operands into
one register?

Lawrence D'Oliveiro                       fone: +64-71-562-889
Computer Services Dept                     fax: +64-71-384-066
University of Waikato            electric mail: ldo@waikato.ac.nz
Hamilton, New Zealand    37^ 47' 26" S, 175^ 19' 7" E, GMT+12:00
Be kind to plants--eat more herbivores.

kenton@abyss.zk3.dec.com (Jeff Kenton OSG/UEG) (04/17/91)

In article <1991Apr17.174918.3458@waikato.ac.nz>, ldo@waikato.ac.nz
(Lawrence D'Oliveiro, Waikato University) writes:
|> In article <712@seqp4.UUCP>, jdarcy@seqp4.ORG (Jeff d'Arcy) mentions
|> the bitfield instructions on the Motorola 88K processor family.
|> 
|> Someone else has kindly e-mailed me a description of the 88K bitfield
|> instructions. They are truly remarkable animals, except for some reason
|> the variable versions take *both* the offset and field width (as bitfields)
|> from the same register.
|> 
|> In other words, if you were dealing with a variable bitfield, you'd need
|> another sequence of bitfield instructions (using fixed, immediate values for
|> the width and offset) to set up the operands for the first one!
|> 
|> Why did they design the 88K instructions to pack two operands into
|> one register?
|> 

I don't remember any case where having them in the same register (you want
them in 2 registers?) was inefficient.  Furthermore, the instruction set
encoding has space for 3 register specifiers -- source, destination and
width/offset in this case.  There would be no way to encode width and offset
separately.

-----------------------------------------------------------------------------
==	jeff kenton		Consulting at kenton@decvax.dec.com        ==
==	(617) 894-4508			(603) 881-0011			   ==
-----------------------------------------------------------------------------

jdarcy@seqp4.ORG (Jeff d'Arcy) (04/17/91)

ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) writes:
>Someone else has kindly e-mailed me a description of the 88K bitfield
>instructions. They are truly remarkable animals, except for some reason
>the variable versions take *both* the offset and field width (as bitfields)
>from the same register.

When I saw this in the manual, I wondered why one would ever use them.  I
eventually found a use; when emulating misaligned accesses I needed to do
the effective address calculation manually, and for the scaled addressing
mode this was most easily accomplished by extracting the instruction size
into a register and then using that register as an argument to the next
make-bitfield instruction.

>In other words, if you were dealing with a variable bitfield, you'd need
>another sequence of bitfield instructions (using fixed, immediate values for
>the width and offset) to set up the operands for the first one! I'd venture
>to suggest this would nullify any performance gain from trying to use bitfield
>instructions in the first place, except that Jeff D'Arcy also mentions that
>the 88K doesn't have any shift or mask operators, so you don't have the
>choice.

Unlike your name, mine has a small "d".  I'm sure you understand.

Secondly, the 88K *does* have mask operations, lacking only the shifts which
are merely subsets of the make/extract bitfield instructions.  There is *no*
shift operation which cannot be simulated in a single instruction this way.

As for the variable-argument problem, the extra instructions to set up the
operand would be present using shifts instead of bitfield ops.  As it is,
Motorola wisely allowed a width of 0 to mean "all bits to the left of the
offset" (left = MSB), and the offset for the bitfield ops is in the *low*
order five bits of the register operand.  Consider the following:

	; 88K notation is dst,src1,src2 - src1 is always a register
	; extract (unsigned) bits 26-27 of r10 into r11
	extu	r11,r10,26<2>

	; left shift r12 according to r11
	mak	r12,r12,r11

After the first instruction, r11 contains:

	  3         2         1
	 1098765432109876543210 98765 43210
	+----------------------+-----+-----+
	|0000000000000000000000|00000|000xx|
	+----------------------+-----+-----+

The field in the middle (bits 5-9) is the bitfield width; the field on
the right (bits 0-4) is the bitfield offset.  The second instruction
therefore takes bits 0-31 (width of 0 gets translated to 32) of r12 and
puts them in r12 starting at position <xx> (the extracted field from
r10); excess bits on the left are of course discarded.

The point of this detailed explation is thus: I don't know of any CPU
that can do such a manipulation in *fewer* than two instructions, and
I'd be rather surprised to find one.  Hence, the "inconvenience" of
Motorola's encoding is actually a win.

(Any factual errors are the result of my not having at 88K manual handy
since I don't work with the 88K any more)
-- 

  Jeff d'Arcy, Generic MTS, Sequoia Systems Inc. <jdarcy%seqp4@m2c.org>
Yonder nor sorghum stenches shut ladle gulls stopper torque wet strainers

hamilton@siberia.rtp.dg.com (Eric Hamilton) (04/17/91)

In article <1991Apr17.174918.3458@waikato.ac.nz>, ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) writes:
|> 
|> Someone else has kindly e-mailed me a description of the 88K bitfield
|> instructions. They are truly remarkable animals, except for some reason
|> the variable versions take *both* the offset and field width (as bitfields)
|> from the same register.
|> 
|> In other words, if you were dealing with a variable bitfield, you'd need
|> another sequence of bitfield instructions (using fixed, immediate values for
|> the width and offset) to set up the operands for the first one! I'd venture
|> to suggest this would nullify any performance gain from trying to use bitfield
|> instructions in the first place, except that Jeff D'Arcy also mentions that
|> the 88K doesn't have any shift or mask operators, so you don't have the
|> choice.

The 88000 has shift operations.
It gives you all flavors (left, right, arithmetic, logical, constant or
variable shift count) in one clock.  They're special cases of the
bitfield operations, so they may not be recognized as shifts, but they are.

The 88000 also has mask operators (rd,rs1,rs2 are registers, imm16 a 16-bit
literal):

	and	rd,rs1,rs2	rd = rs1 & rs2;
	and.c	rd,rs1,rs2	rd = rs1 & ~rs2
	and	rd,rs1,imm16	rd = rs1 & (0xffff0000 | imm16)
	and.u	rd,rs1,imm16	rd = rs1 & (0x0000ffff | (imm16 << 16))
	mask	rd,rs1,imm16	rd = rs1 & imm16
	mask.u	rd,rs1,imm16	rd = rs1 & (imm16 << 16)

as well as the bit-field operations.

Packing the operands of the variable
bit-field operations into one register doesn't really impose much performance
penalty (all instructions are single-cycle):

Any shift (constant or variable shift count):		1 instruction.
Any constant width and offset bitfield operation:	1 instruction.
Any variable width, zero offset operation:		2 instructions.
Any variable width, non-zero offset operation:		3 instructions.
Any variable width, variable offset operation:		3 instructions.  

These are not the penalties - they're the total instructions to perform the operation.
They're pretty good, especially if you consider the dynamic frequency
of the variable width and offset operations.  I think you'll be hard-pressed
to better with shift-and-mask sequences.	

|> 
|> Why did they design the 88K instructions to pack two operands into
|> one register?
|> 
With the operands packed, the bit field operations are already three-register
instructions:  source, destination, and width/offset specifier.  For a
single-cycle operation, that's about all that's reasonable - five more
bits in the instruction to specify another register, and another read port
in the register file, just so we can do variable-width/length operations
in one or two fewer cycles?  That can't be right. 

-- 
----------------------------------------------------------------------
Eric Hamilton				+1 919 248 6172
Data General Corporation		hamilton@dg-rtp.rtp.dg.com
62 Alexander Drive			...!mcnc!rti!xyzzy!hamilton
Research Triangle Park, NC  27709, USA

torbenm@diku.dk (Torben [gidius Mogensen) (04/18/91)

ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) writes:

> What do RISC designers think of bitfield instructions? (A la
> the VAX and 680x0 [x >= 2] processors: extract so many bits at
> a certain offset, insert so many bits at a certain offset, etc.)
> 
> Do other people find these instructions useful?
> 

As John Bowler pointed out in another article, bit masking
instructions can be simulated with a few shift and logical
instructions, and you oten don't need the full generality of the
masks, as the bitfields tend to be at one end or other of the word.

What is (often?) needed in graphics systems is to convert raster
arrays from one number of bits per pixel to another, for example 1 bit
per pixel to two bits per pixel. With the normal instruction selection
on processors, this requires handling each pixel on its own, making
the operation quite slow.

If you had bit ZIP and UNZIP instructions as described below, the
operations could be done on whole words (or half words), at least if
the bits per pixel are powers of two.

A ZIP instructions takes two words abcde..., ABCDE... and merges the
bits into a double word aAbBcCdDeE... (or two half words into a word)

An UNZIP instruction is the inverse of ZIP, and thus takes a double
word aAbBcCdDeE... and produces two words abcde..., ABCDE... (or takes
a word and produces two half words).

Thus if you want to take a one-bit-per-pixel word and produce a
two-bit-per-pixel double word, you ZIP if with 0, -1 or itself, which
would yield the following colour translations:

	X ZIP 0:   0 -> 00, 1 -> 10
	0 ZIP X:   0 -> 00, 1 -> 01
	X ZIP -1:  0 -> 01, 1 -> 11
	-1 ZIP X:  0 -> 10, 1 -> 11
	X ZIP X:   0 -> 00, 1 -> 11

other translations can be obtained using logical operations on the
result. Conversion of two-bit-per-pixel words to four-bits-per-pixel
double words can also be done, but specific colour translation is more
tricky.

to convert from a two-bit-per-pixel double word to a one-bit per pixel
you UNZIP it into two words. Specific colour translations can be
obtained using logical operations between the two words (and possibly
constants).

ZIP and UNZIP can be implemented very efficiently (they are just
crossing wires), but will take up rather much space (similar to a
barrel shifter I would guess), so it would probably only be worthwhile
on graphics processors. You could also just implement EXPAND and
CONTRACT, and use shifts and logical operations to get the full ZIP
and UNZIP functionality. EXPAND ZIPs a word with zero, and CONTRACT is
an UNZIP where only one of the outputs is generated.

Does anyone know of any processors (graphics or general purpose) that
have such instructions?

	Torben Mogensen (torbenm@diku.dk)

spot@CS.CMU.EDU (Scott Draves) (04/18/91)

In article <1991Apr18.093804.18183@odin.diku.dk> torbenm@diku.dk (Torben [gidius Mogensen) writes:

   What is (often?) needed in graphics systems is to convert raster
   arrays from one number of bits per pixel to another, for example 1 bit
   per pixel to two bits per pixel.

this rarely happens.  1 -> 8 or 1 -> 24 happens on occasion, but i
doubt any of these are significant compared to blitting and rendering.

   With the normal instruction selection
   on processors, this requires handling each pixel on its own, making
   the operation quite slow.

when it does happen, you can use look up tables. to do your 1 -> 2
bits per pixel example, i would use a table of 256 entries, 16 bits
each.

It also works for bitmap magnification.

   A ZIP instructions takes two words abcde..., ABCDE... and merges the
   bits into a double word aAbBcCdDeE... (or two half words into a word)

these can be done with the look up tables in a similar fashion.  make
a table called halfzip that spaces out the bits of a word:
halfzip[abcd] = 0a0b0c0d.  then your ZIP(i,j) = halfzip[i] |
(halfzip[j] << 1).  Or you can use two separate tables to avoid the
shift.

   An UNZIP instruction is the inverse of ZIP, and thus takes a double
   word aAbBcCdDeE... and produces two words abcde..., ABCDE... (or takes
   a word and produces two half words).

I can't think of a similar game for unzip, but....

   to convert from a two-bit-per-pixel double word to a one-bit per pixel
   you UNZIP it into two words. Specific colour translations can be
   obtained using logical operations between the two words (and possibly
   constants).

...this is a very bad way to convert from 2 bit to 1 bit.  Dithering
should be used if you want the result to look good at all.  converting
down to 1 bit even less common than converting from 1 bit 
   
   You could also just implement EXPAND and
   CONTRACT, and use shifts and logical operations to get the full ZIP
   and UNZIP functionality. EXPAND ZIPs a word with zero, and CONTRACT is
   an UNZIP where only one of the outputs is generated.

EXPAND = halfzip
--

			christianity is stupid
Scott Draves		communism is good
spot@cs.cmu.edu		give up

ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) (04/18/91)

In article <715@seqp4.UUCP>, jdarcy@seqp4.ORG (Jeff d'Arcy) writes:

> Unlike your name, mine has a small "d".  I'm sure you understand.

Whoops! After all the fuss I make about people spelling *my* name
correctly, please accept my humble apologues.

  [detailed example of emulating power-of-two operand scaling
   using just two bitfield instructions, ending with]
> Hence, the "inconvenience" of Motorola's encoding is actually a win.

I agree you haven't lost anything in this example, but I don't see
that you've gained anything, either, other than saving a register.
Your example works OK on this architecture, mine would work less
well, and that's all.

Todd Bridges (toddb@illini.sps.mot.com) has, if I understand him
correctly, given me what I think the real reason is for these
instructions being defined the way they are: 88K instructions can't
have more than three operands.

Lawrence D'Oliveiro                       fone: +64-71-562-889
Computer Services Dept                     fax: +64-71-384-066
University of Waikato            electric mail: ldo@waikato.ac.nz
Hamilton, New Zealand    37^ 47' 26" S, 175^ 19' 7" E, GMT+12:00
Whoever said "There is no such thing as a free lunch" obviously never
heard of the Buy-No-Meal Theorem.

henry@zoo.toronto.edu (Henry Spencer) (04/19/91)

In article <1991Apr18.093804.18183@odin.diku.dk> torbenm@diku.dk (Torben [gidius Mogensen) writes:
>A ZIP instructions takes two words abcde..., ABCDE... and merges the
>bits into a double word aAbBcCdDeE... (or two half words into a word)
>An UNZIP instruction is the inverse of ZIP...

Unless access to memory is seriously expensive, table lookup can implement
both of these operations quite efficiently.  Agreed, it's not as fast as
having them wired in...
-- 
And the bean-counter replied,           | Henry Spencer @ U of Toronto Zoology
"beans are more important".             |  henry@zoo.toronto.edu  utzoo!henry

glew@pdx007.intel.com (Andy Glew) (04/19/91)

    Someone else has kindly e-mailed me a description of the 88K bitfield
    instructions. They are truly remarkable animals, except for some reason
    the variable versions take *both* the offset and field width (as bitfields)
    from the same register.

    In other words, if you were dealing with a variable bitfield, you'd need
    another sequence of bitfield instructions (using fixed, immediate values for
    the width and offset) to set up the operands for the first one! I'd venture
    to suggest this would nullify any performance gain from trying to use bitfield
    instructions in the first place, except that Jeff D'Arcy also mentions that
    the 88K doesn't have any shift or mask operators, so you don't have the
    choice.

    Why did they design the 88K instructions to pack two operands into
    one register?

I hate to defend the 88K, now that I have moved from Motorola to
Intel, but setting up the dynamic operands in a register is acceptable
for one of the biggest uses of the bitfield operations - bitblt, where
the amount of the shift is known before the loop begins.  Setup once.


--

Andy Glew, glew@ichips.intel.com
Intel Corp., M/S JF1-19, 5200 NE Elam Young Parkway, 
Hillsboro, Oregon 97124-6497

This is a private posting; it does not indicate opinions or positions
of Intel Corp.

graeme@labtam.labtam.oz (Graeme Gill) (04/19/91)

In article <SPOT.91Apr18123711@WOOZLE.GRAPHICS.CS.CMU.EDU>, spot@CS.CMU.EDU (Scott Draves) writes:
> In article <1991Apr18.093804.18183@odin.diku.dk> torbenm@diku.dk (Torben [gidius Mogensen) writes:
> 
>    What is (often?) needed in graphics systems is to convert raster
>    arrays from one number of bits per pixel to another, for example 1 bit
>    per pixel to two bits per pixel.
> 
> this rarely happens.  1 -> 8 or 1 -> 24 happens on occasion, but i
> doubt any of these are significant compared to blitting and rendering.

    Consider that many applications are written to run on monochrome
devices, so they use single level bitmaps and fonts to encode graphic
objects. When rendered on a colour device, the 1->8 or 1->24 bit
expansion is used a great deal.

> when it does happen, you can use look up tables. to do your 1 -> 2
> bits per pixel example, i would use a table of 256 entries, 16 bits
> each.

    In fast hardware the memory bandwidth is the limit. Using lookup tables
doubles the number of memory accesses, thereby halving the ultimate speed.
Hardware support for 1->power_of_two expansion is a desirable feature
in a graphics processor.


	Graeme Gill
	Labtam Australia

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (04/19/91)

In article <1991Apr18.195841.10588@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:

| Unless access to memory is seriously expensive, table lookup can implement
| both of these operations quite efficiently.  Agreed, it's not as fast as
| having them wired in...

  This is true, but the original topic was somewhat more general,
breaking N bit chunks into parts and assembling them. In the case where
N is a power of two I think the table lookup loses if I want the data
elsewhere. I can generate it faster than I can read it from memory (on
most machines). 

  The problem comes when I want to go from words back to byte packing.
In this case a lookup doesn't help, and I have to fetch, shift, and or
the bits regardless of wether I do it with an instruction or a loop.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
        "Most of the VAX instructions are in microcode,
         but halt and no-op are in hardware for efficiency"

henry@zoo.toronto.edu (Henry Spencer) (04/20/91)

In article <3357@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>  The problem comes when I want to go from words back to byte packing.
>In this case a lookup doesn't help, and I have to fetch, shift, and or
>the bits regardless of wether I do it with an instruction or a loop.

I'm not quite sure what you mean by "go from words back to byte packing".
If you're referring to the "unzip" instruction, transforming AaBbCcDd to
ABCDabcd, note that you can do the shifts when generating tables.  You
look up AaBb in *two* tables, one giving you AB-- and one giving ab--,
and look up CdDd in tables that give you --CD and --cd.  You still get
to OR them together yourself, however.

How the tradeoffs go on this depends a lot on memory access speeds (and
on whether your caches are big enough to contain your lookup tables!),
and also on whether the operations being done are regular enough to lend
themselves to table generation, but table lookup is always worth considering
for messy bit operations.
-- 
And the bean-counter replied,           | Henry Spencer @ U of Toronto Zoology
"beans are more important".             |  henry@zoo.toronto.edu  utzoo!henry

rex@cs.su.oz (Rex Di Bona) (04/22/91)

In article <10408@labtam.labtam.oz> graeme@labtam.labtam.oz (Graeme Gill) writes:
>     Consider that many applications are written to run on monochrome
> devices, so they use single level bitmaps and fonts to encode graphic
> objects. When rendered on a colour device, the 1->8 or 1->24 bit
> expansion is used a great deal.

This is true, but misleading. You do not want to convert 0 -> 00000000 and
1 -> 11111111, but 0-> some colour, and 1-> some other colour. Both of
these colours are user selectable. This requires more hardware support
than just a bit stretcher.
 
> > when it does happen, you can use look up tables. to do your 1 -> 2
> > bits per pixel example, i would use a table of 256 entries, 16 bits
> > each.
> 
>     In fast hardware the memory bandwidth is the limit. Using lookup tables
> doubles the number of memory accesses, thereby halving the ultimate speed.
> Hardware support for 1->power_of_two expansion is a desirable feature
> in a graphics processor.
> 
> 
> 	Graeme Gill
> 	Labtam Australia

You should be able to do all of the rendering in software, keeping the
intermediate values in registers. This requires a better layout of the
memory for the graphics 'screen'. You could use a lookup table for additional
speed (as an example, if your video memory was 8 bits deep, and layed out so
that 4 pixels occupied a 32 bit word you could construct a 16 element
table (each 32 bits wide) with the appropriate spaces filled in. To set
up the table would require 16 memory stores, but you rendering would
be twice as fast (one read, one write as opposed to 4 writes) as the byte
at a time method). You could remove the read by having all 16 values stored
in registers, and do a branch to the apppropriate store, or other hardware
nasties.

The one instruction I keep wishing for when doing these sort of graphic
operations is a rotate instruction. It can be used for line drawing, as
a counter for the above routine (it wouldn't require reloading for each
loop however) and does save instructions... I have found that normal shifts
are fine for bitblts, even those on random boundaries and for random widths.
I would suspect the additional overheads of determining when to use the 88K
type bitfield operations would swamp the usefulness of them.
--------
Rex di Bona (rex@cs.su.oz.au)
Penguin Lust is NOT immoral

peter@ficc.ferranti.com (peter da silva) (04/22/91)

In article <SPOT.91Apr18123711@WOOZLE.GRAPHICS.CS.CMU.EDU>, spot@CS.CMU.EDU (Scott Draves) writes:
> when it does happen, you can use look up tables. to do your 1 -> 2
> bits per pixel example, i would use a table of 256 entries, 16 bits
> each.

Wouldn't this tend to be a cache-basher? Particularly a problem when you're
already bashing the cache with the bitmap anyway?
-- 
Peter da Silva.  `-_-'  peter@ferranti.com
+1 713 274 5180.  'U`  "Have you hugged your wolf today?"

sef@kithrup.COM (Sean Eric Fagan) (04/23/91)

In article <PUWA8R4@xds13.ferranti.com> peter@ficc.ferranti.com (peter da silva) writes:
>Wouldn't this tend to be a cache-basher? Particularly a problem when you're
>already bashing the cache with the bitmap anyway?

Perhaps that's where a 'don't cache' attribute on a load/store might be
useful...  You could probably do it now by having telling the OS to not
cache the page your data are in (although that won't work on all systems).

-- 
Sean Eric Fagan  | "I made the universe, but please don't blame me for it;
sef@kithrup.COM  |  I had a bellyache at the time."
-----------------+           -- The Turtle (Stephen King, _It_)
Any opinions expressed are my own, and generally unpopular with others.

henry@zoo.toronto.edu (Henry Spencer) (04/23/91)

In article <1991Apr23.053619.13474@kithrup.COM> sef@kithrup.COM (Sean Eric Fagan) writes:
>>Wouldn't this tend to be a cache-basher? Particularly a problem when you're
>>already bashing the cache with the bitmap anyway?
>
>Perhaps that's where a 'don't cache' attribute on a load/store might be
>useful... 

Anybody who caches a frame buffer is crazy.  Especially if the cache isn't
write-through, in which case your frame-buffer updates show up on the screen
some arbitrary time later!

Table-lookup algorithms (the original topic) can be somewhat hard on caches,
although the impact is minimized if the tables are small (i.e. lookup by
bytes, not by shorts!) and the cache is large.
-- 
And the bean-counter replied,           | Henry Spencer @ U of Toronto Zoology
"beans are more important".             |  henry@zoo.toronto.edu  utzoo!henry

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (04/24/91)

In article <1991Apr23.152155.2298@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:

| Anybody who caches a frame buffer is crazy.  Especially if the cache isn't
| write-through, in which case your frame-buffer updates show up on the screen
| some arbitrary time later!

  The second part of this is true, but in a system which writes data to
the bus aggressively, the arbitrary time is in ms, not seconds or
minutes.

  Write-back cache seems to fall into the group which trys to unload the
bus by only writing to memory when the cache is getting full, and the
aggressive cache which writes whenever the bus is not busy. I played
with caching a frame buffer, and it had a positive effect on
performance, since calculation and writing of points was overlapped.

  Obviously a bank selected frame buffer can't (easily) be cached, but
other than that I don't see any reason why cache won't work nicely. I
don't think a write-through cache would make sense because *most*
programs don't read back the buffer.

  I agree with the conclusion, that caching a frame buffer is not cost
effective in most applications, but I'm not quite as firmly set against
it as Henry.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
        "Most of the VAX instructions are in microcode,
         but halt and no-op are in hardware for efficiency"

johnl@iecc.cambridge.ma.us (John R. Levine) (04/24/91)

In article <1991Apr23.152155.2298@zoo.toronto.edu> you write:
>Anybody who caches a frame buffer is crazy. ...

Given how slow some of the frame buffers on PC clones are, upwards of 10
wait states per 16 bit word, caching could be a big performance win.

It should probably be write-through, though I can even imagine situations
where you're redrawing an entire window and, so long as you can flush it
when you're done, a little smudge during the repaint due to a write-back
cache wouldn't be offensive.

Recent models of screen buffer do remap all of the planes at the same
place in the address space, which makes a lot of this moot unless you can
do a page flush without a context switch.  But even that's possible on the
386 -- it does let you enable individual device addresses for user I/O.

Regards,
John Levine, johnl@iecc.cambridge.ma.us, {spdcc|ima|world}!iecc!johnl

graeme@labtam.labtam.oz (Graeme Gill) (04/24/91)

In article <2325@cluster.cs.su.oz.au>, rex@cs.su.oz (Rex Di Bona) writes:
> 
> This is true, but misleading. You do not want to convert 0 -> 00000000 and
> 1 -> 11111111, but 0-> some colour, and 1-> some other colour. Both of
> these colours are user selectable.

	The ideal graphics support would be an expand instruction with
foreground and background colour registers, but a 0 -> 00000000 and
1 -> 11111111 instruction would still be very useful when internal
operations are many times faster than memory accesses. The expanded
bitmap is used as a mask to merge the foreground and background colours
together (as well as a plane mask perhaps).  

> You should be able to do all of the rendering in software, keeping the
> intermediate values in registers. This requires a better layout of the
> memory for the graphics 'screen'. You could use a lookup table for additional
> speed (as an example, if your video memory was 8 bits deep, and layed out so
> that 4 pixels occupied a 32 bit word you could construct a 16 element
> table (each 32 bits wide) with the appropriate spaces filled in. To set
> up the table would require 16 memory stores, but you rendering would
> be twice as fast (one read, one write as opposed to 4 writes) as the byte
> at a time method). You could remove the read by having all 16 values stored
> in registers, and do a branch to the apppropriate store, or other hardware
> nasties.

	Umm. Packed frame stores are used because they speed up other very
important operations like fill and copy. Currently I do an expand operation
like this:

	Read 32 bits of source (word read)
	Lookup 8 bits of the source in the expand table (double word read)
	Lookup 8 bits of the source in the expand table (double word read)
	Write to the destination (quad word write).
	Lookup 8 bits of the source in the expand table (double word read)
	Lookup 8 bits of the source in the expand table (double word read)
	Write to the destination (quad word write).
	
With expand support I could do this:

	Read 32 bits of source (word read)
	Write to the destination (quad word write).
	Write to the destination (quad word write).

> I would suspect the additional overheads of determining when to use the 88K
> type bitfield operations would swamp the usefulness of them.

	When speed is important, these sort of routines are often pre-compiled
for various cases (eg. alignment, direction), so one selects the appropriate
routine, not instruction.

	Graeme Gill
	Labtam Australia

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (04/24/91)

In article <9104231527.AA09611@iecc.cambridge.ma.us> johnl@iecc.cambridge.ma.us (John R. Levine) writes:

| Recent models of screen buffer do remap all of the planes at the same
| place in the address space, which makes a lot of this moot unless you can
| do a page flush without a context switch.  But even that's possible on the
| 386 -- it does let you enable individual device addresses for user I/O.

  I have been waiting for frame buffers to come out with an option to
map the whole thing in extended memory, say between 15-16MB. I predicted
this as a popular enhancement when EGA first came out, and I was wrong.
I find it hard to belive that no one has done this, since even a 286 can
use it, but I don't know of anyone who has.

  And I'm told that the new XGA from IBM is still bank swapping so they
can use an 8086. Slavish dedication to being backward. Ur, backward
compatibility I meant, didn't i?
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
        "Most of the VAX instructions are in microcode,
         but halt and no-op are in hardware for efficiency"

sef@kithrup.COM (Sean Eric Fagan) (04/25/91)

In article <3365@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>  I have been waiting for frame buffers to come out with an option to
>map the whole thing in extended memory, say between 15-16MB. I predicted
>this as a popular enhancement when EGA first came out, and I was wrong.
>I find it hard to belive that no one has done this, since even a 286 can
>use it, but I don't know of anyone who has.

I have an old Cornerston card and monitor, capable of 1600x1280x1.  Really
nice.  It maps into memory as a 512k chunk of memory (I know, I know, that
isn't 512k worth of graphics).  It's got one minor problem making it useless
for me (namely, reads of the odd bytes aren't possible), but they had the
right idea.

There is at least one video card for EISA which maps into 0xd0000000, I
believe.  Takes a megabyte there.

-- 
Sean Eric Fagan  | "I made the universe, but please don't blame me for it;
sef@kithrup.COM  |  I had a bellyache at the time."
-----------------+           -- The Turtle (Stephen King, _It_)
Any opinions expressed are my own, and generally unpopular with others.

henry@zoo.toronto.edu (Henry Spencer) (04/25/91)

In article <9104231527.AA09611@iecc.cambridge.ma.us> johnl@iecc.cambridge.ma.us (John R. Levine) writes:
>>Anybody who caches a frame buffer is crazy. ...
>
>Given how slow some of the frame buffers on PC clones are, upwards of 10
>wait states per 16 bit word, caching could be a big performance win.

It's only a win if you are reading that word *repeatedly*.  This isn't
too likely to be a major performance issue.  If you are twiddling around
in a small area, odds are good that your performance is dominated by
overhead and not by frame-buffer accesses.  If you're doing something
big, it's probably being done a word at a time without repeated accesses
to the same word, and it's also probably going to bust your cache if
you cache those accesses.
-- 
And the bean-counter replied,           | Henry Spencer @ U of Toronto Zoology
"beans are more important".             |  henry@zoo.toronto.edu  utzoo!henry

scottw@ico.isc.com (Scott Wiesner) (04/25/91)

From article <1991Apr24.173400.13397@zoo.toronto.edu>, by henry@zoo.toronto.edu (Henry Spencer):
> In article <9104231527.AA09611@iecc.cambridge.ma.us> johnl@iecc.cambridge.ma.us (John R. Levine) writes:
>>>Anybody who caches a frame buffer is crazy. ...
>>
>>Given how slow some of the frame buffers on PC clones are, upwards of 10
>>wait states per 16 bit word, caching could be a big performance win.
> 
> It's only a win if you are reading that word *repeatedly*.  This isn't
> too likely to be a major performance issue.  

Unfortunately, there's a bigger problem lurking here.  Most of the
"frame buffers" on PC's aren't really a frame buffer in the traditional
workstation sense.  Most don't map all their video memory into the
CPU's address space.  They usually map 64k, and use some I/O registers
to do bank switching.  Access of the same address may be looking at
different parts of video memory.

Also, things like EGA's and VGA's are even more complicated, with 4 planes
of memory being shadowed at the same address, with a I/O port selecting
which plane is of interest on a read, and another I/O port selecting
what action should be taken on a write.  (Actions include masking with 
a plane mask, a bit mask, 4 choices of rastrop, etc.)  You definately
don't want to be doing standard caching on one of these cards.

Scott Wiesner
Interactive Systems
X Development Group

andreww@uniwa.uwa.oz (Andrew John Williams) (04/26/91)

Not really anything to do with this - but hows this for 'orrible...
I picked up the 1980 Nat Semi MOS book recently, and had a look at the
COPS420 (or whatever it was) specs. No shift instruction. The way they
suggested was to swap the accumulator out to the shift register (also
connected to the outside world) and wait. The shift register was
syncronised to the instruction cycle, so to shift one bit you swapped A
back in the next instruction. For a 2 bit shift, you inserted a NOP (or
other such useful instruction). Don't worry though - the maximum delay
was 4 instructions, 'cos thats how big A was. And people complain about
the 8086!

John West (stealing Andrew's account)
Do fish need aqualungs?