[comp.arch] overview of HP-PA Bitfield insts.

baum@apple.com (Allen Baum) (04/17/91)

With all the postings about bitfield ops, I thought I'd better give a
quick and dirty overview of HP-PA's operations. The rationale for having
bitfield ops at all was partly bitmap data, partly Pascal packed structures,
partly doing FP in software, & a lot of OS packed data structures.....

Extract [Signed] [Variable] - 
   Right justify a field, given a bit Position of the field's LSB as an
   immediate, and the field length as an immediate.
   The [Signed] option sign-extends instead of zero extends.
   The [Variable] option takes the LSB bit position from a special register.

This instruction can be used for right shifts of various flavors. If
(LSB pos_leng>32, then MSB of reg is used as MSB of field.

The rationale for having [variable] take a position out of a register is:
- expected critical path in the ALU (possibly not the case in CMOS)
- when a variable position is used, it's often used over & over, so having
  to move the position to a special register is not a significant overhead.

Note that there is no variable length. We didn't see it happening enough to
put it in, and in the rare cases it occurs, a case branch can be used.

[Zero & ][Variable] Deposit [Immed.] -
   Take a Right justified field and put it back into its (unjustified)
   position, given a bit position of the LSB as an immediate,
   and a Field Length as an immediate.

   The [Zero] option clears the destination register, otherwise bits that
   are not part of the field are left unchanged. 
   The [Variable] option takes the LSB bit position from a special register.
   The [Immed] option uses a five bit signed immediate as a source.

With the [Zero] option, this can be used for left shifts.
With the Immediate option, this can be used as a set or clear bit (or field),
 (fairly common in OS structures).
Using both options, constants can be constructed.
Since something like 80% of immediates are in the range -16:+16,
 this restricted immediate is still quite useful. 

[Variable] Shift Double
  Concatenate two registers, right shift the 64 result by an amount specified
  by an immediate, and store the 32 low order bits into the result.
 
   The [Variable] option takes shift amount from a special register.

This was intended to be used to justify/position fields that crossed word
boundaries.


While this sounds like a lot of instructions and variations, the hardware
required to implement it is quite simple. Maybe someone from the HP-PA
team can comment on whether it is worthwhile (i.e. used much).

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (04/17/91)

In article <1991Apr17.180036.3459@waikato.ac.nz> ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) writes:

| The only way I can see around this is to update the width value in
| the instruction itself--using self-modifying code. Hmmm, I seem to
| recall from another discussion some time ago that the HP-PA is one
| of those processors that will correctly invalidate its instruction cache
| if you do any writes to program code in memory. So this should work
| quite nicely.

  Excuse me while I call Ralph on the porcelain intercom... this is a
good argument for an "execute" instruction, allowing you to build the
instruction in a register (hopefully) or the stack (if you must) and
then force an instruction fetch on it.

  No matter how you do it you will slow things down to the point where
it's unlikely to be faster than the mask and shift, unfortunatly.

  I like the bitfield instructions in the 32x32 series, which really has
an orthogonal register set, rather than general and arithmetic registers.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
        "Most of the VAX instructions are in microcode,
         but halt and no-op are in hardware for efficiency"

ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) (04/18/91)

In article <51584@apple.Apple.COM>, baum@apple.com (Allen Baum) gives
an overview of the HP-PA bitfield instructions. He mentions that there
are versions which take a variable value for the field offset, but none
which take a variable value for the length:

"We didn't see it happening enough to put it in, and in the rare cases it
occurs, a case branch can be used."

Unfortunately, in the GIF compression/decompression example which
inspired me to start this discussion, an instruction which could
handle variable-length bitfields would work very well indeed. This
is because the symbol size grows, as more symbols are added to the
translation table. It doesn't grow very rapidly (it goes to 10 bits
once you've defined 512 symbols, to 11 bits when you've got 1024,
and so on), but I still wouldn't like to see a case branch taking
up execution time within the compression/decompression loop.

The only way I can see around this is to update the width value in
the instruction itself--using self-modifying code. Hmmm, I seem to
recall from another discussion some time ago that the HP-PA is one
of those processors that will correctly invalidate its instruction cache
if you do any writes to program code in memory. So this should work
quite nicely.

Lawrence D'Oliveiro                       fone: +64-71-562-889
Computer Services Dept                     fax: +64-71-384-066
University of Waikato            electric mail: ldo@waikato.ac.nz
Hamilton, New Zealand    37^ 47' 26" S, 175^ 19' 7" E, GMT+12:00
To someone with a hammer and a screwdriver, every problem looks
like a nail with threads.

cs450a03@uc780.umd.edu (04/18/91)

Wm E Davidsen Jr    >
Lawrence D'Oliveiro >|

   [ talking about GIF compression ]
>| The only way I can see around this is to update the width value in
>| the instruction itself--using self-modifying code. ...  HP-PA is
>| one of those processors that will correctly invalidate its
>| instruction cache ...

> ... this is a good argument for an "execute" instruction, allowing
>you to build the instruction in a register ...

Oh, God... I'd LOVE a register based "execute" instruction.  It would
make my loop management code so elegant...  (just one instance of each
looping construct)

>  No matter how you do it you will slow things down to the point where
>it's unlikely to be faster than the mask and shift, unfortunatly.

Not Lawrence's example.  He only rarely needs to change the field
size.  That's why he didn't want to pay for a case statement inside
his loop.

Of course, he could just write, say, 8 instances of his compression
code (one for each field size).  Of course, then he'd probably take a
hit in cache performance, but he'd not be doing that "horrible self
modifying code stuff" :-/

Raul Rockwell

dik@cwi.nl (Dik T. Winter) (04/18/91)

In article <3354@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
 > In article <1991Apr17.180036.3459@waikato.ac.nz> ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) writes:
 > | The only way I can see around this is to update the width value in
 > | the instruction itself--using self-modifying code. Hmmm, I seem to
 > | recall from another discussion some time ago that the HP-PA is one
 > | of those processors that will correctly invalidate its instruction cache
 > | if you do any writes to program code in memory. So this should work
 > | quite nicely.
 >   Excuse me while I call Ralph on the porcelain intercom... this is a
 > good argument for an "execute" instruction, allowing you to build the
 > instruction in a register (hopefully) or the stack (if you must) and
 > then force an instruction fetch on it.
 >   No matter how you do it you will slow things down to the point where
 > it's unlikely to be faster than the mask and shift, unfortunatly.
Not so fast!  You can have an execute instruction with delay slots!  And
as the true instruction has to be stuffed into the decoder, I think
we need (at least) three delay slots.  So we need also three annul bits
to specify which of the following three instructions must be annulled.  It
gets a bit hairy of course if the instruction is an execute instruction again.

More serious to the original question.  Indeed, in applications like compress
your bit fields vary (offhand I would not know other applications, e.g. there
is no such need for Huffman coding as you only know the size if you know the
value).  What I would do if I did compress in assembler was doing different
streams of code for 9 bits, 10 bits etc.  When 9 bits is exhausted you just
jump to the 10 bits stream.  Of course the code will be larger, but the
source only marginally so if you use macro's (and the HP assembler is a
macro assembler).
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

vitale@hpcupt1.cup.hp.com (Phil Vitale) (04/19/91)

> (Lawrence D'Oliveiro, Waikato University)
> Hmmm, I seem to recall from another discussion some time ago that the
> HP-PA is one of those processors that will correctly invalidate its 
> instruction cache if you do any writes to program code in memory.

You need to execute a Flush Instruction Cache (FIC) to make sure the old
line is out of the instruction cache.

On HP-UX, page protection prevents mortal users from writing to program code
in memory.  However, mortal users can execute from data in memory.  This
means you have to think about the data cache that may be on your machine.

To handle the data cache, the Flush Data Cache (FDC) instruction should
be used to make sure memory has the latest copy of the modified code.

You only need to use the flush pair once per cache line.  Cache line size
is implementation dependent.

Now, having said that, the flushes may not be necessary under certain
conditions:  (1) You are able to write to program code in memory (special 
page permissions), and (2) the HP-PA implementation you were using has a
unified I and D cache (850 for example).  It's a lot more portable across
HP-PA implementations to use the flushes.

On a related note, the ASPLOS-IV proceedings had a paper on a portable
implementation for self-modifying code.


Phil Vitale
vitale@hpda.cup.hp.com

edwardm@hpcuhe.cup.hp.com (Edward McClanahan) (04/24/91)

Phil Vitale responds:

> > (Lawrence D'Oliveiro, Waikato University)
> > Hmmm, I seem to recall from another discussion some time ago that the
> > HP-PA is one of those processors that will correctly invalidate its 
> > instruction cache if you do any writes to program code in memory.

> You need to execute a Flush Instruction Cache (FIC) to make sure the old
> line is out of the instruction cache.

> To handle the data cache, the Flush Data Cache (FDC) instruction should
> be used to make sure memory has the latest copy of the modified code.

And finally, you'd better do a SYNC instruction (which waits for the FIC
and FDC instructions to "complete").

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

  Edward McClanahan
  Hewlett Packard Company     -or-     edwardm@cup.hp.com
  Mail Stop 42UN
  11000 Wolfe Road                     Phone: (480)447-5651
  Cupertino, CA  95014                 Fax:   (408)447-5039