[comp.arch] Help: Hashing on TLB input?

soohong@gondor.cs.psu.edu (Soohong Kim) (09/09/88)

Could some one please enlighten me by explaining these?

    In Hwang & Briggs (Computer architecture and parallel processing) it is
said that the virtual page number is first randomized by hashing before
feeding it to the TLB. It is also said that the reason is for efficient use
of the low-order TLB entries. And Vax/750 seems to avoid hashing by using
a number of high-order bits and 5 low-order bits to select the set in the
set-associative TLB.

    My questions are:
(1) Why is it inefficient to use the Virtual page number directly?
(2) How could hashing or bit-combination help in improving the efficiency?

Thank you.
---Soohong

news@amdcad.AMD.COM (Network News) (09/09/88)

In article <3907@psuvax1.cs.psu.edu> soohong@gondor.cs.psu.edu (Soohong Kim) writes:
| Could some one please enlighten me by explaining these?
| 
|     In Hwang & Briggs (Computer architecture and parallel processing) it is
| said that the virtual page number is first randomized by hashing before
| feeding it to the TLB. It is also said that the reason is for efficient use
| of the low-order TLB entries. And Vax/750 seems to avoid hashing by using
| a number of high-order bits and 5 low-order bits to select the set in the
| set-associative TLB.
| 
|     My questions are:
| (1) Why is it inefficient to use the Virtual page number directly?
| (2) How could hashing or bit-combination help in improving the efficiency?
| 
| Thank you.
| ---Soohong

If a TLB is fully-associative, then the entire TLB is searched in
parallel for the virtual address, and no hashing is required.  However,
if the TLB is set-associative or direct-mapped, then a mapping must
exist from a virtual address to the set to examine.  This mapping (or
hashing) must make efficient use of the TLB by evenly distributing the
virtual addresses over all of the sets instead of clustering them into a
few sets.  Otherwise the TLB will thrash, causing performance
degradation.

The most obvious hash is to use the lsb's of the virtual address as an
index into the TLB.  This is better than using the msb's, because
addresses exhibit the principal of locality, so we want sequential pages
to map to different TLB sets.  This scheme can be augmented in many
ways.  For example, if the virtual address space is divided up into 1/2
user and 1/2 supervisor, then the msb of the virtual address can be
included in the hash to allow user and supervisor page mappings to
coexist in the TLB without replacing each other (this is apparently what
is going on in the Vax/750 example.)

	-- Tim Olson
	Advanced Micro Devices
	(tim@crackle.amd.com)

baum@Apple.COM (Allen J. Baum) (09/10/88)

[]
> tim@crackle.amd.com (Tim Olson) writes:
>  soohong@gondor.cs.psu.edu (Soohong Kim) writes:

>|     In Hwang & Briggs (Computer architecture and parallel processing) it is
>| said that the virtual page number is first randomized by hashing before
>| feeding it to the TLB. 
>| 
>|     My questions are:
>| (1) Why is it inefficient to use the Virtual page number directly?
>| (2) How could hashing or bit-combination help in improving the efficiency?

>.. stuff about how to avoid thrashing the TLB...
>The most obvious hash is to use the lsb's of the virtual address as an
>index into the TLB.  This is better than using the msb's, because
>addresses exhibit the principal of locality, so we want sequential pages
>to map to different TLB sets.  This scheme can be augmented in many ways.

Unfortunately, this is largely affected by how compilers and OS's allocate
pages. It is often not enough to use the LSBs, because then the first page
of every process would collide, or the heap (allocated to high mem.) would
collide with the stak d(in low mem., or vice versa), or user and system pages
would collide. So, the hashing that I've seen exclusive-ors the msb's of the
page number with the lsb's, sometimes reversing the bits of one half or the
other to really get them good and random.                                    
--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

dre%ember@Sun.COM (David Emberson) (09/13/88)

In article <3907@psuvax1.cs.psu.edu>, soohong@gondor.cs.psu.edu (Soohong Kim) writes:
> Could some one please enlighten me by explaining these?
> 
>     In Hwang & Briggs (Computer architecture and parallel processing) it is
> said that the virtual page number is first randomized by hashing before
> feeding it to the TLB. It is also said that the reason is for efficient use
> of the low-order TLB entries. And Vax/750 seems to avoid hashing by using
> a number of high-order bits and 5 low-order bits to select the set in the
> set-associative TLB.
> 
>     My questions are:
> (1) Why is it inefficient to use the Virtual page number directly?
> (2) How could hashing or bit-combination help in improving the efficiency?
> 
> Thank you.
> ---Soohong

Most programs tend to use the same parts of their virtual address spaces, i.e.
low pages for text and data, and high pages for stack.  If you just used the
virtual page number to address the TLB, you would be doing a lot of thrashing
between processes and would not be using the TLB efficiently.

What is typically done is to ex-or the low order bits of the virtual page
number (bits most likely to change) with the low order bits of the process ID
to generate a pseudo-random number which addresses the TLB.  This ensures that
all entries in the TLB are used to the same degree and thrashing between
contexts is minimized.

			Dave Emberson (dre@sun.com)

casey@admin.cognet.ucla.edu (Casey Leedom) (09/21/88)

In article <16891@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
> []
> >  tim@crackle.amd.com (Tim Olson) writes:
> > .. stuff about how to avoid thrashing the TLB...
> > The most obvious hash is to use the lsb's of the virtual address as an
> > index into the TLB.  This is better than using the msb's, because
> > addresses exhibit the principal of locality, so we want sequential pages
> > to map to different TLB sets.  This scheme can be augmented in many ways.
> 
> It is often not enough to use the LSBs, because then the first page of
> every process would collide, or the heap (allocated to high mem.) would
> collide with the stack in low mem., or vice versa), or user and system
> pages would collide. So, the hashing that I've seen exclusive-ors the
> msb's of the page number with the lsb's, sometimes reversing the bits of
> one half or the other to really get them good and random.  

  Another source of problems with simply using the LSBs is if an
application manipulates two (or more) arrays who's ``corresponding
addresses'' are 2^N away from each other (where N is the number of LSB
bits used to index the TLB).  If a situation like this occurs, any linear
access of the two (or more) arrays will cause *MASSIVE* thrashing.

  This actually happened with a group trying to do astronomic image
processing at Berkeley using a Sun 3/2XX.  They were copying a 1/4Mb
image array 30 times a second as part of their application.  They'd
justified buying the Sun 3/2XX based on their need for speed.  Much to
their dismay, the Sun 3/2XX performed the memory copy about 3 times
slower than the Sun 3/1XX!

  The problem turned out to be that the Sun 3/2XX cache was 64Kb with 16
byte lines indexed by a formula very dependent on the LSBs of the
address.  The loop would pick up a 4 byte long word causing a line to be
faulted in.  It would write that into the destination array causing that
line to be faulted in.  On the next read, the same source line would have
to be faulted in, but now that line in the cache was dirty, so a 30 cycle
write back would go down ...

  We finally got Sun 3/1XX performance by offsetting the source and
destination arrays by 24 bytes.  Because the array being copied was much
larger than the cache, we got best performance at:

	    | D - S | % 16 == 8
	&&  | D - S | >= 16
	&&  | D - S | <= 64K - 16

with a sin wave performance curve with the peaks as above and the valleys
half way in between and also tapering out at the end points (hence the
second and third conditions).

  I haven't sufficiently analyzed why the 24 byte offset (and other
similar offsets) had the effect they did.  It would require a better
understanding of the exact indexing algorithm being employeed by the Sun
3/2XX cache.

Casey