[comp.arch] Life with TLB and no PT

hansen@mips.UUCP (04/23/87)

In article <3027@sdcsvax.UCSD.EDU>, stuart@CS.ROCHESTER.EDU (Stuart Friedberg) writes:
> Prodded by something Avie Tevanian of the MACH research group said,
> I have been considering life with a translation lookaside buffer (TLB)
> but without hardware page tables (PT).  This message is intended to
> spark some discussion of a) what such a system would be like, and
> b) how existing architectures and operating systems can be instrumented
> to get some empirical data to guide TLB design and sizing.  Since it
> involves tradeoffs between hardware and software, I cross-posted to
> comp.os.research and comp.arch.

The MIPS R2000 processor already does it, and with a 4k page size, a
64-entry, fully-associative TLB, and 15-19 cycles to refill the TLB in
software from a simple page table, large benchmarks typically spend about 1%
of their execution time in the software TLB refill handler. There are 6-bit
process identifiers associated with each entry, so that the TLB doesn't need
to be flushed on context switches, and the permanently memory-resident
portion of the kernel doesn't use the TLB at all, which reduces the overall
size requirement of the TLB by a large margin.

An important benefit of the software TLB handler is that it is easily
changable. We use two different handlers for the two different UNIX ports we
have running on the machine (BSD 4.3 and V.3), because there are two
different page table formats in use.  We've been able to easily add code to
count TLB misses to verify our simulation data under real system conditions.
We have also used it to code around hardware problems in one revision of our
processor chip.

-- 
Craig Hansen
Manager, Architecture Development
MIPS Computer Systems, Inc.
...decwrl!mips!hansen

lamaster@pioneer.arpa (Hugh LaMaster) (04/24/87)

In article <338@dumbo.UUCP> hansen@mips.UUCP (Craig Hansen) writes:
>
>The MIPS R2000 processor already does it, and with a 4k page size, a
>64-entry, fully-associative TLB, and 15-19 cycles to refill the TLB in
>software from a simple page table, large benchmarks typically spend about 1%
>of their execution time in the software TLB refill handler. There are 6-bit
:

I don't doubt that the MIPS does well running ls, cat, troff, and awk.  But
what about this Fortran program:

      real a(1000,1000),b(1000,1000)

      :
      :

      do 10 i=1,1000
      do 5  j=1,1000
      b(i,j) = a(j,i)
    5 continue
   10 continue

That looks like one TLB load per statement to me.  (By the way, this is not a
far fetched example at all.)  Code references are usually local.  Data
references are frequently far apart.

It seems to me that the R2000 is not designed with numerical analysis in mind.
I wish that designers would give more thought to performance on engineering
and scientific applications.  Even IBM found it necessary to provide for
better e&s performance, and such applications are a smaller part of IBM's
market than most other companies.  If e&s applications are not "typical"
("according to Dhrystone . . ." :-)   ) then why does every workstation and
mini come with an FPA these days?

  Hugh LaMaster, m/s 233-9,  UUCP {seismo,topaz,lll-crg,ucbvax}!
  NASA Ames Research Center                ames!pioneer!lamaster
  Moffett Field, CA 94035    ARPA lamaster@ames-pioneer.arpa
  Phone:  (415)694-6117      ARPA lamaster@pioneer.arc.nasa.gov

"In order to promise genuine progress, the acronym RISC should stand 
for REGULAR (not reduced) instruction set computer." - Wirth

("Any opinions expressed herein are solely the responsibility of the
author and do not represent the opinions of NASA or the U.S. Government")

crowl@rochester.ARPA (Lawrence Crowl) (04/25/87)

In article <1366@ames.UUCP> lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) writes:
>In article <338@dumbo.UUCP> hansen@mips.UUCP (Craig Hansen) writes:
>>The MIPS R2000 processor already does [software TLB refill], and with a 4k
>>page size, a 64-entry, fully-associative TLB, and 15-19 cycles to refill the
>>TLB in software from a simple page table, large benchmarks typically spend
>>about 1% of their execution time in the software TLB refill handler.
>
>I don't doubt that the MIPS does well running ls, cat, troff, and awk.  But
>what about this Fortran program:
>
>      real a(1000,1000),b(1000,1000)
>      do 10 i=1,1000
>      do 5  j=1,1000
>      b(i,j) = a(j,i)
>    5 continue
>   10 continue
>
>That looks like one TLB load per statement to me.  (By the way, this is not
>a far fetched example at all.)  Code references are usually local.  Data
>references are frequently far apart.

You are talking eight megabytes of data for what is presumably a small part
of any real program.  Unless you have a lot of physical memory, this code
looks like one page fault per statement.  The TLB miss is insignificant.

Let's assume you have the physical memory and won't page fault.  I will make
a guess that each iteration of the inner loop takes roughly 18 cycles which
is rougly the same as a TLB refill.  This means that the code above spends
about 50% of its time in TLB refill.

Let's reexamine the code.  What you are really doing is b = transpose( a ).
I feel free to recode the same function in a more efficient manner.  The
approach is to transpose square submatricies instead of rows (or columns). 

      real a(1000,1000),b(1000,1000)
      do 40 i1=1,951,50              -- select submatrix
      do 30 j1=1,951,50
      do 20 i2=i1,i1+50              -- transpose submatrix
      do 10 j2=j1,j1+50
      b(i2,j2) = a(j2,i2)
   10 continue
   20 continue
   30 continue
   40 continue

This looks to be about 100 TLB misses in each submatrix transposition, one
one for each row of a and b touched.  This is 0.04 misses per iteration of
the inner loop.  This means that the revised code spends roughly 4% of its
time in TLB refill.  While this is worse than the 1% cited by Craig Hansen,
it is certainly not the 50% implied by the first code.

Is the revised code unreasonable?  Certainly not if it is part of a library.
Given the potential paging problem, programmers may be forced into this type
of approach anyway.
-- 
  Lawrence Crowl		716-275-5766	University of Rochester
			crowl@rochester.arpa	Computer Science Department
 ...!{allegra,decvax,seismo}!rochester!crowl	Rochester, New York,  14627

lamaster@pioneer.UUCP (04/28/87)

In article <27302@rochester.ARPA> crowl@rochester.UUCP (Lawrence Crowl) writes:
:
>
>You are talking eight megabytes of data for what is presumably a small part
>of any real program.  Unless you have a lot of physical memory, this code
>looks like one page fault per statement.  The TLB miss is insignificant.

Well, a medium part actually.  If I had a 4 MIPS machine with 16MB of memory,
typical these days for a Sun, etc., I would probably dimension my arrays about
700x700.  On the other hand, if I could get 32MB, 1Kx1K would be good.  I
think 16-64MB is a very good guess as to the memory size I would expect on a
new architecture in this speed range.  I don't consider 8MB a lot of memory by
today's standards.

>
>Let's assume you have the physical memory and won't page fault.  I will make
>a guess that each iteration of the inner loop takes roughly 18 cycles which

18 cycles sounds like a lot to me for a single memory to memory word-sized
move, overhead included.  

>
>Let's reexamine the code.  What you are really doing is b = transpose( a ).
>I feel free to recode the same function in a more efficient manner.  The
>approach is to transpose square submatricies instead of rows (or columns). 

:
(Description of better transpose deleted)

Indeed, there are better ways to transpose, given that you know something
about the machine.  However, the algorithm proposed is also very bad for some
vector machines that I know of.  It is very difficult to optimize code for
more than one machine at a time.  A classic example is the BLAS library
delineated some time ago.  BLAS was expected to be at the correct level for
optimization on many machines.  Strangely enough, at about the same time,
people figured out how to use the fast vector machines that were coming out
then.  It turned out that BLAS was designed around too small a unit of work to
allow the best optimization on vector machines.  Dongarra, who was in on the
BLAS effort, soon proposed an organization for linear algebra problems that
was 3 times faster, in Fortran, than the best assembly language versions based
on BLAS.  Whenever you write specifically optimized code, you are risking
premature optimization. 

But the point of my original posting was that graphics, numerical modeling,
and other engineering and scientific number crunching codes often
reference large amounts of memory in a pseudo-random fashion.  It is not safe
to assume that both code and data have strong locality in reference patterns.
Often, only the code does.

  Hugh LaMaster, m/s 233-9,  UUCP {seismo,topaz,lll-crg,ucbvax}!
  NASA Ames Research Center                ames!pioneer!lamaster
  Moffett Field, CA 94035    ARPA lamaster@ames-pioneer.arpa
  Phone:  (415)694-6117      ARPA lamaster@pioneer.arc.nasa.gov

"In order to promise genuine progress, the acronym RISC should stand 
for REGULAR (not reduced) instruction set computer." - Wirth

("Any opinions expressed herein are solely the responsibility of the
author and do not represent the opinions of NASA or the U.S. Government")