[comp.sys.m68k] Comparing 68xxx's; really TLB misses

phil@amdcad.UUCP (Phil Ngai) (02/01/87)

In article <1701@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes:
>The 68012 is a version of the 68010 which has a 31-bit address bus,
>packaged in a chip carrier (or pin grid array?).  It's actually the
>same chip, but the package has more wires coming out.  The 68010
>couldn't add more wires because it was designed to be plugged in
>anywhere a 68000 goes.  This was a stopgap for people who wanted a
>large virtual address space (>16 megs) but couldn't wait for the
>68020.  I don't know any popular machine that uses them.

I was going to say, you must mean physical address, not virtual, but
then I remembered the 68K puts out virtual addresses since it has
no MMU.

>The 68030's MMU is the mediocre one that they build for the 68020; it
>does slow Vax-style page table lookup in main memory and has more features
>and options than you ever cared to read about.

Any prediction on how fast a TLB miss is handled? I seem to recall the
VAX 780, which does it in hardware, takes about 4 microseconds while
MIPS, which does it in software, takes from 1-2 microseconds for a
micro-TLB miss. I don't know how long a regular TLB miss takes. Many
people are shocked at the idea but looking at the bottom line, "how
long does it take", software TLB refill doesn't seem like such a bad
idea. 

Any one know how fast the other chips are at TLB refills? Intel, NSC,
Fairchild, which I assume do it in hardware?

-- 
 They also surf who only stand on waves.

 Phil Ngai +1 408 982 7840
 UUCP: {ucbvax,decwrl,hplabs,allegra}!amdcad!phil
 ARPA: amdcad!phil@decwrl.dec.com

bjorn@alberta.UUCP (02/02/87)

In article <14561@amdcad.UUCP>, phil@amdcad.UUCP (Phil Ngai) writes:
> I was going to say, you must mean physical address, not virtual, but
> then I remembered the 68K puts out virtual addresses since it has
> no MMU.

The processor has nothing to do with the distinction between
virtual and physical addresses in this case.  That distinction
is enforced by the MMU and the operating system.

			Bjorn R. Bjornsson
			alberta!bjorn

mash@mips.UUCP (02/02/87)

In article <14561@amdcad.UUCP> phil@amdcad.UUCP (Phil Ngai) writes:
(regarding 68030)
>Any prediction on how fast a TLB miss is handled? I seem to recall the
>VAX 780, which does it in hardware, takes about 4 microseconds while
>MIPS, which does it in software, takes from 1-2 microseconds for a
>micro-TLB miss. I don't know how long a regular TLB miss takes. Many
>people are shocked at the idea but looking at the bottom line, "how
>long does it take", software TLB refill doesn't seem like such a bad
>idea. 
>Any one know how fast the other chips are at TLB refills? Intel, NSC,
>Fairchild, which I assume do it in hardware?

1) A MIPS micro-TLB refill is actually 1 cycle: it's a refill of the tiny
on-chip TLB from the 64-entry larger on-chip one.

2) A normal TLB refill is 9-10 instructions (convenient form is slightly
different between 4.3 and V.3), + 0-5 cycles for a data-cache miss,
+ 2-4 cycles of pipeline breakage/time to get into refill routine.
This totals 11-19 cycles, assuming NO I-cache misses in the refill routine.
The latter cost [on the 5MIPS board/memory design] 5 cycles, so the worst
case is about 60 cycles [7.5microsecs].  On the average, the actual cost
is 1-2 cycles, yielding 13-21 cycles.  Anyway, the bottom line is a little
under 2 microseconds total penalty.  In any case, for user level programs,
this all costs about 1-2% of user execution time, even on fairly large
programs, i.e., it's almost down in the noise with regard to performance.
I.e., as long as it's fast enough, you can concentrate on making it have
the behavior desired by the O.S., and then go worry about other things,
like cache design.  For example, cache miss overhead is a much larger
performance issue: cache misses can easily eat up 10-50% of the cycles,
depending on the design and the program.

3) "Many people are shocked at the idea" : I hope this is passing: after
all, the same technique is used on HP Spectrums [for sure], and
on Celerity boxes [I think].  It does depend on having fast exception
handling: if that is not possible, it is probably better to use microcode.

4) Note that Data-cache miss penalities for fetching page-table entries
account for 25-30% of the penalty above. This is relevant: in high-performance
systems, even if the microcode is instantaneous, you still have 1-2
memory references, which are there whether you do it in hardware or software.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

phil@amdcad.UUCP (02/03/87)

In article <108@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes:
>
>1) A MIPS micro-TLB refill is actually 1 cycle: it's a refill of the tiny
>on-chip TLB from the 64-entry larger on-chip one.

When I saw the following text:
"TLB REFILL IN SOFTWARE.
Normal UTLBMISS refills are reasonably fast (10-14 125 nS cycles, or
1.2-1.7 microseconds in an 8MHz, 5 mips system).",
I assumed UTLBMISS was micro TLB but now that I go back I see that's
for a miss in kuseg. Sorry.
-- 
 They also surf who only stand on waves.

 Phil Ngai +1 408 982 7840
 UUCP: {ucbvax,decwrl,hplabs,allegra}!amdcad!phil
 ARPA: amdcad!phil@decwrl.dec.com

henry@utzoo.UUCP (Henry Spencer) (02/04/87)

> 3) "Many people are shocked at the idea" : I hope this is passing: after
> all, the same technique is used on HP Spectrums [for sure], and
> on Celerity boxes [I think]...

Yes, the Celerity boxes use it.  Or at least, they did back when I read
the manual for one, when we were thinking of buying it.

There is some variation in how one handles the problem of making sure that
the TLB-miss handler does not get TLB misses.  I dimly recall that the MIPS
hardware makes it easy to reserve some TLB slots for the system.  On the
Celerity machines the kernel just has to be careful, as I recall.

If you want a really shocking idea, consider the latest notion from Cheriton
and his bunch at Stanford.  The CPU runs out of a cache, which is addressed
by virtual address (virtual address includes a process id to avoid flushing
on every context switch) and has very long cache lines (e.g. 128 bytes --
*bytes*, not *bits*).  There is also a small amount of local memory which
is accessible only in kernel mode.  Cache misses are handled entirely in
software, with a bit of hardware help for fast data moving.  Notice that
there is no MMU!  The virtual->real translation is needed only for handling
cache misses and can be done entirely by the software.  The only real issue
is whether software cache handling will be efficient enough.  Given a large
cache with a big line size, it just might work.  There are some other neat
tweaks to do cache-coherence handling for a multiprocessor system.  Very
new and very experimental, but a nice idea.
-- 
Legalize			Henry Spencer @ U of Toronto Zoology
freedom!			{allegra,ihnp4,decvax,pyramid}!utzoo!henry