[net.arch] 386 info

jnw@mcnc.UUCP (John White) (11/02/85)

I recieved the 386 documentation from Intel (they sent it very promptly)
and here is my some info I extracted.

(Disclaimer: I have only seen the documentation, not the chip. Also,
I probably don't know what I'm talking about. If I make any errors,
feel free to point them out.)

The bus looks like it is well done. Each cycle requires only 2 clocks.
(I am refering to internal clocks, the external clock is twice
that. The 12/16Mhz spec refers to the internal clock.) The address->data
time is the full 2 clocks minus a 40ns address delay and 10ns data setup.
(for the 16Mhz chip.) Thus, address->data is 75ns at 16Mhz with no waits.
There is an input that allows you to request the next address even if
the data from the last request is not ready yet, essentually pipelining
the bus.

The registers ax,bx,cx,dx,si,di,sp,bp have been expanded to 32 bits
(called eax,ebx ...). Each of these registers has its own set of
capabilities (as with the 86). (This means that for a given amount of
effort writing an optimizing compiler, code for a clean arcitecture
like the 68020 will be better optimized than for the 386.)
There is a mode bit associated with each code segment that specifies
whether addresses and operands are 16 or 32 bits. The address and
operand lengths can be individually toggled by prefixes.

There are 4 debug registers. These allow access to particular memory
locations to be trapped, great for data or breakpoints in read-only code.
This is in addition to the breakpoint instruction (int3) and single-stepping.

The 386 can support 86 programs in a protected environment. Interupts
in 86 mode are caught so interupts defined for the
IBM-PC don't conflict with the reserved interupts defined for the 386.
Other instructions like I/O are also caught allowing their function to
be simulated. There seems to be enough flexability in the paging to
allow direct screen writes by an 86 program to be handled. (The video
memory locations can be mapped to some location in ordinary memory,
and the dirty bits of the corresponding pages tell if a direct write
occured. Or, the memory can be marked not-present forcing a fault.)

There are up to 16k segments associated with a task. When a memory access
is made, it is referenced to some segment. The address from the access
is added to the base of the segment and then compared to the
limit of the segment. There are also some attribute bits associated with the
segment. (restricting access, access and dirty bits, etc.)
The segment registers (cs,ss,ds,es,fs,gs) are essentually pointers to the
actuall segment descriptions. (These descriptions are loaded into memory
when the segment register is changed, of course.)
Segments can be ignored by setting their base to 0 and their limit to 4Gbytes.

Paging can be on or off, and it can be used in addition to segments.
All pages are 4k bytes. Each task has a pointer to one page. This page
(the page directory) has 1k pointers to page tables (each 1 page, or 4k
bytes). Each page table has 1k entries. Each entry in the page table
describes where the physical address of the page is, and has bits for
accessed, dirty, present, etc. There is a cache of 32 entries that hold
the 32 page table entries in current use (for 128k bytes). (This is the
cache that Intel gives the 98% hit rate for. There is no data or code cache.)
The upper 20 bits of the after-segment address specifies the page.
If it is one of the 32 cached entries, then 20 bits from the page
table entry are substituted for the upper 20 bits of the address and
the 386 continues full speed ahead. But if it is not one of the 32 cached
entries, then things slow down. First, if all 32 entries in the cache
are used, then one must be freed up. As the access and dirty bits may have
been changed, the entry must be written back to the page table. Then
the upper 10 bits of the address are used as an index into the page directory
to find the location of the correct page table. Then the next 10 bits of
the address must be added to the base of the page table to get the location
of the page entry. This entry must then be read. Then the entry is placed in
the cache and is also used to translate the address. Finaly the original
memory access can occur.

Intel claims that either segments or paging can be used to provide
flexible memory management. Let's assume that the 386 is being used in a
high performance workstation. Let this workstation be multitasking and
run programs written in C (with sizeof int=32). Assume those programs
were written without worrying about the architecture that would run them.
(ie, the data structures and algorithms are not tuned to the 386).
Let this workstation have about 8Mbytes of memory, and assume that this is
enough to hold all active tasks. Assume there is one major task being run
and that we are most concerned about its performance. This task will use
over a Mbyte of data area accessed randomly.
Let each task have its own 4Gbyte address space with code+data being in
low memory with a heap growing up, and the stack in high memory growing
down. Although a limited amount of memory is provided for the stack and
heap, the operating system should be able to add more as faults occur,
so the stack and heap can grow without any arbitrary limit.
First, paging will clearly work. But our main task will run slowly because
almost all of the data accesses to the >1MBbyte of data area will be
cache misses, forcing the slow operation described above. Note that this
is a built in problem with 386. The page size is 4k, the cache size is 32,
so any program that randomly accesses more than 128kbytes will slow down.
Of course, we could turn paging off, so lets see if we can design the
workstation using only segments. First, data and stack segments must
have the same base because C does not distinguish between pointers to stack,
static, and malloc'ed variables. The stack is in high memory, so the
limit on the stack segment must be high. But as the stack grows there will
be no fault until the base is reached. Thus, all memory between
the stack and heap must be associated with real memory, which is not
what we want. So, we can't do it with segments.
We might add an external MMU. It could have a low and a high limit, causing
a fault if an access is between these limits. While this check is occuring,
the address can be added to a base (ignoring carry). Thus, the task would
appear in physical memory as a single block, with a stack growing down and
a heap growing up. This would fulfill all our requirments nicely.
But I don't see how to generate a 386 bus fault from an external MMU, and
that is needed for the operating system to know when the stack or heap
has exceeded its limit.
(It looks like the only way I'll be able to make my workstation is to use
a 68020 :-)

Samples of the 386 are available now, but I would advise caution.
As I remember, it took a while for Intel to shake the bugs out of the 286,
and the 386 contains all the complexity of the 286 as a subset.

Overall, the 386 looks like a very high performance processor,
and I am glad to see that kind of power becoming affordable to
ordinary people like me.

Happy hacking
John N. White

sambo@ukma.UUCP (Father of micro-ln) (11/05/85)

In article <965@mcnc.mcnc.UUCP> jnw@mcnc.UUCP (John White) writes:
>The registers ax,bx,cx,dx,si,di,sp,bp have been expanded to 32 bits
>(called eax,ebx ...). Each of these registers has its own set of
>capabilities (as with the 86). (This means that for a given amount of
>effort writing an optimizing compiler, code for a clean arcitecture
>like the 68020 will be better optimized than for the 386.)

If I understand the literature correctly, each task has a default of
either 16- or 32-bit operands and addresses.  The former would allow
for emulating the 286, for example.  The default can be overridden by
a special prefix, which affects only the next instruction.  Anyway,
if the default is set for 32-bit operands and addresses, all of a sud-
den there are also quite a few more addressing options, so that the
registers are almost truly general purpose.  For example, it is possi-
ble to multiply any two registers.  (Actually, this particular capabi-
lity is available even if it is set up for the 16-bit default.)  It is
also possible to use just about any register as an index register.  (I
think SP is the only register that can't be used this way - it remains
almost exclusively as a stack pointer.)  So it should be a lot easier
to write an optimizing compiler - the only thing is that to use use
these additional addressing options usually lengthens the instruction
by an extra byte.
--
Samuel A. Figueroa, Dept. of CS, Univ. of KY, Lexington, KY  40506-0027
ARPA: ukma!sambo<@ANL-MCS>, or sambo%ukma.uucp@anl-mcs.arpa,
      or even anlams!ukma!sambo@ucbvax.arpa
UUCP: {ucbvax,unmvax,boulder,oddjob}!anlams!ukma!sambo,
      or cbosgd!ukma!sambo

	"Micro-ln is great, if only people would start using it."

mash@mips.UUCP (John Mashey) (11/05/85)

John N. White writes (on the 386):
> Intel claims that either segments or paging can be used to provide
> flexible memory management. Let's assume that the 386 is being used in a
> high performance workstation. ...
> enough to hold all active tasks. Assume there is one major task being run
> and that we are most concerned about its performance. This task will use
> over a Mbyte of data area accessed randomly.
> First, paging will clearly work. But our main task will run slowly because
> almost all of the data accesses to the >1MBbyte of data area will be
> cache misses, forcing the slow operation described above. Note that this
> is a built in problem with 386. The page size is 4k, the cache size is 32,
> so any program that randomly accesses more than 128kbytes will slow down....
(remember that the cache referenced here is really the TLB, or cache of
translations, not a data cache).
1) Fortunately for the 386 and similar designs, most real programs don't
randomly access data. Instruction locality is quite good, and even data
locality is not bad, unless you have programs walking arrays in
more than page size steps.
2) Consider the 128Kbyte number.  The equivalent number on a VAX-11/780,
for user space, is 64*512 = 32Kbyte, and the VAX has to flush that on
every context switch.  It seems possible on the 386 to use segment id's
as process/context identifiers to avoid the flush.
-- 
-john mashey
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash
DDD:  	415-960-1200
USPS: 	MIPS Computer Systems, 1330 Charleston Rd, Mtn View, CA 94043

rfm@x.UUCP (Bob Mabee) (11/06/85)

In article <965@mcnc.mcnc.UUCP> jnw@mcnc.UUCP (John White) writes:
(summarizing Intel release on the new 386)
>First, if all 32 entries in the cache
>are used, then one must be freed up. As the access and dirty bits may have
>been changed, the entry must be written back to the page table.

This is a bug (or misquoted).  The accessed and dirty bits have to be written
back to memory immediately they are set, using a bus-lock read-alter-rewrite
cycle to set just the bit of interest.  Otherwise the chip can not be used
(reasonably) in a multi-cpu system.  Also, even with a single processor there
would have to be a way to force the dirty bits to get written out before the
system can look for a good page to evict.
-- 
				Bob Mabee @ Charles River Data Systems
				decvax!frog!rfm

peter@graffiti.UUCP (Peter da Silva) (11/06/85)

> also possible to use just about any register as an index register.  (I
> think SP is the only register that can't be used this way - it remains
> almost exclusively as a stack pointer.)  So it should be a lot easier
> to write an optimizing compiler - the only thing is that to use use
> these additional addressing options usually lengthens the instruction
> by an extra byte.

Yes, but indexing off the stack pointer is something compilers like to do:
it's a very convenient way of accessing local storage. If you can't
index off the SP you have to waste a general purpose register as a base
register for these local variables...
-- 
Name: Peter da Silva
Graphic: `-_-'
UUCP: ...!shell!{graffiti,baylor}!peter
IAEF: ...!kitty!baylor!peter

brownc@utah-cs.UUCP (Eric C. Brown) (11/08/85)

In article <414@graffiti.UUCP> peter@graffiti.UUCP (Peter da Silva) writes:
>> also possible to use just about any register as an index register.  (I
>> think SP is the only register that can't be used this way - it remains
>> almost exclusively as a stack pointer.)  So it should be a lot easier
>
>Yes, but indexing off the stack pointer is something compilers like to do:
>it's a very convenient way of accessing local storage. If you can't
>index off the SP you have to waste a general purpose register as a base
>register for these local variables...

Well, there are two problems here.  First, Intel defines two sorts of 
addressing modes, based and indexed.  ANY register (including sp) can be
used as a based register, but SP cannot be used as a indexed register.

Also, in the 808x architecture, you use BP instead of SP to access
local storage.  That way you can push stuff on the stack without having to
recalculate all the offsets of the locals.

Eric C. Brown
brownc@utah-cs
...!{ihnp4, seismo}!utah-cs!brownc

mash@mips.UUCP (John Mashey) (11/12/85)

Bob Mabee writes:
> In article <965@mcnc.mcnc.UUCP> jnw@mcnc.UUCP (John White) writes:
> (summarizing Intel release on the new 386)
> >First, if all 32 entries in the cache
> >are used, then one must be freed up. As the access and dirty bits may have
> >been changed, the entry must be written back to the page table.
> 
> This is a bug (or misquoted).  The accessed and dirty bits have to be written
> back to memory immediately they are set, using a bus-lock read-alter-rewrite

Must be misquoted.  WHen the chip decides it needs to change one of these
bits, it reissues the request (with bus-lock), sets the bit, writes it back.
One certainly doesn't want the TLB to have a different idea of these bits
than those kept in memory, not only for the reasons above, but because:
	1) you need at least 1 more bit of state per entry in the TLB, to
	note whether or not the entry has changed.
	2) Even worse, you have to keep track of where the TLB entry CAME
	FROM, so you can write it back, which adds a lot more state per entry.
	(If N is size of cache, you need N physical addresses, whereas the
	way the chip works uses exactly 1).
Changes from non-referenced to referenced don't happen very often;
changes from clean to dirty hardly ever happen, at least measured on the
scale of TLB translations.  An excellent analysis of TLB behavior is:
Douglas Clark, Joel Emer, "Performance of the VAX-11/780 Translation
Buffer: Simulation and Measurement", ACM Trans on COmp Syst 3, 1(Feb 85), 31-62.
-- 
-john mashey
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash
DDD:  	415-960-1200
USPS: 	MIPS Computer Systems, 1330 Charleston Rd, Mtn View, CA 94043

clif@intelca.UUCP (Clif Purkiser) (11/18/85)

> In article <965@mcnc.mcnc.UUCP> jnw@mcnc.UUCP (John White) writes:
> >The registers ax,bx,cx,dx,si,di,sp,bp have been expanded to 32 bits
> >(called eax,ebx ...). Each of these registers has its own set of
> >capabilities (as with the 86). (This means that for a given amount of
> >effort writing an optimizing compiler, code for a clean arcitecture
> >like the 68020 will be better optimized than for the 386.)
> 
> If I understand the literature correctly, each task has a default of
> either 16- or 32-bit operands and addresses.  The former would allow
> for emulating the 286, for example.  The default can be overridden by
> a special prefix, which affects only the next instruction.  Anyway,
> if the default is set for 32-bit operands and addresses, all of a sud-
> den there are also quite a few more addressing options, so that the
> registers are almost truly general purpose.  For example, it is possi-
> ble to multiply any two registers.  (Actually, this particular capabi-
> lity is available even if it is set up for the 16-bit default.)  It is
> also possible to use just about any register as an index register.  (I
> think SP is the only register that can't be used this way - it remains
> almost exclusively as a stack pointer.)  So it should be a lot easier
> to write an optimizing compiler - the only thing is that to use use
> these additional addressing options usually lengthens the instruction
> by an extra byte.
> --
> Samuel A. Figueroa, Dept. of CS, Univ. of KY, Lexington, KY  40506-0027
> ARPA: ukma!sambo<@ANL-MCS>, or sambo%ukma.uucp@anl-mcs.arpa,
>       or even anlams!ukma!sambo@ucbvax.arpa
> UUCP: {ucbvax,unmvax,boulder,oddjob}!anlams!ukma!sambo,
>       or cbosgd!ukma!sambo

I wondered if we would ever see the end of the 
"386 Description:   Commercial hype or Valuable Information" debate 
It looks like the fire has finally stopped (yea).

A very minor correction to Samuel's posting.  Each segment has a default 
operand size not each task.  Therefore a task could have both 32-bit segments 
and existing 16-bit 286 segments.  This would be especially useful if you had 
call existing 286 libraries routines from new 386 applications and didn't have
the source code for the 286 libraries to recompile them.

Intel 386 compilers and linkers support the calling conventions necessary to 
allow the linking of both types of segments together.


-- 
Clif Purkiser, Intel, Santa Clara, Ca.
HIGH PERFORMANCE MICROPROCESSORS
{pur-ee,hplabs,amd,scgvaxd,dual,idi,omsvax}!intelca!clif
	
{standard disclaimer about how these views are mine and may not reflect
the views of Intel, my boss , or USNET goes here. }

kds@intelca.UUCP (Ken Shoemaker) (11/22/85)

> In article <414@graffiti.UUCP> peter@graffiti.UUCP (Peter da Silva) writes:
> >> also possible to use just about any register as an index register.  (I
> >> think SP is the only register that can't be used this way - it remains
> >> almost exclusively as a stack pointer.)  So it should be a lot easier
> >
> >Yes, but indexing off the stack pointer is something compilers like to do:
> 
> Well, there are two problems here.  First, Intel defines two sorts of 
> addressing modes, based and indexed.  ANY register (including sp) can be
> used as a based register, but SP cannot be used as a indexed register.

also, on the 386 you can use both bases and indexes at the same time.  If
you didn't want to use the EBP register as a frame pointer, but instead wanted
to use the stack pointer (ESP), you could use the ESP register as the base
of the area, and then as need be use any other register as a (scaled) index,
and then add an immediately specified offset.  Of course, the more typical
thing to do is still to use the EBP register as the frame pointer, and just
use the ESP register as the stack pointer.  An example instruction using
these addressing modes might look like:

	add	al,byte ptr [esp + ecx*4 + 24h]
-- 
yes, some uncomplicated peoples still believe this myth...

Ken Shoemaker, Santa Clara, Ca.
{pur-ee,hplabs,amd,scgvaxd,dual,qantel}!intelca!kds

---the above views are personal.