[net.arch] I like segmented architectures

freeman@spar.UUCP (Jay Freeman) (05/31/85)

References:

[ libation to line-eater ]

The fuss about 64K segments looks like effort wasted beating a
dead horse:  I haven't heard anyone defending 64K segments
as the right size, or opposing them as too big.  Now that we all
agree, I suggest that there are some virtues of segmentation that
have been overlooked, having to do with memory-management and
task-switching.

In the recent Intel hardware, the full description of a segment
consists of a base address, a length, and some stuff about access
rights.  (Max length is still 64K in the 286, but is rumored to be
4G (32-bit segment registers) on the 386.)  Every memory reference
explicity involves a segment, and whenever a segment is loaded into
a segment register, all this information about it is brought onto
the chip.  However, the information is not accessible to the user
-- its on-chip location is protected.

Thus, in parallel with the address calculations for memory references,
the CPU itself can decide whether the reference is legal, without
having to consult any additional hardware or software.  Given the
expense in silicon real estate to do this in parallel, there is no
speed penalty at all.

Only when the contents of a segment register are changed, does an
additional decision about memory protection need to be made:  The
operating system will likely maintain for each job, a list of segments
(all information -- base, length, access ...) that it is allowed
to load.  The operating system need not concern itself with the
details of what the jobs do with their segments -- the on-chip
operations should ensure that proper restrictions are observed.

The part of task-switching that involves memory management then
boils down to exchanging my list of allowed segments for yours.

The recent Intel stuff also has a layer of indirection built into the
segment basing, so as to support virtual memory easily.

I suggest that these features are sufficiently useful, that the
allocation of silicon and the provision of software to support them, 
should not be dismissed as a trivial and obvious mistake:  True
fans of the 68XXX and 32XXX would clearly not wish to win merely by 
exploiting public ignorance of their adversary's strengths.

What's more, I love goto's and I hate comments!!! :-)
-- 
Jay Reynolds Freeman (Schlumberger Palo Alto Research)(canonical disclaimer)

henry@utzoo.UUCP (Henry Spencer) (06/03/85)

> I suggest that these features are sufficiently useful, that the
> allocation of silicon and the provision of software to support them, 
> should not be dismissed as a trivial and obvious mistake:  True
> fans of the 68XXX and 32XXX would clearly not wish to win merely by 
> exploiting public ignorance of their adversary's strengths.

Unfortunately for Intel, these are also strengths of the 68XXX and the
32XXX; it's just a little less conspicuous, because it's more transparent.
Herewith an explanation, in the context of the 32XXX (because I'm not too
familiar with the 1000 different 68XXX MMUs).

> ...  Every memory reference
> explicity involves a segment

32XXX:  every memory reference explicitly involves a page number, although
it's not quite so obvious since you don't have to care about your memory
being split up into pages.  (Well, not much.)

> and whenever a segment is loaded into
> a segment register, all this information about it is brought onto
> the chip.  However, the information is not accessible to the user
> -- its on-chip location is protected.

32XXX:  whenever a page is being accessed frequently, all the page-table
information about it is brought into the MMU's cache.  This cache is not
accessible to the user.

> Thus, in parallel with the address calculations for memory references,
> the CPU itself can decide whether the reference is legal, without
> having to consult any additional hardware or software.  Given the
> expense in silicon real estate to do this in parallel, there is no
> speed penalty at all.

32XXX:  thus, in parallel with the address calculations for the memory
reference, the MMU can decide whether the reference is legal.  Whether
it consults anything else is nearly irrelevant; chip size is the only
reason why the 32XXX MMU is not part of the CPU chip.  National decided
that good memory management warranted sufficient silicon real estate
that putting it on the CPU chip wasn't practical for the first version.
The 32XXX does slow down when you turn on the MMU, but this translation
overhead is *always* present in the Intel cpus (although Intel has done
a better job of minimizing it, assisted by the on-chip location of their
MMU).  The only speed penalty on the 32XXX that is specifically the
result of the MMU chip being separate (as opposed to being the result of
separate decisions on bus management etc.) is the small overhead involved
in bringing signals on and off chips.

> Only when the contents of a segment register are changed, does an
> additional decision about memory protection need to be made:

32XXX:  Only when the program's reference patterns change does an additional
decision about memory protection (specifically, loading of page-table entries
that are more useful given the new pattern) need to be made.

> The
> operating system will likely maintain for each job, a list of segments
> (all information -- base, length, access ...) that it is allowed
> to load.  The operating system need not concern itself with the
> details of what the jobs do with their segments -- the on-chip
> operations should ensure that proper restrictions are observed.

32XXX:  The page table constitutes the per-job list of all pages that
the job is allowed to access.  The operating system is not involved in
the details of what the jobs do with their pages -- the MMU hardware
ensures that the proper restrictions are observed.

> The part of task-switching that involves memory management then
> boils down to exchanging my list of allowed segments for yours.

32XXX:  The part of task-switching that involves memory management then
boils down to changing one register, the master page-table pointer in
the MMU, which is essentially the list of allowed pages.

> The recent Intel stuff also has a layer of indirection built into the
> segment basing, so as to support virtual memory easily.

32XXX:  the layer of indirection has, of course, been there all along.
The cache on the 32XXX MMU is managed automatically by the hardware,
rather than requiring the programmer (or his compiler) to do the job
itself by explicitly loading segment registers.

> What's more, I love goto's and I hate comments!!! :-)

Spoken like a true 8086 programmer!!! :-)
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

freeman@spar.UUCP (Jay Freeman) (06/05/85)

[oh glorious line-eater, accept this humble sacrifice]

How delightful to see a posting that generates more light than heat!
(Henry Spencer's, cited below).  Pray allow me to continue my perverse
Devil's advocacy of brand I segmented architectures ...

In article <5653@utzoo.UUCP>,henry@utzoo.UUCP (Henry Spencer) writes:
(responding to an earlier posting of mine)

>> I suggest that these features are sufficiently useful, that the
>> allocation of silicon and the provision of software to support them, 
>> should not be dismissed as a trivial and obvious mistake:  True
>> fans of the 68XXX and 32XXX would clearly not wish to win merely by 
>> exploiting public ignorance of their adversary's strengths.
>
>Unfortunately for Intel, these are also strengths of the 68XXX and the
>32XXX; it's just a little less conspicuous, because it's more transparent.
>Herewith an explanation, in the context of the 32XXX (because I'm not too
>familiar with the 1000 different 68XXX MMUs).

>... [and so on]

Our exhange of comments appears to establish that the X86 and 32XXX have
generally comparable memory-management and task-switching capabilities,
implemented by different approaches to hardware -- segment registers and so
forth on-chip in the case of the Intel stuff, more conventional
memory-management in a separate chip in the case of the 32XXX.

It would be a shame to let a nice argument die for so inadequate a reason as
mutual agreement, so let me pick up on a few points:

>>                                                        Given the
>> expense in silicon real estate to do this in parallel, there is no
>> speed penalty at all.
>
>The 32XXX does slow down when you turn on the MMU, but this translation
>overhead is *always* present in the Intel cpus (although Intel has done
>a better job of minimizing it, assisted by the on-chip location of their
>MMU).  The only speed penalty on the 32XXX that is specifically the
>result of the MMU chip being separate (as opposed to being the result of
>separate decisions on bus management etc.) is the small overhead involved
>in bringing signals on and off chips.

I don't think it is quite that simple:  On-chip data-flow is typically
notably faster than inter-chip information transfer, both because of
more parallelism and because the designer presumably has better control
of what's going on there.  Certainly the 286 comes up to bus bandwidth
with no difficulty -- when doing memory-to-memory string moves a word at
a time, the rate of data flow is one byte per CPU clock (strictly, one
two-byte memory-to-memory move every two CPU clocks) (NB:  That would
have been "every four CPU clocks" for the 8086 -- the 286 really does use
fewer clocks.)  Thus if Intel is in fact presently shipping 12 MHz 286's
(late ads so suggest), the corresponding data transmission rate is 12 
Mbytes / second, notwithstanding all the segment checking.  How does that 
compare with the speed of the latest 32XXX's, with MMU engaged?  The fact 
that the 286 can achieve this speed suggests that the segment stuff is 
really very fast. (And it sure would be nice if that bus were 32 bits,
wouldn't it?)

(Buried in here is a quite complicated issue pertaining to chip speed,
having to do with what is bottlenecking the clock:  If it should turn out
that that were the segment-checking stuff, then one would have to look
very carefully at the tradeoff obtained by omitting this feature and
obtaining higher speed overall.  But I don't know, of course.)

(And I trust we all agree that neither the X86 nor the 32XXX nor the 68XXX
are RISC machines; and that the issue of RISC versus more traditional
architectures is worthy of a whole separate tempest in its very own teapot.)

>                          ... managed automatically by the hardware,
>rather than requiring the programmer (or his compiler) to do the job
>itself by explicitly loading segment registers.

This is a two-edged sword:  Segment registers look just like address
registers, and allow richer addressing modes when explicitly manipulated.
That is, an X86 address is obtained by adding the contents of up to three
registers, plus an "immediate" offset.  (Lots of people will say that fancy
addressing modes are a pain, and they may be right -- here is the RISC
issue again.)

And if you don't like the segment registers, surely it's not too much
trouble to set them once and forget them.  (I have indeed admitted, in my first
posting, that 64K segments are too small.)

By the way, where are the 68XXX fans?  Surely the Motorola chip isn't
so bad that no one can find a defense for it. :-)  Why, with all those
pins it makes a great brush for stuffed animals and plush furniture.  :-)

Real programmers can't SPELL quiche!! :-)

-- 
Jay Reynolds Freeman (Schlumberger Palo Alto Research)(canonical disclaimer)

bwc@ganehd.UUCP (Brantley Coile) (06/05/85)

One thing issue I haven't seen mentioned, or I have missed it, is
that in the '86 family of processors the memory mapping is known
to the user program.  On most systems only the supervisor knows about
the memory mapping and the user program just thinks it has
so much space.   Even the seperate I and D space on a pdp-11
didn't require the knowledge of the user program.  The iNTEL segments
are just big versions of the 4k addressing problem on the IBM mainframes 
except on the IBM you can use 24bit pointers.


-- 

		Brantley Coile CCP
		..!akgua!ganehd!bwc
		Northeast Health District, Athens, Ga

jer@peora.UUCP (J. Eric Roskos) (06/07/85)

> By the way, where are the 68XXX fans?  Surely the Motorola chip isn't
> so bad that no one can find a defense for it.

It's kind of hard to compare the 68000's MMU, which functions in a very
familiar, traditional way (the same way MMUs on many "mainframe" machines
work), with the very strange segmentation facilities of the 286.

Here you've complained again that "64K segments are too small".  Now, I
have a feeling part of the problem I see here is in our definition of
"segments", which varies widely.  But I don't think it is the smallness of
the "segments" that is the problem.

The 8086's way of handling segmentation is not like that of many more
familiar machines, 68000 included.  In what I will call the "conventional"
memory management units, the address field in the instruction is partitioned
into subfields, like this:

	AABBBB

(where each digit represents, let us say, 4 bits, for concreteness).  The
bits AA are used to select an entry in an address translation table in the
MMU, which replaces the bit string AA in the original ("virtual") address
with some bit string CCCCCC in the generated ("physical") address.  The
result is some address

	CCCCCCBBBB

Now, there is usually also a size associated with the block of physical
memory pointed to by CCCCCC in the AAth entry of this translation table, so
the value of BBBB is checked against this number to be sure it is in range.
Assuming it is, we have generated our physical address, and can go on to
checking the other bits in that table, which tell whether we are allowed to
read, write, or execute the location, whether or not it is in memory at
present, whether it has been modified, etc.

More sophisticated memory management units further partition AA (or add
more bits), so that the high order bits select one or more pointer tables
which themselves point to other translation tables that are used for the
next-lower-order field of bits, etc., but the mechanism is the same.

Notice that in this scheme, The size of BBBB doesn't really matter so much.
The total amount of space you can address in an instruction (without
changing an external register) is the number of distinct addresses that may
be represented by AABBBB, which is the number of bits in the instruction's
address field for the operand.

Now, on the other hand, we have the Intel approach.  Intel gives us an
instruction address field

	BBBB

and a segmentation register field

	AAAA

We get our physical address from this via

	AAAA0
	+BBBB

Now, in the 286, we have an "improvement" in that AAAA is put into a
memory management unit table, just like in the "conventional" architecture,
but it is still added to the instruction's address field rather than
concatenating it.  And where does the index into the memory management unit's
table come from?  Why, it comes from what used to be the segmentation
register!  So, rather than deriving the index from part of the
instruction's address field, it comes from a separate register, which
must be set via a MOV (or via an instruction which loads a segment register
and an index register in one instruction from consecutive memory words)
each time you want to change it.

This is what the compiler writers have trouble with.  The index into the
memory management tables for the 286 are NOT derived from the instruction's
address field, transparently as part of the "virtual" address.  They have
to be explicitly LOADED into a segmentation register.  And the enhancements
made thus far to the architecture don't improve this; they just add bits to
the field in the memory management unit tables.  The reason this is such
a problem is that if you are generating code that involves data larger than
64K, you have to keep up with what value you last loaded into the segmentation
register, so that you can change it if you have to access something that is
not in range.  And, as the flow of control in the program for which you
are generating code becomes more complex, deciding when you need to change
the contents of the segmentation register becomes enormously difficult.
A compiler that could make this sort of flow analysis for the 8086 family
of machines could also do a substantial amount of optimization, with the
result that the same compiler for a 68000 would also achieve substantially
better results.  But at present, there are not really any compilers out
there like that.  It is rumored that some are in the works; some companies
have claimed to produce them already, but the optimization methods thus far
have been mostly "peephole" optimizations.  The difficulty of writing an
optimizer of the sort required is probably larger than that of writing
the compiler.

And optimization is really the issue here.  A fully unoptimized 8086-family
program would load the segmentation register before EVERY operand access.
The task of the optimizer is to decide when it doesn't have to.

And that is the basic problem.
-- 
Full-Name:  J. Eric Roskos
UUCP:       ..!{decvax,ucbvax,ihnp4}!vax135!petsd!peora!jer
US Mail:    MS 795; Perkin-Elmer SDC;
	    2486 Sand Lake Road, Orlando, FL 32809-7642

	    "Zl FB vf n xvyyre junyr."

nather@utastro.UUCP (Ed Nather) (06/08/85)

> -- 
> Full-Name:  J. Eric Roskos
> UUCP:       ..!{decvax,ucbvax,ihnp4}!vax135!petsd!peora!jer
> US Mail:    MS 795; Perkin-Elmer SDC;
> 	    2486 Sand Lake Road, Orlando, FL 32809-7642

Gad!  A reasoned, understandable comparison of architectures without flames,
inuendo or dead-horse-flogging.  What is this newgroup coming to?

Thank you, Eric.

-- 
Ed Nather
Astronony Dept, U of Texas @ Austin
{allegra,ihnp4}!{noao,ut-sally}!utastro!nather
nather%utastro.UTEXAS@ut-sally.ARPA

mann@LaBrea.ARPA (06/08/85)

> > By the way, where are the 68XXX fans?  Surely the Motorola chip isn't
> > so bad that no one can find a defense for it.
> 
> It's kind of hard to compare the 68000's MMU, which functions in a very
> familiar, traditional way (the same way MMUs on many "mainframe" machines
> work), with the very strange segmentation facilities of the 286.

I felt J. Eric Roskos's message was a good comparison of 8086-style
segmented architecture with more conventional linear address space models,
and I'm essentially in agreement that linear address spaces are superior
overall.  But I'm curious as to what he's referring to when he talks about
the 68000's MMU above.  The 680X0 doesn't have an on-chip MMU -- which is in
a sense one of its strengths, since the chips do support a large linear
address space, and those who use them are free to build any style of
outboard MMU for such an address space.  Most have chosen to do fairly
conventional paged MMUs (for instance, Sun).  But there is no such MMU in
the 68XXX family (yet).  The 68451 is utterly ridiculous garbage (please
look into how it works before flaming back at me).  The more recent promised
MMUs from Motorola and Signetics have not come out yet.

	--Tim

freeman@spar.UUCP (Jay Freeman) (06/10/85)

[libation to line-eater]

In article <1031@peora.UUCP> jer@peora.UUCP (J. Eric Roskos) writes:

> [a thoughtful article comparing Intel-style segmentation
>  with more traditional approaches to memory management -- JF]

>The 8086's way of handling segmentation is not like that of many more
>familiar machines, 68000 included.  [discussion of address-calculation by
>concatenation of bit-fields (traditional approach) versus shift-and-add
>(Intel approach) follows -- JF]

If shift-and-add segmentation were made transparent to the user, by putting
it in an inaccessible MMU, how would it then compare with the concatenation
technique?

>This is what the compiler writers have trouble with [...] deciding when
>you need to change the contents of the segmentation register becomes
>enormously difficult.
>
>And optimization is really the issue here.  A fully unoptimized 8086-family
>program would load the segmentation register before EVERY operand access.
>The task of the optimizer is to decide when it doesn't have to.
>
>And that is the basic problem.

Well put.  I had thought that flow-analysis was in fact state-of-the-art
for compilers (though admittedly still a thorny problem).  If not, there is
clearly a problem with all forms of messy addressing.

I view user-controlled segmentation more as a special kind of addressing
than a memory-management scheme (though it clearly can help with the latter).
I see it as providing support for abstract data types and object-oriented
programming, at a very low level -- you put your specialized object in its
own segment, and rely on the hardware to trap if you attempt to access it
incorrectly.  Or you put different instantiations of a particular flavor of
object in their own segments, and just change a segment register to get at
one instantiation instead of another.

Segments can be made much smaller than the typical page size of a more
traditional MMU.  Thus access rights to objects can be specified with much
higher resolution -- down to individual procedures and (small) data
structures.

And I need these tools:  It's so hard to do object-oriented programming in my
favorite languages -- '66 FORTRAN, COBOL and Tiny BASIC.  :-)
-- 
Jay Reynolds Freeman (Schlumberger Palo Alto Research)(canonical disclaimer)

jer@peora.UUCP (J. Eric Roskos) (06/11/85)

> But I'm curious as to what he's referring to when he talks about
> the 68000's MMU above.  The 680X0 doesn't have an on-chip MMU ...

I was referring to the separate Motorola MMU part, the 68451, which you
called "utterly ridiculous garbage".  My data sheet on it is at home, so
it is kind of hard to see what you are referring to (all I have is the
68000 User's Manual here, which has a brief summary of the MMU at the back).

What do you feel is wrong with the 68451?  (Besides the fact that, last time
I looked, it cost more than the 68000, I mean.)
-- 
Full-Name:  J. Eric Roskos
UUCP:       ..!{decvax,ucbvax,ihnp4}!vax135!petsd!peora!jer
US Mail:    MS 795; Perkin-Elmer SDC;
	    2486 Sand Lake Road, Orlando, FL 32809-7642

	   "Gnyx gb gur fhayvtug, pnyyre..."

doug@terak.UUCP (Doug Pardee) (06/11/85)

> The iNTEL segments
> are just big versions of the 4k addressing problem on the IBM mainframes 
> except on the IBM you can use 24bit pointers.

Once again I must point out that the IBM 360/370/30xx architecture is
*not* segmented.  Addresses run linearly from 0 to 16M-1 for non-XA
systems, 0 to 2G-1 for XA.

What *was* botched was that there is no PC-relative addressing mode for
branching.  Since there is also no direct addressing mode available,
this means that in order to branch, you first have to load a register
with the address of the destination.  Or, the more common approach, keep
a register loaded with a procedure code address and then use the
(register + offset) addressing mode.  The offset is limited to 0-4095,
giving rise to the erroneous claims that the IBM architecture has 4K
segments.

The "proper" way around this is to limit your routines to 4096 bytes in
length, and when you call a subroutine you should load its address into
a register to address it rather than using the (register + offset) mode.
Then the subroutine can use that register to address all of its own
procedure code using the (register + offset) addressing mode.

The reason that this is important is that while the Intel architecture
is a nuisance for large amounts of data but the procedure code is still
quite manageable, the IBM is the other way around.  Many IBM systems
process unbelievable quantities of in-memory data (that's why they had
to expand the addressing to allow each process to have more than 16 Mb
of data).  Arrays of 1 Gb in size are "no sweat" on an IBM XA machine.
But addressing branch destinations requires planning.
-- 
Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug
               ^^^^^--- soon to be CalComp

mann@LaBrea.ARPA (06/15/85)

Wow!  I guess it's time for me to pay the price for flaming by trying to
back up my opinion.

First let me say I called the 68451 "utterly ridiculous garbage" in a moment
of weakness.  Normally I don't flame like that, but I've been reading
laser-lovers lately, and suddenly the urge overcame me.  Anyway, my real
opinion of the 68451 is that it's a lot better than no memory managment unit
at all, but it has some serious problems.  Most were pointed out in previous
postings, by people who said in effect "well, of course it has the following
problems, but I don't understand why people hate it so much."  I'll
summarize the problems below.

The basic design of the 68451 forces the operating system to manage physical
(as well as virtual) memory in 2^k sized chunks (called segments), where k
is variable (but must be an integer!).  A 2^k sized chunk must start on an
address that is an exact multiple of 2^k.  You can't choose a uniform small
k (like 10) and simulate a page-oriented MMU because you rapidly run out of
segment descriptors (there are only 32).  The operating system code needed
to manage physical memory in this way is much more complex than what is
needed with a paging MMU.  It is also difficult to avoid wasting a lot of
physical memory in fragmentation and/or spending a lot of time copying data
from one part of physical memory to another when an address space grows.

I'll try to make this clear with an example.  Let's say you are implementing
a Unix-style model of process address space, where there is an upper bound
that is allowed to grow dynamically.  If you try to minimize internal
fragmentation by allocating very little more actual memory than the process
is using at any given time, you use up a lot of segment descriptors since
you need one for each 1 bit in the binary representation of the memory size.
Then let's say you try to grow the space.  Each time you grow it, you have
the choice of doing it by tacking on a small chunk to the end (chewing up
another segment descriptor), or replacing one or more of the existing chunks
by larger ones.  Replacing some chunk(s) by a larger one requires copying
their data to a new place in physical memory, unless you had the good fortune
to find that the existing chunks were contiguous and started on the right
boundary in physical space, and the space following them was free.
One could put in some heuristics to make this condition more likely, but
these make the OS memory management code yet more complicated.

An advertised benefit of the 68451 is that it allows fast context switching
because you can keep the segments for multiple address spaces in the chip at
the same time.  Of course, with only 32 segment descriptors to go around and
each address space needing several, you run out rather quickly if you try to
keep them all in there.  So you need to swap them in and out, perhaps
keeping the N most recently used sets in the chip in hopes of minimizing the
amount of descriptor swapping.

Protection bits are also on a per-segment basis, so if you have different
areas in your process that need different protections, you chew up more
segment descriptors -- each differently-protected area needs several segment
descriptors if its size is not an exact power of two.

The fact that the 68451 introduces TWO wait states into memory access is a
serious black mark against it in my book, too.

Of course, it's possible to live with all of this.  And user programs still
see a nice, simple, linear address space.  But I wouldn't want to try to do
demand paging with the 68010 on top of the 68451 -- demand paging is
complicated enough with a more rational MMU.

What are my qualifications for saying all this?  I've written all the kernel
memory managment code for the Sun-1, Sun-2 and Iris PM-II versions of the
Stanford V kernel.  (See IEEE Software, April 1984 for an article about the
V kernel.)  These machines all happen to be 68000/68010-based systems with
custom MMUs built with mostly TTL.  The current V kernel implements multiple
address spaces with variable upper limits, and multiple processes per
address space.  There is currently no demand paging -- all code and data must
be resident at all times -- but I will be implementing that soon as a
byproduct of some of my thesis research.  (Adding swapping to the current
system would be trivial.)  I formed my opinions about the 68451 in the
process of trying to explain the current memory management to some folks who
are trying to port the kernel to a system with a 68451, and helping them
find solutions to the problems they ran into.

	--Tim