freeman@spar.UUCP (Jay Freeman) (05/31/85)
References: [ libation to line-eater ] The fuss about 64K segments looks like effort wasted beating a dead horse: I haven't heard anyone defending 64K segments as the right size, or opposing them as too big. Now that we all agree, I suggest that there are some virtues of segmentation that have been overlooked, having to do with memory-management and task-switching. In the recent Intel hardware, the full description of a segment consists of a base address, a length, and some stuff about access rights. (Max length is still 64K in the 286, but is rumored to be 4G (32-bit segment registers) on the 386.) Every memory reference explicity involves a segment, and whenever a segment is loaded into a segment register, all this information about it is brought onto the chip. However, the information is not accessible to the user -- its on-chip location is protected. Thus, in parallel with the address calculations for memory references, the CPU itself can decide whether the reference is legal, without having to consult any additional hardware or software. Given the expense in silicon real estate to do this in parallel, there is no speed penalty at all. Only when the contents of a segment register are changed, does an additional decision about memory protection need to be made: The operating system will likely maintain for each job, a list of segments (all information -- base, length, access ...) that it is allowed to load. The operating system need not concern itself with the details of what the jobs do with their segments -- the on-chip operations should ensure that proper restrictions are observed. The part of task-switching that involves memory management then boils down to exchanging my list of allowed segments for yours. The recent Intel stuff also has a layer of indirection built into the segment basing, so as to support virtual memory easily. I suggest that these features are sufficiently useful, that the allocation of silicon and the provision of software to support them, should not be dismissed as a trivial and obvious mistake: True fans of the 68XXX and 32XXX would clearly not wish to win merely by exploiting public ignorance of their adversary's strengths. What's more, I love goto's and I hate comments!!! :-) -- Jay Reynolds Freeman (Schlumberger Palo Alto Research)(canonical disclaimer)
henry@utzoo.UUCP (Henry Spencer) (06/03/85)
> I suggest that these features are sufficiently useful, that the > allocation of silicon and the provision of software to support them, > should not be dismissed as a trivial and obvious mistake: True > fans of the 68XXX and 32XXX would clearly not wish to win merely by > exploiting public ignorance of their adversary's strengths. Unfortunately for Intel, these are also strengths of the 68XXX and the 32XXX; it's just a little less conspicuous, because it's more transparent. Herewith an explanation, in the context of the 32XXX (because I'm not too familiar with the 1000 different 68XXX MMUs). > ... Every memory reference > explicity involves a segment 32XXX: every memory reference explicitly involves a page number, although it's not quite so obvious since you don't have to care about your memory being split up into pages. (Well, not much.) > and whenever a segment is loaded into > a segment register, all this information about it is brought onto > the chip. However, the information is not accessible to the user > -- its on-chip location is protected. 32XXX: whenever a page is being accessed frequently, all the page-table information about it is brought into the MMU's cache. This cache is not accessible to the user. > Thus, in parallel with the address calculations for memory references, > the CPU itself can decide whether the reference is legal, without > having to consult any additional hardware or software. Given the > expense in silicon real estate to do this in parallel, there is no > speed penalty at all. 32XXX: thus, in parallel with the address calculations for the memory reference, the MMU can decide whether the reference is legal. Whether it consults anything else is nearly irrelevant; chip size is the only reason why the 32XXX MMU is not part of the CPU chip. National decided that good memory management warranted sufficient silicon real estate that putting it on the CPU chip wasn't practical for the first version. The 32XXX does slow down when you turn on the MMU, but this translation overhead is *always* present in the Intel cpus (although Intel has done a better job of minimizing it, assisted by the on-chip location of their MMU). The only speed penalty on the 32XXX that is specifically the result of the MMU chip being separate (as opposed to being the result of separate decisions on bus management etc.) is the small overhead involved in bringing signals on and off chips. > Only when the contents of a segment register are changed, does an > additional decision about memory protection need to be made: 32XXX: Only when the program's reference patterns change does an additional decision about memory protection (specifically, loading of page-table entries that are more useful given the new pattern) need to be made. > The > operating system will likely maintain for each job, a list of segments > (all information -- base, length, access ...) that it is allowed > to load. The operating system need not concern itself with the > details of what the jobs do with their segments -- the on-chip > operations should ensure that proper restrictions are observed. 32XXX: The page table constitutes the per-job list of all pages that the job is allowed to access. The operating system is not involved in the details of what the jobs do with their pages -- the MMU hardware ensures that the proper restrictions are observed. > The part of task-switching that involves memory management then > boils down to exchanging my list of allowed segments for yours. 32XXX: The part of task-switching that involves memory management then boils down to changing one register, the master page-table pointer in the MMU, which is essentially the list of allowed pages. > The recent Intel stuff also has a layer of indirection built into the > segment basing, so as to support virtual memory easily. 32XXX: the layer of indirection has, of course, been there all along. The cache on the 32XXX MMU is managed automatically by the hardware, rather than requiring the programmer (or his compiler) to do the job itself by explicitly loading segment registers. > What's more, I love goto's and I hate comments!!! :-) Spoken like a true 8086 programmer!!! :-) -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
freeman@spar.UUCP (Jay Freeman) (06/05/85)
[oh glorious line-eater, accept this humble sacrifice] How delightful to see a posting that generates more light than heat! (Henry Spencer's, cited below). Pray allow me to continue my perverse Devil's advocacy of brand I segmented architectures ... In article <5653@utzoo.UUCP>,henry@utzoo.UUCP (Henry Spencer) writes: (responding to an earlier posting of mine) >> I suggest that these features are sufficiently useful, that the >> allocation of silicon and the provision of software to support them, >> should not be dismissed as a trivial and obvious mistake: True >> fans of the 68XXX and 32XXX would clearly not wish to win merely by >> exploiting public ignorance of their adversary's strengths. > >Unfortunately for Intel, these are also strengths of the 68XXX and the >32XXX; it's just a little less conspicuous, because it's more transparent. >Herewith an explanation, in the context of the 32XXX (because I'm not too >familiar with the 1000 different 68XXX MMUs). >... [and so on] Our exhange of comments appears to establish that the X86 and 32XXX have generally comparable memory-management and task-switching capabilities, implemented by different approaches to hardware -- segment registers and so forth on-chip in the case of the Intel stuff, more conventional memory-management in a separate chip in the case of the 32XXX. It would be a shame to let a nice argument die for so inadequate a reason as mutual agreement, so let me pick up on a few points: >> Given the >> expense in silicon real estate to do this in parallel, there is no >> speed penalty at all. > >The 32XXX does slow down when you turn on the MMU, but this translation >overhead is *always* present in the Intel cpus (although Intel has done >a better job of minimizing it, assisted by the on-chip location of their >MMU). The only speed penalty on the 32XXX that is specifically the >result of the MMU chip being separate (as opposed to being the result of >separate decisions on bus management etc.) is the small overhead involved >in bringing signals on and off chips. I don't think it is quite that simple: On-chip data-flow is typically notably faster than inter-chip information transfer, both because of more parallelism and because the designer presumably has better control of what's going on there. Certainly the 286 comes up to bus bandwidth with no difficulty -- when doing memory-to-memory string moves a word at a time, the rate of data flow is one byte per CPU clock (strictly, one two-byte memory-to-memory move every two CPU clocks) (NB: That would have been "every four CPU clocks" for the 8086 -- the 286 really does use fewer clocks.) Thus if Intel is in fact presently shipping 12 MHz 286's (late ads so suggest), the corresponding data transmission rate is 12 Mbytes / second, notwithstanding all the segment checking. How does that compare with the speed of the latest 32XXX's, with MMU engaged? The fact that the 286 can achieve this speed suggests that the segment stuff is really very fast. (And it sure would be nice if that bus were 32 bits, wouldn't it?) (Buried in here is a quite complicated issue pertaining to chip speed, having to do with what is bottlenecking the clock: If it should turn out that that were the segment-checking stuff, then one would have to look very carefully at the tradeoff obtained by omitting this feature and obtaining higher speed overall. But I don't know, of course.) (And I trust we all agree that neither the X86 nor the 32XXX nor the 68XXX are RISC machines; and that the issue of RISC versus more traditional architectures is worthy of a whole separate tempest in its very own teapot.) > ... managed automatically by the hardware, >rather than requiring the programmer (or his compiler) to do the job >itself by explicitly loading segment registers. This is a two-edged sword: Segment registers look just like address registers, and allow richer addressing modes when explicitly manipulated. That is, an X86 address is obtained by adding the contents of up to three registers, plus an "immediate" offset. (Lots of people will say that fancy addressing modes are a pain, and they may be right -- here is the RISC issue again.) And if you don't like the segment registers, surely it's not too much trouble to set them once and forget them. (I have indeed admitted, in my first posting, that 64K segments are too small.) By the way, where are the 68XXX fans? Surely the Motorola chip isn't so bad that no one can find a defense for it. :-) Why, with all those pins it makes a great brush for stuffed animals and plush furniture. :-) Real programmers can't SPELL quiche!! :-) -- Jay Reynolds Freeman (Schlumberger Palo Alto Research)(canonical disclaimer)
bwc@ganehd.UUCP (Brantley Coile) (06/05/85)
One thing issue I haven't seen mentioned, or I have missed it, is that in the '86 family of processors the memory mapping is known to the user program. On most systems only the supervisor knows about the memory mapping and the user program just thinks it has so much space. Even the seperate I and D space on a pdp-11 didn't require the knowledge of the user program. The iNTEL segments are just big versions of the 4k addressing problem on the IBM mainframes except on the IBM you can use 24bit pointers. -- Brantley Coile CCP ..!akgua!ganehd!bwc Northeast Health District, Athens, Ga
jer@peora.UUCP (J. Eric Roskos) (06/07/85)
> By the way, where are the 68XXX fans? Surely the Motorola chip isn't > so bad that no one can find a defense for it. It's kind of hard to compare the 68000's MMU, which functions in a very familiar, traditional way (the same way MMUs on many "mainframe" machines work), with the very strange segmentation facilities of the 286. Here you've complained again that "64K segments are too small". Now, I have a feeling part of the problem I see here is in our definition of "segments", which varies widely. But I don't think it is the smallness of the "segments" that is the problem. The 8086's way of handling segmentation is not like that of many more familiar machines, 68000 included. In what I will call the "conventional" memory management units, the address field in the instruction is partitioned into subfields, like this: AABBBB (where each digit represents, let us say, 4 bits, for concreteness). The bits AA are used to select an entry in an address translation table in the MMU, which replaces the bit string AA in the original ("virtual") address with some bit string CCCCCC in the generated ("physical") address. The result is some address CCCCCCBBBB Now, there is usually also a size associated with the block of physical memory pointed to by CCCCCC in the AAth entry of this translation table, so the value of BBBB is checked against this number to be sure it is in range. Assuming it is, we have generated our physical address, and can go on to checking the other bits in that table, which tell whether we are allowed to read, write, or execute the location, whether or not it is in memory at present, whether it has been modified, etc. More sophisticated memory management units further partition AA (or add more bits), so that the high order bits select one or more pointer tables which themselves point to other translation tables that are used for the next-lower-order field of bits, etc., but the mechanism is the same. Notice that in this scheme, The size of BBBB doesn't really matter so much. The total amount of space you can address in an instruction (without changing an external register) is the number of distinct addresses that may be represented by AABBBB, which is the number of bits in the instruction's address field for the operand. Now, on the other hand, we have the Intel approach. Intel gives us an instruction address field BBBB and a segmentation register field AAAA We get our physical address from this via AAAA0 +BBBB Now, in the 286, we have an "improvement" in that AAAA is put into a memory management unit table, just like in the "conventional" architecture, but it is still added to the instruction's address field rather than concatenating it. And where does the index into the memory management unit's table come from? Why, it comes from what used to be the segmentation register! So, rather than deriving the index from part of the instruction's address field, it comes from a separate register, which must be set via a MOV (or via an instruction which loads a segment register and an index register in one instruction from consecutive memory words) each time you want to change it. This is what the compiler writers have trouble with. The index into the memory management tables for the 286 are NOT derived from the instruction's address field, transparently as part of the "virtual" address. They have to be explicitly LOADED into a segmentation register. And the enhancements made thus far to the architecture don't improve this; they just add bits to the field in the memory management unit tables. The reason this is such a problem is that if you are generating code that involves data larger than 64K, you have to keep up with what value you last loaded into the segmentation register, so that you can change it if you have to access something that is not in range. And, as the flow of control in the program for which you are generating code becomes more complex, deciding when you need to change the contents of the segmentation register becomes enormously difficult. A compiler that could make this sort of flow analysis for the 8086 family of machines could also do a substantial amount of optimization, with the result that the same compiler for a 68000 would also achieve substantially better results. But at present, there are not really any compilers out there like that. It is rumored that some are in the works; some companies have claimed to produce them already, but the optimization methods thus far have been mostly "peephole" optimizations. The difficulty of writing an optimizer of the sort required is probably larger than that of writing the compiler. And optimization is really the issue here. A fully unoptimized 8086-family program would load the segmentation register before EVERY operand access. The task of the optimizer is to decide when it doesn't have to. And that is the basic problem. -- Full-Name: J. Eric Roskos UUCP: ..!{decvax,ucbvax,ihnp4}!vax135!petsd!peora!jer US Mail: MS 795; Perkin-Elmer SDC; 2486 Sand Lake Road, Orlando, FL 32809-7642 "Zl FB vf n xvyyre junyr."
nather@utastro.UUCP (Ed Nather) (06/08/85)
> -- > Full-Name: J. Eric Roskos > UUCP: ..!{decvax,ucbvax,ihnp4}!vax135!petsd!peora!jer > US Mail: MS 795; Perkin-Elmer SDC; > 2486 Sand Lake Road, Orlando, FL 32809-7642 Gad! A reasoned, understandable comparison of architectures without flames, inuendo or dead-horse-flogging. What is this newgroup coming to? Thank you, Eric. -- Ed Nather Astronony Dept, U of Texas @ Austin {allegra,ihnp4}!{noao,ut-sally}!utastro!nather nather%utastro.UTEXAS@ut-sally.ARPA
mann@LaBrea.ARPA (06/08/85)
> > By the way, where are the 68XXX fans? Surely the Motorola chip isn't > > so bad that no one can find a defense for it. > > It's kind of hard to compare the 68000's MMU, which functions in a very > familiar, traditional way (the same way MMUs on many "mainframe" machines > work), with the very strange segmentation facilities of the 286. I felt J. Eric Roskos's message was a good comparison of 8086-style segmented architecture with more conventional linear address space models, and I'm essentially in agreement that linear address spaces are superior overall. But I'm curious as to what he's referring to when he talks about the 68000's MMU above. The 680X0 doesn't have an on-chip MMU -- which is in a sense one of its strengths, since the chips do support a large linear address space, and those who use them are free to build any style of outboard MMU for such an address space. Most have chosen to do fairly conventional paged MMUs (for instance, Sun). But there is no such MMU in the 68XXX family (yet). The 68451 is utterly ridiculous garbage (please look into how it works before flaming back at me). The more recent promised MMUs from Motorola and Signetics have not come out yet. --Tim
freeman@spar.UUCP (Jay Freeman) (06/10/85)
[libation to line-eater] In article <1031@peora.UUCP> jer@peora.UUCP (J. Eric Roskos) writes: > [a thoughtful article comparing Intel-style segmentation > with more traditional approaches to memory management -- JF] >The 8086's way of handling segmentation is not like that of many more >familiar machines, 68000 included. [discussion of address-calculation by >concatenation of bit-fields (traditional approach) versus shift-and-add >(Intel approach) follows -- JF] If shift-and-add segmentation were made transparent to the user, by putting it in an inaccessible MMU, how would it then compare with the concatenation technique? >This is what the compiler writers have trouble with [...] deciding when >you need to change the contents of the segmentation register becomes >enormously difficult. > >And optimization is really the issue here. A fully unoptimized 8086-family >program would load the segmentation register before EVERY operand access. >The task of the optimizer is to decide when it doesn't have to. > >And that is the basic problem. Well put. I had thought that flow-analysis was in fact state-of-the-art for compilers (though admittedly still a thorny problem). If not, there is clearly a problem with all forms of messy addressing. I view user-controlled segmentation more as a special kind of addressing than a memory-management scheme (though it clearly can help with the latter). I see it as providing support for abstract data types and object-oriented programming, at a very low level -- you put your specialized object in its own segment, and rely on the hardware to trap if you attempt to access it incorrectly. Or you put different instantiations of a particular flavor of object in their own segments, and just change a segment register to get at one instantiation instead of another. Segments can be made much smaller than the typical page size of a more traditional MMU. Thus access rights to objects can be specified with much higher resolution -- down to individual procedures and (small) data structures. And I need these tools: It's so hard to do object-oriented programming in my favorite languages -- '66 FORTRAN, COBOL and Tiny BASIC. :-) -- Jay Reynolds Freeman (Schlumberger Palo Alto Research)(canonical disclaimer)
jer@peora.UUCP (J. Eric Roskos) (06/11/85)
> But I'm curious as to what he's referring to when he talks about > the 68000's MMU above. The 680X0 doesn't have an on-chip MMU ... I was referring to the separate Motorola MMU part, the 68451, which you called "utterly ridiculous garbage". My data sheet on it is at home, so it is kind of hard to see what you are referring to (all I have is the 68000 User's Manual here, which has a brief summary of the MMU at the back). What do you feel is wrong with the 68451? (Besides the fact that, last time I looked, it cost more than the 68000, I mean.) -- Full-Name: J. Eric Roskos UUCP: ..!{decvax,ucbvax,ihnp4}!vax135!petsd!peora!jer US Mail: MS 795; Perkin-Elmer SDC; 2486 Sand Lake Road, Orlando, FL 32809-7642 "Gnyx gb gur fhayvtug, pnyyre..."
doug@terak.UUCP (Doug Pardee) (06/11/85)
> The iNTEL segments > are just big versions of the 4k addressing problem on the IBM mainframes > except on the IBM you can use 24bit pointers. Once again I must point out that the IBM 360/370/30xx architecture is *not* segmented. Addresses run linearly from 0 to 16M-1 for non-XA systems, 0 to 2G-1 for XA. What *was* botched was that there is no PC-relative addressing mode for branching. Since there is also no direct addressing mode available, this means that in order to branch, you first have to load a register with the address of the destination. Or, the more common approach, keep a register loaded with a procedure code address and then use the (register + offset) addressing mode. The offset is limited to 0-4095, giving rise to the erroneous claims that the IBM architecture has 4K segments. The "proper" way around this is to limit your routines to 4096 bytes in length, and when you call a subroutine you should load its address into a register to address it rather than using the (register + offset) mode. Then the subroutine can use that register to address all of its own procedure code using the (register + offset) addressing mode. The reason that this is important is that while the Intel architecture is a nuisance for large amounts of data but the procedure code is still quite manageable, the IBM is the other way around. Many IBM systems process unbelievable quantities of in-memory data (that's why they had to expand the addressing to allow each process to have more than 16 Mb of data). Arrays of 1 Gb in size are "no sweat" on an IBM XA machine. But addressing branch destinations requires planning. -- Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug ^^^^^--- soon to be CalComp
mann@LaBrea.ARPA (06/15/85)
Wow! I guess it's time for me to pay the price for flaming by trying to back up my opinion. First let me say I called the 68451 "utterly ridiculous garbage" in a moment of weakness. Normally I don't flame like that, but I've been reading laser-lovers lately, and suddenly the urge overcame me. Anyway, my real opinion of the 68451 is that it's a lot better than no memory managment unit at all, but it has some serious problems. Most were pointed out in previous postings, by people who said in effect "well, of course it has the following problems, but I don't understand why people hate it so much." I'll summarize the problems below. The basic design of the 68451 forces the operating system to manage physical (as well as virtual) memory in 2^k sized chunks (called segments), where k is variable (but must be an integer!). A 2^k sized chunk must start on an address that is an exact multiple of 2^k. You can't choose a uniform small k (like 10) and simulate a page-oriented MMU because you rapidly run out of segment descriptors (there are only 32). The operating system code needed to manage physical memory in this way is much more complex than what is needed with a paging MMU. It is also difficult to avoid wasting a lot of physical memory in fragmentation and/or spending a lot of time copying data from one part of physical memory to another when an address space grows. I'll try to make this clear with an example. Let's say you are implementing a Unix-style model of process address space, where there is an upper bound that is allowed to grow dynamically. If you try to minimize internal fragmentation by allocating very little more actual memory than the process is using at any given time, you use up a lot of segment descriptors since you need one for each 1 bit in the binary representation of the memory size. Then let's say you try to grow the space. Each time you grow it, you have the choice of doing it by tacking on a small chunk to the end (chewing up another segment descriptor), or replacing one or more of the existing chunks by larger ones. Replacing some chunk(s) by a larger one requires copying their data to a new place in physical memory, unless you had the good fortune to find that the existing chunks were contiguous and started on the right boundary in physical space, and the space following them was free. One could put in some heuristics to make this condition more likely, but these make the OS memory management code yet more complicated. An advertised benefit of the 68451 is that it allows fast context switching because you can keep the segments for multiple address spaces in the chip at the same time. Of course, with only 32 segment descriptors to go around and each address space needing several, you run out rather quickly if you try to keep them all in there. So you need to swap them in and out, perhaps keeping the N most recently used sets in the chip in hopes of minimizing the amount of descriptor swapping. Protection bits are also on a per-segment basis, so if you have different areas in your process that need different protections, you chew up more segment descriptors -- each differently-protected area needs several segment descriptors if its size is not an exact power of two. The fact that the 68451 introduces TWO wait states into memory access is a serious black mark against it in my book, too. Of course, it's possible to live with all of this. And user programs still see a nice, simple, linear address space. But I wouldn't want to try to do demand paging with the 68010 on top of the 68451 -- demand paging is complicated enough with a more rational MMU. What are my qualifications for saying all this? I've written all the kernel memory managment code for the Sun-1, Sun-2 and Iris PM-II versions of the Stanford V kernel. (See IEEE Software, April 1984 for an article about the V kernel.) These machines all happen to be 68000/68010-based systems with custom MMUs built with mostly TTL. The current V kernel implements multiple address spaces with variable upper limits, and multiple processes per address space. There is currently no demand paging -- all code and data must be resident at all times -- but I will be implementing that soon as a byproduct of some of my thesis research. (Adding swapping to the current system would be trivial.) I formed my opinions about the 68451 in the process of trying to explain the current memory management to some folks who are trying to port the kernel to a system with a 68451, and helping them find solutions to the problems they ran into. --Tim