connors@hplabsz.HP.COM (Tim Connors) (02/20/91)
Now that we've entered the brave new world of 64 bit (flat) address spaces, is it time to revive the old flame wars on address translation mechanisms? For 32 bit addresses, Motorola's MC68851 uses a "two level" translation tree involving 4K pages (12 bits) and two 10 bit indices, one index for each level. How could this technique be applied to 64 bit addresses? Would more levels be needed? Should the page size be larger? More interestingly, could the pointers which link one level to the next be only 32 bits and thus save on translation table size? This might limit the placement of the tables in a machine with more than 4Gbytes of RAM. It also requires switching from 64 to 32 bit mode during TLB miss handling. What about inverted page tables. Would they be any better for 64 bit addresses? Does this make life tough for the MACH operating system? Are 64 bit addresses spaces more likely to be sparse? What affect does that have on the translation mechanism? I can think of alot more questions, but I'll leave it there except to ask if anyone from MIPS can tell us how you intend to do address translation on the R4000? Sincerely, Tim
af@spice.cs.cmu.edu (Alessandro Forin) (02/21/91)
In article <6590@hplabsz.HP.COM>, connors@hplabsz.HP.COM (Tim Connors) writes: > ... > Does this make life tough for the MACH operating system? > May I humbly ask which original reading of our code prompts you this question ? sandro-
mash@mips.COM (John Mashey) (02/24/91)
In article <6590@hplabsz.HP.COM> connors@hplabs.hp.com (Tim Connors) writes: >Now that we've entered the brave new world of 64 bit (flat) address spaces, >is it time to revive the old flame wars on address translation mechanisms? ... >I can think of alot more questions, but I'll leave it there except to ask >if anyone from MIPS can tell us how you intend to do address translation on >the R4000? The general approach (which is about all I can talk about right now) is essentially identical to the R3000's, although low-level details differ: 1) Generate an adddress. 2) Send it to the TLB 3) If it matches, translation is done. Do protection checks as needed. 4) If no match, stuff the offending address and/or appropriate portions thereof into coprocessor-0 registers, and invoke a special fast-trap kernel routine to refill the TLB. Different environments use different refill routines, depending on their preference for PTE layouts and organization. At this level of discourse, the main difference is the ability for each entry to map from 4KB -> 16MB. Obviously, it is trivial to use the big entries for things like frame buffers, but (in my opinion) it will be useful to end up with OSs that can use larger pages when they need to. This is already an issue with DBMS and scientific vector code, where it is quite possible to overpower any reasonable-to-build TLB if it maps only 4K or 8KB pages, regardless of whether it uses hardware or software refill. (True regardless of 32 or >32-bit addressing). There are various ways of coding refill routines that provide mixed-page sizes; I did one as an example while back that looked like it cost just a few more cycles, so I believe that it is practical. Note: I occasionally see marketing-stuff that claims that hardware refill is much better than software refill. I might believe this if somebody showed numbers, also.... BTW: back to the oppurtunity cost of 64-bit integers. I went back to the plot, and measured the datapath itself a little more carefully, as my earlier figures included datapath + associated control. Now, it looks like the die cost of 64-bit integers is more like 4-5%, not 7%. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
cprice@mips.COM (Charlie Price) (02/24/91)
In article <6590@hplabsz.HP.COM> connors@hplabs.hp.com (Tim Connors) writes: >Now that we've entered the brave new world of 64 bit (flat) address spaces, >is it time to revive the old flame wars on address translation mechanisms? > >For 32 bit addresses, Motorola's MC68851 uses a "two level" translation >tree involving 4K pages (12 bits) and two 10 bit indices, one index for >each level. How could this technique be applied to 64 bit addresses? >Would more levels be needed? Should the page size be larger? > >More interestingly, could the pointers which link one level to the next be >only 32 bits and thus save on translation table size? This might >limit the placement of the tables in a machine with more than 4Gbytes of RAM. >It also requires switching from 64 to 32 bit mode during TLB miss handling. > >What about inverted page tables. Would they be any better for 64 bit >addresses? Does this make life tough for the MACH operating system? > >Are 64 bit addresses spaces more likely to be sparse? What affect does that >have on the translation mechanism? > >I can think of alot more questions, but I'll leave it there except to ask >if anyone from MIPS can tell us how you intend to do address translation on >the R4000? Address translation for the R4000 is a lot like the R2000/R3000. The short answer is that the in-memory page-table arrangement is entirely up to the OS programmers because it is all done by software. These processors have a fully-associative on-chip TLB. During execution, the hardware looks in the TLB. If the right information is not in the TLB, the processor takes an exception and *software* refills the TLB. The software can do whatever it likes. DETAILS: Hit "N" now if not interested. The processors do have give the exception handler enough information to do the TLB refill and this information in in the right form to makd a one-level page table especially fast. If you use a one-level page table, the R3000 user TLB miss refill routine takes 9 instructions. The R3000 has a CONTEXT register that looks like: ----------------------------------------------- | PTE base | bad VPN |..| ----------------------------------------------- ^----- 2 bits of 0 The PTE-base part is filled in by the processor during a context switch. The bad-VPN field is filled at exception time, by the hardware, with the Virtual Page Number (VPN) of the failed translation. The net intended effect is that when VPN NNN gets a translation fault, the CONTEXT register contains the *kernel address* of the 1-word user page-table entry for page NNN! On the R2000/R3000 a TLB miss for a user-mode access (i.e. a user program) vectors to the UTLB miss exception vector so it can be handled quickly. This routine is 9 instructions for a 1-word PTE (add 1 instruction to shift the address left one for 2-word wide PTE like RISC/os uses). For a kernel-mode TLB miss, the exception sets a fault cause register and vectors to the common exception vector, and this takes more effort to sort out (but also presumably happens a lot less often). The R4000 takes all non-nested TLB misses through the fast exception vector. There is nothing that *requires* anybody to use a one-level page table. If you think that memory savings or whatever makes it worth the cost in CPU cycles to have a more complex TLB refill routine, then you (the OS hacker) are allowed to make that tradeoff. -- Charlie Price cprice@mips.mips.com (408) 720-1700 MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA 94086-23650
lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (03/01/91)
In article <6590@hplabsz.HP.COM> connors@hplabs.hp.com (Tim Connors) writes: >For 32 bit addresses, Motorola's MC68851 uses a "two level" translation >tree involving 4K pages (12 bits) and two 10 bit indices, one index for >each level. How could this technique be applied to 64 bit addresses? >Would more levels be needed? Should the page size be larger? The MC68851 actually allows a variable number of levels - the table search flowchart contains a loop. You may be thinking of its RISC descendant, the 88200. There are lots of ways to skin a cat. You can store a complete flat table in another virtual space, with pages there only being instantiated on demand. Trees work, and there have been tree methods that stored "likely" information (eg the descriptor of the first page of a segment) high up in the tree. You can punt it all to software, if the TLB is big enough, or has big or variable page sizes. Variable page size can be done in software. (Mach's boot can select a software page size - of course, it has to be a multiple of the hardware size.) The CDC Star-100 allowed multiple hardware page sizes, simultaneously, as part of their plans for memory mapped I/O. The 88200's BATC is a 10 entry TLB, each entry mapping an aligned 512 KB region. (This, like the new MIPS stuff, is aimed at big address spaces, not big I/O.) -- Don D.C.Lindsay .. temporarily at Carnegie Mellon Robotics
pcg@cs.aber.ac.uk (Piercarlo Grandi) (03/01/91)
On 25 Feb 91 23:03:30 GMT, connors@hplabsz.HP.COM (Tim Connors) said: connors> In article <6590@hplabsz.HP.COM>, connors@hplabsz.HP.COM (Tim connors> Connors) writes: connors> What about inverted page tables. Would they be any better for connors> 64 bit addresses? Does this make life tough for the MACH connors> operating system? connors> In article <12030@pt.cs.cmu.edu> af@spice.cs.cmu.edu connors> (Alessandro Forin) writes: af> May I humbly ask which original reading of our code prompts you this af> question ? Oh, I think he is referring to the fact that in the Mach kernel one of the pager modules is specifically designed to deal with inverted page tables. connors> [ ... ] HP's PA-RISC machines and the IBM RT and RS6000. My connors> understanding is that in each of these machines the inverted connors> page table scheme comes with an assumption of NO address connors> aliasing. [ ... ] As I understand MACH, address aliasing is of connors> primary importance. I can answer here for the Mach people. One of the first Mach implementations was for the RT. They did make it work on the RT, despite tha aliasing problem. The describe the solution (a variant on the obvious "remap aliased things when they are used" theme) as not pretty, but it works quite well. Actually the Mach page table code is *designed* to work also with inverted page tables, something that is not that true with other common Unix kernels (even if 4.3BSD was hacked to work on the RT too). Naturally one pays a performance penalty. The problem is historical; there is no conceptual good reason to have shared memory or copy on write, but if you want to have Unix like semantics (e.g. fork(2)), then you really want them both, because the perfomance is even worse without. Mach's success is largely because it is Unix compatible, not because it is "better" (the "better" things were in Accent, and Accent withered). It is Unix that is built around the assumption of aliasing to a large extent (shared text, SysV IPC, BSD mmap(2)) or that makes copy-on-write so desirable ("copy" semantics are prevalent because on a PDP-11 it was simpler and maybe faster to copy small things than to share them). Inasmuch Mach is Unix compatible, one needs to make the best of a mismatch, like that between inverted page tables (which are at their best supporting ORSLA/Ssystem38 type systems, that is capability based sparse address space architectures) and Unix semantics, and the Mach kernel does it with reasonable efficiency, precisely because one of the first machines it supported in large numbers was the RT. -- Piercarlo Grandi | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
jonah@dgp.toronto.edu (Jeff Lee) (03/01/91)
In <12151@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes: >In article <6590@hplabsz.HP.COM> connors@hplabs.hp.com (Tim Connors) writes: >> [...] How could this technique be applied to 64 bit addresses? >>Would more levels be needed? Should the page size be larger? > [...] You can store a complete flat table in another virtual space, > with pages there only being instantiated on demand. Full-indexing methods have an outrageous worst-case cost: O(n) in the amount of space [actually the number of pages] indexed. This is managable with today's 4 gigabyte (32-bit) address spaces, but is insane with a 16777216 terabyte (64-bit) address space. [What comes after tera-?] Making the index sparce helps, but you *still* need to create the index for the portions of the data that you want to address. Putting the index in a virtual space doesn't help very much; it means you can swap it out, but you still have to construct and maintain the index (although now in swap space instead of physical memory). This complicates the process. Additionally, the full-indexes often need to be maintained in a form that is compatible with your MMU, so it can't easily be shared with heterogeneous machines. Disk files already have an index of sorts, but it is seldom compatible with the MMU hardware. Why build another index when you map that file into memory? > You can punt it all to software, if the TLB is big enough, or has big > or variable page sizes. [...] Variable page size can be done in > software. (Mach's boot can select a software page size - of course, it > has to be a multiple of the hardware size.) Ultimately, it comes down to software -- especially if you want to be able to map arbitrary data resources and large shared remote files. Variable page sizes can help, but as you point out they can be done in software. Speculation: In order to deal with growing physical memories we will soon see TLBs comparable to today's on-chip caches with [tens of] thousands of entries instead of tens or hundreds of entries as exist today. Jeff Lee -- jonah@cs.toronto.edu || utai!jonah
peter@ficc.ferranti.com (Peter da Silva) (03/01/91)
In article <1991Feb28.144420.21179@jarvis.csri.toronto.edu> jonah@dgp.toronto.edu (Jeff Lee) writes: > Making the index sparce helps, but you *still* need to create the > index for the portions of the data that you want to address. Putting > the index in a virtual space doesn't help very much; it means you can > swap it out, but you still have to construct and maintain the index... Can't you construct the index at page-fault time, if the structure is regular and predictable? (gives new depth to the term "virtual memory") -- Peter da Silva. `-_-' peter@ferranti.com +1 713 274 5180. 'U` "Have you hugged your wolf today?"
torek@elf.ee.lbl.gov (Chris Torek) (03/02/91)
In article <PCG.91Feb28183457@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes: >Actually the Mach page table code is *designed* to work also with >inverted page tables, something that is not that true with other common >Unix kernels (even if 4.3BSD was hacked to work on the RT too). I suspect you have neither read the code nor written a pmap module. (I have now done some of both....) The Mach code demands two kinds of support from a `pmap module' (the thing that implements address-space operations on any given machine): pmap_enter, pmap_remove, pmap_protect, pmap_change_wiring, pmap_extract: All of these receive a `pmap' pointer (where a `pmap' is something you define on your own) and a virtual address (and, for pmap_enter, a starting physical address) and require that you map or unmap or protect that virtual address, or return the corresponding physical address. This requires a way to go from virtual to physical, per process (i.e., a forward map). pmap_remove_all, pmap_copy_on_write, pmap_clear_modify, pmap_clear_refernce, pmap_is_modified, pmap_is_referenced, other miscellaneous functions: These receive a physical address and must return information, or remove mappings, or copy, about/to/from the corresponding physical address. pmap_remove_all, for instance, must locate each virtual address that maps to the given physical page and remove that mapping (as if pmap_remove() were called). This requires a way to go from physical to virtual, for all processes that share that physical page (i.e., an inverted page table, of sorts). Note that, although the VM layer has all the forward map information in machine independent form, the pmap module is expected to maintain the same information in machine-dependent form. This is almost certainly a mistake; the information should appear in one place or the other. (Which place is best is dictated by the hardware.) (Note that all pmap_extract does is obtain a virtual-to-physical mapping that could be found by simulating a pagein.) The reverse map maintained in hardware inverted page tables typically lacks some information the pager will require, and thus cannot be used to store everything; I suspect (although I have not penetrated this far yet) the VM layer code `fits' better with this situation, i.e., keeps just what it needs and leaves the rest to the pmap module and/or hardware. -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov
steve@cs.su.oz (Stephen Russell) (03/03/91)
In article <IAT90LB@xds13.ferranti.com> peter@ficc.ferranti.com (Peter da Silva) writes: >In article <1991Feb28.144420.21179@jarvis.csri.toronto.edu> jonah@dgp.toronto.edu (Jeff Lee) writes: >> Making the index sparce helps, but you *still* need to create the >> index for the portions of the data that you want to address... > >Can't you construct the index at page-fault time, if the structure is >regular and predictable? > I doubt this would be a win, as you are still constructing the same index, although in a piecemeal manner. In fact, you end up performing the inner loop of the PTE initialisation code along with the setup code for the loop, on each fault. Of course, it would win if most program executions only referenced a small part of their code and data. This is not impossible -- each execution may only exercise some of the code -- but I'd be surised if there was a large amount of code and data that _never_ got used :-). Anyone got any figures on code and data subset usage for different runs of the same program(s)? Cheers, Steve.
peter@ficc.ferranti.com (Peter da Silva) (03/03/91)
In article <PCG.91Feb28183457@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes: > Naturally one pays a performance penalty. The problem is historical; > there is no conceptual good reason to have shared memory... Shared libraries? -- Peter da Silva. `-_-' peter@ferranti.com +1 713 274 5180. 'U` "Have you hugged your wolf today?"
moss@cs.umass.edu (Eliot Moss) (03/04/91)
One possibility is trace or debugging related information. In most runs of a program this would not be used. Similarly, online help stuff, tutorials, etc., may be rarely used in some environments (i.e., once you've learned it, you refer to such material rarely, though you want it around if possible). Also, code for handling very rare bad situations (e.g., out of disk space) may never be exercised. Just some suggestions that in fact there might reasonably be code and data that is very rarely used .... Eliot -- J. Eliot B. Moss, Assistant Professor Department of Computer and Information Science Lederle Graduate Research Center University of Massachusetts Amherst, MA 01003 (413) 545-4206, 545-1249 (fax); Moss@cs.umass.edu
Richard.Draves@cs.cmu.edu (03/05/91)
> Excerpts from netnews.comp.arch: 1-Mar-91 Re: Translating 64-bit addr.. > Chris Torek@elf.ee.lbl.g (2509) > n article <PCG.91Feb28183457@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk > (Piercarlo Grandi) writes: > >Actually the Mach page table code is *designed* to work also with > >inverted page tables, something that is not that true with other common > >Unix kernels (even if 4.3BSD was hacked to work on the RT too). > I suspect you have neither read the code nor written a pmap module. > (I have now done some of both....) ... > Note that, although the VM layer has all the forward map information in > machine independent form, the pmap module is expected to maintain the > same information in machine-dependent form. This is almost certainly a > mistake; the information should appear in one place or the other. > (Which place is best is dictated by the hardware.) (Note that all > pmap_extract does is obtain a virtual-to-physical mapping that could be > found by simulating a pagein.) The reverse map maintained in hardware > inverted page tables typically lacks some information the pager will > require, and thus cannot be used to store everything; I suspect > (although I have not penetrated this far yet) the VM layer code `fits' > better with this situation, i.e., keeps just what it needs and leaves > the rest to the pmap module and/or hardware. I think this is almost certainly not a mistake. Hardware data structures are not going to represent all the information that the machine-independent VM code must maintain. Hence, the MI VM code must maintain its own data structures for at least some information. I think it would be very difficult to split information between MI and MD data structures. Trying to cope with varying splits would greatly complicate the pmap interface. So I think the Mach approach (making MI data structures the true repository of information and making MD data structures an ephemeral cache) is definitely the right way to go. Rich
pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) (03/06/91)
On 3 Mar 91 15:26:24 GMT, peter@ficc.ferranti.com (Peter da Silva) said:
peter> In article <PCG.91Feb28183457@odin.cs.aber.ac.uk>
peter> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
pcg> Naturally one pays a performance penalty. The problem is historical;
pcg> there is no conceptual good reason to have shared memory...
peter> Shared libraries?
Circular argument: if you are in the situation where shared libraries
are useful, that is because each process is confined in a separate
address space because protection is done via address confinement, shared
libraries are useful. But so is shared memory in general, because
absolute address confinement is not that appealing, and copying is not
that nice (even copy-on-write).
Naturally, except on the AS/400, single address space machines are
pie-in-sky technology (even if more or less research systems abound).
I would however still maintain that even with conventional multiple
address space architectures shared memory is not necessary, as sending
segments back and forth (remapping) gives much the same bandwidth.
In this case I can argue that _irrevocably read only_ segments, useful
for a shared library, are a good idea; if a segment is declared
irrevocably read only no remapping need occur as multiple address spaces
access it; it can de facto, if not logically, be mapped at the same time
in multiple address spaces without aliasing hazards.
--
Piercarlo Grandi | ARPA: pcg%uk.ac.aber@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@aber.ac.uk
jesup@cbmvax.commodore.com (Randell Jesup) (03/06/91)
In article <PCG.91Mar5193541@aberdb.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) writes: >pcg> Naturally one pays a performance penalty. The problem is historical; >pcg> there is no conceptual good reason to have shared memory... > >peter> Shared libraries? > >Circular argument: if you are in the situation where shared libraries >are useful, that is because each process is confined in a separate >address space because protection is done via address confinement, shared >libraries are useful. But so is shared memory in general, because >absolute address confinement is not that appealing, and copying is not >that nice (even copy-on-write). Shared libraries are very useful on systems with single address spaces. Very efficient in memory usage, helps improve cache hit rates, reduces the amount of copying of data from one space to another (see some not-to-old articles in comp.os.research concerning pipes vs shared memory speed), reduces needs to flush caches, etc, etc. >Naturally, except on the AS/400, single address space machines are >pie-in-sky technology (even if more or less research systems abound). I wouldn't quite say that. The Amiga (for example) uses a single- address-space software architecture on 680x0 CPUs. It currently doesn't support inter-process protection (partially since 80% of amigas don't have an MMU), but if it were implemented (it might be eventually), it would almost certainly continue to use a single address space. >I would however still maintain that even with conventional multiple >address space architectures shared memory is not necessary, as sending >segments back and forth (remapping) gives much the same bandwidth. It requires far more OS overhead to change the MMU tables, and requires both large-grained sharing and program knowlege of the MMU boundaries. -- Randell Jesup, Keeper of AmigaDos, Commodore Engineering. {uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.commodore.com BIX: rjesup The compiler runs Like a swift-flowing river I wait in silence. (From "The Zen of Programming") ;-)
peter@ficc.ferranti.com (Peter da Silva) (03/07/91)
In article <PCG.91Mar5193541@aberdb.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) writes: > I would however still maintain that even with conventional multiple > address space architectures shared memory is not necessary, as sending > segments back and forth (remapping) gives much the same bandwidth. I don't think you can really make a good case for this. Consider the 80286, where pretty much all memory access for large programs is done by remapping segments. Loading a segment register is an expensive operation, and is to a large extent the cause of the abysmal behaviour of large programs on that architecture. For an extreme case, the sieve slows down by a factor of 11 once the array size gets over 64K. My own experience with real codes under Xenix 286 bears this out. Think of the 80286 as an extreme case of what you're proposing. I think it's clear from this experience that frequent reloading of segment registers is a bad idea. After your discussion of the inappropriate use of another technology, networks, I would have expected you'd know better. As for single address space machines, my Amiga 1000's exceptional performance... given the slow clock speed and dated CPU (7.14 MHz 68000)... tends to suggest that avoiding MMU tricks might be a good idea here as well. The Sparcstation 2 is the first UNIX workstation I've seen with as good response time to user actions. It's only a 27 MIPS machine... or approximately 40 times faster. -- Peter da Silva. `-_-' peter@ferranti.com +1 713 274 5180. 'U` "Have you hugged your wolf today?"
martin@adpplz.UUCP (Martin Golding) (03/08/91)
In <PCG.91Mar5193541@aberdb.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) writes: >On 3 Mar 91 15:26:24 GMT, peter@ficc.ferranti.com (Peter da Silva) said: >peter> In article <PCG.91Feb28183457@odin.cs.aber.ac.uk> >peter> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes: >pcg> Naturally one pays a performance penalty. The problem is historical; >pcg> there is no conceptual good reason to have shared memory... >peter> Shared libraries? >Circular argument: if you are in the situation where shared libraries >are useful, that is because each process is confined in a separate >address space because protection is done via address confinement, shared >libraries are useful. But so is shared memory in general, because >absolute address confinement is not that appealing, and copying is not >that nice (even copy-on-write). >Naturally, except on the AS/400, single address space machines are >pie-in-sky technology (even if more or less research systems abound). Umm. Pick? Reality? fairly large population of pie-in-sky research machines. One of the interesting things that people are missing on this 64bits/too big issue, is sparsely occupied virtual memory. Arbitrary numbers of objects require arbitrary room to grow... Martin Golding | sync, sync, sync, sank ... sunk: Dod #0236 | He who steals my code steals trash. {mcspdx,pdxgate}!adpplz!martin or martin@adpplz.uucp
linley@hpcuhe.cup.hp.com (Linley Gwennap) (03/09/91)
(Jeff Lee) 16777216 terabyte (64-bit) address space. [What comes after tera-?] "Peta". A 64-bit address space contains 16,384 petabytes. So what comes after peta-? --Linley Gwennap Hewlett-Packard Co.
jmaynard@thesis1.hsch.utexas.edu (Jay Maynard) (03/09/91)
In article <32580002@hpcuhe.cup.hp.com> linley@hpcuhe.cup.hp.com (Linley Gwennap) writes: >(Jeff Lee) >16777216 terabyte (64-bit) address space. [What comes after tera-?] >"Peta". A 64-bit address space contains 16,384 petabytes. So what >comes after peta-? exa-. A 64-bit address space contains 16 exabytes...no, not 16 8mm tape drives! -- Jay Maynard, EMT-P, K5ZC, PP-ASEL | Never ascribe to malice that which can jmaynard@thesis1.hsch.utexas.edu | adequately be explained by stupidity. "You can even run GNUemacs under X-windows without paging if you allow about 32MB per user." -- Bill Davidsen "Oink!" -- me
pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) (03/10/91)
On 6 Mar 91 04:41:41 GMT, jesup@cbmvax.commodore.com (Randell Jesup) said:
jesup> In article <PCG.91Mar5193541@aberdb.cs.aber.ac.uk>
jesup> pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) writes:
pcg> there is no conceptual good reason to have shared memory...
And practical as well, let me add, except for historical re
peter> Shared libraries?
pcg> [ ... yes, but only if you have multiple address spaces, and only
pcg> as irrevocably read only data. With 64 bit addresses one had better
pcg> have single address space systems ... ]
jesup> Shared libraries are very useful on systems with single address
^^^^^^
Probably you mean multiple here; in single address space systems all
libraries are "shared".
jesup> spaces. [ ... ]
Ah yes, of course. One of the tragic legacies of the PDP-11 origin of
Unix is that shared libraries have been so slow in appearing. Not all
is a win for shared libraries (one has to be careful about clustering
of functions in shared library), but I agree.
pcg> Naturally, except on the AS/400, single address space machines are
pcg> pie-in-sky technology (even if more or less research systems abound).
jesup> I wouldn't quite say that. The Amiga (for example) uses a
jesup> single- address-space software architecture on 680x0 CPUs. It
jesup> currently doesn't support inter-process protection (partially
jesup> since 80% of amigas don't have an MMU),
This is quite an empty claim. Naturally all machines without an MMU have
a single (real) address space, from the IBM s/360 to the Apple II to the
PC/XT. The point lies precisely in having inter process protection...
jesup> but if it were implemented (it might be eventually), it would
jesup> almost certainly continue to use a single address space.
It will be interesting to see how you do it on a machine with just 32
bits of addressing. You cannot do much better than having 256 regions of
16MB each, which in today's terms seems a bit constricting. Not for an
Amiga probably, admittedly. Amiga users are happy with 68000 addressing,
because its sw is not as bloated as others. There are other problems.
Psyche from Rochester does something like that, when protection is not
deemed terribly important; it maps multiple "processes" in the same
address space, so that they can enjoy very fast communications.
pcg> I would however still maintain that even with conventional multiple
pcg> address space architectures shared memory is not necessary, as
pcg> sending segments back and forth (remapping) gives much the same
pcg> bandwidth.
jesup> It requires far more OS overhead to change the MMU tables,
The Mach people have been able to live with it on the ROMP; not terribly
happy, but the price is not that big; "far more" is a bit excessive.
Point partially acknowledged, though; I had written:
pcg> Naturally one pays a performance penalty. The problem is
pcg> historical;
jesup> and requires both large-grained sharing and program knowlege of
jesup> the MMU boundaries.
But this is really always true; normally you cannot share 13 bytes (in
multiple address space systems). As a rule one can share *segments*.
Some systems allow you to share single *pages*, but I do not like that
(too much overhead, little point in doing so, most MMUs support much
more easily shared segments than pages; even the VAX has a 2 level MMU
with 64KB shared segments, contrarily to common misconceptions).
Here I feel compelled to restate the old argument: if shared memory is
read-only, it can well be *irrevocably* read-only and require no actual
remapping, so no problems.
If it is read-write some sort of synchronization must be provided. In
most cases synchronization must instead be provided via kernel services,
and in that case it is equivalent to remapping, so simultaneous shared
memory is not needed.
Taking turns via semaphores to access a segment that is always mapped at
the same time in multiple address spaces is not any faster than
remapping, even explicitly, the segment in one process at a time. I know
that on machines with suitable hardware instructions and when spin locks
are acceptable one wins with simultaneous shared memory, but this I bet
is not that common for inter address space communication, and there are
better alternatives (multiple threads in the same address space).
Explicit remaps involve a small change in programming paradigm, in that
it is neither simultaneous shared memory, nor message passing (which
implies copying, at least virtually). Its one major implementation, as
far as I know, is in MUSS.
IMNHO it can be easily demonstrated that remapped memory is superior
both to simultaneous shared memory (automatic synchronization, no
problems with reverse MMUs, the distributed case is simple, no address
aliasing hazards) and to message passing (no copying or
copying-on-write, no data aliasing, much simpler to implement).
The only loss I can conceive of is when there are mostly readers and
occasional writers, but this can be handled IMNHO more cleanly with
versioning on irrevocably read only segments.
Consider the two process case; in the left column we have traditional
shared memory, in the right column we have nonshared memory:
p1: create seg0 p1: create seg0
p1: create sem0 p1: map seg0
p1: down sem0 ...
p2: map seg0 p2: map seg0 (wait)
p2: down sem0 (wait) ...
... ...
p1: up sem0 p1: unmap seg0
... ...
p2: (continue) p2: (continue)
... ...
p1: down sem0 (wait) p1: map seg0 (wait)
... ...
p2: up sem0 p2: unmap seg0
... ...
p1: (continue) p1: (continue)
... ...
The contents of the message passing column is left to the reader.
Final note: I am amazed that a simple and efficient idea like remapped
read-write memory and shared irrevocably read only segments is not
mainstream, and much more complicated or inefficient mechanisms liek
shared memory and message passing are.
--
Piercarlo Grandi | ARPA: pcg%uk.ac.aber@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@aber.ac.uk
pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) (03/10/91)
On 7 Mar 91 13:31:29 GMT, peter@ficc.ferranti.com (Peter da Silva) said:
peter> In article <PCG.91Mar5193541@aberdb.cs.aber.ac.uk>
peter> pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) writes:
pcg> I would however still maintain that even with conventional multiple
pcg> address space architectures shared memory is not necessary, as
pcg> sending segments back and forth (remapping) gives much the same
pcg> bandwidth.
peter> I don't think you can really make a good case for this.
Tell that to the people that ported Mach, and 4.3BSD to the PC/RT... :-)
My estimate is that remapping can be done on demand (lazy remapping),
not on every context switch, and it does not cost more than a page fault
(which is admittedly an extremely expensive operation, but one about
which people don't complain). Also in many cases lazy remapping costs
nothing, given the vagaries of scheduling. Suppose a segment is shared
between processes 1 and 3; if 1 is deactivated, 2 is activated, and 1
is reactivated, no remapping need take place, because 3 has not
accessed it in this sequence.
If 3 had accessed it, the OS would have taken a fault on the attempted
access, found out that the segment was mapped to process 1, unmapped it
from 1, and remapped it onto 3. I correct myself: far less expensive
than a page fault. Probably also less expensive than a process
reschedule, even in a properly designed kernel. You lose a lot only if
you have a very large number of shared segments, which are shared among
a lot but not all processes, and which are all being accessed in every
time slice given to each process that shares them. A very, very, very
unlikely scenario, and one in which after all the cost is proportional
to use, not worse.
Incidentally, avoiding the scenario above is why I think that sharing
single pages as opposed to sharing segments is a bad idea: if each
process sharing a segment of address space touches more the page in it,
a remap fault occurs on each page. I think (and some statistics seem to
support my hunch) that this multiple page access in the same shared
segment is a far more frequent phenomenon than multiple shared segment
access.
peter> Consider the 80286, where pretty much all memory access for large
peter> programs is done by remapping segments.
Are you sure? I think that in all OSes that run on the 286 maybe except
for iRMX/286 segments are not unmapped and remapped, but stay always
mapped, and can be and are shared.
peter> Loading a segment register is an expensive operation,
Around 20-30 cycles if memory serves me right. Compared to a context
switch it is insignificant. And in any case the 286 MMU does support
shared segments directly, so there is no need to do segment remapping to
simulate shared memory.
This said, your comments about the 286 MMU are irrelevant to a
discussion on acceptability of the cost of simulating shared memory by
remapping them on demand or at a context switch in each process that has
them nominally attached. This discussion is important only when
comparing reverse map MMUs with straight map MMUs, and only when the
reverse map MMU does not support (unlike mine) shared segments, and only
when shared segments are deemed useful.
Yet in your discussions of the 286 MMU there are some common fallacies
and myths, and they merit some comment.
Note first that the it is only because of a design misconception (not
quite a mistake) of the 286 designers that loading a segment
register is so expensive. The problem is that the shadow segment
registers are not like TLBs, in that they are reloaded every time, even
if the shadowed segment register *value* has not changed.
This could have easily been avoided by simply comparing the old and new
segment register values. It was not, only because conceivably the
segment descriptors could have been altered even if the value of
ssegment register had not in fact changed, and the 286 has no distinct
"flush shadow segment registers" instruction.
I guess that the designers assumed that in their "Pascal"/"Algol" model
of process execution each segment register was dedicated to a specific
function (code, stack, global, var parameters), and supposed not to be
reloaded often, so no need to treat the shadows as caches.
peter> and is to a large extent the cause of the abysmal behaviour of
peter> large programs on that architecture. For an extreme case, the
peter> sieve slows down by a factor of 11 once the array size gets over
peter> 64K.
This is only because probably the HUGE model gets used, which implies
funny code to simulate 32 bit address arithmetic (the HUGE model is so
expensive because the mistake of putting the ring number in the middle
of a pointer instead of in the most significant bits). On less extreme
examples, or if you code the sieve for the LARGE model, the slowdown is
around 20-50%, even for extremely pointer intensive operations, in the
LARGE model.
Your figure of 11 is plainly ridiculous and warped by the machinations
of the HUGE model; After all 32 bit pointer dereferences are only about
3 times slower than 16 bit pointer dereferences, so even a program
that consisted *only* of them would be only 3 times slower.
Note again that this point about 32 bit pointer arithmetic on a 286 has
*nothing* to do on the cost of simulating shared memory by remapping
when the MMU does not support it directly.
peter> My own experience with real codes under Xenix 286 bears this out.
Maybe. *My* experience of recompiling large large numbers of Unix
nonfloat utilities on a 286 tells me that the average slowdown is around
30%. A 10 Mhz 286 is about the equivalent of a PDP-11/73 (1 "MIPS") in
the small model or of a VAX-11/750 (0.7 "MIPS") in the large model, to
all practical (nonfloat Unix applications :->) purposes.
peter> Think of the 80286 as an extreme case of what you're proposing.
I seem to have completely failed to explain myself. The 286 is
*irrelevant* to a discussion on shared memory simulation by implicit OS
supported or explicit application requested segment remapping (whcih I
prefer).
peter> I think it's clear from this experience that frequent reloading
peter> of segment registers is a bad idea.
No, the conclusion is not supported by the 286 example; the 286 is
uniquely poor for reloading segment registers because it does not treat
shadow segment registers as a cache and because its pointers have an
unfortunate format.
Properly designed MMUs with properly designed TLBs, even reverse map
ones, do segment remapping with small or insignificant cost, not worse
than the 286 MMU. Moreover the real overhead lies not in reloading some
lines in the MMU or the TLB; it is in taking the remap fault and in
searching the appropriate kernel structures to find which (nominally
shared) segment to map in that region.
peter> After your discussion of the inappropriate use of another
peter> technology, networks, I would have expected you'd know better.
I am sorry I got myself so badly misunderstood.
peter> As for single address space machines, my Amiga 1000's exceptional
peter> performance... given the slow clock speed and dated CPU (7.14
peter> MHz 68000)... tends to suggest that avoiding MMU tricks might be
peter> a good idea here as well.
MMUs are a difficult subject. A lot of vendors have bungled their MMU
designs, the OS code that supports them, and the VM policies that drive
them. Sun is just *one* of the baddies. That a lot of vendors take many
years to get their act together (if ever) on virtual memory does not
mean that it is a bad technology; it means that maybe it is too subtle
for mere Unix kernel hackers.
peter> The Sparcstation 2 is the first UNIX workstation I've seen with
peter> as good response time to user actions. It's only a 27 MIPS
peter> machine... or approximately 40 times faster.
The MIPS-eating sun bogons strike again! :-)
The people that did Tripos (Martin Richards!) and Amiga (and those that
now maintain them at CBM) seem to be quite another story. I am another
Amiga fan :-). Now, if only they could get their act together
commercially... (please redirect the ensuing flame war to the
appropriate newsgroup :->).
--
Piercarlo Grandi | ARPA: pcg%uk.ac.aber@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@aber.ac.uk
bsy@PLAY.MACH.CS.CMU.EDU (Bennet Yee) (03/10/91)
In article <PCG.91Mar9205121@aberdb.cs.aber.ac.uk>, pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) writes: >On 7 Mar 91 13:31:29 GMT, peter@ficc.ferranti.com (Peter da Silva) said: > >peter> In article <PCG.91Mar5193541@aberdb.cs.aber.ac.uk> >peter> pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) writes: > >pcg> I would however still maintain that even with conventional multiple >pcg> address space architectures shared memory is not necessary, as >pcg> sending segments back and forth (remapping) gives much the same >pcg> bandwidth. > >peter> I don't think you can really make a good case for this. > >Tell that to the people that ported Mach, and 4.3BSD to the PC/RT... :-) > >My estimate is that remapping can be done on demand (lazy remapping), >not on every context switch, and it does not cost more than a page fault >(which is admittedly an extremely expensive operation, but one about >which people don't complain). Also in many cases lazy remapping costs >nothing, given the vagaries of scheduling. Suppose a segment is shared >between processes 1 and 3; if 1 is deactivated, 2 is activated, and 1 >is reactivated, no remapping need take place, because 3 has not >accessed it in this sequence. Actually, handling shared memory doesn't require the kernel to twiddle the page tables on context switch for almost all machines _except_ for those like the RTs where it's got an inverted page table. Normal page table machines may have many virtual addresses mapped to the same physical address. The kernel needs to intervene only if the page is shared copy-on-write -- the page is marked as write protected and the kernel copies the page only when a task writes to the (shared) page. Remapping is a special case of shared memory. -- Bennet S. Yee Phone: +1 412 268-7571 Email: bsy+@cs.cmu.edu School of Computer Science, Carnegie Mellon, Pittsburgh, PA 15213-3890
Michael.Marsden@newcastle.ac.uk (Michael Marsden) (03/10/91)
linley@hpcuhe.cup.hp.com (Linley Gwennap) writes: >(Jeff Lee) >>16777216 terabyte (64-bit) address space. [What comes after tera-?] >"Peta". A 64-bit address space contains 16,384 petabytes. So what >comes after peta-? "Eta". The 64-bit address space is 16 EtaBytes. The non-computing value of Eta is 10^18, i.e. approx 2^60. So what comes after Eta, when we start arguing about 96 bits vs. 128 bits in a decades time? -Mike Mars .--------* Mike ________________________________ Michael.Marsden | Grad. /| /| /| /| / "..never write device drivers | @ | Student / |/ | /_| /_| \ while on acid!" -XXXXXXXXXX | Uk.Ac.Newcastle |__________/ |/ |/ \__/ *----NOT-mjd!!-----------'
peter@ficc.ferranti.com (peter da silva) (03/11/91)
In article <PCG.91Mar9205121@aberdb.cs.aber.ac.uk>, pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) writes: > peter> I don't think you can really make a good case [that sending segments back and forth gives much the same bandwidth] > Tell that to the people that ported Mach, and 4.3BSD to the PC/RT... :-) Hasn't the PC/RT been found to have surprisingly poor performance once the number of context switches involved get too high? > My estimate is that remapping can be done on demand (lazy remapping), > not on every context switch, and it does not cost more than a page fault > (which is admittedly an extremely expensive operation, Exactly. > but one about which people don't complain). Well, you and I have both complained about excessive paging in the past. > peter> Consider the 80286, where pretty much all memory access for large > peter> programs is done by remapping segments. > Are you sure? I think that in all OSes that run on the 286 maybe except > for iRMX/286 segments are not unmapped and remapped, but stay always > mapped, and can be and are shared. Once you want to access more than 256K (64K for each of DS, SS, CS, and ES) you *have* to reload the segment registers. The machine can *not* directly address more than 64K per segment, and it only has the 4 segment registers. This is a hard limit unless you start reloading segment registers... which is sufficiently expensive to have an exquisitely painful impact on performance. > peter> Loading a segment register is an expensive operation, > Around 20-30 cycles if memory serves me right. Compared to a context > switch it is insignificant. But it happens so much more often. > This said, your comments about the 286 MMU are irrelevant to a > discussion on acceptability of the cost of simulating shared memory by > remapping them on demand or at a context switch in each process that has > them nominally attached. Well, only that in the case you're talking about the cost of remapping the segments is even higher. [ re: costs of segment reloads on the 80286 ] > This is only because probably the HUGE model gets used, which implies > funny code to simulate 32 bit address arithmetic No, just large model. Once you have to operate on more than two data segments at a time (which basically means more than two objects at a time) you have to reload the segment register on each data access. > peter> I think it's clear from this experience that frequent reloading > peter> of segment registers is a bad idea. > No, the conclusion is not supported by the 286 example; the 286 is > uniquely poor for reloading segment registers because it does not treat > shadow segment registers as a cache and because its pointers have an > unfortunate format. True. Your point. Of course page faults are such a cheap operation. :-> -- Peter da Silva. `-_-' peter@ferranti.com +1 713 274 5180. 'U` "Have you hugged your wolf today?"
efeustel@prime.com (Ed Feustel) (03/11/91)
It is :wwq
efeustel@prime.com (Ed Feustel) (03/11/91)
It is not necessary for segment access to take such a long time. Prime offers a multi-segment architecture and it performs quite satisfactorily because it was designed to perform well with segments. Intel could design their architecture more appropriately for segments if they wished (and they have with the IN960XA. If enough people demanded it, you could see a substantial improvement now that they have the extra Silicon available. Ed Feustel Prime Computer
pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) (03/11/91)
On 10 Mar 91 17:50:11 GMT, peter@ficc.ferranti.com (peter da silva) said: peter> Hasn't the PC/RT been found to have surprisingly poor performance peter> once the number of context switches involved get too high? I don't know, but this could be for many other reasons. I remember having seen hints that the RS/6000 does badly on context switching, but whether this is due to shared memory simulation or rather one of a million porbable bogosities in the OS I cannot know. peter> Once you want to access more than 256K (64K for each of DS, SS, peter> CS, and ES) you *have* to reload the segment registers. The peter> machine can *not* directly address more than 64K per segment, and peter> it only has the 4 segment registers. This is a hard limit unless peter> you start reloading segment registers... which is sufficiently peter> expensive to have an exquisitely painful impact on performance. Maybe you have tired of reading my article before its end, but I maintain that the 286 large model, even in pointer expensive programs, has at most a 50% average slow down compared with small model, except for pathological cases. Such pathological cases are easy to find for every cache organization, as you will readily concede. Accessing two arrays that happen to map to the same cache lines kills almost every machine out there, for one thing... That the shadow register organization of the 286 is misguided I have been ready to concede, but it should not reflect on a judgement on the merits of shared memory simulation via remapping for reverse MMUs, or on the merits of segmented architectures in general. I also have the impression that you loathe so much the 286 two dimensional addressing scheme that you also detest all segmentation schemes but the two issues are unrelated. Most paged and segmented VM systems have linear addressing, e.g. the 370, or the VAX-11, and so on. peter> Loading a segment register is an expensive operation, pcg> Around 20-30 cycles if memory serves me right. Compared to a pcg> context switch it is insignificant. peter> But it happens so much more often. Dereferencing a far pointer costs only three times a near pointer, and not every instruction is a far pointer dereference. Also, when one does segment remapping, really one twiddles the contents of a field in the LDT (the page table), not that of the segment registers, and at most once per context switch (and this does not happne on most context switches). The cost of reloading a segment register and of remapping a segment are therefore totally unrelated. peter> Well, only that in the case you're talking about the cost of peter> remapping the segments is even higher. True... But not tragic. Taking a trap, finding out which segment should be remapped, fiddling the LDT of the process who had the segment mapped, and remapping it might cost as much maybe as reading a block off the buffer cache, i.e. a few hundred instructions. I would think that it is of the order of a page fault (mind you, I was maybe not clear before: just the CPU cost of a page fault, not the many milliseconds for the IO time possibly associated to it), and less frequent. I remember that the BSD VM subsystem that used a like technique to simulate a 'referenced' bit for each page (take a fault and map the page in) cost less than 5% on a VAX-11/780, and that was for much more frequent faulting. People do IPC using pipes or System V MSGs or sockets which cost far far far more. On machines like the 286 that can share segments simultaneously, pure shared memory is OK. On those that cannot, like the RT, the cost is not excessive, and probably inferior to that of most alternatives. -- Piercarlo Grandi | ARPA: pcg%uk.ac.aber@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@aber.ac.uk
peter@ficc.ferranti.com (Peter da Silva) (03/11/91)
In article <1991Mar10.154107.10976@newcastle.ac.uk> Michael.Marsden@newcastle.ac.uk (Michael Marsden) writes: > So what comes after Eta, when we start arguing about 96 bits vs. 128 bits in > a decades time? Gotta be 128 bits, since after you use up 64 bits for the IP address 96 bits would only leave 32 bits of address space for each node. -- Peter da Silva. `-_-' peter@ferranti.com +1 713 274 5180. 'U` "Have you hugged your wolf today?"
rh@craycos.com (Robert Herndon) (03/12/91)
Those 10^(n*3) powers used so often in engineering and the sciences are: English prefix multiplier symbolic prefix exa- 10^18 X? peta- 10^15 P tera- 10^12 T giga- 10^9 G mega- 10^6 M kilo- 10^3 K milli- 10^-3 m micro- 10^-6 u (greek mu) nano- 10^-9 n pico- 10^-12 p femto- 10^-15 f? atto- 10^-18 a? The symbolic prefixes are used before units, e.g., Tbyte, GH(ert)z, Kbar, nH(enry), pF(arad), etc. The symbolic prefixes can also be ambiguous (T == Tesla, so a TT == a teratesla?; G == unit of acceleration, so GG = Billion (Milliard for the Europeans) Gravities?), so some care is required... I may have the last two reversed, as these are more often used in bio (femtomolar solutions, etc.) and physics, but I think I've got them right. Other persons have also corrected my exa- to eka-, but I stand by my usage (eka- means "like", as in Mendeleev's eka-elements). Perhaps some of the device physics people can clarify common usage of the low-end units. Robert -- Robert Herndon -- not speaking officially for Cray Computer. Cray Computer Corporation 719/540-4240 1110 Bayfield Dr. rh@craycos.com Colorado Springs, CO 80906 "Ignore these three words."
ccplumb@rose.uwaterloo.ca (Colin Plumb) (03/12/91)
rh@craycos.com (Robert Herndon) wrote: > >Those 10^(n*3) powers used so often in engineering and the >sciences are: > > English prefix multiplier symbolic prefix > exa- 10^18 X? [No, E] > peta- 10^15 P > tera- 10^12 T > giga- 10^9 G > mega- 10^6 M > kilo- 10^3 K > milli- 10^-3 m > micro- 10^-6 u (greek mu) > nano- 10^-9 n > pico- 10^-12 p > femto- 10^-15 f? [Yes, f] > atto- 10^-18 a? [Yes, a] The letter prefix for exa- is E. EHz, EeV (stop drooling, particle physicists), etc. Exabyte got their name from something. femto and atto are indeed 10^-15 and 10^-18, respectively, as you have illustrated them. Officially, the letter prefix for kilo is lower case k, not upper case K, but I like the upper case implies >1 pattern. Then there's deci- for 1/10 and centi- for 1/100, as well as numbers (deka and hecto, I think) for 10 and 100, but nobody ever uses those. As all who have implemented RS232 can attest, standards are meant to be ignored when convenient... long live the mho! -- -Colin
peter@ficc.ferranti.com (Peter da Silva) (03/12/91)
In article <PCG.91Mar11144147@aberdb.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) writes: > People do IPC using pipes or System V MSGs or sockets which cost far far > far more. And the performance costs of this, and other results of getting the MMU so intimately involved with things, is the main reason why a 27 MIPS SparcStation 2 doesn't seem any faster than a 0.7 MIPS Amiga 1000. (Of course, that begs the question of why a 2-3 MIPS Mac-II is just as slow with no MMU overhead) You're right... for systems with so much MMU overhead anyway giving up shared memory is probably not such a big deal. -- Peter da Silva. `-_-' peter@ferranti.com +1 713 274 5180. 'U` "Have you hugged your wolf today?"
pcg@test.aber.ac.uk (Piercarlo Antonio Grandi) (03/13/91)
On 12 Mar 91 14:52:17 GMT, peter@ficc.ferranti.com (Peter da Silva) said:
peter> In article <PCG.91Mar11144147@aberdb.cs.aber.ac.uk>
peter> pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) writes:
pcg> People do IPC using pipes or System V MSGs or sockets which cost
pcg> far far far more.
peter> And the performance costs of this, and other results of getting
peter> the MMU so intimately involved with things, is the main reason
peter> why a 27 MIPS SparcStation 2 doesn't seem any faster than a 0.7
peter> MIPS Amiga 1000.
I still beg to differ. The cost of IPC is not that enormous, as IPC is
not that inefficient, and even when it is it is not that frequent. Even
under traditional Unix, where a pipe often implies six copies of the
same data (source->stdio buffer, stdio buffer->system buffer, system
buffer->disk, disk->system buffer, system->buffer stdio buffer, stdio
buffer->target), IPC cannot account for one or two orders of magnitude
of inefficiency. Even more importantly a very poorly designed MMU could
not either.
The ultimate difference between a SPARC 2 and an Amiga is not that one
has an MMU and the other has not, it is that for one inefficiency does
not matter or increases sales, while the other is sold to people that
purchase it with their own money. This has amazing effects on how
sloppily the system software gets written, and on how easily vendors can
get away with it.
A non sloppily designed and used MMU has a negligible effect on
virtually all applications, and possibly substantial benefits.
--
Piercarlo Grandi | ARPA: pcg%uk.ac.aber@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@aber.ac.uk
jesup@cbmvax.commodore.com (Randell Jesup) (03/14/91)
In article <PCG.91Mar9194211@aberdb.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) writes: >On 6 Mar 91 04:41:41 GMT, jesup@cbmvax.commodore.com (Randell Jesup) said: >pcg> [ ... yes, but only if you have multiple address spaces, and only >pcg> as irrevocably read only data. With 64 bit addresses one had better >pcg> have single address space systems ... ] > >jesup> Shared libraries are very useful on systems with single address > ^^^^^^ >Probably you mean multiple here; in single address space systems all >libraries are "shared". No, I meant "single". Single address space doesn't mean a single shared r/w address space, and it doesn't mean that all libraries are public (as opposed to private). Libraries can/are used for dynamic run-time binding as well as for code-sharing. "Single address space" means to me that there are no duplicated virtual addresses between different processes (which of course makes sharing memory/code far easier, since a pointer is valid to all processes, and libraries do not have to be position-independant). >jesup> spaces. [ ... ] > >Ah yes, of course. One of the tragic legacies of the PDP-11 origin of >Unix is that shared libraries have been so slow in appearing. Not all >is a win for shared libraries (one has to be careful about clustering >of functions in shared library), but I agree. Absolutely agreed. Witness the apocryphal 1Meg XClock. >jesup> but if it were implemented (it might be eventually), it would >jesup> almost certainly continue to use a single address space. > >It will be interesting to see how you do it on a machine with just 32 >bits of addressing. You cannot do much better than having 256 regions of >16MB each, which in today's terms seems a bit constricting. Not for an >Amiga probably, admittedly. Amiga users are happy with 68000 addressing, >because its sw is not as bloated as others. There are other problems. 256 times 16Mb???!!!! Constricting?? First, why would we partition on 16Mb boundaries, when the average application uses 50-500K? Also, I can't see it being constricting until memory + swapspace is greater than 2 or 4 gigs. Constricting for a 100 user shared system, for a mini-super- computer, but not for a single-user oriented machine. 68000 addressing is restricting (24 bits), but all mmu-capable Amigas are 32-bit machines. (BTW, due to shared libs, easy interprocess communication, whatever, Amiga applications are usually far smaller than, say, the X/Unix versions of the same thing.) >Psyche from Rochester does something like that, when protection is not >deemed terribly important; it maps multiple "processes" in the same >address space, so that they can enjoy very fast communications. Sounds very much like threads to me. But the protection and single address space issues are separate, IMHO. >pcg> I would however still maintain that even with conventional multiple >pcg> address space architectures shared memory is not necessary, as >pcg> sending segments back and forth (remapping) gives much the same >pcg> bandwidth. > >jesup> It requires far more OS overhead to change the MMU tables, > >The Mach people have been able to live with it on the ROMP; not terribly >happy, but the price is not that big; "far more" is a bit excessive. >Point partially acknowledged, though; I had written: This depends highly on how frequent you think passing segments is going to be. On a highly shared-message-based system like the amiga, the numbers of messages passed per second can be very high (many thousands). >pcg> Naturally one pays a performance penalty. The problem is >pcg> historical; > >jesup> and requires both large-grained sharing and program knowlege of >jesup> the MMU boundaries. > >But this is really always true; normally you cannot share 13 bytes (in >multiple address space systems). As a rule one can share *segments*. >Some systems allow you to share single *pages*, but I do not like that >(too much overhead, little point in doing so, most MMUs support much >more easily shared segments than pages; even the VAX has a 2 level MMU >with 64KB shared segments, contrarily to common misconceptions). In any protected amiga, sharing would almost certainly be on the page level. Anything else is far too inefficient in memory usage. Also, more than one object would reside in the shared memory space (i.e. AllocMem(size,MEMF_SHARED,...) or some such). This greatly cuts down on the amount of shared memory required, and the amount of real memory to hold shared pages (since the % utilization of the pages is higher). >Here I feel compelled to restate the old argument: if shared memory is >read-only, it can well be *irrevocably* read-only and require no actual >remapping, so no problems. > >If it is read-write some sort of synchronization must be provided. In >most cases synchronization must instead be provided via kernel services, >and in that case it is equivalent to remapping, so simultaneous shared >memory is not needed. But such services can be many orders of magnitude faster than remapping. An example is a program I wrote and posted to comp.os.research, concerning pipe speed versus shared memory speed to transfer data. I merely sent a shared-memory message from one process to the other saying the data was in the shared buffer. Rebuilding process page tables on each pass would have swamped the actual transfer time. >Explicit remaps involve a small change in programming paradigm, in that >it is neither simultaneous shared memory, nor message passing (which >implies copying, at least virtually). Its one major implementation, as >far as I know, is in MUSS. Messages can be shared also (greatly increases message passing speed). I don't think we disagree that much (partially on terminology). -- Randell Jesup, Keeper of AmigaDos, Commodore Engineering. {uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.commodore.com BIX: rjesup The compiler runs Like a swift-flowing river I wait in silence. (From "The Zen of Programming") ;-)
tgg@otter.hpl.hp.com (Tom Gardner) (03/14/91)
|As all who have implemented RS232 can attest, standards are meant to be |ignored when convenient... long live the mho! Of course, the SI unit of conductance is the Siemen (S), not the mho. The unit of time is the second (s). Unfortunately people often get it wrong and specify, for example, DRAM access times in nS. This is equivalent to specifying their weight as being x metres. OK, so notes is a quick informal medium where mistakes are tolerated. But shipping products which get this detail wrong makes me wonder how many how many other details they got wrong...
pcg@test.aber.ac.uk (Piercarlo Antonio Grandi) (03/16/91)
In article <PCG.91Mar9194211@aberdb.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) writes: pcg> It will be interesting to see how you do it on a machine with just pcg> 32 bits of addressing. You cannot do much better than having 256 pcg> regions of 16MB each, which in today's terms seems a bit pcg> constricting. Not for an Amiga probably, admittedly. Amiga users pcg> are happy with 68000 addressing, because its sw is not as bloated pcg> as others. There are other problems. To this you comment repeating exactly what I say here, that is 32 bit addressability is not constricting for an Amiga because its typical applications are small and do not rely on a large sparse address space. This is unfair! You cannot object to what I say by repeating it! I could fall into the trap, not realize that your reply is just a rewording of what I have written myself, and try to show that it is wrong, and then look even more of a fool, especially if I succeed! :-). pcg> Here I feel compelled to restate the old argument: if shared memory is pcg> read-only, it can well be *irrevocably* read-only and require no actual pcg> remapping, so no problems. pcg> If it is read-write some sort of synchronization must be provided. pcg> In most cases synchronization must instead be provided via kernel pcg> services, and in that case it is equivalent to remapping, so pcg> simultaneous shared memory is not needed. jesup> But such services can be many orders of magnitude faster than jesup> remapping. jesup> An example is a program I wrote and posted to comp.os.research, jesup> concerning pipe speed versus shared memory speed to transfer jesup> data. I merely sent a shared-memory message from one process to jesup> the other saying the data was in the shared buffer. You only win if you can make assumptions about the hw environment, and not use OS sempahores, but use hw locks. Again, it would be better anyhow to have tightly communicating processes in the same address space than using shared memory. jesup> Rebuilding process page tables on each pass would have swamped jesup> the actual transfer time. Hardly, I think, given a suitable buffer size and a quick way of modifying two page table entries. I cannot imagine an implementation of remapping which is significantly more expensive than an implementation of semaphores. Again, hw locks are faster than sw remapping or sw semaphores, but of more limited applicability. And remapping after all is just something you do if you have to run applications designed for multiple address spaces on a single address space architecture. What I say is not that it's optimal, but that it is tolerable. jesup> Messages can be shared also (greatly increases message passing jesup> speed). No, the speed does not increase a lot. Unless the synchronization is done using hardware primitives, which is nonportable, or at least not always applicable. Also, my position is that messages are not as good as simply sharing in a single address space. Yet, if one does messaging for one reason or another, doing it by remapping is much better than the alternatives, and is not that much worse than shared memory. jesup> I don't think we disagree that much (partially on terminology). No, we actually don't disagree a lot, except for the part where I draw the attention to the fact that shared memory of any flavour implies some sort of synchronization, and this can only be cheaply provided under fairly limiting assumptions (but valid on an Amiga for example). -- Piercarlo Grandi | ARPA: pcg%uk.ac.aber@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@aber.ac.uk
peter@ficc.ferranti.com (Peter da Silva) (03/19/91)
In article <PCG.91Mar15184558@aberdb.test.aber.ac.uk>, pcg@test.aber.ac.uk (Piercarlo Antonio Grandi) writes: > No, the speed does not increase a lot. Unless the synchronization is > done using hardware primitives, which is nonportable, or at least not > always applicable. Why is that any less portable than doing a floating multiply using hardware primitives, for example? Just make "sendMessage" a primitive operation (put it in a library, for example). Then when you have a fast hardware mechanism for synchronising message transactions you can use it. The performance of the Amiga demonstrates that this is a good idea, and would be badly hurt by the overhead of remapping because in most cases you don't do a context switch for each message. -- Peter da Silva. `-_-' peter@ferranti.com +1 713 274 5180. 'U` "Have you hugged your wolf today?"
jesup@cbmvax.commodore.com (Randell Jesup) (03/23/91)
<19840@cb <PCG.91Mar15184558@aberdb.test.aber.ac.uk> Sender: Reply-To: jesup@cbmvax.commodore.com (Randell Jesup) Followup-To: Distribution: Organization: Commodore, West Chester, PA Keywords: In article <PCG.91Mar15184558@aberdb.test.aber.ac.uk> pcg@test.aber.ac.uk (Piercarlo Antonio Grandi) writes: >To this you comment repeating exactly what I say here, that is 32 bit >addressability is not constricting for an Amiga because its typical >applications are small and do not rely on a large sparse address space. >This is unfair! You cannot object to what I say by repeating it! :-) Who said I objected? Actually, I think you missed my point: on something other than an amiga, even if the software is bloated, it still doesn't matter until you're a) actually using 4 gig worth of memory space, and b) have at least 4 gig of swap space. (well, less if you're memory- mapping files - 4 gig disk space in that case (but you'd have to be mapping in almost all your disk!)) >jesup> But such services can be many orders of magnitude faster than >jesup> remapping. > >jesup> An example is a program I wrote and posted to comp.os.research, >jesup> concerning pipe speed versus shared memory speed to transfer >jesup> data. I merely sent a shared-memory message from one process to >jesup> the other saying the data was in the shared buffer. > >You only win if you can make assumptions about the hw environment, and >not use OS sempahores, but use hw locks. Again, it would be better >anyhow to have tightly communicating processes in the same address space >than using shared memory. Who made assumptions about the hardware environment? I merely called an OS primitive (PutMsg) which sent a message to a specific port. How it accomplishes this is irrelevant. >jesup> Rebuilding process page tables on each pass would have swamped >jesup> the actual transfer time. > >Hardly, I think, given a suitable buffer size and a quick way of >modifying two page table entries. I cannot imagine an implementation of >remapping which is significantly more expensive than an implementation >of semaphores. Again, hw locks are faster than sw remapping or sw >semaphores, but of more limited applicability. I think you underestimate page-table-rebuild time, or overestimate how fast shared messages can be. Semaphores expensive?? Hardly (at least in a single-processor environment). >And remapping after all is just something you do if you have to run >applications designed for multiple address spaces on a single address >space architecture. What I say is not that it's optimal, but that it is >tolerable. Point taken. >jesup> Messages can be shared also (greatly increases message passing >jesup> speed). > >No, the speed does not increase a lot. Unless the synchronization is >done using hardware primitives, which is nonportable, or at least not >always applicable. So long as it's hidden in the OS, I don't care if it uses blue men from mars. The OS _can_ use any hardware primitives it likes. For example, message passing on the amiga is (essentially) disable interrupts, addtail the message onto the port's list (easy with doubly-linked lists), enable, and signal the port owner's process (which may cause a reschedule if the owner is higher priority). -- Randell Jesup, Keeper of AmigaDos, Commodore Engineering. {uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.commodore.com BIX: rjesup The compiler runs Like a swift-flowing river I wait in silence. (From "The Zen of Programming") ;-)