[comp.arch] Translating 64-bit addresses

connors@hplabsz.HP.COM (Tim Connors) (02/20/91)

Now that we've entered the brave new world of 64 bit (flat) address spaces,
is it time to revive the old flame wars on address translation mechanisms?

For 32 bit addresses, Motorola's MC68851 uses a "two level" translation
tree involving 4K pages (12 bits) and two 10 bit indices, one index for
each level.  How could this technique be applied to 64 bit addresses?
Would more levels be needed?  Should the page size be larger?

More interestingly, could the pointers which link one level to the next be
only 32 bits and thus save on translation table size?  This might
limit the placement of the tables in a machine with more than 4Gbytes of RAM.
It also requires switching from 64 to 32 bit mode during TLB miss handling.

What about inverted page tables.  Would they be any better for 64 bit
addresses?  Does this make life tough for the MACH operating system?

Are 64 bit addresses spaces more likely to be sparse?  What affect does that
have on the translation mechanism?

I can think of alot more questions, but I'll leave it there except to ask
if anyone from MIPS can tell us how you intend to do address translation on
the R4000?

Sincerely,
Tim

af@spice.cs.cmu.edu (Alessandro Forin) (02/21/91)

In article <6590@hplabsz.HP.COM>, connors@hplabsz.HP.COM (Tim Connors) writes:
> ...
> Does this make life tough for the MACH operating system?
> 

May I humbly ask which original reading of our code prompts you this 
question ?

sandro-

mash@mips.COM (John Mashey) (02/24/91)

In article <6590@hplabsz.HP.COM> connors@hplabs.hp.com (Tim Connors) writes:
>Now that we've entered the brave new world of 64 bit (flat) address spaces,
>is it time to revive the old flame wars on address translation mechanisms?
...
>I can think of alot more questions, but I'll leave it there except to ask
>if anyone from MIPS can tell us how you intend to do address translation on
>the R4000?

The general approach (which is about all I can talk about right now)
is essentially identical to the R3000's, although low-level details
differ:
	1) Generate an adddress.
	2) Send it to the TLB
	3) If it matches, translation is done.  Do protection checks
	as needed.
	4) If no match, stuff the offending address and/or appropriate portions
	thereof into coprocessor-0 registers, and invoke a special
	fast-trap kernel routine to refill the TLB.  Different environments
	use different refill routines, depending on their preference
	for PTE layouts and organization.

At this level of discourse, the main difference is the ability for each entry to
map from 4KB -> 16MB.  Obviously, it is trivial to use the big entries
for things like frame buffers, but (in my opinion) it will be useful
to end up with OSs that can use larger pages when they need to.  This is already
an issue with DBMS and scientific vector code, where it is quite possible
to overpower any reasonable-to-build TLB if it maps only 4K or 8KB pages,
regardless of whether it uses hardware or software refill.
(True regardless of 32 or >32-bit addressing).  There are various
ways of coding refill routines that provide mixed-page sizes;
I did one as an example while back that looked like it cost just
a few more cycles, so I believe that it is practical.

Note: I occasionally see marketing-stuff that claims that hardware
refill is much better than software refill. I might believe this
if somebody showed numbers, also....

BTW: back to the oppurtunity cost of 64-bit integers.  I went back
to the plot, and measured the datapath itself a little more carefully,
as my earlier figures included datapath + associated control.

Now, it looks like the die cost of 64-bit integers is more like
4-5%, not 7%.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

cprice@mips.COM (Charlie Price) (02/24/91)

In article <6590@hplabsz.HP.COM> connors@hplabs.hp.com (Tim Connors) writes:
>Now that we've entered the brave new world of 64 bit (flat) address spaces,
>is it time to revive the old flame wars on address translation mechanisms?
>
>For 32 bit addresses, Motorola's MC68851 uses a "two level" translation
>tree involving 4K pages (12 bits) and two 10 bit indices, one index for
>each level.  How could this technique be applied to 64 bit addresses?
>Would more levels be needed?  Should the page size be larger?
>
>More interestingly, could the pointers which link one level to the next be
>only 32 bits and thus save on translation table size?  This might
>limit the placement of the tables in a machine with more than 4Gbytes of RAM.
>It also requires switching from 64 to 32 bit mode during TLB miss handling.
>
>What about inverted page tables.  Would they be any better for 64 bit
>addresses?  Does this make life tough for the MACH operating system?
>
>Are 64 bit addresses spaces more likely to be sparse?  What affect does that
>have on the translation mechanism?
>
>I can think of alot more questions, but I'll leave it there except to ask
>if anyone from MIPS can tell us how you intend to do address translation on
>the R4000?

Address translation for the R4000 is a lot like the R2000/R3000.
The short answer is that the in-memory page-table arrangement
is entirely up to the OS programmers because it is all done by software.

These processors have a fully-associative on-chip TLB.
During execution, the hardware looks in the TLB.
If the right information is not in the TLB,
the processor takes an exception and *software* refills the TLB.
The software can do whatever it likes.

DETAILS:  Hit "N" now if not interested.

The processors do have give the exception handler enough information
to do the TLB refill and this information in in the right form to
makd a one-level page table especially fast.
If you use a one-level page table, the R3000 user TLB miss refill
routine takes 9 instructions.

The R3000 has a CONTEXT register that looks like:
 -----------------------------------------------
 | PTE base    |   bad VPN                  |..|
 -----------------------------------------------
					      ^----- 2 bits of 0

The PTE-base part is filled in by the processor during a context switch.
The bad-VPN field is filled at exception time, by the hardware,
with the Virtual Page Number (VPN) of the failed translation.
The net intended effect is that when VPN NNN gets a translation fault,
the CONTEXT register contains the *kernel address* of the
1-word user page-table entry for page NNN!

On the R2000/R3000 a TLB miss for a user-mode access (i.e. a user program)
vectors to the UTLB miss exception vector so it can be handled quickly.
This routine is 9 instructions for a 1-word PTE (add 1 instruction
to shift the address left one for 2-word wide PTE like RISC/os uses).

For a kernel-mode TLB miss, the exception sets a fault cause register
and vectors to the common exception vector, and this takes more effort
to sort out (but also presumably happens a lot less often).
The R4000 takes all non-nested TLB misses through the fast exception vector.

There is nothing that *requires* anybody to use a one-level page table.
If you think that memory savings or whatever makes it worth the
cost in CPU cycles to have a more complex TLB refill routine,
then you (the OS hacker) are allowed to make that tradeoff.

-- 
Charlie Price    cprice@mips.mips.com        (408) 720-1700
MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA   94086-23650

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (03/01/91)

In article <6590@hplabsz.HP.COM> connors@hplabs.hp.com (Tim Connors) writes:
>For 32 bit addresses, Motorola's MC68851 uses a "two level" translation
>tree involving 4K pages (12 bits) and two 10 bit indices, one index for
>each level.  How could this technique be applied to 64 bit addresses?
>Would more levels be needed?  Should the page size be larger?

The MC68851 actually allows a variable number of levels - the table
search flowchart contains a loop.  You may be thinking of its RISC
descendant, the 88200.

There are lots of ways to skin a cat. You can store a complete flat
table in another virtual space, with pages there only being
instantiated on demand. Trees work, and there have been tree methods
that stored "likely" information (eg the descriptor of the first page
of a segment) high up in the tree.  You can punt it all to software,
if the TLB is big enough, or has big or variable page sizes.

Variable page size can be done in software. (Mach's boot can select a
software page size - of course, it has to be a multiple of the
hardware size.) The CDC Star-100 allowed multiple hardware page
sizes, simultaneously, as part of their plans for memory mapped I/O.
The 88200's BATC is a 10 entry TLB, each entry mapping an aligned 512
KB region. (This, like the new MIPS stuff, is aimed at big address
spaces, not big I/O.)

-- 
Don		D.C.Lindsay .. temporarily at Carnegie Mellon Robotics

pcg@cs.aber.ac.uk (Piercarlo Grandi) (03/01/91)

On 25 Feb 91 23:03:30 GMT, connors@hplabsz.HP.COM (Tim Connors) said:

connors> In article <6590@hplabsz.HP.COM>, connors@hplabsz.HP.COM (Tim
connors> Connors) writes:

connors> What about inverted page tables.  Would they be any better for
connors> 64 bit addresses?  Does this make life tough for the MACH
connors> operating system?

connors> In article <12030@pt.cs.cmu.edu> af@spice.cs.cmu.edu
connors> (Alessandro Forin) writes:

af> May I humbly ask which original reading of our code prompts you this 
af> question ?

Oh, I think he is referring to the fact that in the Mach kernel one of
the pager modules is specifically designed to deal with inverted page
tables.

connors> [ ... ] HP's PA-RISC machines and the IBM RT and RS6000.  My
connors> understanding is that in each of these machines the inverted
connors> page table scheme comes with an assumption of NO address
connors> aliasing. [ ... ] As I understand MACH, address aliasing is of
connors> primary importance.

I can answer here for the Mach people. One of the first Mach
implementations was for the RT. They did make it work on the RT, despite
tha aliasing problem. The describe the solution (a variant on the
obvious "remap aliased things when they are used" theme) as not pretty,
but it works quite well.

Actually the Mach page table code is *designed* to work also with
inverted page tables, something that is not that true with other common
Unix kernels (even if 4.3BSD was hacked to work on the RT too).

Naturally one pays a performance penalty. The problem is historical;
there is no conceptual good reason to have shared memory or copy on
write, but if you want to have Unix like semantics (e.g. fork(2)), then
you really want them both, because the perfomance is even worse without.

Mach's success is largely because it is Unix compatible, not because it
is "better" (the "better" things were in Accent, and Accent withered).

It is Unix that is built around the assumption of aliasing to a large
extent (shared text, SysV IPC, BSD mmap(2)) or that makes copy-on-write
so desirable ("copy" semantics are prevalent because on a PDP-11 it was
simpler and maybe faster to copy small things than to share them).

Inasmuch Mach is Unix compatible, one needs to make the best of a
mismatch, like that between inverted page tables (which are at their
best supporting ORSLA/Ssystem38 type systems, that is capability based
sparse address space architectures) and Unix semantics, and the Mach
kernel does it with reasonable efficiency, precisely because one of the
first machines it supported in large numbers was the RT.

--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

jonah@dgp.toronto.edu (Jeff Lee) (03/01/91)

In <12151@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
>In article <6590@hplabsz.HP.COM> connors@hplabs.hp.com (Tim Connors) writes:
>> [...]  How could this technique be applied to 64 bit addresses?
>>Would more levels be needed?  Should the page size be larger?

> [...] You can store a complete flat table in another virtual space,
> with pages there only being instantiated on demand.

Full-indexing methods have an outrageous worst-case cost: O(n) in the
amount of space [actually the number of pages] indexed.  This is
managable with today's 4 gigabyte (32-bit) address spaces, but is
insane with a 16777216 terabyte (64-bit) address space.  [What comes
after tera-?]

Making the index sparce helps, but you *still* need to create the
index for the portions of the data that you want to address.  Putting
the index in a virtual space doesn't help very much; it means you can
swap it out, but you still have to construct and maintain the index
(although now in swap space instead of physical memory).  This
complicates the process.

Additionally, the full-indexes often need to be maintained in a form
that is compatible with your MMU, so it can't easily be shared with
heterogeneous machines.  Disk files already have an index of sorts,
but it is seldom compatible with the MMU hardware.  Why build another
index when you map that file into memory?

> You can punt it all to software, if the TLB is big enough, or has big
> or variable page sizes. [...]  Variable page size can be done in
> software. (Mach's boot can select a software page size - of course, it
> has to be a multiple of the hardware size.)

Ultimately, it comes down to software -- especially if you want to be
able to map arbitrary data resources and large shared remote files.
Variable page sizes can help, but as you point out they can be done in
software.

Speculation: In order to deal with growing physical memories we will
soon see TLBs comparable to today's on-chip caches with [tens of]
thousands of entries instead of tens or hundreds of entries as exist
today.

Jeff Lee -- jonah@cs.toronto.edu || utai!jonah

peter@ficc.ferranti.com (Peter da Silva) (03/01/91)

In article <1991Feb28.144420.21179@jarvis.csri.toronto.edu> jonah@dgp.toronto.edu (Jeff Lee) writes:
> Making the index sparce helps, but you *still* need to create the
> index for the portions of the data that you want to address.  Putting
> the index in a virtual space doesn't help very much; it means you can
> swap it out, but you still have to construct and maintain the index...

Can't you construct the index at page-fault time, if the structure is
regular and predictable?

(gives new depth to the term "virtual memory")
-- 
Peter da Silva.  `-_-'  peter@ferranti.com
+1 713 274 5180.  'U`  "Have you hugged your wolf today?"

torek@elf.ee.lbl.gov (Chris Torek) (03/02/91)

In article <PCG.91Feb28183457@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk
(Piercarlo Grandi) writes:
>Actually the Mach page table code is *designed* to work also with
>inverted page tables, something that is not that true with other common
>Unix kernels (even if 4.3BSD was hacked to work on the RT too).

I suspect you have neither read the code nor written a pmap module.
(I have now done some of both....)

The Mach code demands two kinds of support from a `pmap module' (the
thing that implements address-space operations on any given machine):

	pmap_enter, pmap_remove, pmap_protect, pmap_change_wiring,
	pmap_extract:

All of these receive a `pmap' pointer (where a `pmap' is something
you define on your own) and a virtual address (and, for pmap_enter,
a starting physical address) and require that you map or unmap or
protect that virtual address, or return the corresponding physical
address.

This requires a way to go from virtual to physical, per process
(i.e., a forward map).

	pmap_remove_all, pmap_copy_on_write, pmap_clear_modify,
	pmap_clear_refernce, pmap_is_modified, pmap_is_referenced,
	other miscellaneous functions:

These receive a physical address and must return information, or
remove mappings, or copy, about/to/from the corresponding physical
address.  pmap_remove_all, for instance, must locate each virtual
address that maps to the given physical page and remove that mapping
(as if pmap_remove() were called).

This requires a way to go from physical to virtual, for all processes
that share that physical page (i.e., an inverted page table, of sorts).

Note that, although the VM layer has all the forward map information in
machine independent form, the pmap module is expected to maintain the
same information in machine-dependent form.  This is almost certainly a
mistake; the information should appear in one place or the other.
(Which place is best is dictated by the hardware.)  (Note that all
pmap_extract does is obtain a virtual-to-physical mapping that could be
found by simulating a pagein.)  The reverse map maintained in hardware
inverted page tables typically lacks some information the pager will
require, and thus cannot be used to store everything; I suspect
(although I have not penetrated this far yet) the VM layer code `fits'
better with this situation, i.e., keeps just what it needs and leaves
the rest to the pmap module and/or hardware.
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

steve@cs.su.oz (Stephen Russell) (03/03/91)

In article <IAT90LB@xds13.ferranti.com> peter@ficc.ferranti.com (Peter da Silva) writes:
>In article <1991Feb28.144420.21179@jarvis.csri.toronto.edu> jonah@dgp.toronto.edu (Jeff Lee) writes:
>> Making the index sparce helps, but you *still* need to create the
>> index for the portions of the data that you want to address...
>
>Can't you construct the index at page-fault time, if the structure is
>regular and predictable?
>

I doubt this would be a win, as you are still constructing the same index,
although in a piecemeal manner. In fact, you end up performing the inner
loop of the PTE initialisation code along with the setup code for the loop,
on each fault.

Of course, it would win if most program executions only referenced a small
part of their code and data. This is not impossible -- each execution may
only exercise some of the code -- but I'd be surised if there was a large
amount of code and data that _never_ got used :-). Anyone got any figures
on code and data subset usage for different runs of the same program(s)?

Cheers,

Steve.

peter@ficc.ferranti.com (Peter da Silva) (03/03/91)

In article <PCG.91Feb28183457@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
> Naturally one pays a performance penalty. The problem is historical;
> there is no conceptual good reason to have shared memory...

Shared libraries?
-- 
Peter da Silva.  `-_-'  peter@ferranti.com
+1 713 274 5180.  'U`  "Have you hugged your wolf today?"

moss@cs.umass.edu (Eliot Moss) (03/04/91)

One possibility is trace or debugging related information. In most runs of a
program this would not be used. Similarly, online help stuff, tutorials, etc.,
may be rarely used in some environments (i.e., once you've learned it, you
refer to such material rarely, though you want it around if possible). Also,
code for handling very rare bad situations (e.g., out of disk space) may never
be exercised. Just some suggestions that in fact there might reasonably be
code and data that is very rarely used ....			Eliot
--

		J. Eliot B. Moss, Assistant Professor
		Department of Computer and Information Science
		Lederle Graduate Research Center
		University of Massachusetts
		Amherst, MA  01003
		(413) 545-4206, 545-1249 (fax); Moss@cs.umass.edu

Richard.Draves@cs.cmu.edu (03/05/91)

> Excerpts from netnews.comp.arch: 1-Mar-91 Re: Translating 64-bit addr..
> Chris Torek@elf.ee.lbl.g (2509)

> n article <PCG.91Feb28183457@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk
> (Piercarlo Grandi) writes:
> >Actually the Mach page table code is *designed* to work also with
> >inverted page tables, something that is not that true with other common
> >Unix kernels (even if 4.3BSD was hacked to work on the RT too).

> I suspect you have neither read the code nor written a pmap module.
> (I have now done some of both....)

...

> Note that, although the VM layer has all the forward map information in
> machine independent form, the pmap module is expected to maintain the
> same information in machine-dependent form.  This is almost certainly a
> mistake; the information should appear in one place or the other.
> (Which place is best is dictated by the hardware.)  (Note that all
> pmap_extract does is obtain a virtual-to-physical mapping that could be
> found by simulating a pagein.)  The reverse map maintained in hardware
> inverted page tables typically lacks some information the pager will
> require, and thus cannot be used to store everything; I suspect
> (although I have not penetrated this far yet) the VM layer code `fits'
> better with this situation, i.e., keeps just what it needs and leaves
> the rest to the pmap module and/or hardware.

I think this is almost certainly not a mistake.  Hardware data
structures are not going to represent all the information that the
machine-independent VM code must maintain.  Hence, the MI VM code must
maintain its own data structures for at least some information.  I think
it would be very difficult to split information between MI and MD data
structures.  Trying to cope with varying splits would greatly complicate
the pmap interface.  So I think the Mach approach (making MI data
structures the true repository of information and making MD data
structures an ephemeral cache) is definitely the right way to go.

Rich

pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) (03/06/91)

On 3 Mar 91 15:26:24 GMT, peter@ficc.ferranti.com (Peter da Silva) said:

peter> In article <PCG.91Feb28183457@odin.cs.aber.ac.uk>
peter> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:

pcg> Naturally one pays a performance penalty. The problem is historical;
pcg> there is no conceptual good reason to have shared memory...

peter> Shared libraries?

Circular argument: if you are in the situation where shared libraries
are useful, that is because each process is confined in a separate
address space because protection is done via address confinement, shared
libraries are useful. But so is shared memory in general, because
absolute address confinement is not that appealing, and copying is not
that nice (even copy-on-write).

Naturally, except on the AS/400, single address space machines are
pie-in-sky technology (even if more or less research systems abound).

I would however still maintain that even with conventional multiple
address space architectures shared memory is not necessary, as sending
segments back and forth (remapping) gives much the same bandwidth.

In this case I can argue that _irrevocably read only_ segments, useful
for a shared library, are a good idea; if a segment is declared
irrevocably read only no remapping need occur as multiple address spaces
access it; it can de facto, if not logically, be mapped at the same time
in multiple address spaces without aliasing hazards.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@aber.ac.uk

jesup@cbmvax.commodore.com (Randell Jesup) (03/06/91)

In article <PCG.91Mar5193541@aberdb.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) writes:
>pcg> Naturally one pays a performance penalty. The problem is historical;
>pcg> there is no conceptual good reason to have shared memory...
>
>peter> Shared libraries?
>
>Circular argument: if you are in the situation where shared libraries
>are useful, that is because each process is confined in a separate
>address space because protection is done via address confinement, shared
>libraries are useful. But so is shared memory in general, because
>absolute address confinement is not that appealing, and copying is not
>that nice (even copy-on-write).

	Shared libraries are very useful on systems with single address
spaces.  Very efficient in memory usage, helps improve cache hit rates,
reduces the amount of copying of data from one space to another (see some
not-to-old articles in comp.os.research concerning pipes vs shared memory
speed), reduces needs to flush caches, etc, etc.

>Naturally, except on the AS/400, single address space machines are
>pie-in-sky technology (even if more or less research systems abound).

	I wouldn't quite say that.  The Amiga (for example) uses a single-
address-space software architecture on 680x0 CPUs.  It currently doesn't
support inter-process protection (partially since 80% of amigas don't
have an MMU), but if it were implemented (it might be eventually), it would
almost certainly continue to use a single address space.

>I would however still maintain that even with conventional multiple
>address space architectures shared memory is not necessary, as sending
>segments back and forth (remapping) gives much the same bandwidth.

	It requires far more OS overhead to change the MMU tables, and
requires both large-grained sharing and program knowlege of the MMU 
boundaries.

-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.commodore.com  BIX: rjesup  
The compiler runs
Like a swift-flowing river
I wait in silence.  (From "The Zen of Programming")  ;-)

peter@ficc.ferranti.com (Peter da Silva) (03/07/91)

In article <PCG.91Mar5193541@aberdb.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) writes:
> I would however still maintain that even with conventional multiple
> address space architectures shared memory is not necessary, as sending
> segments back and forth (remapping) gives much the same bandwidth.

I don't think you can really make a good case for this. Consider the 80286,
where pretty much all memory access for large programs is done by remapping
segments. Loading a segment register is an expensive operation, and is to
a large extent the cause of the abysmal behaviour of large programs on that
architecture. For an extreme case, the sieve slows down by a factor of 11 once
the array size gets over 64K. My own experience with real codes under Xenix
286 bears this out.

Think of the 80286 as an extreme case of what you're proposing. I think it's
clear from this experience that frequent reloading of segment registers is a
bad idea. After your discussion of the inappropriate use of another technology,
networks, I would have expected you'd know better.

As for single address space machines, my Amiga 1000's exceptional performance...
given the slow clock speed and dated CPU (7.14 MHz 68000)... tends to suggest
that avoiding MMU tricks might be a good idea here as well. The Sparcstation 2
is the first UNIX workstation I've seen with as good response time to user
actions. It's only a 27 MIPS machine... or approximately 40 times faster.
-- 
Peter da Silva.  `-_-'  peter@ferranti.com
+1 713 274 5180.  'U`  "Have you hugged your wolf today?"

martin@adpplz.UUCP (Martin Golding) (03/08/91)

In <PCG.91Mar5193541@aberdb.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) writes:

>On 3 Mar 91 15:26:24 GMT, peter@ficc.ferranti.com (Peter da Silva) said:

>peter> In article <PCG.91Feb28183457@odin.cs.aber.ac.uk>
>peter> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:

>pcg> Naturally one pays a performance penalty. The problem is historical;
>pcg> there is no conceptual good reason to have shared memory...

>peter> Shared libraries?

>Circular argument: if you are in the situation where shared libraries
>are useful, that is because each process is confined in a separate
>address space because protection is done via address confinement, shared
>libraries are useful. But so is shared memory in general, because
>absolute address confinement is not that appealing, and copying is not
>that nice (even copy-on-write).

>Naturally, except on the AS/400, single address space machines are
>pie-in-sky technology (even if more or less research systems abound).

Umm. Pick? Reality? fairly large population of pie-in-sky research machines.
One of the interesting things that people are missing on this 64bits/too
big issue, is sparsely occupied virtual memory. Arbitrary numbers of
objects require arbitrary room to grow...

Martin Golding    | sync, sync, sync, sank ... sunk:
Dod #0236         |  He who steals my code steals trash.
{mcspdx,pdxgate}!adpplz!martin or martin@adpplz.uucp

linley@hpcuhe.cup.hp.com (Linley Gwennap) (03/09/91)

(Jeff Lee)
16777216 terabyte (64-bit) address space.  [What comes after tera-?]

"Peta".  A 64-bit address space contains 16,384 petabytes.  So what
comes after peta-?
						--Linley Gwennap
						  Hewlett-Packard Co.

jmaynard@thesis1.hsch.utexas.edu (Jay Maynard) (03/09/91)

In article <32580002@hpcuhe.cup.hp.com> linley@hpcuhe.cup.hp.com (Linley Gwennap) writes:
>(Jeff Lee)
>16777216 terabyte (64-bit) address space.  [What comes after tera-?]

>"Peta".  A 64-bit address space contains 16,384 petabytes.  So what
>comes after peta-?

exa-. A 64-bit address space contains 16 exabytes...no, not 16 8mm
tape drives!

-- 
Jay Maynard, EMT-P, K5ZC, PP-ASEL | Never ascribe to malice that which can
jmaynard@thesis1.hsch.utexas.edu  | adequately be explained by stupidity.
  "You can even run GNUemacs under X-windows without paging if you allow
          about 32MB per user." -- Bill Davidsen  "Oink!" -- me

pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) (03/10/91)

On 6 Mar 91 04:41:41 GMT, jesup@cbmvax.commodore.com (Randell Jesup) said:

jesup> In article <PCG.91Mar5193541@aberdb.cs.aber.ac.uk>
jesup> pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) writes:

pcg> there is no conceptual good reason to have shared memory...

And practical as well, let me add, except for historical re

peter> Shared libraries?

pcg> [ ... yes, but only if you have multiple address spaces, and only
pcg> as irrevocably read only data. With 64 bit addresses one had better
pcg> have single address space systems ... ]

jesup> 	Shared libraries are very useful on systems with single address
							 ^^^^^^
Probably you mean multiple here; in single address space systems all
libraries are "shared".

jesup> spaces. [ ... ]

Ah yes, of course. One of the tragic legacies of the PDP-11 origin of
Unix is that shared libraries have been so slow in appearing. Not all
is a win for shared libraries (one has to be careful about clustering
of functions in shared library), but I agree.

pcg> Naturally, except on the AS/400, single address space machines are
pcg> pie-in-sky technology (even if more or less research systems abound).

jesup> 	I wouldn't quite say that.  The Amiga (for example) uses a
jesup> single- address-space software architecture on 680x0 CPUs.  It
jesup> currently doesn't support inter-process protection (partially
jesup> since 80% of amigas don't have an MMU),

This is quite an empty claim. Naturally all machines without an MMU have
a single (real) address space, from the IBM s/360 to the Apple II to the
PC/XT. The point lies precisely in having inter process protection...

jesup> but if it were implemented (it might be eventually), it would
jesup> almost certainly continue to use a single address space.

It will be interesting to see how you do it on a machine with just 32
bits of addressing. You cannot do much better than having 256 regions of
16MB each, which in today's terms seems a bit constricting. Not for an
Amiga probably, admittedly. Amiga users are happy with 68000 addressing,
because its sw is not as bloated as others. There are other problems.

Psyche from Rochester does something like that, when protection is not
deemed terribly important; it maps multiple "processes" in the same
address space, so that they can enjoy very fast communications.

pcg> I would however still maintain that even with conventional multiple
pcg> address space architectures shared memory is not necessary, as
pcg> sending segments back and forth (remapping) gives much the same
pcg> bandwidth.

jesup> It requires far more OS overhead to change the MMU tables,

The Mach people have been able to live with it on the ROMP; not terribly
happy, but the price is not that big; "far more" is a bit excessive.
Point partially acknowledged, though; I had written:

pcg> Naturally one pays a performance penalty. The problem is
pcg> historical;

jesup> and requires both large-grained sharing and program knowlege of
jesup> the MMU boundaries.

But this is really always true; normally you cannot share 13 bytes (in
multiple address space systems). As a rule one can share *segments*.
Some systems allow you to share single *pages*, but I do not like that
(too much overhead, little point in doing so, most MMUs support much
more easily shared segments than pages; even the VAX has a 2 level MMU
with 64KB shared segments, contrarily to common misconceptions).


Here I feel compelled to restate the old argument: if shared memory is
read-only, it can well be *irrevocably* read-only and require no actual
remapping, so no problems.

If it is read-write some sort of synchronization must be provided.  In
most cases synchronization must instead be provided via kernel services,
and in that case it is equivalent to remapping, so simultaneous shared
memory is not needed.

Taking turns via semaphores to access a segment that is always mapped at
the same time in multiple address spaces is not any faster than
remapping, even explicitly, the segment in one process at a time. I know
that on machines with suitable hardware instructions and when spin locks
are acceptable one wins with simultaneous shared memory, but this I bet
is not that common for inter address space communication, and there are
better alternatives (multiple threads in the same address space).

Explicit remaps involve a small change in programming paradigm, in that
it is neither simultaneous shared memory, nor message passing (which
implies copying, at least virtually). Its one major implementation, as
far as I know, is in MUSS.

IMNHO it can be easily demonstrated that remapped memory is superior
both to simultaneous shared memory (automatic synchronization, no
problems with reverse MMUs, the distributed case is simple, no address
aliasing hazards) and to message passing (no copying or
copying-on-write, no data aliasing, much simpler to implement).

    The only loss I can conceive of is when there are mostly readers and
    occasional writers, but this can be handled IMNHO more cleanly with
    versioning on irrevocably read only segments.

Consider the two process case; in the left column we have traditional
shared memory, in the right column we have nonshared memory:

	p1: create seg0			p1: create seg0
	p1: create sem0			p1: map seg0
	p1: down sem0			...
	p2: map seg0			p2: map seg0 (wait)
	p2: down sem0 (wait)		...
	...				...
	p1: up sem0			p1: unmap seg0
	...				...
	p2: (continue)			p2: (continue)
	...				...
	p1: down sem0 (wait)		p1: map seg0 (wait) 
	...				...
	p2: up sem0			p2: unmap seg0
	...				...
	p1: (continue)			p1: (continue)
	...				...

The contents of the message passing column is left to the reader.

Final note: I am amazed that a simple and efficient idea like remapped
read-write memory and shared irrevocably read only segments is not
mainstream, and much more complicated or inefficient mechanisms liek
shared memory and message passing are.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@aber.ac.uk

pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) (03/10/91)

On 7 Mar 91 13:31:29 GMT, peter@ficc.ferranti.com (Peter da Silva) said:

peter> In article <PCG.91Mar5193541@aberdb.cs.aber.ac.uk>
peter> pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) writes:

pcg> I would however still maintain that even with conventional multiple
pcg> address space architectures shared memory is not necessary, as
pcg> sending segments back and forth (remapping) gives much the same
pcg> bandwidth.

peter> I don't think you can really make a good case for this.

Tell that to the people that ported Mach, and 4.3BSD to the PC/RT... :-)

My estimate is that remapping can be done on demand (lazy remapping),
not on every context switch, and it does not cost more than a page fault
(which is admittedly an extremely expensive operation, but one about
which people don't complain). Also in many cases lazy remapping costs
nothing, given the vagaries of scheduling. Suppose a segment is shared
between processes 1 and 3; if 1 is deactivated, 2 is activated, and 1
is reactivated, no remapping need take place, because 3 has not
accessed it in this sequence.

If 3 had accessed it, the OS would have taken a fault on the attempted
access, found out that the segment was mapped to process 1, unmapped it
from 1, and remapped it onto 3. I correct myself: far less expensive
than a page fault. Probably also less expensive than a process
reschedule, even in a properly designed kernel. You lose a lot only if
you have a very large number of shared segments, which are shared among
a lot but not all processes, and which are all being accessed in every
time slice given to each process that shares them. A very, very, very
unlikely scenario, and one in which after all the cost is proportional
to use, not worse.

Incidentally, avoiding the scenario above is why I think that sharing
single pages as opposed to sharing segments is a bad idea: if each
process sharing a segment of address space touches more the page in it,
a remap fault occurs on each page. I think (and some statistics seem to
support my hunch) that this multiple page access in the same shared
segment is a far more frequent phenomenon than multiple shared segment
access.

peter> Consider the 80286, where pretty much all memory access for large
peter> programs is done by remapping segments.

Are you sure? I think that in all OSes that run on the 286 maybe except
for iRMX/286 segments are not unmapped and remapped, but stay always
mapped, and can be and are shared.

peter> Loading a segment register is an expensive operation,

Around 20-30 cycles if memory serves me right. Compared to a context
switch it is insignificant. And in any case the 286 MMU does support
shared segments directly, so there is no need to do segment remapping to
simulate shared memory.

This said, your comments about the 286 MMU are irrelevant to a
discussion on acceptability of the cost of simulating shared memory by
remapping them on demand or at a context switch in each process that has
them nominally attached. This discussion is important only when
comparing reverse map MMUs with straight map MMUs, and only when the
reverse map MMU does not support (unlike mine) shared segments, and only
when shared segments are deemed useful.

Yet in your discussions of the 286 MMU there are some common fallacies
and myths, and they merit some comment.

Note first that the it is only because of a design misconception (not
quite a mistake) of the 286 designers that loading a segment
register is so expensive. The problem is that the shadow segment
registers are not like TLBs, in that they are reloaded every time, even
if the shadowed segment register *value* has not changed.

This could have easily been avoided by simply comparing the old and new
segment register values. It was not, only because conceivably the
segment descriptors could have been altered even if the value of
ssegment register had not in fact changed, and the 286 has no distinct
"flush shadow segment registers" instruction.

I guess that the designers assumed that in their "Pascal"/"Algol" model
of process execution each segment register was dedicated to a specific
function (code, stack, global, var parameters), and supposed not to be
reloaded often, so no need to treat the shadows as caches.

peter> and is to a large extent the cause of the abysmal behaviour of
peter> large programs on that architecture. For an extreme case, the
peter> sieve slows down by a factor of 11 once the array size gets over
peter> 64K.

This is only because probably the HUGE model gets used, which implies
funny code to simulate 32 bit address arithmetic (the HUGE model is so
expensive because the mistake of putting the ring number in the middle
of a pointer instead of in the most significant bits). On less extreme
examples, or if you code the sieve for the LARGE model, the slowdown is
around 20-50%, even for extremely pointer intensive operations, in the
LARGE model.

Your figure of 11 is plainly ridiculous and warped by the machinations
of the HUGE model; After all 32 bit pointer dereferences are only about
3 times slower than 16 bit pointer dereferences, so even a program
that consisted *only* of them would be only 3 times slower.

Note again that this point about 32 bit pointer arithmetic on a 286 has
*nothing* to do on the cost of simulating shared memory by remapping
when the MMU does not support it directly.

peter> My own experience with real codes under Xenix 286 bears this out.

Maybe. *My* experience of recompiling large large numbers of Unix
nonfloat utilities on a 286 tells me that the average slowdown is around
30%. A 10 Mhz 286 is about the equivalent of a PDP-11/73 (1 "MIPS") in
the small model or of a VAX-11/750 (0.7 "MIPS") in the large model, to
all practical (nonfloat Unix applications :->) purposes.

peter> Think of the 80286 as an extreme case of what you're proposing.

I seem to have completely failed to explain myself. The 286 is
*irrelevant* to a discussion on shared memory simulation by implicit OS
supported or explicit application requested segment remapping (whcih I
prefer).

peter> I think it's clear from this experience that frequent reloading
peter> of segment registers is a bad idea.

No, the conclusion is not supported by the 286 example; the 286 is
uniquely poor for reloading segment registers because it does not treat
shadow segment registers as a cache and because its pointers have an
unfortunate format.

Properly designed MMUs with properly designed TLBs, even reverse map
ones, do segment remapping with small or insignificant cost, not worse
than the 286 MMU. Moreover the real overhead lies not in reloading some
lines in the MMU or the TLB; it is in taking the remap fault and in
searching the appropriate kernel structures to find which (nominally
shared) segment to map in that region.

peter> After your discussion of the inappropriate use of another
peter> technology, networks, I would have expected you'd know better.

I am sorry I got myself so badly misunderstood.

peter> As for single address space machines, my Amiga 1000's exceptional
peter> performance...  given the slow clock speed and dated CPU (7.14
peter> MHz 68000)... tends to suggest that avoiding MMU tricks might be
peter> a good idea here as well.

MMUs are a difficult subject. A lot of vendors have bungled their MMU
designs, the OS code that supports them, and the VM policies that drive
them.  Sun is just *one* of the baddies. That a lot of vendors take many
years to get their act together (if ever) on virtual memory does not
mean that it is a bad technology; it means that maybe it is too subtle
for mere Unix kernel hackers.

peter> The Sparcstation 2 is the first UNIX workstation I've seen with
peter> as good response time to user actions.  It's only a 27 MIPS
peter> machine... or approximately 40 times faster.

The MIPS-eating sun bogons strike again! :-)

The people that did Tripos (Martin Richards!) and Amiga (and those that
now maintain them at CBM) seem to be quite another story. I am another
Amiga fan :-).  Now, if only they could get their act together
commercially... (please redirect the ensuing flame war to the
appropriate newsgroup :->).
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@aber.ac.uk

bsy@PLAY.MACH.CS.CMU.EDU (Bennet Yee) (03/10/91)

In article <PCG.91Mar9205121@aberdb.cs.aber.ac.uk>, pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) writes:
>On 7 Mar 91 13:31:29 GMT, peter@ficc.ferranti.com (Peter da Silva) said:
>
>peter> In article <PCG.91Mar5193541@aberdb.cs.aber.ac.uk>
>peter> pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) writes:
>
>pcg> I would however still maintain that even with conventional multiple
>pcg> address space architectures shared memory is not necessary, as
>pcg> sending segments back and forth (remapping) gives much the same
>pcg> bandwidth.
>
>peter> I don't think you can really make a good case for this.
>
>Tell that to the people that ported Mach, and 4.3BSD to the PC/RT... :-)
>
>My estimate is that remapping can be done on demand (lazy remapping),
>not on every context switch, and it does not cost more than a page fault
>(which is admittedly an extremely expensive operation, but one about
>which people don't complain). Also in many cases lazy remapping costs
>nothing, given the vagaries of scheduling. Suppose a segment is shared
>between processes 1 and 3; if 1 is deactivated, 2 is activated, and 1
>is reactivated, no remapping need take place, because 3 has not
>accessed it in this sequence.

Actually, handling shared memory doesn't require the kernel to twiddle the
page tables on context switch for almost all machines _except_ for those
like the RTs where it's got an inverted page table.  Normal page table
machines may have many virtual addresses mapped to the same physical
address.  The kernel needs to intervene only if the page is shared
copy-on-write -- the page is marked as write protected and the kernel copies
the page only when a task writes to the (shared) page.  Remapping is a
special case of shared memory.

-- 
Bennet S. Yee		Phone: +1 412 268-7571		Email: bsy+@cs.cmu.edu
School of Computer Science, Carnegie Mellon, Pittsburgh, PA 15213-3890

Michael.Marsden@newcastle.ac.uk (Michael Marsden) (03/10/91)

linley@hpcuhe.cup.hp.com (Linley Gwennap) writes:
>(Jeff Lee)
>>16777216 terabyte (64-bit) address space.  [What comes after tera-?]

>"Peta".  A 64-bit address space contains 16,384 petabytes.  So what
>comes after peta-?


"Eta". The 64-bit address space is 16 EtaBytes. The non-computing value of
Eta is 10^18, i.e. approx 2^60.

So what comes after Eta, when we start arguing about 96 bits vs. 128 bits in
a decades time?

                                            -Mike Mars




 .--------*  Mike            ________________________________  Michael.Marsden
 |  Grad.     /| /|  /|  /| /  "..never write device drivers |       @
 | Student   / |/ | /_| /_| \   while on acid!"  -XXXXXXXXXX | Uk.Ac.Newcastle
 |__________/     |/  |/ \__/       *----NOT-mjd!!-----------'

peter@ficc.ferranti.com (peter da silva) (03/11/91)

In article <PCG.91Mar9205121@aberdb.cs.aber.ac.uk>, pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) writes:
> peter> I don't think you can really make a good case [that sending
	 segments back and forth gives much the same bandwidth]

> Tell that to the people that ported Mach, and 4.3BSD to the PC/RT... :-)

Hasn't the PC/RT been found to have surprisingly poor performance once the
number of context switches involved get too high?

> My estimate is that remapping can be done on demand (lazy remapping),
> not on every context switch, and it does not cost more than a page fault
> (which is admittedly an extremely expensive operation,

Exactly.

> but one about which people don't complain).

Well, you and I have both complained about excessive paging in the past.

> peter> Consider the 80286, where pretty much all memory access for large
> peter> programs is done by remapping segments.

> Are you sure? I think that in all OSes that run on the 286 maybe except
> for iRMX/286 segments are not unmapped and remapped, but stay always
> mapped, and can be and are shared.

Once you want to access more than 256K (64K for each of DS, SS, CS, and ES)
you *have* to reload the segment registers. The machine can *not* directly
address more than 64K per segment, and it only has the 4 segment registers.
This is a hard limit unless you start reloading segment registers... which
is sufficiently expensive to have an exquisitely painful impact on performance.

> peter> Loading a segment register is an expensive operation,

> Around 20-30 cycles if memory serves me right. Compared to a context
> switch it is insignificant.

But it happens so much more often.

> This said, your comments about the 286 MMU are irrelevant to a
> discussion on acceptability of the cost of simulating shared memory by
> remapping them on demand or at a context switch in each process that has
> them nominally attached.

Well, only that in the case you're talking about the cost of remapping the
segments is even higher.

[ re: costs of segment reloads on the 80286 ]
> This is only because probably the HUGE model gets used, which implies
> funny code to simulate 32 bit address arithmetic

No, just large model. Once you have to operate on more than two data segments
at a time (which basically means more than two objects at a time) you have
to reload the segment register on each data access.

> peter> I think it's clear from this experience that frequent reloading
> peter> of segment registers is a bad idea.

> No, the conclusion is not supported by the 286 example; the 286 is
> uniquely poor for reloading segment registers because it does not treat
> shadow segment registers as a cache and because its pointers have an
> unfortunate format.

True. Your point. Of course page faults are such a cheap operation. :->
-- 
Peter da Silva.  `-_-'  peter@ferranti.com
+1 713 274 5180.  'U`  "Have you hugged your wolf today?"

efeustel@prime.com (Ed Feustel) (03/11/91)

It is :wwq

efeustel@prime.com (Ed Feustel) (03/11/91)

It is not necessary for segment access to take such a long time.  Prime offers
a multi-segment architecture and it performs quite satisfactorily because it
was designed to perform well with segments.  Intel could design their architecture
more appropriately for segments if they wished (and they have with the IN960XA.
If enough people demanded it, you could see a substantial improvement now that they
have the extra Silicon available.

Ed Feustel Prime Computer

pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) (03/11/91)

On 10 Mar 91 17:50:11 GMT, peter@ficc.ferranti.com (peter da silva) said:

peter> Hasn't the PC/RT been found to have surprisingly poor performance
peter> once the number of context switches involved get too high?

I don't know, but this could be for many other reasons. I remember
having seen hints that the RS/6000 does badly on context switching, but
whether this is due to shared memory simulation or rather one of a
million porbable bogosities in the OS I cannot know.

peter> Once you want to access more than 256K (64K for each of DS, SS,
peter> CS, and ES) you *have* to reload the segment registers. The
peter> machine can *not* directly address more than 64K per segment, and
peter> it only has the 4 segment registers.  This is a hard limit unless
peter> you start reloading segment registers... which is sufficiently
peter> expensive to have an exquisitely painful impact on performance.

Maybe you have tired of reading my article before its end, but I
maintain that the 286 large model, even in pointer expensive programs,
has at most a 50% average slow down compared with small model, except
for pathological cases. Such pathological cases are easy to find for
every cache organization, as you will readily concede. Accessing two
arrays that happen to map to the same cache lines kills almost every
machine out there, for one thing...

That the shadow register organization of the 286 is misguided I have
been ready to concede, but it should not reflect on a judgement on the
merits of shared memory simulation via remapping for reverse MMUs, or on
the merits of segmented architectures in general.

I also have the impression that you loathe so much the 286 two
dimensional addressing scheme that you also detest all segmentation
schemes but the two issues are unrelated. Most paged and segmented VM
systems have linear addressing, e.g. the 370, or the VAX-11, and so on.

peter> Loading a segment register is an expensive operation,

pcg> Around 20-30 cycles if memory serves me right. Compared to a
pcg> context switch it is insignificant.

peter> But it happens so much more often.

Dereferencing a far pointer costs only three times a near pointer, and
not every instruction is a far pointer dereference.

Also, when one does segment remapping, really one twiddles the contents
of a field in the LDT (the page table), not that of the segment
registers, and at most once per context switch (and this does not happne
on most context switches). The cost of reloading a segment register and
of remapping a segment are therefore totally unrelated.

peter> Well, only that in the case you're talking about the cost of
peter> remapping the segments is even higher.

True... But not tragic. Taking a trap, finding out which segment should
be remapped, fiddling the LDT of the process who had the segment mapped,
and remapping it might cost as much maybe as reading a block off the
buffer cache, i.e. a few hundred instructions.

I would think that it is of the order of a page fault (mind you, I was
maybe not clear before: just the CPU cost of a page fault, not the many
milliseconds for the IO time possibly associated to it), and less
frequent.

I remember that the BSD VM subsystem that used a like technique to
simulate a 'referenced' bit for each page (take a fault and map the page
in) cost less than 5% on a VAX-11/780, and that was for much more
frequent faulting.

People do IPC using pipes or System V MSGs or sockets which cost far far
far more.

On machines like the 286 that can share segments simultaneously, pure
shared memory is OK. On those that cannot, like the RT, the cost is not
excessive, and probably inferior to that of most alternatives.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@aber.ac.uk

peter@ficc.ferranti.com (Peter da Silva) (03/11/91)

In article <1991Mar10.154107.10976@newcastle.ac.uk> Michael.Marsden@newcastle.ac.uk (Michael Marsden) writes:
> So what comes after Eta, when we start arguing about 96 bits vs. 128 bits in
> a decades time?

Gotta be 128 bits, since after you use up 64 bits for the IP address 96 bits
would only leave 32 bits of address space for each node.
-- 
Peter da Silva.  `-_-'  peter@ferranti.com
+1 713 274 5180.  'U`  "Have you hugged your wolf today?"

rh@craycos.com (Robert Herndon) (03/12/91)

Those 10^(n*3) powers used so often in engineering and the
sciences are:

    English prefix	multiplier	symbolic prefix
	exa-		10^18			X?
	peta-		10^15			P
	tera-		10^12			T
	giga-		10^9			G
	mega-		10^6			M
	kilo-		10^3			K
	milli-		10^-3			m
	micro-		10^-6			u (greek mu)
	nano-		10^-9			n
	pico-		10^-12			p
	femto-		10^-15			f?
	atto-		10^-18			a?

  The symbolic prefixes are used before units, e.g., Tbyte, GH(ert)z,
Kbar, nH(enry), pF(arad), etc.  The symbolic prefixes can also be
ambiguous (T == Tesla, so a TT == a teratesla?; G == unit of acceleration,
so GG = Billion (Milliard for the Europeans) Gravities?), so some
care is required...

  I may have the last two reversed, as these are more often used in
bio (femtomolar solutions, etc.) and physics, but I think I've got them
right.  Other persons have also corrected my exa- to eka-, but I stand
by my usage (eka- means "like", as in Mendeleev's eka-elements).  Perhaps
some of the device physics people can clarify common usage of the low-end
units.

					Robert
-- 
Robert Herndon		    --	not speaking officially for Cray Computer.
Cray Computer Corporation			719/540-4240
1110 Bayfield Dr.				rh@craycos.com
Colorado Springs, CO   80906		   "Ignore these three words."

ccplumb@rose.uwaterloo.ca (Colin Plumb) (03/12/91)

rh@craycos.com (Robert Herndon) wrote:
>
>Those 10^(n*3) powers used so often in engineering and the
>sciences are:
>
>    English prefix	multiplier	symbolic prefix
>	exa-		10^18			X?  [No, E]
>	peta-		10^15			P
>	tera-		10^12			T
>	giga-		10^9			G
>	mega-		10^6			M
>	kilo-		10^3			K
>	milli-		10^-3			m
>	micro-		10^-6			u (greek mu)
>	nano-		10^-9			n
>	pico-		10^-12			p
>	femto-		10^-15			f?  [Yes, f]
>	atto-		10^-18			a?  [Yes, a]

The letter prefix for exa- is E.  EHz, EeV (stop drooling, particle physicists),
etc.  Exabyte got their name from something.

femto and atto are indeed 10^-15 and 10^-18, respectively, as you
have illustrated them.

Officially, the letter prefix for kilo is lower case k, not upper case K,
but I like the upper case implies >1 pattern.

Then there's deci- for 1/10 and centi- for 1/100, as well as numbers
(deka and hecto, I think) for 10 and 100, but nobody ever uses those.

As all who have implemented RS232 can attest, standards are meant to be
ignored when convenient... long live the mho!
-- 
	-Colin

peter@ficc.ferranti.com (Peter da Silva) (03/12/91)

In article <PCG.91Mar11144147@aberdb.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) writes:
> People do IPC using pipes or System V MSGs or sockets which cost far far
> far more.

And the performance costs of this, and other results of getting the MMU
so intimately involved with things, is the main reason why a 27 MIPS
SparcStation 2 doesn't seem any faster than a 0.7 MIPS Amiga 1000. (Of
course, that begs the question of why a 2-3 MIPS Mac-II is just as slow
with no MMU overhead)

You're right... for systems with so much MMU overhead anyway giving
up shared memory is probably not such a big deal.
-- 
Peter da Silva.  `-_-'  peter@ferranti.com
+1 713 274 5180.  'U`  "Have you hugged your wolf today?"

pcg@test.aber.ac.uk (Piercarlo Antonio Grandi) (03/13/91)

On 12 Mar 91 14:52:17 GMT, peter@ficc.ferranti.com (Peter da Silva) said:

peter> In article <PCG.91Mar11144147@aberdb.cs.aber.ac.uk>
peter> pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) writes:

pcg> People do IPC using pipes or System V MSGs or sockets which cost
pcg> far far far more.

peter> And the performance costs of this, and other results of getting
peter> the MMU so intimately involved with things, is the main reason
peter> why a 27 MIPS SparcStation 2 doesn't seem any faster than a 0.7
peter> MIPS Amiga 1000.

I still beg to differ. The cost of IPC is not that enormous, as IPC is
not that inefficient, and even when it is it is not that frequent. Even
under traditional Unix, where a pipe often implies six copies of the
same data (source->stdio buffer, stdio buffer->system buffer, system
buffer->disk, disk->system buffer, system->buffer stdio buffer, stdio
buffer->target), IPC cannot account for one or two orders of magnitude
of inefficiency. Even more importantly a very poorly designed MMU could
not either.

The ultimate difference between a SPARC 2 and an Amiga is not that one
has an MMU and the other has not, it is that for one inefficiency does
not matter or increases sales, while the other is sold to people that
purchase it with their own money. This has amazing effects on how
sloppily the system software gets written, and on how easily vendors can
get away with it.

A non sloppily designed and used MMU has a negligible effect on
virtually all applications, and possibly substantial benefits.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@aber.ac.uk

jesup@cbmvax.commodore.com (Randell Jesup) (03/14/91)

In article <PCG.91Mar9194211@aberdb.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) writes:
>On 6 Mar 91 04:41:41 GMT, jesup@cbmvax.commodore.com (Randell Jesup) said:
>pcg> [ ... yes, but only if you have multiple address spaces, and only
>pcg> as irrevocably read only data. With 64 bit addresses one had better
>pcg> have single address space systems ... ]
>
>jesup> 	Shared libraries are very useful on systems with single address
>							 ^^^^^^
>Probably you mean multiple here; in single address space systems all
>libraries are "shared".

	No, I meant "single".  Single address space doesn't mean a single
shared r/w address space, and it doesn't mean that all libraries are
public (as opposed to private).  Libraries can/are used for dynamic run-time
binding as well as for code-sharing.

	"Single address space" means to me that there are no duplicated
virtual addresses between different processes (which of course makes
sharing memory/code far easier, since a pointer is valid to all processes,
and libraries do not have to be position-independant).

>jesup> spaces. [ ... ]
>
>Ah yes, of course. One of the tragic legacies of the PDP-11 origin of
>Unix is that shared libraries have been so slow in appearing. Not all
>is a win for shared libraries (one has to be careful about clustering
>of functions in shared library), but I agree.

	Absolutely agreed.  Witness the apocryphal 1Meg XClock.

>jesup> but if it were implemented (it might be eventually), it would
>jesup> almost certainly continue to use a single address space.
>
>It will be interesting to see how you do it on a machine with just 32
>bits of addressing. You cannot do much better than having 256 regions of
>16MB each, which in today's terms seems a bit constricting. Not for an
>Amiga probably, admittedly. Amiga users are happy with 68000 addressing,
>because its sw is not as bloated as others. There are other problems.

	256 times 16Mb???!!!!  Constricting??  First, why would we partition
on 16Mb boundaries, when the average application uses 50-500K?  Also, I
can't see it being constricting until memory + swapspace is greater than
2 or 4 gigs.  Constricting for a 100 user shared system, for a mini-super-
computer, but not for a single-user oriented machine.

	68000 addressing is restricting (24 bits), but all mmu-capable
Amigas are 32-bit machines.  (BTW, due to shared libs, easy interprocess
communication, whatever, Amiga applications are usually far smaller than,
say, the X/Unix versions of the same thing.)

>Psyche from Rochester does something like that, when protection is not
>deemed terribly important; it maps multiple "processes" in the same
>address space, so that they can enjoy very fast communications.

	Sounds very much like threads to me.  But the protection and 
single address space issues are separate, IMHO.

>pcg> I would however still maintain that even with conventional multiple
>pcg> address space architectures shared memory is not necessary, as
>pcg> sending segments back and forth (remapping) gives much the same
>pcg> bandwidth.
>
>jesup> It requires far more OS overhead to change the MMU tables,
>
>The Mach people have been able to live with it on the ROMP; not terribly
>happy, but the price is not that big; "far more" is a bit excessive.
>Point partially acknowledged, though; I had written:

	This depends highly on how frequent you think passing segments is
going to be.  On a highly shared-message-based system like the amiga,
the numbers of messages passed per second can be very high (many thousands).

>pcg> Naturally one pays a performance penalty. The problem is
>pcg> historical;
>
>jesup> and requires both large-grained sharing and program knowlege of
>jesup> the MMU boundaries.
>
>But this is really always true; normally you cannot share 13 bytes (in
>multiple address space systems). As a rule one can share *segments*.
>Some systems allow you to share single *pages*, but I do not like that
>(too much overhead, little point in doing so, most MMUs support much
>more easily shared segments than pages; even the VAX has a 2 level MMU
>with 64KB shared segments, contrarily to common misconceptions).

	In any protected amiga, sharing would almost certainly be on the
page level.  Anything else is far too inefficient in memory usage.  Also,
more than one object would reside in the shared memory space (i.e.
AllocMem(size,MEMF_SHARED,...) or some such).  This greatly cuts down on
the amount of shared memory required, and the amount of real memory to
hold shared pages (since the % utilization of the pages is higher).

>Here I feel compelled to restate the old argument: if shared memory is
>read-only, it can well be *irrevocably* read-only and require no actual
>remapping, so no problems.
>
>If it is read-write some sort of synchronization must be provided.  In
>most cases synchronization must instead be provided via kernel services,
>and in that case it is equivalent to remapping, so simultaneous shared
>memory is not needed.

	But such services can be many orders of magnitude faster than
remapping.  An example is a program I wrote and posted to comp.os.research,
concerning pipe speed versus shared memory speed to transfer data.  I merely
sent a shared-memory message from one process to the other saying the
data was in the shared buffer.  Rebuilding process page tables on each
pass would have swamped the actual transfer time.

>Explicit remaps involve a small change in programming paradigm, in that
>it is neither simultaneous shared memory, nor message passing (which
>implies copying, at least virtually). Its one major implementation, as
>far as I know, is in MUSS.

	Messages can be shared also (greatly increases message passing
speed).

	I don't think we disagree that much (partially on terminology).

-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.commodore.com  BIX: rjesup  
The compiler runs
Like a swift-flowing river
I wait in silence.  (From "The Zen of Programming")  ;-)

tgg@otter.hpl.hp.com (Tom Gardner) (03/14/91)

|As all who have implemented RS232 can attest, standards are meant to be
|ignored when convenient... long live the mho!

Of course, the SI unit of conductance is the Siemen (S), not the mho. The 
unit of time is the second (s). Unfortunately people often get it wrong
and specify, for example, DRAM access times in nS. This is equivalent
to specifying their weight as being x metres.

OK, so notes is a quick informal medium where mistakes are tolerated. But 
shipping products which get this detail wrong makes me wonder how many
how many other details they got wrong...

pcg@test.aber.ac.uk (Piercarlo Antonio Grandi) (03/16/91)

In article <PCG.91Mar9194211@aberdb.cs.aber.ac.uk> pcg@cs.aber.ac.uk
(Piercarlo Antonio Grandi) writes:

pcg> It will be interesting to see how you do it on a machine with just
pcg> 32 bits of addressing. You cannot do much better than having 256
pcg> regions of 16MB each, which in today's terms seems a bit
pcg> constricting. Not for an Amiga probably, admittedly. Amiga users
pcg> are happy with 68000 addressing, because its sw is not as bloated
pcg> as others. There are other problems.

To this you comment repeating exactly what I say here, that is 32 bit
addressability is not constricting for an Amiga because its typical
applications are small and do not rely on a large sparse address space.

This is unfair! You cannot object to what I say by repeating it! I could
fall into the trap, not realize that your reply is just a rewording of
what I have written myself, and try to show that it is wrong, and then
look even more of a fool, especially if I succeed! :-).

pcg> Here I feel compelled to restate the old argument: if shared memory is
pcg> read-only, it can well be *irrevocably* read-only and require no actual
pcg> remapping, so no problems.

pcg> If it is read-write some sort of synchronization must be provided.
pcg> In most cases synchronization must instead be provided via kernel
pcg> services, and in that case it is equivalent to remapping, so
pcg> simultaneous shared memory is not needed.

jesup> But such services can be many orders of magnitude faster than
jesup> remapping.

jesup> An example is a program I wrote and posted to comp.os.research,
jesup> concerning pipe speed versus shared memory speed to transfer
jesup> data.  I merely sent a shared-memory message from one process to
jesup> the other saying the data was in the shared buffer.

You only win if you can make assumptions about the hw environment, and
not use OS sempahores, but use hw locks. Again, it would be better
anyhow to have tightly communicating processes in the same address space
than using shared memory.

jesup> Rebuilding process page tables on each pass would have swamped
jesup> the actual transfer time.

Hardly, I think, given a suitable buffer size and a quick way of
modifying two page table entries. I cannot imagine an implementation of
remapping which is significantly more expensive than an implementation
of semaphores. Again, hw locks are faster than sw remapping or sw
semaphores, but of more limited applicability.

And remapping after all is just something you do if you have to run
applications designed for multiple address spaces on a single address
space architecture. What I say is not that it's optimal, but that it is
tolerable.

jesup> Messages can be shared also (greatly increases message passing
jesup> speed).

No, the speed does not increase a lot. Unless the synchronization is
done using hardware primitives, which is nonportable, or at least not
always applicable.

Also, my position is that messages are not as good as simply sharing in
a single address space. Yet, if one does messaging for one reason or
another, doing it by remapping is much better than the alternatives, and
is not that much worse than shared memory.

jesup>	I don't think we disagree that much (partially on terminology).

No, we actually don't disagree a lot, except for the part where I draw
the attention to the fact that shared memory of any flavour implies some
sort of synchronization, and this can only be cheaply provided under
fairly limiting assumptions (but valid on an Amiga for example).
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@aber.ac.uk

peter@ficc.ferranti.com (Peter da Silva) (03/19/91)

In article <PCG.91Mar15184558@aberdb.test.aber.ac.uk>, pcg@test.aber.ac.uk (Piercarlo Antonio Grandi) writes:
> No, the speed does not increase a lot. Unless the synchronization is
> done using hardware primitives, which is nonportable, or at least not
> always applicable.

Why is that any less portable than doing a floating multiply using
hardware primitives, for example? Just make "sendMessage" a primitive
operation (put it in a library, for example). Then when you have a
fast hardware mechanism for synchronising message transactions
you can use it. The performance of the Amiga demonstrates that this
is a good idea, and would be badly hurt by the overhead of remapping
because in most cases you don't do a context switch for each message.
-- 
Peter da Silva.  `-_-'  peter@ferranti.com
+1 713 274 5180.  'U`  "Have you hugged your wolf today?"

jesup@cbmvax.commodore.com (Randell Jesup) (03/23/91)

<19840@cb <PCG.91Mar15184558@aberdb.test.aber.ac.uk>
Sender: 
Reply-To: jesup@cbmvax.commodore.com (Randell Jesup)
Followup-To: 
Distribution: 
Organization: Commodore, West Chester, PA
Keywords: 

In article <PCG.91Mar15184558@aberdb.test.aber.ac.uk> pcg@test.aber.ac.uk (Piercarlo Antonio Grandi) writes:
>To this you comment repeating exactly what I say here, that is 32 bit
>addressability is not constricting for an Amiga because its typical
>applications are small and do not rely on a large sparse address space.

>This is unfair! You cannot object to what I say by repeating it!

	:-)  Who said I objected?  Actually, I think you missed my point:
on something other than an amiga, even if the software is bloated, it still
doesn't matter until you're a) actually using 4 gig worth of memory space,
and b) have at least 4 gig of swap space.  (well, less if you're memory-
mapping files - 4 gig disk space in that case (but you'd have to be mapping
in almost all your disk!))

>jesup> But such services can be many orders of magnitude faster than
>jesup> remapping.
>
>jesup> An example is a program I wrote and posted to comp.os.research,
>jesup> concerning pipe speed versus shared memory speed to transfer
>jesup> data.  I merely sent a shared-memory message from one process to
>jesup> the other saying the data was in the shared buffer.
>
>You only win if you can make assumptions about the hw environment, and
>not use OS sempahores, but use hw locks. Again, it would be better
>anyhow to have tightly communicating processes in the same address space
>than using shared memory.

	Who made assumptions about the hardware environment?  I merely
called an OS primitive (PutMsg) which sent a message to a specific port.
How it accomplishes this is irrelevant.

>jesup> Rebuilding process page tables on each pass would have swamped
>jesup> the actual transfer time.
>
>Hardly, I think, given a suitable buffer size and a quick way of
>modifying two page table entries. I cannot imagine an implementation of
>remapping which is significantly more expensive than an implementation
>of semaphores. Again, hw locks are faster than sw remapping or sw
>semaphores, but of more limited applicability.

	I think you underestimate page-table-rebuild time, or overestimate
how fast shared messages can be.  Semaphores expensive??  Hardly (at least
in a single-processor environment).

>And remapping after all is just something you do if you have to run
>applications designed for multiple address spaces on a single address
>space architecture. What I say is not that it's optimal, but that it is
>tolerable.

	Point taken.

>jesup> Messages can be shared also (greatly increases message passing
>jesup> speed).
>
>No, the speed does not increase a lot. Unless the synchronization is
>done using hardware primitives, which is nonportable, or at least not
>always applicable.

	So long as it's hidden in the OS, I don't care if it uses blue men
from mars.  The OS _can_ use any hardware primitives it likes.  For example,
message passing on the amiga is (essentially) disable interrupts, addtail
the message onto the port's list (easy with doubly-linked lists), enable,
and signal the port owner's process (which may cause a reschedule if the
owner is higher priority).

-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.commodore.com  BIX: rjesup  
The compiler runs
Like a swift-flowing river
I wait in silence.  (From "The Zen of Programming")  ;-)