[net.arch] VERY LARGE main memories

jvz@sdcsvax.UUCP (John Van Zandt) (08/25/86)

   I heard the other day about current research into
using massive amounts of main memory (approx 1GB) on minicomputers and 
achieving very high performance. I assume this was mainly due 
to the non-swapping of data and code for large applications.  
It seems to me that a reasonably clever paging scheme (maybe 
with some compiler assistance) would limit the overhead of 
the swapping to the point of making it invisible for large
programs/data.  This is under the assumption of a single-user system.  I grant
the fact that in a multi-user environment the more memory, the better.

   Besides starting a discussion of the pro's and con's of 
this (does it really give that much better performance?), I'd 
like some pointers to articles or technical reports on the topic.

John Van Zandt
UCSD
uucp: ...ucbvax!sdcsvax!jvz
arpa: jvz@UCSD

johnson@uiucdcsp.CS.UIUC.EDU (08/26/86)

I believe that the large memory computers are designed for database
applications.  By putting the entire database in main memory, each
transaction can be run to completion without waiting for the disk.
Thus, much fewer locks are needed.  Concurrency control, deadlock,
etc. all become much simple problems.  Not only is the system faster
because it desn't wait on disks, it is also faster because there is
much less overhead for locking.

While it is true that minicomputers are used for CPUs, I think that
the large memory computers are considered database supercomputers.
They are not meant to be cheap.  Hector Garcia-Molina of Princeton
is one of the main workers in this area.  The first I learned of
this work was an article he wrote in IEEE Trans. on Computers a few
years ago.

mc68020@gilbbs.UUCP (Thomas J Keller) (08/27/86)

   Someone please correct me if I am wrong, but as I have been lead to 
understand the situation, it will prove somewhat difficult to successfully
implement large physical memory systems on the order of 1Gb.  The primary
impediment seems to be the delays caused by propagation delays in the
decoding trees.   Anyone care to enlight me (us)?

-- 

Disclaimer:  Disclaimer?  DISCLAIMER!? I don't need no stinking DISCLAIMER!!!

tom keller					"She's alive, ALIVE!"
{ihnp4, dual}!ptsfa!gilbbs!mc68020

(* we may not be big, but we're small! *)

kenny@uiucdcsb.CS.UIUC.EDU (08/27/86)

Ralph (johnson@b.cs.uiuc.edu) is right about the huge-memory machines being
intended to serve as multiprogrammed database machines.  Honeywell is
presently marketing one that can be configured with up to 64M 36-bit words
(a tad over 0.25 GB), and Nippon Electric has one twice that size.  On a
typical configuration with one of those behemoths, 75% or more of the memory
is dedicated to disc caching.  With their present technology, they don't
simplify the locking mechanism (since there still is the possibility that
some given page will be on disc, not in main store) but gain substantial
performance improvements because of the reduced number of collisions.

Kevin Kenny
University of Illinois at Urbana-Champaign
UUCP: {ihnp4,pur-ee,convex}!uiucdcs!kenny
CSNET: kenny@UIUC.CSNET
ARPA: kenny@B.CS.UIUC.EDU (kenny@UIUC.ARPA)

guy@sun.uucp (Guy Harris) (08/27/86)

>    Someone please correct me if I am wrong, but as I have been lead to 
> understand the situation, it will prove somewhat difficult to successfully
> implement large physical memory systems on the order of 1Gb.  The primary
> impediment seems to be the delays caused by propagation delays in the
> decoding trees.   Anyone care to enlight me (us)?

"Will prove"?  As *I* have been lead to understand the situation, the Cray-2
is *already* offering primary memories of that size.  From your reference to
"decoding trees", I presume you're talking about *single chip* memories of
that size, not memory systems.
-- 
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com (or guy@sun.arpa)

jaw@aurora.UUCP (James A. Woods) (08/28/86)

# "what comes after silicon?  oh, gallium arsenide, i'd guess.
   and after that, there's a thing called indium phosphide." -- seymour cray
								circa 1980
tom keller wonders about the speed of gigabyte memories.

the cray 2 here has 1/2 gigabyte, with a worst-case cycle time of 57 clock
pulses (~234 nsec).  i doubt chip decoding delays take the bulk of this
time (logarithms don't grow fast); the sluggishness is more likely due to
the 2 using conservative (i.e. cheap) 256K dynamic mos technology.

things are a bit better with pseudo-banks (33 clocks/eight-byte word),
and strictly sequential access for vectors is designed to be fast 
(1.1 clocks/word).

but the memory speed for non-vectorized C code (e.g. any unix utility except
for 'cmp'), leaves a lot to be desired -- even with local variables in
fast registers ("local memory") [it would be nice to allow arrays here].

for "random stride" computation (unix kernel, compilers, utilities), 
the cray 2 is about the speed of a MIPS board.  indeed, d. ritchie
noted at the recent supercomputer conference held at nasa ames
that because of assembler code expansion risc-style (to ascii, yet),
the compile pipeline makes at&t's faster (scalar-wise) cray x-mp about the
speed of a high-end vax for such application.

if seymour ever hopes to regain the 'scalar' computing championship title,
he'd better get hip to a transparent data cache.

-- ames!aurora!jaw

bzs@bu-cs.BU.EDU (Barry Shein) (08/28/86)

>From: mc68020@gilbbs.UUCP (Thomas J Keller)
>   Someone please correct me if I am wrong, but as I have been lead to 
>understand the situation, it will prove somewhat difficult to successfully
>implement large physical memory systems on the order of 1Gb.  The primary
>impediment seems to be the delays caused by propagation delays in the
>decoding trees.   Anyone care to enlight me (us)?

>From: johnson@uiucdcsp.CS.UIUC.EDU
>I believe that the large memory computers are designed for database
>applications.

I'm not sure you two are wrong, but I'm not sure you're right either.

The Cray-2 (certainly a number cruncher) comes with around 2GB of
main memory. The recently announced ELXSI (more like a $200K [entry]
machine if I read the article right) boasts a maximum 1GB configuration
(I figure you can buy the ELXSI on the volume discount on the memory
to fill it [~$1M list], but I wander.) Again, a number cruncher I
believe.

So, for what it's worth these are essentially counter-examples of
some value.

As I brought up once before, I still think there may be some constant
N which completes the sentence "never buy more memory than you can
zero out in N seconds" [I call it Shein's law of memory but some have
claimed that Amdahl may have said something similar, great minds run
in the same gutters :-]

The reasoning is if you can't touch it in N seconds you probably can't
use it very effectively either. For some more thoughts on this you
might want to pick up Danny Hillis' "The Connection Machine" where he
paints an interesting vision of modern computers as vast seas of
inactive silicon (the memory) with this (typically) one poor little
CPU touching one or two spots per cycle.

Of course, if you spend most of your time waiting for disk vast
memories may help, but so would (and does) clever memory/disk
scheduling (within limits.)

The point is it is not clear that increasing memories unbounded
produces unbounded performance gains, in fact, it almost certainly
doesn't.

You need a CPU (or more than one) to do something with all this
wonderful stuff you have in memory.

Before you all jump down my throat because you are sure that
if you had 16MB rather than 8MB on your machine and hence more
*must* be better consider:

A one MIP machine zeroing memory in a loop:

		CLRL	R1
	LOOP:
		CLRL	(R1)+
		CMPL	R1,HIMEM
		BNE	LOOP

would (theoretical machine) take 3 * (1G/1M) or 3000 seconds or
a little less than one hour to complete. It's hard to believe
such a machine could make -effective- use of that much memory.

I know, it's debateable, but anyone arguing against that statement
is probably ignoring any rational concern for cost/benefit trade-offs
(eg. spend $1M on the memory and $200K on the processor, or $1M on
the processor and $200K on the memory? or some similar variation.)

Of course, assuming the memory were free and reasonably random behavior
I agree that a huge memory would have some value to a database application
that filled the memory, but I doubt it would be a reasonable thing to
do unless memory prices dropped drastically. You'd probably be better
off putting your money into the processor (the context of the argument
seemed to imply smaller processors, obviously Cray is already up against
that limit.)

	-Barry Shein, Boston University

alan@mn-at1.UUCP (Alan Klietz) (08/28/86)

In article <884@gilbbs.UUCP>, mc68020@gilbbs.UUCP (Thomas J Keller) writes:
> 
>    Someone please correct me if I am wrong, but as I have been lead to 
> understand the situation, it will prove somewhat difficult to successfully
> implement large physical memory systems on the order of 1Gb.  The primary
> impediment seems to be the delays caused by propagation delays in the
> decoding trees.   Anyone care to enlight me (us)?
> 

If you use DRAMs you have access times on the order of 50-200ns.  That is
enough time for fast ECL-type logic to do plenty of decoding.

The CRAY-3 is rumored to have a central memory on the order of 2 Gwords, and
solid-state "disks" are going even higher.
--
Alan Klietz
Minnesota Supercomputer Center (*)
1200 Washington Avenue South
Minneapolis, MN  55415    UUCP:  ..ihnp4!dicome!mn-at1!alan
Ph: +1 612 638 0577              ..caip!meccts!dicome!mn-at1!alan
                          ARPA:  aek@umn-rei-uc.ARPA

(*) An affiliate of the University of Minnesota

hammond@petrus.UUCP (Rich A. Hammond) (08/28/86)

> 
tom keller writes:
>    Someone please correct me if I am wrong, but as I have been lead to 
> understand the situation, it will prove somewhat difficult to successfully
> implement large physical memory systems on the order of 1Gb.  The primary
> impediment seems to be the delays caused by propagation delays in the
> decoding trees.   Anyone care to enlight me (us)?

Well, the Cray 2 supports 256M Word (where word = 64 bits) i.e. 2 Gb,
so it is not impossible.  The decoding trees grow as the log of the
size of memory, so it isn't bad.  Plus, one rarely treats memory as a
single byte, one selects a larger chunk (1 to n words, each word of 4
or 8 bytes), gets it to the CPU and then uses a barrel shifter or other
select to pick out the individual byte(s) wanted.  Clearly, the decoding
times would get longer if we stayed with the same technology, but the
memory IC's are also getting faster as they shrink the circuit size.
e.g. the 64k rams had access times around 100-120 ns, the 256k rams had
access times around 80-100ns, the 1M rams have access times around 60-80ns.
This despite the increase in decoding on the chips.  So, going to the
larger chips allows both the 4fold increase from the chips, plus there
is additional time available for external decoding (assuming the memory cycle
stays constant).

Rich Hammond	hammond@bellcore.com

phil@osiris.UUCP (Philip Kos) (08/29/86)

A similar discussion came up a few months ago, in the context of
"how much memory is it reasonable to try to hang off my type X
machine?"  I expressed some confusion over conclusions drawn by
someone else (it may have been Barry Shein, whose article I am
replying to), who maintained steadfastly that there was a limit
beyond which it was not useful to go.  The argument really only
applies to machines with virtual memory, and it goes like this:

Your memory is organized in pages which are mapped from various
process's virtual spaces into the physical memory addressing.  It
takes a certain number of MM table entries to map a certain amount
of memory (considering a single VM architecture).  You want to
improve performance by reducing paging; it seems logical to do
this by increasing main memory.

Now.  By adding more main memory, you *do* decrease paging, which
is obviously a good thing.  However, you also make memory mapping
more time-consuming (more page table entries to maintain), so
there's a tradeoff here.  You might be able to expand the memory
to, say, 16M without taking any hits - the upper limit depends,
of course, on your VM archi- tecture.  Beyond this limit, the
page table has to be expanded, and maintaining it becomes more
complicated.  Eventually you reach a point where the time saved
by not paging is less than the extra time spent maintaining the
mapping table.  This is the point where you should just give up
trying to speed up your system by adding more memory.

Is this right?  It's sort of off the top of my head, and I may
have gotten some of the details wrong (what details, I hear you
ask?) but it seems to convey the gist of the argument...


Phil Kos			  ...!decvax!decuac
The Johns Hopkins Hospital                         > !aplcen!osiris!phil
Baltimore, MD			...!allegra!umcp-cs

"In the end there's still that song,
Comes crying like the wind
Down every lonely street that's ever been." - Robert Hunter

hammond@petrus.UUCP (Rich A. Hammond) (08/29/86)

> 	Barry Shein, Boston University writes
> ... consider:
> 
> A one MIP machine zeroing memory in a loop:
> 
> 		CLRL	R1
> 	LOOP:
> 		CLRL	(R1)+
> 		CMPL	R1,HIMEM
> 		BNE	LOOP
> 
> would (theoretical machine) take 3 * (1G/1M) or 3000 seconds or
> a little less than one hour to complete. It's hard to believe
> such a machine could make -effective- use of that much memory.

That calculation left out the word size (or were you talking about 1GW
versus 1Gb?)  Thus, for a 32 bit 1 MIPS machine, clearing 1 Gb of memory
would take (3 / 1M) * (1  Gb / 4) = 750 seconds or 12.5 minutes.

Of course, for general purpose computation, Barry's point is correct,
you can't use huge amounts of memory.

However, data base machines benefit greatly from the reduced overhead
if all data is available in memory, even if they don't touch much of it
at any time.

Also, our Convex C-1, when doing vector operations, can touch 8 bytes
every 100 ns, so to zero a 1Gb space  takes (1 Gb / 8) /10M = 12.5 seconds.

Cheers,
Rich Hammond	Bellcore

mat@amdahl.UUCP (Mike Taylor) (08/29/86)

In article <1130@bu-cs.bu-cs.BU.EDU>, bzs@bu-cs.BU.EDU (Barry Shein) writes:
> The point is it is not clear that increasing memories unbounded
> produces unbounded performance gains, in fact, it almost certainly
> doesn't.

As a matter of interest, studies we did some time ago with IBM MVS
indicated that performance reaches a maximum and then begins to
deteriorate slowly. Not a general result, of course, but food for
thought.  BTW, we offer up to 512MB of mainstore on our systems.

> 
> You need a CPU (or more than one) to do something with all this
> wonderful stuff you have in memory.
> 
> Of course, assuming the memory were free and reasonably random behavior
> I agree that a huge memory would have some value to a database application
> that filled the memory, but I doubt it would be a reasonable thing to
> do unless memory prices dropped drastically. You'd probably be better
> off putting your money into the processor (the context of the argument
> seemed to imply smaller processors, obviously Cray is already up against
> that limit.)
> 

But what do you do when you can't put any more money into the processor?
Not only Cray is up against that limit.
-- 
Mike Taylor                        ...!{ihnp4,hplabs,amd,sun}!amdahl!mat

[ This may not reflect my opinion, let alone anyone else's.  ]

joel@peora.UUCP (Joel Upchurch) (08/30/86)

	One thing that no one has mentioned so far  that you could do
	with very large memories is table lookups. Sure everyone knows
	that a table lookup is faster than than a calculated solution
	almost all the time, but how many people would it occur to to use
	it if the resulting table would be several hundred megabytes
	or even larger? It seems to me that this could be significant
	in some applications, like weather forcasting.
-- 
     Joel Upchurch @ CONCURRENT Computer Corporation (A Perkin-Elmer Company)
     Southern Development Center
     2486 Sand Lake Road/ Orlando, Florida 32809/ (305)850-1031
     {decvax!ucf-cs, ihnp4!pesnta, vax135!petsd, akgua!codas}!peora!joel

bzs@bu-cs.BU.EDU (Barry Shein) (08/30/86)

>A similar discussion came up a few months ago, in the context of
>"how much memory is it reasonable to try to hang off my type X
>machine?"  I expressed some confusion over conclusions drawn by
>someone else (it may have been Barry Shein, whose article I am
>replying to), who maintained steadfastly that there was a limit
>beyond which it was not useful to go.  The argument really only
>applies to machines with virtual memory, and it goes like this:
>
>Your memory is organized in pages which are mapped from various
>process's virtual spaces into the physical memory addressing.  It
>takes a certain number of MM table entries to map a certain amount
>of memory (considering a single VM architecture).  You want to
>improve performance by reducing paging; it seems logical to do
>this by increasing main memory.
>
>Now.  By adding more main memory, you *do* decrease paging, which
>is obviously a good thing.  However, you also make memory mapping
>more time-consuming (more page table entries to maintain), so
>there's a tradeoff here.

T'was I. This is a slightly different topic (virtual memory systems)
but I am quite sure that you can hit this limit easily on a 750 (eg.),
I've suspected my 8MB VAX750 (4.2bsd) of having this problem (too much
time spent in the kernel due to virtual memory management) tho I've
never had the time to investigate (anyone?)

Of course, there's the bridging argument that no matter how much
memory you have it's just a matter of time before you wish it had
virtual memory management. I won't argue this either way but I believe
there is an amusing anecdote told around a famous institution about a
certain famous individual who proclaimed some years back, upon the
arrival of their 256KW [36-bit] memory, that now they have more memory
than anyone can possibly use.

I already know of people in two applications areas that believe they
could easily fill 1GB main memories with their current needs (a
database and a graphics animation shop.)

	-Barry Shein, Boston University

roy@phri.UUCP (Roy Smith) (09/01/86)

In article <2289@peora.UUCP> joel@peora.UUCP (Joel Upchurch) writes:
> 	One thing that no one has mentioned so far  that you could do
> 	with very large memories is table lookups.

	OK, let's talk bizarre (1/2 :-)).  Imagine the stateword of a
process as all the bits in that process's memory strung end to end.  If you
take this as a memory address, you could implement your CPU as a simple
lookup table; for each possible state there is only one possible next state
that a process can get to.  All you have to do is do a simple table lookup
to find the next state.

	Of course, you need a mighty big memory to build the lookup table
(not to mention the non-trivial amount of CPU time to calculate what values
go where in the table).  How much memory?  Well, I just did a size on
/bin/* (11/750 running 4.2 BSD).  Running the results through colex to get
the decimal total and desc to get descriptive statistics (both from Gary
Perlman's very nice UNIX|STAT package) I get (with a sample size of 60;
this excludes the shell scripts /bin/true and /bin/false):

------------------------------------------------------------
        Mean      Median    Midpoint   Geometric    Harmonic
   35343.000   27394.000   57622.000   32294.059   30172.889
------------------------------------------------------------
          SD   Quart Dev       Range     SE mean
   17635.600    7553.000   77052.000    2276.746
------------------------------------------------------------
     Minimum  Quartile 1  Quartile 2  Quartile 3     Maximum
   19096.000   24372.000   27394.000   39478.000   96148.000
------------------------------------------------------------

	The smallest program was /bin/echo (20k to echo argv!?); the
largest was /bin/as at 96k.  So, assuming most processes will fit into 100
kbytes, that means you need a LUT with an 800k bit long address.  Talk
about BFM's!  As I think I've mentioned before, it is believed that there
are approximately 2^200 electrons in the universe.  Since it is unlikely
that anybody would want to reference more things than there are electrons
in the universe, 200 bits seems like a good upper bound for the length of a
memory address.

-- 
Roy Smith, {allegra,philabs}!phri!roy
System Administrator, Public Health Research Institute
455 First Avenue, New York, NY 10016

aglew@ccvaxa.UUCP (09/01/86)

... > Very large memories - Barry Shein (and C. Gordon Bell, and many
... > others) suggest that memory clearing places an upper limit on
... > useful memory size.

Right. How quickly can you clear a 4TB disk farm?

(No, I'm not quite so dogmatic. Memory clearing is certainly a 
consideration in context switching. Which may be one of the reasons
multiprocessors like ELXSI can effectively use large memory,
since they also report less context switching).

Andy "Krazy" Glew. Gould CSD-Urbana.    USEnet:  ihnp4!uiucdcs!ccvaxa!aglew
1101 E. University, Urbana, IL 61801    ARPAnet: aglew@gswd-vms

jon@msunix.UUCP (Jonathan Hue) (09/03/86)

In article <145@mn-at1.UUCP>, alan@mn-at1.UUCP (Alan Klietz) writes:
> If you use DRAMs you have access times on the order of 50-200ns.  That is
> enough time for fast ECL-type logic to do plenty of decoding.

I don't design with ECL, but just the buffers to go to/from TTL levels
are going to be way slower than the FAST (that's right, Fast Advanced
Schottky TTL) 74xx series stuff from Fairchild.  With gate delays
around 2ns, and 256K DRAMs at around 150ns, you have plenty of time
to decode address lines.  Heck you can just cram your decoder into
a 20R8 PAL (I think it will fit) and the new ones are 15ns up to
three gates deep.

Of course, bipolar SRAMS with 3ns access times are another story...


"If we did it like everyone else,	  Jonathan Hue
what would distinguish us from		  Via Visuals Inc.
every other company in Silicon Valley?"	  sun!sunncal\
						      >!leadsv!msunix!jon
"A profit?"				amdcad!cae780/

ronc@fai.UUCP (Ronald O. Christian) (09/03/86)

In article <145@mn-at1.UUCP> alan@mn-at1.UUCP (Alan Klietz) writes:
>In article <884@gilbbs.UUCP>, mc68020@gilbbs.UUCP (Thomas J Keller) writes:
>> 
>> [...]  The primary
>> impediment seems to be the delays caused by propagation delays in the
>> decoding trees.   Anyone care to enlight me (us)?
>> 
>
>If you use DRAMs you have access times on the order of 50-200ns.  That is
>enough time for fast ECL-type logic to do plenty of decoding.

I wonder about this.  I had a design problem awhile back involving 45ns
ram used as a lookup table, where I couldn't decode the address fast enough.
I found some ECL gates that would do the decoding faster, but then found to
my chagrin that you lost so much time in the conversion from ECL back to TTL
that the overall response ended up being slower.  I've had the same problem
in gate arrays, where the propagation delay through the on-chip conversion
to ECL, the array itself, and the conversion back to TTL was greater than
the prop delay through a straight TTL array of similar complexity.  If
you're going ECL for decoding, I think you really need to have ECL memories
to gain anything.

Then there's GaAs...  So fast you can spend a lot of time converting
to a different logic family.  I like GaAs.  Expensive, though.


				Ron
-- 
--
		Ronald O. Christian (Fujitsu America Inc., San Jose, Calif.)
		seismo!amdahl!fai!ronc  -or-   ihnp4!pesnta!fai!ronc

Oliver's law of assumed responsibility:
	"If you are seen fixing it, you will be blamed for breaking it."

john@datacube.UUCP (09/04/86)

/* Written  5:51 pm  Aug 28, 1986 by phil@osiris.UUCP in datacube:net.arch */
.......
Now.  By adding more main memory, you *do* decrease paging, which
is obviously a good thing.  However, you also make memory mapping
more time-consuming (more page table entries to maintain), so
there's a tradeoff here.  You might be able to expand the memory
to, say, 16M without taking any hits - the upper limit depends,
of course, on your VM archi- tecture.  Beyond this limit, the
page table has to be expanded, and maintaining it becomes more
complicated.  Eventually you reach a point where the time saved
by not paging is less than the extra time spent maintaining the
mapping table.  This is the point where you should just give up
trying to speed up your system by adding more memory.

Is this right?  It's sort of off the top of my head, and I may
have gotten some of the details wrong (what details, I hear you
ask?) but it seems to convey the gist of the argument...

Phil Kos			  ...!decvax!decuac
The Johns Hopkins Hospital                         > !aplcen!osiris!phil
Baltimore, MD			...!allegra!umcp-cs
/* End of text from datacube:net.arch */

I agree with this for fixed VM architectures, but for  new machines I
would think that the page size should be increased so that the number
of  pages  of  physical  memory  available  remains roughly constant.
Won't this give ever increasing performance, or does the latency time
of bringing in new pages kill the gains?  

BTW, a quick calculation tells me that with current  technology a 128
Mbyte memory board could be designed for the  SUN-3, (15"  X 15" form
factor).  This would include error  correction and  detection and all
the other goodies required  on big  system memorys.   Just  slap 8 of
these in your SUN3-160, instant Giggabytes.  

1Gb == 45 seconds color video.			      John Bloomfield

Datacube Inc. 4 Dearborn Rd. Peabody, Ma 01960 	         617-535-6644
	
ihnp4!datacube!john
decvax!cca!mirror!datacube!john
{mit-eddie,cyb0vax}!mirror!datacube!john

srt@duke.UUCP (Stephen R. Tate) (09/04/86)

In article <289@petrus.UUCP>, hammond@petrus.UUCP (Rich A. Hammond) writes:
> > 
> tom keller writes:
> > The primary
> > impediment seems to be the delays caused by propagation delays in the
> > decoding trees.   Anyone care to enlight me (us)?
> 
> Well, the Cray 2 supports 256M Word (where word = 64 bits) i.e. 2 Gb,
> so it is not impossible.  The decoding trees grow as the log of the
> size of memory, so it isn't bad.

That's an over-complicated (or over-generalized) treatment of address
decoding.  The key phrase you use is "decoding trees", where decoding
does not have to be done by trees at all.  *Regardless* off the memory
size, decoding the unique address of a bank of memory (bank, row, or
word, actually) need never be more than 2 gates deep.  Meaning propogation
delays of only around 30ns using regular old slow silicon TTL.
And this is completely independent of memory size.  (within reason...
propogation delays for 40 input AND gates might be a bit higher....  But
who's going to have 40 bit bank addresses?)

-- 
Steve Tate			..!{ihnp4,decvax}!duke!srt

preece@ccvaxa.UUCP (09/05/86)

I don't claim to know where the cost-benefit curve lies,
but there clearly are a lot of problems whose solutions
naturally involve huge address spaces:  databases, number
crunching on huge arrays, image analysis, many kinds of
things that fall into connectionist models, etc.  For
problems like that, solutions on systems with less memory
than the address space of the problem will involve some
kind of mapping between secondary memory and physical
memory.  That means they go slow.  If memory is "cheap
enough" it is obviously preferable to have that memory
be physical memory.  The definition of "cheap enough"
depends on the importance of the problem, the time
pressure on its solution, and your budget.  Saying that
there is a constant that defines the maximum useful amount
of memory is simply stating your interpretation of those
governing factors.

The original note limited the question to single-user
systems.  Even in that context, though, there is an obvious
answer to the complaint that you can't access memory that fast:
get more processors and share the memory among them.  N years in
the future, when processors are available in 1M-processor chips
(that is, one chip=1 million processors), it will seem pretty
silly to be talking about whether a T of memory is too much, just
as now it's pretty silly to be talking about whether a G is
too much.

There's a simple law governing all of computing [don't ping me on
the obvious caveats and quibbles, this is the kind of law that's
supposed to be over-broad but simple]:

There can't be too much memory and no processor is too fast.

-- 
scott preece
gould/csd - urbana
uucp:	ihnp4!uiucdcs!ccvaxa!preece
arpa:	preece@gswd-vms

jlg@lanl.ARPA (Jim Giles) (09/05/86)

In article <1130@bu-cs.bu-cs.BU.EDU> bzs@bu-cs.BU.EDU (Barry Shein) writes:
>...
>As I brought up once before, I still think there may be some constant
>N which completes the sentence "never buy more memory than you can
>zero out in N seconds" [I call it Shein's law of memory but some have
>claimed that Amdahl may have said something similar, great minds run
>in the same gutters :-]
>...
>A one MIP machine zeroing memory in a loop:
>
>		CLRL	R1
>	LOOP:
>		CLRL	(R1)+
>		CMPL	R1,HIMEM
>		BNE	LOOP
>
>would (theoretical machine) take 3 * (1G/1M) or 3000 seconds or
>a little less than one hour to complete. It's hard to believe
>such a machine could make -effective- use of that much memory.
>

The task of zeroing memory on the Cray 2 with 256MW (=2GB) should take
less than 5-10 seconds.  You must remember that a one MIP machine is
pathetically slow compared to a Cray (at least for vector operations
like zeroing memory, searching, sorting, and many scientific
applications).

By the way, why does everyone assume that large memory is for database
applications and such.  I can think of LOTS of scientific applications
for large memory machines - none of which involve databases in any way.

J. Giles
Los Alamos

jlg@lanl.ARPA (Jim Giles) (09/06/86)

In article <7144@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes:
>The task of zeroing memory on the Cray 2 with 256MW (=2GB) should take
>less than 5-10 seconds.  You must remember that a one MIP machine is
>pathetically slow compared to a Cray (at least for vector operations
>like zeroing memory, searching, sorting, and many scientific
>applications).
>

I made a slight error (by an order of magnitude) - The Cray 2 should
take about .5 to 1.0 seconds to clear all its memory.

It seems that large memory machines are obviously desireable.  In
lattice guage theory a single lattice could easily use all the Cray 2
memory. 3-d hydro codes also use enormous ammounts of memory.
Furthermore, applications like these make frequent references to the
entire data structure (ie. the whole array is updated every single
time-step).

The desirability of paging for such machines is not so obvious.
Consider a code which updates a large array on each step through a
loop (each time-step).  If the central memory is too small to hold the
entire array and you have a virtual memory scheme, some part of the
array will get swapped out on each time step.  Most likely, it will be
the least recently used page that gets swapped - the very one that you
will need first on the subsequent time step!  You are now in a
situation of chasing you tail around memory - losing time all the
while.

Without virtual memory though, your code can anticipate the problem by
initializing asynchronous I/O long before it needs to use the data.
And, since it's not driven by page faults, you can select only a
particular part of the array to be swapped - thus minimizing I/O.
This kind of programming effort is somewhat unfasionable these days,
but it's exactly the sort of thing that most programmers that use
these big machines immediately do.  They bought the machine because
the critical issue was SPEED - and anything that reduces this speed
(like virtual memory) is to be shunned.  (Cyber 205 users usually
turn off the virtual memory when they need speed, Crays don't even
have virtual memory.)

On the other hand, a small memory - single user machine (like my SUN
workstation) should never be built without virtual memory.  The
desirability of a feature should always be driven entirely be the
purpose of the machine.

J. Giles
Los Alamos

philip@amdcad.UUCP (Philip Freidin) (09/06/86)

In article <8513@duke.duke.UUCP>, srt@duke.UUCP (Stephen R. Tate) writes:
> 
> That's an over-complicated (or over-generalized) treatment of address
> decoding.  The key phrase you use is "decoding trees", where decoding
> does not have to be done by trees at all.  *Regardless* off the memory
> size, decoding the unique address of a bank of memory (bank, row, or
> word, actually) need never be more than 2 gates deep.  Meaning propogation
> delays of only around 30ns using regular old slow silicon TTL.
> And this is completely independent of memory size.  (within reason...
> propogation delays for 40 input AND gates might be a bit higher....  But
> who's going to have 40 bit bank addresses?)
> 
> -- 
> Steve Tate			..!{ihnp4,decvax}!duke!srt

Unfortunately, at this point I would like to apply some reality to the
discussion. Rather than talk about your 40 bit address memories, lets
look at something trivial: 64kw. this needs 16 bits of address. With
your 2 level decode (one of inverters, and the second of and gates to
do word select) you have 32 address select lines coming into the second
level, address and address complement. each of these must drive 32k and
gates!  I dont know of any logic familly with a drive capability to support
that type of load. Your typical ttl has a drive capability of from 10 to 20
loads.  Also, another fly in your fast decode ointment is that the way and
gates are implemented in many logic families precludes building a 16 input
and gate as a single level. Cmos is limited to about 4 levels, and TTL and
ECL have similar limits. To build bigger and gates, you end up with a tree
structure inside your and gate.

--Philip Freidin

bzs@bu-cs.BU.EDU (Barry Shein) (09/07/86)

From: jlg@lanl.ARPA (Jim Giles)
>The task of zeroing memory on the Cray 2 with 256MW (=2GB) should take
>less than 5-10 seconds.  You must remember that a one MIP machine is
>pathetically slow compared to a Cray (at least for vector operations
>like zeroing memory, searching, sorting, and many scientific
>applications).

That's exactly my point, the Cray "deserves" 2GB of memory (was this a
counter-example? I just said if machine can't zero memory in N seconds
for some small N it probably won't use the memory either, your point
is tautological to that.)

Working backwards, if the Cray can zero 2GB in 10 seconds we get
(assuming, as before, 3 instructions (I) per zero per word, 8 bytes
per word for the Cray) 2GB/8B -> 256M * 3I -> 768MI/10s -> 76.8MIPS,
using your 5 second figure, around 150MIPS.

BUT -- How is that speed being accomplished (and I suspect from the
math above that the Cray is a little faster, no matter)? By
parallelism, vector processors, very non-standard expensive stuff (tho
of course becoming more popular BECAUSE OF THE ABOVE DISCUSSED
PROBLEMS, among others.) The anti-intuitive argument that comes up is
to show that increasing memory size for conventional processors into
the GB range is a losing proposition.

	-Barry Shein, Boston University

henry@utzoo.UUCP (Henry Spencer) (09/07/86)

> The desirability of paging for such machines is not so obvious.
> Consider a code which updates a large array on each step through a
> loop (each time-step).  If the central memory is too small to hold the
> entire array and you have a virtual memory scheme, some part of the
> array will get swapped out on each time step.  Most likely, it will be
> the least recently used page that gets swapped - the very one that you
> will need first on the subsequent time step!...

Jim, all that you have established here is that LRU is a thoroughly bad
virtual-memory policy for a scientific program.  Few people will argue that.
You have also more-or-less established that a program which behaves in
the manner you suggest will not benefit much from virtual memory; its
performance will degrade badly when it starts paging.  Few will argue that
either.  Not all programs behave that way, though.

You have *not* established that virtual memory, as such, is a poor idea.
It is quite possible to combine demand fetching with prefetching of things
that are expected to be needed soon.  It's probably even a good idea when
trying to page scientific programs.  It *is* harder to get right, which is
why you don't see it done much.

> Without virtual memory though, your code can anticipate the problem by
> initializing asynchronous I/O long before it needs to use the data.
> And, since it's not driven by page faults, you can select only a
> particular part of the array to be swapped - thus minimizing I/O.
> This kind of programming effort is somewhat unfasionable these days...

With some reason.  What you're saying is that because the operating-system
people are too lazy to devise paging algorithms that are useful for large
scientific programs, the programmers should be required to do it themselves.
Apart from the matter of constantly reinventing the wheel, there is also
the problem that it's a lot of work to get it right -- program reference
patterns are notorious for being hard to predict beforehand, which means
experimenting and then twiddling the code to match the results.  This
may perhaps not be needed for really straightforward array-mashing code,
but I remain a bit skeptical:  historically, the batting average on statements
like "this program obviously has the following reference pattern..." is
close to zero.  I wouldn't be surprised if a lot of scientific code, with
its carefully hand-twiddled asynchronous I/O, is in fact managing its
memory rather inefficiently.  Especially if the code has been revised, or
moved to a new machine (or a new variant of the old one), since the last
tuning job was done.

> ... They bought the machine because
> the critical issue was SPEED - and anything that reduces this speed
> (like virtual memory) is to be shunned.  (Cyber 205 users usually
> turn off the virtual memory when they need speed, Crays don't even
> have virtual memory.)

I think you may be confusing two issues here.  The reason the Crays don't
have virtual memory is not because asynchronous I/O is superior to paging,
but because non-trivial address translation hurts memory-access time.  Are
the Cyber 205 users turning the virtual memory off because they don't trust
the paging algorithm, or because the machine will run even a memory-resident
program faster with it off?  I'd bet it's the latter.  Now *that* is a
legitimate and well-justified reason for not using virtual memory.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry

ken@argus.UUCP (Kenneth Ng) (09/08/86)

In article <8513@duke.duke.UUCP>, srt@duke.UUCP (Stephen R. Tate) writes:
> But
> who's going to have 40 bit bank addresses?)
> Steve Tate			..!{ihnp4,decvax}!duke!srt

Check out the load far pointer instruction on the Intel 386.  It is
a full logical address, 48 bits, a selector and and offset.

-- 
Kenneth Ng: Post office: NJIT - CCCC, Newark New Jersey  07102
uucp(for a while) ihnp4!allegra!bellcore!argus!ken
     ***   WARNING:  NOT ken@bellcore.uucp ***
           !psuvax1!cmcl2!ciap!andromeda!argus!ken
bitnet(prefered) ken@orion.bitnet

--- Please resend any mail between 10 Aug and 16 Aug:
--- the mailer broke and we had billions and billions of
--- bits scattered on the floor.

sewilco@mecc.UUCP (Scot E. Wilcoxon) (09/08/86)

>As I think I've mentioned before, it is believed that there are
>approximately 2^200 electrons in the universe.  Since it is unlikely that
>anybody would want to reference more things than there are electrons in the
>universe, 200 bits seems like a good upper bound for the length of a memory
>address.
>
>Roy Smith, {allegra,philabs}!phri!roy

Present technologies all require use of many electrons per bit stored.
Therefore the number of electrons in the universe is an actual upper bound.
Particularly since a few electrons are needed for the computer which will use
the memory :-)

Collecting all the electrons in the universe for the factory is left as an
exercise for the reader.  Simulating this action and the forces involved will
use more memory than the number of electrons.
-- 
Scot E. Wilcoxon    Minn Ed Comp Corp  {quest,dicome,meccts}!mecc!sewilco
45 03 N  93 08 W (612)481-3507 {{caip!meccts},ihnp4,philabs}!mecc!sewilco

	"BOOKS" in five-foot neon letters means pictures are sold there.

eugene@ames.UUCP (Eugene Miya) (09/08/86)

Gawd, you guys on the net!  Come on!

So VAX/Unix oriented, so blinded!

The Cray is a word oriented machine.  Take out the factor of 8 from all
of your calculations.  You can tell a person's thinking by the words they
choose.  The Cray-2 is a 256 MW not a 2 GB machine.  These are not the same
because the conversion is non trivial.  Crays are word oriented machines.
Anyone who says "2 GB" is showing a great deal of naive.  Get off your
VAXen an try Univacs (36-bit), IBM's inverted bit order, and other systems.

>From the Rock of Ages Home for Retired Hackers:

--eugene miya
  NASA Ames Research Center
  eugene@ames-aurora.ARPA
  "You trust the `reply' command with all those different mailers out there?"
  {hplabs,hao,nike,ihnp4,decwrl,allegra,tektronix,menlo70}!ames!aurora!eugene

jon@msunix.UUCP (09/09/86)

In article <12930@amdcad.UUCP>, philip@amdcad.UUCP (Philip Freidin) writes:
> do word select) you have 32 address select lines coming into the second
> level, address and address complement. each of these must drive 32k and
> gates!  I dont know of any logic familly with a drive capability to support
> that type of load. Your typical ttl has a drive capability of from 10 to 20
> loads.  Also, another fly in your fast decode ointment is that the way and
> gates are implemented in many logic families precludes building a 16 input
> and gate as a single level. Cmos is limited to about 4 levels, and TTL and
> ECL have similar limits. To build bigger and gates, you end up with a tree
> structure inside your and gate.

Most people don't worry about the decoders inside DRAMs, but just what the
DRAM looks like from the pins (timing, loads, etc).  As a crude example,
suppose you have a VME bus board with 100 in^2 and the P1 and P2 connectors.
You are using 256K DRAMs and can fit 4Mb on each board.  A1 thru A18 form the
row and column address.  That leaves A19 thru A23.  A19 thru A21 can be the
inputs to a 74AS138, and A22 and A23 can be the enables for the AS138 (it has
two low enables, and one high).  To put 16Mb in this system, you only need
one more gate to enable the AS138 when both A22 and A23 are high.

Okay, now add the P3 connector, 1Mb DRAMs, and twice as much real estate,
so you can put 32Mb on a board.  A2 thru A21 form the row and column
address, A22 thru A24 go into an AS138, and A25 thru A31 go into a
74AS688.  The AS688 can be used as an address comparator, and is nice
because you can stick eight address jumpers next to it to set the board's
base address.  The AS688 has an eight input gate in it, as do a lot
of the AS67x and AS68x parts.  The output of the AS688 along with
A1 control what the board puts on the bus.

You can address 4Gb with this scheme, and none of this looks much like a tree.
And there is a part here that has an eight input gate.  Add 8 more address
lines, another AS688, and you've got 1Tb.  This wouldn't be any slower to
access than the 24 bit example.  The point here is that you don't have to
design to decode n address pins into 2^n signals.  Your DRAMS take care of
18 or 20 of them, and you only need to decode as many as you have banks of
memory on a board.  The other address lines need only form a board select
 - one output only.



"If we did it like everyone else,	  Jonathan Hue
what would distinguish us from		  Via Visuals Inc.
every other company in Silicon Valley?"	  sun!sunncal\
						      >!leadsv!msunix!jon
"A profit?"				amdcad!cae780/

guy@sun.uucp (Guy Harris) (09/09/86)

> Gawd, you guys on the net!  Come on!
> 
> So VAX/Unix oriented, so blinded!

Oh, come off it.  You damn well know that most machines out there,
regardless of whether they run UNIX or are VAXes, are byte-oriented.  The
use of "byte" when talking about memory sizes has *NO* relation *WHATSOEVER*
to a VAX or UNIX mindset.

> The Cray is a word oriented machine.  Take out the factor of 8 from all
> of your calculations.  You can tell a person's thinking by the words they
> choose.  The Cray-2 is a 256 MW not a 2 GB machine.  These are not the same
> because the conversion is non trivial.

Why?  Does the Cray-2 store characters one per word?  Does it store
*everything* one per word?  Even if it does, that isn't enough to recommend
that "word" be used in a general discussion of large main memories.  "word"
is a *useless* term for general discussions of memories, because it means a
different amount of memory on different machines.  This discussion is NOT a
discussion of the Cray-2, it is a discussion of large main memories.  As
such, the byte is the appropriate unit to discuss, since it is the same size
on almost all machines out there.

> Crays are word oriented machines.

You already said that.  Merely stating this twice doesn't make it more
interesting.  Explain *why* this renders discussion of Cray memory sizes in
bytes inappropriate.  The PDP-10 is a word-oriented machine also; however, a
9-bit byte is a *very* appropriate unit for comparative discussions of
memory size, since one (9-bit) byte holds a character (which can be
addressed independently given a byte pointer), four bytes holds an integer,
(byte) pointer, or single-precision floating-point number, and eight bytes
holds a double-precision floating-point number - just like the VAX.  The
available range of values of these types are different from that on a VAX,
but the difference is not enough to make a difference in gross discussions
of memory sizes.

> Anyone who says "2 GB" is showing a great deal of naive.

That's "naivete", modulo various diacritical marks, and this statement
needs a lot more defense than you've given it.  If the problem is that a
given data structure (tree, 2D array of floating-point numbers, etc) takes a
different amount of memory on a Cray-2 than on another machine, then the
appropriate unit in the discussion is *NOT* words, it's elements of said
data structure.

> Get off your VAXen an try Univacs (36-bit), IBM's inverted bit order,
> and other systems.

"*BIT* order"?  What relevance has that?  I presume you meant "*byte* order,
in which case which "inverted byte order" do you mean?  The byte order on
the IBM PC is the same as that on the VAX, and I could see some future
80386-based IBM machine having lots of main memory.  The 360/370 family, by
the way, is byte-oriented and memory sizes are given in bytes....

Your assumption that the use of "bytes" in this discussion is an indication
of VAX/UNIX tunnel vision is way off the mark.
-- 
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com (or guy@sun.arpa)

srt@duke.UUCP (Stephen R. Tate) (09/10/86)

In article <12930@amdcad.UUCP>, philip@amdcad.UUCP (Philip Freidin) writes:
> Unfortunately, at this point I would like to apply some reality to the
> discussion. Rather than talk about your 40 bit address memories, lets
> look at something trivial: 64kw. this needs 16 bits of address. With
> your 2 level decode (one of inverters, and the second of and gates to
> do word select) you have 32 address select lines coming into the second
> level, address and address complement. each of these must drive 32k and
> gates!  I dont know of any logic familly with a drive capability to support
> that type of load. Your typical ttl has a drive capability of from 10 to 20
> loads.  Also, another fly in your fast decode ointment is that the way and
> gates are implemented in many logic families precludes building a 16 input
> and gate as a single level. Cmos is limited to about 4 levels, and TTL and
> ECL have similar limits. To build bigger and gates, you end up with a tree
> structure inside your and gate.
> 
> --Philip Freidin

First off, I was talking about decoding *bank* addresses, not individual
word addresses.  If you wanted 1GB of memory, and used 1Mb chips, you would
have, say, 256 banks of 1Mb x 32 bit words.  (If you have this much memory,
I hope memory accesses are done more than a word at a time, but ignore this
for now....)  Now that's only 8 bits for a bank address, and I have seen 8
input NAND gates.  (7430 or something like that....)  Each of these bank
address lines need only drive one input per bank (32 chips), which means
that they only have to drive 256 inputs.  Much less than your 32k figure,
but still unreasonable.  Obviously, the address lines need to be buffered.
Using TTL with a fanout of, say, 16, you only need one level of buffering
(since 16*16 = 256).  Now you're three levels deep for a propogation delay
of about 40-50ns.  Still not a terribly unreasonable time.

Anyway, another problem to consider is buffering all the address lines below
the bank address lines.  These have to be run to every chip, and in the
example above, there are 32*256 = 8192 chips in all.  You're going to have
to be real careful with buffering here.....   So it's not the decode
circuitry that takes time, it's the buffering for reasonable fan-out.
Incidentally, CMOS has a *huge* fanout.  That is, CMOS outputs to CMOS
inputs (no mixing).



-- 
Steve Tate			..!{ihnp4,decvax}!duke!srt

jeff@gatech.CSNET (Jeff Lee) (09/10/86)

>Then there's GaAs...  So fast you can spend a lot of time converting
>to a different logic family.  I like GaAs.  Expensive, though.

I know absolutely nothing about GaAs except that Seymour is planning
to do his cray-3 in it. What are the speeds and costs of some "typical"
GaAs chips?  What sort of power do they dissipate?  What is the
difficulty in processing GaAs as opposed to silicon?  Also, is anybody
doing anything with InP (Indium Phosphide) yet?
-- 
Jeff Lee
CSNet:	Jeff @ GATech		ARPA:	Jeff%GATech.CSNet @ CSNet-Relay.ARPA
uucp:	...!{akgua,allegra,hplabs,ihnp4,linus,seismo,ulysses}!gatech!jeff

jlg@lanl.ARPA (Jim Giles) (09/10/86)

In article <7094@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>...
>With some reason.  What you're saying is that because the operating-system
>people are too lazy to devise paging algorithms that are useful for large
>scientific programs, the programmers should be required to do it themselves.
>Apart from the matter of constantly reinventing the wheel, there is also
>the problem that it's a lot of work to get it right -- program reference
>patterns are notorious for being hard to predict beforehand, which means
>experimenting and then twiddling the code to match the results.

It's not just that the operating system or hardware designers are too lazy
to come up with a good scheme.  The problem is that any scheme they DO come
up with must work for general cases.  That is, it can't take advantage of
special knowledge of a specific algorithm.

The individual applications programmer CAN take advantage of such knowledge.
To be sure, this is an expensive and difficult programming project.  But,
if you've just spent $10-$20 million on a fast machine, you aren't going
to balk at a few million more in programmer man-hours to get the speed
that you shelled out so much cash for.

Since hardware (for whatever reason) can run faster without virtual memory,
there will always be a market among the high-end users for machines that
don't have it.  Since these are the sort of machines with the Very large
memories that are talking about, I question the desirability of virtual
memory on them.

As a final note: there are some types of algorithm for which it is
extremely easy to predict the data usage patterns.  Most Finite-Difference
and Finite-Element codes are of this kind.  These applications involve a
small number of very large arrays which are referenced cyclically.  Other
applications, like image manipulation and particle transport, have somewhat
more difficult patterns, but there are known methods for dealing with them.
These are the types of codes which form the predominant workload for
today's large memory supercomputers (whether bought by oil companies or
government).  Now, it might be convenient to implement virtual memory
schemes which are useful in this context, but I doubt that the extra
overhead in the memory interface would be justified - especially since
the explicit methods for dealing with them are fairly easy to implement.

apc@cblpe.UUCP (Alan Curtis) (09/11/86)

In article <884@gilbbs.UUCP> mc68020@gilbbs.UUCP writes:
>
>   Someone please correct me if I am wrong, but as I have been lead to 
>understand the situation, it will prove somewhat difficult to successfully
>implement large physical memory systems on the order of 1Gb.  The primary
>impediment seems to be the delays caused by propagation delays in the
>decoding trees.   Anyone care to enlight me (us)?
>
>
>tom keller					"She's alive, ALIVE!"
>{ihnp4, dual}!ptsfa!gilbbs!mc68020

Today I can buy 16 megabytes on one board given 256k bit drams.
Tommorrow I should be able to buy the same board with 1M devices
giving me 64meg on a card. 20 cards is then 1.2G.

I think it would be possible.

mclase@watdaisy.UUCP (Michael Clase) (09/11/86)

In article <7331@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes:
>In article <7094@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>>...
>>With some reason.  What you're saying is that because the operating-system
>>people are too lazy to devise paging algorithms that are useful for large
>>scientific programs, the programmers should be required to do it themselves.
>>Apart from the matter of constantly reinventing the wheel, there is also
>>the problem that it's a lot of work to get it right -- program reference
>>patterns are notorious for being hard to predict beforehand, which means
>>experimenting and then twiddling the code to match the results.
>
>It's not just that the operating system or hardware designers are too lazy
>to come up with a good scheme.  The problem is that any scheme they DO come
>up with must work for general cases.  That is, it can't take advantage of
>special knowledge of a specific algorithm.
>
Rather than having the user explicitly implement a suitable paging
algorithm for his program, couldn't the operating system have a
facility like the UNIX vadvise system call?  According to the vadvise
man page, the call vadvise(V_ANOM) warns the pager that LRU is not
a suitable algorithm for this particular job.  Perhaps this could be
expanded to include calls like vadvise(V_CYCLIC) to indicate that
the program wants to cyclically reference large arrays.  Of course,
the difficulty would be for the pager to work out which pages corresponded
to these arrays, particularly in the case of two or more arrays which
are traversed in parallel.

Michael Clase
mclase@watdaisy.uucp (for one more week)

tuba@ur-tut.UUCP (Jon Krueger) (09/11/86)

In article <7331@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes:
>It's not just that the operating system or hardware designers are too lazy
>to come up with a good scheme.  The problem is that any scheme they DO come
>up with must work for general cases.  That is, it can't take advantage of
>special knowledge of a specific algorithm....The individual applications
>programmer CAN take advantage of such knowledge.  To be sure, this is an
>expensive and difficult programming project...[but] there are some types of
>algorithm for which it is extremely easy to predict the data usage
>patterns.  Now, it might be convenient to implement virtual memory schemes
>which are useful in this context, but I doubt that the extra overhead in the
>memory interface would be justified - especially since the explicit methods
>for dealing with them are fairly easy to implement.

1) Doubtless you can show me cases where it's "extremely easy" to predict
data usage patterns.  Can you show me one where it's easy to predict the
code usage patterns?  I want VM to free me from overlays, or code space
management, not file structuring, or data space management.

>if you've just spent $10-$20 million on a fast machine, you aren't going
>to balk at a few million more in programmer man-hours to get the speed
>that you shelled out so much cash for.
2) Can we measure the win?  Can you provide figures on performance
improvements for either data or code space management by application
programmers over mechanisms provided by operating system or hardware
designers?  Have you any actual examples, how much was the improvement for a
specific application you're familiar with?  Can we state a general rule,
expected returns on applications programmers managing their own code and/or
data spaces?  Can you state the breakeven point, how many millions of
dollars I should be willing to spend before I improve on the operating
system's paging?

					-- jon
-- 
--> Jon Krueger
uucp: {seismo, allegra, decvax, cmcl2, topaz, harvard}!rochester!ur-tut!tuba
Phone: (716) 275-2811 work, 473-4124 home	BITNET: TUBA@UORDBV
USMAIL:  Taylor Hall, University of Rochester, Rochester NY  14627 

phil@amdcad.UUCP (Phil Ngai) (09/11/86)

In article <8546@duke.duke.UUCP> srt@duke.UUCP (Stephen R. Tate) writes:
>Incidentally, CMOS has a *huge* fanout.  That is, CMOS outputs to CMOS
>inputs (no mixing).

CMOS has a huge DC fanout. The AC fanout is somewhat less, depending
on the level of performance you demand.

-- 
 Rain follows the plow.

 Phil Ngai +1 408 749 5720
 UUCP: {ucbvax,decwrl,ihnp4,allegra}!amdcad!phil
 ARPA: amdcad!phil@decwrl.dec.com

stever@videovax.UUCP (Steven E. Rice) (09/12/86)

In article <8546@duke.duke.UUCP>, Stephen R. Tate (srt@duke.UUCP) writes:

>> [ comments by Philip Freidin on decoder tree structure deleted -- S. Rice]
> 
> First off, I was talking about decoding *bank* addresses, not individual
> word addresses.  . . .  Now that's only 8 bits for a bank address, and I
> have seen 8 input NAND gates.  (7430 or something like that....)  . . .

If you're going to design large memories, decode them *fast*.  Delays for
a 74AS30 are 5 ns max over temperature and +/- 10% Vcc (but 50 pf/500 Ohm
load).  Fanning out to 16 gates will push this out a bit (but not much).

> . . .  Each of these bank
> address lines need only drive one input per bank (32 chips), which means
> that they only have to drive 256 inputs.  Much less than your 32k figure,
> but still unreasonable.  Obviously, the address lines need to be buffered.
> Using TTL with a fanout of, say, 16, you only need one level of buffering
> (since 16*16 = 256).  Now you're three levels deep for a propogation delay
> of about 40-50ns.  Still not a terribly unreasonable time.

With currently-available logic, you should be able to go 3 levels in not much
more than 15 ns (oh, Lattice, where are those 10 ns GALs??).

> . . .
> Incidentally, CMOS has a *huge* fanout.  That is, CMOS outputs to CMOS
> inputs (no mixing).

CMOS has a huge fanout at DC. . .  As you try to do things fast, the
capacitive loading of the inputs becomes the dominant factor.  If you hang
a whole bunch of inputs on one CMOS output, the rise and fall times
become seriously degraded.  Some of the newer CMOS is quite capable,
though -- output drive capabilities that equal or exceed those of most
bipolar circuits.

					Steve Rice

----------------------------------------------------------------------------
{decvax | hplabs | ihnp4 | uw-beaver}!tektronix!videovax!stever

jlg@lanl.ARPA (Jim Giles) (09/16/86)

In article <676@ur-tut.UUCP> tuba@ur-tut.UUCP (Jon Krueger) writes:
>...
>1) Doubtless you can show me cases where it's "extremely easy" to predict
>data usage patterns.  Can you show me one where it's easy to predict the
>code usage patterns?  I want VM to free me from overlays, or code space
>management, not file structuring, or data space management.

In this context, it is important to note that code memory is ALWAYS
trivially small compared to data memory for scientific codes.  This is NOT
just conjecture, the memory used by code has been less than a few percent
of the total memory requirement for every scientific program I've ever
seen.  Furthermore, code usage patterns in these programs ARE easy to
determine for the same reason that the data usage patterns are - the
thing is in a large time-loop: it does physics on the grid, then it
rezones (if necessary), then it dumps a graphic description of the
grid (if requested), then it goes back for the next time step...

To be sure, people use overlays to save even the small space required by
code.  But this is fairly trivial - the physics, rezone, and graphics
subroutines are the ones to be overlayed.  And those programs which
can fit in main memory ARE NOT PENALIZED BY THE PRESENCE OF ADDITIONAL
OVERHEAD IN THE MEMORY INTERFACE!!!

There really are applications which don't benefit very much from a VM
system.  This will remain true as long as VM systems add ANY overhead
at all to the memory interface.

J. Giles
Los Alamos

jlg@lanl.ARPA (Jim Giles) (09/16/86)

In article <676@ur-tut.UUCP> tuba@ur-tut.UUCP (Jon Krueger) writes:
>...
>1) Doubtless you can show me cases where it's "extremely easy" to predict
>data usage patterns.  Can you show me one where it's easy to predict the
>code usage patterns?  I want VM to free me from overlays, or code space
>management, not file structuring, or data space management.

Another thing to remember here is that most of the large scale scientific
codes that run on large memory supercomputers have existed for many
years.  As a result, the usage patterns (both of data and code) are well
known.  These already contain sophisticated memory management routines
that are taylored for the specific code.  When moving to a new and larger
machine, these usage patterns don't change - just the scale of the
memory involved.  The main problems in porting these codes to a new
machine tend to be numerical (different arithmetic on the new hardware),
or incompatibilities in the language (the compiler on the new machine
recognizes a different set of Fortran extensions).

The result of all this is that the people who do large scale scientific
computing have already resolved the memory management problems.  A VM
system would not save them any time, on the contrary - it would require
additional work in trying to rearrange code and data usage patterns to
minimize page faulting.

J. Giles
Los Alamos

hutch@sdcsvax.UUCP (Jim Hutchison) (09/16/86)

In article <7331@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes:
>to come up with a good scheme.  The problem is that any scheme they DO come
>up with must work for general cases.  That is, it can't take advantage of
>special knowledge of a specific algorithm.
>
>The individual applications programmer CAN take advantage of such knowledge.

vmadvise() ???  I don't think BSD ever got it fully off of the ground, but
it has interesting applications to you.  It was a way of sharing your
specific paging knowledge which is bought by time, with the OS.  This will
not save you from page translation cost, but it does allow you to advise
the OS on differences your program has from the general case.  It also allows
the OS to play any strange games that can be played to squeeze those extra
pennies out of your XXXXX.

-- 
    Jim Hutchison   UUCP:	{dcdwest,ucbvax}!sdcsvax!hutch
		    ARPA:	Hutch@sdcsvax.ucsd.edu
	"The fog crept in on little cats feet" -CS

lamaster@nike.uucp (Hugh LaMaster) (09/17/86)

Cray users often like to boast that very large main memories such as the
Cray 2 obviate the need for virtual memory.  J. Giles stated recently that
Cyber 205 users "turn off" virtual memory.  I work at a site which has a
Cray X-MP, a Cyber 205, and a Cray-2 (Ames Research Center).  I would like
to state for the record that there are considerable advantages to having a
machine have memory mapping virtual memory architecture, even though there
is no apparent single job performance advantage.  

Picture a very large (almost all of real memory) batch job running on a Cray.
Suppose that there is some interactive debugging going on (The Cyber 205 VSOS
operating system and the Cray CTSS operating system are both interactive and
have very nice interactive debuggers.)  The large batch job will have to be
written out to disk in entirety on the Cray before the small interactive job
can be rolled in.  That might take tens of seconds on a Cray 2.  What if only
a few pages had to be paged out as on the Cyber 205?  The batch job would
proceed with only a few pages missing, and the total swap time in any case
would be only a few tenths of a second.  There are many hypothetical examples
which can be imagined which are confirmed, in my experience, in real life:

All other things being equal, virtual memory systems have better throughput
than non virtual memory systems of equivalent memory size and CPU speed.

Virtual memory systems are better suited to mixed large batch and interactive
loads, and are capable of better response time at an equivalent overhead than
non virtual memory systems.

Anyone who has had to deal with setting policies for user memory allocations
on a Seymour Cray machine (from CDC 6600 days to the Cray 2) will know what
kind of trade offs are necessary and why often these machines have been run
batch only with users limited to half of the overall available memory.

Now, a reasonable question to ask on a fast machine is, how much CPU real
estate is going to be consumed by dynamic address translation hardware,
because added logic is going to slow a very fast machine down.   Cray has
kept his machines very simple architecturally, and has a much smaller 
number of gates than other comparable machines (e.g. Cray 1's and 2's
run between 600000 to 700000 gates per CPU, while the Cyber 205 has about
1.3 million, roughly twice as many), which may help explain why Cray has
always had the fastest clock in the supercomputer business (another factor
is that Cray is obviously a packaging genius).  

Conclusion:  An computer architect always has to trade off design features
to build a real machine, and in some cases virtual memory may have to be
traded off, but in the real world of operating a machine, virtual memory
is a significant advantage, other things being equal.

   Hugh LaMaster, m/s 233-9,   UUCP:  {seismo,hplabs}!nike!pioneer!lamaster 
   NASA Ames Research Center   ARPA:  lamaster@ames-pioneer.arpa
   Moffett Field, CA 94035     ARPA:  lamaster%pioneer@ames.arpa
   Phone:  (415)694-6117       ARPA:  lamaster@ames.arc.nasa.gov

"The reasonable man adapts himself to the world, the unreasonable man
adapts the world to himself, therefore, all progress depends on the
unreasonable man." -- George Bernard Shaw

("Any opinions expressed herein are solely the responsibility of the
author and do not represent the opinions of NASA or the U.S. Government")

pmontgom@sdcrdcf.UUCP (Peter Montgomery) (09/18/86)

        I once was a system programmer for a CDC 7600 site.  We ran many
scientific programs, but lacked virtual memory.  Some programs used almost
all of memory.  These jobs might run faster than the same ones would with
less memory, but overall system performance suffered.  For example, a
compilation might not fit in memory alongside a huge scientific job.  If
the compilation runs alone, the CPU will be idle much of the time it runs.
Yet the scientific program might have considerable unused code and/or data
(e.g., the input data for this run does not select Calcomp plots, so those
routines are never called and their data areas are never referenced;
another likely possibility is that some dimensions are much larger than
required).  If either job runs alone, then it wants as much memory as it
can get.  With virtual memory, the operating system can simultaneously
load the heavily used parts of BOTH jobs in the machine, for better
overall performance.
-- 
			Peter Montgomery

	{aero,allegra,bmcg,burdvax,hplabs,
	 ihnp4,psivax,randvax,sdcsvax,trwrb}!sdcrdcf!pmontgom

Don't blame me for the crowded freeways - I don't drive.

jlg@lanl.ARPA (Jim Giles) (09/24/86)

In article <2077@sdcsvax.UUCP> hutch@sdcsvax.UUCP (Jim Hutchison) writes:
>...
>vmadvise() ???  I don't think BSD ever got it fully off of the ground, but
>it has interesting applications to you.  It was a way of sharing your
>specific paging knowledge which is bought by time, with the OS.  This will
>not save you from page translation cost, but it does allow you to advise
>the OS on differences your program has from the general case.  It also allows
>the OS to play any strange games that can be played to squeeze those extra
>pennies out of your XXXXX.
>
My document lists this as VADVISE().  It only has two settings which,
apparently, optimize for random order (that is, it ignores the recent
usage patterns), or it optimizes for the most frequently and recently
referenced pages (the default).  The MAN page contains a note that
this feature was included mainly to support LISP, which has fairly
random access patterns.

This doesn't seem to give the kind of control over memory contents that
I am used to.  It certainly doesn't seem likely to work well for an
algorithm that has some 'randomly' used data, some cyclically used data,
and some frequently used data (or some such mix).

The VADVISE() doc also mentions (under 'BUGS') that this 'Will go away
soon, being replaced by a per page *madvise* facility.' MADVISE looks
like it will have more control over memory contents (but still not
complete control).  It looks also like it will entail the same degree
of work to use it effectively as direct user control would require.
So the bottom line is: we still don't have complete control, we still
need to do a lot of work on our own, and we still have a slower memory
interface than necessary.

J. Giles
Los Alamos

jlg@lanl.ARPA (Jim Giles) (09/24/86)

In article <609@nike.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes:
>Cray users often like to boast that very large main memories such as the
>Cray 2 obviate the need for virtual memory.  J. Giles stated recently that
>Cyber 205 users "turn off" virtual memory.  I work at a site which has a
>Cray X-MP, a Cyber 205, and a Cray-2 (Ames Research Center).  I would like
>to state for the record that there are considerable advantages to having a
>machine have memory mapping virtual memory architecture, even though there
>is no apparent single job performance advantage.  
>
>Picture a very large (almost all of real memory) batch job running on a Cray.
>Suppose that there is some interactive debugging going on (The Cyber 205 VSOS
>operating system and the Cray CTSS operating system are both interactive and
>have very nice interactive debuggers.)  The large batch job will have to be
>written out to disk in entirety on the Cray before the small interactive job
>can be rolled in.  That might take tens of seconds on a Cray 2.  What if only
>a few pages had to be paged out as on the Cyber 205?  The batch job would
>proceed with only a few pages missing, and the total swap time in any case
>would be only a few tenths of a second.  There are many hypothetical examples
>which can be imagined which are confirmed, in my experience, in real life:

You obviously don't have enough Crays!  Now if you had 3 X/MPs (like
we do) you could configure one for interactive use, and the others for
batch use :-).

Seriously: yes there is a trade-off between throughput, interactivity,
and single process speed.  We have to configure our maximum job size
to no more than 3/4 the total memory during the day to allow for
interactive processes.  We also reduce maximum memory residency time
during the day.  We wouldn't have to do this with a virtual memory
system (nor with a total batch operating system).

During the night, when most large scale programs are run, the
operating system is reconfigured to optimize the speen of individual
codes.  On a virtual memory system, this is not possible (at least not
completely - you can't truly rid yourself of the overhead inherent in
the VM interface).  It is the fast turn-around on large night jobs that
is attractive to our users - the ability to also debug interactively
during the day is a useful bonus.

New Crays are coming out now with SSD memory devices (Solid State Disk).
The increased speed of these devices, compared to disk, might make VM
seem to be attractive once more.  Seymore's attitude is (what else):
if it's made of solid state memory - why not make it part of the central
memory of the machine?  The Cray 3 is not currently expected to have an
SSD, but is is expected to fill the entire 32 bit address space with
central memory (that's 32 GB or 4 GW).  Seymore thinks 32 bits is too
small for an address!

This is another problem with virtual memory - central memory is
starting to get cheaper and bigger than the disk memory.  The biggest
disk drive available with a Cray these days is a DD-49 (holds about
151 MW).  One memory image of the 4GW Cray 3 would fill 26 DD-49s to
capacity.  What are you going to operate virtual memory out of?

J. Giles
Los Alamos

jlg@lanl.ARPA (Jim Giles) (09/24/86)

In article <3013@sdcrdcf.UUCP> pmontgom@sdcrdcf.UUCP (Peter Montgomery) writes:
>
>        I once was a system programmer for a CDC 7600 site.  We ran many
>scientific programs, but lacked virtual memory.  Some programs used almost
>all of memory.  These jobs might run faster than the same ones would with
>less memory, but overall system performance suffered.  For example, a
>compilation might not fit in memory alongside a huge scientific job.  If
>the compilation runs alone, the CPU will be idle much of the time it runs.
>Yet the scientific program might have considerable unused code and/or data
>(e.g., the input data for this run does not select Calcomp plots, so those
>routines are never called and their data areas are never referenced;
>another likely possibility is that some dimensions are much larger than
>required).  If either job runs alone, then it wants as much memory as it
>can get.  With virtual memory, the operating system can simultaneously
>load the heavily used parts of BOTH jobs in the machine, for better
>overall performance.

Yes, I agree.  The 7600 probably would have been better off with
virtual memory.  They only had 262 KW of memory (at most - the address
bus was 18 bits, I never saw a 7600 configured with this much).  These
days, 262 KW is only one 64th of the available memory on a 16 MW
machine.  You can afford to load 50-100 KW of unused code into memory,
it's less than one percent(of memory).

The times have changed, the calcomp plotting routines still take up
the same amount of space but the total memory has grown by 100 fold
(soon to be 1000 fold).  The code for the scientific part of the
calculation hasn't changed much either.  The extra memory is being
used by data, not code.  And it is the data in a scientific code that
gets the most use (and the largest part of the data gets the largest
use - the grid, mesh, lattice, particle descriptions, etc.).  It doesn't
trouble me to load unused code, it takes up so little space anyway.

To be sure, there is still a trade-off between throughput and
individual job speed.  I claim that there are types of application in
which this trade-off sould be resolved in favor of job speed.  In this
kind of application, virtual memory in not desireable and is getting
less desireable as machine speed and central memory size increases.

J. Giles
Los Alamos

lamaster@nike.uucp (Hugh LaMaster) (09/24/86)

One important point that I forgot to mention in previous posting:
On a machine with memory mapping (Cyber-205 for example) there is only a very
small penalty for reclaiming a small amount of memory for another task.  On a
very large main memory system, this is an important point.  A 256 MW Cray-2
(with approx. 4 ns clock) would take 1 second to completely copy main memory.
Unfortunately, this is exactly what is required to reclaim memory on the Cray-2
or any other machine which requires memory to be contiguous.  This is an 
important contribution to system overhead, even in batch mode, but can become
a major bottleneck.  This combines with the already mentioned problems to
produce effectively even poorer memory utilization, because there is a limit
on how frequently memory can be packed in order to make available space
usable.  That there is no such problem on a virtual memory machine is an
additional benefit to virtual memory.  J. Giles is correct in stating
that if single job STEP (or single program) speed is the most important
criterion, there is probably no advantage to a virtual memory machine, as
long as the individual job steps take hundreds of seconds or more.  And, as
stated earlier, there is a price to pay for virtual memory:  the extra 
cpu logic it takes to implement it, which undoubtedly slows the cpu by
some amount.  It may be interesting to note that CDC gave the future 
existence of very large main memories as one of the reasons for designing
the virtual memory architecture of the new Cyber 800 and 900 series (NOS/VE)
machines as they did.  They stated that memory management of main memory 
would be too inefficient without it.

   Hugh LaMaster, m/s 233-9,   UUCP:  {seismo,hplabs}!nike!pioneer!lamaster 
   NASA Ames Research Center   ARPA:  lamaster@ames-pioneer.arpa
   Moffett Field, CA 94035     ARPA:  lamaster%pioneer@ames.arpa
   Phone:  (415)694-6117       ARPA:  lamaster@ames.arc.nasa.gov

"The reasonable man adapts himself to the world, the unreasonable man
adapts the world to himself, therefore, all progress depends on the
unreasonable man." -- George Bernard Shaw

("Any opinions expressed herein are solely the responsibility of the
author and do not represent the opinions of NASA or the U.S. Government")

cdshaw@alberta.UUCP (Chris Shaw) (09/25/86)

In article <7832@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes:
>So the bottom line is: we still don't have complete control, we still
>need to do a lot of work on our own, and we still have a slower memory
>interface than necessary.
>

..and we still don't have any numbers to back up this or any other position.

Basically, Jim's argument has been "our machines are too big, the users love
performance, and they do weird things, so VM is no go". 

The pro-VM people are saying: Use (or invent) a call to tell VM what to do, 
and thereby solve the strange usage pattern problem. This will give the 
users much more flexibility. The flexibility will come at the price of a small 
performance hit, but the performance cost is worth it. There is the added 
benefit that the cost of the system may be less, since secondary store is 
cheaper than high-speed primary store. 

I suppose the question really is, how much primary memory do you need, versus
how much can you get away with. A previous article mentioned Cray wanting
a full 16MW of primary memory. Plenty of cash mo-nee, since the memory probably
has to run fast. Is this extra cost worth it? In the infinite-wallet world
of defence research, probably not. In the "real", tight-budget world,
bang per buck matters more, and the marginal speed improvement of (say)
full-address-space core might not pan out in the face of more reasonable 
alternatives.

But then, I'm talking through my hat, too.

In any case, there are two major positions here because there are two
types of budgets to consider. Jim is in the world of "performance at any
cost", while lots of other people are into "performance at a reasonable
price".

>J. Giles
>Los Alamos


Chris Shaw    cdshaw@alberta
University of Alberta
Bogus as HELL !

franka@mmintl.UUCP (Frank Adams) (09/25/86)

In article <7839@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes:
>Seymore [Cray] thinks 32 bits is too small for an address!

He's right, too.

Frank Adams                           ihnp4!philabs!pwa-b!mmintl!franka
Multimate International    52 Oakland Ave North    E. Hartford, CT 06108

jlg@lanl.ARPA (Jim Giles) (09/25/86)

In article <627@nike.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes:
>One important point that I forgot to mention in previous posting:
>On a machine with memory mapping (Cyber-205 for example) there is only a very
>small penalty for reclaiming a small amount of memory for another task.  On a
>very large main memory system, this is an important point.  A 256 MW Cray-2
>(with approx. 4 ns clock) would take 1 second to completely copy main memory.
>Unfortunately, this is exactly what is required to reclaim memory on the Cray-2
>or any other machine which requires memory to be contiguous.
>...

This is not really true.  There is no reason inherent in a non-VM
system which requires it to swap the whole large program out in order
to make room for the smaller one.  The only absolute requirement is
that the large program must be entirely resident while it is running.

As it happens, the systems currently running on Cray machines drop
entire memory images when swapping for space.  But this is not a
requirement - only a simplification of the duties of the operating
system.

J. Giles
Los Alamos

jjw@celerity.UUCP (Jim ) (09/30/86)

The "anti-virtual" memory discussions seem to be concentrating on whether
Supercomputers need virtual memory.  Note that these machines are in a sense
special purpose* machines with the following characteristics:

	They are pushing the "state of the art" in memory and processor
	design.

	They are intended for large scale, vectored, mostly floating point
	calculations.

	They are expensive to purchase and operate.

	They are usually sold and purchased to support a few computationally
	intensive applications (which may run for hours or days even on a
	Cray).

In this environment virtual memory is probably a hindrance rather than a help.

However, as time passes, larger memories and faster processors will be
available for more conventional general purpose computers.  I believe that
virtual memory will be essential for the management of the larger memories
in many of the environments in which these systems will be used.


---------
* I know they are special purpose in the sense that they can perform any
  application which any other "general purpose" machine can perform.  But,
  how many people purchase a Cray to do timesharing, text editing or
  business EDP?  And if they do what do you suppose Seymore Cray's answer
  would be to someone who complains about the number of users who can get
  good emacs response?

jlg@lanl.ARPA (Jim Giles) (10/12/86)

In article <589@celerity.UUCP> jjw@celerity.UUCP (Jim (JJ) Whelan) writes:
>The "anti-virtual" memory discussions seem to be concentrating on whether
>Supercomputers need virtual memory.  Note that these machines are in a sense
>special purpose* machines ...
>...
>However, as time passes, larger memories and faster processors will be
>available for more conventional general purpose computers.  I believe that
>virtual memory will be essential for the management of the larger memories
>in many of the environments in which these systems will be used.
>
This has been my point all along.  I never claimed that virtual memory
was universally bad, only that it is counterproductive in SOME
applications.  The main opposition to this view has come from those
supporters of VM who think that all applications are better with VM.

>---------
>* I know they are special purpose in the sense that they can perform any
>  application which any other "general purpose" machine can perform.  But,
>  how many people purchase a Cray to do timesharing, text editing or
>  business EDP?  And if they do what do you suppose Seymore Cray's answer
>  would be to someone who complains about the number of users who can get
>  good emacs response?

It is interesting to note however, that Crays have a better price-
performance ratio, even on these tasks, than VAXEN do.  But then,
nearly everything has a better price-performance ratio than a VAX!

J. Giles
Los Alamos