jvz@sdcsvax.UUCP (John Van Zandt) (08/25/86)
I heard the other day about current research into using massive amounts of main memory (approx 1GB) on minicomputers and achieving very high performance. I assume this was mainly due to the non-swapping of data and code for large applications. It seems to me that a reasonably clever paging scheme (maybe with some compiler assistance) would limit the overhead of the swapping to the point of making it invisible for large programs/data. This is under the assumption of a single-user system. I grant the fact that in a multi-user environment the more memory, the better. Besides starting a discussion of the pro's and con's of this (does it really give that much better performance?), I'd like some pointers to articles or technical reports on the topic. John Van Zandt UCSD uucp: ...ucbvax!sdcsvax!jvz arpa: jvz@UCSD
johnson@uiucdcsp.CS.UIUC.EDU (08/26/86)
I believe that the large memory computers are designed for database applications. By putting the entire database in main memory, each transaction can be run to completion without waiting for the disk. Thus, much fewer locks are needed. Concurrency control, deadlock, etc. all become much simple problems. Not only is the system faster because it desn't wait on disks, it is also faster because there is much less overhead for locking. While it is true that minicomputers are used for CPUs, I think that the large memory computers are considered database supercomputers. They are not meant to be cheap. Hector Garcia-Molina of Princeton is one of the main workers in this area. The first I learned of this work was an article he wrote in IEEE Trans. on Computers a few years ago.
mc68020@gilbbs.UUCP (Thomas J Keller) (08/27/86)
Someone please correct me if I am wrong, but as I have been lead to understand the situation, it will prove somewhat difficult to successfully implement large physical memory systems on the order of 1Gb. The primary impediment seems to be the delays caused by propagation delays in the decoding trees. Anyone care to enlight me (us)? -- Disclaimer: Disclaimer? DISCLAIMER!? I don't need no stinking DISCLAIMER!!! tom keller "She's alive, ALIVE!" {ihnp4, dual}!ptsfa!gilbbs!mc68020 (* we may not be big, but we're small! *)
kenny@uiucdcsb.CS.UIUC.EDU (08/27/86)
Ralph (johnson@b.cs.uiuc.edu) is right about the huge-memory machines being intended to serve as multiprogrammed database machines. Honeywell is presently marketing one that can be configured with up to 64M 36-bit words (a tad over 0.25 GB), and Nippon Electric has one twice that size. On a typical configuration with one of those behemoths, 75% or more of the memory is dedicated to disc caching. With their present technology, they don't simplify the locking mechanism (since there still is the possibility that some given page will be on disc, not in main store) but gain substantial performance improvements because of the reduced number of collisions. Kevin Kenny University of Illinois at Urbana-Champaign UUCP: {ihnp4,pur-ee,convex}!uiucdcs!kenny CSNET: kenny@UIUC.CSNET ARPA: kenny@B.CS.UIUC.EDU (kenny@UIUC.ARPA)
guy@sun.uucp (Guy Harris) (08/27/86)
> Someone please correct me if I am wrong, but as I have been lead to > understand the situation, it will prove somewhat difficult to successfully > implement large physical memory systems on the order of 1Gb. The primary > impediment seems to be the delays caused by propagation delays in the > decoding trees. Anyone care to enlight me (us)? "Will prove"? As *I* have been lead to understand the situation, the Cray-2 is *already* offering primary memories of that size. From your reference to "decoding trees", I presume you're talking about *single chip* memories of that size, not memory systems. -- Guy Harris {ihnp4, decvax, seismo, decwrl, ...}!sun!guy guy@sun.com (or guy@sun.arpa)
jaw@aurora.UUCP (James A. Woods) (08/28/86)
# "what comes after silicon? oh, gallium arsenide, i'd guess. and after that, there's a thing called indium phosphide." -- seymour cray circa 1980 tom keller wonders about the speed of gigabyte memories. the cray 2 here has 1/2 gigabyte, with a worst-case cycle time of 57 clock pulses (~234 nsec). i doubt chip decoding delays take the bulk of this time (logarithms don't grow fast); the sluggishness is more likely due to the 2 using conservative (i.e. cheap) 256K dynamic mos technology. things are a bit better with pseudo-banks (33 clocks/eight-byte word), and strictly sequential access for vectors is designed to be fast (1.1 clocks/word). but the memory speed for non-vectorized C code (e.g. any unix utility except for 'cmp'), leaves a lot to be desired -- even with local variables in fast registers ("local memory") [it would be nice to allow arrays here]. for "random stride" computation (unix kernel, compilers, utilities), the cray 2 is about the speed of a MIPS board. indeed, d. ritchie noted at the recent supercomputer conference held at nasa ames that because of assembler code expansion risc-style (to ascii, yet), the compile pipeline makes at&t's faster (scalar-wise) cray x-mp about the speed of a high-end vax for such application. if seymour ever hopes to regain the 'scalar' computing championship title, he'd better get hip to a transparent data cache. -- ames!aurora!jaw
bzs@bu-cs.BU.EDU (Barry Shein) (08/28/86)
>From: mc68020@gilbbs.UUCP (Thomas J Keller) > Someone please correct me if I am wrong, but as I have been lead to >understand the situation, it will prove somewhat difficult to successfully >implement large physical memory systems on the order of 1Gb. The primary >impediment seems to be the delays caused by propagation delays in the >decoding trees. Anyone care to enlight me (us)? >From: johnson@uiucdcsp.CS.UIUC.EDU >I believe that the large memory computers are designed for database >applications. I'm not sure you two are wrong, but I'm not sure you're right either. The Cray-2 (certainly a number cruncher) comes with around 2GB of main memory. The recently announced ELXSI (more like a $200K [entry] machine if I read the article right) boasts a maximum 1GB configuration (I figure you can buy the ELXSI on the volume discount on the memory to fill it [~$1M list], but I wander.) Again, a number cruncher I believe. So, for what it's worth these are essentially counter-examples of some value. As I brought up once before, I still think there may be some constant N which completes the sentence "never buy more memory than you can zero out in N seconds" [I call it Shein's law of memory but some have claimed that Amdahl may have said something similar, great minds run in the same gutters :-] The reasoning is if you can't touch it in N seconds you probably can't use it very effectively either. For some more thoughts on this you might want to pick up Danny Hillis' "The Connection Machine" where he paints an interesting vision of modern computers as vast seas of inactive silicon (the memory) with this (typically) one poor little CPU touching one or two spots per cycle. Of course, if you spend most of your time waiting for disk vast memories may help, but so would (and does) clever memory/disk scheduling (within limits.) The point is it is not clear that increasing memories unbounded produces unbounded performance gains, in fact, it almost certainly doesn't. You need a CPU (or more than one) to do something with all this wonderful stuff you have in memory. Before you all jump down my throat because you are sure that if you had 16MB rather than 8MB on your machine and hence more *must* be better consider: A one MIP machine zeroing memory in a loop: CLRL R1 LOOP: CLRL (R1)+ CMPL R1,HIMEM BNE LOOP would (theoretical machine) take 3 * (1G/1M) or 3000 seconds or a little less than one hour to complete. It's hard to believe such a machine could make -effective- use of that much memory. I know, it's debateable, but anyone arguing against that statement is probably ignoring any rational concern for cost/benefit trade-offs (eg. spend $1M on the memory and $200K on the processor, or $1M on the processor and $200K on the memory? or some similar variation.) Of course, assuming the memory were free and reasonably random behavior I agree that a huge memory would have some value to a database application that filled the memory, but I doubt it would be a reasonable thing to do unless memory prices dropped drastically. You'd probably be better off putting your money into the processor (the context of the argument seemed to imply smaller processors, obviously Cray is already up against that limit.) -Barry Shein, Boston University
alan@mn-at1.UUCP (Alan Klietz) (08/28/86)
In article <884@gilbbs.UUCP>, mc68020@gilbbs.UUCP (Thomas J Keller) writes: > > Someone please correct me if I am wrong, but as I have been lead to > understand the situation, it will prove somewhat difficult to successfully > implement large physical memory systems on the order of 1Gb. The primary > impediment seems to be the delays caused by propagation delays in the > decoding trees. Anyone care to enlight me (us)? > If you use DRAMs you have access times on the order of 50-200ns. That is enough time for fast ECL-type logic to do plenty of decoding. The CRAY-3 is rumored to have a central memory on the order of 2 Gwords, and solid-state "disks" are going even higher. -- Alan Klietz Minnesota Supercomputer Center (*) 1200 Washington Avenue South Minneapolis, MN 55415 UUCP: ..ihnp4!dicome!mn-at1!alan Ph: +1 612 638 0577 ..caip!meccts!dicome!mn-at1!alan ARPA: aek@umn-rei-uc.ARPA (*) An affiliate of the University of Minnesota
hammond@petrus.UUCP (Rich A. Hammond) (08/28/86)
> tom keller writes: > Someone please correct me if I am wrong, but as I have been lead to > understand the situation, it will prove somewhat difficult to successfully > implement large physical memory systems on the order of 1Gb. The primary > impediment seems to be the delays caused by propagation delays in the > decoding trees. Anyone care to enlight me (us)? Well, the Cray 2 supports 256M Word (where word = 64 bits) i.e. 2 Gb, so it is not impossible. The decoding trees grow as the log of the size of memory, so it isn't bad. Plus, one rarely treats memory as a single byte, one selects a larger chunk (1 to n words, each word of 4 or 8 bytes), gets it to the CPU and then uses a barrel shifter or other select to pick out the individual byte(s) wanted. Clearly, the decoding times would get longer if we stayed with the same technology, but the memory IC's are also getting faster as they shrink the circuit size. e.g. the 64k rams had access times around 100-120 ns, the 256k rams had access times around 80-100ns, the 1M rams have access times around 60-80ns. This despite the increase in decoding on the chips. So, going to the larger chips allows both the 4fold increase from the chips, plus there is additional time available for external decoding (assuming the memory cycle stays constant). Rich Hammond hammond@bellcore.com
phil@osiris.UUCP (Philip Kos) (08/29/86)
A similar discussion came up a few months ago, in the context of "how much memory is it reasonable to try to hang off my type X machine?" I expressed some confusion over conclusions drawn by someone else (it may have been Barry Shein, whose article I am replying to), who maintained steadfastly that there was a limit beyond which it was not useful to go. The argument really only applies to machines with virtual memory, and it goes like this: Your memory is organized in pages which are mapped from various process's virtual spaces into the physical memory addressing. It takes a certain number of MM table entries to map a certain amount of memory (considering a single VM architecture). You want to improve performance by reducing paging; it seems logical to do this by increasing main memory. Now. By adding more main memory, you *do* decrease paging, which is obviously a good thing. However, you also make memory mapping more time-consuming (more page table entries to maintain), so there's a tradeoff here. You might be able to expand the memory to, say, 16M without taking any hits - the upper limit depends, of course, on your VM archi- tecture. Beyond this limit, the page table has to be expanded, and maintaining it becomes more complicated. Eventually you reach a point where the time saved by not paging is less than the extra time spent maintaining the mapping table. This is the point where you should just give up trying to speed up your system by adding more memory. Is this right? It's sort of off the top of my head, and I may have gotten some of the details wrong (what details, I hear you ask?) but it seems to convey the gist of the argument... Phil Kos ...!decvax!decuac The Johns Hopkins Hospital > !aplcen!osiris!phil Baltimore, MD ...!allegra!umcp-cs "In the end there's still that song, Comes crying like the wind Down every lonely street that's ever been." - Robert Hunter
hammond@petrus.UUCP (Rich A. Hammond) (08/29/86)
> Barry Shein, Boston University writes > ... consider: > > A one MIP machine zeroing memory in a loop: > > CLRL R1 > LOOP: > CLRL (R1)+ > CMPL R1,HIMEM > BNE LOOP > > would (theoretical machine) take 3 * (1G/1M) or 3000 seconds or > a little less than one hour to complete. It's hard to believe > such a machine could make -effective- use of that much memory. That calculation left out the word size (or were you talking about 1GW versus 1Gb?) Thus, for a 32 bit 1 MIPS machine, clearing 1 Gb of memory would take (3 / 1M) * (1 Gb / 4) = 750 seconds or 12.5 minutes. Of course, for general purpose computation, Barry's point is correct, you can't use huge amounts of memory. However, data base machines benefit greatly from the reduced overhead if all data is available in memory, even if they don't touch much of it at any time. Also, our Convex C-1, when doing vector operations, can touch 8 bytes every 100 ns, so to zero a 1Gb space takes (1 Gb / 8) /10M = 12.5 seconds. Cheers, Rich Hammond Bellcore
mat@amdahl.UUCP (Mike Taylor) (08/29/86)
In article <1130@bu-cs.bu-cs.BU.EDU>, bzs@bu-cs.BU.EDU (Barry Shein) writes: > The point is it is not clear that increasing memories unbounded > produces unbounded performance gains, in fact, it almost certainly > doesn't. As a matter of interest, studies we did some time ago with IBM MVS indicated that performance reaches a maximum and then begins to deteriorate slowly. Not a general result, of course, but food for thought. BTW, we offer up to 512MB of mainstore on our systems. > > You need a CPU (or more than one) to do something with all this > wonderful stuff you have in memory. > > Of course, assuming the memory were free and reasonably random behavior > I agree that a huge memory would have some value to a database application > that filled the memory, but I doubt it would be a reasonable thing to > do unless memory prices dropped drastically. You'd probably be better > off putting your money into the processor (the context of the argument > seemed to imply smaller processors, obviously Cray is already up against > that limit.) > But what do you do when you can't put any more money into the processor? Not only Cray is up against that limit. -- Mike Taylor ...!{ihnp4,hplabs,amd,sun}!amdahl!mat [ This may not reflect my opinion, let alone anyone else's. ]
joel@peora.UUCP (Joel Upchurch) (08/30/86)
One thing that no one has mentioned so far that you could do with very large memories is table lookups. Sure everyone knows that a table lookup is faster than than a calculated solution almost all the time, but how many people would it occur to to use it if the resulting table would be several hundred megabytes or even larger? It seems to me that this could be significant in some applications, like weather forcasting. -- Joel Upchurch @ CONCURRENT Computer Corporation (A Perkin-Elmer Company) Southern Development Center 2486 Sand Lake Road/ Orlando, Florida 32809/ (305)850-1031 {decvax!ucf-cs, ihnp4!pesnta, vax135!petsd, akgua!codas}!peora!joel
bzs@bu-cs.BU.EDU (Barry Shein) (08/30/86)
>A similar discussion came up a few months ago, in the context of >"how much memory is it reasonable to try to hang off my type X >machine?" I expressed some confusion over conclusions drawn by >someone else (it may have been Barry Shein, whose article I am >replying to), who maintained steadfastly that there was a limit >beyond which it was not useful to go. The argument really only >applies to machines with virtual memory, and it goes like this: > >Your memory is organized in pages which are mapped from various >process's virtual spaces into the physical memory addressing. It >takes a certain number of MM table entries to map a certain amount >of memory (considering a single VM architecture). You want to >improve performance by reducing paging; it seems logical to do >this by increasing main memory. > >Now. By adding more main memory, you *do* decrease paging, which >is obviously a good thing. However, you also make memory mapping >more time-consuming (more page table entries to maintain), so >there's a tradeoff here. T'was I. This is a slightly different topic (virtual memory systems) but I am quite sure that you can hit this limit easily on a 750 (eg.), I've suspected my 8MB VAX750 (4.2bsd) of having this problem (too much time spent in the kernel due to virtual memory management) tho I've never had the time to investigate (anyone?) Of course, there's the bridging argument that no matter how much memory you have it's just a matter of time before you wish it had virtual memory management. I won't argue this either way but I believe there is an amusing anecdote told around a famous institution about a certain famous individual who proclaimed some years back, upon the arrival of their 256KW [36-bit] memory, that now they have more memory than anyone can possibly use. I already know of people in two applications areas that believe they could easily fill 1GB main memories with their current needs (a database and a graphics animation shop.) -Barry Shein, Boston University
roy@phri.UUCP (Roy Smith) (09/01/86)
In article <2289@peora.UUCP> joel@peora.UUCP (Joel Upchurch) writes: > One thing that no one has mentioned so far that you could do > with very large memories is table lookups. OK, let's talk bizarre (1/2 :-)). Imagine the stateword of a process as all the bits in that process's memory strung end to end. If you take this as a memory address, you could implement your CPU as a simple lookup table; for each possible state there is only one possible next state that a process can get to. All you have to do is do a simple table lookup to find the next state. Of course, you need a mighty big memory to build the lookup table (not to mention the non-trivial amount of CPU time to calculate what values go where in the table). How much memory? Well, I just did a size on /bin/* (11/750 running 4.2 BSD). Running the results through colex to get the decimal total and desc to get descriptive statistics (both from Gary Perlman's very nice UNIX|STAT package) I get (with a sample size of 60; this excludes the shell scripts /bin/true and /bin/false): ------------------------------------------------------------ Mean Median Midpoint Geometric Harmonic 35343.000 27394.000 57622.000 32294.059 30172.889 ------------------------------------------------------------ SD Quart Dev Range SE mean 17635.600 7553.000 77052.000 2276.746 ------------------------------------------------------------ Minimum Quartile 1 Quartile 2 Quartile 3 Maximum 19096.000 24372.000 27394.000 39478.000 96148.000 ------------------------------------------------------------ The smallest program was /bin/echo (20k to echo argv!?); the largest was /bin/as at 96k. So, assuming most processes will fit into 100 kbytes, that means you need a LUT with an 800k bit long address. Talk about BFM's! As I think I've mentioned before, it is believed that there are approximately 2^200 electrons in the universe. Since it is unlikely that anybody would want to reference more things than there are electrons in the universe, 200 bits seems like a good upper bound for the length of a memory address. -- Roy Smith, {allegra,philabs}!phri!roy System Administrator, Public Health Research Institute 455 First Avenue, New York, NY 10016
aglew@ccvaxa.UUCP (09/01/86)
... > Very large memories - Barry Shein (and C. Gordon Bell, and many ... > others) suggest that memory clearing places an upper limit on ... > useful memory size. Right. How quickly can you clear a 4TB disk farm? (No, I'm not quite so dogmatic. Memory clearing is certainly a consideration in context switching. Which may be one of the reasons multiprocessors like ELXSI can effectively use large memory, since they also report less context switching). Andy "Krazy" Glew. Gould CSD-Urbana. USEnet: ihnp4!uiucdcs!ccvaxa!aglew 1101 E. University, Urbana, IL 61801 ARPAnet: aglew@gswd-vms
jon@msunix.UUCP (Jonathan Hue) (09/03/86)
In article <145@mn-at1.UUCP>, alan@mn-at1.UUCP (Alan Klietz) writes: > If you use DRAMs you have access times on the order of 50-200ns. That is > enough time for fast ECL-type logic to do plenty of decoding. I don't design with ECL, but just the buffers to go to/from TTL levels are going to be way slower than the FAST (that's right, Fast Advanced Schottky TTL) 74xx series stuff from Fairchild. With gate delays around 2ns, and 256K DRAMs at around 150ns, you have plenty of time to decode address lines. Heck you can just cram your decoder into a 20R8 PAL (I think it will fit) and the new ones are 15ns up to three gates deep. Of course, bipolar SRAMS with 3ns access times are another story... "If we did it like everyone else, Jonathan Hue what would distinguish us from Via Visuals Inc. every other company in Silicon Valley?" sun!sunncal\ >!leadsv!msunix!jon "A profit?" amdcad!cae780/
ronc@fai.UUCP (Ronald O. Christian) (09/03/86)
In article <145@mn-at1.UUCP> alan@mn-at1.UUCP (Alan Klietz) writes: >In article <884@gilbbs.UUCP>, mc68020@gilbbs.UUCP (Thomas J Keller) writes: >> >> [...] The primary >> impediment seems to be the delays caused by propagation delays in the >> decoding trees. Anyone care to enlight me (us)? >> > >If you use DRAMs you have access times on the order of 50-200ns. That is >enough time for fast ECL-type logic to do plenty of decoding. I wonder about this. I had a design problem awhile back involving 45ns ram used as a lookup table, where I couldn't decode the address fast enough. I found some ECL gates that would do the decoding faster, but then found to my chagrin that you lost so much time in the conversion from ECL back to TTL that the overall response ended up being slower. I've had the same problem in gate arrays, where the propagation delay through the on-chip conversion to ECL, the array itself, and the conversion back to TTL was greater than the prop delay through a straight TTL array of similar complexity. If you're going ECL for decoding, I think you really need to have ECL memories to gain anything. Then there's GaAs... So fast you can spend a lot of time converting to a different logic family. I like GaAs. Expensive, though. Ron -- -- Ronald O. Christian (Fujitsu America Inc., San Jose, Calif.) seismo!amdahl!fai!ronc -or- ihnp4!pesnta!fai!ronc Oliver's law of assumed responsibility: "If you are seen fixing it, you will be blamed for breaking it."
john@datacube.UUCP (09/04/86)
/* Written 5:51 pm Aug 28, 1986 by phil@osiris.UUCP in datacube:net.arch */ ....... Now. By adding more main memory, you *do* decrease paging, which is obviously a good thing. However, you also make memory mapping more time-consuming (more page table entries to maintain), so there's a tradeoff here. You might be able to expand the memory to, say, 16M without taking any hits - the upper limit depends, of course, on your VM archi- tecture. Beyond this limit, the page table has to be expanded, and maintaining it becomes more complicated. Eventually you reach a point where the time saved by not paging is less than the extra time spent maintaining the mapping table. This is the point where you should just give up trying to speed up your system by adding more memory. Is this right? It's sort of off the top of my head, and I may have gotten some of the details wrong (what details, I hear you ask?) but it seems to convey the gist of the argument... Phil Kos ...!decvax!decuac The Johns Hopkins Hospital > !aplcen!osiris!phil Baltimore, MD ...!allegra!umcp-cs /* End of text from datacube:net.arch */ I agree with this for fixed VM architectures, but for new machines I would think that the page size should be increased so that the number of pages of physical memory available remains roughly constant. Won't this give ever increasing performance, or does the latency time of bringing in new pages kill the gains? BTW, a quick calculation tells me that with current technology a 128 Mbyte memory board could be designed for the SUN-3, (15" X 15" form factor). This would include error correction and detection and all the other goodies required on big system memorys. Just slap 8 of these in your SUN3-160, instant Giggabytes. 1Gb == 45 seconds color video. John Bloomfield Datacube Inc. 4 Dearborn Rd. Peabody, Ma 01960 617-535-6644 ihnp4!datacube!john decvax!cca!mirror!datacube!john {mit-eddie,cyb0vax}!mirror!datacube!john
srt@duke.UUCP (Stephen R. Tate) (09/04/86)
In article <289@petrus.UUCP>, hammond@petrus.UUCP (Rich A. Hammond) writes: > > > tom keller writes: > > The primary > > impediment seems to be the delays caused by propagation delays in the > > decoding trees. Anyone care to enlight me (us)? > > Well, the Cray 2 supports 256M Word (where word = 64 bits) i.e. 2 Gb, > so it is not impossible. The decoding trees grow as the log of the > size of memory, so it isn't bad. That's an over-complicated (or over-generalized) treatment of address decoding. The key phrase you use is "decoding trees", where decoding does not have to be done by trees at all. *Regardless* off the memory size, decoding the unique address of a bank of memory (bank, row, or word, actually) need never be more than 2 gates deep. Meaning propogation delays of only around 30ns using regular old slow silicon TTL. And this is completely independent of memory size. (within reason... propogation delays for 40 input AND gates might be a bit higher.... But who's going to have 40 bit bank addresses?) -- Steve Tate ..!{ihnp4,decvax}!duke!srt
preece@ccvaxa.UUCP (09/05/86)
I don't claim to know where the cost-benefit curve lies, but there clearly are a lot of problems whose solutions naturally involve huge address spaces: databases, number crunching on huge arrays, image analysis, many kinds of things that fall into connectionist models, etc. For problems like that, solutions on systems with less memory than the address space of the problem will involve some kind of mapping between secondary memory and physical memory. That means they go slow. If memory is "cheap enough" it is obviously preferable to have that memory be physical memory. The definition of "cheap enough" depends on the importance of the problem, the time pressure on its solution, and your budget. Saying that there is a constant that defines the maximum useful amount of memory is simply stating your interpretation of those governing factors. The original note limited the question to single-user systems. Even in that context, though, there is an obvious answer to the complaint that you can't access memory that fast: get more processors and share the memory among them. N years in the future, when processors are available in 1M-processor chips (that is, one chip=1 million processors), it will seem pretty silly to be talking about whether a T of memory is too much, just as now it's pretty silly to be talking about whether a G is too much. There's a simple law governing all of computing [don't ping me on the obvious caveats and quibbles, this is the kind of law that's supposed to be over-broad but simple]: There can't be too much memory and no processor is too fast. -- scott preece gould/csd - urbana uucp: ihnp4!uiucdcs!ccvaxa!preece arpa: preece@gswd-vms
jlg@lanl.ARPA (Jim Giles) (09/05/86)
In article <1130@bu-cs.bu-cs.BU.EDU> bzs@bu-cs.BU.EDU (Barry Shein) writes: >... >As I brought up once before, I still think there may be some constant >N which completes the sentence "never buy more memory than you can >zero out in N seconds" [I call it Shein's law of memory but some have >claimed that Amdahl may have said something similar, great minds run >in the same gutters :-] >... >A one MIP machine zeroing memory in a loop: > > CLRL R1 > LOOP: > CLRL (R1)+ > CMPL R1,HIMEM > BNE LOOP > >would (theoretical machine) take 3 * (1G/1M) or 3000 seconds or >a little less than one hour to complete. It's hard to believe >such a machine could make -effective- use of that much memory. > The task of zeroing memory on the Cray 2 with 256MW (=2GB) should take less than 5-10 seconds. You must remember that a one MIP machine is pathetically slow compared to a Cray (at least for vector operations like zeroing memory, searching, sorting, and many scientific applications). By the way, why does everyone assume that large memory is for database applications and such. I can think of LOTS of scientific applications for large memory machines - none of which involve databases in any way. J. Giles Los Alamos
jlg@lanl.ARPA (Jim Giles) (09/06/86)
In article <7144@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes: >The task of zeroing memory on the Cray 2 with 256MW (=2GB) should take >less than 5-10 seconds. You must remember that a one MIP machine is >pathetically slow compared to a Cray (at least for vector operations >like zeroing memory, searching, sorting, and many scientific >applications). > I made a slight error (by an order of magnitude) - The Cray 2 should take about .5 to 1.0 seconds to clear all its memory. It seems that large memory machines are obviously desireable. In lattice guage theory a single lattice could easily use all the Cray 2 memory. 3-d hydro codes also use enormous ammounts of memory. Furthermore, applications like these make frequent references to the entire data structure (ie. the whole array is updated every single time-step). The desirability of paging for such machines is not so obvious. Consider a code which updates a large array on each step through a loop (each time-step). If the central memory is too small to hold the entire array and you have a virtual memory scheme, some part of the array will get swapped out on each time step. Most likely, it will be the least recently used page that gets swapped - the very one that you will need first on the subsequent time step! You are now in a situation of chasing you tail around memory - losing time all the while. Without virtual memory though, your code can anticipate the problem by initializing asynchronous I/O long before it needs to use the data. And, since it's not driven by page faults, you can select only a particular part of the array to be swapped - thus minimizing I/O. This kind of programming effort is somewhat unfasionable these days, but it's exactly the sort of thing that most programmers that use these big machines immediately do. They bought the machine because the critical issue was SPEED - and anything that reduces this speed (like virtual memory) is to be shunned. (Cyber 205 users usually turn off the virtual memory when they need speed, Crays don't even have virtual memory.) On the other hand, a small memory - single user machine (like my SUN workstation) should never be built without virtual memory. The desirability of a feature should always be driven entirely be the purpose of the machine. J. Giles Los Alamos
philip@amdcad.UUCP (Philip Freidin) (09/06/86)
In article <8513@duke.duke.UUCP>, srt@duke.UUCP (Stephen R. Tate) writes: > > That's an over-complicated (or over-generalized) treatment of address > decoding. The key phrase you use is "decoding trees", where decoding > does not have to be done by trees at all. *Regardless* off the memory > size, decoding the unique address of a bank of memory (bank, row, or > word, actually) need never be more than 2 gates deep. Meaning propogation > delays of only around 30ns using regular old slow silicon TTL. > And this is completely independent of memory size. (within reason... > propogation delays for 40 input AND gates might be a bit higher.... But > who's going to have 40 bit bank addresses?) > > -- > Steve Tate ..!{ihnp4,decvax}!duke!srt Unfortunately, at this point I would like to apply some reality to the discussion. Rather than talk about your 40 bit address memories, lets look at something trivial: 64kw. this needs 16 bits of address. With your 2 level decode (one of inverters, and the second of and gates to do word select) you have 32 address select lines coming into the second level, address and address complement. each of these must drive 32k and gates! I dont know of any logic familly with a drive capability to support that type of load. Your typical ttl has a drive capability of from 10 to 20 loads. Also, another fly in your fast decode ointment is that the way and gates are implemented in many logic families precludes building a 16 input and gate as a single level. Cmos is limited to about 4 levels, and TTL and ECL have similar limits. To build bigger and gates, you end up with a tree structure inside your and gate. --Philip Freidin
bzs@bu-cs.BU.EDU (Barry Shein) (09/07/86)
From: jlg@lanl.ARPA (Jim Giles) >The task of zeroing memory on the Cray 2 with 256MW (=2GB) should take >less than 5-10 seconds. You must remember that a one MIP machine is >pathetically slow compared to a Cray (at least for vector operations >like zeroing memory, searching, sorting, and many scientific >applications). That's exactly my point, the Cray "deserves" 2GB of memory (was this a counter-example? I just said if machine can't zero memory in N seconds for some small N it probably won't use the memory either, your point is tautological to that.) Working backwards, if the Cray can zero 2GB in 10 seconds we get (assuming, as before, 3 instructions (I) per zero per word, 8 bytes per word for the Cray) 2GB/8B -> 256M * 3I -> 768MI/10s -> 76.8MIPS, using your 5 second figure, around 150MIPS. BUT -- How is that speed being accomplished (and I suspect from the math above that the Cray is a little faster, no matter)? By parallelism, vector processors, very non-standard expensive stuff (tho of course becoming more popular BECAUSE OF THE ABOVE DISCUSSED PROBLEMS, among others.) The anti-intuitive argument that comes up is to show that increasing memory size for conventional processors into the GB range is a losing proposition. -Barry Shein, Boston University
henry@utzoo.UUCP (Henry Spencer) (09/07/86)
> The desirability of paging for such machines is not so obvious. > Consider a code which updates a large array on each step through a > loop (each time-step). If the central memory is too small to hold the > entire array and you have a virtual memory scheme, some part of the > array will get swapped out on each time step. Most likely, it will be > the least recently used page that gets swapped - the very one that you > will need first on the subsequent time step!... Jim, all that you have established here is that LRU is a thoroughly bad virtual-memory policy for a scientific program. Few people will argue that. You have also more-or-less established that a program which behaves in the manner you suggest will not benefit much from virtual memory; its performance will degrade badly when it starts paging. Few will argue that either. Not all programs behave that way, though. You have *not* established that virtual memory, as such, is a poor idea. It is quite possible to combine demand fetching with prefetching of things that are expected to be needed soon. It's probably even a good idea when trying to page scientific programs. It *is* harder to get right, which is why you don't see it done much. > Without virtual memory though, your code can anticipate the problem by > initializing asynchronous I/O long before it needs to use the data. > And, since it's not driven by page faults, you can select only a > particular part of the array to be swapped - thus minimizing I/O. > This kind of programming effort is somewhat unfasionable these days... With some reason. What you're saying is that because the operating-system people are too lazy to devise paging algorithms that are useful for large scientific programs, the programmers should be required to do it themselves. Apart from the matter of constantly reinventing the wheel, there is also the problem that it's a lot of work to get it right -- program reference patterns are notorious for being hard to predict beforehand, which means experimenting and then twiddling the code to match the results. This may perhaps not be needed for really straightforward array-mashing code, but I remain a bit skeptical: historically, the batting average on statements like "this program obviously has the following reference pattern..." is close to zero. I wouldn't be surprised if a lot of scientific code, with its carefully hand-twiddled asynchronous I/O, is in fact managing its memory rather inefficiently. Especially if the code has been revised, or moved to a new machine (or a new variant of the old one), since the last tuning job was done. > ... They bought the machine because > the critical issue was SPEED - and anything that reduces this speed > (like virtual memory) is to be shunned. (Cyber 205 users usually > turn off the virtual memory when they need speed, Crays don't even > have virtual memory.) I think you may be confusing two issues here. The reason the Crays don't have virtual memory is not because asynchronous I/O is superior to paging, but because non-trivial address translation hurts memory-access time. Are the Cyber 205 users turning the virtual memory off because they don't trust the paging algorithm, or because the machine will run even a memory-resident program faster with it off? I'd bet it's the latter. Now *that* is a legitimate and well-justified reason for not using virtual memory. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,decvax,pyramid}!utzoo!henry
ken@argus.UUCP (Kenneth Ng) (09/08/86)
In article <8513@duke.duke.UUCP>, srt@duke.UUCP (Stephen R. Tate) writes: > But > who's going to have 40 bit bank addresses?) > Steve Tate ..!{ihnp4,decvax}!duke!srt Check out the load far pointer instruction on the Intel 386. It is a full logical address, 48 bits, a selector and and offset. -- Kenneth Ng: Post office: NJIT - CCCC, Newark New Jersey 07102 uucp(for a while) ihnp4!allegra!bellcore!argus!ken *** WARNING: NOT ken@bellcore.uucp *** !psuvax1!cmcl2!ciap!andromeda!argus!ken bitnet(prefered) ken@orion.bitnet --- Please resend any mail between 10 Aug and 16 Aug: --- the mailer broke and we had billions and billions of --- bits scattered on the floor.
sewilco@mecc.UUCP (Scot E. Wilcoxon) (09/08/86)
>As I think I've mentioned before, it is believed that there are >approximately 2^200 electrons in the universe. Since it is unlikely that >anybody would want to reference more things than there are electrons in the >universe, 200 bits seems like a good upper bound for the length of a memory >address. > >Roy Smith, {allegra,philabs}!phri!roy Present technologies all require use of many electrons per bit stored. Therefore the number of electrons in the universe is an actual upper bound. Particularly since a few electrons are needed for the computer which will use the memory :-) Collecting all the electrons in the universe for the factory is left as an exercise for the reader. Simulating this action and the forces involved will use more memory than the number of electrons. -- Scot E. Wilcoxon Minn Ed Comp Corp {quest,dicome,meccts}!mecc!sewilco 45 03 N 93 08 W (612)481-3507 {{caip!meccts},ihnp4,philabs}!mecc!sewilco "BOOKS" in five-foot neon letters means pictures are sold there.
eugene@ames.UUCP (Eugene Miya) (09/08/86)
Gawd, you guys on the net! Come on!
So VAX/Unix oriented, so blinded!
The Cray is a word oriented machine. Take out the factor of 8 from all
of your calculations. You can tell a person's thinking by the words they
choose. The Cray-2 is a 256 MW not a 2 GB machine. These are not the same
because the conversion is non trivial. Crays are word oriented machines.
Anyone who says "2 GB" is showing a great deal of naive. Get off your
VAXen an try Univacs (36-bit), IBM's inverted bit order, and other systems.
>From the Rock of Ages Home for Retired Hackers:
--eugene miya
NASA Ames Research Center
eugene@ames-aurora.ARPA
"You trust the `reply' command with all those different mailers out there?"
{hplabs,hao,nike,ihnp4,decwrl,allegra,tektronix,menlo70}!ames!aurora!eugene
jon@msunix.UUCP (09/09/86)
In article <12930@amdcad.UUCP>, philip@amdcad.UUCP (Philip Freidin) writes: > do word select) you have 32 address select lines coming into the second > level, address and address complement. each of these must drive 32k and > gates! I dont know of any logic familly with a drive capability to support > that type of load. Your typical ttl has a drive capability of from 10 to 20 > loads. Also, another fly in your fast decode ointment is that the way and > gates are implemented in many logic families precludes building a 16 input > and gate as a single level. Cmos is limited to about 4 levels, and TTL and > ECL have similar limits. To build bigger and gates, you end up with a tree > structure inside your and gate. Most people don't worry about the decoders inside DRAMs, but just what the DRAM looks like from the pins (timing, loads, etc). As a crude example, suppose you have a VME bus board with 100 in^2 and the P1 and P2 connectors. You are using 256K DRAMs and can fit 4Mb on each board. A1 thru A18 form the row and column address. That leaves A19 thru A23. A19 thru A21 can be the inputs to a 74AS138, and A22 and A23 can be the enables for the AS138 (it has two low enables, and one high). To put 16Mb in this system, you only need one more gate to enable the AS138 when both A22 and A23 are high. Okay, now add the P3 connector, 1Mb DRAMs, and twice as much real estate, so you can put 32Mb on a board. A2 thru A21 form the row and column address, A22 thru A24 go into an AS138, and A25 thru A31 go into a 74AS688. The AS688 can be used as an address comparator, and is nice because you can stick eight address jumpers next to it to set the board's base address. The AS688 has an eight input gate in it, as do a lot of the AS67x and AS68x parts. The output of the AS688 along with A1 control what the board puts on the bus. You can address 4Gb with this scheme, and none of this looks much like a tree. And there is a part here that has an eight input gate. Add 8 more address lines, another AS688, and you've got 1Tb. This wouldn't be any slower to access than the 24 bit example. The point here is that you don't have to design to decode n address pins into 2^n signals. Your DRAMS take care of 18 or 20 of them, and you only need to decode as many as you have banks of memory on a board. The other address lines need only form a board select - one output only. "If we did it like everyone else, Jonathan Hue what would distinguish us from Via Visuals Inc. every other company in Silicon Valley?" sun!sunncal\ >!leadsv!msunix!jon "A profit?" amdcad!cae780/
guy@sun.uucp (Guy Harris) (09/09/86)
> Gawd, you guys on the net! Come on! > > So VAX/Unix oriented, so blinded! Oh, come off it. You damn well know that most machines out there, regardless of whether they run UNIX or are VAXes, are byte-oriented. The use of "byte" when talking about memory sizes has *NO* relation *WHATSOEVER* to a VAX or UNIX mindset. > The Cray is a word oriented machine. Take out the factor of 8 from all > of your calculations. You can tell a person's thinking by the words they > choose. The Cray-2 is a 256 MW not a 2 GB machine. These are not the same > because the conversion is non trivial. Why? Does the Cray-2 store characters one per word? Does it store *everything* one per word? Even if it does, that isn't enough to recommend that "word" be used in a general discussion of large main memories. "word" is a *useless* term for general discussions of memories, because it means a different amount of memory on different machines. This discussion is NOT a discussion of the Cray-2, it is a discussion of large main memories. As such, the byte is the appropriate unit to discuss, since it is the same size on almost all machines out there. > Crays are word oriented machines. You already said that. Merely stating this twice doesn't make it more interesting. Explain *why* this renders discussion of Cray memory sizes in bytes inappropriate. The PDP-10 is a word-oriented machine also; however, a 9-bit byte is a *very* appropriate unit for comparative discussions of memory size, since one (9-bit) byte holds a character (which can be addressed independently given a byte pointer), four bytes holds an integer, (byte) pointer, or single-precision floating-point number, and eight bytes holds a double-precision floating-point number - just like the VAX. The available range of values of these types are different from that on a VAX, but the difference is not enough to make a difference in gross discussions of memory sizes. > Anyone who says "2 GB" is showing a great deal of naive. That's "naivete", modulo various diacritical marks, and this statement needs a lot more defense than you've given it. If the problem is that a given data structure (tree, 2D array of floating-point numbers, etc) takes a different amount of memory on a Cray-2 than on another machine, then the appropriate unit in the discussion is *NOT* words, it's elements of said data structure. > Get off your VAXen an try Univacs (36-bit), IBM's inverted bit order, > and other systems. "*BIT* order"? What relevance has that? I presume you meant "*byte* order, in which case which "inverted byte order" do you mean? The byte order on the IBM PC is the same as that on the VAX, and I could see some future 80386-based IBM machine having lots of main memory. The 360/370 family, by the way, is byte-oriented and memory sizes are given in bytes.... Your assumption that the use of "bytes" in this discussion is an indication of VAX/UNIX tunnel vision is way off the mark. -- Guy Harris {ihnp4, decvax, seismo, decwrl, ...}!sun!guy guy@sun.com (or guy@sun.arpa)
srt@duke.UUCP (Stephen R. Tate) (09/10/86)
In article <12930@amdcad.UUCP>, philip@amdcad.UUCP (Philip Freidin) writes: > Unfortunately, at this point I would like to apply some reality to the > discussion. Rather than talk about your 40 bit address memories, lets > look at something trivial: 64kw. this needs 16 bits of address. With > your 2 level decode (one of inverters, and the second of and gates to > do word select) you have 32 address select lines coming into the second > level, address and address complement. each of these must drive 32k and > gates! I dont know of any logic familly with a drive capability to support > that type of load. Your typical ttl has a drive capability of from 10 to 20 > loads. Also, another fly in your fast decode ointment is that the way and > gates are implemented in many logic families precludes building a 16 input > and gate as a single level. Cmos is limited to about 4 levels, and TTL and > ECL have similar limits. To build bigger and gates, you end up with a tree > structure inside your and gate. > > --Philip Freidin First off, I was talking about decoding *bank* addresses, not individual word addresses. If you wanted 1GB of memory, and used 1Mb chips, you would have, say, 256 banks of 1Mb x 32 bit words. (If you have this much memory, I hope memory accesses are done more than a word at a time, but ignore this for now....) Now that's only 8 bits for a bank address, and I have seen 8 input NAND gates. (7430 or something like that....) Each of these bank address lines need only drive one input per bank (32 chips), which means that they only have to drive 256 inputs. Much less than your 32k figure, but still unreasonable. Obviously, the address lines need to be buffered. Using TTL with a fanout of, say, 16, you only need one level of buffering (since 16*16 = 256). Now you're three levels deep for a propogation delay of about 40-50ns. Still not a terribly unreasonable time. Anyway, another problem to consider is buffering all the address lines below the bank address lines. These have to be run to every chip, and in the example above, there are 32*256 = 8192 chips in all. You're going to have to be real careful with buffering here..... So it's not the decode circuitry that takes time, it's the buffering for reasonable fan-out. Incidentally, CMOS has a *huge* fanout. That is, CMOS outputs to CMOS inputs (no mixing). -- Steve Tate ..!{ihnp4,decvax}!duke!srt
jeff@gatech.CSNET (Jeff Lee) (09/10/86)
>Then there's GaAs... So fast you can spend a lot of time converting >to a different logic family. I like GaAs. Expensive, though. I know absolutely nothing about GaAs except that Seymour is planning to do his cray-3 in it. What are the speeds and costs of some "typical" GaAs chips? What sort of power do they dissipate? What is the difficulty in processing GaAs as opposed to silicon? Also, is anybody doing anything with InP (Indium Phosphide) yet? -- Jeff Lee CSNet: Jeff @ GATech ARPA: Jeff%GATech.CSNet @ CSNet-Relay.ARPA uucp: ...!{akgua,allegra,hplabs,ihnp4,linus,seismo,ulysses}!gatech!jeff
jlg@lanl.ARPA (Jim Giles) (09/10/86)
In article <7094@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >... >With some reason. What you're saying is that because the operating-system >people are too lazy to devise paging algorithms that are useful for large >scientific programs, the programmers should be required to do it themselves. >Apart from the matter of constantly reinventing the wheel, there is also >the problem that it's a lot of work to get it right -- program reference >patterns are notorious for being hard to predict beforehand, which means >experimenting and then twiddling the code to match the results. It's not just that the operating system or hardware designers are too lazy to come up with a good scheme. The problem is that any scheme they DO come up with must work for general cases. That is, it can't take advantage of special knowledge of a specific algorithm. The individual applications programmer CAN take advantage of such knowledge. To be sure, this is an expensive and difficult programming project. But, if you've just spent $10-$20 million on a fast machine, you aren't going to balk at a few million more in programmer man-hours to get the speed that you shelled out so much cash for. Since hardware (for whatever reason) can run faster without virtual memory, there will always be a market among the high-end users for machines that don't have it. Since these are the sort of machines with the Very large memories that are talking about, I question the desirability of virtual memory on them. As a final note: there are some types of algorithm for which it is extremely easy to predict the data usage patterns. Most Finite-Difference and Finite-Element codes are of this kind. These applications involve a small number of very large arrays which are referenced cyclically. Other applications, like image manipulation and particle transport, have somewhat more difficult patterns, but there are known methods for dealing with them. These are the types of codes which form the predominant workload for today's large memory supercomputers (whether bought by oil companies or government). Now, it might be convenient to implement virtual memory schemes which are useful in this context, but I doubt that the extra overhead in the memory interface would be justified - especially since the explicit methods for dealing with them are fairly easy to implement.
apc@cblpe.UUCP (Alan Curtis) (09/11/86)
In article <884@gilbbs.UUCP> mc68020@gilbbs.UUCP writes: > > Someone please correct me if I am wrong, but as I have been lead to >understand the situation, it will prove somewhat difficult to successfully >implement large physical memory systems on the order of 1Gb. The primary >impediment seems to be the delays caused by propagation delays in the >decoding trees. Anyone care to enlight me (us)? > > >tom keller "She's alive, ALIVE!" >{ihnp4, dual}!ptsfa!gilbbs!mc68020 Today I can buy 16 megabytes on one board given 256k bit drams. Tommorrow I should be able to buy the same board with 1M devices giving me 64meg on a card. 20 cards is then 1.2G. I think it would be possible.
mclase@watdaisy.UUCP (Michael Clase) (09/11/86)
In article <7331@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes: >In article <7094@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >>... >>With some reason. What you're saying is that because the operating-system >>people are too lazy to devise paging algorithms that are useful for large >>scientific programs, the programmers should be required to do it themselves. >>Apart from the matter of constantly reinventing the wheel, there is also >>the problem that it's a lot of work to get it right -- program reference >>patterns are notorious for being hard to predict beforehand, which means >>experimenting and then twiddling the code to match the results. > >It's not just that the operating system or hardware designers are too lazy >to come up with a good scheme. The problem is that any scheme they DO come >up with must work for general cases. That is, it can't take advantage of >special knowledge of a specific algorithm. > Rather than having the user explicitly implement a suitable paging algorithm for his program, couldn't the operating system have a facility like the UNIX vadvise system call? According to the vadvise man page, the call vadvise(V_ANOM) warns the pager that LRU is not a suitable algorithm for this particular job. Perhaps this could be expanded to include calls like vadvise(V_CYCLIC) to indicate that the program wants to cyclically reference large arrays. Of course, the difficulty would be for the pager to work out which pages corresponded to these arrays, particularly in the case of two or more arrays which are traversed in parallel. Michael Clase mclase@watdaisy.uucp (for one more week)
tuba@ur-tut.UUCP (Jon Krueger) (09/11/86)
In article <7331@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes: >It's not just that the operating system or hardware designers are too lazy >to come up with a good scheme. The problem is that any scheme they DO come >up with must work for general cases. That is, it can't take advantage of >special knowledge of a specific algorithm....The individual applications >programmer CAN take advantage of such knowledge. To be sure, this is an >expensive and difficult programming project...[but] there are some types of >algorithm for which it is extremely easy to predict the data usage >patterns. Now, it might be convenient to implement virtual memory schemes >which are useful in this context, but I doubt that the extra overhead in the >memory interface would be justified - especially since the explicit methods >for dealing with them are fairly easy to implement. 1) Doubtless you can show me cases where it's "extremely easy" to predict data usage patterns. Can you show me one where it's easy to predict the code usage patterns? I want VM to free me from overlays, or code space management, not file structuring, or data space management. >if you've just spent $10-$20 million on a fast machine, you aren't going >to balk at a few million more in programmer man-hours to get the speed >that you shelled out so much cash for. 2) Can we measure the win? Can you provide figures on performance improvements for either data or code space management by application programmers over mechanisms provided by operating system or hardware designers? Have you any actual examples, how much was the improvement for a specific application you're familiar with? Can we state a general rule, expected returns on applications programmers managing their own code and/or data spaces? Can you state the breakeven point, how many millions of dollars I should be willing to spend before I improve on the operating system's paging? -- jon -- --> Jon Krueger uucp: {seismo, allegra, decvax, cmcl2, topaz, harvard}!rochester!ur-tut!tuba Phone: (716) 275-2811 work, 473-4124 home BITNET: TUBA@UORDBV USMAIL: Taylor Hall, University of Rochester, Rochester NY 14627
phil@amdcad.UUCP (Phil Ngai) (09/11/86)
In article <8546@duke.duke.UUCP> srt@duke.UUCP (Stephen R. Tate) writes: >Incidentally, CMOS has a *huge* fanout. That is, CMOS outputs to CMOS >inputs (no mixing). CMOS has a huge DC fanout. The AC fanout is somewhat less, depending on the level of performance you demand. -- Rain follows the plow. Phil Ngai +1 408 749 5720 UUCP: {ucbvax,decwrl,ihnp4,allegra}!amdcad!phil ARPA: amdcad!phil@decwrl.dec.com
stever@videovax.UUCP (Steven E. Rice) (09/12/86)
In article <8546@duke.duke.UUCP>, Stephen R. Tate (srt@duke.UUCP) writes: >> [ comments by Philip Freidin on decoder tree structure deleted -- S. Rice] > > First off, I was talking about decoding *bank* addresses, not individual > word addresses. . . . Now that's only 8 bits for a bank address, and I > have seen 8 input NAND gates. (7430 or something like that....) . . . If you're going to design large memories, decode them *fast*. Delays for a 74AS30 are 5 ns max over temperature and +/- 10% Vcc (but 50 pf/500 Ohm load). Fanning out to 16 gates will push this out a bit (but not much). > . . . Each of these bank > address lines need only drive one input per bank (32 chips), which means > that they only have to drive 256 inputs. Much less than your 32k figure, > but still unreasonable. Obviously, the address lines need to be buffered. > Using TTL with a fanout of, say, 16, you only need one level of buffering > (since 16*16 = 256). Now you're three levels deep for a propogation delay > of about 40-50ns. Still not a terribly unreasonable time. With currently-available logic, you should be able to go 3 levels in not much more than 15 ns (oh, Lattice, where are those 10 ns GALs??). > . . . > Incidentally, CMOS has a *huge* fanout. That is, CMOS outputs to CMOS > inputs (no mixing). CMOS has a huge fanout at DC. . . As you try to do things fast, the capacitive loading of the inputs becomes the dominant factor. If you hang a whole bunch of inputs on one CMOS output, the rise and fall times become seriously degraded. Some of the newer CMOS is quite capable, though -- output drive capabilities that equal or exceed those of most bipolar circuits. Steve Rice ---------------------------------------------------------------------------- {decvax | hplabs | ihnp4 | uw-beaver}!tektronix!videovax!stever
jlg@lanl.ARPA (Jim Giles) (09/16/86)
In article <676@ur-tut.UUCP> tuba@ur-tut.UUCP (Jon Krueger) writes: >... >1) Doubtless you can show me cases where it's "extremely easy" to predict >data usage patterns. Can you show me one where it's easy to predict the >code usage patterns? I want VM to free me from overlays, or code space >management, not file structuring, or data space management. In this context, it is important to note that code memory is ALWAYS trivially small compared to data memory for scientific codes. This is NOT just conjecture, the memory used by code has been less than a few percent of the total memory requirement for every scientific program I've ever seen. Furthermore, code usage patterns in these programs ARE easy to determine for the same reason that the data usage patterns are - the thing is in a large time-loop: it does physics on the grid, then it rezones (if necessary), then it dumps a graphic description of the grid (if requested), then it goes back for the next time step... To be sure, people use overlays to save even the small space required by code. But this is fairly trivial - the physics, rezone, and graphics subroutines are the ones to be overlayed. And those programs which can fit in main memory ARE NOT PENALIZED BY THE PRESENCE OF ADDITIONAL OVERHEAD IN THE MEMORY INTERFACE!!! There really are applications which don't benefit very much from a VM system. This will remain true as long as VM systems add ANY overhead at all to the memory interface. J. Giles Los Alamos
jlg@lanl.ARPA (Jim Giles) (09/16/86)
In article <676@ur-tut.UUCP> tuba@ur-tut.UUCP (Jon Krueger) writes: >... >1) Doubtless you can show me cases where it's "extremely easy" to predict >data usage patterns. Can you show me one where it's easy to predict the >code usage patterns? I want VM to free me from overlays, or code space >management, not file structuring, or data space management. Another thing to remember here is that most of the large scale scientific codes that run on large memory supercomputers have existed for many years. As a result, the usage patterns (both of data and code) are well known. These already contain sophisticated memory management routines that are taylored for the specific code. When moving to a new and larger machine, these usage patterns don't change - just the scale of the memory involved. The main problems in porting these codes to a new machine tend to be numerical (different arithmetic on the new hardware), or incompatibilities in the language (the compiler on the new machine recognizes a different set of Fortran extensions). The result of all this is that the people who do large scale scientific computing have already resolved the memory management problems. A VM system would not save them any time, on the contrary - it would require additional work in trying to rearrange code and data usage patterns to minimize page faulting. J. Giles Los Alamos
hutch@sdcsvax.UUCP (Jim Hutchison) (09/16/86)
In article <7331@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes: >to come up with a good scheme. The problem is that any scheme they DO come >up with must work for general cases. That is, it can't take advantage of >special knowledge of a specific algorithm. > >The individual applications programmer CAN take advantage of such knowledge. vmadvise() ??? I don't think BSD ever got it fully off of the ground, but it has interesting applications to you. It was a way of sharing your specific paging knowledge which is bought by time, with the OS. This will not save you from page translation cost, but it does allow you to advise the OS on differences your program has from the general case. It also allows the OS to play any strange games that can be played to squeeze those extra pennies out of your XXXXX. -- Jim Hutchison UUCP: {dcdwest,ucbvax}!sdcsvax!hutch ARPA: Hutch@sdcsvax.ucsd.edu "The fog crept in on little cats feet" -CS
lamaster@nike.uucp (Hugh LaMaster) (09/17/86)
Cray users often like to boast that very large main memories such as the Cray 2 obviate the need for virtual memory. J. Giles stated recently that Cyber 205 users "turn off" virtual memory. I work at a site which has a Cray X-MP, a Cyber 205, and a Cray-2 (Ames Research Center). I would like to state for the record that there are considerable advantages to having a machine have memory mapping virtual memory architecture, even though there is no apparent single job performance advantage. Picture a very large (almost all of real memory) batch job running on a Cray. Suppose that there is some interactive debugging going on (The Cyber 205 VSOS operating system and the Cray CTSS operating system are both interactive and have very nice interactive debuggers.) The large batch job will have to be written out to disk in entirety on the Cray before the small interactive job can be rolled in. That might take tens of seconds on a Cray 2. What if only a few pages had to be paged out as on the Cyber 205? The batch job would proceed with only a few pages missing, and the total swap time in any case would be only a few tenths of a second. There are many hypothetical examples which can be imagined which are confirmed, in my experience, in real life: All other things being equal, virtual memory systems have better throughput than non virtual memory systems of equivalent memory size and CPU speed. Virtual memory systems are better suited to mixed large batch and interactive loads, and are capable of better response time at an equivalent overhead than non virtual memory systems. Anyone who has had to deal with setting policies for user memory allocations on a Seymour Cray machine (from CDC 6600 days to the Cray 2) will know what kind of trade offs are necessary and why often these machines have been run batch only with users limited to half of the overall available memory. Now, a reasonable question to ask on a fast machine is, how much CPU real estate is going to be consumed by dynamic address translation hardware, because added logic is going to slow a very fast machine down. Cray has kept his machines very simple architecturally, and has a much smaller number of gates than other comparable machines (e.g. Cray 1's and 2's run between 600000 to 700000 gates per CPU, while the Cyber 205 has about 1.3 million, roughly twice as many), which may help explain why Cray has always had the fastest clock in the supercomputer business (another factor is that Cray is obviously a packaging genius). Conclusion: An computer architect always has to trade off design features to build a real machine, and in some cases virtual memory may have to be traded off, but in the real world of operating a machine, virtual memory is a significant advantage, other things being equal. Hugh LaMaster, m/s 233-9, UUCP: {seismo,hplabs}!nike!pioneer!lamaster NASA Ames Research Center ARPA: lamaster@ames-pioneer.arpa Moffett Field, CA 94035 ARPA: lamaster%pioneer@ames.arpa Phone: (415)694-6117 ARPA: lamaster@ames.arc.nasa.gov "The reasonable man adapts himself to the world, the unreasonable man adapts the world to himself, therefore, all progress depends on the unreasonable man." -- George Bernard Shaw ("Any opinions expressed herein are solely the responsibility of the author and do not represent the opinions of NASA or the U.S. Government")
pmontgom@sdcrdcf.UUCP (Peter Montgomery) (09/18/86)
I once was a system programmer for a CDC 7600 site. We ran many scientific programs, but lacked virtual memory. Some programs used almost all of memory. These jobs might run faster than the same ones would with less memory, but overall system performance suffered. For example, a compilation might not fit in memory alongside a huge scientific job. If the compilation runs alone, the CPU will be idle much of the time it runs. Yet the scientific program might have considerable unused code and/or data (e.g., the input data for this run does not select Calcomp plots, so those routines are never called and their data areas are never referenced; another likely possibility is that some dimensions are much larger than required). If either job runs alone, then it wants as much memory as it can get. With virtual memory, the operating system can simultaneously load the heavily used parts of BOTH jobs in the machine, for better overall performance. -- Peter Montgomery {aero,allegra,bmcg,burdvax,hplabs, ihnp4,psivax,randvax,sdcsvax,trwrb}!sdcrdcf!pmontgom Don't blame me for the crowded freeways - I don't drive.
jlg@lanl.ARPA (Jim Giles) (09/24/86)
In article <2077@sdcsvax.UUCP> hutch@sdcsvax.UUCP (Jim Hutchison) writes: >... >vmadvise() ??? I don't think BSD ever got it fully off of the ground, but >it has interesting applications to you. It was a way of sharing your >specific paging knowledge which is bought by time, with the OS. This will >not save you from page translation cost, but it does allow you to advise >the OS on differences your program has from the general case. It also allows >the OS to play any strange games that can be played to squeeze those extra >pennies out of your XXXXX. > My document lists this as VADVISE(). It only has two settings which, apparently, optimize for random order (that is, it ignores the recent usage patterns), or it optimizes for the most frequently and recently referenced pages (the default). The MAN page contains a note that this feature was included mainly to support LISP, which has fairly random access patterns. This doesn't seem to give the kind of control over memory contents that I am used to. It certainly doesn't seem likely to work well for an algorithm that has some 'randomly' used data, some cyclically used data, and some frequently used data (or some such mix). The VADVISE() doc also mentions (under 'BUGS') that this 'Will go away soon, being replaced by a per page *madvise* facility.' MADVISE looks like it will have more control over memory contents (but still not complete control). It looks also like it will entail the same degree of work to use it effectively as direct user control would require. So the bottom line is: we still don't have complete control, we still need to do a lot of work on our own, and we still have a slower memory interface than necessary. J. Giles Los Alamos
jlg@lanl.ARPA (Jim Giles) (09/24/86)
In article <609@nike.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes: >Cray users often like to boast that very large main memories such as the >Cray 2 obviate the need for virtual memory. J. Giles stated recently that >Cyber 205 users "turn off" virtual memory. I work at a site which has a >Cray X-MP, a Cyber 205, and a Cray-2 (Ames Research Center). I would like >to state for the record that there are considerable advantages to having a >machine have memory mapping virtual memory architecture, even though there >is no apparent single job performance advantage. > >Picture a very large (almost all of real memory) batch job running on a Cray. >Suppose that there is some interactive debugging going on (The Cyber 205 VSOS >operating system and the Cray CTSS operating system are both interactive and >have very nice interactive debuggers.) The large batch job will have to be >written out to disk in entirety on the Cray before the small interactive job >can be rolled in. That might take tens of seconds on a Cray 2. What if only >a few pages had to be paged out as on the Cyber 205? The batch job would >proceed with only a few pages missing, and the total swap time in any case >would be only a few tenths of a second. There are many hypothetical examples >which can be imagined which are confirmed, in my experience, in real life: You obviously don't have enough Crays! Now if you had 3 X/MPs (like we do) you could configure one for interactive use, and the others for batch use :-). Seriously: yes there is a trade-off between throughput, interactivity, and single process speed. We have to configure our maximum job size to no more than 3/4 the total memory during the day to allow for interactive processes. We also reduce maximum memory residency time during the day. We wouldn't have to do this with a virtual memory system (nor with a total batch operating system). During the night, when most large scale programs are run, the operating system is reconfigured to optimize the speen of individual codes. On a virtual memory system, this is not possible (at least not completely - you can't truly rid yourself of the overhead inherent in the VM interface). It is the fast turn-around on large night jobs that is attractive to our users - the ability to also debug interactively during the day is a useful bonus. New Crays are coming out now with SSD memory devices (Solid State Disk). The increased speed of these devices, compared to disk, might make VM seem to be attractive once more. Seymore's attitude is (what else): if it's made of solid state memory - why not make it part of the central memory of the machine? The Cray 3 is not currently expected to have an SSD, but is is expected to fill the entire 32 bit address space with central memory (that's 32 GB or 4 GW). Seymore thinks 32 bits is too small for an address! This is another problem with virtual memory - central memory is starting to get cheaper and bigger than the disk memory. The biggest disk drive available with a Cray these days is a DD-49 (holds about 151 MW). One memory image of the 4GW Cray 3 would fill 26 DD-49s to capacity. What are you going to operate virtual memory out of? J. Giles Los Alamos
jlg@lanl.ARPA (Jim Giles) (09/24/86)
In article <3013@sdcrdcf.UUCP> pmontgom@sdcrdcf.UUCP (Peter Montgomery) writes: > > I once was a system programmer for a CDC 7600 site. We ran many >scientific programs, but lacked virtual memory. Some programs used almost >all of memory. These jobs might run faster than the same ones would with >less memory, but overall system performance suffered. For example, a >compilation might not fit in memory alongside a huge scientific job. If >the compilation runs alone, the CPU will be idle much of the time it runs. >Yet the scientific program might have considerable unused code and/or data >(e.g., the input data for this run does not select Calcomp plots, so those >routines are never called and their data areas are never referenced; >another likely possibility is that some dimensions are much larger than >required). If either job runs alone, then it wants as much memory as it >can get. With virtual memory, the operating system can simultaneously >load the heavily used parts of BOTH jobs in the machine, for better >overall performance. Yes, I agree. The 7600 probably would have been better off with virtual memory. They only had 262 KW of memory (at most - the address bus was 18 bits, I never saw a 7600 configured with this much). These days, 262 KW is only one 64th of the available memory on a 16 MW machine. You can afford to load 50-100 KW of unused code into memory, it's less than one percent(of memory). The times have changed, the calcomp plotting routines still take up the same amount of space but the total memory has grown by 100 fold (soon to be 1000 fold). The code for the scientific part of the calculation hasn't changed much either. The extra memory is being used by data, not code. And it is the data in a scientific code that gets the most use (and the largest part of the data gets the largest use - the grid, mesh, lattice, particle descriptions, etc.). It doesn't trouble me to load unused code, it takes up so little space anyway. To be sure, there is still a trade-off between throughput and individual job speed. I claim that there are types of application in which this trade-off sould be resolved in favor of job speed. In this kind of application, virtual memory in not desireable and is getting less desireable as machine speed and central memory size increases. J. Giles Los Alamos
lamaster@nike.uucp (Hugh LaMaster) (09/24/86)
One important point that I forgot to mention in previous posting: On a machine with memory mapping (Cyber-205 for example) there is only a very small penalty for reclaiming a small amount of memory for another task. On a very large main memory system, this is an important point. A 256 MW Cray-2 (with approx. 4 ns clock) would take 1 second to completely copy main memory. Unfortunately, this is exactly what is required to reclaim memory on the Cray-2 or any other machine which requires memory to be contiguous. This is an important contribution to system overhead, even in batch mode, but can become a major bottleneck. This combines with the already mentioned problems to produce effectively even poorer memory utilization, because there is a limit on how frequently memory can be packed in order to make available space usable. That there is no such problem on a virtual memory machine is an additional benefit to virtual memory. J. Giles is correct in stating that if single job STEP (or single program) speed is the most important criterion, there is probably no advantage to a virtual memory machine, as long as the individual job steps take hundreds of seconds or more. And, as stated earlier, there is a price to pay for virtual memory: the extra cpu logic it takes to implement it, which undoubtedly slows the cpu by some amount. It may be interesting to note that CDC gave the future existence of very large main memories as one of the reasons for designing the virtual memory architecture of the new Cyber 800 and 900 series (NOS/VE) machines as they did. They stated that memory management of main memory would be too inefficient without it. Hugh LaMaster, m/s 233-9, UUCP: {seismo,hplabs}!nike!pioneer!lamaster NASA Ames Research Center ARPA: lamaster@ames-pioneer.arpa Moffett Field, CA 94035 ARPA: lamaster%pioneer@ames.arpa Phone: (415)694-6117 ARPA: lamaster@ames.arc.nasa.gov "The reasonable man adapts himself to the world, the unreasonable man adapts the world to himself, therefore, all progress depends on the unreasonable man." -- George Bernard Shaw ("Any opinions expressed herein are solely the responsibility of the author and do not represent the opinions of NASA or the U.S. Government")
cdshaw@alberta.UUCP (Chris Shaw) (09/25/86)
In article <7832@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes: >So the bottom line is: we still don't have complete control, we still >need to do a lot of work on our own, and we still have a slower memory >interface than necessary. > ..and we still don't have any numbers to back up this or any other position. Basically, Jim's argument has been "our machines are too big, the users love performance, and they do weird things, so VM is no go". The pro-VM people are saying: Use (or invent) a call to tell VM what to do, and thereby solve the strange usage pattern problem. This will give the users much more flexibility. The flexibility will come at the price of a small performance hit, but the performance cost is worth it. There is the added benefit that the cost of the system may be less, since secondary store is cheaper than high-speed primary store. I suppose the question really is, how much primary memory do you need, versus how much can you get away with. A previous article mentioned Cray wanting a full 16MW of primary memory. Plenty of cash mo-nee, since the memory probably has to run fast. Is this extra cost worth it? In the infinite-wallet world of defence research, probably not. In the "real", tight-budget world, bang per buck matters more, and the marginal speed improvement of (say) full-address-space core might not pan out in the face of more reasonable alternatives. But then, I'm talking through my hat, too. In any case, there are two major positions here because there are two types of budgets to consider. Jim is in the world of "performance at any cost", while lots of other people are into "performance at a reasonable price". >J. Giles >Los Alamos Chris Shaw cdshaw@alberta University of Alberta Bogus as HELL !
franka@mmintl.UUCP (Frank Adams) (09/25/86)
In article <7839@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes: >Seymore [Cray] thinks 32 bits is too small for an address! He's right, too. Frank Adams ihnp4!philabs!pwa-b!mmintl!franka Multimate International 52 Oakland Ave North E. Hartford, CT 06108
jlg@lanl.ARPA (Jim Giles) (09/25/86)
In article <627@nike.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes: >One important point that I forgot to mention in previous posting: >On a machine with memory mapping (Cyber-205 for example) there is only a very >small penalty for reclaiming a small amount of memory for another task. On a >very large main memory system, this is an important point. A 256 MW Cray-2 >(with approx. 4 ns clock) would take 1 second to completely copy main memory. >Unfortunately, this is exactly what is required to reclaim memory on the Cray-2 >or any other machine which requires memory to be contiguous. >... This is not really true. There is no reason inherent in a non-VM system which requires it to swap the whole large program out in order to make room for the smaller one. The only absolute requirement is that the large program must be entirely resident while it is running. As it happens, the systems currently running on Cray machines drop entire memory images when swapping for space. But this is not a requirement - only a simplification of the duties of the operating system. J. Giles Los Alamos
jjw@celerity.UUCP (Jim ) (09/30/86)
The "anti-virtual" memory discussions seem to be concentrating on whether Supercomputers need virtual memory. Note that these machines are in a sense special purpose* machines with the following characteristics: They are pushing the "state of the art" in memory and processor design. They are intended for large scale, vectored, mostly floating point calculations. They are expensive to purchase and operate. They are usually sold and purchased to support a few computationally intensive applications (which may run for hours or days even on a Cray). In this environment virtual memory is probably a hindrance rather than a help. However, as time passes, larger memories and faster processors will be available for more conventional general purpose computers. I believe that virtual memory will be essential for the management of the larger memories in many of the environments in which these systems will be used. --------- * I know they are special purpose in the sense that they can perform any application which any other "general purpose" machine can perform. But, how many people purchase a Cray to do timesharing, text editing or business EDP? And if they do what do you suppose Seymore Cray's answer would be to someone who complains about the number of users who can get good emacs response?
jlg@lanl.ARPA (Jim Giles) (10/12/86)
In article <589@celerity.UUCP> jjw@celerity.UUCP (Jim (JJ) Whelan) writes: >The "anti-virtual" memory discussions seem to be concentrating on whether >Supercomputers need virtual memory. Note that these machines are in a sense >special purpose* machines ... >... >However, as time passes, larger memories and faster processors will be >available for more conventional general purpose computers. I believe that >virtual memory will be essential for the management of the larger memories >in many of the environments in which these systems will be used. > This has been my point all along. I never claimed that virtual memory was universally bad, only that it is counterproductive in SOME applications. The main opposition to this view has come from those supporters of VM who think that all applications are better with VM. >--------- >* I know they are special purpose in the sense that they can perform any > application which any other "general purpose" machine can perform. But, > how many people purchase a Cray to do timesharing, text editing or > business EDP? And if they do what do you suppose Seymore Cray's answer > would be to someone who complains about the number of users who can get > good emacs response? It is interesting to note however, that Crays have a better price- performance ratio, even on these tasks, than VAXEN do. But then, nearly everything has a better price-performance ratio than a VAX! J. Giles Los Alamos