pardo@june.cs.washington.edu (David Keppel) (07/22/88)
[ Comparisons of Amdahls, Crays, 68Ks, and an 11/750 ] [ Cray 1Gb, Amdahl 256Mbyte ] I believe (please correct me if I'm wrong ...) that the Cray's memory limit is ~750Mword (* 8 bytes/word = ~6Gb) but that few machines have anywhere near this much. More to the point, I also believe that the Crays don't have virtual memory (because it slows down the computer!) while the Amdahls do (have virtual memory). Relevant (really?) question: Does it make more sense to buy a little bit of very fast memory and slow it down with virtual memory, or to buy a whole bunch of fast physical memory and slow it down by putting it farther away? (Assume: $ is no problem). Obviously the answer depends on the access patterns (and dataset size) of the programs being run. I wonder if anybody has insight on this? ;-D on ( Registers for me ) Pardo
dik@cwi.nl (Dik T. Winter) (07/22/88)
In article <5342@june.cs.washington.edu> pardo@uw-june.UUCP (David Keppel) writes: > More to the point, I also believe that the Crays don't have virtual > memory (because it slows down the computer!) while the Amdahls do > (have virtual memory). > > Relevant (really?) question: Does it make more sense to buy a little > bit of very fast memory and slow it down with virtual memory, or to > buy a whole bunch of fast physical memory and slow it down by putting > it farther away? (Assume: $ is no problem). Obviously the answer > depends on the access patterns (and dataset size) of the programs > being run. I wonder if anybody has insight on this? > The major problem with virtual memory on vector machines is that you get paging interrupts during the execution of an instruction. The CDC 205 has virtual memory, and there are problems. Let me explain a bit how it works on the 205. The machine (of course) maintains a page table, mapping virtual to real memory. Of course you do not want to interrupt a vector instruction if memory access crosses a page boundary, so the machine has 16 associative registers that hold the mapping entries for the 16 pages most recently accessed. Whenever a vector instruction crosses a page boundary to a page whose mapping information is in the associative registers, the next page of real memory is easily found, and the instruction continues without interrupt (all translation etc. is done during buffering and unbuffering the 205 performs in its pipes). However, if the cross is to a page whose information is not in the associative register, the mapping entry has to be found in memory. This involves interrupt of the instruction, draining the pipes, saving state, reading mapping info and restarting the instruction. That takes a lot of time. The 205 has two different page sizes, large pages of 65536 words (8 bytes/word) and small pages of (site selectable) 512, 2048 or 8192 words. The number of associative registers is 16, and these are shared amongst the jobs on a system. It appears that the selection of small page size is very critical. I have a small (~10 lines) program that will run 2 times as fast on a 1 pipe 205 with small pages of 2048 words than on the same machine with small pages of 512 words. This is all due to the page boundary crossing. (Oh, we have also a single instruction that takes 90 seconds to complete; too long for the timer to handle.) So what this amounts to is having virtual memory will need address translation. This in turn requires page tables and part of these (or all?) need to be in very fast (associative) registers. Associative, because there is no time to do a search if you do not want to drain the pipes. The Cray on the other hand addresses all its memory directly, so no address translation is needed and no vector instruction interrupt. Strange enough, the Cray has a maximal vector length of 64, and all instructions except load/store are through registers. The 205 on the other hand has only vector instructions that go from memory to memory, and the maximal vector size is 65535. So my arguments above would imply that you could have a Cray with virtual memory, but not a 205! Another point about VM on vector processors: it makes you think the machine is large enough to handle your problem, while it will mostly be trashing pages. A theoretical example for a 205 with 1 Mwords of memory: try to multiply two 1024*1024 matrices to get a third. The program will be accepted, CP time will be something like 60 seconds. Only paging will take about 1 year (disk access only, and they are fast). -- dik t. winter, cwi, amsterdam, nederland INTERNET : dik@cwi.nl BITNET/EARN: dik@mcvax
dre%ember@Sun.COM (David Emberson) (07/22/88)
> > Relevant (really?) question: Does it make more sense to buy a little > bit of very fast memory and slow it down with virtual memory, or to > buy a whole bunch of fast physical memory and slow it down by putting > it farther away? (Assume: $ is no problem). If money is no object, it makes sense to buy a ton of memory AND have VM. This may be apocryphal, but I have been told that Seymour Cray, replying to the question of why the Cray 1 did not have virtual memory, replied, "Because I don't understand it." With virtual caches, VM does not cause a performance penalty worth mentioning. Even on some machines with physical caches, address translation can take place in parallel with the cache tag access--thus no penalty. Dave Emberson (dre@sun.com)
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (07/22/88)
In article <7588@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: >The major problem with virtual memory on vector machines is that you get >paging interrupts during the execution of an instruction. The CDC 205 >has virtual memory, and there are problems. I beg to differ. I have seen no "problems" with virtual memory on the Cyber 205, other than those arising from: 1) The confusion that users sometimes experience when they have a larger set of facilities to choose from, or, 2) Problems which are exactly the same on Crays- namely, there is never enough real memory for some people and their programs. >the instruction. That takes a lot of time. The 205 has two different >page sizes, large pages of 65536 words (8 bytes/word) and small pages >of (site selectable) 512, 2048 or 8192 words. The number of >associative registers is 16, and these are shared amongst the jobs on >a system. It appears that the selection of small page size is very >critical. I have a small (~10 lines) program that will run 2 times The Cyber 205 does have a "problem" because of its memory to memory vector instruction set, as opposed to Cray's vector registers. The problem is/was vector startup time was fairly long on the Cyber 205. This problem appears to have been solved on the ETA-10 with better overlap, etc., and short vector performance seems to be much better. (Still the same memory to memory architecture.) The "problem" with small page sizes is actually an installation option intended to benefit installations with a relatively small amount of main memory. The solution, if you have more memory, is to use a larger small page size. > >The Cray on the other hand addresses all its memory directly, so no >address translation is needed and no vector instruction interrupt. It is true that Cray vector instructions are atomic, and those on the 205 are restartable, but the context is saved quite efficiently on the 205. A complete context switch actually takes fewer cycles on the 205 than it does on the Cray 1/X/Y-MP's, and many fewer than the Cray-2. So, the assumption that virtual memory OR memory to memory vector instructions cause long context switch times relative to Cray is not correct. As stated above, the "price" of the 205 architecture was poor short vector performance, and, or course, the extra hardware that virtual memory consumes (a virtual memory MMU takes up a LOT more space than a non-virtual MMU - I think it is worth it but others disagree...). I note also that a Benefit to the 205 architecture is excellent long vector performance. >to memory, and the maximal vector size is 65535. So my arguments >above would imply that you could have a Cray with virtual memory, >but not a 205! There is, in fact, no reason why a Cray type architecture can't have virtual memory. In fact, a number of vector machines built by other companies have done approximately that. The real debate, in my opinion, is not between virtual and non-virtual (virtual won a long time ago, in my opinion- Cray is an anachronism in this respect) but between a memory to memory pipeline and a vector register architecture. But these are only two of many proposed possibilities, and some good performing machines have been built with other architectures entirely. None have been commercially successful yet, but I would not assume that that will always be the case. -- Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
sharma@mit-vax.LCS.MIT.EDU (Madhumitra Sharma) (07/23/88)
The hardware complexity also goes up if the vector pipeline is to be able to take interrupts in the middle of a vector instruction. This increased complexity (most probably) implies a longer cycle time for the machine, too. Cray wanted to make his pipelines as simple as possible so that he could run them as fast as possible. Therefore, he decided he would not handle any interrupts in the middle of a vector instruction. Hence, no virtual memory. Madhu Sharma sharma@xx.lcs.mit.edu
blu@hall.cray.com (Brian Utterback) (07/23/88)
In article <60952@sun.uucp> dre%ember@Sun.COM (David Emberson) writes: >If money is no object, it makes sense to buy a ton of memory AND have VM. >This may be apocryphal, but I have been told that Seymour Cray, replying to >the question of why the Cray 1 did not have virtual memory, replied, "Because >I don't understand it." I have never heard anything of the kind. Well, sort of the kind. What he did say was that the CDC machines used ones-complement arithmetic instead of twos-complement because he did not understand it. He went on to say that he figured it out by the time he built the Cray-1, since it is twos-complement. I tend to think that he was joking. By the way, this is anecdotal rather than apocryphal because the talk was video taped and I have seen it. Cray's continue to use only physical memory rather than virtual for one reason: it's faster. That's our charter: faster. -- Brian Utterback |UUCP:{ihnp4!cray,sun!tundra}!hall!blu | "Aunt Pheobe, Cray Research Inc. |ARPA:blu%hall.cray.com@uc.msc.umn.edu | we looked like One Tara Blvd. #301 | | Smurfs!" Nashua NH. 03062 |Tele:(603) 888-3083 |
ddb@ns.UUCP (David Dyer-Bennet) (07/23/88)
In article <5342@june.cs.washington.edu>, pardo@june.cs.washington.edu (David Keppel) writes: > Relevant (really?) question: Does it make more sense to buy a little > bit of very fast memory and slow it down with virtual memory, or to > buy a whole bunch of fast physical memory and slow it down by putting > it farther away? (Assume: $ is no problem). ^^^^^^^^^^^^^^^^^^^^^^^ Obviously :-) the correct solution is to buy a whole bunch of VERY fast physical memory. A cache system (I consider virtual memory to be essentially a caching system) is never as fast as an entire main memory made out of that same technology. -- -- David Dyer-Bennet ...!{rutgers!dayton | amdahl!ems | uunet!rosevax}!umn-cs!ns!ddb ddb@Lynx.MN.Org, ...{amdahl,hpda}!bungia!viper!ddb Fidonet 1:282/341.0, (612) 721-8967 hst/2400/1200/300
dik@cwi.nl (Dik T. Winter) (07/23/88)
In article <12174@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov.UUCP (Hugh LaMaster) writes: > In article <7588@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: > >the instruction. That takes a lot of time. The 205 has two different > >page sizes, large pages of 65536 words (8 bytes/word) and small pages > >of (site selectable) 512, 2048 or 8192 words. The number of > >associative registers is 16, and these are shared amongst the jobs on > >a system. It appears that the selection of small page size is very > >critical. I have a small (~10 lines) program that will run 2 times > > The Cyber 205 does have a "problem" because of its memory to memory > vector instruction set, as opposed to Cray's vector registers. The > problem is/was vector startup time was fairly long on the Cyber 205. > This problem appears to have been solved on the ETA-10 with better > overlap, etc., and short vector performance seems to be much better. > (Still the same memory to memory architecture.) The "problem" with > small page sizes is actually an installation option intended to benefit > installations with a relatively small amount of main memory. The > solution, if you have more memory, is to use a larger small page size. > Well, let me also disagree. The factor of 2 I mentioned was from memory, and not substantiated by fact; it is more like a factor of 30. Anyhow, what we experienced when the 205 was installed next door (1 Mword of memory, small page size 512 words): our program which was consuming lots of time on the 205 (it used in total something like 1 CP-month) was that it ran twice as fast when small page size was increased to 2048. The original runs gave no indication at all that something was wrong; we had typically something like 250 page faults in a 1 hour run. The problem was, the program fitted in memory (with lots of memory to spare), but there were not enough associative registers to cope with it, so every vector instruction was interrupted at least 3 times for every 512 elements. That is quite a lot if you know that vector startup time is 50 to 70 cycles. The main problem is of course not enough associative registers (you ought to have enough to address all of real memory). My estimate, from experience, is, on a 1-pipe 205 set the small page size to at least 2048, on a 2-pipe 205 use 8192; and what on a 4-pipe 205? -- dik t. winter, cwi, amsterdam, nederland INTERNET : dik@cwi.nl BITNET/EARN: dik@mcvax
jlg@beta.lanl.gov (Jim Giles) (07/23/88)
In article <12174@ames.arc.nasa.gov>, lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: > The real debate, in my opinion, is not between virtual and non-virtual > (virtual won a long time ago, in my opinion- Cray is an anachronism > in this respect) [...] In general, this is true. Most machines and applications are better off with VM. But I think Cray did the right thing for his market. Most buyers of supercomputers have long since figured out how to do memory management in software. And, since these users know the data usage patterns of their programs, their software VM is MUCH more efficient than existing hardware can supply. It's no coincidence that many 205 users still run with VM turned off - their codes run faster that way! The problem is that hardware VM isn't flexible enough to deal with a large variety of data usage patterns. As a result, most VM machines just do some variant of demand paging. This is exactly the WRONG data usage model for most large-scale scientific codes. Providing more sophisticated VM mechanisms would be more expensive and wouldn't really help unless the user code is able to give 'hints' about the data usage patterns. But, if the user is required to give 'hints' in order to get efficiency, he might as well do VM in software as he's always done (figuring out what you need next is the hard part - actually reading it in is easy). Unless the hardware VM mechanism can look ahead far enough to avoid page faults entirely (several hundred thousand instructions with the current difference in disk and memory speed), it will never beat the clever use of asynchronous I/O that more sophisticated users have been doing for years. Of course, if the hardware can somehow divine the data usage patterns of the code automatically (a channeller perhaps? :-), the it could maybe even beat the user's software VM. J. Giles Los Alamos
seanf@sco.COM (Sean Fagan) (07/24/88)
In article <12174@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov.UUCP (Hugh LaMaster) writes: >In article <7588@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: > >It is true that Cray vector instructions are atomic, and those on the >205 are restartable, but the context is saved quite efficiently on the >205. A complete context switch actually takes fewer cycles on the 205 >than it does on the Cray 1/X/Y-MP's, and many fewer than the Cray-2. My $0.02 worth: by a complete context switch, I assume you mean something which will save the vector registers? The Cray, when it does a context switch, will save 24 registers (8 address, 8 index/offset, 8 data), plus the program counter, the starting address of relative word 0, and the limit of the programs size (incidently, this is, for some unkwon reason, very similar to the CDC Cyber 170 machines context switch 8-)). It does not save vectors (understandable), which the operating system must then do (if it feels the need; if all it's doing is OS type work, then there is probably no reason to save them). Since storing things is so *slow* (relatively speaking), you try to avoid memory like the plague. Again, incidently, if Seymour designed the Cray 1 et al as he did the Cybers, when the machine does a context switch, the hardware starts storing the exchange package (described above) and then reading it, at the same time (the cybers had a long wire into which the signal would go; since there was some travel time, it could safely read into the registers without worrying about whether or not the values were done being saved), causing *very* fast context switches (without vector registers, of course). As a result, there are two tradeoffs between the two architectures: Crays use vector registers, which are a pain to load and store, but very fast for multiple operations (and very RISCy, of course 8-)), while the 205's (and ETA's) allow for larger vectors, which somewhat faster memory access. > Hugh LaMaster, m/s 233-9, UUCP ames!lamaster -- Sean Eric Fagan | "An Amdahl mainframe is lots faster than a Vax 750." seanf@sco.UUCP | -- Charles Simmons (chuck@amdahl.uts.amdahl.com) (408) 458-1422 | Any opinions expressed are my own, not my employers'.
smryan@garth.UUCP (Steven Ryan) (07/24/88)
>The problem is that hardware VM isn't flexible enough to deal with a >large variety of data usage patterns. As a result, most VM machines >just do some variant of demand paging. VSOS on 205 provides an Q5ADVISE for an asynchronous swap in/swap out. Pretty arcane, but its there. The real battle is between a developement shop with lots of interactive jobs and a production shop which is dedicated to one program.
jlg@beta.lanl.gov (Jim Giles) (07/26/88)
In article <1070@garth.UUCP>, smryan@garth.UUCP (Steven Ryan) writes: > VSOS on 205 provides an Q5ADVISE for an asynchronous swap in/swap out. > Pretty arcane, but its there. But, as I pointed out, if you have already worked out what you need next, why not just do the asynchronous I/O yourself? The hard part is figuring out what to tell Q5ADVISE. The part that Q5ADVISE does is just the I/O initialization. As I said, sophisticated users have been doing this stuff in software for years. What do they need hardware VM for? J. Giles
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (07/26/88)
In article <7819@hall.cray.com> blu@hall.UUCP (Brian Utterback) writes: >Cray's continue to use only physical memory rather than virtual for one reason: >it's faster. That's our charter: faster. Help! I have looked in my trusty computer architecture books (latest is 5 years old) and can find very little on just how much complexity (number of gates, real estate on or off chip, etc.) various architectural features consume. Now, I suppose that since the whole world is interested in "RISC" these days, there must be a whole slough of books out there which give such information so that correct trade-offs can be made. I would like to know how much space a 16 entry MMU consumes versus an adder with bounds checking. And also, how many gates deep the critical path is for each. And so on. None of my hardware books have anything more than the cost of several simple adders. Any suggestions on more recent books that contain good information of this type? -- Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
aglew@urbsdc.Urbana.Gould.COM (07/26/88)
>As I said, sophisticated users have been doing this stuff in software >for years. What do they need hardware VM for? > >J. Giles There aren't enough sophisticated users in the world. We have to sell computers to the unsophisticated ones, too.
gillies@p.cs.uiuc.edu (07/27/88)
In my undergrad systems course we learned to optimize a multi-level memory design for speed, given a constant number of $$$. We used: 1. Paper & pencil 2. Simple model of cacheing/paging hit ratio versus cache size (often some given linear function hitsRate = f(memorySize)). 3. The price/performance of various kinds of memory, for each kind: a. $/K b. access time For a 2-level memory system (main memory, cache), you could plot a 2-dimensional curve (main memory size versus cache size), then derive the highest performance point on the curve. Of course, this analysis is impossible if you don't know your instruction mix and software paging patterns. And if the customer wants to expand main memory, he should probably expand the cache at the same time (I think this is uncommon). So I doubt many companies pay attention to this analysis -- maybe it's mostly academic. My point is that it's an optimization problem, which if oversimplified, can even be handled with paper & pencil. If not, then it can probably be solved by nonlinear optimization methods. Don Gillies, Dept. of Computer Science, University of Illinois 1304 W. Springfield, Urbana, Ill 61801 ARPA: gillies@cs.uiuc.edu UUCP: {uunet,ihnp4,harvard}!uiucdcs!gillies
jhm@cs.cmu.edu (Jim Morris) (07/28/88)
I heard him say he didn't understand VM at a talk at Livermore in 1975.