aeh@h.cc.purdue.edu (Dale Talcott) (09/08/86)
This is not directly related to the discussion about large main memories, but I thought I would try to supply some information about the CDC Cyber 205, since we have one of these beasties. The context is the comparison between the Cray, which does not use virtual memory, and the 205, which does. The hardware part of virtual memory on the 205 is implemented using a two level lookup table to translate virtual page addresses to physical pages. The first level is a 16 entry set of "associative registers" (ARs) which can do this mapping in 1 cycle. These are similar to the "translation lookaside buffer" on other virtual memory machines, except smaller. The second level is the "space table", which has an entry for each occupied real memory page (plus some dummy entries used by the system). This table resides at a fixed address in real memory and its first 16 entries are loaded into the associative registers. That is, the ARs cache the first 16 entries of the space table. When a memory reference cannot be mapped using the associative registers, the affected instruction is suspended, the ARs are stored into the start of the space table, the space table is searched for the mapping at a rate of two entries per cycle until the match is found, the match is moved to the first entry in the space table, the ARs are reloaded from the space table, and the instruction is resumed. (If a mapping is not found, this constitutes a page fault, which I am ignoring.) The ARs need to be stored and reloaded because they are kept in LRU order by the hardware, and thus rapidly get out of sync with the part of the space table they are caching. The overhead for stopping, saving, reloading, and restarting is about 80 cycles. (Choke!) When running in monitor mode, all memory references are physical references, but take just as long to execute. I suspect, but have not checked the prints to be sure, that monitor mode just forces the ARs to always respond with the identity map. With that as background, 1) Whatever programming technique you use on a Cray to fit ten pounds of data into 5 pounds of real memory will also work on the 205. That is, if your program can be run on a Cray with X Mwords of memory, it will run on a 205 with X Mwords of memory WITHOUT FAULTING. 2) Monitor mode on the 205 disables external interrupts, so I seriously doubt that any sites are running their 205's "with virtual memory disabled". 3) Nonetheless, I ran some timing tests to determine the cost of space table searches (which would not happen in monitor mode, due to the faked identity map). Using as a test case a vector add of two arrays into a third array, each array one third the size of available real memory, I came up with a cost of 0.48% with 4 Mwords and 1.5% for 32 Mwords (both assuming a page size of 64k words, which is the largest available on the 205). Over the course of a day, the latter amounts to about 22 minutes. The previous example has about the best AR hit rate for code which still references every word of real memory just once. If the code were a gather/scatter which touched only one word per page, the cost would be a whopping 32800% for a 32 Mword system. Fortunately, most codes have better locality of reference than either of these, so there is little incentive to run a 205 as a single user, real memory system. 4) Note that the previous test compares the 205 with itself. That is, from the results we cannot determine how much faster the 205 would run if it had no virtual memory at all, only how much faster it would run if we don't use what is there. The 205, like the Cray, is very parallel, so it is possible that the AR mapping takes place at the same time as other functions and ends up adding nothing to the total execution time of memory reference instructions which find a hit in the ARs. (Again, I haven't checked this against the prints or microcode.) 5) The 205's ancestor, the STAR-100, had at most 1 Mword of memory, and thus all real memory could be mapped with just the ARs (if you used only the large page size). It would have been nice if CDC could have kept this property as they increased the size of central memory. (My .signature is copied from someone else's [Hi Tom] so I can't guarantee I am really reachable where it says, nor even that it gets included. So far as I can tell, news is voodoo.) -- Dale Talcott Systems programmer ARPANET: aeh@j.cc.Purdue.EDU Purdue University Computing Center BITNET: AEH@PURCCVM Mathematical Sciences Bldg. USENET: aeh@pucc-j.UUCP West Lafayette, IN 47907 Phone: (317) 494-1787
jlg@lanl.ARPA (Jim Giles) (09/11/86)
In article <2993@h.cc.purdue.edu> aeh@h.cc.purdue.edu.UUCP (Dale Talcott) writes: >... >1) Whatever programming technique you use on a Cray to fit ten pounds >of data into 5 pounds of real memory will also work on the 205. That >is, if your program can be run on a Cray with X Mwords of memory, it >will run on a 205 with X Mwords of memory WITHOUT FAULTING. > >2) Monitor mode on the 205 disables external interrupts, so I seriously >doubt that any sites are running their 205's "with virtual memory >disabled". > Since the trip through the Associative registers costs a machine cycle on the 205, that means that every memory reference is delayed by at least that ammount. (Actually, I suspect that the cost may be higher in that the selection of the clock rate may have been partially driven by this memory interface.) Most memory references are, of course, overlapped by other computation - so you might think that the extra clock may be negligable. Unfortunately, as we all found out, a sizable part of execution time in a 'real' application is inherently scaler. And for this code, an extra clock in the memory interface usually means an extra clock over-all. This may be part of the explanation of the fact that Crays seem to outperform Cybers even when small benchmarks indicate the opposite should be true. (The other reason is the MUCH longer setup time for vector operations on the Cyber, which can only be offset if the vector sizes are very long on average. The distribution of vector lengths within a computation is not that simple, but it seems that there tend to be a lot of short vectors in 'real' codes.) I suspect that the main saving on the Cyber of going into monitor mode is to prevent parts of your data from being swapped out when you don't want it to be. This prevents a page fault later. As I said before, most of the programmers I know who have run codes on Cybers prefer to have the virtual memory off.