gordoni@cs.adelaide.edu.au (Gordon Irlam) (03/27/91)
The Low Level Computational Performance of Sparcstations ======================================================== A Guide for Compiler Writers ---------------------------- Gordon Irlam (gordoni@cs.adelaide.edu.au) March 1991 This document provides information on the Sparcstation architecture that may be of use in targeting compilers. It is an attempt to address the common problem of architectural definitions failing to provide concrete discussions on performance related issues. This document discusses the following aspects of performance: clock rates, instruction timings, memory sub-system, register windows, and address translation. Issues relating to multiuser performance are not discussed. Familiarity with the SPARC processor architecture, but not the Sparcstation architecture is assumed. Overall architecture. --------------------- The SS1 family (SS1, SLC, SS1+, IPC) use the Fujitsu MB86901A or its equivalent. The SS2 uses the Cypress CY7C601. The SS1 and SLC have a 20MHz clock. The SS1+ and IPC have a 25MHz clock. The SS2 has a 40MHz clock. All machines have a 64k write through cache. The SS2 is able to map twice as much virtual memory as the machines in the SS1 family. Sparcstation model summary Machine Chip Clock Cache Pmegs SS1 Fujitsu 20MHz 64k 128 SLC Fujitsu 20MHz 64k 128 SS1+ Fujitsu 25MHz 64k 128 IPC Fujitsu 25MHz 64k 128 SS2 Cypress 40MHz 64k 256 SPARC instruction times. ------------------------ In the vast majority of cases the two different chips have identical instruction timings. With most instructions taking 1 cycle. The exceptions are mentioned here. Single word loads (LD*) take 2 cycles. Double word loads and single word stores (LDD*, ST*) take 3 cycles. Double word stores, and atomic load store instructions (STD*, LDSTUB* and SWAP*) take 4 cycles. The only common case in which the two chips take different times is for branch instructions. In most cases branches (B* and FB*) take 1 cycle, but on the Fujitsu chip untaken branches consume 2 cycles. For either chip if the delay slot of the branch was annulled an additional 1 cycle penalty will be incurred since the delay instruction is still fetched. Jumps/indirect calls/returns and trap returns (JMPL and RETT) take 2 cycles. Note, direct calls (CALL) take 1 cycle. For delayed control transfer instructions (B*, FB*, CALL, JMPL and RETT) if the following instruction slot cannot be filled with a useful instruction, an additional 1 cycle penalty is effectively incurred. Traps incur a penalty of 3 cycles while the pipeline is drained/refilled. In the vast majority of cases bypass paths within the CPU allow an instruction to use a value computed by the previous instruction in the pipeline without a stall occurring. The only exception to this is the load instruction (LD*). A 1 cycle stall penalty is incurred if the instruction following a load attempts to read the register to which the load is destined. In the case of a double word load the penalty is only incurred if the following instruction accesses the second register to be loaded. Floating point instructions depend upon the type of floating point coprocessor used with the chip. Unfortunately I don't have any timing values for these. The Fujitsu chip does not implement the atomic swap instruction, although it does implement the atomic load-store byte instruction. Neither chip implements integer multiplication or division. But under SunOS 4.1 the integer multiplication and division instructions are emulated in software. Trivia: on the Cypress chip, but not the Fujitsu chip, loads to register g0 (which is always 0) followed by reads of g0, do not incur the above mentioned stall penalty. SPARC instruction timings Instruction Cycles LD* 2 + stall if next inst. uses dest. reg. LDD* 3 + stall if next inst. uses 2nd dest. reg. ST* 3 STD* 4 LDSTUB*, SWAP* 4 B*, FB* 1 + delay (2 + delay on Fujitsu if not taken) JMPL 2 + delay RETT 2 + delay CALL 1 + delay all others 1 traps +3 Sparcstation memory times. -------------------------- The above times all assume instruction fetches, load instructions, and store instructions never have to wait upon the memory system. This is not true. Both the SS1 family, and the SS2 use a single level 64k virtually address write-though cache with a line size of 16 bytes. I currently don't have any information on memory timings for the SS2. The following times refer to the SS1 family. Attempting to measure memory access times on a Sun is very difficult due to the large number of possible contributing factors - cache size, cache organization, address translation, write back queue, and DRAM access regime, as well as having to deal with interference caused by instruction fetches and context switching. Thus the following data is not exact. It is an attempt to reconcile the results of a number of experiments, some of which are contradictory. When the referenced line is in the cache no penalty is incurred. Thus a single word load will take 2 cycles, and a single word store will take 3 cycles. Before continuing with what happens when a cache miss occurs remember that the SS1 family are low cost machines. A certain amount of performance has been traded off to end up with a box that is very cheap to produce. The more expensive Sun's offer a considerably improvement in this area, but you have to pay for it. If the referenced line is not in the cache a stall will occur while the address is translated, and main memory is accessed. For a load, or presumably an instruction fetch, the stall lasts for 12 cycles while the line is loaded into the cache. Thus a single word load will take 14 cycles. For a store the stall lasts 3 cycles, causing a single word store to take 6 cycles. In the case of a store the target line remains uncached. When several stores occur in sequence a stall will occur once the memory write buffers fill up. This will occur even if the target addresses are all cached. And is in addition to stalls due to cache misses. On the SS1 family even a single store double instruction causes the write buffer to fill. The following table shows this. instruction sequence total cycles stall cycles ST 3 0 ST; ST 11 5 ST; ST; ST 17.5 8.5 ST; ST; ST; ST 23.5 11.5 STD 8 4 STD; STD 21.5 13.5 STD; STD; STD 33.5 21.5 STD; STD; STD; STD 46 30 Non-integer cycle times are presumably the result of one or more of the following: freezing the CPU while address translation is performed, cycle stealing for DRAM refresh, use of a fancy DRAM access regime, the presence of other devices on the bus, measurement error. To prevent stalls a minimum of 3 cycles between two store instructions is necessary. Depending upon the addresses of successive stores 4 cycles may be required. If a write back stall is triggered it appears to lasts at least 2 cycles. Thus if two single word stores are separated by 3 nops it will take either 2 * 3 + 3 * 1 = 9, or 2 * 3 + 3 * 1 + 2 = 11 cycles, whereas two single word stores separated by 4 nops will take 2 * 3 + 4 * 1 = 10 cycles. Based on my observations, I would recommend separating stores by 4 cycles, only when all 4 intervening cycles can be usefully filled. Otherwise I would recommend adding nops to give a minimum of 3 cycles following a double word store before the next store, but only filling as many cycles as can be usefully filled between a single word store and the next store. Note that on a SS1 class machine the double word store instruction provides very little advantage over a pair of single word store instructions. A double word store takes 8 cycles. Two single word stores, plus 4 intervening operations will only require an additional 2 cycles. Despite this it is probably sensible to use double word store instructions when possible, unless it is clear that the code is likely to be run exclusively on SS1 class machines. SS1 family memory sub-system architecture Operation Cached cycles Uncached cycles LD 2 14 LDD 3 15 ST 3 6 STD 8 11 Sequence Required separation ST, .., store 3-4 cycles STD, .., store 3 cycles SunOS register window handlers. ------------------------------- For compilers that don't have sophisticated register allocation algorithms register windows can offer a substantial reduction in the cost of procedure calls and returns. How register windows stack up against sophisticated register allocation algorithms is an open question. I believe the Fujitsu chip has 7 register windows. The Cypress chip has 8 register windows. Because one window is reserved for use by the register window trap handler the user is left with either 6 or 7 register windows. A register window overflow/underflow trap pair takes about 260 cycles on a SS1 class machine (Hennessy and Patterson not withstanding). This is due to the poor performance of a sequence of 8 double word store instructions on a write through cache, the cost of testing to make sure the required page is present, checking to see if the window spans a page boundary, and making sure the register window handler is not used to circumvent system security. The machines on which register windows were first developed (RISC I, RISC II, SOAR, and SPUR), and from which register windows were proclaimed to offer great performance advantages did not have to deal with any of these problems, and consequently an overflow/underflow trap pair took about 40 cycles. Despite the significant cost of handling register window traps on Sparcstations, in the absence of heavy recursion, register windows probably still offer some performance advantages for procedure call intensive languages such as Smalltalk, or compilers that don't perform careful register allocation. For languages such as C however the benefits could outway the implementation cost. That is in comparison to the amount of effort required to modify a compiler/debugger/OS to take advantage of register windows the gains are probably not all that significant. Thus I would recommend attempting to quantify the expected gains obtained by using register windows before deciding to use them. I would expect they offer some performance advantages, say 10%, far less than the 70% performance improvement found on SOAR. One possible counter-argument to some of the doubts about the significance of register windows that I have expressed is that by having register windows, Sun have been able to expend less effort on maximizing the performance of loads and stores. This results in an interesting circularity that complicates the evaluation of the significance of register windows as an architectural feature. By having register windows, poor memory performance has been acceptable, which in turn forces compiler writers to make use of register windows to get acceptable performance. Sparcstation address translation. --------------------------------- Sparcstations have a software managed cache that can hold the address translations for each of several regions of virtual memory. The SS1 family can concurrently map up to 128 regions of virtual memory, each of size 256k, and aligned on a 256k boundary. The SS2 can concurrently map up to 256 regions of virtual memory, each of size 256k, and aligned on a 256k boundary. Attempting to make active use of more than 128 or 256 of these virtually addressed regions at once will result in a large number of page faults. This is unlikely to be a problem except for systems that have very sparse memory access patterns. Persistent environments and very large Lisp applications are two application areas that are most likely to suffer from these problems. In addition it is important that such applications make use of the vadvise/madvise system calls to control paging behavior. Notes. ------ 1. Some of the above information has been empirically determined, and is undoubtedly wrong, corrections welcome. 2. I would be interested in low level details of the expected integer performance of superscalar SPARC implementations. 3. Working out memory timing values is very difficult. If anyone wants a challenge they might have a go at doing this for the SS2. Contact me for some pointers/caveats. 4. I have some additional information on window handlers and address translation if anyone is interested. One day I plan to evaluate the effect of register windows on the performance of the SPEC benchmarks. Working out the costs is easy, the hard part is working out how many additional loads and stores would be required in their absence. 5. Thanks to Peter Van Roy (vanroy@pisces.Berkeley.EDU) for helping me clean up an early version of this document. -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.