henry@utzoo.uucp (Henry Spencer) (11/19/88)
John Mashey gave a talk here the other day. Herewith some random notes on it, of possible interest to others. This is not an attempt to record everything (or even everything significant) that he said. All statements are my recollection of what he said, i.e. if something's wrong, it might be because it's his opinion and he's wrong, or because I didn't remember it correctly. "I don't make the news, I just try to report it." "RISC: there's less to be sorry about." Reducing instruction set is a means to an end, not an end in itself. Their policy is that a new feature must either be structurally necessary (e.g. Return From Exception) or must demonstrably improve performance 1% on a significant number of real applications. "Intuition is not allowed." "Never argue about something you can simulate." At midnight their entire computer room suddenly becomes furiously active as the nightly batch of simulations fires up. Much criticism of tiny benchmarks. Real programs are much better, but even there one has to try to run a reasonable variety. There is an effort underway, jointly among several companies, to try to put together a set of big, "real", freely-redistributable programs for useful benchmarking. "If a vendor can't give you some reasonable backup for the viewgraphs, don't believe the viewgraphs." They try to collect "embarrassing" benchmarks, programs that do something unusual, on the grounds that including these helps to keep the performance numbers honest. Statistics over a small number of programs can be very misleading. Some of Berkeley's commitment to register windows may have been accidental: two of their major initial benchmarks, the C compiler and nroff, quite by accident are unusually heavy register-save-and-restorers. People have noticed, with some surprise, that nroff does very few byte accesses despite being a text formatter... but TeX is different. Most programs use char and int but make very little use of short... but some CAD stuff is different, and so is TCP/IP. Most programs are maybe 20% loads and 10% stores, but some are very different, e.g. diff does lots of loads and few stores. Compress is a good way to embarrass a machine that relies very heavily on caching. They are not competing in the low-end imbedded-controller market. "It's hard for us to build anything less than 6 MIPS." The 88000's floating-point hardware looks good in single precision but is *spectacularly* slower in double precision. (These numbers are very new, the floating point basically didn't work until very recently.) Basically, Motorola sacrificed double precision to save silicon. Okay for things like graphics, not so okay for serious floating-point applications. Making an IEEE double-precision floating-point add run in two cycles is not easy. The tricky part is exceptions, especially since the MIPSco cpu has precise exceptions. It turns out that a quick table lookup based on some of the high bits of the exponents of the two operands can usually assure you that no exception is possible. If the table lookup says "maybe" rather than "no", the FPU just stalls until it can be sure. This almost never happens in reality unless an exception is indeed imminent. Multiply and divide are asynchronous, with the cpu stalling if the result is required before it's available. In the current implementation this usually does happen, the delay is too long to fill completely very often. One side effect is that the divider doesn't check for a zero divisor, since an explicit check for that can be the first thing in the delay slots! General experience is that filling one load or branch delay slot is easy but filling two or more is much harder. Roughly 50% of all function calls (dynamically, not statically) are calls to leaf functions. These need not save the return address out of the register it gets parked in, and need not have a stack frame (all the variables can just go in registers, barring funny situations), which makes call and return of such functions one cycle each. MIPSco passes the first four parameters in registers, and has 13 registers which the caller must save if he wants them, so local storage is usually all-register. With 13 scratch registers, register save and restore is pretty uncommon, averaging less than 2 registers per call. SPARC, with its register windows, actually does more saving and restoring to/from memory, because its big fixed-size windows hurt it. I asked him about languages with highly dynamic call patterns, like C++ and Smalltalk. He said that for those languages, it would indeed be mildly helpful to have a few register windows. For a commercial machine, the ability to do unaligned accesses can be significant. (MIPSco does it via special instructions; their compilers can be told to use those for all dubious accesses.) Ugly things like Cobol and PL/I tend to want them, and there exist major applications in C and Fortran which cannot be fixed easily. Vendors are much happier about getting something running with a compiler option and then tuning the result than about having to get alignment right from the start. It is hard to win the performance race with pure clock speed, because you can't ship a part if you can't test it, and the test-equipment folks are not going to build a new tester just for you. So clock rates tend to improve simultaneously across the industry. He put up a slide listing lines of code in the various parts of their various compilers. Out of six compilers and assorted optimizers etc., half the total lines were for Ada. Their compilers use the "signed" (overflow causes exception) instructions for non-pointer arithmetic in C. This constrains optimization in ANSI C, but they already have to deal with the issue for Ada, so this isn't a big software problem. MIPSco is mostly content to watch the AT&Sun/OSF/etc scuffle from the sidelines. "When elephants dance, the mice stay off the floor." CPUs are almost free; what matters in cost is the support hardware. Their new big machine puts most of its money into things like fast I/O. "Heavy iron, with a CPU pasted on." For most Unix utilities they don't bother with the heavier levels of optimization, e.g. interprocedural. However: "We do have customers who'll commit vile acts to get another 5-10%." Optimizing the Unix kernel is a "thrilling adventure". The new "volatile" keyword in ANSI C helps, but figuring out where to use it is not trivial, and implementing it in an existing optimizing compiler is lots of fun. "Never debug a new CPU and a new silicon process simultaneously." They (or rather, their silicon-building partners) use essentially static-RAM processes, and let the RAM market do the leading-edge process debugging. The hard part of designing a CPU is not the instruction set, but the exception handling. -- Sendmail is a bug, | Henry Spencer at U of Toronto Zoology not a feature. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu