sohail@terak.UUCP (Sohail M. Hussain) (02/01/85)
> Early prototype versions of > the box ran with 4 and 6Mhz parts, because of the inavailability of 10Mhz > ones, and bug-free 10Mhz parts (rev N) are still not available in production > quantities We have some of those 10Mhz rev N parts, in our work station and what has been puzzeling me, is that these machines out perform out Vax 750. (not in compiles ofcourse, but in execution times) Can some one out there shed some light on why a 32016, runs faster than a 750, in programs that access memory (using pointers or matrix type operations.) As background, we have Vax 750, with 3Mb mem, running 4.1 BSD and the workstations are of our own manufacture, using a 10Mhz 32016, 4Mb memory, and running 4.1 Genix. The times were done with both systems running multiuser, but only one person logged in. Looking forward to some answer, as I have always though of our 750 as a fair size machine, it supports our development efforts quite well, I am now faced with either having to start respecting the 32016 more, or the 750 less. sohail Sohail Hussain uucp: ...{decvax,hao,ihnp4,seismo}!noao!terak!sohail phone: 602 998 4800 us mail: Terak Corporation, 14151 N 76th street, Scottsdale, AZ 85260 -- Sohail Hussain uucp: ...{decvax,hao,ihnp4,seismo}!noao!terak!sohail phone: 602 998 4800 us mail: Terak Corporation, 14151 N 76th street, Scottsdale, AZ 85260
hammond@petrus.UUCP (02/07/85)
> We have some of those 10Mhz rev N parts, in our work station and > what has been puzzeling me, is that these machines out perform out > Vax 750. (not in compiles ofcourse, but in execution times) > > Can some one out there shed some light on why a 32016, runs faster > than a 750, in programs that access memory (using pointers or matrix type > operations.) > > Sohail Hussain > Issues: Does your 32016 based workstation have a 32081? Are you using the 32082 MMU? Does your 750 have a floating point accelerator? Is your benchmark program small enough to fit in memory, (i.e. roughly the same number of page faults on both machines?) Questions: How much faster, i.e. 5, 10, 20 30 %? I have a NSC Sys 32 (A 32016 based, 4.1 bsd development system) It runs about the same as an 11/23, or about 1/3 of a 750. My boss has been giving me grief about this, so your info is most encouraging. Note a 32032 should give roughly 1.25 times the performance of a 32016. The 32 bit bus doesn't buy you that much more, except in applications such as copying data memory to memory.
hr@uicsl.UUCP (02/08/85)
RE: "Can some one out there shed some light on why a 32016, runs faster than a 750, in programs that access memory (using pointers or matrix type operations.)" This might be relevant, or it might not: One must take into consideration the software used. A friend and I have recently run the Dr. DOBBS floating point benchmark on a number of machines. Surprisingly, his S100/286, MSDOS system (with 80287) was within 10% of our VAX 11/780, BSD 4.2 system. He used the new DRI FORTRAN, I used f77 (C produced similar results). I recently tried the same program on a 780 running VMS. The VMS machine ran the program 4 times faster (8 times faster if the single precision times are used)! I suspect that we wound up measuring not so much the machines as their libraries. Presumably, your memory intensive programs would be less susceptable to this though. Now if I could just find a 68k or 32016 system with that speed in the $5000 range, I'd have something to look forward to. harold ravlin {ihnp4,pur-ee}!uiucdcs!uicsl!hr
henry@utzoo.UUCP (Henry Spencer) (02/08/85)
Another relevant question is, does your memory have zero wait states? People I trust tell me that the 32016's performance deteriorates *SHARPLY* when wait states are introduced -- it's much worse than you would expect, and in particular it's not linear in the number of wait states. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
chuqui@nsc.UUCP (Chuq Von Rospach) (02/09/85)
In article <278@petrus.UUCP> hammond@petrus.UUCP writes: >It runs about the same as an 11/23, or about 1/3 of a 750. >My boss has been giving me grief about this, so your info is most >encouraging. I'll probably get grief for saying this, but there are some quirks in the SYS32 hardware that keep it from performing in ways it should. The memory subsystem tends to require an unreasonable number of wait states in certain configurations, and it makes the system sludge out. We've been taking a close look at the SYS32 in the last few months because we realize that the performance makes our chips look a lot worse than they really are. I don't have anything I can talk about at this time besides pointing out that it IS very possible to get 32xxx based systems that run MUCH faster than SYS32. The SYS32 is more of a workhorse than a benchmark system, and people should be aware of that fact. chuq -- From the ministry of silly talks: Chuq Von Rospach {allegra,cbosgd,hplabs,ihnp4,seismo}!nsc!chuqui nsc!chuqui@decwrl.ARPA Life, the Universe, and lots of other stuff is a trademark of AT&T Bell Labs
srm@nsc.UUCP (Richard Mateosian) (02/11/85)
In article <5040@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >Another relevant question is, does your memory have zero wait states? >People I trust tell me that the 32016's performance deteriorates >*SHARPLY* when wait states are introduced -- it's much worse than >you would expect, and in particular it's not linear in the number of >wait states. Obviously, behavior under wait states depends a lot on particular programs. Here is one data point taken from one of my Wescon papers. The program is a small benchmark that uses memory heavily. Bus use is 82% on the NS32016, 57% on the NS32032, both at 0 wait states. Given below are execution times in seconds at 0, 1 and 2 wait states: 0 ws 1 ws 2 ws ---- ---- ---- NS32032 32.7 36.0 39.3 NS32016 43.7 49.4 55.0 Ratio 1.34 1.37 1.4 Execution speed of the NS32016 is 88.5% at 1 ws, 79.5% at 2 ws. Execution speed of the NS32032 is 90.8% at 1 ws, 83.2% at 2 ws. In general, programs with lighter bus use show smaller degradation with wait states and smaller ratios of NS32032 to NS32016 execution speed. In fact, a program doing nothing but register to register divison operations might show no degradation at all under wait states (because of the instruction pipeline) and no difference in execution speed between the NS32016 and NS32032. -- Richard Mateosian {allegra,cbosgd,decwrl,hplabs,ihnp4,seismo}!nsc!srm nsc!srm@decwrl.ARPA
jchapman@watcgl.UUCP (john chapman) (02/12/85)
> Another relevant question is, does your memory have zero wait states? > People I trust tell me that the 32016's performance deteriorates > *SHARPLY* when wait states are introduced -- it's much worse than > you would expect, and in particular it's not linear in the number of > wait states. > On the otherhand (at least according to my old 16032 manual) you wouldn't need very fast memory to keep up with a 32016 say about 400ns access?
doug@terak.UUCP (Doug Pardee) (02/12/85)
> >People I trust tell me that the 32016's performance deteriorates > >*SHARPLY* when wait states are introduced -- it's much worse than > >you would expect, and in particular it's not linear in the number of > >wait states. > > In general, programs with lighter bus use show smaller degradation with wait > states and smaller ratios of NS32032 to NS32016 execution speed. Wait states are a punch aimed at the 32000's glass jaw -- instruction prefetch. For those not completely conversant: the 32000 series CPU's use instruction prefetching to try to keep the 8 bytes following the _current_ instruction already loaded into the CPU. These bytes are always the ones located sequentially after the current instruction. There are two undesirable side effects which can occur. The most obvious occurs when a branch is taken -- the prefetch cycles were a waste of time, and the new instructions have to be fetched. But ---- if the CPU had just started a prefetch cycle when the branch is recognized, it has to wait for it to complete before the branch can be executed. Wait states increase the likelihood of this happening as well as make the situation more serious. Remembering that programs spend most of their time in loops, and that a loop requires at least one branch on every time through, this effect is magnified considerably. Especially for concocted benchmark programs, where the contents of the loop tends to be trivial, leaving the branching as the major time consumer. A second aspect of the 32000 series enters in here as well -- unlike the 68000, instructions are not required to start on word boundaries. If the branch destination is to an "odd" address, the CPU requires yet another memory cycle, with any wait states. Compilers for high- level languages like "C" don't pay any attention to this little detail, so tight loops can suffer just because the top of the loop is on an odd-byte boundary. The other side effect is less obvious. The instruction prefetch cycle can also obstruct access to the operands of the current instruction. Again, wait states increase the likelihood of this happening, and make the delay more serious as well. This process, in turn, is made more likely by the use of high-level languages like "C". Unlike the competition's CPUs, the 32000 series allows essentially all operations to be performed memory-to-memory, without needing a register as an intermediate. The compilers use this feature extensively, with the result that operands require memory access much more often than the equivalent 32000 assembler code or (e.g.) 68000 "C" code. Important note: this presumes that if the compiler had been forced to bring the operands into a register, and get the result in a register, that it could have done some optimization and re-used that register. It is obvious, is it not, that a simple "Load A, Add B, Store B" is necessarily going to be slower than "Add A to B"? And to compound the problem even further: the 32000 series is set up to use "indirect addressing" fairly heavily, and the compilers really use it a bunch. Especially the "C" compiler, which uses indirect addressing to implement pointer variables. But wait, there's more (this is starting to sound like a TV mail-order ad!). Most "C" programmers seem to like to use "external" variables rather than parameters. On the 32000 series, parameters are accessed just as easily as ordinary variables, but externals are a *double- indirect*! For a 32016 to get just the *address* of an external item, it has to do four (4) memory cycles. And if that item is a pointer variable, "C" will require yet another two memory cycles before it even has the *address* of the data. All of this indirect address and operand fetching puts quite a load on the memory system, and prefetching represents serious competition for memory cycles. If that prefetching turns out to have been unnecessary because of a branch, the performance suffers more than the number of wait states would imply. So if you want your 32000 system to hum along, don't use wait states, keep looping and branching to a minimum, program in assembler, and if you simply *must* program in "C" avoid external variables and use register variables (especially for pointer variables). Oh, BTW, the MMU adds one wait state of its own. -- Doug Pardee -- Terak Corp. -- !{hao,ihnp4,decvax}!noao!terak!doug
mwm@ucbtopaz.CC.Berkeley.ARPA (02/17/85)
>Now if I could just find a 68k or 32016 system with that speed in >the $5000 range, I'd have something to look forward to. How many times does it need to be said? If you'll give up running OS's that are resource hogs (Any post-v6 unix), you can get a *lot* of power for an small price. For instance: take any cheap z80 CP/M system (<$1000), add the HSC 6MHz 68K system (~$1000) with OS/9 (~$500), plug 4 of the NS FPU chips (whatever they're calling them this week) (and yes, I said *four* of the beasts) into it (<$1400, and going down), and you've got mucho floating point power for less than $4000. If you'd rather have character-mangling power, try an IBM PC compatible (<$2000), add the HSC 10MHz 68K card (~$1000) and OS/9 (~$500), and maybe one NS FPU ($350?). Of course, you may have to fly to Japan to get this one. Once more, the thing costs less than $4000. And no, I don't work for HSC. I just like their hardware. <mike