dave@enmasse.UUCP (Dave Brownell) (01/01/70)
In article <3233@nsc.UUCP> freund@nsc.UUCP (Bob Freund) writes: > ... it > is possible that a task be re-started on a different processor than > the one that faulted. If there is any difference in the micro-state > between cpu revision levels, it could happen that the restart would > fail due to incompatible micro-state. Does Motorola guarantee > micro-state compatibility across revision levels? ... At least one company that I know of had this problem. Using MC68010s, the problem was that some internal state bits might not be set identically; Motorola gave a few instructions to execute that would guarantee that all the state bits were identical on all processors.
hull@hao.UUCP (Howard Hull) (07/08/85)
This is one of those "promised summary" things. I asked how 68000 and 32016/32 users got along without a direct equivalent of the PDP 11's MOV (SP)+,@(R5)+ for table oriented video updates. I got three replies. 1.) Rich Altmaier, Integrated Office Systems He supplied actual NS32 code. I could not determine if this was only a single indirect (I think that is the case), or was in effect the same as the @(R5)+, in which R5 points to a location that contains what the processor will interpret as an *address* for the destination data word. This address is assumed to be part of a table of addresses of modifiable video fields in the PDP11 implementation. Method: Use movd with tos as the first operand, (r0) single indirect as the second operand. Surprise (to me anyway): tos evidently autoincrements! (Yuh couldn't tell that by a casual scan through the instruction set mnemonics. Awards to National for doing such a good job hiding it!) Nonetheless, it does require two additional instructions to increment the destination pointer in double precision. But since the address range of a typical PDP11 is so much smaller, only one would be needed for the NS chip to match equivalent performance. Convenient, but does take more bytes on the NS chip than the PDP11. 2.) Andrew Klossner He provided a method that would work well with a modularized data structure (it did not, as near as I could tell, use a double indirect form such as is implied in the @(R5)+ PDP11 instruction, either.) Method: Use a single MOVSD (move string) to copy the entire table. followed by an ADJSPx (adjust stack pointer) to pop the table off the stack. Moving data a word at a time is passe. 3.) Peter da Silva He provided some commentary on the virtues of the Motorola 6809 8-bit microprocessor in executing the "basic Forth loop", the little two- word chunk of list-linking magic which has caught the notice of all serious assembly language programmers of the last decade. Evidently the 6809 is the last of the micros of this particular mentality, grand as it may be. Peter indicated that the 6809 can manage the matter in two instructions, just as is found with the PDP11 execution. Peter noted that the execution of this famous task on the 68000 was indeed somewhat more awkward. Peter pointed out that the 68000 has a little less elegance with its instruction set than does the PDP11, but that it is rather more efficient in its use of T-states. Summary: It looks very much as though the influence of the profession of Computer Science has had a definite impact on the former priorities of hardware and software constraint. No longer is it considered the responsibility of the software designer to obtain maximum performance from a given hardware configuration (e.g. a particular microprocessor) but rather to obtain maximum economy of effort for the combination of hardware and software (i.e., cut costs and time from conception to market emplacement). Under the circumstances, most programmers will not worry about the detailed machine code at all, having most likely not looked any further than the high-level-language compiler output. Portability is more important than the cleverness of an implementation. This reflects the fact that while software producers always have to worry about competitors with more efficient implementations of their code, most such competitors will not survive late arrival at market, particularly with a more expensive product, or one that is more difficult to document or maintain. The method used to procure this new stance is to modularize tasks at many levels. In this case, such modularization is accomplished by breaking the video map into several parts such that updating is accomplished on one module at a time, rather than on the entire data structure as segmented by a suitable table of addresses. The double indirect, a concept always difficult to teach to the neophites, is now declared dead. For all practical purposes, it has been buried in the nested-line formats of structured programming topology. Howard Hull [If yet unproven concepts are outlawed in the range of discussion... ...Then only the deranged will discuss yet unproven concepts] {ucbvax!hplabs | allegra!nbires | harpo!seismo } !hao!hull
jans@mako.UUCP (Jan Steinman) (07/09/85)
In article <1617@hao.UUCP> hull@hao.UUCP (Howard Hull) writes: >The double indirect, a concept always difficult to teach to the neophites, is >now declared dead. For all practical purposes, it has been buried in the >nested-line formats of structured programming topology. Not so! Our NS32000 C compiler regularly produces memory indirect references whenever a pointer is declared as an automatic variable: movd 0(4(fp)),rX ;Get the data pointed to by the data pointed ; to by the frame pointer plus four. Although I'm not familiar with PDP11 assembly, it appears that National lacks the general-purpose autoincrement, but adds the (more useful, in my opinion) ability to offset from the first base before indirection, and then you can even throw a scaled index on top of the final address if you want, which means you no longer have to forget your base when double-indirecting through a table. While I agree that most applications are content with the output of a compiler, writing tight assembly code, using all the features a machine has to offer, is not dead. I make heavy use of the NS32000 memory indirect addressing mode, which is excellent for such things as traversing linked-lists in dynamically typed languages. (Get a pointer to an object, look through the object for a terminator, etc.) My only wish is that Nati had given all the general purpose registers this facility, so I wouldn't have to shuffle the SP or FP around so much! -- :::::: Jan Steinman Box 1000, MS 61-161 (w)503/685-2843 :::::: :::::: tektronix!tekecs!jans Wilsonville, OR 97070 (h)503/657-7703 ::::::
henry@utzoo.UUCP (Henry Spencer) (07/11/85)
> Method: Use a single MOVSD (move string) to copy the entire table. > followed by an ADJSPx (adjust stack pointer) to pop the table > off the stack. Moving data a word at a time is passe. Unfortunately, if you study the timings you will find that a well-optimized (i.e. unrolled) word-at-a-time move loop is faster than the string-move instructions on the current 32000 processors. When I queried my National rep about this, he admitted it. The string instructions are very slow on the 32016 and 32032; with any luck the 32332 will be better. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
gnu@sun.uucp (John Gilmore) (07/16/85)
Speaking of double-indirect, tektronix!tekecs!jans said: > My only wish is that Nati had given all the > general purpose registers this facility, so I wouldn't have to shuffle the > SP or FP around so much! Hmm. You mean the totally orthagonal wonderful National architecture has a few warts after all? Lemme see, I seem to remember something about bit fields can't lay across more than four bytes...e.g. you can only put a 32-bit bitfield on a byte boundary. Don't tell their marketing folks...
jon@nsc.UUCP (Jon Ryshpan) (07/18/85)
In article <2422@sun.uucp> you write: >Speaking of double-indirect, tektronix!tekecs!jans said: >> My only wish is that Nati had given all the >> general purpose registers this facility, so I wouldn't have to shuffle the >> SP or FP around so much! > >Hmm. You mean the totally orthogonal wonderful National architecture >has a few warts after all? > The 32000 does (in fact) have some warts, but in my humble opinion this isn't one of them. The 320xx is not a register machine like the PDP11 or some other well known processors. It's a p-machine with some registers added to it for extra speed. You don't expect to have symmetry between the registers that make up the p-machine and the other registers that speed it up. The basic machine architecture uses these principal registers: FP - The Frame Pointer : These define the current SP - The Stack Pointer : activation record SB - The Static Base : Pointer to own variables Link Base : Linkage to external procedures The machine would run perfectly happily without any more registers; but it would be slow, because all data references would be to main memory. So some additional registers were added to allow for on-chip data storage. These registers have addressing modes the same as a main memory location, when addressed via one of the principal registers (FP, SP, or SB). The addressing modes are: o Y(FP) - Direct addressing: The contents of the location Y bytes above the Frame Pointer o X(Y(FP)) - Displaced addressing: The contents of the location X bytes about the location that the location Y bytes above the frame pointer points to. (Same for SP and SB.) The register modes are: o R0 - Register value: The contents of register R0 o X(R0) - Register displaced addressing: The contents of the location X bytes above the location that R0 points to. (Same for R1..R7) If you move something from memory to register, you have exactly the same access capability to it that you had when it was in memory, no more, no less. This is what we mean by symmetry. (There's more, but this is the most important part.) One of the most important parts of processor architecture is the struggle for bits - you want to be able to say as much as you can with the least number of bits. The 32000 gets away with a 5 bit mode field, which does about as much as the 6 bit mode/address field in the another well known processor or the 8 bit mode/address field in the some other well known processors. -- Thanks - Jonathan Ryshpan {cbosgd,fortune,hplabs,ihnp4,seismo}!nsc!jon nsc!jon@decwrl.ARPA Let justice be done though the heavens fall.
jans@mako.UUCP (Jan Steinman) (07/18/85)
In article <2422@sun.uucp> gnu@sun.uucp (John Gilmore) writes: >>(me, as qoted by John Gilmore) >> My only wish is that Nati had given all the general purpose registers this >> facility, so I wouldn't have to shuffle the SP or FP around so much! >(John Gilmore) >Lemme see, I seem to remember something about bit fields can't lay across >more than four bytes...e.g. you can only put a 32-bit bitfield on a >byte boundary. Well since you're from Sun, John, you're probably used to the 68000. Now how many bits can it's bit fields span? No kidding! It doesn't even have bit fields, you say? (Back to the old shift-and-count loops, I guess.) While we're on bits and orthogonality, how many bits can a generic, single bit operation span on the 68000? Eight, you say? On the 32000 it is unlimited: i.e. setting bit 33 at a given pointer sets bit one, four bytes away from the pointer. This makes bit-mapped code a breeze to write. By the way, a 32 bit "bitfield" (in the sense of an atomic quantity) on the 68000 must be on an WORD boundary. No processor is without it's warts, but your jibe is grasping at straws. I suspect the 32 bit restriction makes the MMU simpler. Since 31 bits can enjoy arbitrary alignment, and 32 bits are usually an integer, this doesn't bother me too much. And the 32000 is still the most enjoyable processor I've ever written assembly code for. -- :::::: Jan Steinman Box 1000, MS 61-161 (w)503/685-2843 :::::: :::::: tektronix!tekecs!jans Wilsonville, OR 97070 (h)503/657-7703 ::::::
guy@sun.uucp (Guy Harris) (07/28/85)
> The 320xx is not a register machine like the PDP11 or some other well > known processors. It's a p-machine with some registers added to it for > extra speed. You don't expect to have symmetry between the registers > that make up the p-machine and the other registers that speed it up. What??? I've read the 32xxx data sheet and the machine looks *far* more like a "warmed-over VAX" (which is no slur, considering how many machines out there are just warmed-over PDP-11s or warmed-over VAXes) than a "p-machine" (by "p-machine" do you mean "p-code engine"?). Yes, you can use it as a stack machine (make both operands use the TOS addresing mode) - but then you can do the *exact same thing* on "the PDP11 (and) some other well known processors" (such as the VAX). How many 32xxx instructions use TOS for both operands, and how many use a register, or register relative/memory space, or...? If the majority do NOT use TOS, I submit that the 32xxx is not a "p-machine" in the sense of an engine intended to run P-code, but instead a VAXish register machine with some addressing modes added to make stack-oriented expression evaluation slightly simpler (although, given that any PDP-11 or VAX instruction with two general-addressing-mode operands can be made to act like a stack machine instruction, I don't think they even make it any simpler). Guy Harris
jans@mako.UUCP (Jan Steinman) (07/31/85)
In article <2506@sun.uucp> guy@sun.uucp (Guy Harris) writes: >> The 320xx is not a register machine like the PDP11 or some other well >> known processors. It's a p-machine with some registers added... > >Yes, you can use it as a stack machine (make both operands use the TOS >addresing mode)... How many 32xxx instructions use TOS for both operands, >and how many use a register, or register relative/memory space, or...? Any instruction that uses "general" addressing could care less if the operands are "TOS..., register, or register relative/memory space, or..." Do you have the "Instruction Set Reference Manual"? Have you looked at the instructions? >If the majority do NOT use TOS, I submit that the 32xxx is not a "p-machine" >in the sense of an engine intended to run P-code, but instead a VAXish >register machine with some addressing modes added to make stack-oriented >expression evaluation slightly simpler... It is easier to count the instructions that do not use TOS for one operand. A quick perusal shows "quick" (embedded constant operand), branches (would anyone want a PC displacement on the stack, anyway?), block moves and compares, prcessor directives (ENTER, EXIT, RET, SETCFG, etc.), and processor register (LPRi, SPRi) instructions. All the "mainstream" operations can have any general addressing mode for either operand. What's more, Nati has paid a great deal of attention to TOS access classes and addressing speed. TOS is the fastest memory addressing (if the bus could only keep up!) and the SP behaves in a reasonable way, depending on the access class: addd tos,tos first operand is popped (SP incremented), rd rmw second is only modified (SP unchanged) jump tos tos -> PC (SP unchanged) addr negd tos,tos first operand is popped (SP incremented), rd wr second operand is pushed (SP decremented) acbd -1,tos,label operand is decremented, loop until operand q rmw disp reaches zero. (SP unchanged) As to Nati TOS usefulness to the mythical "p-machine", I'm working on a Z80 emulator that uses the 32032 SP as the Z80 PC. Another thing we're exploring is using the SP as a Smalltalk virtual stack pointer. -- :::::: Jan Steinman Box 1000, MS 61-161 (w)503/685-2843 :::::: :::::: tektronix!tekecs!jans Wilsonville, OR 97070 (h)503/657-7703 ::::::
guy@sun.uucp (Guy Harris) (08/02/85)
> >Yes, you can use it as a stack machine (make both operands use the TOS > >addresing mode)... How many 32xxx instructions use TOS for both operands, > >and how many use a register, or register relative/memory space, or...? > > Any instruction that uses "general" addressing could care less if the > operands are "TOS..., register, or register relative/memory space, or..." Yes, I already knew that. *That was my entire point.* Since a large list of instructions use "general" addressing for both their operands, and since that means they can all use all the aforementioned addressing modes, I submit that the NS32xxx isn't a "p-machine". If it *is* a "p-machine", so is the CCI Power 6/32; it has auto-increment SP and auto-decrement SP addressing modes, and you could easily has the assembler to accept a TOS addressing mode and generate (sp)+, -(sp), or (sp) addressing modes for it. Somehow, I don't think removing all auto-incrementing or auto-decrementing addressing modes from a machine's instruction set makes it a "p-machine", though. > It is easier to count the instructions that do not use TOS for one operand. > A quick perusal shows <list of instructions>. All the "mainstream" > operations can have any general addressing mode for either operand. My point exactly. When I said "how many 32xxx instructions use...", I didn't mean "how many instructions as listed in the 'Instruction Set Reference Manual' use...", I meant "how many instructions as written by assembly-language programmers and as generated by compilers use..." I don't have the "Instruction Set Reference Manual"; will "NS16032S-6, NS16032-4 High Performance Microprocessors (Preliminary - November 1982)" do? > ...and the SP behaves in a reasonable way, depending on the access > class: > > addd tos,tos first operand is popped (SP incremented), > rd rmw second is only modified (SP unchanged) > > jump tos tos -> PC (SP unchanged) > addr > > negd tos,tos first operand is popped (SP incremented), > rd wr second operand is pushed (SP decremented) > > acbd -1,tos,label operand is decremented, loop until operand > q rmw disp reaches zero. (SP unchanged) As I pointed out, a VAX or Power 6 assembler could do that too. (The PDP-11 and M68000 don't have enough two-operand instructions with both operands specified by general addressing modes to make this worthwhile.) Turn "tos" into (sp) for most one-operand instructions, turn the first "tos" into (sp)+ and the second into (sp) for two-operand instructions that fetch both operands, and the first into (sp)+ and the second into -(sp) for two-operand instructions that fetch only the first operand. (Punt the 3-operand instructions.) > As to Nati TOS usefulness to the mythical "p-machine", I'm working on a Z80 > emulator that uses the 32032 SP as the Z80 PC. Another thing we're > exploring is using the SP as a Smalltalk virtual stack pointer. What does this have to do with "Nati TOS usefulness to the mythical 'p-machine'"? I have no idea what the person had in mind when he called the 32xxx a "p-machine". The logical assumption is that he meant "p-code engine"; P-code engines are generally stack machines which the NS32xxx is *not*, any more than the VAX is. You can treat it as a stack machine, but you don't *have* to (and probably don't want to; it'll run faster as a general-register machine). If you are merely referring to architectural features that make certain bits of coding work nicely, then the PDP-11 is a "p-machine" in that sense - note the use of "jmp @(r4)+" for threaded code. I presume the TOS addressing mode is useful for the Z80 simulator because it's the only addressing mode that does any sort of auto-incrementing of a register - i.e., instruction fetch is done with "movw tos, <instruction register>". You can do the same on machines like the PDP-11, VAX, and M68000 by doing move (reg)+,<instruction register> where "reg" is a register chosen for use as the simulator's PC. Guy Harris
steveg@hammer.UUCP (Steve Glaser) (08/06/85)
The "p-machine" garbage for the 32xxx was probably just early marketing hype. It's real difficult to sell anything in a big company unless you hang it on something familiar. The "p-machine" is an old system that was probably familiar to somebody in charge at the time. Remember that there was a chip known as the 16016 in the family. That was a 16032 (aka 32016 nowdays) with Intel 8080 emulation mode added (not Z-80, just 8080). (Gee you could write a CP/M emulator.) This should give you some insight into their thinking at the time. (No I wasn't there, but I was an early user of the chipset). As for eliminating auto +/- addressing mode, I support that decision. Given their decision to "back out" instructions that get page faults rather than dump out the internal microstate like the 68010, National would have to keep shadow copies of too much internal stuff around in case a page fault came through. That's a big hassle and takes chip real estate. I think having full memory to memory addressing is more useful that auto +/-, especially for compiler generated code. (well maybe not for pcc -- it's model seems to be put something in a register, munch on it, put it back in memory). Steve Glaser steveg.tektronix@csnet-relay.arpa or tektronix!steveg
guy@sun.uucp (Guy Harris) (08/08/85)
> The "p-machine" garbage for the 32xxx was probably just early marketing > hype. Which means "but it's not a general register machine, it's a p-machine!" isn't a legitimate reason why the 32xxx's SP, etc. aren't general registers - which is what the person replying to John Gilmore said. (There may be legitimate reasons, but "it's a p-machine" isn't one of them - because it isn't a p-machine.) > As for eliminating auto +/- addressing mode, I support that decision. Not knowing what the exact tradeoffs were, I neither support it nor oppose it. The P6/32 doesn't have auto-I/D except on the SP, and it seems not to have suffered *too* much in performance :-) (~4-7x 11/780 isn't too bad, especially for a TTL machine with an instruction set which a fair fraction of the VAX's complexity). It may also be easier to do pipelining if fewer of the addressing modes have side-effects - you don't have to worry about the (r4)+ two pipeline stages behind screwing up your movl r4,<location> (or, if you have multiple copies of the general register set, having to worry about propagating the change from the auto-increment to the instruction-unit copy of r4 forward to the execution-unit copy at the right time). > Given their decision to "back out" instructions that get page faults > rather than dump out the internal microstate like the 68010, National > would have to keep shadow copies of too much internal stuff around in > case a page fault came through. Well, maybe. Returning to the original topic, as described by the subject - the PDP-11 can only modify a maximum of two registers during the operand preparation, so some models have (or have what amounted to) a register which remmbered the register numbers of the two registers modified and the amount added to or subtracted from them. When you take a fault, the fault handler saves the contents of this register (which, presumably, freezes until read) and uses it to back up the faulting instruction. (This backup could also be mostly simulated in software - see the routine "backup" in the assembler language support code for UNIX on PDP-11s lacking this register - 11/40, 11/34, 11/23, 11/60...). > I think having full memory to memory addressing is more useful that auto > +/-, especially for compiler generated code. (well maybe not for pcc -- > it's model seems to be put something in a register, munch on it, put it > back in memory). Well, if the RISC people are correct, neither of them are necessarily useful. One problem with having auto-I/D modes is you set up your language to use them; then, when you compile code turned for machines with auto-I/D on machines which don't have it, you get code that's not as good as would be generated by a more straightforward coding. Also, at least if you use things like "+=", I think PCC will make use of memory-to-memory modes in simple expressions; if the expression is more complicated, it's probably faster to do it in a register anyway. Guy Harris
peter@kitty.UUCP (Peter DaSilva) (08/08/85)
> As for eliminating auto +/- addressing mode, I support that decision. > Given their decision to "back out" instructions that get page faults > rather than dump out the internal microstate like the 68010, National Any particular reason to do this rather than restart the instruction from where it left off? I hadn't heard of this approach... what does the Vax do? What are the tradeoffs. > would have to keep shadow copies of too much internal stuff around in > case a page fault came through. That's a big hassle and takes chip > real estate. I think having full memory to memory addressing is more > useful that auto +/-, especially for compiler generated code. (well > maybe not for pcc -- it's model seems to be put something in a > register, munch on it, put it back in memory). main(ac, av) register int ac; register char **av; { while(*++av) { ... } } Not an uncommon case construction in 'C'.
kds@intelca.UUCP (Ken Shoemaker) (08/08/85)
> It may also be easier to do pipelining if fewer of the addressing modes have > side-effects - you don't have to worry about the (r4)+ two pipeline stages > behind screwing up your movl r4,<location> (or, if you have multiple copies > of the general register set, having to worry about propagating the change > from the auto-increment to the instruction-unit copy of r4 forward to the > execution-unit copy at the right time). On the subject of side effects and pipelining, has anyone thought of the problems of treating the pc as a general register (with autoincrement, etc.) at the same time as you added some level of prefetch? This would seem to me to get very ugly, having to keep track of things in the prefetch buffer whenever you address/adjust off the pc. Indeed, this could limit the amount of instruction pre-processing/cracking you could do (or increase dramatically increase the amount of logic that is required). Any solutions besides punting? -- ...and I'm sure it wouldn't interest anybody outside of a small circle of friends... Ken Shoemaker, Microprocessor Design for a large, Silicon Valley firm {pur-ee,hplabs,amd,scgvaxd,dual,qantel}!intelca!kds ---the above views are personal. They may not represent those of the employer of its submitter.
guy@sun.uucp (Guy Harris) (08/11/85)
> > As for eliminating auto +/- addressing mode, I support that decision. > > Given their decision to "back out" instructions that get page faults > > rather than dump out the internal microstate like the 68010, National > > Any particular reason to do this rather than restart the instruction from > where it left off? Less internal state to dump? (Which means less microcode/whatever to do the dumping and restoring, and less code in the kernel to check that the state, if accessible to the user, hasn't been tampered with.) > I hadn't heard of this approach... what does the Vax do? What are the > tradeoffs. The VAX has a "first part done" bit in the PSL. Presumably, instructions which have side-effects and where a page fault can occur after the side-effect occur set the "first part done" bit after the side-effects occur. This is a simpler version of the "dump the internal microstate" model. The PDP-11 (at least the ones with the fancier MMUs) backs out the instruction in software - it dumps the numbers of the registers which have been auto-incremented or auto-decremented, and the amount they've been auto-incremented or auto-decremented by, into a register which is used by the trap handler to actually back the auto-ID out. Some VAX instructions, like the string instructions, require more hair. In that case they hijack several registers and store the current pointers and lengths into them if the instruction takes a fault and is interrupted. Presumably the FPD bit says that the pointers and lengths should be taken from the registers instead of from the instruction's operands. > > I think having full memory to memory addressing is more > > useful that auto +/-, especially for compiler generated code. (well > > maybe not for pcc -- it's model seems to be put something in a > > register, munch on it, put it back in memory). > register char **av; > { > while(*++av) { > > Not an uncommon case construction in 'C'. Not uncommon, but it doesn't generate autoincrement code on the PDP-11, VAX, or M68000, because they all have only predecrement and postincrement modes - this construct would require a preincrement mode. Do any machines have preincrement addressing modes? Guy Harris
henry@utzoo.UUCP (Henry Spencer) (08/14/85)
> > Any particular reason to do this rather than restart the instruction from > > where it left off? > > Less internal state to dump? (Which means less microcode/whatever to do the > dumping and restoring, and less code in the kernel to check that the state, > if accessible to the user, hasn't been tampered with.) Motorola obviously :-) views its 68020 line primarily as a way to sell memory chips. Between the incredible pile of trash it heaves onto the stack when you take a page fault, and the huge internal state of the 68881 FPU that has to be shoveled in and out every time you context-switch (what's the betting Motorola's next FPU chip has DMA? :-), the memory market is clearly what they're aiming at. That and the cache market. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
peter@baylor.UUCP (Peter da Silva) (08/15/85)
> Not uncommon, but it doesn't generate autoincrement code on the PDP-11, VAX, > or M68000, because they all have only predecrement and postincrement modes - > this construct would require a preincrement mode. Do any machines have > preincrement addressing modes? OK. Bad example. How about: strcpy(to, from) register char *to, *from; { char hold=to; while(*to++ = *from++) continue; return hold; } -- Peter da Silva (the mad Australian) UUCP: ...!shell!neuro1!{hyd-ptd,baylor,datafac}!peter MCI: PDASILVA; CIS: 70216,1076
davet@oakhill.UUCP (Dave Trissel) (08/17/85)
In article <5874@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >> > Any particular reason to do this rather than restart the instruction from >> > where it left off? >> > >Motorola obviously :-) views its 68020 line primarily as a way to sell >memory chips. Between the incredible pile of trash it heaves onto the >stack when you take a page fault, and the huge internal state of the >68881 FPU that has to be shoveled in and out every time you context-switch >(what's the betting Motorola's next FPU chip has DMA? :-), the memory >market is clearly what they're aiming at. That and the cache market. What you don't realize is the amazing performance we can get because of the "incredible pile of trash" we heave on the stack. The crux of the problem is that chips which have to back-up and redo instructions pay a nasty penalty in pipeline design. Consider the following generic microprocessor code sequence: MOVE something to memory SHIFT Reg by immediate MUL Reg to Reg etc. The MC68020 executes the MOVE and the bus unit schedules a write cycle. Then the execution unit/pipeline happily continues executing the instruction stream without regard to the final status of the write. Even if the write fails (bus errors) there could be several more instructions executed (in fact any amount until one is hit which requires the bus again.) Contrast this to chips which redo instructions. They must soon stop dead in their tracks until the write cycle has been verified as properly done. Other- wise they would alter the programmers model and invalidate retry. Another thing to consider, is that the total operating system code executed to continue from a page fault (assign an unused page frame and map it in the MMU, or block the process and schedule a swapped out page to be read) makes the overhead of writing the internal 020 machine state seem insignificant. The stack save equates to about the same overhead as executing 12 instructions. Concerning floating-point state saves we gave a lot of thought to minimizing latency times. What we did was give an indication to the OS of whether any of the FP registers had been used. If not, the OS could skip the context save and restore completely. Intel has a novel approach on their 8087 and 2087 where they let the process context switch without saving FP state. If another process tries using floating-point an interrupt occurs letting the OS then swap context only when necessary. The trouble with this technique is that all it takes is for one out of every 20 or so context switches to require a re-save and you start losing overall processor time over just saving it unconditionally. At worse, if you have several processes constantly sharing the FP chip then you have essentially forced a complete extra interrupt exception invocation for every change in context - a massive penalty. One solution would be to keep multiple contexts on chip. Ah - if we only had next decade's technology today. Lot's of exciting things are going to happen once we can get millions of gates on a single chip running at 70 mHz. -- Dave Trissel Motorola Semiconductor Inc. Austin, Texas {seismo,ihnp4}!ut-sally!oakhill!davet
tmb@talcott.UUCP (Thomas M. Breuel) (08/17/85)
In article <492@oakhill.UUCP>, davet@oakhill.UUCP (Dave Trissel) writes: |>Motorola obviously :-) views its 68020 line primarily as a way to sell |>memory chips. Between the incredible pile of trash it heaves onto the |>stack when you take a page fault, and the huge internal state of the |>68881 FPU that has to be shoveled in and out every time you context-switch |>(what's the betting Motorola's next FPU chip has DMA? :-), the memory |>market is clearly what they're aiming at. That and the cache market. |What you don't realize is the amazing performance we can get because of the |"incredible pile of trash" we heave on the stack. | |The crux of the problem is that chips which have to back-up and redo |instructions pay a nasty penalty in pipeline design. Consider the following |generic microprocessor code sequence: | | MOVE something to memory | SHIFT Reg by immediate | MUL Reg to Reg | etc. | |The MC68020 executes the MOVE and the bus unit schedules a write cycle. Then |the execution unit/pipeline happily continues executing the instruction |stream without regard to the final status of the write. Even if the write |fails (bus errors) there could be several more instructions executed (in fact |any amount until one is hit which requires the bus again.) I find this argument amusing. You just generated a page fault. That means context switch, disk driver, housekeeping, ... . Compared to all this, the overhead of your instruction re-start is going to be negligible no matter how inefficiently you do it. In addition, I tend not to believe that what you gain in cache performance makes up for the time required to push a lot onto the stack. Cache performance is going to increase in the way you describe it on writes only anyhow, since if you get a page fault on a read (which is probably the more common case) you have to wait for the page to be brought in no matter what. Finally, the thought of having a page fault pending and the CPU happily executing more instructions before the fault is serviced somehow worries me. It may play havoc with simple-minded process synchronisation techniques. Altogether, I don't buy that the 68020 gets 'amazing performance' because it pushes of the order of 20 longwords onto the stack every time it gets a page fault. Thomas.
davet@oakhill.UUCP (Dave Trissel) (08/19/85)
In article <489@talcott.UUCP> tmb@talcott.UUCP (Thomas M. Breuel) writes: >| >| MOVE something to memory >| SHIFT Reg by immediate >| MUL Reg to Reg >| etc. >| >|The MC68020 executes the MOVE and the bus unit schedules a write cycle. Then >|the execution unit/pipeline happily continues executing the instruction >|stream without regard to the final status of the write. Even if the write >|fails (bus errors) there could be several more instructions executed (in fact >|any amount until one is hit which requires the bus again.) > >I find this argument amusing. You just generated a page fault. That >means context switch, disk driver, housekeeping, ... . Compared to all >this, the overhead of your instruction re-start is going to be >negligible no matter how inefficiently you do it. You are not getting the point - maybe I did not make it that clear. Most of the time instructions execute without a page fault interrupt. The problem is that microprocessors which backup and redo instructions must ALWAYS halt when a write is done onto the bus because there just may possibly be a bus fault even though there almost always isn't. The '020 pipeline only halts for memory operand reads, changes in supervisor state or locked bus cycle instructions like TAS and CAS. . Probably the '020 bus averages somewhere around 30 percent write type cycles. This means there are many chances for this overlap to increase performance. The overlap the '020 gains is dependent on how far along the pipeline can crunch before another bus cycle is needed. WIth a 256 byte cache and large number of work registers (15) there is a large percentage of the time that one, two or more instructions can be executed while a write is being done. Even if the next instruction requires an operand read or write from the bus and therefore stops the pipe there, at least an overlap of instruction decoding and queueing of another bus cycle is accomplished before the halt. >In addition, I tend not to believe that what you gain in cache >performance makes up for the time required to push a lot onto the >stack. For the average one to three million instructions the '020 may be doing each second the 24 extra longwords saved and stored over a bus fault (which occurs anywhere from zero to let's say 10 times a second) doesn't really make any difference. >Cache performance is going to increase in the way you describe >it on writes only anyhow, since if you get a page fault on a read >(which is probably the more common case) you have to wait for the >page to be brought in no matter what. Maybe I didn't make it clear that I was getting at the majority of the time that you don't have a bus fault. Yes, any operand read from memory will lock the pipe since obviously it cannot continue regardless of whether a bus fault is going to occur or not. >Finally, the thought of having a page fault pending and the CPU >happily executing more instructions before the fault is serviced >somehow worries me. It may play havoc with simple-minded process >synchronisation techniques. > There are some side-effects but they don't occur for synchronisation since, as I mentioned earlier, for semaphore and lock operations the pipe does not forge ahead. The side-effects are subtle and relate mostly to exception handling and asynchronous exit invocations by the OS. That's the small penalty you pay for getting higher performance. Any advanced pipeline mechanism is going to be executing ahead whether you're on the '020 or a supercomputer. >Altogether, I don't buy that the 68020 gets 'amazing performance' >because it pushes of the order of 20 longwords onto the stack every >time it gets a page fault. The way to tell is to simply look at some assembly code and follow the instructions after operand writes. A pretty good estimate can be gotten from this method. And remember, even if the very next instruction after a write forces a bus access the '020 pipeline can progress up to the point of that bus cycle request before it halts. -- Dave Trissel Motorola Semiconductor {seismo,ihnp4}!ut-sally!oakhill!davet Austin, Texas
henry@utzoo.UUCP (Henry Spencer) (08/21/85)
The following two Dave Trissel quotes are from the same message: > [we can continue other instructions in parallel] ... Even if the write > fails (bus errors) there could be several more instructions executed (in fact > any amount until one is hit which requires the bus again.) > The stack save equates to about the same overhead as executing 12 > instructions. In other words, all you need is 12 contiguous non-memory-referencing instructions and the 68020's stack puke will actually break even! This is stretching it a bit, since on the pdp11 typically every third or fourth instruction did some sort of memory reference; I doubt that the 68000 family does much better. Speaking of the 68000 *family*, note that a 68010 gets the full performance hit every time since it doesn't pipeline much. On the other hand, I'm glad to hear that Motorola did have the sense to put a floating-point-used flag in the FPU, so at least you don't have to shovel 300 bytes of state around unnecessarily. > Intel has a novel approach on their 8087 and 2087 where they let the process > context switch without saving FP state. If another process tries using > floating-point an interrupt occurs letting the OS then swap context only > when necessary. The trouble with this technique is that all it takes is > for one out of every 20 or so context switches to require a re-save and you > start losing overall processor time over just saving it unconditionally. > At worse, if you have several processes constantly sharing the FP chip then > you have essentially forced a complete extra interrupt exception invocation > for every change in context - a massive penalty. An interesting possibility would be to have the hardware support *both* an FPU-used flag *and* a trap-on-first-FPU-use bit. It would not seem too difficult to set up some code in the kernel that switches between the two strategies as a function of the number of FPU context switches that have occurred lately. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
thomson@uthub.UUCP (Brian Thomson) (08/21/85)
Dave Trissel (davet@oakhill.UUCP) brags about the Motorola FPU: >Concerning floating-point state saves we gave a lot of thought to minimizing >latency times. What we did was give an indication to the OS of whether any >of the FP registers had been used. If not, the OS could skip the context >save and restore completely. Careful reading of the specs for the National FPU, plus a little experimenting, shows that they also provided this capability, though their implementation has the look of being a fortuitous accident (hint: check the behaviour of the Trap Type field of the Floating Status Register). Unfortunately, whoever wrote the documentation made no mention of this use, which suggests that they don't realize what they have and are in danger of making it not work in future releases of the hardware. -- Brian Thomson, CSRI Univ. of Toronto {linus,ihnp4,uw-beaver,floyd,utzoo}!utcsrgv!uthub!thomson
henry@utzoo.UUCP (Henry Spencer) (08/22/85)
> [discussion about saving floating-point-unit state only when needed] > Careful reading of the specs for the National FPU, plus a little experimenting, > shows that they also provided this capability, though their implementation > has the look of being a fortuitous accident (hint: check the behaviour of the > Trap Type field of the Floating Status Register). > Unfortunately, whoever wrote the documentation made no mention of this > use, which suggests that they don't realize what they have and are in danger > of making it not work in future releases of the hardware. It should also be possible to get a similar effect by using the SETCFG instruction to tell the cpu "no floating point", which will produce a trap when the user tries to use floating point. Save the state and then turn floating point on again. When I asked the local National man about this, he said it would work. Beware that I have *not* tried it on real hardware yet. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
davet@oakhill.UUCP (Dave Trissel) (08/24/85)
In article <5890@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >In other words, all you need is 12 contiguous non-memory-referencing >instructions and the 68020's stack puke will actually break even! This is >is stretching it a bit, since on the pdp11 typically every third or fourth >instruction did some sort of memory reference; I doubt that the 68000 family >does much better. ... > It's clear that you still don't understand what I'm getting at. I'll try one more time. The '020 averages between 2 to 5 million external bus operations per second and that doesn't count the internal bus cycles run from the on-chip cache. The overhead for the "puking" as you call it is 46 bus cycles (23 each way.) If you insist that those 46 bus cycles are significant against 2 to 5 million bus cycles then there's nothing more I can say. . >On the other hand, I'm glad to hear that Motorola did have the sense to >put a floating-point-used flag in the FPU, so at least you don't have to >shovel 300 bytes of state around unnecessarily. Your use of the word "state" is ambiguous. If you mean internal chip context save state then Mike Cruess has already brought that into perspective. If you mean the user register context size then that's 208 bytes of state and our analysis at the time for not including DMA (or more correctly bus mastership capability) can be gone into. >> Intel has a novel approach on their 8087 and 2087 where they let the process >> context switch without saving FP state. If another process tries using >> floating-point an interrupt occurs letting the OS then swap context only >> when necessary. The trouble with this technique is that all it takes is >> for one out of every 20 or so context switches to require a re-save and you >> start losing overall processor time over just saving it unconditionally. I have since figured the 286/287 overhead out and it is somewhat less than what I stated. It takes 209 clocks to determine that no other task has used the 287 in the meantime and that there is no state to reload. If the exception routine detects the 287 now has some other task's registers then the exception routines execution takes 765 clocks. It takes 535 clocks to unconditionally save and restore the state. However, the 286 is not smart enough to handle the 287 with it's task switching capability which means there really is little alternative but to use the exception routine route anyway. So the ratio for use is somewhere in about one in four. -- Dave Trissel {seismo,ihnp4}!ut-sally!oakhill!davet Motorola Semiconductor Inc. Austin, Texas
geoff@desint.UUCP (Geoff Kuenning) (08/24/85)
In article <489@talcott.UUCP> tmb@talcott.UUCP (Thomas M. Breuel) writes: >Finally, the thought of having a page fault pending and the CPU >happily executing more instructions before the fault is serviced >somehow worries me. It may play havoc with simple-minded process >synchronisation techniques. Which just goes to show that you shouldn't try to do OS-type things in a simple-minded manner on a complicated computer like the '020. Operating systems designers have been dealing with this sort of problem for over two decades; in general we don't really mind a few subtle points in the architecture that require careful attention to detail, as long as they're well documented. -- Geoff Kuenning ...!ihnp4!trwrb!desint!geoff
geoff@desint.UUCP (Geoff Kuenning) (08/24/85)
In article <5890@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >The following two Dave Trissel quotes are from the same message: > >> The stack save equates to about the same overhead as executing 12 >> instructions. > >In other words, all you need is 12 contiguous non-memory-referencing >instructions and the 68020's stack puke will actually break even! This >is stretching it a bit, since on the pdp11 typically every third or fourth >instruction did some sort of memory reference; I doubt that the 68000 family >does much better. Speaking of the 68000 *family*, note that a 68010 gets >the full performance hit every time since it doesn't pipeline much. Do I detect just a tiny rabid note here? Henry, I think Dave's point was not that you only have to do 12 non-memory-referencing instructions to break even. Rather, his point was that you only have to eliminate 12 instructions (of any "average" type) from the total code stream executed in response to a bus error to break even. Or, alternatively, that you would get the same performance hit from adding 12 instructions to that stream, which can easily be the result of a single bug fix in trap.c. There is little point in getting excited about a few microseconds in bus-error processing unless (a) you are getting LOTS of bus errors per second, and (b) those microseconds add a SIGNIFICANT percentage to the bus-error processing time. (a) is generally true in virtual-machine OS's; (b) is not true in any operating system I've ever heard of. In any case, Henry, why bring up the red herring of the 68010? This was a discussion of the 020 until now. Or are you just in a flaming-at-Motorola mood? >On the other hand, I'm glad to hear that Motorola did have the sense to >put a floating-point-used flag in the FPU, so at least you don't have to >shovel 300 bytes of state around unnecessarily. Hmm, maybe you *are* in a mood. In article <5883@utzoo.UUCP> you complain that some friends are all upset about the 300 bytes of state. Now we find out that said friends maybe didn't even know about the f.p.-used flag? >An interesting possibility would be to have the hardware support *both* an >FPU-used flag *and* a trap-on-first-FPU-use bit. It would not seem too >difficult to set up some code in the kernel that switches between the >two strategies as a function of the number of FPU context switches that >have occurred lately. Henry's got a point here, Dave. Even if you didn't want to do it dynamically, the OS designer would still have the option of picking trap-on-first-use, which is still advantageous if you are certain that most of the time there will only be one f.p. user. Any chance of getting this idea into the next rev? -- Geoff Kuenning ...!ihnp4!trwrb!desint!geoff
jack@boring.UUCP (08/26/85)
In article <5900@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: > >It should also be possible to get a similar effect by using the SETCFG >instruction to tell the cpu "no floating point", which will produce a >trap when the user tries to use floating point. Save the state and then >turn floating point on again. When I asked the local National man about >this, he said it would work. Beware that I have *not* tried it on real >hardware yet. Well, I didn't try it either, but NS uses the code in their 4.1 release, so I guess it works. -- Jack Jansen, jack@mcvax.UUCP The shell is my oyster.
tim@callan.UUCP (Tim Smith) (08/27/85)
> I find this argument amusing. You just generated a page fault. That > means context switch, disk driver, housekeeping, ... . Compared to all > this, the overhead of your instruction re-start is going to be > negligible no matter how inefficiently you do it. Yes, but how about when you DON'T have a page fault? His point is that the 68020 can go ahead and do a lot of other stuff, cause it don't matter if the write a couple instructions back failed, whereas the instruction restart machine might have to wait to be sure that there will be no page fault. -- Tim Smith ihnp4!{cithep,wlbr!callan}!tim
henry@utzoo.UUCP (Henry Spencer) (08/27/85)
> Do I detect just a tiny rabid note here? Who, me? Just because I think dumping microstate onto the stack when you get a page fault is a wretched botch to cover up the fact that Motorola totally and utterly ignored virtual memory when they designed the 68000? Nah. > Henry, I think Dave's point was > not that you only have to do 12 non-memory-referencing instructions to > break even. The way I read Dave's note (which is still the way I read it, looking back) was "dumping microstate is a big win, because we can execute instructions beyond the one that causes the fault, and not have to redo them, unlike those cruddy architectures that have to stop dead when they hit a fault". My point, somewhat overstated I admit, was that this is near-nonsense, because the number of extra instructions is likely to be very small, not large enough to make up for the greater volume of data that has to go onto the stack at fault time. > In any case, Henry, why bring up the red herring of the 68010? Because Motorola trumpeted microstate dumping as a big win on the 68010 too. "Look at us, we did it right, we don't have to restart the whole instruction from scratch." Feh. > Or are you just in a flaming-at-Motorola mood? I'm never out of flaming-at-the-680x0's-stupid-stack-puke-page-fault mode! > Hmm, maybe you *are* in a mood. In article <5883@utzoo.UUCP> you complain > that some friends are all upset about the 300 bytes of state. Now we > find out that said friends maybe didn't even know about the f.p.-used > flag? No, we find out that *I* didn't know about it. Said friends are disgusted at the need to handle 300 bytes of state even *sometimes*, as it turns out. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
rfm@frog.UUCP (Bob Mabee) (08/30/85)
Dave Trissel of Motorola explains why the 68020 stores a large state on faults: > MOVE something to memory > SHIFT Reg by immediate > MUL Reg to Reg > etc. > The MC68020 executes the MOVE and the bus unit schedules a write cycle. Then > the execution unit/pipeline happily continues executing the instruction > stream without regard to the final status of the write. Even if the write > fails (bus errors) there could be several more instructions executed (in fact > any amount until one is hit which requires the bus again.) > > Contrast this to chips which redo instructions. They must soon stop dead in > their tracks until the write cycle has been verified as properly done. Other- > wise they would alter the programmers model and invalidate retry. Quite a few responses seemed to miss the point that this makes the 68020 run a lot faster all the time, not just when the reference causes a fault. The alternative requires that the CPU store just a PC value that can be jumped to to restart the program; that means there can be no visible effects from instructions executed after the one that started the write that got the error. In the example, either the shift can't happen until the write is acknowledged, or the processor has to keep multiple register sets so it can back up far enough to recreate the state that goes with the bus cycle. However, there is a big problem with the 68020 fault state on UNIX-like systems: the state is (potentially) writeable by malicious users but Motorola has not provided enough information so we can detect bad states. We need 1) Motorola's assurance that no combination of bits fed to RTE can damage or hang up the chip, allow users to enter supervisor mode, or set a booby-trap that will harm the OS or another process. or 2) a (small) set of checks that will reject all combinations that might do any of those things, while allowing all combinations actually stored by the CPU. If the OS boils the state down to a PS and PC, which can be easily validated, then it is lying to the user, because it will allow restarting but the program will misbehave (in the example, shift a register twice). If the OS prevents restarting such cases, it will kill programs that merely happen to get signals in the middle of 68881 instructions. The state gets to be writeable when a user instruction faults or (with the 68881) takes a mid-instruction interrupt, and the kernel then decides to signal the process. The signal handler runs like a user-level version of a trap handler, and can return, which should make the stopped instruction resume. Signal handlers can themselves be interrupted by other signals, so there can be a lot of sets of fault data around. The easiest way for the kernel to handle this is to put the data on the user stack as part of calling the signal handler. (Implementing a parallel, growable stack accessible only by the kernel to hold the fault data is going to be a big pain.) So, how about it, Dave? Can you give us #2 above? -- Bob Mabee @ Charles River Data Systems decvax!frog!rfm
dws@tolerant.UUCP (Dave W. Smith) (09/01/85)
>> [discussion about saving floating-point-unit state only when needed] > > It should also be possible to get a similar effect by using the SETCFG > instruction to tell the cpu "no floating point", which will produce a > trap when the user tries to use floating point. Save the state and then > turn floating point on again. When I asked the local National man about > ... Beware that I have *not* tried it on real > hardware yet. > -- > Henry Spencer @ U of Toronto Zoology > {allegra,ihnp4,linus,decvax}!utzoo!henry We do it on real hardware, and it works nicely. -- David W. Smith {ucbvax}!tolerant!dws Tolerant Systems, Inc. 408/946-5667
freund@nsc.UUCP (Bob Freund) (09/09/85)
Now that the subject of internal state-dumping has been discussed for awhile, I have a question that has as yet not been addressed. In a multiprocessor system designed for transparent operation and which has the ability to allocate processors to tasks dynamicly, it is possible that a task be re-started on a different processor than the one that faulted. If there is any difference in the micro-state between cpu revision levels, it could happen that the restart would fail due to incompatible micro-state. Does Motorola guarantee micro-state compatibility across revision levels? Does it guarantee compatibility of micro-state across implementations? If the answer to the first question is false, then all multiprocessors must be at the same cpu revision level. If the answer to the second question is false, then it will not be possible to design multiprocessors unless they were constituted of homogeneous types. What effect does this have on the types of multiprocessor systems that can be designed based on the part? What is the effect on distributed systems that allow task migration across network. What about paging across network? Have fun -bob
guy@sun.uucp (Guy Harris) (09/11/85)
> In a multiprocessor system designed for transparent operation and > which has the ability to allocate processors to tasks dynamicly, it > is possible that a task be re-started on a different processor than > the one that faulted. If there is any difference in the micro-state > between cpu revision levels, it could happen that the restart would > fail due to incompatible micro-state. Another problem with making information like the format of internal state dumped on page faults and the like "private" to the particular chip is that you make it difficult for user-mode code in a protected system to catch faults of that sort and handle them itself. I'm sure you all know one infamous UNIX utility which this fouls up (I ran afoul of this one recently). Other programs which might want to do this include programs maintaining multiple subtasks within a process - they could catch SIGSEGV (or your OS'es equivalent) and do grow a process' stack if it goes beyond the stack's boundary. Unless the architectural spec for the machine states that there is *nothing* a user-mode program can do to the internal state that will do anything more than wedge the process doing it, you can't keep the state information in user-writable space; this means you can't keep it on the user stack. Unfortunately, that's where a user-mode RTE or whatever instruction will usually restore it from. You thus have to keep it somewhere like the kernel stack or (in UNIX) the user page. This means you have to limit the number of such exceptions that remain outstanding or dynamically allocate space to hold them. Saving one such lump of state information, so that user-mode code handling the exception is not allowed to incur another such exception if that exception is also to be handled in user mode, should handle most of the cases you're likely to see, but it's still a kludge. Guy Harris