[net.micro.68k] PDP11s vs the micros

dave@enmasse.UUCP (Dave Brownell) (01/01/70)

In article <3233@nsc.UUCP> freund@nsc.UUCP (Bob Freund) writes:
> ... it
> is possible that a task be re-started on a different processor than
> the one that faulted.  If there is any difference in the micro-state
> between cpu revision levels, it could happen that the restart would
> fail due to incompatible micro-state.  Does Motorola guarantee
> micro-state compatibility across revision levels?  ...

At least one company that I know of had this problem.  Using MC68010s,
the problem was that some internal state bits might not be set
identically; Motorola gave a few instructions to execute that would
guarantee that all the state bits were identical on all processors.

hull@hao.UUCP (Howard Hull) (07/08/85)

This is one of those "promised summary" things.  I asked how 68000 and 32016/32
users got along without a direct equivalent of the PDP 11's MOV (SP)+,@(R5)+
for table oriented video updates.  I got three replies.

1.)  Rich Altmaier, Integrated Office Systems
	He supplied actual NS32 code.  I could not determine if this was only
	a single indirect (I think that is the case), or was in effect the
	same as the @(R5)+, in which R5 points to a location that contains what
	the processor will interpret as an *address* for the destination data
	word.  This address is assumed to be part of a table of addresses of
	modifiable video fields in the PDP11 implementation.

	Method: Use movd with tos as the first operand, (r0) single indirect
		as the second operand.  Surprise (to me anyway): tos evidently
		autoincrements!  (Yuh couldn't tell that by a casual scan
		through the instruction set mnemonics.  Awards to National for
		doing such a good job hiding it!)
		Nonetheless, it does require two additional instructions to
		increment the destination pointer in double precision.  But
		since the address range of a typical PDP11 is so much smaller,
		only one would be needed for the NS chip to match equivalent
		performance.  Convenient, but does take more bytes on the NS
		chip than the PDP11.

2.)  Andrew Klossner
	He provided a method that would work well with a modularized data
	structure (it did not, as near as I could tell, use a double indirect
	form such as is implied in the @(R5)+ PDP11 instruction, either.)

	Method: Use a single MOVSD (move string) to copy the entire table.
		followed by an ADJSPx (adjust stack pointer) to pop the table
		off the stack.  Moving data a word at a time is passe.

3.) Peter da Silva
	He provided some commentary on the virtues of the Motorola 6809 8-bit
	microprocessor in executing the "basic Forth loop", the little two-
	word chunk of list-linking magic which has caught the notice of all
	serious assembly language programmers of the last decade.  Evidently
	the 6809 is the last of the micros of this particular mentality, grand
	as it may be.  Peter indicated that the 6809 can manage the matter
	in two instructions, just as is found with the PDP11 execution.
	Peter noted that the execution of this famous task on the 68000 was
	indeed somewhat more awkward.  Peter pointed out that the 68000 has a
	little less elegance with its instruction set than does the PDP11, but
	that it is rather more efficient in its use of T-states.

Summary:
	It looks very much as though the influence of the profession of
	Computer Science has had a definite impact on the former priorities
	of hardware and software constraint.  No longer is it considered the
	responsibility of the software designer to obtain maximum performance
	from a given hardware configuration (e.g. a particular microprocessor)
	but rather to obtain maximum economy of effort for the combination of
	hardware and software (i.e., cut costs and time from conception to
	market emplacement).  Under the circumstances, most programmers will
	not worry about the detailed machine code at all, having most likely
	not looked any further than the high-level-language compiler output.
	Portability is more important than the cleverness of an implementation.
	This reflects the fact that while software producers always have to
	worry about competitors with more efficient implementations of their
	code, most such competitors will not survive late arrival at market,
	particularly with a more expensive product, or one that is more
	difficult to document or maintain.
	
	The method used to procure this new stance is to modularize tasks at
	many levels.  In this case, such modularization is accomplished by
	breaking the video map into several parts such that updating is
	accomplished on one module at a time, rather than on the entire data
	structure as segmented by a suitable table of addresses.  The double
	indirect, a concept always difficult to teach to the neophites, is
	now declared dead.  For all practical purposes, it has been buried
	in the nested-line formats of structured programming topology.

								     Howard Hull
[If yet unproven concepts are outlawed in the range of discussion...
                   ...Then only the deranged will discuss yet unproven concepts]
        {ucbvax!hplabs | allegra!nbires | harpo!seismo } !hao!hull

jans@mako.UUCP (Jan Steinman) (07/09/85)

In article <1617@hao.UUCP> hull@hao.UUCP (Howard Hull) writes:
>The double indirect, a concept always difficult to teach to the neophites, is
>now declared dead.  For all practical purposes, it has been buried in the
>nested-line formats of structured programming topology.

Not so!  Our NS32000 C compiler regularly produces memory indirect references
whenever a pointer is declared as an automatic variable:

	movd	0(4(fp)),rX	;Get the data pointed to by the data pointed
				;  to by the frame pointer plus four.

Although I'm not familiar with PDP11 assembly, it appears that National lacks
the general-purpose autoincrement, but adds the (more useful, in my opinion)
ability to offset from the first base before indirection, and then you can
even throw a scaled index on top of the final address if you want, which means
you no longer have to forget your base when double-indirecting through a
table.

While I agree that most applications are content with the output of a
compiler, writing tight assembly code, using all the features a machine has to
offer, is not dead.  I make heavy use of the NS32000 memory indirect
addressing mode, which is excellent for such things as traversing linked-lists
in dynamically typed languages.  (Get a pointer to an object, look through the
object for a terminator, etc.)  My only wish is that Nati had given all the
general purpose registers this facility, so I wouldn't have to shuffle the
SP or FP around so much!
-- 
:::::: Jan Steinman		Box 1000, MS 61-161	(w)503/685-2843 ::::::
:::::: tektronix!tekecs!jans	Wilsonville, OR 97070	(h)503/657-7703 ::::::

henry@utzoo.UUCP (Henry Spencer) (07/11/85)

> 	Method: Use a single MOVSD (move string) to copy the entire table.
> 		followed by an ADJSPx (adjust stack pointer) to pop the table
> 		off the stack.  Moving data a word at a time is passe.

Unfortunately, if you study the timings you will find that a well-optimized
(i.e. unrolled) word-at-a-time move loop is faster than the string-move
instructions on the current 32000 processors.  When I queried my National
rep about this, he admitted it.  The string instructions are very slow on
the 32016 and 32032; with any luck the 32332 will be better.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

gnu@sun.uucp (John Gilmore) (07/16/85)

Speaking of double-indirect, tektronix!tekecs!jans said:
> 				  My only wish is that Nati had given all the
> general purpose registers this facility, so I wouldn't have to shuffle the
> SP or FP around so much!

Hmm.  You mean the totally orthagonal wonderful National architecture
has a few warts after all?

Lemme see, I seem to remember something about bit fields can't lay across
more than four bytes...e.g. you can only put a 32-bit bitfield on a
byte boundary.

Don't tell their marketing folks...

jon@nsc.UUCP (Jon Ryshpan) (07/18/85)

In article <2422@sun.uucp> you write:
>Speaking of double-indirect, tektronix!tekecs!jans said:
>> 				  My only wish is that Nati had given all the
>> general purpose registers this facility, so I wouldn't have to shuffle the
>> SP or FP around so much!
>
>Hmm.  You mean the totally orthogonal wonderful National architecture
>has a few warts after all?
>

The 32000 does (in fact) have some warts, but in my humble opinion
this isn't one of them.

The 320xx is not a register machine like the PDP11 or some other well
known processors.  It's a p-machine with some registers added to it for
extra speed.  You don't expect to have symmetry between the registers
that make up the p-machine and the other registers that speed it up.

The basic machine architecture uses these principal registers:

	FP - The Frame Pointer : These define the current
	SP - The Stack Pointer : activation record

	SB - The Static Base   : Pointer to own variables

	Link Base	       : Linkage to external procedures

The machine would run perfectly happily without any more registers; but
it would be slow, because all data references would be to main memory.
So some additional registers were added to allow for on-chip data
storage.  These registers have addressing modes the same as a main
memory location, when addressed via one of the principal registers (FP,
SP, or SB).

The addressing modes are:

      o Y(FP) - Direct addressing: The contents of the location Y
	bytes above the Frame Pointer

      o X(Y(FP)) - Displaced addressing: The contents of the
	location X bytes about the location that the location
	Y bytes above the frame pointer points to.

(Same for SP and SB.)

The register modes are:

      o R0 - Register value: The contents of register R0

      o X(R0) - Register displaced addressing: The contents of the
	location X bytes above the location that R0 points to.

(Same for R1..R7)

If you move something from memory to register, you have exactly the
same access capability to it that you had when it was in memory, no
more, no less.  This is what we mean by symmetry.  (There's more, but
this is the most important part.)

One of the most important parts of processor architecture is the
struggle for bits - you want to be able to say as much as you can
with the least number of bits.  The 32000 gets away with a 5 bit
mode field, which does about as much as the 6 bit mode/address field
in the another well known processor or the 8 bit mode/address field
in the some other well known processors.
-- 

				Thanks - Jonathan Ryshpan

{cbosgd,fortune,hplabs,ihnp4,seismo}!nsc!jon     nsc!jon@decwrl.ARPA

Let justice be done though the heavens fall.

jans@mako.UUCP (Jan Steinman) (07/18/85)

In article <2422@sun.uucp> gnu@sun.uucp (John Gilmore) writes:
>>(me, as qoted by John Gilmore)
>> My only wish is that Nati had given all the general purpose registers this
>> facility, so I wouldn't have to shuffle the SP or FP around so much!

>(John Gilmore)
>Lemme see, I seem to remember something about bit fields can't lay across
>more than four bytes...e.g. you can only put a 32-bit bitfield on a
>byte boundary.

Well since you're from Sun, John, you're probably used to the 68000.  Now
how many bits can it's bit fields span?  No kidding!  It doesn't even have
bit fields, you say?  (Back to the old shift-and-count loops, I guess.)  While
we're on bits and orthogonality, how many bits can a generic, single bit
operation span on the 68000?  Eight, you say?  On the 32000 it is unlimited:
i.e. setting bit 33 at a given pointer sets bit one, four bytes away from the
pointer.  This makes bit-mapped code a breeze to write.  By the way, a 32 bit
"bitfield" (in the sense of an atomic quantity) on the 68000 must be on an WORD
boundary.

No processor is without it's warts, but your jibe is grasping at straws.  I
suspect the 32 bit restriction makes the MMU simpler.  Since 31 bits can
enjoy arbitrary alignment, and 32 bits are usually an integer, this doesn't
bother me too much.  And the 32000 is still the most enjoyable processor I've
ever written assembly code for.

-- 
:::::: Jan Steinman		Box 1000, MS 61-161	(w)503/685-2843 ::::::
:::::: tektronix!tekecs!jans	Wilsonville, OR 97070	(h)503/657-7703 ::::::

guy@sun.uucp (Guy Harris) (07/28/85)

> The 320xx is not a register machine like the PDP11 or some other well
> known processors.  It's a p-machine with some registers added to it for
> extra speed.  You don't expect to have symmetry between the registers
> that make up the p-machine and the other registers that speed it up.

What???  I've read the 32xxx data sheet and the machine looks *far* more
like a "warmed-over VAX" (which is no slur, considering how many machines
out there are just warmed-over PDP-11s or warmed-over VAXes) than a
"p-machine" (by "p-machine" do you mean "p-code engine"?).  Yes, you can use
it as a stack machine (make both operands use the TOS addresing mode) - but
then you can do the *exact same thing* on "the PDP11 (and) some other well
known processors" (such as the VAX).  How many 32xxx instructions use TOS
for both operands, and how many use a register, or register relative/memory
space, or...?  If the majority do NOT use TOS, I submit that the 32xxx is
not a "p-machine" in the sense of an engine intended to run P-code, but
instead a VAXish register machine with some addressing modes added to make
stack-oriented expression evaluation slightly simpler (although, given that
any PDP-11 or VAX instruction with two general-addressing-mode operands can
be made to act like a stack machine instruction, I don't think they even
make it any simpler).

	Guy Harris

jans@mako.UUCP (Jan Steinman) (07/31/85)

In article <2506@sun.uucp> guy@sun.uucp (Guy Harris) writes:
>> The 320xx is not a register machine like the PDP11 or some other well
>> known processors.  It's a p-machine with some registers added...
>
>Yes, you can use it as a stack machine (make both operands use the TOS 
>addresing mode)...  How many 32xxx instructions use TOS for both operands,
>and how many use a register, or register relative/memory space, or...?

Any instruction that uses "general" addressing could care less if the operands
are "TOS..., register, or register relative/memory space, or..."  Do you
have the "Instruction Set Reference Manual"?  Have you looked at the
instructions?

>If the majority do NOT use TOS, I submit that the 32xxx is not a "p-machine"
>in the sense of an engine intended to run P-code, but instead a VAXish
>register machine with some addressing modes added to make stack-oriented
>expression evaluation slightly simpler...

It is easier to count the instructions that do not use TOS for one operand.
A quick perusal shows "quick" (embedded constant operand), branches (would
anyone want a PC displacement on the stack, anyway?), block moves and
compares, prcessor directives (ENTER, EXIT, RET, SETCFG, etc.), and processor
register (LPRi, SPRi) instructions.  All the "mainstream" operations can have
any general addressing mode for either operand.

What's more, Nati has paid a great deal of attention to TOS access classes and
addressing speed.  TOS is the fastest memory addressing (if the bus could only
keep up!) and the SP behaves in a reasonable way, depending on the access
class:

	addd	tos,tos		first operand is popped (SP incremented),
		rd  rmw		second is only modified (SP unchanged)

	jump	tos		tos -> PC (SP unchanged)
		addr

	negd	tos,tos		first operand is popped (SP incremented),
		rd  wr		second operand is pushed (SP decremented)

	acbd	-1,tos,label	operand is decremented, loop until operand
		q  rmw disp	reaches zero. (SP unchanged)

As to Nati TOS usefulness to the mythical "p-machine", I'm working on a Z80
emulator that uses the 32032 SP as the Z80 PC.  Another thing we're exploring
is using the SP as a Smalltalk virtual stack pointer.
-- 
:::::: Jan Steinman		Box 1000, MS 61-161	(w)503/685-2843 ::::::
:::::: tektronix!tekecs!jans	Wilsonville, OR 97070	(h)503/657-7703 ::::::

guy@sun.uucp (Guy Harris) (08/02/85)

> >Yes, you can use it as a stack machine (make both operands use the TOS 
> >addresing mode)...  How many 32xxx instructions use TOS for both operands,
> >and how many use a register, or register relative/memory space, or...?
> 
> Any instruction that uses "general" addressing could care less if the
> operands are "TOS..., register, or register relative/memory space, or..."

Yes, I already knew that.  *That was my entire point.*  Since a large
list of instructions use "general" addressing for both their operands, and
since that means they can all use all the aforementioned addressing modes, I
submit that the NS32xxx isn't a "p-machine".  If it *is* a "p-machine", so
is the CCI Power 6/32; it has auto-increment SP and auto-decrement SP
addressing modes, and you could easily has the assembler to accept a TOS
addressing mode and generate (sp)+, -(sp), or (sp) addressing modes for it.
Somehow, I don't think removing all auto-incrementing or auto-decrementing
addressing modes from a machine's instruction set makes it a "p-machine",
though.

> It is easier to count the instructions that do not use TOS for one operand.
> A quick perusal shows <list of instructions>.  All the "mainstream"
> operations can have any general addressing mode for either operand.

My point exactly.

When I said "how many 32xxx instructions use...", I didn't mean "how many
instructions as listed in the 'Instruction Set Reference Manual' use...", I
meant "how many instructions as written by assembly-language programmers and
as generated by compilers use..."  I don't have the "Instruction Set
Reference Manual"; will "NS16032S-6, NS16032-4 High Performance
Microprocessors (Preliminary - November 1982)" do?

> ...and the SP behaves in a reasonable way, depending on the access
> class:
> 
> 	addd	tos,tos		first operand is popped (SP incremented),
> 		rd  rmw		second is only modified (SP unchanged)
> 
> 	jump	tos		tos -> PC (SP unchanged)
> 		addr
> 
> 	negd	tos,tos		first operand is popped (SP incremented),
> 		rd  wr		second operand is pushed (SP decremented)
> 
> 	acbd	-1,tos,label	operand is decremented, loop until operand
> 		q  rmw disp	reaches zero. (SP unchanged)

As I pointed out, a VAX or Power 6 assembler could do that too.  (The PDP-11
and M68000 don't have enough two-operand instructions with both operands
specified by general addressing modes to make this worthwhile.) Turn "tos"
into (sp) for most one-operand instructions, turn the first "tos" into (sp)+
and the second into (sp) for two-operand instructions that fetch both
operands, and the first into (sp)+ and the second into -(sp) for two-operand
instructions that fetch only the first operand.  (Punt the 3-operand
instructions.)

> As to Nati TOS usefulness to the mythical "p-machine", I'm working on a Z80
> emulator that uses the 32032 SP as the Z80 PC.  Another thing we're
> exploring is using the SP as a Smalltalk virtual stack pointer.

What does this have to do with "Nati TOS usefulness to the mythical
'p-machine'"?  I have no idea what the person had in mind when he called the
32xxx a "p-machine".  The logical assumption is that he meant "p-code
engine"; P-code engines are generally stack machines which the NS32xxx is
*not*, any more than the VAX is.  You can treat it as a stack machine, but
you don't *have* to (and probably don't want to; it'll run faster as a
general-register machine).  If you are merely referring to architectural
features that make certain bits of coding work nicely, then the PDP-11 is a
"p-machine" in that sense - note the use of "jmp @(r4)+" for threaded code.
I presume the TOS addressing mode is useful for the Z80 simulator because
it's the only addressing mode that does any sort of auto-incrementing of a
register - i.e., instruction fetch is done with "movw tos, <instruction
register>".  You can do the same on machines like the PDP-11, VAX, and
M68000 by doing

	move (reg)+,<instruction register>

where "reg" is a register chosen for use as the simulator's PC.

	Guy Harris

steveg@hammer.UUCP (Steve Glaser) (08/06/85)

The "p-machine" garbage for the 32xxx was probably just early marketing
hype.  It's real difficult to sell anything in a big company unless you
hang it on something familiar.  The "p-machine" is an old system that
was probably familiar to somebody in charge at the time.

Remember that there was a chip known as the 16016 in the family.  That
was a 16032 (aka 32016 nowdays) with Intel 8080 emulation mode added
(not Z-80, just 8080).  (Gee you could write a CP/M emulator.)  This
should give you some insight into their thinking at the time.  (No I
wasn't there, but I was an early user of the chipset).

As for eliminating auto +/- addressing mode, I support that decision.
Given their decision to "back out" instructions that get page faults
rather than dump out the internal microstate like the 68010, National
would have to keep shadow copies of too much internal stuff around in
case a page fault came through.  That's a big hassle and takes chip
real estate.  I think having full memory to memory addressing is more
useful that auto +/-, especially for compiler generated code.  (well
maybe not for pcc -- it's model seems to be put something in a
register, munch on it, put it back in memory).

	Steve Glaser
	steveg.tektronix@csnet-relay.arpa       or tektronix!steveg

guy@sun.uucp (Guy Harris) (08/08/85)

> The "p-machine" garbage for the 32xxx was probably just early marketing
> hype.

Which means "but it's not a general register machine, it's a p-machine!"
isn't a legitimate reason why the 32xxx's SP, etc. aren't general registers
- which is what the person replying to John Gilmore said.  (There may be
legitimate reasons, but "it's a p-machine" isn't one of them - because it
isn't a p-machine.)

> As for eliminating auto +/- addressing mode, I support that decision.

Not knowing what the exact tradeoffs were, I neither support it nor oppose
it.  The P6/32 doesn't have auto-I/D except on the SP, and it seems not to
have suffered *too* much in performance :-) (~4-7x 11/780 isn't too bad,
especially for a TTL machine with an instruction set which a fair fraction
of the VAX's complexity).

It may also be easier to do pipelining if fewer of the addressing modes have
side-effects - you don't have to worry about the (r4)+ two pipeline stages
behind screwing up your movl r4,<location> (or, if you have multiple copies
of the general register set, having to worry about propagating the change
from the auto-increment to the instruction-unit copy of r4 forward to the
execution-unit copy at the right time).

> Given their decision to "back out" instructions that get page faults
> rather than dump out the internal microstate like the 68010, National
> would have to keep shadow copies of too much internal stuff around in
> case a page fault came through.

Well, maybe.  Returning to the original topic, as described by the subject -
the PDP-11 can only modify a maximum of two registers during the operand
preparation, so some models have (or have what amounted to) a register which
remmbered the register numbers of the two registers modified and the amount
added to or subtracted from them.  When you take a fault, the fault handler
saves the contents of this register (which, presumably, freezes until read)
and uses it to back up the faulting instruction.  (This backup could also be
mostly simulated in software - see the routine "backup" in the assembler
language support code for UNIX on PDP-11s lacking this register - 11/40,
11/34, 11/23, 11/60...).

> I think having full memory to memory addressing is more useful that auto
> +/-, especially for compiler generated code.  (well maybe not for pcc --
> it's model seems to be put something in a register, munch on it, put it
> back in memory).

Well, if the RISC people are correct, neither of them are necessarily
useful.  One problem with having auto-I/D modes is you set up your language
to use them; then, when you compile code turned for machines with auto-I/D
on machines which don't have it, you get code that's not as good as would be
generated by a more straightforward coding.  Also, at least if you use
things like "+=", I think PCC will make use of memory-to-memory modes in
simple expressions; if the expression is more complicated, it's probably
faster to do it in a register anyway.

	Guy Harris

peter@kitty.UUCP (Peter DaSilva) (08/08/85)

> As for eliminating auto +/- addressing mode, I support that decision.
> Given their decision to "back out" instructions that get page faults
> rather than dump out the internal microstate like the 68010, National

Any particular reason to do this rather than restart the instruction from
where it left off? I hadn't heard of this approach... what does the Vax
do? What are the tradeoffs.

> would have to keep shadow copies of too much internal stuff around in
> case a page fault came through.  That's a big hassle and takes chip
> real estate.  I think having full memory to memory addressing is more
> useful that auto +/-, especially for compiler generated code.  (well
> maybe not for pcc -- it's model seems to be put something in a
> register, munch on it, put it back in memory).

main(ac, av)
register int ac;
register char **av;
{
	while(*++av) {
		...
	}
}

Not an uncommon case construction in 'C'.

kds@intelca.UUCP (Ken Shoemaker) (08/08/85)

> It may also be easier to do pipelining if fewer of the addressing modes have
> side-effects - you don't have to worry about the (r4)+ two pipeline stages
> behind screwing up your movl r4,<location> (or, if you have multiple copies
> of the general register set, having to worry about propagating the change
> from the auto-increment to the instruction-unit copy of r4 forward to the
> execution-unit copy at the right time).

On the subject of side effects and pipelining, has anyone thought of
the problems of treating the pc as a general register (with autoincrement,
etc.) at the same time as you added some level of prefetch?  This would
seem to me to get very ugly, having to keep track of things in the
prefetch buffer whenever you address/adjust off the pc.  Indeed, this
could limit the amount of instruction pre-processing/cracking you could
do (or increase dramatically increase the amount of logic that is required).
Any solutions besides punting?
-- 
...and I'm sure it wouldn't interest anybody outside of a small circle
of friends...

Ken Shoemaker, Microprocessor Design for a large, Silicon Valley firm

{pur-ee,hplabs,amd,scgvaxd,dual,qantel}!intelca!kds
	
---the above views are personal.  They may not represent those of the
	employer of its submitter.

guy@sun.uucp (Guy Harris) (08/11/85)

> > As for eliminating auto +/- addressing mode, I support that decision.
> > Given their decision to "back out" instructions that get page faults
> > rather than dump out the internal microstate like the 68010, National
> 
> Any particular reason to do this rather than restart the instruction from
> where it left off?

Less internal state to dump?  (Which means less microcode/whatever to do the
dumping and restoring, and less code in the kernel to check that the state,
if accessible to the user, hasn't been tampered with.)

> I hadn't heard of this approach... what does the Vax do? What are the
> tradeoffs.

The VAX has a "first part done" bit in the PSL.  Presumably, instructions
which have side-effects and where a page fault can occur after the
side-effect occur set the "first part done" bit after the side-effects
occur.  This is a simpler version of the "dump the internal microstate"
model.  The PDP-11 (at least the ones with the fancier MMUs) backs out the
instruction in software - it dumps the numbers of the registers which have
been auto-incremented or auto-decremented, and the amount they've been
auto-incremented or auto-decremented by, into a register which is used by
the trap handler to actually back the auto-ID out.

Some VAX instructions, like the string instructions, require more hair.  In
that case they hijack several registers and store the current pointers and
lengths into them if the instruction takes a fault and is interrupted.
Presumably the FPD bit says that the pointers and lengths should be taken
from the registers instead of from the instruction's operands.

> > I think having full memory to memory addressing is more
> > useful that auto +/-, especially for compiler generated code.  (well
> > maybe not for pcc -- it's model seems to be put something in a
> > register, munch on it, put it back in memory).

> register char **av;
> {
> 	while(*++av) {
> 
> Not an uncommon case construction in 'C'.

Not uncommon, but it doesn't generate autoincrement code on the PDP-11, VAX,
or M68000, because they all have only predecrement and postincrement modes -
this construct would require a preincrement mode.  Do any machines have
preincrement addressing modes?

	Guy Harris

henry@utzoo.UUCP (Henry Spencer) (08/14/85)

> > Any particular reason to do this rather than restart the instruction from
> > where it left off?
> 
> Less internal state to dump?  (Which means less microcode/whatever to do the
> dumping and restoring, and less code in the kernel to check that the state,
> if accessible to the user, hasn't been tampered with.)

Motorola obviously :-) views its 68020 line primarily as a way to sell
memory chips.  Between the incredible pile of trash it heaves onto the
stack when you take a page fault, and the huge internal state of the
68881 FPU that has to be shoveled in and out every time you context-switch
(what's the betting Motorola's next FPU chip has DMA? :-), the memory
market is clearly what they're aiming at.  That and the cache market.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

peter@baylor.UUCP (Peter da Silva) (08/15/85)

> Not uncommon, but it doesn't generate autoincrement code on the PDP-11, VAX,
> or M68000, because they all have only predecrement and postincrement modes -
> this construct would require a preincrement mode.  Do any machines have
> preincrement addressing modes?

OK. Bad example. How about:

strcpy(to, from)
register char *to, *from;
{
	char hold=to;
	while(*to++ = *from++)
		continue;
	return hold;
}
-- 
	Peter da Silva (the mad Australian)
		UUCP: ...!shell!neuro1!{hyd-ptd,baylor,datafac}!peter
		MCI: PDASILVA; CIS: 70216,1076

davet@oakhill.UUCP (Dave Trissel) (08/17/85)

In article <5874@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:

>> > Any particular reason to do this rather than restart the instruction from
>> > where it left off?
>> 
>
>Motorola obviously :-) views its 68020 line primarily as a way to sell
>memory chips.  Between the incredible pile of trash it heaves onto the
>stack when you take a page fault, and the huge internal state of the
>68881 FPU that has to be shoveled in and out every time you context-switch
>(what's the betting Motorola's next FPU chip has DMA? :-), the memory
>market is clearly what they're aiming at.  That and the cache market.

What you don't realize is the amazing performance we can get because of the
"incredible pile of trash" we heave on the stack.

The crux of the problem is that chips which have to back-up and redo
instructions pay a nasty penalty in pipeline design.  Consider the following
generic microprocessor code sequence:

		MOVE   something to memory
		SHIFT  Reg by immediate
		MUL    Reg to Reg
		etc.

The MC68020 executes the MOVE and the bus unit schedules a write cycle.  Then
the execution unit/pipeline happily continues executing the instruction
stream without regard to the final status of the write.  Even if the write
fails (bus errors) there could be several more instructions executed (in fact
any amount until one is hit which requires the bus again.)

Contrast this to chips which redo instructions.  They must soon stop dead in
their tracks until the write cycle has been verified as properly done. Other-
wise they would alter the programmers model and invalidate retry.

Another thing to consider, is that the total operating system code executed
to continue from a page fault (assign an unused page frame and map it in the
MMU, or block the process and schedule a swapped out page to be read) makes
the overhead of writing the internal 020 machine state seem insignificant.
The stack save equates to about the same overhead as executing 12
instructions.

Concerning floating-point state saves we gave a lot of thought to minimizing
latency times.  What we did was give an indication to the OS of whether any
of the FP registers had been used.  If not, the OS could skip the context
save and restore completely.

Intel has a novel approach on their 8087 and 2087 where they let the process
context switch without saving FP state.  If another process tries using
floating-point an interrupt occurs letting the OS then swap context only
when necessary.  The trouble with this technique is that all it takes is
for one out of every 20 or so context switches to require a re-save and you
start losing overall processor time over just saving it unconditionally.
At worse, if you have several processes constantly sharing the FP chip then
you have essentially forced a complete extra interrupt exception invocation
for every change in context - a massive penalty.

One solution would be to keep multiple contexts on chip.  Ah - if we only
had next decade's technology today.  Lot's of exciting things are going to
happen once we can get millions of gates on a single chip running at 70 mHz.

 --  Dave Trissel
     Motorola Semiconductor Inc.
     Austin, Texas              {seismo,ihnp4}!ut-sally!oakhill!davet

tmb@talcott.UUCP (Thomas M. Breuel) (08/17/85)

In article <492@oakhill.UUCP>, davet@oakhill.UUCP (Dave Trissel) writes:
|>Motorola obviously :-) views its 68020 line primarily as a way to sell
|>memory chips.  Between the incredible pile of trash it heaves onto the
|>stack when you take a page fault, and the huge internal state of the
|>68881 FPU that has to be shoveled in and out every time you context-switch
|>(what's the betting Motorola's next FPU chip has DMA? :-), the memory
|>market is clearly what they're aiming at.  That and the cache market.
|What you don't realize is the amazing performance we can get because of the
|"incredible pile of trash" we heave on the stack.
|
|The crux of the problem is that chips which have to back-up and redo
|instructions pay a nasty penalty in pipeline design.  Consider the following
|generic microprocessor code sequence:
|
|		MOVE   something to memory
|		SHIFT  Reg by immediate
|		MUL    Reg to Reg
|		etc.
|
|The MC68020 executes the MOVE and the bus unit schedules a write cycle.  Then
|the execution unit/pipeline happily continues executing the instruction
|stream without regard to the final status of the write.  Even if the write
|fails (bus errors) there could be several more instructions executed (in fact
|any amount until one is hit which requires the bus again.)

I find this argument amusing. You just generated a page fault.  That
means context switch, disk driver, housekeeping, ... .  Compared to all
this, the overhead of your instruction re-start is going to be
negligible no matter how inefficiently you do it.

In addition, I tend not to believe that what you gain in cache
performance makes up for the time required to push a lot onto the
stack.  Cache performance is going to increase in the way you describe
it on writes only anyhow, since if you get a page fault on a read
(which is probably the more common case) you have to wait for the
page to be brought in no matter what.

Finally, the thought of having a page fault pending and the CPU
happily executing more instructions before the fault is serviced
somehow worries me. It may play havoc with simple-minded process
synchronisation techniques.

Altogether, I don't buy that the 68020 gets 'amazing performance'
because it pushes of the order of 20 longwords onto the stack every
time it gets a page fault.

						Thomas.

davet@oakhill.UUCP (Dave Trissel) (08/19/85)

In article <489@talcott.UUCP> tmb@talcott.UUCP (Thomas M. Breuel) writes:
>|
>|		MOVE   something to memory
>|		SHIFT  Reg by immediate
>|		MUL    Reg to Reg
>|		etc.
>|
>|The MC68020 executes the MOVE and the bus unit schedules a write cycle.  Then
>|the execution unit/pipeline happily continues executing the instruction
>|stream without regard to the final status of the write.  Even if the write
>|fails (bus errors) there could be several more instructions executed (in fact
>|any amount until one is hit which requires the bus again.)
>
>I find this argument amusing. You just generated a page fault.  That
>means context switch, disk driver, housekeeping, ... .  Compared to all
>this, the overhead of your instruction re-start is going to be
>negligible no matter how inefficiently you do it.

You are not getting the point - maybe I did not make it that clear. Most of
the time instructions execute without a page fault interrupt. The problem is
that microprocessors which backup and redo instructions must ALWAYS halt
when a write is done onto the bus because there just may possibly be a bus
fault even though there almost always isn't.

The '020 pipeline only halts for memory operand reads, changes in supervisor
state or locked bus cycle instructions like TAS and CAS.                    .
Probably the '020 bus averages somewhere around 30 percent write type cycles.
This means there are many chances for this overlap to increase performance.

The overlap the '020 gains is dependent on how far along the pipeline can
crunch before another bus cycle is needed.  WIth a 256 byte cache and large
number of work registers (15) there is a large percentage of the time that
one, two or more instructions can be executed while a write is being done.
Even if the next instruction requires an operand read or write from the bus
and therefore stops the pipe there, at least an overlap of instruction
decoding and queueing of another bus cycle is accomplished before the halt.

>In addition, I tend not to believe that what you gain in cache
>performance makes up for the time required to push a lot onto the
>stack.

For the average one to three million instructions the '020 may be doing each
second the 24 extra longwords saved and stored over a bus
fault (which occurs anywhere from zero to let's say 10 times a second)
doesn't really make any difference.

>Cache performance is going to increase in the way you describe
>it on writes only anyhow, since if you get a page fault on a read
>(which is probably the more common case) you have to wait for the
>page to be brought in no matter what.

Maybe I didn't make it clear that I was getting at the majority of the
time that you don't have a bus fault.  Yes, any operand read from memory
will lock the pipe since obviously it cannot continue regardless of whether
a bus fault is going to occur or not.

>Finally, the thought of having a page fault pending and the CPU
>happily executing more instructions before the fault is serviced
>somehow worries me. It may play havoc with simple-minded process
>synchronisation techniques.
>

There are some side-effects but they don't occur for synchronisation since,
as I mentioned earlier, for semaphore and lock operations the pipe does not
forge ahead.  The side-effects are subtle and relate mostly to exception
handling and asynchronous exit invocations by the OS. That's the small penalty
you pay for getting higher performance.  Any advanced pipeline mechanism is
going to be executing ahead whether you're on the '020 or a supercomputer.

>Altogether, I don't buy that the 68020 gets 'amazing performance'
>because it pushes of the order of 20 longwords onto the stack every
>time it gets a page fault.

The way to tell is to simply look at some assembly code and follow the
instructions after operand writes.  A pretty good estimate can be gotten
from this method.  And remember, even if the very next instruction after a
write forces a bus access the '020 pipeline can progress up to the point of
that bus cycle request before it halts.

  -- Dave Trissel
     Motorola Semiconductor           {seismo,ihnp4}!ut-sally!oakhill!davet
     Austin, Texas

henry@utzoo.UUCP (Henry Spencer) (08/21/85)

The following two Dave Trissel quotes are from the same message:

> [we can continue other instructions in parallel] ... Even if the write
> fails (bus errors) there could be several more instructions executed (in fact
> any amount until one is hit which requires the bus again.)

> The stack save equates to about the same overhead as executing 12
> instructions.

In other words, all you need is 12 contiguous non-memory-referencing
instructions and the 68020's stack puke will actually break even!  This
is stretching it a bit, since on the pdp11 typically every third or fourth
instruction did some sort of memory reference; I doubt that the 68000 family
does much better.  Speaking of the 68000 *family*, note that a 68010 gets
the full performance hit every time since it doesn't pipeline much.

On the other hand, I'm glad to hear that Motorola did have the sense to
put a floating-point-used flag in the FPU, so at least you don't have to
shovel 300 bytes of state around unnecessarily.

> Intel has a novel approach on their 8087 and 2087 where they let the process
> context switch without saving FP state.  If another process tries using
> floating-point an interrupt occurs letting the OS then swap context only
> when necessary.  The trouble with this technique is that all it takes is
> for one out of every 20 or so context switches to require a re-save and you
> start losing overall processor time over just saving it unconditionally.
> At worse, if you have several processes constantly sharing the FP chip then
> you have essentially forced a complete extra interrupt exception invocation
> for every change in context - a massive penalty.

An interesting possibility would be to have the hardware support *both* an
FPU-used flag *and* a trap-on-first-FPU-use bit.  It would not seem too
difficult to set up some code in the kernel that switches between the
two strategies as a function of the number of FPU context switches that
have occurred lately.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

thomson@uthub.UUCP (Brian Thomson) (08/21/85)

Dave Trissel (davet@oakhill.UUCP) brags about the Motorola FPU:

>Concerning floating-point state saves we gave a lot of thought to minimizing
>latency times.  What we did was give an indication to the OS of whether any
>of the FP registers had been used.  If not, the OS could skip the context
>save and restore completely.

Careful reading of the specs for the National FPU, plus a little experimenting,
shows that they also provided this capability, though their implementation
has the look of being a fortuitous accident (hint: check the behaviour of the
Trap Type field of the Floating Status Register).
Unfortunately, whoever wrote the documentation made no mention of this
use, which suggests that they don't realize what they have and are in danger
of making it not work in future releases of the hardware.
-- 
		    Brian Thomson,	    CSRI Univ. of Toronto
		    {linus,ihnp4,uw-beaver,floyd,utzoo}!utcsrgv!uthub!thomson

henry@utzoo.UUCP (Henry Spencer) (08/22/85)

> [discussion about saving floating-point-unit state only when needed]
> Careful reading of the specs for the National FPU, plus a little experimenting,
> shows that they also provided this capability, though their implementation
> has the look of being a fortuitous accident (hint: check the behaviour of the
> Trap Type field of the Floating Status Register).
> Unfortunately, whoever wrote the documentation made no mention of this
> use, which suggests that they don't realize what they have and are in danger
> of making it not work in future releases of the hardware.

It should also be possible to get a similar effect by using the SETCFG
instruction to tell the cpu "no floating point", which will produce a
trap when the user tries to use floating point.  Save the state and then
turn floating point on again.  When I asked the local National man about
this, he said it would work.  Beware that I have *not* tried it on real
hardware yet.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

davet@oakhill.UUCP (Dave Trissel) (08/24/85)

In article <5890@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:


>In other words, all you need is 12 contiguous non-memory-referencing
>instructions and the 68020's stack puke will actually break even! This is
>is stretching it a bit, since on the pdp11 typically every third or fourth
>instruction did some sort of memory reference; I doubt that the 68000 family
>does much better.        ...
>

It's clear that you still don't understand what I'm getting at.  I'll try one
more time.

The '020 averages between 2 to 5 million external bus operations per second
and that doesn't count the internal bus cycles run from the on-chip cache.
The overhead for the "puking" as you call it is 46 bus cycles (23 each way.)

If you insist that those 46 bus cycles are significant against 2 to 5 million
bus cycles then there's nothing more I can say. .

>On the other hand, I'm glad to hear that Motorola did have the sense to
>put a floating-point-used flag in the FPU, so at least you don't have to
>shovel 300 bytes of state around unnecessarily.

Your use of the word "state" is ambiguous.  If you mean internal chip context
save state then Mike Cruess has already brought that into perspective.  If you
mean the user register context size then that's 208 bytes of state and our
analysis at the time for not including DMA (or more correctly bus mastership
capability) can be gone into.

>> Intel has a novel approach on their 8087 and 2087 where they let the process
>> context switch without saving FP state.  If another process tries using
>> floating-point an interrupt occurs letting the OS then swap context only
>> when necessary.  The trouble with this technique is that all it takes is
>> for one out of every 20 or so context switches to require a re-save and you
>> start losing overall processor time over just saving it unconditionally.

I have since figured the 286/287 overhead out and it is somewhat less than
what I stated.  It takes 209 clocks to determine that no other task has used
the 287 in the meantime and that there is no state to reload.  If the
exception routine detects the 287 now has some other task's registers then
the exception routines execution takes 765 clocks.

It takes 535 clocks to unconditionally save and restore the state.  However,
the 286 is not smart enough to handle the 287 with it's task switching
capability which means there really is little alternative but to use the
exception routine route anyway.

So the ratio for use is somewhere in about one in four.

  --  Dave Trissel            {seismo,ihnp4}!ut-sally!oakhill!davet
      Motorola Semiconductor Inc.
      Austin, Texas

geoff@desint.UUCP (Geoff Kuenning) (08/24/85)

In article <489@talcott.UUCP> tmb@talcott.UUCP (Thomas M. Breuel) writes:

>Finally, the thought of having a page fault pending and the CPU
>happily executing more instructions before the fault is serviced
>somehow worries me. It may play havoc with simple-minded process
>synchronisation techniques.

Which just goes to show that you shouldn't try to do OS-type things in
a simple-minded manner on a complicated computer like the '020.  Operating
systems designers have been dealing with this sort of problem for over
two decades;  in general we don't really mind a few subtle points in the
architecture that require careful attention to detail, as long as they're
well documented.
-- 

	Geoff Kuenning
	...!ihnp4!trwrb!desint!geoff

geoff@desint.UUCP (Geoff Kuenning) (08/24/85)

In article <5890@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:

>The following two Dave Trissel quotes are from the same message:
>
>> The stack save equates to about the same overhead as executing 12
>> instructions.
>
>In other words, all you need is 12 contiguous non-memory-referencing
>instructions and the 68020's stack puke will actually break even!  This
>is stretching it a bit, since on the pdp11 typically every third or fourth
>instruction did some sort of memory reference; I doubt that the 68000 family
>does much better.  Speaking of the 68000 *family*, note that a 68010 gets
>the full performance hit every time since it doesn't pipeline much.

Do I detect just a tiny rabid note here?  Henry, I think Dave's point was
not that you only have to do 12 non-memory-referencing instructions to
break even.  Rather, his point was that you only have to eliminate 12
instructions (of any "average" type) from the total code stream executed
in response to a bus error to break even.  Or, alternatively, that you would
get the same performance hit from adding 12 instructions to that stream,
which can easily be the result of a single bug fix in trap.c.  There is
little point in getting excited about a few microseconds in bus-error
processing unless (a) you are getting LOTS of bus errors per second, and
(b) those microseconds add a SIGNIFICANT percentage to the bus-error
processing time.  (a) is generally true in virtual-machine OS's;  (b)
is not true in any operating system I've ever heard of.

In any case, Henry, why bring up the red herring of the 68010?  This was
a discussion of the 020 until now.  Or are you just in a flaming-at-Motorola
mood?

>On the other hand, I'm glad to hear that Motorola did have the sense to
>put a floating-point-used flag in the FPU, so at least you don't have to
>shovel 300 bytes of state around unnecessarily.

Hmm, maybe you *are* in a mood.  In article <5883@utzoo.UUCP> you complain
that some friends are all upset about the 300 bytes of state.  Now we
find out that said friends maybe didn't even know about the f.p.-used
flag?

>An interesting possibility would be to have the hardware support *both* an
>FPU-used flag *and* a trap-on-first-FPU-use bit.  It would not seem too
>difficult to set up some code in the kernel that switches between the
>two strategies as a function of the number of FPU context switches that
>have occurred lately.

Henry's got a point here, Dave.  Even if you didn't want to do it dynamically,
the OS designer would still have the option of picking trap-on-first-use,
which is still advantageous if you are certain that most of the time there
will only be one f.p. user.  Any chance of getting this idea into the
next rev?
-- 

	Geoff Kuenning
	...!ihnp4!trwrb!desint!geoff

jack@boring.UUCP (08/26/85)

In article <5900@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>
>It should also be possible to get a similar effect by using the SETCFG
>instruction to tell the cpu "no floating point", which will produce a
>trap when the user tries to use floating point.  Save the state and then
>turn floating point on again.  When I asked the local National man about
>this, he said it would work.  Beware that I have *not* tried it on real
>hardware yet.
Well, I didn't try it either, but NS uses the code in their 4.1
release, so I guess it works.
-- 
	Jack Jansen, jack@mcvax.UUCP
	The shell is my oyster.

tim@callan.UUCP (Tim Smith) (08/27/85)

> I find this argument amusing. You just generated a page fault.  That
> means context switch, disk driver, housekeeping, ... .  Compared to all
> this, the overhead of your instruction re-start is going to be
> negligible no matter how inefficiently you do it.

Yes, but how about when you DON'T have a page fault?  His point is that
the 68020 can go ahead and do a lot of other stuff, cause it don't matter
if the write a couple instructions back failed, whereas the instruction
restart machine might have to wait to be sure that there will be no
page fault.
-- 
					Tim Smith
				ihnp4!{cithep,wlbr!callan}!tim

henry@utzoo.UUCP (Henry Spencer) (08/27/85)

> Do I detect just a tiny rabid note here?

Who, me?  Just because I think dumping microstate onto the stack when you
get a page fault is a wretched botch to cover up the fact that Motorola
totally and utterly ignored virtual memory when they designed the 68000?
Nah.

> Henry, I think Dave's point was
> not that you only have to do 12 non-memory-referencing instructions to
> break even.

The way I read Dave's note (which is still the way I read it, looking back)
was "dumping microstate is a big win, because we can execute instructions
beyond the one that causes the fault, and not have to redo them, unlike
those cruddy architectures that have to stop dead when they hit a fault".
My point, somewhat overstated I admit, was that this is near-nonsense,
because the number of extra instructions is likely to be very small, not
large enough to make up for the greater volume of data that has to go
onto the stack at fault time.

> In any case, Henry, why bring up the red herring of the 68010?

Because Motorola trumpeted microstate dumping as a big win on the 68010 too.
"Look at us, we did it right, we don't have to restart the whole instruction
from scratch."  Feh.

> Or are you just in a flaming-at-Motorola mood?

I'm never out of flaming-at-the-680x0's-stupid-stack-puke-page-fault mode!

> Hmm, maybe you *are* in a mood.  In article <5883@utzoo.UUCP> you complain
> that some friends are all upset about the 300 bytes of state.  Now we
> find out that said friends maybe didn't even know about the f.p.-used
> flag?

No, we find out that *I* didn't know about it.  Said friends are disgusted
at the need to handle 300 bytes of state even *sometimes*, as it turns out.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

rfm@frog.UUCP (Bob Mabee) (08/30/85)

Dave Trissel of Motorola explains why the 68020 stores a large state on faults:
>		MOVE   something to memory
>		SHIFT  Reg by immediate
>		MUL    Reg to Reg
>		etc.
> The MC68020 executes the MOVE and the bus unit schedules a write cycle.  Then
> the execution unit/pipeline happily continues executing the instruction
> stream without regard to the final status of the write.  Even if the write
> fails (bus errors) there could be several more instructions executed (in fact
> any amount until one is hit which requires the bus again.)
> 
> Contrast this to chips which redo instructions.  They must soon stop dead in
> their tracks until the write cycle has been verified as properly done. Other-
> wise they would alter the programmers model and invalidate retry.

Quite a few responses seemed to miss the point that this makes the 68020
run a lot faster all the time, not just when the reference causes a fault.
The alternative requires that the CPU store just a PC value that can be
jumped to to restart the program; that means there can be no visible effects
from instructions executed after the one that started the write that got the
error.  In the example, either the shift can't happen until the write is
acknowledged, or the processor has to keep multiple register sets so it
can back up far enough to recreate the state that goes with the bus cycle.

However, there is a big problem with the 68020 fault state on UNIX-like
systems:  the state is (potentially) writeable by malicious users but Motorola
has not provided enough information so we can detect bad states.  We need
	1) Motorola's assurance that no combination of bits fed to RTE can
	   damage or hang up the chip, allow users to enter supervisor mode,
	   or set a booby-trap that will harm the OS or another process.
or	2) a (small) set of checks that will reject all combinations that
	   might do any of those things, while allowing all combinations
	   actually stored by the CPU.

If the OS boils the state down to a PS and PC, which can be easily validated,
then it is lying to the user, because it will allow restarting but the program
will misbehave (in the example, shift a register twice).  If the OS prevents
restarting such cases, it will kill programs that merely happen to get signals
in the middle of 68881 instructions.

The state gets to be writeable when a user instruction faults or (with the
68881) takes a mid-instruction interrupt, and the kernel then decides to
signal the process.  The signal handler runs like a user-level version of
a trap handler, and can return, which should make the stopped instruction
resume.  Signal handlers can themselves be interrupted by other signals, so
there can be a lot of sets of fault data around.  The easiest way for the
kernel to handle this is to put the data on the user stack as part of calling
the signal handler.  (Implementing a parallel, growable stack accessible
only by the kernel to hold the fault data is going to be a big pain.)

So, how about it, Dave?  Can you give us #2 above?

--
				Bob Mabee @ Charles River Data Systems
				decvax!frog!rfm

dws@tolerant.UUCP (Dave W. Smith) (09/01/85)

>> [discussion about saving floating-point-unit state only when needed]
> 
> It should also be possible to get a similar effect by using the SETCFG
> instruction to tell the cpu "no floating point", which will produce a
> trap when the user tries to use floating point.  Save the state and then
> turn floating point on again.  When I asked the local National man about
> ... Beware that I have *not* tried it on real
> hardware yet.
> -- 
> 				Henry Spencer @ U of Toronto Zoology
> 				{allegra,ihnp4,linus,decvax}!utzoo!henry

We do it on real hardware, and it works nicely.
-- 
  David W. Smith             {ucbvax}!tolerant!dws
  Tolerant Systems, Inc.
  408/946-5667

freund@nsc.UUCP (Bob Freund) (09/09/85)

Now that the subject of internal state-dumping has been discussed for
awhile, I have a question that has as yet not been addressed.
In a multiprocessor system designed for transparent operation and
which has the ability to allocate processors to tasks dynamicly, it
is possible that a task be re-started on a different processor than
the one that faulted.  If there is any difference in the micro-state
between cpu revision levels, it could happen that the restart would
fail due to incompatible micro-state.  Does Motorola guarantee
micro-state compatibility across revision levels?  Does it guarantee
compatibility of micro-state across implementations?
If the answer to the first question is false, then all multiprocessors
must be at the same cpu revision level.  If the answer to the second
question is false, then it will not be possible to design multiprocessors
unless they were constituted of homogeneous types.  What effect
does this have on the types of multiprocessor systems that can be designed
based on the part?  What is the effect on distributed systems that
allow task migration across network.  What about paging across network?

Have fun
-bob

guy@sun.uucp (Guy Harris) (09/11/85)

> In a multiprocessor system designed for transparent operation and
> which has the ability to allocate processors to tasks dynamicly, it
> is possible that a task be re-started on a different processor than
> the one that faulted.  If there is any difference in the micro-state
> between cpu revision levels, it could happen that the restart would
> fail due to incompatible micro-state.

Another problem with making information like the format of internal state
dumped on page faults and the like "private" to the particular chip is that
you make it difficult for user-mode code in a protected system to catch
faults of that sort and handle them itself.  I'm sure you all know one
infamous UNIX utility which this fouls up (I ran afoul of this one
recently).  Other programs which might want to do this include programs
maintaining multiple subtasks within a process - they could catch SIGSEGV
(or your OS'es equivalent) and do grow a process' stack if it goes beyond
the stack's boundary.

Unless the architectural spec for the machine states that there is *nothing*
a user-mode program can do to the internal state that will do anything more
than wedge the process doing it, you can't keep the state information in
user-writable space; this means you can't keep it on the user stack.
Unfortunately, that's where a user-mode RTE or whatever instruction will
usually restore it from.  You thus have to keep it somewhere like the kernel
stack or (in UNIX) the user page.  This means you have to limit the number
of such exceptions that remain outstanding or dynamically allocate space to
hold them.  Saving one such lump of state information, so that user-mode
code handling the exception is not allowed to incur another such exception
if that exception is also to be handled in user mode, should handle most of
the cases you're likely to see, but it's still a kludge.

	Guy Harris