[comp.arch] i860 overview

jangr@microsoft.UUCP (Jan Gray) (03/06/89)

				i860 Overview

(what I consider interesting features of the part), taken from the
"i860(tm) 64-bit Microprocessor Programmer's Reference Manual", Order
Number 240329-001, (C) Intel Corp. 1989.

Overview

* 64 bit external data/instruction bus
* 128 bit on-chip data bus
* 64 bit on-chip instruction bus
* 8K data cache, virtual addressed, write-back, two-way "set associative",
  2x128 lines of 32 bytes
* 4K instruction cache, virtual addressed
* 64 entry TLB
* core integer RISC unit
* floating-point unit with pipelined multiply and add units (can also be
  used "unpipelined")
* some multiply-accumulate type floating point instructions
* dual instruction mode can simultaneously dispatch a 32-bit core instruction
  and a 32-bit floating-point instruction

Data Types

* BE bit in epsr (extended processor status register) selects big/little
  endian format in memory, instructions always little-endian
* 32 bit signed/unsigned integers
* IEEE 754 format single (32-bit) and double (64-bit) precision floating
  point numbers
* pixels:
  * stored as 8, 16, or 32 bits (always operates on 64 bits of pixels at a
    time)
  * colour intensity shading instructions treat divide pixels into fields:
    pixel size	colour 1 bits	colour 2 bits	colour 3 bits	other bits
	8	....................N........................	8 - N
	16		6		6		4	0
	32		8		8		8	8
    [These particular field assignments are a result of the pixel add
     instructions described below.]

Memory Management

* NO SEGMENTS!
* 32 bit virtual addresses (translation can be disabled)
* translated identically to 386 virtual address: two level address
  translation, with bits 31..20 of address selecting:
  * dirbase register specifies page directory
  * 1st level: addr[31..22] specifies page directory entry, yielding 
    permissions and address of the second level page table
  * 2nd level: addr[21..12] specifies page table entry, yielding additional
    permissions and address of the physical page
  * addr[11..0]  specifies byte offset within physical page (4K pages)
* page table bits:
  * P  - page is present
  * CD - cache disable: page is not cacheable
  * WT - page is write-through.  disables internal caching.  Either CD or WT
         can be passed through to the external PTB pin, depending upon PBM
	 bit in epsr.
  * U  - user: if 0, page in inaccessible in user mode.
  * W  - writable: if 0, page is not writable in user mode, and may be writable
	 in supervisor mode depending upon WP bit in epsr.
  * A  - accessed: automatically set first time page is accessed
  * D  - dirty: traps when D=0 and page is written
  * two bits reserved, three bits user-definable
  * page directory PTE bits and second level PTE bits are combined in the most
    restrictive fashion
* 64 entry TLB

Caches

* Flush instruction forces a dirty data cache line (32 bytes) back to memory.
  Intel supplies suggested code to flush entire data cache.
* Storing to dirbase register with ITI bit set invalidates TLB and instruction
  caches; must flush data cache first!  [Remember, the data cache is virtually
  addressed.]
 
Core Unit

* Standard 32 bit RISC architecture:
  * 32 32-bit integer registers
  * fault instruction, psr, epsr, dirbase, data breakpoint registers
  * r0 always reads as 0
  * 8, 16, 32 bit integer load/store insns, operands must be appropriately
    aligned; byte or word values are sign extended on load.  [I hope you
    don't use "unsigned char" too much...]
  * 2 source, 1 destination add/subtract/logical (and, andnot, or, xor)
  * No integer multiply/divide instructions.  To multiply, you move the
    operands to floating point registers, use multiply (four insns plus
    five free delay slots).  To divide, you move the dividend to a floating
    point register and multiply by the reciprocal.  This can be very slow
    (59 clocks) if the divisor is a variable (hopefully infrequent).
* 32 bit shift left/right/right-arithmetic, plus 64 bit funnel shift
  ("shift right double").  They ran out of bits to specify two 32 bit sources
  plus destination plus shift count, so the shift count of the last 32 bit
  shift right (automatically stored in the 5 bit SC field of the psr) is used.
* Similar to MIPS Rx000 architecture in some ways:
  * load/store addressing mode is src1(src2), src1 is a register or 16 bit
    immediate constant.
  * form 32 bit constants using andh/andnoth/orh/xorh on upper 16 bits of
    a register
* Only one condition code bit (CC), set in various ways by signed/unsigned
  add/subtract/logical operations, unaffected by shift ops
* Delayed and non-delayed branches on CC set/not set (bc[.t], bnc[.t])
* Non-delayed branch on src1 ==/!= src2 (bte, btne)
* Strange delayed branch "bla" instruction, for one instruction looping.
  useful for aoblss/dsz/isg type looping.  Uses its own special LCC condition
  code bit.  "Programs should avoid calling subroutines while within a bla
  loop, because a subroutine may use bla also and change LCC".  [Ug.]
* Trap, trap on integer overflow instructions
* Call/call indirect, stores return address in r1.
* Unconditional branch, branch indirect, latter also used for return and
  return from trap.
* Core unit loads and stores floating point operands of 32, 64, and 128 bits
* Pipelined floating load instruction (32/64 bits) queues an address of an
  operand not expected to be in cache, and stores the result of the third
  previous pipelined floating load into the destination floating register.
  [This is the data-loading component of the i860 "vector" support.]
* Bus lock/unlock instructions for flexible indivisible read-modify-write
  sequences.  Interrupts are disabled while the bus is locked.  "If ...
  the processor does not encounter a load or store following an unlock
  instruction by the time it has executed 32 instructions, it triggers
  an instruction fault...".
  For example: locked test and set is:
	// r22 <- semaphore, semaphore <- r23
	lock				// next cache miss load/store locks bus
	ld.b	semaphore, r22
	unlock				// next load/store unlocks bus
	st.b	r23, semaphore
* Pixel store instructions for selectively updating particular masked pixels
  in a 64-bit memory location, used for Z-buffer hidden surface elimination.
  Pixel mask is set by fzchk instructions (in floating point/graphics unit)

Floating Point Unit

* 32 32 bit single precision floating point registers, can also be treated
  as 16 64 bit double precision registers.
* graphics operands also stored in the fp registers
* f0/f1 reads as 0
* pipelined multiply and add units
* floating point instructions can be non-pipelined, or pipelined
* Similar to the pipelined load above, in a pipelined multiply or add
  instruction, the source operands go into the pipeline, and the result of
  the 3rd (or so) previous pipelined multiply or add is stored in the
  destination register(s).
* Pipeline lengths
  * adder:     3 stages
  * multiplier:2 or 3 stages (2 double precision, 3 single(!))
  * graphics:  1 
  * load:      3 (loads issued from core unit above)
* IEEE status bits percolate through the fp pipelines, and can be reloaded,
  along with the pipeline contents, after traps
* Divide?  Ha!  If Seymour can do it with reciprocals, so can the i860.
  The frcp and frsqr insns give return approximate reciprocal and 1/square
  root "with absolute significand error < 2^-7".  Intel supplies routines
  for Newton-Raphson approximations that take 22 clocks (*almost* single
  precision) or 38 clocks (*almost* double precision), and the Intel i860
  library provides true IEEE divide.  [RISC design principles at work:
  divides are infrequent enough not to slow down/drop some other feature
  to provide divide hardware.]
* Dual operation instructions (not "dual mode"): Some pipelined instructions
  cause both a pipelined add and a multiply operation to take place.  Since
  the instruction can only encode two source operands, the others are taken
  from temporary holding registers and busses connecting the two units
  in various topologies, depending upon the data path control field of the
  instruction opcode.  [Many real world computations e.g. dot product can
  make use of these instructions.]

Dual Instruction Mode
* DIM allows the i860 to run both a core and a floating/graphics unit insn
  on each cycle.  The resulting 64 bit "wide instruction" must be 64
  bit aligned.
* There is a two cycle latency: two cycles after a floating instruction with
  the D bit set, both a core and a floating insn will be issued.  Similarly,
  if the D bit is clear, there will be no DIM two cycles (two instruction
  pairs) later.
* There are various sensible rules for determining the result of insn pairs
  which set/use common registers, control registers, etc.

Graphics Unit

* Pipelined and non pipelined 64 bit integer add and subtract.
* 16/32 bit non/pipelined Z buffer check instructions:
  "fzchks src1, src2, rdest (16 bit Z-Buffer Check)
   Consider src1, src2, and rdest as arrays of four 16 bit fields
   src1(0..3), src2(0..3), rdest(0..3), where zero denotes the
   least-significant field.

   PM <- PM >> 4
   FOR i = 0 to 3
   DO
     PM[i+4] <- src2(i) <= src1(i) (unsigned)
     rdest(i) <- smaller of src2(i) and src1(i)
   OD
   MERGE <- 0"
  This particular instruction merges four (arbitrary sized) pixels whose
  16 bit Z-buffer values are in one of the (64 bit) sources, and the current
  Z-buffer value in the other source, setting pixel mask bits (controlling
  the pixel store insn described above), and updating the Z-buffer depth
  values.  [Neat!  Just what my (personal) graphics package ordered!]
* Pixel add instructions, which add fixed point values, the results
  accumulating in a special MERGE register.  You can use these to interpolate
  between (for instance) two colours as you scan convert a polygon.
* Z-buffer add instructions, for the analogous case of distance interpolation.

Traps

Briefly, there are instruction, floating point, instruction access, data
access, interrupt, and reset traps.  On a trap, the i860 enters supervisor
mode, saves/modifies various psr bits, saves the faulting instruction address,
and jumps to the trap handler which must be at 0xFFFFFF00.  There are various
complications for dual instruction mode, bus lock mode, and for saving/
restoring the various pipeline states.

Interlocks

The i860 is fully interlocked, so no need to insert nops.  You can, of course,
increase performance by reordering insns with dependencies.  For instance,
in the current implementation, referencing the result of a ld in the next
instruction can cause a one clock delay.

Other interesting timings:
* TLB miss: five clocks plus the number of clocks to finish two reads plus
  the number of clocks to set A (accessed) bit, if necessary.  [I guess Intel
  found Mips' and others' software TLB lookup unworthy...]
* ld/fld following st/fst hit: one clock.
* delayed branch not taken: one clock [to skip/annul the delay slot instruction]
* nondelayed branch taken: bc, bnc: one clock; bte, btne: two clocks
* st.c (store to a control register): two clocks.


Comments

Well, that about does it.  Quite a neat part,  I I think Intel has done
themselves proud with a very clean and well-balanced design; I guess they've
been reading comp.arch... :-)  I had read rumours that this was to be a
floating point coprocessor for the x86, and had feared that it would be
burdened with lots of slave-processor crap, but that is not the case.

If I could change one thing, it would be to add Mips' on-chip external cache
control hardware.  Why hasn't anyone else picked up on this idea?  I'm
afraid that for some code (not *mine*, of course) the 4K on-chip insn cache
will be too small; a cache controller would allow you to add big external
caches with a minimum of heartache.  "I guess there's no pleasing some
people!"


Any typos/misinterpretations are my own.  I speak only for myself.

Jan Gray  uunet!microsoft!jangr  Microsoft Corp., Redmond Wash.  206-882-8080

w-colinp@microsoft.UUCP (Colin Plumb) (03/06/89)

Well, I just got Jan's copy of the "i860 64-bit Microprocessor Programmer's
Reference Manual" and am going to post an even longer summary.  I'll try
to avoid too much duplication.

Personal flames:
The exception handling is a disaster.  On any sort of exception, the processor
switches to supervisor mode and jumps to virtual address 0xFFFFFF00.  Then
you have to stare at the bits in the status register to figure out what
happened, handle it, and do arcane things to get the processor in a state
such that it can restart.  This involves looking at the instruction that
faulted and the one just before and parsing them a bit.  Bleah!
Since the processor doesn't handle denormalised, infinity, or NaN values
in the floating-point unit, causing a trap, and the business of sticking
the right value into the pipeline is so tedious, you basically need to
avoid these things altogether.

Also, interrupt return is wierd.  It's overloaded onto the branch indirect
instruction.  If the status register indicates that you're inside a trap
handler, it does some interrupt-return things in addition to branching
to an address specified in a register.

Integer divide is done by converting to floating point, doing the Newton-
Raphson bit, and converting back.  The sample code they give requires 62
clocks (59 without remainder).  Can you say "divide step" boys and girls?

The instruction in the delay slot of a control transfer must not be another
control transfer instruction, including a trap.  This makes me wonder if
putting a trap there could sufficiently confuse the processor that I'd
end up in my code in supervisor mode.  No reason to believe so, just a
nasty idea that popped into my head.  (The first rule of root-crackers:
look for something which says "do not do x."  Try as many variations of x
as possible.)

System regsiters may be read in user mode, and writes are simply ignored.
So much for virtual machines!

Some floating-point instructions can't be pipelined; others must be.
Annoying.

This is Intel order number 240329-001, copyright Intel 1989.
Other related documents:
i860 64-bit Microprocessor (data sheet), order number 240296
i860 Microprocessor Assembler and Linker Reference Manual, 240436
i860 Microprocessor Simulator-Debugger Reference Manual, 240437

The manual I have says absolutely nothing about pinout, timing, or any such
electrical thing.

Anyway, on to the meat:

>> Introduction and Register Summary <<

There are 32 32-bit integer registers and 32 32-bit fp registers.
The fp registers are used in even/odd pairs for double-precision
operations.  The even register appears in memory at the lower address.

Other registers are the psr (processor status register, 32 bits), and epsr
(32 more bits), the db register (debugging, specifies an address to
breakpoint; reads and writes can be trapped), the dirbase register (root
pointer for page tables), fir (fault instruction register, saved PC
on a fault), fsr (floating-point status register), three special-purpose
registers (64 bits) for use in pipelined floating-point mode: KR, KI, and T,
and a 64-bit MERGE register used in pixel operations.

The psr bits, lowest to highest are:
 0: BR, Break Read
 1: BW, Break Write - these bits control breakpoints used with the db register.
	When one is set, the corresponding access to the address specified in
	the db register causes a trap.  The db register specifies a byte
	address; any access touching that byte will be trapped.
 2: CC, Condition Code - there is only one CC bit, set by the add and subtract
	instructions as a greater than/less-than flag, and by the logical
	(and, or, xor, andnot) instructions as a zero flag.
 3: LCC, Loop Condition Code - this is used by the bla instruction only
	to do add-compare-and-branch type things.
 4: IM, Interrupt Mode - external interrupt enable bit.
 5: PIM, Previous Interrupt Mode - state of the IM bit before the last trap.
 6: U, User - set in processor is in user mode.
 7: PU, Previous User - copy of U bit as of before last trap.
 8: IT, Instruction Trap - set by the processor when a trap ocurrs if the
	current instruction caused a trap.  Breakpoints and the like.
 9: IN, INterrupt - set when a trap ocurrs if an external interrupt is
 	a contributing factor.
10: IAT, Instruction Access Trap - as above, set if there was an address
	translation problem during instruction fetch.  There is no mention
	of a BERR-like pin.
11: DAT, Data Access Trap - as above, but for data accesses.  This bit is
	also set by unaligned load/stires and BR/BW exceptions.
12: FT, Floating-point Trap - set if a floating-point error contributed to
	a trap.  Note that any combination of the trap bits may be set on
	entry to the interrupt handler at 0xFFFFFF00.  No bits set indicates
	reset/power-up.  Multiple bits set indicates multiple simultaneous
	exceptions.
13: DS, Delayed Switch - there is a 2-cycle latency between the first
	instruction in the stream with the double-instruction mode but set
	and the processor starting to execute two instructions per cycle,
	or between the first instruction with the bit clear and the cessation
	of double-instruction mode.  This bit is set when the first cycle of
	latency has passed, but not the second.  Note that it is set for
	both switching to dual-instruction mode and away.  The direction
	is given by the DIM bit.
14: DIM, Dual-Instruction Mode - set if the processor is in dual-instruction
	mode, executing an integer ("core") instruction and a floating-point
	one in a single cycle.  It only set when a trap ocurrs; it does not
	reflect the current state of the processor, but the one before the
	trap.  The same goes for the DS bit.
15: KNF, Kill Next Floating - on trap return, if this bit is set, the next
	floating-point instruction is ignored (except for its dual-instruction
	bit).  Useful when emulating a floating-point instruction that trapped
	in dual-instruction mode, when you want to retry the "core" integer
	instruction but not the fp one.
16: X - Unused.  Undefined when read, write with 0 or saved value.
17..21: SC, Shift Count - remembers the shift used in the last SHR (shift
	right logical) instruction.  Specifies the shift count for the
	SHRD (shift right double - extract 32 bits from 64) instruction.
	Equivalent to the 29000's FC register.
23..22: PS, Pixel Size - specifies the size of a pixel for graphics operations.
	0 through 3 mean 8, 16, 32, or <undefined> bit pixels.
24..31: PM, Pixel Mask - the pixel store instruction stores those pixels in
	the current 64-bit word specified by the low-order bits of this field.
	These bits can be set by various z-buffer instructions.

The PM, PS, SC, CC, and LCC fields can be set from user mode; writes to the
other fields are ignored in user mode.

The epsr bits, lowest to highest are:
 0..7: Processor type - specifies the type of the current processor.  1
	for the i860. (Hardwired, may not be changed even by supervisor)
 8..12: Stepping number - specifies the revision. (Also hardwired)
13: IL, InterLock - set if a trap ocurrs in the middle of a lock/unlock
	sequence.
14: WP, Write Rpotect - if clear, supervisor-mode accesses ignore the write
	protect bit of a TLB entry.  If set, even supervisor-mode writes
	are disallowed.
15..16: X, unused.
17: INT, INTerrupt - the value of the INT input pin.  It looks like there
	is only one, and this bit is unqualified.  Writes are ignored.
18..21: DCS, Data Cache Size - the size of the data cache.  2^(DCS+12)
	bytes.  Currently 1, meaning 8Kbytes.  Hardwired.
22: PBM, Page-table Bit Mode - determines which of two bits in the page table
	entry is reflected on the PTB pin.  If 0, the CD bit; if 1, the WT bit.
23: BE, Big-Endian - set if the processor in in big-endian mode.  Causes the
	low 3 bits of the address bus to be complemented.
24: OF, OverFlow - set or cleared by the add and subtract instructions if
	signed or unsigned (depending on the instruction) occurs.  There is
	an instruction analogous to the 68000's TRAPV which traps if this bit
	is set.
25..31: X, unused.

OF is user-writable; the other fields are only writeable from supervisor mode.

The db, Data Breakpoint register contains a byte address which is watched
for accesses.  If any access touches this byte and the corresponding
bit in the psr is set, a data access trap occurs.

The dirbase, directory base register points to the root of the page table
tree.  Standard two-level page table with 4K pages.

The bits, lowest to highes, are:
 0: ATE, Address Translation Enable - if set, address translation is enabled.
	You must flush the data cahce before fiddling with this bit.
 1..3: DPS, DRAM Page Size - the i860 has support for page-mode or
	static-column DRAMs.  If two accesses differ only in the low-order
	12+DPS bits, the NENE# pin is asserted.  Zero is used for one bank
	of 256Kxn DRAMs.
 4: BL, Bus Lock - echoed to the outside world on the LOCK# pin, after one
	cycle of latency.  Controlled by the lock and unlock instructions.
	Copied to the IL bit of the epsr and cleared on a trap.
 5: ITI, Instruction cache and TLB Invalidate - when a 1 is written, the
	instruction cache and page table cache are invalidated.  Always
	reads as zero.
 6: X, unused.
 7: CS8, Code Size 8 bits - when set, instruction fetches are done from
	8-bit-wide memory instead of 64.  Used for bootstrapping from
	a ROM.  Once cleared, cannot be reset.
 8..9: RB, Replacement Block - can control which block (set) of a cache
	is replaced on a miss.  Used by the data-cache flush instruction,
	and for testing in conjunction with the next field.  For the data
	and instruction caches, which are 2-way set-associative, only the
	low bit is used.  For the TLB, both bits are used.
10..11: RC, Replacement Control - the i860 normally uses random replacement
	on all of its caches.  For testing, you can replace this with a
	deterministic algorithm.  00 is normal, 01 causes all cache
	replacements to use the set specified in the RB field, 10 causes
	the data cache to obey the RB field, and 11 disables data cache
	replacement.  I think this means hits can still occur, but no
	new information will be added to the cache.
12..31: DTB, Directory Table Base - this, with 12 low bits of 0, is the address
	of the first-level page table.

The fir (Fault Instruction Register) holds the (virtual, I think) address of
the instruction that casued the trap.  The first time it is read, it is
unfrozen, and subsequent reads will just get the address of the load
instruction.

The fpsr contains all the floating-point flags:
 0: FZ, Flush Zero - if set, underflow is flushed to zero instead of raising
	a result-exception trap.
 1: TI, Trap Inexact - if set, inexact results cause a trap.
 2..3: RM, Rounding Mode - 0 through 3 mean round towards nearest, -inf, +inf,
	and 0.
 4: U, Update - always reads as zero; if set on a write of this register,
	bits 9 through 15 and 22 through 24 are written.  If clear, the
	data written to them is ignored and they are unchanged.
 5: FTE, Floating-point Trap Enable - if clear, floating-point traps are
	never reported.  Used when mucking with the pipeline in various
	ways, and when sticking software-emulated values into the fp unit.
 6: X, unused.
 7: SI, Sticky Inexact - set whenever an inexact result is generated,
	regardless of the state of the TI bit.  Cleared only by explicit
	write.
 8: SE, Source Exception - set when one of the inputs to a FLOP is invalid
	(infinity, denormal, or NaN).
 9: MU, Multiplier Underflow - on read, indicates the last multiply operation
	to come out of the pileline underflowed.  On write (note only written
	if the U bit is set), this forces the flag on the operation in the
	first stage of the multiply pipleline.  When that operation reaches
	the end of the multiply pileline, this is the MU bit that will
	come out.  Used for reloading the pipeline.
10: MO, Multiplier Overflow - similar to above.
11: MI, Multiplier Inexact - similar to above.
12: MA, Multipler Add-one - similar to above, but indicates that the multipler
	rounded up instead of down.  I'm not sure of the effects in the
	presence of sign bits.
13: AU - Adder Underflow - similar to MU.
14: AO - Adder Overflow - similar to MO.
15: AI - Adder Inexact - similar to MI.
16: AA - Adder Add-one - similar to MA.
17..21: RR, Result Register - holds the destination of a FLOP when an exception
	occurs due to a scalar op.
22..24: AE, Adder Exponent - holds the high 3 bits of the exponent coming out
	of the adder.  Used to handle exceptions involving double-precision
	inputs and single-precision outputs properly.
25: X, unused.
26: LRP, Load pipe Result Precision - holds the precision of the value in
	the last stage of the load pipeline (more about this later); set on
	dp, clear on sp.  It cannot be set by software (except by stuffing
	things into the pipe) and is provided to help save the state of the
	pipe.
27: IRP, Integer pipe Result Precision - holds the precision of the
	value in the graphics pipeline.
28: MRP, Multiplier Result Precision - holds the precision of the value
	in the last stage of the multipler pipeline.
29: MRP, Adder Result Precision - holds the precision of the value in the
	last stage of the adder pipeline.
30..31: X, unused.

The KI and KR registers hold constant inputs to the multiplier for use in
the multiply-accumulate instructions.  The T register is between the
multiplier and the adder, again for multiply-accumulate instructions.

The MERGE register is used by graphics operations.

The bits in a page table entry (on either level) are:
 0: P, Present - if clear, the entry is invalid and the high 31 bits are
	available to the programmer.
 1: W, Writeable - if set, the page is writeable.  This is always enforced
	in user mode, and enforcement in supervisor mode is controlled
	by the WP bit of the epsr.  The effective value of this bit
	for any page table entry is the AND of the bits at the two
	levels of page tables.
 2: U, User - if set, this page is accessible at user level.  Must be
	set at both levels of page table for the page to be accessible.
 3: WT, Write-Through - if set, this page will not be cached by the internal
	data cache.  This bit can also be echoed externally, of the PBM bit
	of the espr is set.  This bit is only used on the second (lower)
	level of page tables.  On the first level, it is reserved.
 4: CD, Cache Disable - if set, this page will not be cached by either internal
	cache.  This bit can also be echoed externally, of the PBM bit of the
	espr is clear.  This bit is only used on the second (lower) level of
	page tables.  On the first level, it is reserved.
 5: A, Accessed - only used in the second level of page tables, and is set
	whenever the page is loaded into the TLB.
 6: D, Dirty - only used on the second level of page tables.  If clear, and
	the page is being written to, a data access fault is generated.
	(I.e. must be maintained by software.)
 7..8: X, unused.
 9..11: Reserved for use by the OS.
12..31: high-order bits of the physical address of the page or next level of
	page tables.

There is an external pin, KEN#, that must be asserted to enable cacheing
of instructions and/or data.  If not asserted, the data is not put into the
cache.  The data cache must be explicitly flushed by software before you
may change page tables.

>> Integer ("core") instructions <<

There are a few instruction formats.  The most common has, high to low,
6 bits of opcode (of which the low bit, bit 25, is usually clear), 5
bits of src2, 5 bits of dest, 5 bits of src1, and 11 bits of immediate
offset.  Call this format A.

Next most common has 6 bits of opcode (low bit usually set), 5 bits of
src2, 5 bits of dest, and 16 bits of immediate constant (frequently
ignored; set to 0).  Call this format B.

A few instructions have the high 5 bits of a 16-bit immediate offset in
the dest field (bits 16..20) rather than the src1 field.  Store and a few
branch instructions.  Call this format C.

A few instructions are like the above, and also have a 5-bit immediate
constant in the src1 field.  Call this format D.

The largest branch offsets are handled by instructions having 6 bits of
opcode and 26 bits of (signed) offset.  Call this format E.

Floating-point instructions have 010010 in the high 6 bits, 5 bits of
src2, 5 bits of dest, 5 bits of src1, 4 magic bits (P - pipeline, D -
dual-instruction mode, S - source precision, and R - result precision),
and 7 bits of opcode.  Call this format F.

Some special operations have 010011 in the 6-bit opcode field, 5 bits of
src1 in bits 11..15, and 5 bits of opcode in bits 0..4.  The other bits
are unused.  Call this format G.

> Load
The load instruction (ld) has 3 variants: ld.b, ld.s, and ld.l.
dest = mem[src1 + src2].  The opcode is 000L0I.  I controls src1.
0 if it's a register (format A), 1 if it's a 16-bit signed offset
(format B).  L is 0 if it's a byte load, 1 if it's a 16 or 32-bit
load.  In the latter case, bit 0 of the instruction is stolen from
the immediate offset and indicates 16 (0) ir 32 (1) bit loads.
This bit is connsidered to be 0 for the addition.  Loads are always
sign-extended.

BTW, the Intel-suggested format is ld.x src1(src2), dest.  No more
mov dest, src nonsense!

There is 1 cycle of latency on loads, even if they hit the cache.  This
is interlocked.

> Store
Store is similar (st.b, st.s. or st.l), but it must use a 16-bit
immediate offset, and uses format C.  mem[src2 + immediate] = src1.
The opcode is 000L11.  (Again, st.x src1, imm(src2).)

Note that r0 hardwired to zero comes in handy here.  32-bit absolute
addresses must be formed by loading the high 16 bits into a register and
taking an offset from there.  Intel suggests r32 for this purpose.
Because the offsets are signed, you may have to diddle the high 16 bits
a bit to make things work properly.

> Move int to fp
ixfr src, fdest moves 32 bits (nor format conversion) from an integer
to an fp register.  The opcode is 000010.  There are two cycles of latency
(interlocked) until the data appears in the destination.  Format A.

000110 is reserved.

> Fp load
fld.y src1(src2), fdest and fld.y src1(src2)++, fdest are floating-point
load instructions.  They are simialr to the integer load instructions,
except the data sizes are .l (32 bits), .d (64), or .q (128 bits).
The ++ autoincrement mode stores src1+src2 back into src2.  Again, the
low-order bits of the immediate offsset (formats A or B) are used in
the instruction.  Bit 0 is set for autoincrement addressing, bit 1 is
set for 32-bit loads, and bit 2 is set for 128-bit loads.  Bits 1 and
2 are zero for 64-bit loads.  The opcode is 00100I, I selecting format
A or B.  For 64 or 128-bit loads, the destination register must be
even or a multiple of 4.

> Fp store
fst.y fdest, src1(src2) and fst.y fdest, src1(src2)++ are similar.
src1 can be a register here.  The opcode is 00101I.

> Pipelined fp load
pfld.z fdest, src1(src2)[++], pipelined load, has the same addressing
modes, except they use the 3-deep load pipeline (which operates
independently of scalar loads; the two may be arbitrarily interleaved).
The destination register specifies where to put the result of the 3rd
previous pfld instruction.  128-bit pipelined loads are not allowed.
Pipelined accesses do not place the data in the cache, although they
do handle cache hits properly.  The opcode is 01100I, I selecting format
A or B.

> Pixel store
pst.d freg, imm(src2)[++] stores the pixels specified by the PS field
of the psr from the 64-bit freg into memory.  If you have 16-bit pixels
and the low bits of the PS field are 0110, only the middle 4 byte strobes
will be asserted on write.  Bit 0 corresponds to the lowest address.
See the fzchks ahd fzchkl instructions for uses for this.  The opcode
is 001111, and it uses format B.  After the store, the PM field is shifted
right by the number of bits used, so the next pst instruction will
have access to the next bits.

> Add, Subtract, Compare
The add/subtract instructions also double as compare instructions.
There's addu src1, src2, dest, adds, subu, and subs.  They do the
obvious things, also setting the OF flag on overflow, and setting
the CC flag as follows:
addu: CC gets the carry from bit 31.
adds: CC gets (src2 < -src1)
subu: CC getsthe carry from bit 31 (src2 <= src1)
subs: CC gets (src2 > src1)

This uses formats A and B.  If the 16-bit immediate is used, it is sign-
extended.  To get the one's complement, use subs -1, src2, dest.

The opcode is 100UAI, where U is 0 for unsigned, 1 for signed; A is
0 for add, 1 for subtract, and I selects format A or B.  (0 for A, 1 for B.)

> Shift, Rotate
The shift instructions are shl, shr, shra, and shrd.  The first three do
the obvious things, shr zero-filling high-order bits and shra sign-extending.
shr *only* copies its src1 (register or 16-bit immediate, with the high 11
bits ignored) to the SC field of the PSR.  shrd uses this to compute
dest = ((src1<<32 + src2) >> SC ) & 0xFFFFFFFF.

The opcdes are:
10100I for shl, I selects format A or B.
10101I for shr, I selects format A or B.
10111I for shra, I selects format A or B.
101100 for shrd, format A only.

None of the shifts set the condition code bit, so the assembler uses the
following macros:

mov src2, dest == shl r0, src2, rdest
nop == shl r0, r0, r0
fnop = shrd r0, r0, r0

To do a rotate, shr count, r0, r0 and shrd src, src, dest.

> Trap
There is a trap instruction, which uses format B, although the source
operands are not interpreted.  The destination register is "undefined,"
so it's a good idea to use register 0.  The opcode is 010001.  This
causes an IT trap (see the psr).  The source bits can be used for
whatever.

> And, Or, Xor, Andnot
There are 4 logical instructions, and, or, xor, and andnot.  They do the
obvious things, dest = src1 & src2, dest = src1 | src2, dest = src1 ^ src2,
dest = ~src1 & src2.  The opcodes are of the form 11OPHI, where OP
specifies one of {and, andnot, or, xor}.  I specifies A or B format,
and H can be set in B format to indicate that the immediate constant
should be shifted up 16 bits before use.  Thus, to load the high 16 bits
of a register, orh immediate, r0, dest will do the trick.  Or xor.
H bit set and I bit clear is reserved.  16-bit immediate values are
zero-extended, if used.

The CC flag is set if the result is zero, otherwise it is cleared.  The
opcodes with the H bit set are andh, andnoth, orh, and xorh.

> Control register modification
ld.c and st.c are used to modify control registers.  Format A is used,
although only the src2 one of src1 (for st.c) and dest (for ld.c)
are interpreted.  The opcode is 0011L0, where L is 0 for load
(dest = special[src2]) and 1 for store (special[src2] = src1).
The src2 field holds 0 through 5 for the fir, psr, dirbase, db, fpsr,
and epsr registers.  These instructions are legal in user mode, although
many writes will be ignored.

> Branches
Most of the branch instructions use format E, taking a 26-bit offset.  I assume
the offset is a word offset (in case I forgot to mention it, instructions
must be 32-bit word-aligned), but I can't find it explicitly stated.  br
(opcode 011010) is a straight branch, with 1 delay slot.

> Call
call (opcode 011011)
is similar, but also puts a return address in register r1.
bc and bnc branch if the CC flag is set or clear, respectively.  They
come in non-delayed (bc, bnc) and delayed (bc.t, bnc.t) versions.  The
opcodes are 01110T and 01111T, respectively.  T is set in the .t (delayed)
forms.

bri is an indirect branch, delayed, using opcode 010001 and format A,
I think, although only src1 is used.  bri [src1] branches to the address
specified in src1.  The low two bits of src1 are ignored.  If any of the
trap bits are set when this instruction is executed (see the psr), this
also performs an interrupt return, clearing the trap bits, copying PU
to U and PIM to IM, and doing strange things with DS and DIM.

> Loop
There's also bla, a looping-type instruction.  It's a bit wierd.
First of all, if the LCC flag is set, it does a delayed branch
with a 16-bit offset, then it computes "adds src1, src2, src2" and
sets the LCC flag to what the complement of the CC flag would be for
a real adds.  This uses format C, with opcode 101101.

Intel gives the example of clearing an array of 16 single-precision
numbers to zero, atarting at the address in r4:

	adds	-1, r0, r5	// r5 holds loop increment
	or	15, r0, r6	// r6 holds loop count
	bla	r5, r6, CLEAR_LOOP	// clear LCC; it doesn't matter
					// if we jump or not
	addu	-4, r4, r4	// compensate for preincrement (delay slot)
CLEAR_LOOP:
	bla	r5, r6, CLEAR_LOOP
	fst.l	f0, 4(r4)++	// delay slot

I've never seen a looping instruction quite like it.  Be careful not to
trash LCC during the loop (shades of jcxz!).

Other core instructions:

> Compare-and-branch
bte src1, src2, offset and btne src1, src2, offset branch (no delay)
if src1 == src2 or src1 != src2, respectively.  They have opcode 0101EI,
where E = 1 branches on equal, and I selects format C or D.  I.e. src1
can be an immediate value, but only in the range 0..31.

> Flush
flush - flush the data cache. "In user mode, execution of flush is suppressed,"
whatever that means.  What it seems to do is force a fake load, one that fills
the data cache with garbage.  "When flushing the cache before a task switch,
the addresses used by the flush instruction should reference non-user-
accessible memory to ensure [Will wonders never cease?  A book written in the
U.S. actually got ensure/insure straight!] that cached data from the old task
is [oh, well... can't win them all] not transferred to the new task.  These
addresses must be valid and writeable in both the old and the new tasks's
space."

The sample code reserves a 4K hunk of memory, and does this:

// Rw, Rx, Ry, and Rz are registers
// FLUSH_P_H and FLUSH_P_L are two halves of the address of the 4K hunk,
// less 32.

	ld.c	dirbase, Rz	// assuming RB and RC fields clear
	or	0x800, Rz, Rz	// Set RC field to 2 (obey RB for data cache)
	adds	-1, r0, Rx	// Loop increment
	call	D_FLUSH	
	st.c	Rz, dirbase	// Store new RC field (in delay slot of call!)

	or	0x900, Rz, Rz	// Set RB field to 2 (was assumed 0)
	call	D_FLUSH
	st.c	Rz, dirbase	// Store new RB field (in delay slot of call!)
	xor	0x900, Rz, Rz	// Clear RB and RC fields
// Pound on DTB, ATE, or ITI fields here
	st.c	Rz, dirbase	// Store cleared values
// continue...

D_FLUSH:
	orh	FLUSH_P_H, r0, Rw
	or	FLUSH_P_L, r0, Rw	// Rw gets address of flush area
	or	127, r0, Ry		// loop counter
	bla	Rx, Ry, D_FLUSH_LOOP	// set up LCC
	ld.l	32(Rw), r0		// clear pending bus writes
D_FLUSH_LOOP:
	bla	Rx, Ry, D_FLUSH_LOOP	// Loop
	flush	32(Rw)++		// Hit every 32 bytes (cache line size)
	bri	r1			// Return - branch to (r1)
	ld.l	-512(Rw), r0		// Load from flush area to clear pending
					// writes (guaranteed cache hit).
	
I don't quite understand the bit about clearing pending writes.  I guess
it puts off address translation until the last possible moment (the write
queue uses virtual addresses), and a load to r0 is an idiom which always
generates an interlock.

The flush instruction uses opcode 001101; format B.  Bit 0 of the immediate
field selects autoincrement mode.

That's everything in formats A through E; now for format G.  (High 6 bits
opcode = 010011, low 5 bits give secondary opcode; only one 5-bit register
field defined.)

The defined operations are:
calli: opcode 00010, performs an indirect (delayed) call via the address 
specified in the register operand.  I don't know if it reads the source
register before or after storing the return address in r1.  Could be a
way to play with coroutines.

intovr: opcode 00100, traps if the OF flag in the espr is set.  trapv.

> Lock
lock: opcode 00001.  This is interesting.  This begins an interlocked
sequence on the next data access that misses the cache, setting the BL bit.
Interrupts are disabled and the bus is locked until explicitly unlocked.
The sequence must be restartable from the lock instruction in case a
trap ocurrs.  If there is more than one store, you must ensure there are
no traps after the first non-idempotent store.  I.e. keep the code on one
page and make sure all the data addresses are valid.

There is a similar unlock instruction (opcode 00111), that unlocks the bus
on the first data access that misses the cache after it.

These instructions *are* executable from user mode, but there is a 32-
instruction counter that traps if you spend too long with the bus locked.

I like those instructions.  A RTOS might like to be able to set the timeout,
but 32 instructions is a reasonable value.

Now, for the interesting part:

>> Floating-Point <<

These are all in the F format, with a 010010 opcode in the high 6 bits,
then 5 bits of src2, dest, and src1, then 4 magic bits, then 7 bits
of fp opcode.  Two of the magic bits control the source and destination
precisions.  S=0 for single and S=1 for double sources.  R=0 for
single and R=0 for double results.

> Pipelines
Here comes time to explain the pipeline concepts used by the 80860.

There are 4 pipelines on the i860: multiplier, adder, graphics unit,
and floating-point loads.  These are 2/3, 3, 1, and 3 stages deep.
The multiplier is 2 stages deep for double-precision sources and 3
stages [sic] for single.  The destination format is unimportant.

The FZ (flush zero), RM (rounding mode) and/or RR (result register) bits
of the fsr while there are results in the adder or multiplier pipelines
is a bad idea.

One of the magic bits in each fp instruction is the P, pipeline bit.  If
this bit is clear, the operation goes straight through the floating-point
unit.  Any results in the pipeline are lost, but the result is available by
the next instruction.  This is *not* the next cycle, but it's scoreborarded.
(This doesn't apply to the load pipeline, which is not used by scalar load
instructions.)

If the pipeline bit is set, though, then the specified dest is for the
result at the end of the pipeline and the requested operation goes in the
front.  The store is completed before the load of the source operands.
(At least conceptually.)

So initially, you must stick a few operations into the pipeline, throwing
away whatever was there (writing it to f0), then you can pump through
lots of data, then you have to stick in a few junk computations to get the
last few results.

The load pipeline, the pfld instruction, is the most straightforward,
and works as described above.

On the multiply pipeline, if you switch source precisions with the pipeline
half-full, if you started out in double (2-stage) mode with B and A in the
pipeline (A one stage from completion, B two), and added single-precision
computation C, you'd store A and end up with C, B and 0.0 in the pipeline.

If you started out with C, B, A, and added double-precision computation D,
you'd end up with A stored and D, C in the pipeline.  B would get lost.

Both inputs to an operation must be of the same precision.  There are odd,
not fully explained problems with taking double source operands and
returning a single result, so the precision suffixes on floating-point
operations should generally be restricted to .ss, .sd, and .dd.

> Fmul, Fadd, Fsub
Anyway, here's a list of the simple floating-point operations:
[p]fmul.p	src1, src2, dest (opcode 0100000)
[p]fadd.p	src1, src2, dest (opcode 0110000)
[p]fsub.p	src1, src2, dest (opcode 0110001) // result = src1 - src2

The fadd or pfadd instruction may have a .ds precision suffix, as long as
one of the sources is f0.  This is used for format conversion.  The [p]fadd
instructions are used in the [p]fmov macros.

> Float to integer
[p]ftrunc.p	src1, dest (opcode 0111010)
The result of this operation is 64 bits, whose low 32 bits are the integer
(truncated) part of the floating-point src1.  It uses the adder.
[p]fix.p	src1, dest (opcode 0110010)
Same as a bove, but the integer part is rounded.  For both of these, the
integer is two's complement, signed.

pfmul3.dd	src1, src2, dest (opcode 0100100)
This forces a dp multiply to use the 3-stage pipeline.  It's only intended
for reloading a pipeline.

> Multiplty (integer)
fmlow.dd	src1, src2, dest (opcode 0100001)
This multiplies only the low-order bits of its operands.  dest gets the
low-order 53 bits of the product of the significands of src1 and src2.
Bit 53 of dest gets the MSB of the product.  This instruction cannot be
used in pipelined mode, does not affect the result-status bits in the fpsr,
and does not cause any traps.

> Divide, Reciprocal
frcp.p		src2, dest (opcode 0100010)
dest = 1/src2, approximately.  Absolute significand error < 2^-7.
src1 must be zero.  Use as a starting point for Newton-Raphson.
This instruction may not be pipelined.  It causes a source-exception
trap if src2 is zero.  It uses the multipler.

> Square root
frsqr.p		src2, dest (opcode 0100011)
As above, but dest = 1/sqrt(src2), approximately, and it also traps if
src2 < 0.

> Fcmp
pfgt.p		src1, src2, dest (opcode 0110100, R bit clear)
pfle.p		src1, src2, dest (opcode 0110100, R bit set)
pfeq.p		src1, src2, dest (opcode 0110101)
These instructions perform floating-ponit comparison using the adder.
They begin with "p" because they advance the pipeline one stage
(the value they insert is undefined, but not an error), but they
place the result of the comparison (src1 > src2, src1 <= src2,
src1 = srcs) in the CC bit immediately  There is no pipeline delay.
(Actually, there is one cycle of latency, but it's scoreboarded.)
They do trap on invalid inputs.

> Multiply-accumulate
The following instructions are called dual-operation instructions, since
they use both the adder and multiplier.  Not to be confused with dual-
instruction mode.  Combining both of these gives the calimed 150 MOPS.

pfam.p	 	src1, src2, dest (opcode 000xxxx)
pfmam.p	 	src1, src2, dest (opcode 000xxxx)
pfsm.p	 	src1, src2, dest (opcode 001xxxx)
pfmsm.p	 	src1, src2, dest (opcode 001xxxx)

These instructions are really complex families of instructions.
They perform variations on multiply-accumulate.  The xxxx is the DPC
(Data-Path Control) field.

The precision specifies the input and output precisions of the multiplier;
the adder takes inputs and putputs of the destination precision.

Here is where the KI, KR, and T registers come in.  The possibe data flows
are complex, but:

The value written into dest can be the result of either the adder or
multiplier pipeline.

The multipler's src1 can be the instruction's src1, KI, or KR.  If it is one
of the K registers, the instruction's src1 can be copied into it prior to
use, or you can use it's current value.

The multiplier's src2 can be the given src2 or the value written into
dest.

The multipler's result can be written into the T register as well as sent
to the destination register.

The adder's src1 can be the instruction's src1 (if the multiplier hasn't
usurped it), the value written into dest (again, if nobody else has it),
or the value in the T register (which can be whatever it used to be or the
value written by the multiplier).

The adder's src2 can be the result of the multipler, the value written into
the dest register, ot the given src2 (assuming the multipler hasn't
stolen it).

When you add the fact that the adder can compute src1+src2 or src1-src2,
you have a total of 64 possibilities.

A bit in the opcode specifies whether the adder adds or subtracts, and the
P bit is used to specify which output goes to the dest register (0 = adder,
1 = multiplier (and the adder's result is thrown away)).

After this factoring, there are 16 cases, 8 can be represented by the
DPC field values 0XYX, where:
X controls whether "K" means KR (X=0) or KI (X=1),
Y controls whether the adder's src2 is the result of the multiplier (Y=0)
or the result of the multiplier goes into T and the adders' src2 is
the result that gets written into the dest register (Y=1), and
Z controls whether the instruction's src1 goes to the adder's src1
(Z=0) or the instruction's src1 goes to K (and thence to the multiplier)
and the adder's src1 comes from T (which may have come from the
multiplier).

DPC values of the form 1XY0 cover cases where the multiplier's inputs are
K and the result written to dest (K is controlled by X, as above) and the
adder's inputs are the instruction's src1 and src2.  Y controls whether
T is loaded with the result of the multiplier (Y=0) or not (Y=1).

DPC values of the form 1XY1 cover cases where the multiplier's inputs are
the instruction's src1 and src1.  If X is 1, then the adder's src1
is T (which is not loaded from the multiplier's result) and Y controls
whether the adder's src2 is the multiplier's result or the value written
to the dest register.  (Note that these may be the same value.)

If X is 0, then the adder's second input is the result of the multiplier
(which is not written into T), and its first input is controlled by Y.
If 0, it's the valuee written into the dest register; if 1, it's the
T register.

Are you suitably confused?  Pictures do help somewhat.  Intel supplies
transliteration rules for producing mnemonics from these various
connections, but I won't go into them here.

Scoreboard alert: when the multiplier's src1 is the instruction's src1,
this must not be the same as rdest.  Something screws up.

>> Graphics operations <<

These also use the fp instruction encoding and register set.  But they use
a separate graphics pipeline which is only one stage deep - i.e. when you
start one instruction, you get the result of the previous one out.
As with the floating-point instructions, most have pipelined and non-
pielined versions, which behave analogously.
(The graphics operations use fp opcodes 1xxxxxx; I've already covered
everything of the form 0xxxxxx.)

> Long long
The basic ones are long-integer operations:
[p]fiadd.w	src1, src2, dest (opcode 1001001)
.w is .ss or .dd for 32 or 64-bit adds.  The CC is not set, and no traps
are signalled.

[p]fisub.w	src1, src2, dest (opcode 1001101)
dest = src1-src2

There are move macros that use these instructions with f0.

> Z-buffer
[p]fzchks	src1, src2, dest (opcode 1011111)
[p]fzchkl	src1, src2, dest (opcode 1011011)
These instructions do z-buffer operations.  The short form takes the sources
as 4 fields of 16 bits each, and does 4 simultaneous compares, with the
results written to the PM (Pixel Mask) field of the psr.  In fact, what
happens is that the PM is shifted right 4 bits and the most significant 4
bits are set with the results of (src2 <= src1), for each of the 4 fields.
The value produced by the operation is the result of 4 parallel minimum
operations, i.e. the updated z-buffer.

The long form, [p]fzchkl, does the same, except it uses 2 32-bit wide
fields, shifts PM by 2 bits, and updates the high 2 bits.  The shift
allows you to rapidly compute 8 bits worth of z-buffer values.
The size of the z-buffer is independent oof the pixel size set in the PS
field of the psr.

> Phong shading
[p]faddp	src1, src2, rdest (opcode 1010000)
This instruction does pixel interpolation into the MERGE register.
I don't quite understand how this instruction is useful, but it
does something unusual.  Assume 8-bit pixels specified in the PS field
of the PSR.

faddp takes src1 and src2 as consisting of 4 16-bit words, adds each
field together, and writes the high bytes of each word (if you consider
the words to be fixed-point 8.8 bit numbers, it writes the integer
parts) to the MERGE register.  The MERGE register has been shifted
down 8 bits at the same time, so two of these instructions will fill
it with pixel values.

If the pixels are 16 bits wide, it will do the same, except the fields
are considered to be 6.10 bit fixed-point numbers, with the high 6 bits
loaded into the MERGE regsiter, which has been shifted down 6 bits.
(After two shifts, two bits won't fit and get truncated from one of
the fields - thus the 6/6/4 RGB format you see flying around.  This is
the only place it appeears.)

If the pixels are 32 bits wide, the fields are taken to be 32 bits wide,
withe the high bytes of each of the two copied to the MERGE register, which
has been shifted down 8 bits.

There is also a similar [p]faddz instruction (opcode 1010001), which
does the same thing with 16.16 bit fields, shifting the MERGE register
16 bita at a time.  Intel seems to be really keen on this sort of
operation.  I wish I knew what it was good for.

You can do the same thing with 32.32 bit fields, by doing two long adds
on the corresponding parts of src1 and src2, then using a single-precision
move to copy the destination parts nito a register pair.

[p]form		src1, dest (opcode 1011010)
dest = src1 | MERGE
MERGE = 0
This instruction lets you read the MERGE register after you've pounded on
it a while, setting any last bits you need to tweak and clearing it
for future action.

> Move fp to int
fxfr		src1, dest (opcode 1000000)
This moves single-precision floating-point register src1 to integer
register dest.  The opposite of ixfr.  [These mnemonics aren't very mnemonic.]

>> Dual-Instruction Mode <<
One of the magic bits in each fp instruction is the D, dual-instruction
bit.  Intel suggests using either a d. prefix to the mnemonic or
assembler directives .dual and .enddual.

If the processor comes across an instruction (which must be aligned on a
64-bit boundary) with the D bit set, then it executes the next instruction
(integer ("core") op or fp op with D bit set) and starts reading instructions
64 bits at a time.  The low-order instruction must be an FP op, and the high-
order must be an integer ("core") op.  Exception: the fnop (lsrd r0, r0, r0)
instruction is allowed in the fp slot.  Both these instructions are
executed simultaneously.

To get out of dual-instruction mode, have an fp op (FLOP) without the D
bit set.  This pair, and the next, will still be executed in dual-instruction
mode, but after that you're back to single.  A degenerate case is a single
FLOP in a stream with the D bit set, followed by one with it clear.
The next two instructions will be executed as a pair, and them back to single
mode.

Executing two instructions at once requires some extra rules:

- If a branch on CC is paired with a floating-point compare, the branch tests
  CC before the compare sets it.
- If an ixfr, fld, or pfld instruction is paired with a FLOP, the FLOP
  gets the register value before the other instruction updates it (or
  marks it as pending in the scoreboard, really).
- An fst or pst operation that stores a register which is written to by
  the instruction it's paired with, the new value is stored to memory.
- An fxfr instruction that conflicts with a source operand in the
  core operation paired with it will store after the core op has read
  the register.  "The destination of the core operation will not be
  updated if it is any if the integer register.  Likewise, if the core
  instruction uses autoincrement addressing, the index register will not
  be updated."  Typo?  I think this meand the fxfr steals the write bus
  from the core processor, and the core processor's write goes to the
  bit bucket.
- If both instructions set the CC, the FLOP will win.

- If the FLOP is scalar and the core operation is fst or pst, it should
  not store the result of the FLOP.  When the core OP is pst, the FLOP
  must not be [p]fzchks or [p]fzchkl.  Conflict over the PM field, y'know.
- When the core op is ld.c or st.c (diddles control registers), it must
  be paired with fnop.
- You cannot use the return-from-interrupt functionality of bri in dual-
  instruction mode.
- A FLOP which sets CC cannot be paired with a compare-and-branch core
  instruction.  I.e. pfeq and pfgt conflict with b[n]c.t.  b[n]c.t
  also conflict with a pfeq or pfgt instruction in the next pair, too.
- "When the FLOP is fxfr, the core operation cannot be ld, ld.c, st, st.c,
  call, ixfr, or any instruction that updates an integer register
  (including autoincrement indexing)."

- You can't start to exit from dual-instruction mode on an instruction paired
  with a control-transfer instruction.  I.e. if the FLOP before had D set,
  so must the FLOP paired with the branch.
- You can't start to switch to or from dual-instruction mode on the instruction
  following a bri (in its delay slot).

Enough rules?  Well, you should have known it was gonna be a bit ugly.

>> Traps, Interrupts, Exceptions, etc. <<

As I mentioned, this is not well done.

When a trap ocurrs, bits are set in the psr (and maybe fpsr, if the FT bit
in the psr is set) to indicate contributing factors, and then the U and IM
bits are copied to the PU and PIM bits, then cleared (disabling interrupts
and switching to supervisor mode), the DIM and DS flags are set as needed,
and the fir is set up.

(In dual-instruction mode, the fir will point to the FLOP in the low-order
half of the pair.  If the problem was just a data-access fault, the FLOP
(unless it was fxfr) completes, and you should not reexecute it on
interrupt return.  Instruction and data-access faults are always the fault
of the core instruction.)

After this setup, the processor jumps to virtual address 0xFFFFFF00.
then you have to figure out what's going on and fix it.  The state of
the processor consists of:

- The register files
- The four pipelines
- The KI, KR, and T registers
- The MERGE register
- The psr, epsr, and fsr.
- The fir, and
- The dirbase register (with its dependencied on the data cache)

A simple interrupt return consists of
- Restoring the register files, pipelines, KI, KR, T, and MERGE registers
  (not necessary for simple interrupt handlers), except for one register
  which holds the return address from the fir.
- Undoing the effect of an autoincrement instruction which must be
  reexecuted (parse the instruction at [fir] to figure this out)
- See if you need to back up the return address by one instruction
- Set up the psr, possibly setting the KNF bit, and definitely setting
  at least one trap bit.
- Execute an indirect branch (bri) to do the interrupt return, and in its
  delay slot,
- Restore the register that holds the resumption address.  The processor
  is still in supervosor mode here, so you don't need to pollute the
  user's address space.

> Backing up the return address
If the instruction before the one pointed to by the fir is a delayed branch,
you should back up and re-execute it.  If it is a bla, you need to undo its
add instruction.

There is an exception to this where you bombed out on a floating-point
compare instruction you need to emulate and the instruction before is
a conditional delayed branch.  Here, you need to leave the CC alone so
the branch will do the right thing, and set it so the fp compare
will seem to have done the right thing.  You need to compute where the
conditional branch would put you and resume there.

If you are backing up, and in dual-instruction mode, you should set
things up (DS set, DIM clear) so the core instrucrion will be executed
in single-instruction mode, then DIM will be re-entered.  If DS was
originally set, clear it.

Plus, you have to worry about the case that the instruction at fir-4
might not exist.  Intel suggest that you begin each code segment with
a nop instruction to avoid this problem.

> Setting KNF
KNF should be set if you have emulated a floating-point instruction that
trapped, or if you got only a data-access fault in dual-instruction mode
and the FLOP was not fxfr.  [Is the perfectly clear?]

> Saving the pipeline
Doing this is messy.  Basically, you need to read out all the results
(and the associated error codes for the adder and multipler pipelines)
to store them, and then push operations with the equivalent answers back
on restore.  For the load pipeline, store the values read in memory
somewhere and reload it from there afterwards.  For the graphics pipeline,
you can just read it with a pfiadd, and restore it the same way (add 0
to the recalled value).  The MERGE register also needs to be stored.

For the floating-point pipelines, you need to get all the values out,
including error conditions, and the KI, KR, and T registers.  To put
them back, first stuff the KR, KI, and T registers, then place value+0
and value*1 computations into the various pipelines, along with the
proper error bits.  There's sample code to do this in the data book,
and it's not particualrly pretty.

>> Calling Conventions <<

Intel has a suggested calling convention.  Although the border is still
fuzzy, the manual suggests r0-r15 and f0-f15 as callee-saves, and the
other half as caller-saves.  r1 is the retrn address, r2 is the stack
pointer, and r3 is the frame pointer.  Parameters are passed in r16 through
r27 and f16 through f27, and the others are used for scratch.  r31 is reserved
for address computations.

They suggest that even single-precision float arguments be passed in a
register pair, and anything that won't fit into registers be passed
on the stack C-style.  The stack pointer should always be 16-byte
aligned so the 128-bit loads can be used easily.

> Memory map
They also suggest a memory map.
It starts with 4K of unreadable memory (NULL-catcher), then user data,
and heap.  Then empty space until you hit the stack, then shared-memory
frames, and OS data, topping out at 0xF0000000.  Then comes a jump table to
standard library routines until 0xF0400000, then user code (text), blank
space, and then the OS up at the top of memory.

>> Sample code <<

The manual gives a bunch of sample code.  I won't reproduce it, but will
list what's there:

- Sign-extending a value in a register (shl, shra)
- Loading unsigned integers (ld, and)
- single-precision FP divide (approximate, two iterations Newton-Raphson
  unpipelined, 22 cycles, 2 ulp worst-case error)
- DP fp divide (three iterations Newton-Raphson, also 2 ulp, 38 cycles)
- Integer multiply (move to fp, use fmlow, move back; 9 clocks, five
  of which can be overlapped)
- Signed int to double (7 cycles; 3 can be overlapped)
- Signed integer divide (62 cycles, 59 without remainder)
- Null-terminated string copy (byte-at-a-time, simple)
- Example of pipelined adds
- Example of pipelined multiply-accumulate
- Example of dual-instruction mode
- Cache strategies for matrix dot product (e.g. keep both matrices in
  cache; keep one and use pipleined loads on the other)

>> Pipeline Interlocks <<

Everything's single-cycle, but here's what can interlock:

i-cache miss: given in terms of pin timing, plus two cycles if d-cache miss
in progree simultaneously.

d-cache miss (on load): again, pin timing, but it seems to be "clocks from
ADS asserted to READY# asserted"

fld miss: d-cache miss plus one clock

call,calli,ixfr,fxfr,ld,ld.c,st,st.c,pfld,fld or fst
with data cache miss in progress - stalls until miss satisfied, plus one cycle

ld, call, calli, fxfr and ld.c have 1 cycte of latency (next instruction
will stall if scoreboard hits)

fld, pfld and ixfdr have 2 cyctes of latency.

addu, adds, subu, subs, pfeq, pfgt, and pfle have 1 cycle of latency to
update the CC bit.  A branch on that bit will stall.

The multipler's src1 must be in the register file; if it is the result of
the previous instr, you get a 1-cycle stall.

Scalar FLOPS fadd, fix, fmlow, fmul.ss, fmul.sd, ftrunc and fsub have 3
cycles of latency.  fmul.dd has four.  If the input and output precisions
differ (e.g. fmul.sd), add one cycle.  Plus one if the following FLOP is
pipelined and has dest <> f0.

TLB miss takes 5 cycles plus two reads, plus setting the A bit (if necessary).

if three pfld's are outstanding and you execute one more, you will
stall until the first completes, plus one cycle

a pfld data-cache hit costs two clocks

if the store pipe is full (one on bus plus two pending internally), another
access will delay until the current access completes, plus one cycle

a load (or fld) following a store cache hit - one clock

delayed branch not taken - costs one clock

nondelayed branch taken - one clock for bc, bnc; two for bte, btne.

bri - one clock

st.c - two clocks

there is not forwarding from the graphics unit to the adder, multiplier,
or itself, so there is one cycle of latency there

a flush has two cycles of latency

an fst takes one cycle to get the value out of the register, so if the
next instruction overwrites the register being stored, it will stall

>> The End <<

And that, boys and girls, is basically the complete contents of the
programmer's reference manual.  Enjoy!

(52K ug... let's see if we can bomb any mailers!)
-- 
	-Colin (uunet!microsoft!w-colinp)

"Don't listen to me.  I never do." - The Doctor

jangr@microsoft.UUCP (Jan Gray) (03/06/89)

In article <807@microsoft.UUCP>, I write:
>  * 8, 16, 32 bit integer load/store insns, operands must be appropriately
>    aligned; byte or word values are sign extended on load.  [I hope you
>    don't use "unsigned char" too much...]
What an ultra-maroon.  I don't know what I was thinking, but you need merely
"and" the result of the load with 0xFF to load a byte as an unsigned.

Jan Gray  uunet!microsoft!jangr  Microsoft Corp., Redmond Wash.  206-882-8080

jeff@Alliant.COM (Jeff Collins) (03/07/89)

In article <807@microsoft.UUCP> jangr@microsoft.UUCP (Jan Gray) writes:
>
>				i860 Overview
>
> <deleted lots of interesting and useful data on the i860>
>Caches
>
>* 8K data cache, virtual addressed, write-back, two-way "set associative",
>  2x128 lines of 32 bytes
>* Flush instruction forces a dirty data cache line (32 bytes) back to memory.
>  Intel supplies suggested code to flush entire data cache.
>* Storing to dirbase register with ITI bit set invalidates TLB and instruction
>  caches; must flush data cache first!  [Remember, the data cache is virtually
>  addressed.]

	Coming from a multiprocessor background, I personally judge the
desirability of a chip by the ability to put it into an MP architecture.  One
of the most important features necessary for this is the ability to invalidate
any internal data caches from external hardware.  The discussions that I have
seen on the i860 have not made it clear whether this possible or not.   Given
that the internal data cache is a virtual, write-back, two-way set
associative, I would guess this is not possible.  Does any one know for
certain?

	If it is impossible the invalidate the cache from external logic, the
next question is how does the chip perform with the internal data cache
disabled?  Also, is there any way to disable the cache without using the PTE
bit?

dgh%dgh@Sun.COM (David Hough) (03/07/89)

> The frcp and frsqr insns give return approximate reciprocal and 1/square
  root "with absolute significand error < 2^-7".  Intel supplies routines
  for Newton-Raphson approximations that take 22 clocks (*almost* single
  precision) or 38 clocks (*almost* double precision), and the Intel i860
  library provides true IEEE divide.  [RISC design principles at work:
  divides are infrequent enough not to slow down/drop some other feature
  to provide divide hardware.]

Another RISC design principle that will be discovered by whoever trys to
build an engineering work station out of this chip:  you base your design
on few or small benchmarks at your peril.

Not every engineering computation can be reduced to linpack-style memory
intensive adds and multiplies.  Division and sqrt are important in a lot
of realistic applications - spice for starters.  And spice convergence
sometimes is a function of how clean the arithmetic is (at least for the
inappropriately popular mosamp2 benchmark).   A system with a TI 8847
running at the same clock should beat an i860 on a number of realistic
applications.

None of which should be construed to imply that the i860 won't be very 
good on the applications for which it was designed, such as graphics
processors.

David Hough

dhough@sun.com   
na.hough@na-net.stanford.edu
{ucbvax,decvax,decwrl,seismo}!sun!dhough

brooks@vette.llnl.gov (Eugene Brooks) (03/07/89)

In article <808@microsoft.UUCP> w-colinp@microsoft.uucp (Colin Plumb) writes:
>So initially, you must stick a few operations into the pipeline, throwing
>away whatever was there (writing it to f0), then you can pump through
>lots of data, then you have to stick in a few junk computations to get the
>last few results.
It would appear that this exposed pipeline would straightjacket future
i860 implementations which might want to change the pipeline latency of
the floating point units, or perhaps get the double multiply to do one
result per clock cycle.  Is this true?

Is the news software incompatible with your mailer too?
brooks@maddog.llnl.gov, brooks@maddog.uucp, uunet!maddog.llnl.gov!brooks

davidsen@steinmetz.ge.com (William E. Davidsen Jr) (03/08/89)

One problem with any chip which requires alligned data is that
performance suffers when addressing bytes, to the point that a program
may become impractical. One of the people here checked his Sun-30
(68020) against his Sun-4 (SPARC). The three ran troff about 5x faster.
This doesn't mean RISC is bad in a workstation, but it can have
performance problems in software we take for granted.

A question: has anyone benchmarked nroff/troff on the VAXstation 3100
(VAX) and DECstation 3100 (MIPS)? Would some of the MIPS readers like to
comment on performance in this area vs. 68020 or 80386?
-- 
	bill davidsen		(wedu@ge-crd.arpa)
  {uunet | philabs}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

tim@crackle.amd.com (Tim Olson) (03/08/89)

In article <13322@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes:
| One problem with any chip which requires alligned data is that
| performance suffers when addressing bytes, to the point that a program
| may become impractical.

Bytes don't have alignment restrictions -- they are already byte-aligned ;-)

Most new processors require data to be aligned on "natural" boundaries,
i.e. bytes on byte boundaries, half-words on 16-bit boundaries, words on
32-bit boundaries, etc.  This is simply to avoid having to read more
than 1 word of memory on a load (with the associated trap headaches) and
build up the requested data.

|  One of the people here checked his Sun-30
| (68020) against his Sun-4 (SPARC). The three ran troff about 5x faster.
| This doesn't mean RISC is bad in a workstation, but it can have
| performance problems in software we take for granted.

What was the dataset and the specific machine types?  On "normal"
looking input, I find:

Sun 3/160:
snap3 time troff -t 2.t > /dev/null
5.6u 0.1s 0:05 99% 0+184k 0+8io 0pf+0w

Sun 4/110:
crackle2 time troff -t 2.t > /dev/null
2.2u 0.1s 0:03 69% 0+408k 1+8io 0pf+0w

Which shows the 4/110 2.5x faster than the 3/160.


	-- Tim Olson
	Advanced Micro Devices
	(tim@crackle.amd.com)

mash@mips.COM (John Mashey) (03/08/89)

In article <13322@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes:
>One problem with any chip which requires alligned data is that
>performance suffers when addressing bytes, to the point that a program
>may become impractical. One of the people here checked his Sun-30
>(68020) against his Sun-4 (SPARC). The three ran troff about 5x faster.
>This doesn't mean RISC is bad in a workstation, but it can have
>performance problems in software we take for granted.
>
>A question: has anyone benchmarked nroff/troff on the VAXstation 3100
>(VAX) and DECstation 3100 (MIPS)? Would some of the MIPS readers like to
>comment on performance in this area vs. 68020 or 80386?

Something's wrong somewhere: I'd expect a Sun4/200 to be 2X or so faster than
a Sun3/200; I've never seen anything where a Sun-3 would be 5X faster.

The MIPS-based things run them at the rates you'd expect; there's an
nroff benchmark that's been in the Performance Brief forever; and it also
has Sun3/4 numbers.

I do NOT think there's a penalty for addressing bytes: both of these
machines have full support for storing and [signed/unsigned] loading bytes.
There is a penalty for accessing unaligned words, in both cases,
although R2000s have special instructions to mitigate the penalty.

Anyway, please get your friend to supply some data, because it sounds wrong.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

guy@auspex.UUCP (Guy Harris) (03/08/89)

>One problem with any chip which requires alligned data is that
>performance suffers when addressing bytes, to the point that a program
>may become impractical.

I don't think that's true.  My handy-dandy Cypress CY7C600 Family Users
Guide, for the Cypress SPARC implementation, says that LDSB (LoaD Signed
Byte), LDSH (LoaD Signed Halfword - 16 bits), LDUB (LoaD Unsigned Byte),
LDUH (obvious), and LD (LoaD word - 32 bits), all take 2 cycles.

My handy-dandy MIPS R2000 RISC Architecture manual, alas, has no timings
such as that - after all, it's an *architecture* manual, not a manual
for some particular *implementation* - but I'd be *very* surprised if
byte load/store operations were so much slower that "a program (such as
'troff') may become impractical".  (My expectation is that they're no
slower, just as on SPARC.)

Are you, perhaps, thinking of word-addressible machines, and under the
impression that not only do RISC machines tend to require, say, 4-byte
alignment of 4-byte quantities, but that they can't deal with quantities
shorter than 4 bytes?  That's simply not true of the RISC machines with
which I'm familiar.

BTW, there exist CISC machines that require alignment, as well; as I
remember, all but the most recent AT&T WE32K chips require it.

>One of the people here checked his Sun-30 (68020) against his Sun-4
>(SPARC). The three ran troff about 5x faster.

The only three explanations I can imagine for that, offhand, are:

	1) he's got the two figures backwards; the Sun-4 was ~5x faster
	   than the Sun-3;

	2) the figures are real time, not CPU time, and something else
	   is interfering;

	3) "troff" is floating-point intensive, and the Sun-4 in
	   question has no FPU (e.g., a 4/110 with no FPU).

Explanation 3) falls by the wayside rather quickly; I grepped for
"float" and "double" throughout the code and didn't find it.

This leaves 1) or 2); is there one I missed?

I tried comparing "troff"s on a Sun-3/50 with 4MB memory, and a
Sun-4/260 with 32MB memory, both running 4.0.  Here are the times:

Sun-4/260:
	auspex% time troff -t -man /usr/man/man1/csh.1 >/dev/null
	24.4u 1.2s 0:34 75% 0+456k 26+38io 31pf+0w
	auspex% time troff -t -man /usr/man/man1/csh.1 > /dev/null
	24.4u 1.5s 0:36 71% 0+464k 1+35io 0pf+0w

Sun-3/50:

	bootme% time troff -t -man /usr/man/man1/csh.1 >/dev/null
	118.9u 1.2s 2:08 93% 0+208k 14+33io 24pf+0w
	bootme% time troff -t -man /usr/man/man1/csh.1 > /dev/null
	120.2u 2.8s 2:31 81% 0+192k 5+32io 11pf+0w

The 4/260 did 5x *better* than the 3/50, not 5x *worse*, on that
example!  Could 1) be the correct explanation?

peter@ficc.uu.net (Peter da Silva) (03/08/89)

In article <13322@steinmetz.ge.com>, davidsen@steinmetz.ge.com (William E. Davidsen Jr) writes:
> One problem with any chip which requires alligned data is that
> performance suffers when addressing bytes, to the point that a program
> may become impractical.

I guess I'm a bit dense today, but why? Takes the same amount of work to
fetch <x> bits over an <n> x <x> bit-wide bus either way. Are you confusing
alignment requirements with word addressing?

After all, I don't recall the PDP-11 or 68000 having problems dealing with
bytes. At best you lose a little data compression...
-- 
Peter da Silva, Xenix Support, Ferranti International Controls Corporation.

Business: uunet.uu.net!ficc!peter, peter@ficc.uu.net, +1 713 274 5180.
Personal: ...!texbell!sugar!peter, peter@sugar.hackercorp.com.

henry@utzoo.uucp (Henry Spencer) (03/09/89)

In article <13322@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes:
>One problem with any chip which requires alligned data is that
>performance suffers when addressing bytes, to the point that a program
>may become impractical. One of the people here checked his Sun-30
>(68020) against his Sun-4 (SPARC). The three ran troff about 5x faster.

This is curious, since troff does little byte addressing.  Doubly so
since the SPARC does have byte addressing and byte-access instructions.

More generally, are you not confusing alignment with accessing?  It is
quite possible to require aligned data (e.g. 32-bit quantities on 32-bit
boundaries) while still having efficient byte addressing and accessing.
-- 
Welcome to Mars!  Your         |     Henry Spencer at U of Toronto Zoology
passport and visa, comrade?    | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

bcase@cup.portal.com (Brian bcase Case) (03/09/89)

>One problem with any chip which requires alligned data is that
>performance suffers when addressing bytes, to the point that a program
>may become impractical. One of the people here checked his Sun-30
>(68020) against his Sun-4 (SPARC). The three ran troff about 5x faster.
>This doesn't mean RISC is bad in a workstation, but it can have
>performance problems in software we take for granted.

Now wait a minute; can anyone substantiate this claim?  It seems that
something else must have been wrong:  I might believe that the 020
could be faster for *unaligned* data, but not 5 times faster.  The
SPARC has load/store byte, halfword, and word instructions that work
just fine as long as data is properly aligned.  And I can't imagine
the compiler unaligning things on purpose.  Could you clarify this?

davidsen@steinmetz.ge.com (William E. Davidsen Jr) (03/09/89)

Several comments on my earlier posting:

(a) I was talking about the general case of machines which don't have
byte addressing and fetch a byte using a three step "load, shift, and"
and store a byte doing something like "load, and, shift, or, store". I
Didn't mean to imply that any particular RISC architecture used that
method. The Honeywell 6000 series was an example of not having direct
byte addressing (yes you can use a tally word but building it is as
slow as the and & ors). If the load and store time for an arbitrary
byte are the same as an alligned int, and if load/store byte is a
single operation, then I would consider it byte addressable for the
purposes of any program I want to run.

(b) several people posted results using big Sun4's and little Sun3's. I
believe that the tests were done on a Sun3/280 with FPU and a small Sun4
(I don't know the model number). Since troff is FP intensive (at least
on a VAX) that may be difference.

(c) the point I was trying to make was that a RISC processor which is "N
times faster" than some CISC processor is not going to have the same
improvement in all cases.

I'll post the actual results if I can get them from a rerun.

Boy are RISC people defensive! 
-- 
	bill davidsen		(wedu@ge-crd.arpa)
  {uunet | philabs}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

suitti@haddock.ima.isc.com (Stephen Uitti) (03/10/89)

In article <1133@auspex.UUCP> guy@auspex.UUCP (Guy Harris) writes:
>>One problem with any chip which requires alligned data is that
>>performance suffers when addressing bytes, to the point that a program
>>may become impractical.
>
> [talk about instruction times being the same for byte/word/long
> accesses or SPARC, MIPS].

Byte accesses on the PDP-10 were slower - one had to set up a byte
pointer and do special load-byte or load-byte-and-increment-the-pointer
instructions.  Still, bytes were any size from 1 bit to 36...

Also remember that even if an 8 bit byte access takes (about) the
same time as a 32 bit word access, it still moves less data.
I've had some code do its work using larger quantities for just
this reason.  Usually, the code is #ifdef'ed, so that the easier
version can at least be read if not used.  One can often do
"vector bit" operations a word at a time.  The whole "duff's
device" bcopy & memcpy discussions of a few months ago are at
least partly based on this idea.

>BTW, there exist CISC machines that require alignment, as well; as I
>remember, all but the most recent AT&T WE32K chips require it.

The VAX doesn't require it - but don't do it.  A 32 bit word
reference to an odd address is real slow.  That's why the C
compiler there does so much word alignment.  Even so, one would
see a program that worked on a VAX that would die on a machine
which would just plain forbid the operation.  Data became
unaligned, typically by writing them to disk and then reading
them back in.  The VAX would be slow for the operation (nobody
cared), but other machines would yield bus errors.

It seems to me that if an architecture traps unaligned data
references, the kernel can look at the instruction that faulted
and make it appear to work via software.  uVAX IIs implement all
sorts of VAX instructions that just aren't in the hardware.  Both
VMS & flavors of UNIX do this (sometimes even correctly).
(Remember, DEC said these things would work, even though there
are billions of them & the uVAX II CPU fits on a QBus board...
and with a MB of RAM.)  Almost no one uses these instructions, so
who cares?  If the compilers try to make things aligned, and if
the Operating System fixes things when botched, and if the
Operating System provides a way for the user (programmer) to
detect that it happened, and how much, then everyone should be
happy.  I'd be willing to have unaligned data fetches work 100x
slower if the overall architecture could be otherwise, say, twice
as fast (because there was enough chip space for an I cache or
FPU or something).

>>One of the people here checked his Sun-30 (68020) against his Sun-4
>>(SPARC). The three ran troff about 5x faster.

> [attempted explanations]
>This leaves 1) or 2); is there one I missed?

I had one VAX 780 outperform another due to the system binaries
for the program being differant.  Recompilation & cross running
showed that the hardware was the same.  Of course, the Sun 3
and Sun 4 are not binary compatible, and the original user
probably doesn't have sources...

I had one VAX 780 outperform another by 20% due to a ringing
9600 BAUD tty line.  It had been that way for months - no one
noticed...

I ran various "benchmarks" between uVAX IIs and Sun 4s.  The
range was about 2x to over 8x, averaging about 4x.  I never got
the 10 (VAX) MIPS figures that were commonly quoted.  VAX 780s
really are a little faster than uVAX IIs.

(aside:) In the olden days when 68000s were brand new, the EE
dept at Purdue was considering getting a bunch of 68000s, with
troff in ROM & some communication gear, and have troff run on the
dedicated boxes.  The 68000 could run troff at something like 90%
the speed of the 780, which was likely to be much more CPU than a
user could get out of the 780s there.  I remember wondering if
the I/O would kill the 780s making the whole exercise moot...
Remote execution (load sharing) on the local ethernet was
implemented and it did work pretty well, technically (politically
was another matter).  I had thought that having a pre-built
(buildcore) "troff -ms", etc., would save them more.  I recall it
taking troff something like 20 seconds to do the initialization
for the first .PP for the "-ms" macros.  Pretty gross if you ask
me (don't ask).

>I tried comparing "troff"s on a Sun-3/50 with 4MB memory, and a
>Sun-4/260 with 32MB memory, both running 4.0.  Here are the times:
>
>Sun-4/260:
>	auspex% time troff -t -man /usr/man/man1/csh.1 >/dev/null
>	24.4u 1.2s 0:34 75% 0+456k 26+38io 31pf+0w
>	auspex% time troff -t -man /usr/man/man1/csh.1 > /dev/null
>	24.4u 1.5s 0:36 71% 0+464k 1+35io 0pf+0w
>
>Sun-3/50:
>
>	bootme% time troff -t -man /usr/man/man1/csh.1 >/dev/null
>	118.9u 1.2s 2:08 93% 0+208k 14+33io 24pf+0w
>	bootme% time troff -t -man /usr/man/man1/csh.1 > /dev/null
>	120.2u 2.8s 2:31 81% 0+192k 5+32io 11pf+0w
>
>The 4/260 did 5x *better* than the 3/50, not 5x *worse*, on that
>example!  Could 1) be the correct explanation?

The VAX 780 here running 4.3 BSD had this to say:

	haddock% time troff -t -man /usr/man/man1/csh.1 >/dev/null
	troff: unrecognized -t option
	0.1u 0.0s...

This is much faster than the Suns.  It just optimized the
operation a bit, being an "experienced VAX" (as opposed to a
"used VAX").  The Compaq 386/25 sitting here was even faster,
saying something like "troff command not found".  I'm unfamiliar
with the the "-t" option.

	haddock% time troff -man /usr/man/man1/csh.1 >/dev/null
	90.8u 6.4s 36% 95+201k 59+15io 24pf+0w

I thought Sun 3's were lots faster than 780s.  Maybe more
expensive Sun 3s are faster...  Of course, my /usr/man/man1/csh.1
could be differant, though it is probably at least real similar.
Also, I think 'troff' is one of those applications that has odd
behaviour compared to just about anything else one would run.

It should be pointed out (if it hasn't been already) that troff
doesn't do nearly the byte accesses that one would think it
should do.  Still, troff is a great benchmark for sites that do
alot of troff.

Stephen Uitti, suitti@ima.ima.isc.com (near harvard.harvard.edu)

rob@tolerant.UUCP (Rob Kleinschmidt) (03/10/89)

In article <12000@haddock.ima.isc.com>, suitti@haddock.ima.isc.com (Stephen Uitti) writes:
> In article <1133@auspex.UUCP> guy@auspex.UUCP (Guy Harris) writes:
> 
> > [attempted explanations]
> >This leaves 1) or 2); is there one I missed?
> 
For the sake of argument...	Under some very wierd circumstances, one
might be able to demonstrate better cache utilization on a byte aligned vs.
"naturally" aligned machine. Assuming multi-byte cache lines, one could
argue some small improvement because of the lack of padding bytes within
structures etc. I don't believe this for a minute, and assume that any
small gain made would be offset by the extra cpu access cycles, but it
seemed like a thought worth mentioning.

mash@mips.COM (John Mashey) (03/10/89)

In article <13328@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes:

>(b) several people posted results using big Sun4's and little Sun3's. I
>believe that the tests were done on a Sun3/280 with FPU and a small Sun4
>(I don't know the model number). Since troff is FP intensive (at least
>on a VAX) that may be difference.

troff never used to use FP;  has it changed recently?
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

henry@utzoo.uucp (Henry Spencer) (03/11/89)

In article <13328@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes:
>(a) I was talking about the general case of machines which don't have
>byte addressing and fetch a byte using a three step "load, shift, and"
>and store a byte doing something like "load, and, shift, or, store"...

You're still confusing two issues, or at least using confusing terminology.
There are, in fact, three separate issues here:

Alignment: Are N-byte objects required to be aligned on N-byte boundaries?

Byte addressing: Do the pointers point to bytes, as opposed to words?

Byte accessing: Can the processor read/write bytes, as opposed to words?

It is quite possible to have byte addressing without byte accessing, as on
the current 29000:  the pointers do point to bytes, but without external
hardware help, the current processor only does word memory accesses.

Nobody in his right mind designs a processor without byte addressing today.
Byte accessing is a tradeoff that depends on hardware constraints and how
the processor is to be used (Cray, for example, considers it unimportant).
Alignment is coming back in fashion for several reasons, notably simplicity
of hardware and easier exception handling (because aligned objects cannot
span page boundaries).

>(b) several people posted results using big Sun4's and little Sun3's. I
>believe that the tests were done on a Sun3/280 with FPU and a small Sun4...

Did the Sun 4 have enough memory not to page itself to death?  Especially
if it was running SunOS 4, that's not a trivial issue.

Were the troffs compiled the same way?  Modern troffs have an option for
keeping the temporary file in memory, which can have a major effect on
performance.

Did you think to normalize for memory-system performance?  The 280 has
a big fast cache.

>... Since troff is FP intensive (at least on a VAX)...

Are you sure???  That doesn't sound like the troff I know.

>(c) the point I was trying to make was that a RISC processor which is "N
>times faster" than some CISC processor is not going to have the same
>improvement in all cases.

Nobody would argue with this; why are you making a big deal out of it?

>Boy are RISC people defensive! 

Even non-RISC people think that apparent absurdities are worth challenging.
-- 
Welcome to Mars!  Your         |     Henry Spencer at U of Toronto Zoology
passport and visa, comrade?    | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

guy@auspex.UUCP (Guy Harris) (03/11/89)

>(a) I was talking about the general case of machines which don't have
>byte addressing and fetch a byte using a three step "load, shift, and"
>and store a byte doing something like "load, and, shift, or, store". I
>Didn't mean to imply that any particular RISC architecture used that
>method.

Most of them *don't*.  Therefore, the speculation about them in your
posting was incorrect.

>The Honeywell 6000 series was an example of not having direct
>byte addressing

...at least if it didn't have the extended instruction box.

>If the load and store time for an arbitrary byte are the same as an
>alligned int, and if load/store byte is a single operation, then I
>would consider it byte addressable for the purposes of any program I
>want to run.

Well, then, the SPARC and almost certainly the MIPS are byte-addressible.

>(b) several people posted results using big Sun4's and little Sun3's. I
>believe that the tests were done on a Sun3/280 with FPU and a small Sun4
>(I don't know the model number). Since troff is FP intensive (at least
>on a VAX) that may be difference.

How can it be FP-intensive if it has no floating point numbers in it?  I
found no occurrences of "float" or "double" in the 4.3BSD "nroff"/"troff"
source, and the SunOS 4.0 "nroff"/"troff" is basically derived from the
4.3BSD version.

>(c) the point I was trying to make was that a RISC processor which is "N
>times faster" than some CISC processor is not going to have the same
>improvement in all cases.

I think everybody realizes that there isn't always a single number "N"
that can represent the speed difference between one processor and
another - regardless of whether one is a RISC and one a CISC, both are
RISCs, or both are CISCs.  Read a MIPS Performance Brief sometime.

>I'll post the actual results if I can get them from a rerun.

Post both user-mode and system-mode CPU time.  If, by some chance, the
SunOS 4.0 "troff" has had floating-point computations inserted into it
for some reason, and the Sun-4 test was done on a 4/110 with no FPU,
then the system time should be higher since the floating-point emulation
is done in the kernel (to avoid the overhead of going back to user mode).

>Boy are RISC people defensive! 

What do you expect?  You made a completely inappropriate speculation
about RISCs (namely that they're generally word-addressible, and don't
do byte operations inefficiently), and made a hard-to-believe claim
about the relative performance of a Sun-3 and a Sun-4 on "troff".  I'd
certainly expect them to reply, and point out that the first is simply
untrue and the second doesn't match their experiences.

bb@wjh12.harvard.edu (Brent Byer) (03/11/89)

In article <13322@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes:
> .... One of the people here checked his Sun-30
>(68020) against his Sun-4 (SPARC). The three ran troff about 5x faster.
>
One of the signs of competency in a good engineer is that, while (s)he
might not always know the right answer, (s)he should always be able
to detect an obviously wrong one.  Bill Davidsen has presented data
which is bogus by at least a factor of 10; a question, Bill:
	Don't you think you should have verified this data *before*
	trying to use it to support your claims?

>A question: has anyone benchmarked nroff/troff on the VAXstation 3100
>(VAX) and DECstation 3100 (MIPS)? Would some of the MIPS readers like to
>comment on performance in this area vs. 68020 or 80386?
>
n/troff could be very good benchmark candidates, but there are too
many versions, all based on proprietary source code.  I frequently
use our version of troff as a benchmark of various systems, but I can
be certain that I use the *same* sources & test data on each target.
The only surprising results I have seen are that the VAX architecture
seems to perform about 15% poorer with troff, and the i286 does about 10%
better (small model), than their respective results with other tests
would suggest.
[ E.G.:  an 8MHz, 0ws PC/AT clone ran troff 25% faster than a 780 ]

	Brent Byer  (att!ihesa!textware!brent   bb@wjh12.harvard.edu)

bb@wjh12.harvard.edu (Brent Byer) (03/12/89)

In article <12000@haddock.ima.isc.com> suitti@haddock.ima.isc.com (Stephen Uitti) writes:
>In article <1133@auspex.UUCP> guy@auspex.UUCP (Guy Harris) writes:
>
>>> [Provoked by this bogus claim by Bill Davidsen:]
>>>One of the people here checked his Sun-30 (68020) against his Sun-4
>>>(SPARC). The three ran troff about 5x faster.
>
>> [ from Guy ]
>>I tried comparing "troff"s on a Sun-3/50 with 4MB memory, and a
>>Sun-4/260 with 32MB memory, both running 4.0.  Here are the times:
>>
>>Sun-4/260:
>>	auspex% time troff -t -man /usr/man/man1/csh.1 >/dev/null
>>	24.4u 1.2s 0:34 75% 0+456k 26+38io 31pf+0w
>>	auspex% time troff -t -man /usr/man/man1/csh.1 > /dev/null
>>	24.4u 1.5s 0:36 71% 0+464k 1+35io 0pf+0w
>>
>>Sun-3/50:
>>	bootme% time troff -t -man /usr/man/man1/csh.1 >/dev/null
>>	118.9u 1.2s 2:08 93% 0+208k 14+33io 24pf+0w
>>	bootme% time troff -t -man /usr/man/man1/csh.1 > /dev/null
>>	120.2u 2.8s 2:31 81% 0+192k 5+32io 11pf+0w
>>
>>The 4/260 did 5x *better* than the 3/50, not 5x *worse*, on that example!
>
> [ Stephen Uitti : ]
>The VAX 780 here running 4.3 BSD had this to say:
>
>	haddock% time troff -man /usr/man/man1/csh.1 >/dev/null
>	90.8u 6.4s 36% 95+201k 59+15io 24pf+0w
>
>I thought Sun 3's were lots faster than 780s.  Maybe more
>expensive Sun 3s are faster... 

All Sun-3's will run troff faster than any 780, providing that
one starts with the *same* troff sources and the *same* test data.

In the comparison (sic) alluded to above, Guy is using the
"old" troff (otroff, from the C/A/T era), but Stephen used a
DWB-based troff.  The "-t" option is the giveaway.

For comparison, if we rate otroff as 1.0, a DWBv1 troff will get
about 1.15, and DWBv2 troff gets 1.7 .  [ We have a souped-up
DWBv2-based troff that gets 2.2 ; anybody wanna race? ]

>  ....   Of course, my /usr/man/man1/csh.1 could be differant, ... 

And, Stephen is also using different, and less demanding, input data.
The csh.1 with SunOS4.0 is typographically more complex than either
of those in SunOS3.x or 4.3BSD.  All other things equal, it will
require about 80-90% more CPU time.

>Also, I think 'troff' is one of those applications that has odd
>behaviour compared to just about anything else one would run.

This is a *false* supposition.

>It should be pointed out (if it hasn't been already) that troff
>doesn't do nearly the byte accesses that one would think it should do.

True.  Other than fetches from its input buffer and stores into its
output buffer, there are few.  Remember, a troff "character" has a
much richer personality than just its code value.  In otroff,
it was a 16-bit datum, in DWB a 32-bit.

>Still, troff is a great benchmark for sites that do alot of troff.

Here, it is better to generalize:
	XXX is a great benchmark for sites that do alot of XXX.
(But, you all know that.)

------
	Brent Byer  (bb@wjh12.harvard.edu  or  att!ihesa!textware!brent)

12-year old nephew:	Uncle Bill, that steamboat race was the biggest
			  gamble in the world.
W. C. Fields:		Son, that was nothing.  I remember when Lady Godiva
			  put everything she had on a horse.

davidsen@steinmetz.ge.com (William E. Davidsen Jr) (03/14/89)

In article <340@wjh12.harvard.edu> bb@wjh12.UUCP (Brent Byer) writes:
| In article <13322@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes:
| > .... One of the people here checked his Sun-30
| >(68020) against his Sun-4 (SPARC). The three ran troff about 5x faster.
| >
| One of the signs of competency in a good engineer is that, while (s)he
| might not always know the right answer, (s)he should always be able
| to detect an obviously wrong one.  Bill Davidsen has presented data
| which is bogus by at least a factor of 10; a question, Bill:
| 	Don't you think you should have verified this data *before*
| 	trying to use it to support your claims?

  As far as I know the only claim I made was that the results were
reported to me as I stated, leading to an interest in the ability to
access byte. Someone pointed out that byte address and byte access
aren't quite the same thing, thanks.

  I mentioned the case in which a byte is extracted by doing a word load
and isolating a byte in the word. Ten people told me no one does that
any more, four people said "yes the 29000 does just that. At this point
I really don't care. I mentioned the Honeywell 6000, someone immediately
pointed out the EIS instruction set. By adding a coprocessor I can make
any CPU have any instructions, but I'm not sure that justifies claiming
something as a feature.

  As far as floating point in troff, I timed it on a VAX with and
without FPU and it sure runs faster with. One person claimed that the
FPU makes integer divide faster, and two others said it improves context
switching. I checked four troff sources, and BSD/V7 troff don't have
f.p., one vendor supplied version does.
-- 
	bill davidsen		(wedu@ge-crd.arpa)
  {uunet | philabs}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

elind@ircam.UUCP (Eric Lindemann) (03/14/89)

Can somebody clarify the following?

In the "i860 overview (very very long)" w-colin writes:

> Everything's single-cycle, but here's what can interlock:
>
> i-cache miss: given in terms of pin timing, plus two cycles if d-cache miss
> in progree simultaneously.
>
> d-cache miss (on load): again, pin timing, but it seems to be "clocks from
> ADS asserted to READY# asserted"
>
> fld miss: d-cache miss plus one clock

I don't think this is exactly what the Intel literature says. I read
it more like this:

The following "freeze conditions" exist which will cause a delay:

* Reference to destination of load instruction that misses

* fld (load to float register) miss 

In other words, you can fire off a "load" instruction (which must mean
a load to an integer register) and continue executing without delay as long
as you don't reference the destination registers of this "load" instruction.
The fact that there may or may not be a cache miss should only delay the 
availability of the data in the registers without necessarilly interupting
execution. 

A cache miss on an FLD instruction however will apparently always cause a
"freeze", or delay in execution, whether or not the FLD destination register
is referenced by a subsequent instruction.

Can this be? Is there some basic difference in interlock behavior between
integer and floating point register files? If so, this can make a big
difference in througput.

friedman@rd1632.Dayton.NCR.COM (Lee G. Friedman) (03/24/89)

                    CALL FOR PAPERS AND REFEREES
      HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES - 23
                                  
  Processors and Systems Architecture: The Converging Design Space
                                  
              KAILUA-KONA, HAWAII - JANUARY 2-5, 1990
                                  
 The Architecture  track of  HICSS-23 will  contain a  special set of
 papers focusing  on a  broad selection  of topics  in  the  area  of
 Processor  and   Systems  Architectures.    Given  the  current  and
 predicted future state of technology available for computers,  there
 is an  explosion of  new architectural  features being  employed  in
 processors to serve the needs of the systems architecture community.
 Furthermore, systems architects, taking advantage of these features,
 look to  add other features currently not available in off-the-shelf
 silicon, and the cycle continues.
 
 There are several questions which arise:
 
    1. Does  the behavior of the built-in features fulfill the
     requirement as expected from the systems architects point
     of view?
    
    2. How  does a  processor architect  design processors for
     systems?
       a.How does the processor architect decide on the
         features to be included?
       b.What system level assumptions, both hardware and
         software, does the processor architect make in
         developing a processor with a set of features?
       c.What
    
    3. How does the systems architect employ the processor?
       a.How does the system architect evaluate a processor
         for use in a particular system?
       b.What tradeoffs are made in using certain processor
         technology?
       c.What impact does a processor with features have on
         the system?
    
 The goal of this day long program (called a minitrack) at the HICSS-
 23 conference  is to  show, through  the papers, the convergence and
 divergence of  processor and  systems architecture.   That  is where
 things go  right and  were there  is still  a disparity  between the
 chips, boards,  boxes, and software (what we call the semantic gap).
 The format for the minitrack is as follows:
 
     The intention  is to  get  three  good  papers  on  three
     important processor  architectures.   These papers should
     be a detailed description of the architecture of the chip
     (or chips),  as well  as, addressing the questions above.
     Each of  the three  processor architecture papers will be
     followed by  two  systems  architecture  papers.    These
     papers should  describe the  system and how the processor
     in question was used to provide the end solution.  Again,
     this should  also address  the questions above.  Thus, in
     total, we  will have nine papers.  This is followed by an
     open discussion  of the  convergence  and  divergence  of
     processor and systems architecture.
 
 
 Papers are  invited that  may be  practical  applications,  research
 machines, or  theoretical.   Papers can  deal with  systems and VLSI
 technologies.  Those papers selected for presentation will appear in
 the Conference  Proceedings which  are  published  by  the  Computer
 Society of  the IEEE.   HICSS-23  is sponsored  by the University of
 Hawaii in  cooperation with  the ACM,  the Computer Society, and the
 Pacific Research  Institute for  Information Sciences and Management
 (PRIISM).
 
 
 INSTRUCTIONS FOR  SUBMITTING PAPERS:   Manuscripts  should be  22-26
 typewritten, double-spaced pages in length.  Do not send submissions
 which  are  significantly  shorter.    Papers  must  not  have  been
 previously presented  or  published,  nor  currently  submitted  for
 journal publication.  Each manuscript will be put through a rigorous
 refereeing process.  Manuscript papers should have a title page that
 includes the  title of  the  paper,  full  name  of  the  author(s),
 affiliation(s),  complete   physical  and   electronic  address(es),
 telephone number(s) and a 300-word abstract of the paper.
 
 DEADLINES
 
 *   A 300-word abstract is due by April 15, 1989
 *   Feedback to author concerning abstract by May 5, 1989
 *   Six copies of the manuscript are due by June 1, 1989
 *   Notification of accepted papers by August 15, 1989
 *   Accepted manuscripts, camera-ready, due by September 23, 1989
 
 SEND SUBMISSIONS AND QUESTIONS TO
 Lee G. Friedman
 NCR Corporation
 1601 S. Main Street  MS PCD-5
 Dayton, OH 45479
 (513) 445-3594
 e-mail: lee.friedman@dayton.ncr.com