jangr@microsoft.UUCP (Jan Gray) (03/06/89)
i860 Overview (what I consider interesting features of the part), taken from the "i860(tm) 64-bit Microprocessor Programmer's Reference Manual", Order Number 240329-001, (C) Intel Corp. 1989. Overview * 64 bit external data/instruction bus * 128 bit on-chip data bus * 64 bit on-chip instruction bus * 8K data cache, virtual addressed, write-back, two-way "set associative", 2x128 lines of 32 bytes * 4K instruction cache, virtual addressed * 64 entry TLB * core integer RISC unit * floating-point unit with pipelined multiply and add units (can also be used "unpipelined") * some multiply-accumulate type floating point instructions * dual instruction mode can simultaneously dispatch a 32-bit core instruction and a 32-bit floating-point instruction Data Types * BE bit in epsr (extended processor status register) selects big/little endian format in memory, instructions always little-endian * 32 bit signed/unsigned integers * IEEE 754 format single (32-bit) and double (64-bit) precision floating point numbers * pixels: * stored as 8, 16, or 32 bits (always operates on 64 bits of pixels at a time) * colour intensity shading instructions treat divide pixels into fields: pixel size colour 1 bits colour 2 bits colour 3 bits other bits 8 ....................N........................ 8 - N 16 6 6 4 0 32 8 8 8 8 [These particular field assignments are a result of the pixel add instructions described below.] Memory Management * NO SEGMENTS! * 32 bit virtual addresses (translation can be disabled) * translated identically to 386 virtual address: two level address translation, with bits 31..20 of address selecting: * dirbase register specifies page directory * 1st level: addr[31..22] specifies page directory entry, yielding permissions and address of the second level page table * 2nd level: addr[21..12] specifies page table entry, yielding additional permissions and address of the physical page * addr[11..0] specifies byte offset within physical page (4K pages) * page table bits: * P - page is present * CD - cache disable: page is not cacheable * WT - page is write-through. disables internal caching. Either CD or WT can be passed through to the external PTB pin, depending upon PBM bit in epsr. * U - user: if 0, page in inaccessible in user mode. * W - writable: if 0, page is not writable in user mode, and may be writable in supervisor mode depending upon WP bit in epsr. * A - accessed: automatically set first time page is accessed * D - dirty: traps when D=0 and page is written * two bits reserved, three bits user-definable * page directory PTE bits and second level PTE bits are combined in the most restrictive fashion * 64 entry TLB Caches * Flush instruction forces a dirty data cache line (32 bytes) back to memory. Intel supplies suggested code to flush entire data cache. * Storing to dirbase register with ITI bit set invalidates TLB and instruction caches; must flush data cache first! [Remember, the data cache is virtually addressed.] Core Unit * Standard 32 bit RISC architecture: * 32 32-bit integer registers * fault instruction, psr, epsr, dirbase, data breakpoint registers * r0 always reads as 0 * 8, 16, 32 bit integer load/store insns, operands must be appropriately aligned; byte or word values are sign extended on load. [I hope you don't use "unsigned char" too much...] * 2 source, 1 destination add/subtract/logical (and, andnot, or, xor) * No integer multiply/divide instructions. To multiply, you move the operands to floating point registers, use multiply (four insns plus five free delay slots). To divide, you move the dividend to a floating point register and multiply by the reciprocal. This can be very slow (59 clocks) if the divisor is a variable (hopefully infrequent). * 32 bit shift left/right/right-arithmetic, plus 64 bit funnel shift ("shift right double"). They ran out of bits to specify two 32 bit sources plus destination plus shift count, so the shift count of the last 32 bit shift right (automatically stored in the 5 bit SC field of the psr) is used. * Similar to MIPS Rx000 architecture in some ways: * load/store addressing mode is src1(src2), src1 is a register or 16 bit immediate constant. * form 32 bit constants using andh/andnoth/orh/xorh on upper 16 bits of a register * Only one condition code bit (CC), set in various ways by signed/unsigned add/subtract/logical operations, unaffected by shift ops * Delayed and non-delayed branches on CC set/not set (bc[.t], bnc[.t]) * Non-delayed branch on src1 ==/!= src2 (bte, btne) * Strange delayed branch "bla" instruction, for one instruction looping. useful for aoblss/dsz/isg type looping. Uses its own special LCC condition code bit. "Programs should avoid calling subroutines while within a bla loop, because a subroutine may use bla also and change LCC". [Ug.] * Trap, trap on integer overflow instructions * Call/call indirect, stores return address in r1. * Unconditional branch, branch indirect, latter also used for return and return from trap. * Core unit loads and stores floating point operands of 32, 64, and 128 bits * Pipelined floating load instruction (32/64 bits) queues an address of an operand not expected to be in cache, and stores the result of the third previous pipelined floating load into the destination floating register. [This is the data-loading component of the i860 "vector" support.] * Bus lock/unlock instructions for flexible indivisible read-modify-write sequences. Interrupts are disabled while the bus is locked. "If ... the processor does not encounter a load or store following an unlock instruction by the time it has executed 32 instructions, it triggers an instruction fault...". For example: locked test and set is: // r22 <- semaphore, semaphore <- r23 lock // next cache miss load/store locks bus ld.b semaphore, r22 unlock // next load/store unlocks bus st.b r23, semaphore * Pixel store instructions for selectively updating particular masked pixels in a 64-bit memory location, used for Z-buffer hidden surface elimination. Pixel mask is set by fzchk instructions (in floating point/graphics unit) Floating Point Unit * 32 32 bit single precision floating point registers, can also be treated as 16 64 bit double precision registers. * graphics operands also stored in the fp registers * f0/f1 reads as 0 * pipelined multiply and add units * floating point instructions can be non-pipelined, or pipelined * Similar to the pipelined load above, in a pipelined multiply or add instruction, the source operands go into the pipeline, and the result of the 3rd (or so) previous pipelined multiply or add is stored in the destination register(s). * Pipeline lengths * adder: 3 stages * multiplier:2 or 3 stages (2 double precision, 3 single(!)) * graphics: 1 * load: 3 (loads issued from core unit above) * IEEE status bits percolate through the fp pipelines, and can be reloaded, along with the pipeline contents, after traps * Divide? Ha! If Seymour can do it with reciprocals, so can the i860. The frcp and frsqr insns give return approximate reciprocal and 1/square root "with absolute significand error < 2^-7". Intel supplies routines for Newton-Raphson approximations that take 22 clocks (*almost* single precision) or 38 clocks (*almost* double precision), and the Intel i860 library provides true IEEE divide. [RISC design principles at work: divides are infrequent enough not to slow down/drop some other feature to provide divide hardware.] * Dual operation instructions (not "dual mode"): Some pipelined instructions cause both a pipelined add and a multiply operation to take place. Since the instruction can only encode two source operands, the others are taken from temporary holding registers and busses connecting the two units in various topologies, depending upon the data path control field of the instruction opcode. [Many real world computations e.g. dot product can make use of these instructions.] Dual Instruction Mode * DIM allows the i860 to run both a core and a floating/graphics unit insn on each cycle. The resulting 64 bit "wide instruction" must be 64 bit aligned. * There is a two cycle latency: two cycles after a floating instruction with the D bit set, both a core and a floating insn will be issued. Similarly, if the D bit is clear, there will be no DIM two cycles (two instruction pairs) later. * There are various sensible rules for determining the result of insn pairs which set/use common registers, control registers, etc. Graphics Unit * Pipelined and non pipelined 64 bit integer add and subtract. * 16/32 bit non/pipelined Z buffer check instructions: "fzchks src1, src2, rdest (16 bit Z-Buffer Check) Consider src1, src2, and rdest as arrays of four 16 bit fields src1(0..3), src2(0..3), rdest(0..3), where zero denotes the least-significant field. PM <- PM >> 4 FOR i = 0 to 3 DO PM[i+4] <- src2(i) <= src1(i) (unsigned) rdest(i) <- smaller of src2(i) and src1(i) OD MERGE <- 0" This particular instruction merges four (arbitrary sized) pixels whose 16 bit Z-buffer values are in one of the (64 bit) sources, and the current Z-buffer value in the other source, setting pixel mask bits (controlling the pixel store insn described above), and updating the Z-buffer depth values. [Neat! Just what my (personal) graphics package ordered!] * Pixel add instructions, which add fixed point values, the results accumulating in a special MERGE register. You can use these to interpolate between (for instance) two colours as you scan convert a polygon. * Z-buffer add instructions, for the analogous case of distance interpolation. Traps Briefly, there are instruction, floating point, instruction access, data access, interrupt, and reset traps. On a trap, the i860 enters supervisor mode, saves/modifies various psr bits, saves the faulting instruction address, and jumps to the trap handler which must be at 0xFFFFFF00. There are various complications for dual instruction mode, bus lock mode, and for saving/ restoring the various pipeline states. Interlocks The i860 is fully interlocked, so no need to insert nops. You can, of course, increase performance by reordering insns with dependencies. For instance, in the current implementation, referencing the result of a ld in the next instruction can cause a one clock delay. Other interesting timings: * TLB miss: five clocks plus the number of clocks to finish two reads plus the number of clocks to set A (accessed) bit, if necessary. [I guess Intel found Mips' and others' software TLB lookup unworthy...] * ld/fld following st/fst hit: one clock. * delayed branch not taken: one clock [to skip/annul the delay slot instruction] * nondelayed branch taken: bc, bnc: one clock; bte, btne: two clocks * st.c (store to a control register): two clocks. Comments Well, that about does it. Quite a neat part, I I think Intel has done themselves proud with a very clean and well-balanced design; I guess they've been reading comp.arch... :-) I had read rumours that this was to be a floating point coprocessor for the x86, and had feared that it would be burdened with lots of slave-processor crap, but that is not the case. If I could change one thing, it would be to add Mips' on-chip external cache control hardware. Why hasn't anyone else picked up on this idea? I'm afraid that for some code (not *mine*, of course) the 4K on-chip insn cache will be too small; a cache controller would allow you to add big external caches with a minimum of heartache. "I guess there's no pleasing some people!" Any typos/misinterpretations are my own. I speak only for myself. Jan Gray uunet!microsoft!jangr Microsoft Corp., Redmond Wash. 206-882-8080
w-colinp@microsoft.UUCP (Colin Plumb) (03/06/89)
Well, I just got Jan's copy of the "i860 64-bit Microprocessor Programmer's Reference Manual" and am going to post an even longer summary. I'll try to avoid too much duplication. Personal flames: The exception handling is a disaster. On any sort of exception, the processor switches to supervisor mode and jumps to virtual address 0xFFFFFF00. Then you have to stare at the bits in the status register to figure out what happened, handle it, and do arcane things to get the processor in a state such that it can restart. This involves looking at the instruction that faulted and the one just before and parsing them a bit. Bleah! Since the processor doesn't handle denormalised, infinity, or NaN values in the floating-point unit, causing a trap, and the business of sticking the right value into the pipeline is so tedious, you basically need to avoid these things altogether. Also, interrupt return is wierd. It's overloaded onto the branch indirect instruction. If the status register indicates that you're inside a trap handler, it does some interrupt-return things in addition to branching to an address specified in a register. Integer divide is done by converting to floating point, doing the Newton- Raphson bit, and converting back. The sample code they give requires 62 clocks (59 without remainder). Can you say "divide step" boys and girls? The instruction in the delay slot of a control transfer must not be another control transfer instruction, including a trap. This makes me wonder if putting a trap there could sufficiently confuse the processor that I'd end up in my code in supervisor mode. No reason to believe so, just a nasty idea that popped into my head. (The first rule of root-crackers: look for something which says "do not do x." Try as many variations of x as possible.) System regsiters may be read in user mode, and writes are simply ignored. So much for virtual machines! Some floating-point instructions can't be pipelined; others must be. Annoying. This is Intel order number 240329-001, copyright Intel 1989. Other related documents: i860 64-bit Microprocessor (data sheet), order number 240296 i860 Microprocessor Assembler and Linker Reference Manual, 240436 i860 Microprocessor Simulator-Debugger Reference Manual, 240437 The manual I have says absolutely nothing about pinout, timing, or any such electrical thing. Anyway, on to the meat: >> Introduction and Register Summary << There are 32 32-bit integer registers and 32 32-bit fp registers. The fp registers are used in even/odd pairs for double-precision operations. The even register appears in memory at the lower address. Other registers are the psr (processor status register, 32 bits), and epsr (32 more bits), the db register (debugging, specifies an address to breakpoint; reads and writes can be trapped), the dirbase register (root pointer for page tables), fir (fault instruction register, saved PC on a fault), fsr (floating-point status register), three special-purpose registers (64 bits) for use in pipelined floating-point mode: KR, KI, and T, and a 64-bit MERGE register used in pixel operations. The psr bits, lowest to highest are: 0: BR, Break Read 1: BW, Break Write - these bits control breakpoints used with the db register. When one is set, the corresponding access to the address specified in the db register causes a trap. The db register specifies a byte address; any access touching that byte will be trapped. 2: CC, Condition Code - there is only one CC bit, set by the add and subtract instructions as a greater than/less-than flag, and by the logical (and, or, xor, andnot) instructions as a zero flag. 3: LCC, Loop Condition Code - this is used by the bla instruction only to do add-compare-and-branch type things. 4: IM, Interrupt Mode - external interrupt enable bit. 5: PIM, Previous Interrupt Mode - state of the IM bit before the last trap. 6: U, User - set in processor is in user mode. 7: PU, Previous User - copy of U bit as of before last trap. 8: IT, Instruction Trap - set by the processor when a trap ocurrs if the current instruction caused a trap. Breakpoints and the like. 9: IN, INterrupt - set when a trap ocurrs if an external interrupt is a contributing factor. 10: IAT, Instruction Access Trap - as above, set if there was an address translation problem during instruction fetch. There is no mention of a BERR-like pin. 11: DAT, Data Access Trap - as above, but for data accesses. This bit is also set by unaligned load/stires and BR/BW exceptions. 12: FT, Floating-point Trap - set if a floating-point error contributed to a trap. Note that any combination of the trap bits may be set on entry to the interrupt handler at 0xFFFFFF00. No bits set indicates reset/power-up. Multiple bits set indicates multiple simultaneous exceptions. 13: DS, Delayed Switch - there is a 2-cycle latency between the first instruction in the stream with the double-instruction mode but set and the processor starting to execute two instructions per cycle, or between the first instruction with the bit clear and the cessation of double-instruction mode. This bit is set when the first cycle of latency has passed, but not the second. Note that it is set for both switching to dual-instruction mode and away. The direction is given by the DIM bit. 14: DIM, Dual-Instruction Mode - set if the processor is in dual-instruction mode, executing an integer ("core") instruction and a floating-point one in a single cycle. It only set when a trap ocurrs; it does not reflect the current state of the processor, but the one before the trap. The same goes for the DS bit. 15: KNF, Kill Next Floating - on trap return, if this bit is set, the next floating-point instruction is ignored (except for its dual-instruction bit). Useful when emulating a floating-point instruction that trapped in dual-instruction mode, when you want to retry the "core" integer instruction but not the fp one. 16: X - Unused. Undefined when read, write with 0 or saved value. 17..21: SC, Shift Count - remembers the shift used in the last SHR (shift right logical) instruction. Specifies the shift count for the SHRD (shift right double - extract 32 bits from 64) instruction. Equivalent to the 29000's FC register. 23..22: PS, Pixel Size - specifies the size of a pixel for graphics operations. 0 through 3 mean 8, 16, 32, or <undefined> bit pixels. 24..31: PM, Pixel Mask - the pixel store instruction stores those pixels in the current 64-bit word specified by the low-order bits of this field. These bits can be set by various z-buffer instructions. The PM, PS, SC, CC, and LCC fields can be set from user mode; writes to the other fields are ignored in user mode. The epsr bits, lowest to highest are: 0..7: Processor type - specifies the type of the current processor. 1 for the i860. (Hardwired, may not be changed even by supervisor) 8..12: Stepping number - specifies the revision. (Also hardwired) 13: IL, InterLock - set if a trap ocurrs in the middle of a lock/unlock sequence. 14: WP, Write Rpotect - if clear, supervisor-mode accesses ignore the write protect bit of a TLB entry. If set, even supervisor-mode writes are disallowed. 15..16: X, unused. 17: INT, INTerrupt - the value of the INT input pin. It looks like there is only one, and this bit is unqualified. Writes are ignored. 18..21: DCS, Data Cache Size - the size of the data cache. 2^(DCS+12) bytes. Currently 1, meaning 8Kbytes. Hardwired. 22: PBM, Page-table Bit Mode - determines which of two bits in the page table entry is reflected on the PTB pin. If 0, the CD bit; if 1, the WT bit. 23: BE, Big-Endian - set if the processor in in big-endian mode. Causes the low 3 bits of the address bus to be complemented. 24: OF, OverFlow - set or cleared by the add and subtract instructions if signed or unsigned (depending on the instruction) occurs. There is an instruction analogous to the 68000's TRAPV which traps if this bit is set. 25..31: X, unused. OF is user-writable; the other fields are only writeable from supervisor mode. The db, Data Breakpoint register contains a byte address which is watched for accesses. If any access touches this byte and the corresponding bit in the psr is set, a data access trap occurs. The dirbase, directory base register points to the root of the page table tree. Standard two-level page table with 4K pages. The bits, lowest to highes, are: 0: ATE, Address Translation Enable - if set, address translation is enabled. You must flush the data cahce before fiddling with this bit. 1..3: DPS, DRAM Page Size - the i860 has support for page-mode or static-column DRAMs. If two accesses differ only in the low-order 12+DPS bits, the NENE# pin is asserted. Zero is used for one bank of 256Kxn DRAMs. 4: BL, Bus Lock - echoed to the outside world on the LOCK# pin, after one cycle of latency. Controlled by the lock and unlock instructions. Copied to the IL bit of the epsr and cleared on a trap. 5: ITI, Instruction cache and TLB Invalidate - when a 1 is written, the instruction cache and page table cache are invalidated. Always reads as zero. 6: X, unused. 7: CS8, Code Size 8 bits - when set, instruction fetches are done from 8-bit-wide memory instead of 64. Used for bootstrapping from a ROM. Once cleared, cannot be reset. 8..9: RB, Replacement Block - can control which block (set) of a cache is replaced on a miss. Used by the data-cache flush instruction, and for testing in conjunction with the next field. For the data and instruction caches, which are 2-way set-associative, only the low bit is used. For the TLB, both bits are used. 10..11: RC, Replacement Control - the i860 normally uses random replacement on all of its caches. For testing, you can replace this with a deterministic algorithm. 00 is normal, 01 causes all cache replacements to use the set specified in the RB field, 10 causes the data cache to obey the RB field, and 11 disables data cache replacement. I think this means hits can still occur, but no new information will be added to the cache. 12..31: DTB, Directory Table Base - this, with 12 low bits of 0, is the address of the first-level page table. The fir (Fault Instruction Register) holds the (virtual, I think) address of the instruction that casued the trap. The first time it is read, it is unfrozen, and subsequent reads will just get the address of the load instruction. The fpsr contains all the floating-point flags: 0: FZ, Flush Zero - if set, underflow is flushed to zero instead of raising a result-exception trap. 1: TI, Trap Inexact - if set, inexact results cause a trap. 2..3: RM, Rounding Mode - 0 through 3 mean round towards nearest, -inf, +inf, and 0. 4: U, Update - always reads as zero; if set on a write of this register, bits 9 through 15 and 22 through 24 are written. If clear, the data written to them is ignored and they are unchanged. 5: FTE, Floating-point Trap Enable - if clear, floating-point traps are never reported. Used when mucking with the pipeline in various ways, and when sticking software-emulated values into the fp unit. 6: X, unused. 7: SI, Sticky Inexact - set whenever an inexact result is generated, regardless of the state of the TI bit. Cleared only by explicit write. 8: SE, Source Exception - set when one of the inputs to a FLOP is invalid (infinity, denormal, or NaN). 9: MU, Multiplier Underflow - on read, indicates the last multiply operation to come out of the pileline underflowed. On write (note only written if the U bit is set), this forces the flag on the operation in the first stage of the multiply pipleline. When that operation reaches the end of the multiply pileline, this is the MU bit that will come out. Used for reloading the pipeline. 10: MO, Multiplier Overflow - similar to above. 11: MI, Multiplier Inexact - similar to above. 12: MA, Multipler Add-one - similar to above, but indicates that the multipler rounded up instead of down. I'm not sure of the effects in the presence of sign bits. 13: AU - Adder Underflow - similar to MU. 14: AO - Adder Overflow - similar to MO. 15: AI - Adder Inexact - similar to MI. 16: AA - Adder Add-one - similar to MA. 17..21: RR, Result Register - holds the destination of a FLOP when an exception occurs due to a scalar op. 22..24: AE, Adder Exponent - holds the high 3 bits of the exponent coming out of the adder. Used to handle exceptions involving double-precision inputs and single-precision outputs properly. 25: X, unused. 26: LRP, Load pipe Result Precision - holds the precision of the value in the last stage of the load pipeline (more about this later); set on dp, clear on sp. It cannot be set by software (except by stuffing things into the pipe) and is provided to help save the state of the pipe. 27: IRP, Integer pipe Result Precision - holds the precision of the value in the graphics pipeline. 28: MRP, Multiplier Result Precision - holds the precision of the value in the last stage of the multipler pipeline. 29: MRP, Adder Result Precision - holds the precision of the value in the last stage of the adder pipeline. 30..31: X, unused. The KI and KR registers hold constant inputs to the multiplier for use in the multiply-accumulate instructions. The T register is between the multiplier and the adder, again for multiply-accumulate instructions. The MERGE register is used by graphics operations. The bits in a page table entry (on either level) are: 0: P, Present - if clear, the entry is invalid and the high 31 bits are available to the programmer. 1: W, Writeable - if set, the page is writeable. This is always enforced in user mode, and enforcement in supervisor mode is controlled by the WP bit of the epsr. The effective value of this bit for any page table entry is the AND of the bits at the two levels of page tables. 2: U, User - if set, this page is accessible at user level. Must be set at both levels of page table for the page to be accessible. 3: WT, Write-Through - if set, this page will not be cached by the internal data cache. This bit can also be echoed externally, of the PBM bit of the espr is set. This bit is only used on the second (lower) level of page tables. On the first level, it is reserved. 4: CD, Cache Disable - if set, this page will not be cached by either internal cache. This bit can also be echoed externally, of the PBM bit of the espr is clear. This bit is only used on the second (lower) level of page tables. On the first level, it is reserved. 5: A, Accessed - only used in the second level of page tables, and is set whenever the page is loaded into the TLB. 6: D, Dirty - only used on the second level of page tables. If clear, and the page is being written to, a data access fault is generated. (I.e. must be maintained by software.) 7..8: X, unused. 9..11: Reserved for use by the OS. 12..31: high-order bits of the physical address of the page or next level of page tables. There is an external pin, KEN#, that must be asserted to enable cacheing of instructions and/or data. If not asserted, the data is not put into the cache. The data cache must be explicitly flushed by software before you may change page tables. >> Integer ("core") instructions << There are a few instruction formats. The most common has, high to low, 6 bits of opcode (of which the low bit, bit 25, is usually clear), 5 bits of src2, 5 bits of dest, 5 bits of src1, and 11 bits of immediate offset. Call this format A. Next most common has 6 bits of opcode (low bit usually set), 5 bits of src2, 5 bits of dest, and 16 bits of immediate constant (frequently ignored; set to 0). Call this format B. A few instructions have the high 5 bits of a 16-bit immediate offset in the dest field (bits 16..20) rather than the src1 field. Store and a few branch instructions. Call this format C. A few instructions are like the above, and also have a 5-bit immediate constant in the src1 field. Call this format D. The largest branch offsets are handled by instructions having 6 bits of opcode and 26 bits of (signed) offset. Call this format E. Floating-point instructions have 010010 in the high 6 bits, 5 bits of src2, 5 bits of dest, 5 bits of src1, 4 magic bits (P - pipeline, D - dual-instruction mode, S - source precision, and R - result precision), and 7 bits of opcode. Call this format F. Some special operations have 010011 in the 6-bit opcode field, 5 bits of src1 in bits 11..15, and 5 bits of opcode in bits 0..4. The other bits are unused. Call this format G. > Load The load instruction (ld) has 3 variants: ld.b, ld.s, and ld.l. dest = mem[src1 + src2]. The opcode is 000L0I. I controls src1. 0 if it's a register (format A), 1 if it's a 16-bit signed offset (format B). L is 0 if it's a byte load, 1 if it's a 16 or 32-bit load. In the latter case, bit 0 of the instruction is stolen from the immediate offset and indicates 16 (0) ir 32 (1) bit loads. This bit is connsidered to be 0 for the addition. Loads are always sign-extended. BTW, the Intel-suggested format is ld.x src1(src2), dest. No more mov dest, src nonsense! There is 1 cycle of latency on loads, even if they hit the cache. This is interlocked. > Store Store is similar (st.b, st.s. or st.l), but it must use a 16-bit immediate offset, and uses format C. mem[src2 + immediate] = src1. The opcode is 000L11. (Again, st.x src1, imm(src2).) Note that r0 hardwired to zero comes in handy here. 32-bit absolute addresses must be formed by loading the high 16 bits into a register and taking an offset from there. Intel suggests r32 for this purpose. Because the offsets are signed, you may have to diddle the high 16 bits a bit to make things work properly. > Move int to fp ixfr src, fdest moves 32 bits (nor format conversion) from an integer to an fp register. The opcode is 000010. There are two cycles of latency (interlocked) until the data appears in the destination. Format A. 000110 is reserved. > Fp load fld.y src1(src2), fdest and fld.y src1(src2)++, fdest are floating-point load instructions. They are simialr to the integer load instructions, except the data sizes are .l (32 bits), .d (64), or .q (128 bits). The ++ autoincrement mode stores src1+src2 back into src2. Again, the low-order bits of the immediate offsset (formats A or B) are used in the instruction. Bit 0 is set for autoincrement addressing, bit 1 is set for 32-bit loads, and bit 2 is set for 128-bit loads. Bits 1 and 2 are zero for 64-bit loads. The opcode is 00100I, I selecting format A or B. For 64 or 128-bit loads, the destination register must be even or a multiple of 4. > Fp store fst.y fdest, src1(src2) and fst.y fdest, src1(src2)++ are similar. src1 can be a register here. The opcode is 00101I. > Pipelined fp load pfld.z fdest, src1(src2)[++], pipelined load, has the same addressing modes, except they use the 3-deep load pipeline (which operates independently of scalar loads; the two may be arbitrarily interleaved). The destination register specifies where to put the result of the 3rd previous pfld instruction. 128-bit pipelined loads are not allowed. Pipelined accesses do not place the data in the cache, although they do handle cache hits properly. The opcode is 01100I, I selecting format A or B. > Pixel store pst.d freg, imm(src2)[++] stores the pixels specified by the PS field of the psr from the 64-bit freg into memory. If you have 16-bit pixels and the low bits of the PS field are 0110, only the middle 4 byte strobes will be asserted on write. Bit 0 corresponds to the lowest address. See the fzchks ahd fzchkl instructions for uses for this. The opcode is 001111, and it uses format B. After the store, the PM field is shifted right by the number of bits used, so the next pst instruction will have access to the next bits. > Add, Subtract, Compare The add/subtract instructions also double as compare instructions. There's addu src1, src2, dest, adds, subu, and subs. They do the obvious things, also setting the OF flag on overflow, and setting the CC flag as follows: addu: CC gets the carry from bit 31. adds: CC gets (src2 < -src1) subu: CC getsthe carry from bit 31 (src2 <= src1) subs: CC gets (src2 > src1) This uses formats A and B. If the 16-bit immediate is used, it is sign- extended. To get the one's complement, use subs -1, src2, dest. The opcode is 100UAI, where U is 0 for unsigned, 1 for signed; A is 0 for add, 1 for subtract, and I selects format A or B. (0 for A, 1 for B.) > Shift, Rotate The shift instructions are shl, shr, shra, and shrd. The first three do the obvious things, shr zero-filling high-order bits and shra sign-extending. shr *only* copies its src1 (register or 16-bit immediate, with the high 11 bits ignored) to the SC field of the PSR. shrd uses this to compute dest = ((src1<<32 + src2) >> SC ) & 0xFFFFFFFF. The opcdes are: 10100I for shl, I selects format A or B. 10101I for shr, I selects format A or B. 10111I for shra, I selects format A or B. 101100 for shrd, format A only. None of the shifts set the condition code bit, so the assembler uses the following macros: mov src2, dest == shl r0, src2, rdest nop == shl r0, r0, r0 fnop = shrd r0, r0, r0 To do a rotate, shr count, r0, r0 and shrd src, src, dest. > Trap There is a trap instruction, which uses format B, although the source operands are not interpreted. The destination register is "undefined," so it's a good idea to use register 0. The opcode is 010001. This causes an IT trap (see the psr). The source bits can be used for whatever. > And, Or, Xor, Andnot There are 4 logical instructions, and, or, xor, and andnot. They do the obvious things, dest = src1 & src2, dest = src1 | src2, dest = src1 ^ src2, dest = ~src1 & src2. The opcodes are of the form 11OPHI, where OP specifies one of {and, andnot, or, xor}. I specifies A or B format, and H can be set in B format to indicate that the immediate constant should be shifted up 16 bits before use. Thus, to load the high 16 bits of a register, orh immediate, r0, dest will do the trick. Or xor. H bit set and I bit clear is reserved. 16-bit immediate values are zero-extended, if used. The CC flag is set if the result is zero, otherwise it is cleared. The opcodes with the H bit set are andh, andnoth, orh, and xorh. > Control register modification ld.c and st.c are used to modify control registers. Format A is used, although only the src2 one of src1 (for st.c) and dest (for ld.c) are interpreted. The opcode is 0011L0, where L is 0 for load (dest = special[src2]) and 1 for store (special[src2] = src1). The src2 field holds 0 through 5 for the fir, psr, dirbase, db, fpsr, and epsr registers. These instructions are legal in user mode, although many writes will be ignored. > Branches Most of the branch instructions use format E, taking a 26-bit offset. I assume the offset is a word offset (in case I forgot to mention it, instructions must be 32-bit word-aligned), but I can't find it explicitly stated. br (opcode 011010) is a straight branch, with 1 delay slot. > Call call (opcode 011011) is similar, but also puts a return address in register r1. bc and bnc branch if the CC flag is set or clear, respectively. They come in non-delayed (bc, bnc) and delayed (bc.t, bnc.t) versions. The opcodes are 01110T and 01111T, respectively. T is set in the .t (delayed) forms. bri is an indirect branch, delayed, using opcode 010001 and format A, I think, although only src1 is used. bri [src1] branches to the address specified in src1. The low two bits of src1 are ignored. If any of the trap bits are set when this instruction is executed (see the psr), this also performs an interrupt return, clearing the trap bits, copying PU to U and PIM to IM, and doing strange things with DS and DIM. > Loop There's also bla, a looping-type instruction. It's a bit wierd. First of all, if the LCC flag is set, it does a delayed branch with a 16-bit offset, then it computes "adds src1, src2, src2" and sets the LCC flag to what the complement of the CC flag would be for a real adds. This uses format C, with opcode 101101. Intel gives the example of clearing an array of 16 single-precision numbers to zero, atarting at the address in r4: adds -1, r0, r5 // r5 holds loop increment or 15, r0, r6 // r6 holds loop count bla r5, r6, CLEAR_LOOP // clear LCC; it doesn't matter // if we jump or not addu -4, r4, r4 // compensate for preincrement (delay slot) CLEAR_LOOP: bla r5, r6, CLEAR_LOOP fst.l f0, 4(r4)++ // delay slot I've never seen a looping instruction quite like it. Be careful not to trash LCC during the loop (shades of jcxz!). Other core instructions: > Compare-and-branch bte src1, src2, offset and btne src1, src2, offset branch (no delay) if src1 == src2 or src1 != src2, respectively. They have opcode 0101EI, where E = 1 branches on equal, and I selects format C or D. I.e. src1 can be an immediate value, but only in the range 0..31. > Flush flush - flush the data cache. "In user mode, execution of flush is suppressed," whatever that means. What it seems to do is force a fake load, one that fills the data cache with garbage. "When flushing the cache before a task switch, the addresses used by the flush instruction should reference non-user- accessible memory to ensure [Will wonders never cease? A book written in the U.S. actually got ensure/insure straight!] that cached data from the old task is [oh, well... can't win them all] not transferred to the new task. These addresses must be valid and writeable in both the old and the new tasks's space." The sample code reserves a 4K hunk of memory, and does this: // Rw, Rx, Ry, and Rz are registers // FLUSH_P_H and FLUSH_P_L are two halves of the address of the 4K hunk, // less 32. ld.c dirbase, Rz // assuming RB and RC fields clear or 0x800, Rz, Rz // Set RC field to 2 (obey RB for data cache) adds -1, r0, Rx // Loop increment call D_FLUSH st.c Rz, dirbase // Store new RC field (in delay slot of call!) or 0x900, Rz, Rz // Set RB field to 2 (was assumed 0) call D_FLUSH st.c Rz, dirbase // Store new RB field (in delay slot of call!) xor 0x900, Rz, Rz // Clear RB and RC fields // Pound on DTB, ATE, or ITI fields here st.c Rz, dirbase // Store cleared values // continue... D_FLUSH: orh FLUSH_P_H, r0, Rw or FLUSH_P_L, r0, Rw // Rw gets address of flush area or 127, r0, Ry // loop counter bla Rx, Ry, D_FLUSH_LOOP // set up LCC ld.l 32(Rw), r0 // clear pending bus writes D_FLUSH_LOOP: bla Rx, Ry, D_FLUSH_LOOP // Loop flush 32(Rw)++ // Hit every 32 bytes (cache line size) bri r1 // Return - branch to (r1) ld.l -512(Rw), r0 // Load from flush area to clear pending // writes (guaranteed cache hit). I don't quite understand the bit about clearing pending writes. I guess it puts off address translation until the last possible moment (the write queue uses virtual addresses), and a load to r0 is an idiom which always generates an interlock. The flush instruction uses opcode 001101; format B. Bit 0 of the immediate field selects autoincrement mode. That's everything in formats A through E; now for format G. (High 6 bits opcode = 010011, low 5 bits give secondary opcode; only one 5-bit register field defined.) The defined operations are: calli: opcode 00010, performs an indirect (delayed) call via the address specified in the register operand. I don't know if it reads the source register before or after storing the return address in r1. Could be a way to play with coroutines. intovr: opcode 00100, traps if the OF flag in the espr is set. trapv. > Lock lock: opcode 00001. This is interesting. This begins an interlocked sequence on the next data access that misses the cache, setting the BL bit. Interrupts are disabled and the bus is locked until explicitly unlocked. The sequence must be restartable from the lock instruction in case a trap ocurrs. If there is more than one store, you must ensure there are no traps after the first non-idempotent store. I.e. keep the code on one page and make sure all the data addresses are valid. There is a similar unlock instruction (opcode 00111), that unlocks the bus on the first data access that misses the cache after it. These instructions *are* executable from user mode, but there is a 32- instruction counter that traps if you spend too long with the bus locked. I like those instructions. A RTOS might like to be able to set the timeout, but 32 instructions is a reasonable value. Now, for the interesting part: >> Floating-Point << These are all in the F format, with a 010010 opcode in the high 6 bits, then 5 bits of src2, dest, and src1, then 4 magic bits, then 7 bits of fp opcode. Two of the magic bits control the source and destination precisions. S=0 for single and S=1 for double sources. R=0 for single and R=0 for double results. > Pipelines Here comes time to explain the pipeline concepts used by the 80860. There are 4 pipelines on the i860: multiplier, adder, graphics unit, and floating-point loads. These are 2/3, 3, 1, and 3 stages deep. The multiplier is 2 stages deep for double-precision sources and 3 stages [sic] for single. The destination format is unimportant. The FZ (flush zero), RM (rounding mode) and/or RR (result register) bits of the fsr while there are results in the adder or multiplier pipelines is a bad idea. One of the magic bits in each fp instruction is the P, pipeline bit. If this bit is clear, the operation goes straight through the floating-point unit. Any results in the pipeline are lost, but the result is available by the next instruction. This is *not* the next cycle, but it's scoreborarded. (This doesn't apply to the load pipeline, which is not used by scalar load instructions.) If the pipeline bit is set, though, then the specified dest is for the result at the end of the pipeline and the requested operation goes in the front. The store is completed before the load of the source operands. (At least conceptually.) So initially, you must stick a few operations into the pipeline, throwing away whatever was there (writing it to f0), then you can pump through lots of data, then you have to stick in a few junk computations to get the last few results. The load pipeline, the pfld instruction, is the most straightforward, and works as described above. On the multiply pipeline, if you switch source precisions with the pipeline half-full, if you started out in double (2-stage) mode with B and A in the pipeline (A one stage from completion, B two), and added single-precision computation C, you'd store A and end up with C, B and 0.0 in the pipeline. If you started out with C, B, A, and added double-precision computation D, you'd end up with A stored and D, C in the pipeline. B would get lost. Both inputs to an operation must be of the same precision. There are odd, not fully explained problems with taking double source operands and returning a single result, so the precision suffixes on floating-point operations should generally be restricted to .ss, .sd, and .dd. > Fmul, Fadd, Fsub Anyway, here's a list of the simple floating-point operations: [p]fmul.p src1, src2, dest (opcode 0100000) [p]fadd.p src1, src2, dest (opcode 0110000) [p]fsub.p src1, src2, dest (opcode 0110001) // result = src1 - src2 The fadd or pfadd instruction may have a .ds precision suffix, as long as one of the sources is f0. This is used for format conversion. The [p]fadd instructions are used in the [p]fmov macros. > Float to integer [p]ftrunc.p src1, dest (opcode 0111010) The result of this operation is 64 bits, whose low 32 bits are the integer (truncated) part of the floating-point src1. It uses the adder. [p]fix.p src1, dest (opcode 0110010) Same as a bove, but the integer part is rounded. For both of these, the integer is two's complement, signed. pfmul3.dd src1, src2, dest (opcode 0100100) This forces a dp multiply to use the 3-stage pipeline. It's only intended for reloading a pipeline. > Multiplty (integer) fmlow.dd src1, src2, dest (opcode 0100001) This multiplies only the low-order bits of its operands. dest gets the low-order 53 bits of the product of the significands of src1 and src2. Bit 53 of dest gets the MSB of the product. This instruction cannot be used in pipelined mode, does not affect the result-status bits in the fpsr, and does not cause any traps. > Divide, Reciprocal frcp.p src2, dest (opcode 0100010) dest = 1/src2, approximately. Absolute significand error < 2^-7. src1 must be zero. Use as a starting point for Newton-Raphson. This instruction may not be pipelined. It causes a source-exception trap if src2 is zero. It uses the multipler. > Square root frsqr.p src2, dest (opcode 0100011) As above, but dest = 1/sqrt(src2), approximately, and it also traps if src2 < 0. > Fcmp pfgt.p src1, src2, dest (opcode 0110100, R bit clear) pfle.p src1, src2, dest (opcode 0110100, R bit set) pfeq.p src1, src2, dest (opcode 0110101) These instructions perform floating-ponit comparison using the adder. They begin with "p" because they advance the pipeline one stage (the value they insert is undefined, but not an error), but they place the result of the comparison (src1 > src2, src1 <= src2, src1 = srcs) in the CC bit immediately There is no pipeline delay. (Actually, there is one cycle of latency, but it's scoreboarded.) They do trap on invalid inputs. > Multiply-accumulate The following instructions are called dual-operation instructions, since they use both the adder and multiplier. Not to be confused with dual- instruction mode. Combining both of these gives the calimed 150 MOPS. pfam.p src1, src2, dest (opcode 000xxxx) pfmam.p src1, src2, dest (opcode 000xxxx) pfsm.p src1, src2, dest (opcode 001xxxx) pfmsm.p src1, src2, dest (opcode 001xxxx) These instructions are really complex families of instructions. They perform variations on multiply-accumulate. The xxxx is the DPC (Data-Path Control) field. The precision specifies the input and output precisions of the multiplier; the adder takes inputs and putputs of the destination precision. Here is where the KI, KR, and T registers come in. The possibe data flows are complex, but: The value written into dest can be the result of either the adder or multiplier pipeline. The multipler's src1 can be the instruction's src1, KI, or KR. If it is one of the K registers, the instruction's src1 can be copied into it prior to use, or you can use it's current value. The multiplier's src2 can be the given src2 or the value written into dest. The multipler's result can be written into the T register as well as sent to the destination register. The adder's src1 can be the instruction's src1 (if the multiplier hasn't usurped it), the value written into dest (again, if nobody else has it), or the value in the T register (which can be whatever it used to be or the value written by the multiplier). The adder's src2 can be the result of the multipler, the value written into the dest register, ot the given src2 (assuming the multipler hasn't stolen it). When you add the fact that the adder can compute src1+src2 or src1-src2, you have a total of 64 possibilities. A bit in the opcode specifies whether the adder adds or subtracts, and the P bit is used to specify which output goes to the dest register (0 = adder, 1 = multiplier (and the adder's result is thrown away)). After this factoring, there are 16 cases, 8 can be represented by the DPC field values 0XYX, where: X controls whether "K" means KR (X=0) or KI (X=1), Y controls whether the adder's src2 is the result of the multiplier (Y=0) or the result of the multiplier goes into T and the adders' src2 is the result that gets written into the dest register (Y=1), and Z controls whether the instruction's src1 goes to the adder's src1 (Z=0) or the instruction's src1 goes to K (and thence to the multiplier) and the adder's src1 comes from T (which may have come from the multiplier). DPC values of the form 1XY0 cover cases where the multiplier's inputs are K and the result written to dest (K is controlled by X, as above) and the adder's inputs are the instruction's src1 and src2. Y controls whether T is loaded with the result of the multiplier (Y=0) or not (Y=1). DPC values of the form 1XY1 cover cases where the multiplier's inputs are the instruction's src1 and src1. If X is 1, then the adder's src1 is T (which is not loaded from the multiplier's result) and Y controls whether the adder's src2 is the multiplier's result or the value written to the dest register. (Note that these may be the same value.) If X is 0, then the adder's second input is the result of the multiplier (which is not written into T), and its first input is controlled by Y. If 0, it's the valuee written into the dest register; if 1, it's the T register. Are you suitably confused? Pictures do help somewhat. Intel supplies transliteration rules for producing mnemonics from these various connections, but I won't go into them here. Scoreboard alert: when the multiplier's src1 is the instruction's src1, this must not be the same as rdest. Something screws up. >> Graphics operations << These also use the fp instruction encoding and register set. But they use a separate graphics pipeline which is only one stage deep - i.e. when you start one instruction, you get the result of the previous one out. As with the floating-point instructions, most have pipelined and non- pielined versions, which behave analogously. (The graphics operations use fp opcodes 1xxxxxx; I've already covered everything of the form 0xxxxxx.) > Long long The basic ones are long-integer operations: [p]fiadd.w src1, src2, dest (opcode 1001001) .w is .ss or .dd for 32 or 64-bit adds. The CC is not set, and no traps are signalled. [p]fisub.w src1, src2, dest (opcode 1001101) dest = src1-src2 There are move macros that use these instructions with f0. > Z-buffer [p]fzchks src1, src2, dest (opcode 1011111) [p]fzchkl src1, src2, dest (opcode 1011011) These instructions do z-buffer operations. The short form takes the sources as 4 fields of 16 bits each, and does 4 simultaneous compares, with the results written to the PM (Pixel Mask) field of the psr. In fact, what happens is that the PM is shifted right 4 bits and the most significant 4 bits are set with the results of (src2 <= src1), for each of the 4 fields. The value produced by the operation is the result of 4 parallel minimum operations, i.e. the updated z-buffer. The long form, [p]fzchkl, does the same, except it uses 2 32-bit wide fields, shifts PM by 2 bits, and updates the high 2 bits. The shift allows you to rapidly compute 8 bits worth of z-buffer values. The size of the z-buffer is independent oof the pixel size set in the PS field of the psr. > Phong shading [p]faddp src1, src2, rdest (opcode 1010000) This instruction does pixel interpolation into the MERGE register. I don't quite understand how this instruction is useful, but it does something unusual. Assume 8-bit pixels specified in the PS field of the PSR. faddp takes src1 and src2 as consisting of 4 16-bit words, adds each field together, and writes the high bytes of each word (if you consider the words to be fixed-point 8.8 bit numbers, it writes the integer parts) to the MERGE register. The MERGE register has been shifted down 8 bits at the same time, so two of these instructions will fill it with pixel values. If the pixels are 16 bits wide, it will do the same, except the fields are considered to be 6.10 bit fixed-point numbers, with the high 6 bits loaded into the MERGE regsiter, which has been shifted down 6 bits. (After two shifts, two bits won't fit and get truncated from one of the fields - thus the 6/6/4 RGB format you see flying around. This is the only place it appeears.) If the pixels are 32 bits wide, the fields are taken to be 32 bits wide, withe the high bytes of each of the two copied to the MERGE register, which has been shifted down 8 bits. There is also a similar [p]faddz instruction (opcode 1010001), which does the same thing with 16.16 bit fields, shifting the MERGE register 16 bita at a time. Intel seems to be really keen on this sort of operation. I wish I knew what it was good for. You can do the same thing with 32.32 bit fields, by doing two long adds on the corresponding parts of src1 and src2, then using a single-precision move to copy the destination parts nito a register pair. [p]form src1, dest (opcode 1011010) dest = src1 | MERGE MERGE = 0 This instruction lets you read the MERGE register after you've pounded on it a while, setting any last bits you need to tweak and clearing it for future action. > Move fp to int fxfr src1, dest (opcode 1000000) This moves single-precision floating-point register src1 to integer register dest. The opposite of ixfr. [These mnemonics aren't very mnemonic.] >> Dual-Instruction Mode << One of the magic bits in each fp instruction is the D, dual-instruction bit. Intel suggests using either a d. prefix to the mnemonic or assembler directives .dual and .enddual. If the processor comes across an instruction (which must be aligned on a 64-bit boundary) with the D bit set, then it executes the next instruction (integer ("core") op or fp op with D bit set) and starts reading instructions 64 bits at a time. The low-order instruction must be an FP op, and the high- order must be an integer ("core") op. Exception: the fnop (lsrd r0, r0, r0) instruction is allowed in the fp slot. Both these instructions are executed simultaneously. To get out of dual-instruction mode, have an fp op (FLOP) without the D bit set. This pair, and the next, will still be executed in dual-instruction mode, but after that you're back to single. A degenerate case is a single FLOP in a stream with the D bit set, followed by one with it clear. The next two instructions will be executed as a pair, and them back to single mode. Executing two instructions at once requires some extra rules: - If a branch on CC is paired with a floating-point compare, the branch tests CC before the compare sets it. - If an ixfr, fld, or pfld instruction is paired with a FLOP, the FLOP gets the register value before the other instruction updates it (or marks it as pending in the scoreboard, really). - An fst or pst operation that stores a register which is written to by the instruction it's paired with, the new value is stored to memory. - An fxfr instruction that conflicts with a source operand in the core operation paired with it will store after the core op has read the register. "The destination of the core operation will not be updated if it is any if the integer register. Likewise, if the core instruction uses autoincrement addressing, the index register will not be updated." Typo? I think this meand the fxfr steals the write bus from the core processor, and the core processor's write goes to the bit bucket. - If both instructions set the CC, the FLOP will win. - If the FLOP is scalar and the core operation is fst or pst, it should not store the result of the FLOP. When the core OP is pst, the FLOP must not be [p]fzchks or [p]fzchkl. Conflict over the PM field, y'know. - When the core op is ld.c or st.c (diddles control registers), it must be paired with fnop. - You cannot use the return-from-interrupt functionality of bri in dual- instruction mode. - A FLOP which sets CC cannot be paired with a compare-and-branch core instruction. I.e. pfeq and pfgt conflict with b[n]c.t. b[n]c.t also conflict with a pfeq or pfgt instruction in the next pair, too. - "When the FLOP is fxfr, the core operation cannot be ld, ld.c, st, st.c, call, ixfr, or any instruction that updates an integer register (including autoincrement indexing)." - You can't start to exit from dual-instruction mode on an instruction paired with a control-transfer instruction. I.e. if the FLOP before had D set, so must the FLOP paired with the branch. - You can't start to switch to or from dual-instruction mode on the instruction following a bri (in its delay slot). Enough rules? Well, you should have known it was gonna be a bit ugly. >> Traps, Interrupts, Exceptions, etc. << As I mentioned, this is not well done. When a trap ocurrs, bits are set in the psr (and maybe fpsr, if the FT bit in the psr is set) to indicate contributing factors, and then the U and IM bits are copied to the PU and PIM bits, then cleared (disabling interrupts and switching to supervisor mode), the DIM and DS flags are set as needed, and the fir is set up. (In dual-instruction mode, the fir will point to the FLOP in the low-order half of the pair. If the problem was just a data-access fault, the FLOP (unless it was fxfr) completes, and you should not reexecute it on interrupt return. Instruction and data-access faults are always the fault of the core instruction.) After this setup, the processor jumps to virtual address 0xFFFFFF00. then you have to figure out what's going on and fix it. The state of the processor consists of: - The register files - The four pipelines - The KI, KR, and T registers - The MERGE register - The psr, epsr, and fsr. - The fir, and - The dirbase register (with its dependencied on the data cache) A simple interrupt return consists of - Restoring the register files, pipelines, KI, KR, T, and MERGE registers (not necessary for simple interrupt handlers), except for one register which holds the return address from the fir. - Undoing the effect of an autoincrement instruction which must be reexecuted (parse the instruction at [fir] to figure this out) - See if you need to back up the return address by one instruction - Set up the psr, possibly setting the KNF bit, and definitely setting at least one trap bit. - Execute an indirect branch (bri) to do the interrupt return, and in its delay slot, - Restore the register that holds the resumption address. The processor is still in supervosor mode here, so you don't need to pollute the user's address space. > Backing up the return address If the instruction before the one pointed to by the fir is a delayed branch, you should back up and re-execute it. If it is a bla, you need to undo its add instruction. There is an exception to this where you bombed out on a floating-point compare instruction you need to emulate and the instruction before is a conditional delayed branch. Here, you need to leave the CC alone so the branch will do the right thing, and set it so the fp compare will seem to have done the right thing. You need to compute where the conditional branch would put you and resume there. If you are backing up, and in dual-instruction mode, you should set things up (DS set, DIM clear) so the core instrucrion will be executed in single-instruction mode, then DIM will be re-entered. If DS was originally set, clear it. Plus, you have to worry about the case that the instruction at fir-4 might not exist. Intel suggest that you begin each code segment with a nop instruction to avoid this problem. > Setting KNF KNF should be set if you have emulated a floating-point instruction that trapped, or if you got only a data-access fault in dual-instruction mode and the FLOP was not fxfr. [Is the perfectly clear?] > Saving the pipeline Doing this is messy. Basically, you need to read out all the results (and the associated error codes for the adder and multipler pipelines) to store them, and then push operations with the equivalent answers back on restore. For the load pipeline, store the values read in memory somewhere and reload it from there afterwards. For the graphics pipeline, you can just read it with a pfiadd, and restore it the same way (add 0 to the recalled value). The MERGE register also needs to be stored. For the floating-point pipelines, you need to get all the values out, including error conditions, and the KI, KR, and T registers. To put them back, first stuff the KR, KI, and T registers, then place value+0 and value*1 computations into the various pipelines, along with the proper error bits. There's sample code to do this in the data book, and it's not particualrly pretty. >> Calling Conventions << Intel has a suggested calling convention. Although the border is still fuzzy, the manual suggests r0-r15 and f0-f15 as callee-saves, and the other half as caller-saves. r1 is the retrn address, r2 is the stack pointer, and r3 is the frame pointer. Parameters are passed in r16 through r27 and f16 through f27, and the others are used for scratch. r31 is reserved for address computations. They suggest that even single-precision float arguments be passed in a register pair, and anything that won't fit into registers be passed on the stack C-style. The stack pointer should always be 16-byte aligned so the 128-bit loads can be used easily. > Memory map They also suggest a memory map. It starts with 4K of unreadable memory (NULL-catcher), then user data, and heap. Then empty space until you hit the stack, then shared-memory frames, and OS data, topping out at 0xF0000000. Then comes a jump table to standard library routines until 0xF0400000, then user code (text), blank space, and then the OS up at the top of memory. >> Sample code << The manual gives a bunch of sample code. I won't reproduce it, but will list what's there: - Sign-extending a value in a register (shl, shra) - Loading unsigned integers (ld, and) - single-precision FP divide (approximate, two iterations Newton-Raphson unpipelined, 22 cycles, 2 ulp worst-case error) - DP fp divide (three iterations Newton-Raphson, also 2 ulp, 38 cycles) - Integer multiply (move to fp, use fmlow, move back; 9 clocks, five of which can be overlapped) - Signed int to double (7 cycles; 3 can be overlapped) - Signed integer divide (62 cycles, 59 without remainder) - Null-terminated string copy (byte-at-a-time, simple) - Example of pipelined adds - Example of pipelined multiply-accumulate - Example of dual-instruction mode - Cache strategies for matrix dot product (e.g. keep both matrices in cache; keep one and use pipleined loads on the other) >> Pipeline Interlocks << Everything's single-cycle, but here's what can interlock: i-cache miss: given in terms of pin timing, plus two cycles if d-cache miss in progree simultaneously. d-cache miss (on load): again, pin timing, but it seems to be "clocks from ADS asserted to READY# asserted" fld miss: d-cache miss plus one clock call,calli,ixfr,fxfr,ld,ld.c,st,st.c,pfld,fld or fst with data cache miss in progress - stalls until miss satisfied, plus one cycle ld, call, calli, fxfr and ld.c have 1 cycte of latency (next instruction will stall if scoreboard hits) fld, pfld and ixfdr have 2 cyctes of latency. addu, adds, subu, subs, pfeq, pfgt, and pfle have 1 cycle of latency to update the CC bit. A branch on that bit will stall. The multipler's src1 must be in the register file; if it is the result of the previous instr, you get a 1-cycle stall. Scalar FLOPS fadd, fix, fmlow, fmul.ss, fmul.sd, ftrunc and fsub have 3 cycles of latency. fmul.dd has four. If the input and output precisions differ (e.g. fmul.sd), add one cycle. Plus one if the following FLOP is pipelined and has dest <> f0. TLB miss takes 5 cycles plus two reads, plus setting the A bit (if necessary). if three pfld's are outstanding and you execute one more, you will stall until the first completes, plus one cycle a pfld data-cache hit costs two clocks if the store pipe is full (one on bus plus two pending internally), another access will delay until the current access completes, plus one cycle a load (or fld) following a store cache hit - one clock delayed branch not taken - costs one clock nondelayed branch taken - one clock for bc, bnc; two for bte, btne. bri - one clock st.c - two clocks there is not forwarding from the graphics unit to the adder, multiplier, or itself, so there is one cycle of latency there a flush has two cycles of latency an fst takes one cycle to get the value out of the register, so if the next instruction overwrites the register being stored, it will stall >> The End << And that, boys and girls, is basically the complete contents of the programmer's reference manual. Enjoy! (52K ug... let's see if we can bomb any mailers!) -- -Colin (uunet!microsoft!w-colinp) "Don't listen to me. I never do." - The Doctor
jangr@microsoft.UUCP (Jan Gray) (03/06/89)
In article <807@microsoft.UUCP>, I write: > * 8, 16, 32 bit integer load/store insns, operands must be appropriately > aligned; byte or word values are sign extended on load. [I hope you > don't use "unsigned char" too much...] What an ultra-maroon. I don't know what I was thinking, but you need merely "and" the result of the load with 0xFF to load a byte as an unsigned. Jan Gray uunet!microsoft!jangr Microsoft Corp., Redmond Wash. 206-882-8080
jeff@Alliant.COM (Jeff Collins) (03/07/89)
In article <807@microsoft.UUCP> jangr@microsoft.UUCP (Jan Gray) writes: > > i860 Overview > > <deleted lots of interesting and useful data on the i860> >Caches > >* 8K data cache, virtual addressed, write-back, two-way "set associative", > 2x128 lines of 32 bytes >* Flush instruction forces a dirty data cache line (32 bytes) back to memory. > Intel supplies suggested code to flush entire data cache. >* Storing to dirbase register with ITI bit set invalidates TLB and instruction > caches; must flush data cache first! [Remember, the data cache is virtually > addressed.] Coming from a multiprocessor background, I personally judge the desirability of a chip by the ability to put it into an MP architecture. One of the most important features necessary for this is the ability to invalidate any internal data caches from external hardware. The discussions that I have seen on the i860 have not made it clear whether this possible or not. Given that the internal data cache is a virtual, write-back, two-way set associative, I would guess this is not possible. Does any one know for certain? If it is impossible the invalidate the cache from external logic, the next question is how does the chip perform with the internal data cache disabled? Also, is there any way to disable the cache without using the PTE bit?
dgh%dgh@Sun.COM (David Hough) (03/07/89)
> The frcp and frsqr insns give return approximate reciprocal and 1/square
root "with absolute significand error < 2^-7". Intel supplies routines
for Newton-Raphson approximations that take 22 clocks (*almost* single
precision) or 38 clocks (*almost* double precision), and the Intel i860
library provides true IEEE divide. [RISC design principles at work:
divides are infrequent enough not to slow down/drop some other feature
to provide divide hardware.]
Another RISC design principle that will be discovered by whoever trys to
build an engineering work station out of this chip: you base your design
on few or small benchmarks at your peril.
Not every engineering computation can be reduced to linpack-style memory
intensive adds and multiplies. Division and sqrt are important in a lot
of realistic applications - spice for starters. And spice convergence
sometimes is a function of how clean the arithmetic is (at least for the
inappropriately popular mosamp2 benchmark). A system with a TI 8847
running at the same clock should beat an i860 on a number of realistic
applications.
None of which should be construed to imply that the i860 won't be very
good on the applications for which it was designed, such as graphics
processors.
David Hough
dhough@sun.com
na.hough@na-net.stanford.edu
{ucbvax,decvax,decwrl,seismo}!sun!dhough
brooks@vette.llnl.gov (Eugene Brooks) (03/07/89)
In article <808@microsoft.UUCP> w-colinp@microsoft.uucp (Colin Plumb) writes: >So initially, you must stick a few operations into the pipeline, throwing >away whatever was there (writing it to f0), then you can pump through >lots of data, then you have to stick in a few junk computations to get the >last few results. It would appear that this exposed pipeline would straightjacket future i860 implementations which might want to change the pipeline latency of the floating point units, or perhaps get the double multiply to do one result per clock cycle. Is this true? Is the news software incompatible with your mailer too? brooks@maddog.llnl.gov, brooks@maddog.uucp, uunet!maddog.llnl.gov!brooks
davidsen@steinmetz.ge.com (William E. Davidsen Jr) (03/08/89)
One problem with any chip which requires alligned data is that performance suffers when addressing bytes, to the point that a program may become impractical. One of the people here checked his Sun-30 (68020) against his Sun-4 (SPARC). The three ran troff about 5x faster. This doesn't mean RISC is bad in a workstation, but it can have performance problems in software we take for granted. A question: has anyone benchmarked nroff/troff on the VAXstation 3100 (VAX) and DECstation 3100 (MIPS)? Would some of the MIPS readers like to comment on performance in this area vs. 68020 or 80386? -- bill davidsen (wedu@ge-crd.arpa) {uunet | philabs}!steinmetz!crdos1!davidsen "Stupidity, like virtue, is its own reward" -me
tim@crackle.amd.com (Tim Olson) (03/08/89)
In article <13322@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes: | One problem with any chip which requires alligned data is that | performance suffers when addressing bytes, to the point that a program | may become impractical. Bytes don't have alignment restrictions -- they are already byte-aligned ;-) Most new processors require data to be aligned on "natural" boundaries, i.e. bytes on byte boundaries, half-words on 16-bit boundaries, words on 32-bit boundaries, etc. This is simply to avoid having to read more than 1 word of memory on a load (with the associated trap headaches) and build up the requested data. | One of the people here checked his Sun-30 | (68020) against his Sun-4 (SPARC). The three ran troff about 5x faster. | This doesn't mean RISC is bad in a workstation, but it can have | performance problems in software we take for granted. What was the dataset and the specific machine types? On "normal" looking input, I find: Sun 3/160: snap3 time troff -t 2.t > /dev/null 5.6u 0.1s 0:05 99% 0+184k 0+8io 0pf+0w Sun 4/110: crackle2 time troff -t 2.t > /dev/null 2.2u 0.1s 0:03 69% 0+408k 1+8io 0pf+0w Which shows the 4/110 2.5x faster than the 3/160. -- Tim Olson Advanced Micro Devices (tim@crackle.amd.com)
mash@mips.COM (John Mashey) (03/08/89)
In article <13322@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes: >One problem with any chip which requires alligned data is that >performance suffers when addressing bytes, to the point that a program >may become impractical. One of the people here checked his Sun-30 >(68020) against his Sun-4 (SPARC). The three ran troff about 5x faster. >This doesn't mean RISC is bad in a workstation, but it can have >performance problems in software we take for granted. > >A question: has anyone benchmarked nroff/troff on the VAXstation 3100 >(VAX) and DECstation 3100 (MIPS)? Would some of the MIPS readers like to >comment on performance in this area vs. 68020 or 80386? Something's wrong somewhere: I'd expect a Sun4/200 to be 2X or so faster than a Sun3/200; I've never seen anything where a Sun-3 would be 5X faster. The MIPS-based things run them at the rates you'd expect; there's an nroff benchmark that's been in the Performance Brief forever; and it also has Sun3/4 numbers. I do NOT think there's a penalty for addressing bytes: both of these machines have full support for storing and [signed/unsigned] loading bytes. There is a penalty for accessing unaligned words, in both cases, although R2000s have special instructions to mitigate the penalty. Anyway, please get your friend to supply some data, because it sounds wrong. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
guy@auspex.UUCP (Guy Harris) (03/08/89)
>One problem with any chip which requires alligned data is that >performance suffers when addressing bytes, to the point that a program >may become impractical. I don't think that's true. My handy-dandy Cypress CY7C600 Family Users Guide, for the Cypress SPARC implementation, says that LDSB (LoaD Signed Byte), LDSH (LoaD Signed Halfword - 16 bits), LDUB (LoaD Unsigned Byte), LDUH (obvious), and LD (LoaD word - 32 bits), all take 2 cycles. My handy-dandy MIPS R2000 RISC Architecture manual, alas, has no timings such as that - after all, it's an *architecture* manual, not a manual for some particular *implementation* - but I'd be *very* surprised if byte load/store operations were so much slower that "a program (such as 'troff') may become impractical". (My expectation is that they're no slower, just as on SPARC.) Are you, perhaps, thinking of word-addressible machines, and under the impression that not only do RISC machines tend to require, say, 4-byte alignment of 4-byte quantities, but that they can't deal with quantities shorter than 4 bytes? That's simply not true of the RISC machines with which I'm familiar. BTW, there exist CISC machines that require alignment, as well; as I remember, all but the most recent AT&T WE32K chips require it. >One of the people here checked his Sun-30 (68020) against his Sun-4 >(SPARC). The three ran troff about 5x faster. The only three explanations I can imagine for that, offhand, are: 1) he's got the two figures backwards; the Sun-4 was ~5x faster than the Sun-3; 2) the figures are real time, not CPU time, and something else is interfering; 3) "troff" is floating-point intensive, and the Sun-4 in question has no FPU (e.g., a 4/110 with no FPU). Explanation 3) falls by the wayside rather quickly; I grepped for "float" and "double" throughout the code and didn't find it. This leaves 1) or 2); is there one I missed? I tried comparing "troff"s on a Sun-3/50 with 4MB memory, and a Sun-4/260 with 32MB memory, both running 4.0. Here are the times: Sun-4/260: auspex% time troff -t -man /usr/man/man1/csh.1 >/dev/null 24.4u 1.2s 0:34 75% 0+456k 26+38io 31pf+0w auspex% time troff -t -man /usr/man/man1/csh.1 > /dev/null 24.4u 1.5s 0:36 71% 0+464k 1+35io 0pf+0w Sun-3/50: bootme% time troff -t -man /usr/man/man1/csh.1 >/dev/null 118.9u 1.2s 2:08 93% 0+208k 14+33io 24pf+0w bootme% time troff -t -man /usr/man/man1/csh.1 > /dev/null 120.2u 2.8s 2:31 81% 0+192k 5+32io 11pf+0w The 4/260 did 5x *better* than the 3/50, not 5x *worse*, on that example! Could 1) be the correct explanation?
peter@ficc.uu.net (Peter da Silva) (03/08/89)
In article <13322@steinmetz.ge.com>, davidsen@steinmetz.ge.com (William E. Davidsen Jr) writes: > One problem with any chip which requires alligned data is that > performance suffers when addressing bytes, to the point that a program > may become impractical. I guess I'm a bit dense today, but why? Takes the same amount of work to fetch <x> bits over an <n> x <x> bit-wide bus either way. Are you confusing alignment requirements with word addressing? After all, I don't recall the PDP-11 or 68000 having problems dealing with bytes. At best you lose a little data compression... -- Peter da Silva, Xenix Support, Ferranti International Controls Corporation. Business: uunet.uu.net!ficc!peter, peter@ficc.uu.net, +1 713 274 5180. Personal: ...!texbell!sugar!peter, peter@sugar.hackercorp.com.
henry@utzoo.uucp (Henry Spencer) (03/09/89)
In article <13322@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes: >One problem with any chip which requires alligned data is that >performance suffers when addressing bytes, to the point that a program >may become impractical. One of the people here checked his Sun-30 >(68020) against his Sun-4 (SPARC). The three ran troff about 5x faster. This is curious, since troff does little byte addressing. Doubly so since the SPARC does have byte addressing and byte-access instructions. More generally, are you not confusing alignment with accessing? It is quite possible to require aligned data (e.g. 32-bit quantities on 32-bit boundaries) while still having efficient byte addressing and accessing. -- Welcome to Mars! Your | Henry Spencer at U of Toronto Zoology passport and visa, comrade? | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
bcase@cup.portal.com (Brian bcase Case) (03/09/89)
>One problem with any chip which requires alligned data is that >performance suffers when addressing bytes, to the point that a program >may become impractical. One of the people here checked his Sun-30 >(68020) against his Sun-4 (SPARC). The three ran troff about 5x faster. >This doesn't mean RISC is bad in a workstation, but it can have >performance problems in software we take for granted. Now wait a minute; can anyone substantiate this claim? It seems that something else must have been wrong: I might believe that the 020 could be faster for *unaligned* data, but not 5 times faster. The SPARC has load/store byte, halfword, and word instructions that work just fine as long as data is properly aligned. And I can't imagine the compiler unaligning things on purpose. Could you clarify this?
davidsen@steinmetz.ge.com (William E. Davidsen Jr) (03/09/89)
Several comments on my earlier posting: (a) I was talking about the general case of machines which don't have byte addressing and fetch a byte using a three step "load, shift, and" and store a byte doing something like "load, and, shift, or, store". I Didn't mean to imply that any particular RISC architecture used that method. The Honeywell 6000 series was an example of not having direct byte addressing (yes you can use a tally word but building it is as slow as the and & ors). If the load and store time for an arbitrary byte are the same as an alligned int, and if load/store byte is a single operation, then I would consider it byte addressable for the purposes of any program I want to run. (b) several people posted results using big Sun4's and little Sun3's. I believe that the tests were done on a Sun3/280 with FPU and a small Sun4 (I don't know the model number). Since troff is FP intensive (at least on a VAX) that may be difference. (c) the point I was trying to make was that a RISC processor which is "N times faster" than some CISC processor is not going to have the same improvement in all cases. I'll post the actual results if I can get them from a rerun. Boy are RISC people defensive! -- bill davidsen (wedu@ge-crd.arpa) {uunet | philabs}!steinmetz!crdos1!davidsen "Stupidity, like virtue, is its own reward" -me
suitti@haddock.ima.isc.com (Stephen Uitti) (03/10/89)
In article <1133@auspex.UUCP> guy@auspex.UUCP (Guy Harris) writes: >>One problem with any chip which requires alligned data is that >>performance suffers when addressing bytes, to the point that a program >>may become impractical. > > [talk about instruction times being the same for byte/word/long > accesses or SPARC, MIPS]. Byte accesses on the PDP-10 were slower - one had to set up a byte pointer and do special load-byte or load-byte-and-increment-the-pointer instructions. Still, bytes were any size from 1 bit to 36... Also remember that even if an 8 bit byte access takes (about) the same time as a 32 bit word access, it still moves less data. I've had some code do its work using larger quantities for just this reason. Usually, the code is #ifdef'ed, so that the easier version can at least be read if not used. One can often do "vector bit" operations a word at a time. The whole "duff's device" bcopy & memcpy discussions of a few months ago are at least partly based on this idea. >BTW, there exist CISC machines that require alignment, as well; as I >remember, all but the most recent AT&T WE32K chips require it. The VAX doesn't require it - but don't do it. A 32 bit word reference to an odd address is real slow. That's why the C compiler there does so much word alignment. Even so, one would see a program that worked on a VAX that would die on a machine which would just plain forbid the operation. Data became unaligned, typically by writing them to disk and then reading them back in. The VAX would be slow for the operation (nobody cared), but other machines would yield bus errors. It seems to me that if an architecture traps unaligned data references, the kernel can look at the instruction that faulted and make it appear to work via software. uVAX IIs implement all sorts of VAX instructions that just aren't in the hardware. Both VMS & flavors of UNIX do this (sometimes even correctly). (Remember, DEC said these things would work, even though there are billions of them & the uVAX II CPU fits on a QBus board... and with a MB of RAM.) Almost no one uses these instructions, so who cares? If the compilers try to make things aligned, and if the Operating System fixes things when botched, and if the Operating System provides a way for the user (programmer) to detect that it happened, and how much, then everyone should be happy. I'd be willing to have unaligned data fetches work 100x slower if the overall architecture could be otherwise, say, twice as fast (because there was enough chip space for an I cache or FPU or something). >>One of the people here checked his Sun-30 (68020) against his Sun-4 >>(SPARC). The three ran troff about 5x faster. > [attempted explanations] >This leaves 1) or 2); is there one I missed? I had one VAX 780 outperform another due to the system binaries for the program being differant. Recompilation & cross running showed that the hardware was the same. Of course, the Sun 3 and Sun 4 are not binary compatible, and the original user probably doesn't have sources... I had one VAX 780 outperform another by 20% due to a ringing 9600 BAUD tty line. It had been that way for months - no one noticed... I ran various "benchmarks" between uVAX IIs and Sun 4s. The range was about 2x to over 8x, averaging about 4x. I never got the 10 (VAX) MIPS figures that were commonly quoted. VAX 780s really are a little faster than uVAX IIs. (aside:) In the olden days when 68000s were brand new, the EE dept at Purdue was considering getting a bunch of 68000s, with troff in ROM & some communication gear, and have troff run on the dedicated boxes. The 68000 could run troff at something like 90% the speed of the 780, which was likely to be much more CPU than a user could get out of the 780s there. I remember wondering if the I/O would kill the 780s making the whole exercise moot... Remote execution (load sharing) on the local ethernet was implemented and it did work pretty well, technically (politically was another matter). I had thought that having a pre-built (buildcore) "troff -ms", etc., would save them more. I recall it taking troff something like 20 seconds to do the initialization for the first .PP for the "-ms" macros. Pretty gross if you ask me (don't ask). >I tried comparing "troff"s on a Sun-3/50 with 4MB memory, and a >Sun-4/260 with 32MB memory, both running 4.0. Here are the times: > >Sun-4/260: > auspex% time troff -t -man /usr/man/man1/csh.1 >/dev/null > 24.4u 1.2s 0:34 75% 0+456k 26+38io 31pf+0w > auspex% time troff -t -man /usr/man/man1/csh.1 > /dev/null > 24.4u 1.5s 0:36 71% 0+464k 1+35io 0pf+0w > >Sun-3/50: > > bootme% time troff -t -man /usr/man/man1/csh.1 >/dev/null > 118.9u 1.2s 2:08 93% 0+208k 14+33io 24pf+0w > bootme% time troff -t -man /usr/man/man1/csh.1 > /dev/null > 120.2u 2.8s 2:31 81% 0+192k 5+32io 11pf+0w > >The 4/260 did 5x *better* than the 3/50, not 5x *worse*, on that >example! Could 1) be the correct explanation? The VAX 780 here running 4.3 BSD had this to say: haddock% time troff -t -man /usr/man/man1/csh.1 >/dev/null troff: unrecognized -t option 0.1u 0.0s... This is much faster than the Suns. It just optimized the operation a bit, being an "experienced VAX" (as opposed to a "used VAX"). The Compaq 386/25 sitting here was even faster, saying something like "troff command not found". I'm unfamiliar with the the "-t" option. haddock% time troff -man /usr/man/man1/csh.1 >/dev/null 90.8u 6.4s 36% 95+201k 59+15io 24pf+0w I thought Sun 3's were lots faster than 780s. Maybe more expensive Sun 3s are faster... Of course, my /usr/man/man1/csh.1 could be differant, though it is probably at least real similar. Also, I think 'troff' is one of those applications that has odd behaviour compared to just about anything else one would run. It should be pointed out (if it hasn't been already) that troff doesn't do nearly the byte accesses that one would think it should do. Still, troff is a great benchmark for sites that do alot of troff. Stephen Uitti, suitti@ima.ima.isc.com (near harvard.harvard.edu)
rob@tolerant.UUCP (Rob Kleinschmidt) (03/10/89)
In article <12000@haddock.ima.isc.com>, suitti@haddock.ima.isc.com (Stephen Uitti) writes: > In article <1133@auspex.UUCP> guy@auspex.UUCP (Guy Harris) writes: > > > [attempted explanations] > >This leaves 1) or 2); is there one I missed? > For the sake of argument... Under some very wierd circumstances, one might be able to demonstrate better cache utilization on a byte aligned vs. "naturally" aligned machine. Assuming multi-byte cache lines, one could argue some small improvement because of the lack of padding bytes within structures etc. I don't believe this for a minute, and assume that any small gain made would be offset by the extra cpu access cycles, but it seemed like a thought worth mentioning.
mash@mips.COM (John Mashey) (03/10/89)
In article <13328@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes: >(b) several people posted results using big Sun4's and little Sun3's. I >believe that the tests were done on a Sun3/280 with FPU and a small Sun4 >(I don't know the model number). Since troff is FP intensive (at least >on a VAX) that may be difference. troff never used to use FP; has it changed recently? -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
henry@utzoo.uucp (Henry Spencer) (03/11/89)
In article <13328@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes: >(a) I was talking about the general case of machines which don't have >byte addressing and fetch a byte using a three step "load, shift, and" >and store a byte doing something like "load, and, shift, or, store"... You're still confusing two issues, or at least using confusing terminology. There are, in fact, three separate issues here: Alignment: Are N-byte objects required to be aligned on N-byte boundaries? Byte addressing: Do the pointers point to bytes, as opposed to words? Byte accessing: Can the processor read/write bytes, as opposed to words? It is quite possible to have byte addressing without byte accessing, as on the current 29000: the pointers do point to bytes, but without external hardware help, the current processor only does word memory accesses. Nobody in his right mind designs a processor without byte addressing today. Byte accessing is a tradeoff that depends on hardware constraints and how the processor is to be used (Cray, for example, considers it unimportant). Alignment is coming back in fashion for several reasons, notably simplicity of hardware and easier exception handling (because aligned objects cannot span page boundaries). >(b) several people posted results using big Sun4's and little Sun3's. I >believe that the tests were done on a Sun3/280 with FPU and a small Sun4... Did the Sun 4 have enough memory not to page itself to death? Especially if it was running SunOS 4, that's not a trivial issue. Were the troffs compiled the same way? Modern troffs have an option for keeping the temporary file in memory, which can have a major effect on performance. Did you think to normalize for memory-system performance? The 280 has a big fast cache. >... Since troff is FP intensive (at least on a VAX)... Are you sure??? That doesn't sound like the troff I know. >(c) the point I was trying to make was that a RISC processor which is "N >times faster" than some CISC processor is not going to have the same >improvement in all cases. Nobody would argue with this; why are you making a big deal out of it? >Boy are RISC people defensive! Even non-RISC people think that apparent absurdities are worth challenging. -- Welcome to Mars! Your | Henry Spencer at U of Toronto Zoology passport and visa, comrade? | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
guy@auspex.UUCP (Guy Harris) (03/11/89)
>(a) I was talking about the general case of machines which don't have >byte addressing and fetch a byte using a three step "load, shift, and" >and store a byte doing something like "load, and, shift, or, store". I >Didn't mean to imply that any particular RISC architecture used that >method. Most of them *don't*. Therefore, the speculation about them in your posting was incorrect. >The Honeywell 6000 series was an example of not having direct >byte addressing ...at least if it didn't have the extended instruction box. >If the load and store time for an arbitrary byte are the same as an >alligned int, and if load/store byte is a single operation, then I >would consider it byte addressable for the purposes of any program I >want to run. Well, then, the SPARC and almost certainly the MIPS are byte-addressible. >(b) several people posted results using big Sun4's and little Sun3's. I >believe that the tests were done on a Sun3/280 with FPU and a small Sun4 >(I don't know the model number). Since troff is FP intensive (at least >on a VAX) that may be difference. How can it be FP-intensive if it has no floating point numbers in it? I found no occurrences of "float" or "double" in the 4.3BSD "nroff"/"troff" source, and the SunOS 4.0 "nroff"/"troff" is basically derived from the 4.3BSD version. >(c) the point I was trying to make was that a RISC processor which is "N >times faster" than some CISC processor is not going to have the same >improvement in all cases. I think everybody realizes that there isn't always a single number "N" that can represent the speed difference between one processor and another - regardless of whether one is a RISC and one a CISC, both are RISCs, or both are CISCs. Read a MIPS Performance Brief sometime. >I'll post the actual results if I can get them from a rerun. Post both user-mode and system-mode CPU time. If, by some chance, the SunOS 4.0 "troff" has had floating-point computations inserted into it for some reason, and the Sun-4 test was done on a 4/110 with no FPU, then the system time should be higher since the floating-point emulation is done in the kernel (to avoid the overhead of going back to user mode). >Boy are RISC people defensive! What do you expect? You made a completely inappropriate speculation about RISCs (namely that they're generally word-addressible, and don't do byte operations inefficiently), and made a hard-to-believe claim about the relative performance of a Sun-3 and a Sun-4 on "troff". I'd certainly expect them to reply, and point out that the first is simply untrue and the second doesn't match their experiences.
bb@wjh12.harvard.edu (Brent Byer) (03/11/89)
In article <13322@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes: > .... One of the people here checked his Sun-30 >(68020) against his Sun-4 (SPARC). The three ran troff about 5x faster. > One of the signs of competency in a good engineer is that, while (s)he might not always know the right answer, (s)he should always be able to detect an obviously wrong one. Bill Davidsen has presented data which is bogus by at least a factor of 10; a question, Bill: Don't you think you should have verified this data *before* trying to use it to support your claims? >A question: has anyone benchmarked nroff/troff on the VAXstation 3100 >(VAX) and DECstation 3100 (MIPS)? Would some of the MIPS readers like to >comment on performance in this area vs. 68020 or 80386? > n/troff could be very good benchmark candidates, but there are too many versions, all based on proprietary source code. I frequently use our version of troff as a benchmark of various systems, but I can be certain that I use the *same* sources & test data on each target. The only surprising results I have seen are that the VAX architecture seems to perform about 15% poorer with troff, and the i286 does about 10% better (small model), than their respective results with other tests would suggest. [ E.G.: an 8MHz, 0ws PC/AT clone ran troff 25% faster than a 780 ] Brent Byer (att!ihesa!textware!brent bb@wjh12.harvard.edu)
bb@wjh12.harvard.edu (Brent Byer) (03/12/89)
In article <12000@haddock.ima.isc.com> suitti@haddock.ima.isc.com (Stephen Uitti) writes: >In article <1133@auspex.UUCP> guy@auspex.UUCP (Guy Harris) writes: > >>> [Provoked by this bogus claim by Bill Davidsen:] >>>One of the people here checked his Sun-30 (68020) against his Sun-4 >>>(SPARC). The three ran troff about 5x faster. > >> [ from Guy ] >>I tried comparing "troff"s on a Sun-3/50 with 4MB memory, and a >>Sun-4/260 with 32MB memory, both running 4.0. Here are the times: >> >>Sun-4/260: >> auspex% time troff -t -man /usr/man/man1/csh.1 >/dev/null >> 24.4u 1.2s 0:34 75% 0+456k 26+38io 31pf+0w >> auspex% time troff -t -man /usr/man/man1/csh.1 > /dev/null >> 24.4u 1.5s 0:36 71% 0+464k 1+35io 0pf+0w >> >>Sun-3/50: >> bootme% time troff -t -man /usr/man/man1/csh.1 >/dev/null >> 118.9u 1.2s 2:08 93% 0+208k 14+33io 24pf+0w >> bootme% time troff -t -man /usr/man/man1/csh.1 > /dev/null >> 120.2u 2.8s 2:31 81% 0+192k 5+32io 11pf+0w >> >>The 4/260 did 5x *better* than the 3/50, not 5x *worse*, on that example! > > [ Stephen Uitti : ] >The VAX 780 here running 4.3 BSD had this to say: > > haddock% time troff -man /usr/man/man1/csh.1 >/dev/null > 90.8u 6.4s 36% 95+201k 59+15io 24pf+0w > >I thought Sun 3's were lots faster than 780s. Maybe more >expensive Sun 3s are faster... All Sun-3's will run troff faster than any 780, providing that one starts with the *same* troff sources and the *same* test data. In the comparison (sic) alluded to above, Guy is using the "old" troff (otroff, from the C/A/T era), but Stephen used a DWB-based troff. The "-t" option is the giveaway. For comparison, if we rate otroff as 1.0, a DWBv1 troff will get about 1.15, and DWBv2 troff gets 1.7 . [ We have a souped-up DWBv2-based troff that gets 2.2 ; anybody wanna race? ] > .... Of course, my /usr/man/man1/csh.1 could be differant, ... And, Stephen is also using different, and less demanding, input data. The csh.1 with SunOS4.0 is typographically more complex than either of those in SunOS3.x or 4.3BSD. All other things equal, it will require about 80-90% more CPU time. >Also, I think 'troff' is one of those applications that has odd >behaviour compared to just about anything else one would run. This is a *false* supposition. >It should be pointed out (if it hasn't been already) that troff >doesn't do nearly the byte accesses that one would think it should do. True. Other than fetches from its input buffer and stores into its output buffer, there are few. Remember, a troff "character" has a much richer personality than just its code value. In otroff, it was a 16-bit datum, in DWB a 32-bit. >Still, troff is a great benchmark for sites that do alot of troff. Here, it is better to generalize: XXX is a great benchmark for sites that do alot of XXX. (But, you all know that.) ------ Brent Byer (bb@wjh12.harvard.edu or att!ihesa!textware!brent) 12-year old nephew: Uncle Bill, that steamboat race was the biggest gamble in the world. W. C. Fields: Son, that was nothing. I remember when Lady Godiva put everything she had on a horse.
davidsen@steinmetz.ge.com (William E. Davidsen Jr) (03/14/89)
In article <340@wjh12.harvard.edu> bb@wjh12.UUCP (Brent Byer) writes: | In article <13322@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes: | > .... One of the people here checked his Sun-30 | >(68020) against his Sun-4 (SPARC). The three ran troff about 5x faster. | > | One of the signs of competency in a good engineer is that, while (s)he | might not always know the right answer, (s)he should always be able | to detect an obviously wrong one. Bill Davidsen has presented data | which is bogus by at least a factor of 10; a question, Bill: | Don't you think you should have verified this data *before* | trying to use it to support your claims? As far as I know the only claim I made was that the results were reported to me as I stated, leading to an interest in the ability to access byte. Someone pointed out that byte address and byte access aren't quite the same thing, thanks. I mentioned the case in which a byte is extracted by doing a word load and isolating a byte in the word. Ten people told me no one does that any more, four people said "yes the 29000 does just that. At this point I really don't care. I mentioned the Honeywell 6000, someone immediately pointed out the EIS instruction set. By adding a coprocessor I can make any CPU have any instructions, but I'm not sure that justifies claiming something as a feature. As far as floating point in troff, I timed it on a VAX with and without FPU and it sure runs faster with. One person claimed that the FPU makes integer divide faster, and two others said it improves context switching. I checked four troff sources, and BSD/V7 troff don't have f.p., one vendor supplied version does. -- bill davidsen (wedu@ge-crd.arpa) {uunet | philabs}!steinmetz!crdos1!davidsen "Stupidity, like virtue, is its own reward" -me
elind@ircam.UUCP (Eric Lindemann) (03/14/89)
Can somebody clarify the following? In the "i860 overview (very very long)" w-colin writes: > Everything's single-cycle, but here's what can interlock: > > i-cache miss: given in terms of pin timing, plus two cycles if d-cache miss > in progree simultaneously. > > d-cache miss (on load): again, pin timing, but it seems to be "clocks from > ADS asserted to READY# asserted" > > fld miss: d-cache miss plus one clock I don't think this is exactly what the Intel literature says. I read it more like this: The following "freeze conditions" exist which will cause a delay: * Reference to destination of load instruction that misses * fld (load to float register) miss In other words, you can fire off a "load" instruction (which must mean a load to an integer register) and continue executing without delay as long as you don't reference the destination registers of this "load" instruction. The fact that there may or may not be a cache miss should only delay the availability of the data in the registers without necessarilly interupting execution. A cache miss on an FLD instruction however will apparently always cause a "freeze", or delay in execution, whether or not the FLD destination register is referenced by a subsequent instruction. Can this be? Is there some basic difference in interlock behavior between integer and floating point register files? If so, this can make a big difference in througput.
friedman@rd1632.Dayton.NCR.COM (Lee G. Friedman) (03/24/89)
CALL FOR PAPERS AND REFEREES HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES - 23 Processors and Systems Architecture: The Converging Design Space KAILUA-KONA, HAWAII - JANUARY 2-5, 1990 The Architecture track of HICSS-23 will contain a special set of papers focusing on a broad selection of topics in the area of Processor and Systems Architectures. Given the current and predicted future state of technology available for computers, there is an explosion of new architectural features being employed in processors to serve the needs of the systems architecture community. Furthermore, systems architects, taking advantage of these features, look to add other features currently not available in off-the-shelf silicon, and the cycle continues. There are several questions which arise: 1. Does the behavior of the built-in features fulfill the requirement as expected from the systems architects point of view? 2. How does a processor architect design processors for systems? a.How does the processor architect decide on the features to be included? b.What system level assumptions, both hardware and software, does the processor architect make in developing a processor with a set of features? c.What 3. How does the systems architect employ the processor? a.How does the system architect evaluate a processor for use in a particular system? b.What tradeoffs are made in using certain processor technology? c.What impact does a processor with features have on the system? The goal of this day long program (called a minitrack) at the HICSS- 23 conference is to show, through the papers, the convergence and divergence of processor and systems architecture. That is where things go right and were there is still a disparity between the chips, boards, boxes, and software (what we call the semantic gap). The format for the minitrack is as follows: The intention is to get three good papers on three important processor architectures. These papers should be a detailed description of the architecture of the chip (or chips), as well as, addressing the questions above. Each of the three processor architecture papers will be followed by two systems architecture papers. These papers should describe the system and how the processor in question was used to provide the end solution. Again, this should also address the questions above. Thus, in total, we will have nine papers. This is followed by an open discussion of the convergence and divergence of processor and systems architecture. Papers are invited that may be practical applications, research machines, or theoretical. Papers can deal with systems and VLSI technologies. Those papers selected for presentation will appear in the Conference Proceedings which are published by the Computer Society of the IEEE. HICSS-23 is sponsored by the University of Hawaii in cooperation with the ACM, the Computer Society, and the Pacific Research Institute for Information Sciences and Management (PRIISM). INSTRUCTIONS FOR SUBMITTING PAPERS: Manuscripts should be 22-26 typewritten, double-spaced pages in length. Do not send submissions which are significantly shorter. Papers must not have been previously presented or published, nor currently submitted for journal publication. Each manuscript will be put through a rigorous refereeing process. Manuscript papers should have a title page that includes the title of the paper, full name of the author(s), affiliation(s), complete physical and electronic address(es), telephone number(s) and a 300-word abstract of the paper. DEADLINES * A 300-word abstract is due by April 15, 1989 * Feedback to author concerning abstract by May 5, 1989 * Six copies of the manuscript are due by June 1, 1989 * Notification of accepted papers by August 15, 1989 * Accepted manuscripts, camera-ready, due by September 23, 1989 SEND SUBMISSIONS AND QUESTIONS TO Lee G. Friedman NCR Corporation 1601 S. Main Street MS PCD-5 Dayton, OH 45479 (513) 445-3594 e-mail: lee.friedman@dayton.ncr.com