[comp.sys.intel] i860 CPU information

clif@intelca.intel.com (Ken Shoemaker) (03/04/89)

The following information is taken from the i860 TM 64-Bit Microprocessor
data sheet order number 240296-001.  I hope that this posting does not
generate a meta-discussion about appropriateness of the posting.  I
believe that it contains more technical information than a typical
comp.arch posting.  

We, Intel, will try to answer questions regarding the architecture .
However due to work pressures and the need for approval prior to posting
non-technical information their will probably be a delay.


			i860 64-bit Microprocessor

Highlights:

Parallel Architecture:  3 instructions Clock
	- one integer or control instruction
	- up to to Floating Point Instructions

High Performance Design
	- 33.3/40 MHz Clock Rate
	- 80 MFLOP Peak Single Precision MFLOPs
	- 60 MFLOP Peak Double Precision MFLOPs
	- 64-bit External Data Bus
	- 64-bit Internal Instruction Cache Bus
	- 128-bit Internal Data Cache Bus	

Measured Performance with Current Compilers
	- 24 Megawhetsones (40 MHz)
	- 83K Dhrystones (40 MHz)

Highly Integrated
	- 32/64-bit Pipelined Floating-Point Adder and Multipler
	- 32-bit Integer and Control Unit
	- 64-Bit 3-D Graphics Unit
	- Paging Unitg with TLB
	- 4K Byte Instruction Cache
	- 8K Byte Data Cache

The core execution unit controls overall operation of the i860 TM CPU.
The core unit executes load, store, integer, bit, and control-transfer 
operations, and fetches instructions for the floating-point unit 
as well.  A set of 32 32-bit general-purpose registers are provided 
for the manipulation of integer data.  Load and store instructions move 
8-, 16-, and 32-bit data to and from these registers.  Its full set of
integer, logical, and control-transfer instructions give the core
unit the ability to execute complete systems software and
applications programs.  A trap mechanism provides rapid response
to exceptions and external interrupts.  Debugging is supported by
the ability to trap on data or instruction reference. 

The floating-point hardware is connected to a separate set of 
floating-point registers, which can be accessed as 16 64-bit registers,
or 32 32-bit registers.  Special load and store instructions can 
also access these same registers as  8 128-bit registers.  All 
floating-point instructions use  these registers as their source and 
destination operands.

The floating-point control unit controls both the floating-point adder 
and the floating-point multiplier, issuing instructions, handling 
all source and result exceptions, and updating status bits in the 
floating-point status register.  The adder and multiplier can operate 
in parallel, producing up to two results per clock.  The 
floating-point data types, floating-point instructions, and exception 
handling all support the IEEE Standard for Binary 
Floating-Point Arithmetic (ANSI/IEEE Std 754-1985).  

The floating-point adder performs addition, subtraction, comparison, 
and conversions on 64- and 32-bit floating-point values.  An adder 
instruction executes in three to four clocks; however, in pipelined mode,
a new result is generated every clock.

The floating-point multiplier performs floating-point and 
integer multiply and floating-point reciprocal operations on 64- and 
32-bit floating-point values.  A multiplier instruction executes 
in three to four clocks; however, in pipelined mode, a new result
can be generated every clock for single-precision and every other
clock for double precision.

The graphics unit has special integer logic that supports
three-dimensional drawing in a graphics frame buffer, with color
intensity shading and hidden surface elimination via the Z-buffer
algorithm.  The graphics unit recognizes the pixel as an 8-, 16-,
or 32-bit data type.  It can compute individual red, blue, and
green color intensity values within a pixel; but it does so with
parallel operations that take advantage of the 64-bit internal
word size and 64-bit external bus.  The graphics features of the
i860 microprocessor assume that the surface of a solid object is drawn 
with polygon patches whose shapes approximate the original object.
The color intensities of the vertices of the polygon and their
distances from the viewer are known, but the distances and
intensities of the other points must be calculated by 
interpolation.  The graphics instructions of 860 CPU the directly
aid such interpolation.

The paging unit implements protected, paged, virtual memory via a
64-entry, four-way set-associative memory called the TLB
(Translation Lookaside Buffer).  The paging unit uses the TLB to
perform the translation of logical address to physical address,
and to check for access violations.  The access protection scheme
employs two levels of privilege:  user and supervisor.

{Editors note the i860 CPU's paging mechanism is the same
as the 386 CPU.}

The instruction cache is a two-way set-associative memory of four
Kbytes, with 32-byte blocks.  It transfers up to 64 bits per
clock (266 Mbyte/sec at 33.3 MHz). 

The data cache is a two-way set-associative memory of eight
Kbytes, with 32-byte blocks.  It transfers up to 128 bits per
clock (533 Mbyte/sec at 33.3 MHz).  The 860 CPU normally uses writeback 
caching, i.e. memory writes update the cache (if applicable) 
without necessarily updating memory immediately;
however, caching can be inhibited by software where necessary. 

The bus and cache control unit performs data and instruction
accesses for the core unit.  It receives cycle requests and
specifications from the core unit, performs the data-cache or
instruction-cache miss processing, controls TLB translation, and
provides the interface to the external bus.  Its pipelined
structure supports up to three outstanding bus cycles.  

Clif Purkiser
Intel Corp, Santa Clara Microcomputer Division

mark@mips.COM (Mark G. Johnson) (03/05/89)

Thanks for Clif Purkiser for an informative posting! <208@intelca.intel.com>
did raise a question, though:

>Highly Integrated
>	- 32/64-bit Pipelined Floating-Point Adder and Multipler
>	- 32-bit Integer and Control Unit
>	- 64-Bit 3-D Graphics Unit
>	- Paging Unitg with TLB
>	- 4K Byte Instruction Cache
>	- 8K Byte Data Cache


Perhaps the list above is simply incomplete; by an omission it leads to
speculations like:
	1.  Is there a Floating-Point Divider in hardware?
	2.  Are there Floating-Point Divide instructions (IEEE 32b & 64b)
		in the 80860 architecture?
	3.  How many clocks does it take to do an IEEE 32b divide?  64b?
Thanks.
-- 
 -- Mark Johnson	
 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
	...!decwrl!mips!mark	(408) 991-0208

mash@mips.COM (John Mashey) (03/05/89)

In article <208@intelca.intel.com> clif@intelca.intel.com (Ken Shoemaker) writes:
>The following information is taken from the i860 TM 64-Bit Microprocessor
>data sheet order number 240296-001.  I hope that this posting does not
>generate a meta-discussion about appropriateness of the posting....

Appropriate posting; thanx; it's much better than seeing random rumors and
misinformation, and there's plenty of technical content.

There a few questions though: I suspect this was just an oversight,
as somebody MUST know the answers, but 2 of the numbers need clarification,
or they are almost meaningless:

>Measured Performance with Current Compilers
I assume this was measured on real hardware, so are you allowed to say what
the memory system looks like?  i.e., read latency and write retirement rates,
for example?  (of course, for these particular benchmarks it probably doesn't
matter too much, since their cache miss rates are neglible :-)

>	- 24 Megawhetsones (40 MHz)
		1)  Was this single precision or double precision?
		2)  Whichever it was, what was the other one?
>	- 83K Dhrystones (40 MHz)
		1) Which version: 1.1 or 2.1?
			I assume this wasn't 1.0, whose numbers are 15% better
			than 1.1.
		2) What level of optimization?
			any inlining?
			any unusual options? (like, for example: the manual
			shows normal use of a frame pointer, which costs
			4 cycle/call, but could be suppressed if you know
			things like alloca won't be used.  Since a typical
			32-bit RISC would use 30-40 cycles/call, suppressing
			the fp-manipulation gains about 10%.)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

seanf@sco.COM (Sean Fagan) (03/06/89)

In article <14616@obiwan.mips.COM> mark@mips.COM (Mark G. Johnson) writes:
>Perhaps the list above is simply incomplete; by an omission it leads to
>speculations like:
>	1.  Is there a Floating-Point Divider in hardware?

No.

>	2.  Are there Floating-Point Divide instructions (IEEE 32b & 64b)
>		in the 80860 architecture?

No.

>	3.  How many clocks does it take to do an IEEE 32b divide?  64b?

Depends.  I think it might be somewhere around 30-40 (40-50?), but I'm not
sure.  It doesn't have divide in hardware; what it has is reciporacal
approximations (1.0/x), so you do that (plus a little bit to get rid of the
errors), then multiply.

Kinda like a Cray, right? 8-)

-- 
Sean Eric Fagan  | "What the caterpillar calls the end of the world,
seanf@sco.UUCP   |  the master calls a butterfly."  -- Richard Bach
(408) 458-1422   | Any opinions expressed are my own, not my employers'.