[comp.arch] N-10 info

neideck@nestvx.dec.com (Burkhard Neidecker-Lutz) (02/27/89)

 
 
Ok, I have a 5 page article describing the beast in front of me:
 
The chip is approximately 1 x 1.5 cm square and contains slightly more
than 1 Million transistors. Intended speed is 50 Mhz, which they seem
to achieve in the lab at room temperature. The technology used is called
CHMOS-IV with 1 micron linewidths. 
 
There are three major parts to the machine:
 
	- Caches and MMU/Businterface
	- a 32-bit Integer-RISC-CPU
	- a floating point vector engine with special graphics hardware
 
each covering approximately one third of the chip.
 
Both the instruction and data cache are 2-way set-associative
and driven by logical addresses, line size 32 bytes. Instruction
cache is 4 KByte, the data cache 8 KByte with a copy-back strategy.
The external data bus is 64 bit wide and has special logic on
chip to support fast page mode drams ("Next near pin"). The MMU
has a 64-entry, 4-way set-associative TLB.
 
The integer RISC unit has 32 32-bit registers and separate busses for
data and instructions (4 32-bit busses). The instruction cache can be accessed
either a 32-bit (integer) instruction at a time or can deliver an integer
and a floating point instruction (via a second 32-bit instruction data path
to the vector floating point unit) simultaneously. The same applies to the
data cache, which can either be accessed as 32-bit from the integer RISC
or as 128 (yes, TWO doubles at a time !) from the floating point control
unit. Loads and stores each take a single cycle, with the load delay slot
supported by hardware interlocks. Pipeline depth is 4. 
 
The FPCU has a 5-port register file of unknown (to me) size, with
3 ports feeding the two 64-bit inputs to the vector unit and one port receiving
from there. The two other ports can be used by pipelined load and store
instructions at the same time without interference with the data cache.
 
The vector unit contains a floating point multiplier and a adder. Adder and
multiplier can produce a result each cycle, but double precision multiply
takes 2 cycles each. There are 3 temporary registers and chaining data
paths to allow for the peak rate of 100 MFLOPS when adder and multiplier
are both running. The pipeline depth is 3 cycles.
 
On the same data path is a graphics unit that handles 32-bit Z-buffering for
hidden surface elimination and Gourad- and Phong-Shading of surfaces. They
claim 21 million Gouradshaded pixels per second (heck, what a meaningless
measure...). 
 
As to positioning:
 
The article is entitled: "The CRAY on a desktop - a vision becomes a
reality". They claim more than 50% CRAY-1 performance for all levels of
vectorization in single precision (single precision performance on a CRAY ?)
and 25% in double precision, with 50-75% performance for the 50-80%
vectorization levels.
 
The articles author (from Intel) positions it as a single chip Ardent TITAN
or Stellar, but he also mentions Convex and Alliant as possible victims.
They clearly see this as a chip for single user systems, exemplified
by touting it as "PC" in the examples.
 
My opinion is that this is an impressive chip (the on-chip caches can
deliver 1.2 GIGA-Bytes/second !) but I don't think they can build
it in volume before 1990. It is an attempt to scare off people from other
architectures before the door has closed for Intel. What I really like is
that it gives you a glimpse of what you will be able to do with a single
chip in a years time frame :-).
 
 
	Burkhard Neidecker-Lutz, Digital CEC-Karlsruhe, Project NESTOR
 
PS: All this was extracted from 4 schematic drawings of the data pathes
    of the chip, so please correct if I went wrong.