neideck@nestvx.dec.com (Burkhard Neidecker-Lutz) (02/27/89)
Ok, I have a 5 page article describing the beast in front of me: The chip is approximately 1 x 1.5 cm square and contains slightly more than 1 Million transistors. Intended speed is 50 Mhz, which they seem to achieve in the lab at room temperature. The technology used is called CHMOS-IV with 1 micron linewidths. There are three major parts to the machine: - Caches and MMU/Businterface - a 32-bit Integer-RISC-CPU - a floating point vector engine with special graphics hardware each covering approximately one third of the chip. Both the instruction and data cache are 2-way set-associative and driven by logical addresses, line size 32 bytes. Instruction cache is 4 KByte, the data cache 8 KByte with a copy-back strategy. The external data bus is 64 bit wide and has special logic on chip to support fast page mode drams ("Next near pin"). The MMU has a 64-entry, 4-way set-associative TLB. The integer RISC unit has 32 32-bit registers and separate busses for data and instructions (4 32-bit busses). The instruction cache can be accessed either a 32-bit (integer) instruction at a time or can deliver an integer and a floating point instruction (via a second 32-bit instruction data path to the vector floating point unit) simultaneously. The same applies to the data cache, which can either be accessed as 32-bit from the integer RISC or as 128 (yes, TWO doubles at a time !) from the floating point control unit. Loads and stores each take a single cycle, with the load delay slot supported by hardware interlocks. Pipeline depth is 4. The FPCU has a 5-port register file of unknown (to me) size, with 3 ports feeding the two 64-bit inputs to the vector unit and one port receiving from there. The two other ports can be used by pipelined load and store instructions at the same time without interference with the data cache. The vector unit contains a floating point multiplier and a adder. Adder and multiplier can produce a result each cycle, but double precision multiply takes 2 cycles each. There are 3 temporary registers and chaining data paths to allow for the peak rate of 100 MFLOPS when adder and multiplier are both running. The pipeline depth is 3 cycles. On the same data path is a graphics unit that handles 32-bit Z-buffering for hidden surface elimination and Gourad- and Phong-Shading of surfaces. They claim 21 million Gouradshaded pixels per second (heck, what a meaningless measure...). As to positioning: The article is entitled: "The CRAY on a desktop - a vision becomes a reality". They claim more than 50% CRAY-1 performance for all levels of vectorization in single precision (single precision performance on a CRAY ?) and 25% in double precision, with 50-75% performance for the 50-80% vectorization levels. The articles author (from Intel) positions it as a single chip Ardent TITAN or Stellar, but he also mentions Convex and Alliant as possible victims. They clearly see this as a chip for single user systems, exemplified by touting it as "PC" in the examples. My opinion is that this is an impressive chip (the on-chip caches can deliver 1.2 GIGA-Bytes/second !) but I don't think they can build it in volume before 1990. It is an attempt to scare off people from other architectures before the door has closed for Intel. What I really like is that it gives you a glimpse of what you will be able to do with a single chip in a years time frame :-). Burkhard Neidecker-Lutz, Digital CEC-Karlsruhe, Project NESTOR PS: All this was extracted from 4 schematic drawings of the data pathes of the chip, so please correct if I went wrong.