Cell_microprocessor_implementations

Cell microprocessor implementations

Add article description

Cell microprocessors are multi-core processors that use cellular architecture for high performance distributed computing. The first commercial Cell microprocessor, the Cell BE, was designed for the Sony PlayStation 3. IBM designed the PowerXCell 8i for use in the Roadrunner supercomputer.^[1]

Implementation

First edition Cell on 90 nm CMOS

More information Designation, Die area ...

IBM has published information concerning two different versions of Cell in this process, an early engineering sample designated DD1, and an enhanced version designated DD2 intended for production.

The main enhancement in DD2 was a small lengthening of the die to accommodate a larger PPE core, which is reported to "contain more SIMD/vector execution resources". Some preliminary information released by IBM references the DD1 variant. As a result, some early journalistic accounts of the Cell's capabilities now differ from production hardware.

SPE floorplan

More information SPU function unit, Area ...

Additional details concerning the internal SPE implementation have been disclosed by IBM engineers, including Peter Hofstee, IBM's chief architect of the synergistic processing element, in a scholarly IEEE publication.

This document includes a photograph of the 2.54 mm × 5.81 mm SPE, as implemented in 90-nm SOI. In this technology, the SPE contains 21 million transistors of which 14 million are contained in arrays (a term presumably designating register files and the local store) and 7 million transistors are logic. This photograph is overdrawn with functional unit boundaries, which are also captioned by name, which reveals the breakdown of silicon area by function unit as follows:

Understanding the dispatch pipes is important to write efficient code. In the SPU architecture, two instructions can be dispatched (started) in each clock cycle using dispatch pipes designated even and odd. The two pipes provide different execution units, as shown in the table above. As IBM partitioned this, most of the arithmetic instructions execute on the even pipe, while most of the memory instructions execute on the odd pipe. The permute unit is closely associated with memory instructions as it serves to pack and unpack data structures located in memory into the SIMD multiple operand format that the SPU computes on most efficiently.

Unlike other processor designs providing distinct execution pipes, each SPU instruction can only dispatch on one designated pipe. In competing designs, more than one pipe might be designed to handle extremely common instructions such as add, permitting more two or more of these instructions to be executed concurrently, which can serve to increase efficiency on unbalanced workflows. In keeping with the extremely Spartan design philosophy, for the SPU no execution units are multiply provisioned.

Understanding the limitations of the restrictive two pipeline design is one of the key concepts a programmer must grasp to write efficient SPU code at the lowest level of abstraction. For programmers working at higher levels of abstraction, a good compiler will automatically balance pipeline concurrency where possible.

Share this article:

This article uses material from the Wikipedia article Cell_microprocessor_implementations, and is written by contributors. Text is available under a CC BY-SA 4.0 International License; additional terms may apply. Images, videos and audio are available under their respective licenses.

[1] [1]
Kevin J. Barker, Kei Davis, Adolfy Hoisie, Darren J. Kerbyson, Mike Lang, Scott Pakin, Jose C. Sancho. "Entering the Petaflop Era:The Architecture and Performance of Roadrunner".

[1]

Designation	Die area	First disclosed	Enhancement
DD1	221 mm²	ISSCC 2005
DD2	235 mm²	Cool Chips April 2005	Enhanced PPE core

Cell function unit	Area	Description
XDR interface	05.7%	Interface to Rambus system memory
memory controller	04.4%	Manages external memory and L2 cache
512 KiB L2 cache	10.3%	Cache memory for the PPE
PPE core	11.1%	PowerPC processor
test	02.0%	Unspecified "test and decode logic"
EIB	03.1%	Element interconnect bus linking processors
SPE (each) × 8	06.2%	Synergistic coprocessing element
I/O controller	06.6%	External I/O logic
Rambus FlexIO	05.7%	External signalling for I/O pins

SPU function unit	Area	Description	Pipe
single precision	10.0%	single precision FP execution unit	even
double precision	04.4%	double precision FP execution unit	even
simple fixed	03.25%	fixed point execution unit	even
issue control	02.5%	feeds execution units
forward macro	03.75%	feeds execution units
GPR	06.25%	general purpose register file
permute	03.25%	permute execution unit	odd
branch	02.5%	branch execution unit	odd
channel	06.75%	channel interface (three discrete blocks)	odd
LS0–LS3	30.0%	four 64 KiB blocks of local store	odd
MMU	04.75%	memory management unit
DMA	07.5%	direct memory access unit
BIU	09.0%	bus interface unit
RTB	02.5%	array built-in test block (ABIST)
ATO	01.6%	atomic unit for atomic DMA updates
HB	00.5%	obscure

Voltage	Frequency	Power	Die Temp.
0.9 V	2.0 GHz	01 W	25 °C
0.9 V	3.0 GHz	02 W	27 °C
1.0 V	3.8 GHz	03 W	31 °C
1.1 V	4.0 GHz	04 W	38 °C
1.2 V	4.4 GHz	07 W	47 °C
1.3 V	5.0 GHz	11 W	63 °C

Cell_microprocessor_implementations

Cell microprocessor implementations

Implementation

First edition Cell on 90 nm CMOS

Cell floorplan

SPE floorplan

SPE power and performance

Cell at 65 nm

Future editions in CMOS

Prospects at 45 nm

Prospects beyond 45 nm

References

Share this article: