SlideShare a Scribd company logo
Introduction
to the Cell
Broadband
Engine
Architecture
C. R. Johns
D. A. Brokenshire
This paper provides an overview of the Cell Broadband Enginee
Architecture (CBEA). The CBEA defines a revolutionary
extension to a more conventional processor organization and serves
as the basis for the development of microprocessors targeted at
the computer entertainment, multimedia, and real-time market
segments. In this paper, the organization of the architecture is
described, as well as the instruction set, commands, and facilities
defined in the architecture. In many cases, the motivation for these
facilities is explained and examples are provided to illustrate their
intended use. In addition, this paper introduces the Software
Development Kit and the software standards for a
CBEA-compliant processor.
Overview
The Cell Broadband Engine Architecture (CBEA)
defines a family of heterogeneous microprocessors that
target multimedia and compute-intensive applications [1].
The CBEA resulted from a joint effort among the Sony
Group, Toshiba, and IBM to develop the next-generation
processor. The following motivations shaped the
development of the architecture [2]:
 Provide outstanding performance on computer
entertainment and multimedia applications.
 Develop an architecture applicable to a wide range
of platforms.
 Enable real-time response to the user and the
network.
 Address the three design challenges facing traditional
processors: memory latency, power, and frequency.
The CBEA extends the IBM PowerPC* 64-bit
architecture with loosely coupled, cooperative offload
processors. The CBEA is set apart from other processor
architectures by the use of two independent instruction
sets: the PowerPC and the synergistic processor unit
(SPU) instruction sets. For a processor to be considered
CBEA compliant, the processor must contain one or
more PowerPC processor elements (PPEs), one or more
synergistic processor elements (SPEs), and the required
feature set defined by the CBEA. Figure 1 is a block
diagram of a CBEA processor.
The PPE is compliant with the IBM PowerPC
Architecture* [3]. It is intended to perform the system
management and application control functions, or
‘‘control plane’’ processing. Whereas other architectures
have augmented the instruction set of the processor with
tightly coupled extensions such as the vector/single-
instruction multiple-data (SIMD) multimedia extension,
the CBEA employs an independent SPU that is compliant
with the SPU instruction set architecture [4] to perform
the compute-intensive, or ‘‘data plane,’’ processing. This
allows the data processing and control functions to be
decoupled, enabling more application parallelism.
Another distinction between a traditional processor
and the CBEA is the definition of two storage domains:
main and local. The main storage domain is the same as
that commonly found in most processors. This domain
contains the address space for system memory and
memory-mapped I/O (MMIO) registers and devices.
Associated with each SPU is a local storage address
space, or domain, containing instructions and data for
that SPU. Each local storage domain is also assigned an
address range in the main storage domain called the local
Copyright 2007 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each
reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this
paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of
this paper must be obtained from the Editor.
IBM J. RES.  DEV. VOL. 51 NO. 5 SEPTEMBER 2007 C. R. JOHNS AND D. A. BROKENSHIRE
503
0018-8646/07/$5.00 ª 2007 IBM
storage alias. The SPU can address memory directly only
within the associated local storage. To access data in the
main storage domain and maintain synchronization with
other processors, each SPU and local storage pair is
coupled with a memory flow controller (MFC). The
combined SPU, associated local storage, and MFC make
up the SPE. In addition to the PPE and the SPE, the CBEA
includes many features for real-time applications not
typically found in conventional processor architectures.
The ‘‘Cell Broadband Engine Architecture’’ document,
which defines the architecture, refers to the ‘‘PowerPC
Architecture Book’’ and the ‘‘SPU Instruction Set
Architecture’’ document to avoid duplicating
information. For a complete definition of the CBEA,
the reader must have access to all of these documents,
which are publicly available on IBM Web pages [1, 3, 4].
The CBEA document is divided into two parts: the
user-mode environment (UME) and the privileged-mode
environment (PME). The UME defines the instructions
and facilities required for application portability. This
section is similar to Books I and II of the PowerPC
Architecture. For the PowerPC Architecture, only Book I
compliance is necessary for application portability. Since
the CBEA defines a heterogeneous multiprocessor, the
synchronization features found in Book II of the
PowerPC are necessary for cooperative processing. For
this reason, Books I and II were collapsed into a single
UME section. The PME, which defines the instructions
and facilities required for operating system and
hypervisor1
development, is similar to Book III of the
PowerPC Architecture.
PowerPC processor element
Within a CBEA-compliant processor, the PPE performs
the control plane functions that typically required the
more general-purpose computing provided by the PPE.
The PPE is based on Version 2.02 of the PowerPC
Architecture, which offers many of the features required
for the application spaces targeted by the CBEA. The
Figure 1
CBEA processor block diagram.
SL1
BIU
Element interconnect bus (EIB)
SPE group 0 (SG_0) SPE group n (SG_n) PPE group 0 (PG_0) PPE group p (PG_p)
LS
MMU
RMT
SPE_0 SPE_g
MMU
RMT
BIU
RMT
SL1
LS
MMU
RMT
SPE_0 SPE_g
MMU
RMT
BIU
RMT
RMT
RMT RMT
BIU
RMT
RMT RMT
Bus interface controller (BIC) Internal interrupt controller (IIC) Memory interface controller (MIC)
LS LS
L2 L2
BIC Bus interface controller
BIU Bus interface unit
IIC Internal interrupt controller
L1 Memory cache internal to the CPU
L2 Memory cache external to the CPU
LS Local storage
MFC Memory flow controller
MIC Memory interface controller
MMU Memory management unit
PPE PowerPC processor element
PPU PowerPC processor unit
RMT Replacement management table
SL1 First-level cache
SPE Synergistic processor element
SPU Synergistic processor unit
I/O Memory
PPUg
SPU0 SPUg PPU0
SPU0 SPUg PPUg PPU0
MFC MFC MFC MFC L1 L1 L1 L1
1
A hypervisor is a layer of software running on the processor that allows multiple
‘‘guest’’ operating systems to run concurrently. The hypervisor virtualizes the
processor and the system resources.
C. R. JOHNS AND D. A. BROKENSHIRE IBM J. RES.  DEV. VOL. 51 NO. 5 SEPTEMBER 2007
504
PPE is based on the PowerPC Architecture for four
reasons: 1) The PowerPC Architecture is a mature
architecture that is applicable to a wide variety of
platforms. 2) It supports multiple simultaneous operating
environments through logical partitioning. 3) It contains
proven microarchitectures that meet the frequency
and power challenges of the targeted market segment.
4) Use of the PowerPC Architecture leverages the IBM
investment in the PowerPC ecosystem.
A key advantage of the PowerPC Architecture is its
unique ability to support multiple concurrent operating
environments through the use of logical partitioning.
This feature is critical to executing a non–real-time
operating system (RTOS) for the user interface while
simultaneously executing a non-RTOS, such as the
Linux** operating system, for management of the system.
Concurrent execution of an RTOS and the Linux
operating system allows non–real-time processes to
be performed in the background without affecting the
performance of an application running under the RTOS.
Although the PowerPC is a mature architecture for
more traditional platforms, concurrently supporting both
an RTOS and a non-RTOS within the heterogeneous
multiprocessor presented many architectural challenges:
To meet these challenges and optimize the architecture
for media-rich applications, the PowerPC Architecture
required enhancements to the vector/SIMD multimedia
extension, as well as introduction of the mediated external
exception, the multiple concurrent large-page support,
software management of translation lookaside buffers
(TLBs),and the cachereplacementmanagementextensions.
The following subsections describe the mediated
external exception extension, the vector/SIMD multimedia
extension, and the multiple concurrent large-page
extension. Software management of TLBs and cache
replacement management are described in separate
sections because they apply to both the PPEs and the SPEs.
See the CBEA document [1] for more information on each
extension.
Mediated external exception extension
A key attribute of an RTOS is a guaranteed interrupt
latency. A single RTOS can control the interrupt latency
by not allowing interrupts to be disabled for more than a
predetermined amount of time. However, when multiple
operating systems are running, the presentation of an
interrupt can be delayed by another partition for an
arbitrary amount of time. The mediated external
exception extension to the PowerPC Architecture
overcomes this deficiency by allowing an interrupt to be
presented to the processor even if interrupts are disabled.
In the PowerPC Architecture, the term exception
describes the interrupt condition, and the term interrupt
refers to the processor acting on the exception condition or
jumping to the interrupt handler. The critical exception
conditions for an RTOS are those generated by SPEs and
external devices. The current PowerPC Architecture allows
an external exception to invoke a hypervisor-privileged
interrupt handler. However, an operating system still has
control of the external exception enabled in the machine
state register of the processor. The CBEA-mediated
external exception extension enables the interrupt handler
to be invoked as long as the processor is not operating
in hypervisor state or if external exceptions are enabled.
This prevents an operating system from delaying the
presentation of an exception to the processor and allows
a hypervisor-privileged interrupt handler to be invoked.
If the exception is for the RTOS, the hypervisor can
immediately pass control to the interrupt handler of the
RTOS regardless of which partition is currently active.
However, an interrupt can occur when the RTOS is in a
critical section with external exceptions disabled, and this
is where the mediated portion of the extension comes into
play. When external exceptions are disabled by the
partition that is expected to handle the interrupt, the
hypervisor-privileged interrupt handler sets a mediated
external exception request bit in the logical partitioning
control register and returns from the interrupt. When the
partition enables external exceptions, the hypervisor-
privileged interrupt handler is once again invoked with a
mediated external exception. The interrupt handler then
sets the state of the processor to mimic the original
external exception.
Vector/SIMD multimedia extension
Although the CBEA contains offload processors for
vector and streaming media processing, the architecture
team included the vector/SIMD multimedia extension for
the PPE. This multimedia extension was included to run
software developed for the vector/SIMD multimedia
extension, to make it easier to develop and port
applications to the SPE, and to allow applications
to be parallelized across the PPEs and SPEs.
The vector/SIMD multimedia extension defined
by the CBEA is very similar to the PowerPC 970*
implementation. The major difference between the two is
the rounding mode support. For compatibility with SPU
applications, the vector/SIMD multimedia extension
unit in the PPE supports the rounding modes
defined by the SPU instruction set architecture.
Multiple concurrent large-page extension
The current PowerPC Architecture supports the base
4-KB page plus one additional large page to be used
concurrently. The large-page size is implementation
dependent. In the CBEA, many types of data structures
are located in main storage, e.g., MMIO registers for
the SPEs, local storage aliases, streaming data, and video
IBM J. RES.  DEV. VOL. 51 NO. 5 SEPTEMBER 2007 C. R. JOHNS AND D. A. BROKENSHIRE
505
buffers. The limitation of only one large-page size places
a burden on the TLBs. If a large-page size of 64 KB is
selected, the number of translations needed for MMIO
registers and local storage aliases is lower than that for
the base 4-KB page size. However, more translations are
required for the relatively large streaming buffers and
video buffers in main memory. In contrast, a large-page
size of 1 MB or 16 MB reduces the number of translations
required for the streaming and video buffers, but it is too
large for mapping the MMIO registers and local storage
aliases.
To improve the efficiency of the TLBs, the CBEA
augments the PowerPC Architecture by providing
support for multiple concurrent large-page sizes. The
memory management units (MMUs) in the SPE also
support the multiple concurrent large-pages extension.
Synergistic processor element
The SPE is the cooperative offload processor in the
CBEA intended for the computational, or data plane,
processing functions. The SPE consists of three tightly
coupled units: the SPU, the associated local storage,
and the MFC. The SPU is a SIMD processor with an
instruction set architecture optimized for compute-
intensive and media applications: It operates only on
instructions and data in the associated local storage.
Decoupling the SPU from other aspects of the system
provides a very deterministic processing environment
for the programmer.
The SPU and associated local storage are coupled
to the main storage domain and other processors by
the MFC. The MFC enables software to move data
between the storage domains and to synchronize with
other processors in the system. Data movement and
synchronization are initiated by using MFC commands.
Either the SPU or another processor in the system, such
as the PPE, can issue these commands. A direct memory
access (DMA) controller in the MFC unit performs the
data movement. All main storage accesses performed by
the DMA controller adhere to the PowerPC Architecture
for address translation and protection. They are
performed asynchronously with respect to the SPU
and all other units in the system.
Synergistic processor unit
The SPU provides the programmer with 128 registers,
each of which is a 128-bit SIMD register. The large
number of architected registers facilitates efficient
instruction scheduling and also enables important
optimization techniques such as loop unrolling. All SPU
instructions are inherently SIMD operations that process
data in one of four granules: sixteen 8-bit integers, eight
16-bit integers, four 32-bit integers or single-precision
floating-point numbers, or two 64-bit double-precision
floating-point numbers. The SIMD registers in the SPU
are unified and can be an operand of either an integer
or a floating-point instruction, unlike the split set of
registers in the PowerPC Architecture [5].
To obtain the best performance from an SPU, the data
structures of the program should be defined around the
SIMD data flow. In addition, techniques such as double
buffering should be employed to overlap the computation
with the data movement. This technique insulates the
SPU application from the latency of the system memory
accesses. More programming tips can be found in [6].
Although the SPU is optimized for SIMD, scalar
operations can be performed; they use the preferred
slot (the upper word, or 32 bits, for 32-bit scalars) of the
SIMD register.
Channel interface
As mentioned earlier, the SPU is decoupled from the
system and the MFC provides the linkage to the main
storage domain and other processors. A channel interface
supplies the communication path between the SPU and
the MFC. The CBEA defines multiple unique channels
for issuing MFC commands and accessing MFC facilities.
These channels are accessed by using the SPU channel
instructions.
The SPU instruction set architecture defines the
channels as four words wide, but the CBEA uses only one
word, which is the preferred slot of the SIMD data flow.
Each channel has the following set of attributes defined
by the CBEA: direction, capacity, and the ability to be
blocking or nonblocking.
Channels are unidirectional, with the direction defined
by the CBEA. Channels used to retrieve information
from or transfer information to the MFC are accessed,
respectively, by using an SPU read channel (rdch) or write
channel (wrch) instruction. Accessing a read channel by
using a wrch instruction or a write channel by using an
rdch instruction is not allowed and results in an invalid
channel error.
Channel capacity (depth) defines the number of words
that can be contained within the channel. The CBEA
defines the depth for some channels; for others, it is an
implementation-specific parameter. A channel count,
which is accessed by using an SPU read channel count
(rdchcnt) instruction, is used to track the amount of
information in the channel. The value returned by the
rdchcnt instruction indicates the number of valid words
contained in a read channel and the available free space
or number of words that can be written to a write
channel.
The CBEA defines each channel as either blocking or
nonblocking. A nonblocking channel essentially has an
infinite depth, and a value of one is always returned when
reading the channel count. Blocking channels have a
C. R. JOHNS AND D. A. BROKENSHIRE IBM J. RES.  DEV. VOL. 51 NO. 5 SEPTEMBER 2007
506
depth of one or more, with an active channel count.
Accessing a blocking channel whose count is zero stalls
the instruction processing of the SPU until the channel
count becomes nonzero. The blocking attribute relieves
the programmer from having to check the channel
capacity before accessing the channel. When no other
useful work can be performed, this attribute allows the
program to enter a very low-power state while waiting
for the channel to become available. If useful work can
be performed, the programmer can check the channel
capacity with the rdchcnt instruction before accessing
the channel and can then occasionally poll the channel
count while performing other useful work.
An SPU interrupt offers an alternative to polling.
The presence of an SPU interrupt can affect the SPU
instruction sequencing in one of two ways. If interrupts
are disabled, a special SPU branch instruction (bisled) can
be executed to branch to a target address if an interrupt is
present. If interrupts are enabled, the SPU executes the
next instruction from address zero in local storage and
disables interrupts. The address of the next instruction
that would have been executed if the interrupt were not
present is saved in the SPU state save and restore register
(SRR0). SRR0 can also be accessed by using two
channels, one to read SRR0 and one to write SRR0. The
SPU iret instruction can be used to return to the address
stored in SRR0. SPU interrupts should not be confused
with interrupts generated by an SPE that are targeted for
the PPE (SPE interrupts).
The SPU event facility can be programmed to detect
key SPE conditions and other system events. The event
facility is accessed by using channel instructions to specify
the conditions of interest and determine the event status.
Software can use the event facility to stall the SPU while
waiting for multiple different events or to generate an
interrupt when an enabled event occurs. The SPU event
facility is described in more detail later in the paper.
Memory flow controller
The MFC provides the communication path from the
SPU and the local storage domain to the main storage
domain and other processors in the system. The MFC
essentially decouples the SPU from the main storage
domain; an application can view the SPU as having an
additional asynchronous load–store processor. If used
properly, the MFC can insulate an SPU application
from the latency of system memory.
Figure 2 is a high-level block diagram of the MFC
unit. The bus interface unit (BIU) provides the interface
to the element interconnect bus (EIB) [7]. The CBEA does
not define the EIB protocol. The first-level cache (SL1)
is a caching structure for MFC accesses to the main
storage domain, and it provides an architectural entity
for performance enhancements. In most implementations,
the SL1 either is omitted or is very small because of the
streaming nature of the MFC commands. The MFC SPU
command queue is dedicated to MFC commands issued
by the SPU using the channel interface. Other processors
use the MFC proxy command queue to transfer data
between the storage domains on behalf of the associated
SPE. The MFC proxy commands, which are issued by
using the MFC MMIO registers, are used primarily to
efficiently initialize the local storage before an SPU
program is executed. They are typically used by
software executing in a PPE.
The DMA controller transfers instructions and data
between SPU local storage and main storage. Programs
running on the associated SPU, a PPE, or another device
can issue MFC DMA commands. Table 1 shows the MFC
DMA commands supported by the DMA controller.
A single MFC DMA command can transfer up to
16 KB of sequential data between the storage domains. A
DMA transfer is typically performed as a series of smaller
bus transfers, usually a cache line or less. If required,
coherency is maintained for these transfers as the storage
attributes in the page table.
The parameters listed below affect the operation of
MFC DMA commands:
 CL MFC class ID.
 TG MFC command tag identification.
Figure 2
MFC block diagram.
MFC
SL1 (optional)
BIU
Data
path
RMT
DMA request unit
To element interconnect bus
LS
address
Real
address
MFC
registers
MMIO
interface
MMU
RMT
SPU LS
interface
SPU channel
interface
DMA controller
Atomic
request
MFC proxy
command
queue
MFC SPU
command
queue
IBM J. RES.  DEV. VOL. 51 NO. 5 SEPTEMBER 2007 C. R. JOHNS AND D. A. BROKENSHIRE
507
TS MFC transfer size.
 LSZ MFC list size.
 LSA MFC local storage address.
 EAH MFC effective address high.
 EAL MFC effective address low.
 LA MFC list local storage address.
 LTS List element transfer size.
 LEAL List element effective address low.
Each command includes a classID parameter. The
parameter contains a replacement management class
identifier (RclassID) that controls the cache replacement
management facility. The cache replacement management
facility is described later.
Software uses a TclassID identifier to classify the type
of storage being accessed. The TclassID provides a hint to
the MFC and to the system about the way the storage
request should be treated. For example, the TclassID can
be used by the bus interface unit in the MFC to prevent
accesses to a slow device from blocking accesses to the
higher-bandwidth main memory. To accomplish this, the
EIB transfer request queue in the MFC (not shown in
Figure 2) is segmented into two or more areas. The
number of entries in each segment is sized according to
the latency and bandwidth characteristics of the storage
being accessed. EIB transfer requests created for a
command are placed in only the transfer request queue
segment that corresponds to the TclassID of the
command. By issuing a command with the appropriate
TclassID, software can prevent the transfer request queue
from filling with transfer requests to slower devices. Since
the transfers to slower devices are limited to one area of
the request queue, transfers accessing higher-bandwidth
Table 1 MFC DMA commands supported by the DMA controller.
Command Description
get,f,b.[s] Moves data from the effective address within main storage to local storage.
getl,f,b.[s] Same as get,f,b., except that the effective address and size for multiple transfers are specified by list
elements in local storage.
put[r],f,b.[s] Moves data from local storage to the effective address within main storage.
The optional ‘‘r’’ modifier provides a hint to the system that the data is likely to be accessed soon by the
PPE. The data transferred by the command is a candidate for copying into the PPE cache.
put[r]l,f,b. Same as put[r],f,b.[s], except that the effective address and size for multiple transfers are specified by list
elements in local storage.
sdcrt SL1 data cache range touch. Brings a range of effective addresses into the SL1 (performance hint for get
commands). Similar to the PowerPC dcbt instruction.
sdcrtst SL1 data cache range touch for store. Brings a range of effective addresses into the SL1 (performance hint for
put commands). Similar to the PowerPC dcbtst instruction.
sdcrz SL1 data cache range zero. Writes zeros to the contents of a range of effective addresses. Similar to the
PowerPC dcbz instruction.
sdcrst SL1 data cache range store. Stores the modified contents of a range of effective addresses. Similar to the
PowerPC dcbst instruction.
sdcrf SL1 data cache range flush. Stores the modified contents of a range of effective addresses and invalidates the
block. Similar to the PowerPC dcbf instruction.
sndsig,f,b. Sends a signal to another SPE. Updates the signal notification registers in another SPE.
barrier Orders all commands issued prior to the barrier command with respect to all subsequent commands.
mfceieio Orders the storage transactions caused by get and put commands. Similar to the PowerPC eieio instruction.
mfcsync Orders DMA put and get operations within the specified tag group with respect to other processing units and
devices on the system. Similar to the PowerPC sync instruction.
getllar Get lock line and reserve–immediate. Similar in function to the PowerPC lwarx and ldarx instructions.
putllc Put lock line conditional–immediate. Similar in function to the PowerPC stwcx and stdcx instructions.
putlluc Put lock line unconditional–immediate. Main storage location updated regardless of reservation ownership.
putqlluc Same as putlluc except queued with other DMA commands and has an implied fence modifier.
Note: The optional ‘‘s’’ modifier starts the SPU when the command completes.
The fence (,f.) and barrier (,b.) modifiers are used for command ordering. See the section on command ordering for the definition of these modifiers.
C. R. JOHNS AND D. A. BROKENSHIRE IBM J. RES.  DEV. VOL. 51 NO. 5 SEPTEMBER 2007
508
devices are guaranteed a minimum number of request
queue entries. This minimizes the latency impact created
by accesses to slower devices. Without this feature, the
transfer request queue would fill up with bus requests
for slower devices and prevent the MFC from issuing
requests for higher-bandwidth devices.
List commands
The CBEA provides a list modifier ,l. for MFC DMA
commands to move data that is scattered in the main
storage domain. A command with a list modifier is
called a list command. List commands are converted to
a series of commands, each of which is described by a
list element in local storage. Each list element in local
storage contains an effective address low parameter and
a transfer size parameter. When a list command is
executed, the MFC reads the list elements from local
storage and creates a series of DMA commands. Each
DMA command that is generated has the same set of
parameters as the original list command, with the
exception of the transfer size and effective address low
parameters. Each MFC DMA list command can contain
up to 2,048 list elements, each transferring up to
16 KB of data. Only an SPU can issue list commands, and
such commands cannot be issued to the proxy MFC
command queue.
The list element includes a stall-and-notify flag. When
set, the flag causes the MFC unit to stop executing the list
command and notify the SPU after all of the DMA
transfers for the list element and all previous list elements
are complete. Subsequently, software can resume the list
command. Among other uses, this feature allows the
programmer to specify a data transfer larger than the
available buffer space in local storage. The stall-and-notify
flag is placed on the last element that fills the buffer.
When the data transferred by the previous elements is
processed and buffer space made available, the program
can resume the MFC DMA list command. The CBEA
defines one read channel that identifies the list commands
that are stalled. It defines a write channel that is used to
acknowledge the stall and allow the MFC to continue
executing the list command.
Memory management unit
The MMU allows an SPU application to use the same
effective address as that used by a PPE application to
access main storage.
The effective address for an MFC command is
provided in two parameters, the EAH parameter and the
EAL parameter. The EAH parameter contains the high-
order 32 bits of the 64-bit effective address, and the EAL
parameter contains the low-order 32 bits of the address.
The EAH parameter is optional; if it is omitted, the
high-order 32 address bits are set to zero.
The MMU translates the effective address into the real
address of main storage. This translation is compatible
with the virtual address translation mechanism defined by
the PowerPC Architecture. The PPE handles all MFC
translation faults. An MFC translation fault occurs when
a translation cannot be found for an effective address.
The MMU can use either the same page table as
that used by the PPEs or an independent page table; it
also supports the multiple concurrent large-page
extension described earlier in this paper. In addition, the
MMU supports software management of the TLBs,
which is presented subsequently.
Command ordering
As MFC commands are issued, they are placed in the
appropriate command queue. MFC commands can be
executed and completed in any order, regardless of
the order in which they were issued. The out-of-order
execution of commands allows the MFC to use system
resources efficiently to achieve the best performance. For
example, if the EIB supports simultaneous reads and
writes, the MFC can simultaneously execute one get
and one put command. If ordering were implied by
the issue order, this would not be possible.
While out-of-order command execution can help
performance, in some cases, software may require strict
command ordering. To achieve command ordering, the
CBEA provides two command modifiers (fence ,f. and
barrier ,b.) and a barrier command. The command
modifiers order commands only within the same tag
group (that is, all commands issued with the same tag
parameter to the same queue). Hence, these are called
tag-specific modifiers. A command with a fence modifier
is performed after all previously issued commands within
the same tag group. Commands issued after a command
with a fence modifier are not affected. A command with
the barrier modifier is performed after all previously
issued commands within the same tag group. Commands
that are issued after a command with a tag-specific barrier
are also performed after previously issued commands.
Some commands have an implied tag-specific barrier
modifier.
When command ordering is required to be independent
of the tag group, a barrier command can be issued.
Regardless of the tag identifier, this command orders
all previously issued commands with respect to all
subsequently issued commands within the same
command queue. The CBEA does not provide a
mechanism to order commands with respect to commands
in the other command queue.
Storage access ordering
In addition to command ordering, software can also
require the EIB transfers generated by the DMA
IBM J. RES.  DEV. VOL. 51 NO. 5 SEPTEMBER 2007 C. R. JOHNS AND D. A. BROKENSHIRE
509
command to be ordered with respect to other processors
and devices in the system. Although the command
modifiers and barrier command provide for command
ordering, they do not provide ordering of the bus
transfers with respect to other processors and devices in
the system. For this type of storage ordering, the mfceieio
and mfcsync commands are used in combination with the
command modifiers and barrier command. The mfceieio
command provides storage ordering that is similar to the
PowerPC eieio instruction; the mfcsync command
provides storage ordering that is similar to the PowerPC
sync instruction.
The ordering of storage accesses performed by two or
more processors with respect to another processor or
device is called cumulative ordering. The CBEA follows
the PowerPC rules for cumulative ordering when all
accesses are performed within the main storage and the
proper synchronization instructions and commands are
performed. Standard PowerPC rules do not apply when
the storage accesses are performed within the main
and local storage domains. The CBEA multisource
synchronization facility addresses this case. This facility
provides software with a mechanism to ensure that all
accesses from the main storage domain, targeting the
associated local storage domain, are complete with
respect to the SPU. Figure 3 illustrates the use of
the multisource synchronization facility.
Memory-mapped I/O space and channels
The CBEA defines several SPE facilities, some of which
have already been mentioned. A facility is a set of MMIO
registers or channels that provide a specialized function.
Facilities in the CBEA are accessed from the main storage
domain using MMIO registers or are accessed using the
channels defined by the CBEA. The facilities defined by
the CBEA provide a wide variety of functions ranging
from support of context save and restore to providing
synchronization with other SPEs and processors in the
system. The MMIO space and channels also provide
access to other miscellaneous functions.
This paper does not include all of the facilities provided
by the MMIO registers and channels, but it attempts to
give the reader an indication of what is available. Some
of the facilities provided are listed below:
 Local storage alias.
 Command issue.
 Tag-group completion facility.
 Multisource synchronization facility.
 Mailbox facility.
 Signal notification facility.
 SPU event facility and decrementer.
 Software management of TLBs.
Local storage alias facility
Within the MMIO space of the SPE is an area called the
local storage alias that is dedicated to local storage. It
provides direct access to the local storage domain of the
corresponding SPE from the main storage domain. The
local storage alias is used primarily to allow the direct
transfer of data from the local storage of one SPE to the
local storage of another SPE. For example, if an SPE
issues an MFC get command with an effective address
that maps to the local storage alias area of another SPE,
the data is transferred directly from the local storage of
the target SPE to the local storage of the issuing SPE.
Without an alias of the local storage in the main storage
domain, software would be forced to copy data through
system memory.
Command issue facility
The memory flow controller section of this paper
describes the MFC DMA commands. Commands are
issued from other processors in the system using a
sequence of MMIO register stores and loads. Commands
issued using MMIO registers are called MFC proxy
commands. Code running on an SPU issues commands
using a sequence of channel writes; these commands are
called MFC SPU commands.
MFC proxy commands are typically issued by the PPE
to initialize local storage efficiently before the SPU is
started. To issue an MFC proxy command, software
writes each parameter to the corresponding MMIO
register. Some parameters can be omitted, and the last
parameter written is the MFC command opcode. After
writing this parameter, software reads from the MFC
Figure 3
Example of multisource synchronization.
C
B
Main storage
A
Local storage
0x0
0x0
0x0
0x2
0x1
0x3
PPE 1
1. Store 0x1 to local storage location A
2. sync
3. Store 0x2 to main storage location B
PPE 2
1. Load from main storage location B
returns 0x2
2. sync
3. Store 0x3 to local storage location C
SPU 3
1. Load from local storage location C
returns 0x3
2. Use the multisource synchronization
facility
3. Load from local storage location A
guaranteed to return 0x1
C. R. JOHNS AND D. A. BROKENSHIRE IBM J. RES.  DEV. VOL. 51 NO. 5 SEPTEMBER 2007
510
command status register to determine whether the
command was successfully placed in the MFC proxy
command queue. The MFC command opcode register
and the MFC command status register are mapped to
the same address. Therefore, software need not issue a
PowerPC eieio instruction between the last store of the
opcode and the read of the command status, since the
PowerPC Architecture requires that two accesses to the
same location be performed in program order. Queuing
a command can fail because of a command sequence
error or because of insufficient room in the MFC proxy
command queue. A command sequence error is typically
caused when the command issue sequence is interrupted
and the next process or interrupt handler issues an MFC
proxy command. The command sequence error response
eliminates the need to require a lock for the MFC proxy
queue when issuing a command. The error due to
insufficient room occurs when there is no available space
in the MFC proxy command queue. In most cases,
software can avoid this error by reading the MFC
command queue status register to ensure that there is
available space in the queue before issuing the command.
If either error is reported, software must repeat the
command issue sequence.
The MFC SPU command sequence is a series of
channel write instructions. The last channel written in the
sequence is the MFC command opcode channel. This
channel is blocking and has a depth equal to the size of
the MFC SPU command queue. If the queue is full, the
channel write results in an SPU stall until space in the
queue is available. The stall is a very efficient mechanism
for power savings when useful work cannot be performed
until after the command has been queued. Software can
avoid the stall by using the procedure described in the
channel interface section or by using the event facility
in the SPE to interrupt the SPU when space is available.
The event facility is described in a subsequent section.
Tag-group completion facility
Data movement is performed asynchronously with
respect to the execution of the SPU program. Therefore,
the CBEA provides a tag-group completion facility so
that an application can determine when a specific
command or set of commands completes. Independent
facilities are available for both the MFC proxy and MFC
SPU command queues. A tag ID is provided with each
DMA command, and commands issued with the same
tag ID to the same queue are called a tag group. To
determine when tag groups are complete, a mask that
selects the tag groups of interest is written to the
tag-group mask channel or register. The status of the
selected groups is reported in a status register for the
MFC proxy command queue. For the MFC SPU
command queue, a query type is written to the MFC
write tag-group status update request channel. The
request can be that either ‘‘all selected tag groups
are complete’’ or ‘‘any one of the selected groups is
complete.’’ Once the query type is set, reading from the
blocking tag-group status channel causes the SPU to stall
until the query condition is met. Like the command issue
sequence, the stall can be avoided by first determining
the channel count value of the channel.
The CBEA also provides a way to generate an interrupt
when a tag group completes. For the MFC proxy
command queue, a similar query type is provided to
determine when the interrupt is generated. The interrupt
is typically routed to the PPE for processing. For the
MFC SPU command queue, the event facility provides
the interrupt when the condition is met.
Multisource synchronization facility
In any multiprocessor system, the synchronization of
software processes is critical. The PowerPC Architecture
provides four instructions (lwarx, ldarx, stwcx, stdcx)
for performing atomic updates of system memory.
The atomic update of system memory is used to create
software locks, which are then used to achieve the desired
synchronization. Since the SPUs do not have direct access
to the main storage domain, the CBEA defines four MFC
commands (getllar, putllc, putlluc, putqlluc) that allow
an SPU to participate in the atomic update of system
memory. These commands are known as the MFC atomic
update commands. They can be issued only by an SPU to
the MFC SPU command queue; they cannot be issued
by other processors or devices. Unlike the PowerPC
instructions, these commands transfer the full reservation
granule. In most processors, the reservation granule is
typically the same size as the cache line.
The get lock line and reserve (getllar) command is
similar to the PowerPC lwarx and ldarx instructions, and
the put lock line conditional (putllc) command is similar
to the stwcx and stdcx PowerPC instructions. These
commands are not queued in the MFC like other MFC
commands; they are executed immediately but still
require an MFC command queue slot. Since these
commands are executed immediately, they do not support
the tag parameter, and only one of these commands can
be outstanding. The CBEA provides a blocking atomic
command status channel for determining when these
commands complete and the results of the commands.
The put lock line unconditional (putlluc) and put
queued lock line unconditional (putqlluc) commands are
similar to a cacheable PowerPC store instruction. These
commands are used to unconditionally release a lock.
The putlluc command is executed immediately, whereas
the putqlluc command has an implied fence and is placed
IBM J. RES.  DEV. VOL. 51 NO. 5 SEPTEMBER 2007 C. R. JOHNS AND D. A. BROKENSHIRE
511
in the MFC command queue along with other MFC
commands. Since the putqlluc command is queued,
the tag parameter is supported, and software uses the
previously described tag-group completion facility to
determine when this command is complete.
A common synchronization method is for software
to wait for the value of the lock in system memory
to change. In the PowerPC, this requires software to
continually poll the memory location associated with the
lock. In the CBEA, an SPU event can be generated when
a reservation is lost. (The reservation is lost when another
processor modifies the memory location associated with
the lock.) SPU software can avoid constantly polling the
lock by simply obtaining a reservation on the memory
location and then waiting for the reservation to be lost
using the event facility. Additionally, an SPU interrupt
can be generated when the reservation is lost. While
the reservation lost event does not guarantee that the
lock value has changed, it very effectively eliminates
the constant polling of the lock using MFC DMA
commands.
Since the performance of atomic updates can suffer
when there is a high contention for the lock, this may not
be the optimal solution for all synchronization scenarios.
Therefore, the CBEA offers additional options for process
synchronization, such as the mailbox facility and the
signal notification facility. These facilities can be used as
alternatives to the atomic update MFC commands, or
they can be used in conjunction with them for process
synchronization.
Mailbox facility
A key advantage of the CBEA is the independent nature
of the SPEs. Since the SPEs are decoupled from the PPE
and other SPEs, the CBEA provides a mailbox facility
to assist in process-to-process communication and
synchronization. The mailbox facility provides a simple,
unidirectional communication mechanism typically used
by the PPE to send short commands to the SPE and
to receive status in return. This facility consists of
an inbound, an outbound, and an outbound interrupt
mailbox. Each of these mailboxes can have a depth of one
or more entries. The direction of the mailbox is relative to
the SPU. For example, the inbound mailbox is written by
the PPE and read by the SPU. The SPU accesses the
mailboxes using the SPU channel instructions. The
mailbox channels are blocking, causing the SPU to stall if
the outbound mailbox is full or the inbound mailbox is
empty. For the PPE and other devices, MMIO registers
provide access to the mailboxes and the mailbox status.
As the name implies, the outbound interrupt mailbox
generates an interrupt when written by the SPU. The
interrupt is typically routed to the PPE for processing
and is presented as an external exception.
For example, the mailbox facility can be used for a
command-driven SPU application. In this example, the
SPU is typically stalled waiting for a command to be
placed in the inbound mailbox. When a command is
received, the SPU performs the requested operation given
in the mailbox data or sequence of mailbox data. Once
the operation is complete, the SPU places a return code
in the outbound interrupt mailbox. The write of the
outbound interrupt mailbox generates an interrupt for
the PPE that indicates the completion of the requested
operation. The SPU code then reads the next command
from the inbound mailbox and stalls if a new command is
not available. In addition to the application given in this
simple example, many other uses were contemplated
during the development of this facility.
Signal notification facility
The signal notification facility is very similar to the
mailbox facility. It consists of two signal notification
registers. From the SPU, these registers are accessed using
channels; for other processors, these registers are mapped
into the MMIO space. In contrast to the mailbox facility,
the signal notification facility is inbound only, and the
corresponding channels are only one deep and are
blocking. Each of the signal notification registers has two
modes of operation, overwrite and logical OR. In the
overwrite mode, the contents of the register is replaced
with the new value written to the MMIO register even if
the SPU has not yet read the current value. In the logical
OR mode, the current contents of the signal notification
register is ORed with the new value written to the register.
In either mode, the signal notification register is set to
zero after it is read by the SPU.
Normal MFC ‘‘put’’ commands can be used to access
the signal notification facility in another SPE. However,
the MFC ‘‘send signal’’ commands are defined for this
purpose, allowing an implementation to use a more
efficient mechanism than MMIO for signaling between
SPUs. The signal notification facility offers an alternative
synchronization mechanism between SPUs and,
potentially, other devices. It can also be used in
conjunction with the other synchronization mechanisms.
For example, the logical OR mode can be used in a task
completion notification mechanism. In this example,
an SPU assigns a task to another processor; along
with each task, the SPU assigns a tag. When the task
completes, the processor or device performs the proper
synchronization to ensure that the results are visible. It
then writes a binary ‘‘1’’ in the signal notification
register bit corresponding to the tag. The overwrite mode
provides a fast communication when a single bit is not
sufficient for synchronization.
C. R. JOHNS AND D. A. BROKENSHIRE IBM J. RES.  DEV. VOL. 51 NO. 5 SEPTEMBER 2007
512
SPU event facility and decrementer facility
The SPU event facility provides software with a
convenient mechanism to wait for a selected set of
conditions. The facility can also interrupt the SPU if
one of these conditions occurs. The facility consists of an
SPU read event status channel, a channel for reading
the SPU write event mask channel, a channel for writing
the SPU write event mask channel, and an SPU write event
acknowledgment channel. To use the facility, software
first writes the SPU write event mask channel to select the
conditions of interest. Next, the SPU read event status
channel is read. If none of the selected conditions have
occurred, the SPU stalls until at least one of the conditions
exists. The stall can be avoided either by reading the
channel count associated with the SPU write event status
channel before reading the channel or by enabling an
interrupt to occur when an event status is available. After
being notified that an event has occurred, the SPU
program should acknowledge the event by writing a
binary ‘‘1’’ to the corresponding bit in the SPU write event
acknowledgment channel. Acknowledging the event
enables the SPU to receive subsequent events for this
condition. The SPU read event mask channel is provided
for context save and restore operations. Software can also
use this channel to eliminate the need for a shadow copy of
the currently selected events.
As is apparent in the description of the other facilities,
events are defined for many of the stall conditions. The
event facility allows software to wait for multiple
conditions using a single blocking channel and can cause
the SPU to be interrupted. The SPU read event status
channel doubles as the interrupt status. The SPU event
facility supports the following events:
 MFC tag-group status update event.
 MFC DMA list command stall-and-notify event.
 MFC SPU command queue available event.
 SPU inbound mailbox event.
 SPU outbound mailbox event.
 SPU outbound interrupt mailbox event.
 SPU signal notification 1 event.
 SPU signal notification 2 event.
 Lock line reservation lost event.
 Multisource synchronization event.
 Privileged attention event.
 SPU decrementer event.
Except for the privileged attention event and the SPU
decrementer event, these events have been described in
the previous sections.
The privileged attention event allows another processor
or device to get the attention of an SPU program. Setting
the privileged attention bit in the SPU privileged control
register causes the privileged attention event in the SPU
event facility to be set. The SPU decrementer event is set
when the most significant bit of the SPU decrementer
changes from zero to one. The SPU software can either
wait for these events by reading the SPU read event status
channel or enable an interrupt to be generated on the
occurrence of any enabled event.
Software management of TLBs facility
Almost all modern processor architectures support a
virtual address space that requires the use of an address
translation table. The PowerPC, which is no exception,
implements virtual addressing using an architected
hardware page table in system memory. For performance
reasons, processors usually implement a cache of the
translations in an on-chip array. In the PowerPC
Architecture and the CBEA, this array is called the TLB.
Processor architectures usually handle a miss of the
TLB with software or hardware. In the PowerPC
Architecture, the TLB is hardware managed.
In general, caches introduce uncertainty into the
performance of a system. For example, a program
accessing a cache that does not contain any of the
program data (a cold cache) executes more slowly than
when accessing a cache that contains the program data (a
warm cache). Architectures have addressed this problem
for data caches by providing cache management
instructions; some even provide software access to the
TLBs. Since the PowerPC Architecture does not provide
a method to control the contents of the translation cache,
there is typically a start-up penalty when a program
is first executed that is not acceptable for real-time
applications. The CBEA addresses this issue by adding
a software TLB management facility. This facility is
specified for both the PPE and the SPE. When used in
conjunction with hardware management of the TLBs, an
operating system can preload or restore an application’s
translations to reduce the start-up effects.
A second issue with hardware-managed translations is
the fixed size and structure of the hardware page table.
In addition, all processors within the same operating
system partition typically share the same physical
hardware page table. With software management, an
operating system can choose to manage the TLBs in
software for any number of the SPEs or PPEs in a
CBEA-compliant processor. This allows an operating
system to have a specialized page table structure for an
application that is independent of the structure required
by the architecture and hardware.
Cache replacement management facility
As mentioned in the previous section, caches introduce
uncertainty into the performance of a system. Not only is
this uncertainty due to a cold cache at start-up, but it can
IBM J. RES.  DEV. VOL. 51 NO. 5 SEPTEMBER 2007 C. R. JOHNS AND D. A. BROKENSHIRE
513
also be caused by a characteristic of the application. For
example, if a streaming application reads a very long
sequential data structure, the contents of the cache is
replaced with the infrequently accessed streaming data,
thereby creating a cold cache. The cache management
instructions offered by architectures cannot prevent the
streaming data effects on the cache. Other processor
architectures have offered methods to lock data into
a cache, but they still do not address the effects of
streaming. Since streaming media is a major application
area for a CBEA-compliant processor, this issue had to
be addressed.
Caches are usually not an architectural data element;
thus, the ‘‘Cell Broadband Engine Architecture’’
document [1] focuses only on the need for such a cache
replacement management facility. An example in this
document guides the processor designers with their
implementation choices.
Using the example implementation outlined in [1],
an operating system creates a set of replacement
management class identifiers (RclassIDs) and an
associated replacement management table (RMT). For
the PPE, the RclassID is generated from the address of
the data accessed using a set of address-range registers.
Separate ranges are provided for instruction and data
fetches in the PPE. The RclassID is provided as an
explicit parameter in the MFC commands.
The RclassID is used as an index into the RMT. The
RMT entry selects the sets in the cache (assuming a
set-associative cache) that can be replaced if all of the
entries in the congruence class are valid and the requested
data does not reside in the cache. If the data resides in the
cache in a different set, the data simply is accessed. If the
data does not exist in the cache and there is a nonvalid
entry in the congruence class, an implementation can
choose to place the data in the invalid entry even if it does
not correspond to the set selected by the RMT.
The cache replacement management facility provides
an operating system with two key methods for managing
both the data and translation caches, i.e., the TLBs. The
first method prevents streaming data from flushing a
cache. Accesses to streaming data are tagged with an
RclassID that allows the replacement of only a few sets,
typically one. This means that the data can be streamed
through the cache without replacing older data or
translations that may be required by the operating system
or application.
The second method allows an operating system to lock
data or translations in a cache. This method is beneficial
for many types of data and helps to reduce the caching
effects inherent in hardware-managed caches. For
example, an RTOS can use this facility for a critical PPE
interrupt handler.
As mentioned, controlling the interrupt latency is very
important for an RTOS. By locking the translations to
the key interrupt handlers in the TLBs and locking the
interrupt code itself into the caches, an RTOS can reduce
the uncertainty of processing an interrupt. Locking data
or translations in a cache is achieved by ensuring that
only one RMT entry allows for a given cache set to be
replaced. In the example above, software should ensure
that only the address range of the interrupt routine can
select the RMT entry. If these conditions are met, no
other access will cause the interrupt hander instructions
or translations to be removed from the cache.
To eliminate the start-up cost of the very first interrupt,
the RTOS can warm the cache state by pre-touching
the interrupt handler when it is first loaded by using
the cache management instructions. When used in
conjunction with the mediated external exception
extension and cache management instructions, the cache
replacement management facility essentially eliminates
the uncertainty in the execution of an interrupt handler.
The cache replacement management facility is very
flexible, providing for both locking data in a cache and
reducing the effects of streaming data on caches. While
this facility is implementation dependent, the CBEA
recommends that an implementation provide an RMT
for each major caching structure, including the TLBs.
Internal interrupt controller
Processor architectures typically do not define an
interrupt controller. However, in the CBEA all exceptions
generated by an SPE are typically handled by the PPEs in
the system. The architecture would not be complete if it
did not define how these interrupts are presented to the
PPE. Therefore, the CBEA defines an internal interrupt
controller (IIC) that provides a method to present an
interrupt generated by an SPE to the PPE as quickly as
possible. The IIC has a sufficient set of features to act
as the interrupt controller for the system. If a different
feature set is required, the IIC can coexist with a separate
system interrupt controller.
The SPU-generated exceptions include the following:
internal SPU errors, errors created by the execution of an
MFC command, address translation faults, address
compare matches, mailbox interrupts, interrupts
generated by the execution of an SPU stop-and-signal
instruction, and tag-group completion. SPU exceptions
cause a PPE external interrupt to occur. The CBEA
defines a set of routing registers, interrupt mask registers,
and interrupt status registers for each SPE. The routing
registers are used to route the interrupt to the PPE that
will handle the interrupt. The interrupt mask and status
registers are used to enable interrupt conditions and to
record the cause of the interrupt. The IIC allows software
C. R. JOHNS AND D. A. BROKENSHIRE IBM J. RES.  DEV. VOL. 51 NO. 5 SEPTEMBER 2007
514
to prioritize SPE exceptions with other external interrupts
generated by the system.
The interrupt structure in the CBEA is flexible,
allowing SPE exceptions to be routed to an IIC or
directly to the system interrupt controller. Routing an
SPE exception to the system interrupt controller allows
for an alternative priority to be employed by the system
designer. In addition to handling SPE exceptions, the
IIC provides an efficient mechanism for generating
interprocessor interrupts. Interprocessor interrupts are
generated by a PPE, targeting another PPE in the system.
Resource allocation management
Real-time operating systems were a key consideration
during the development of the CBEA and were the
primary reason for including features such as cache
replacement management and software management of
TLBs in the architecture. These features help software
maintain real-time guarantees in a single-processor
environment, but they do little for the effects created by
having multiple processors. In a multiprocessor system,
critical resources such as system memory can be
consumed by a relatively few devices, which can create
latency and bandwidth issues for the remaining devices.
To address this issue, the CBEA includes a resource
allocation management (RAM) facility.
With RAM enabled, each requester in a CBEA-compliant
processor is assigned to a resource allocation group
(RAG). A portion of each critical resource is allocated to
each RAG. The definition provided in the CBEA ends
here. The remaining details of the facility, such as the
resources controlled and the controlling mechanism, are
highly implementation specific. The rest of this section
describes the Cell/B.E. processor implementation of
RAM. The Cell/B.E. processor is the first implementation
of the CBEA.
In the Cell/B.E. processor, the resources controlled by
RAM are system memory and the I/O interfaces. A
token-based control mechanism allocates a percentage
of the managed resources to each RAG. Before a
requester can access a managed resource, a token must be
requested from a central token manager. The hypervisor
initializes the token manager to generate tokens for each
RAG at a rate equivalent to the allocated percentage
of the resource. Requesters compete for the tokens
generated for their RAG using a round-robin priority
scheme. During certain intervals, the frequency of
requests from a given RAG may be lower than the
allocated token rate; this can result in unused tokens for
some RAGs, while other RAGs are throttled by their
allocated token rate. The Cell/B.E. design allows the
unused tokens to be given to another RAG, which
prevents wasting or underutilization of a resource.
Typically, software never allocates 100% of a resource in
order to account for variations in system performance.
This is another source of potential waste. To assist in
reducing this waste, the Cell/B.E. processor design allows
any RAG to use the unallocated percentage of a resource.
The allocation of unused tokens and the unallocated
portion of a resource is programmable and can be
independently disabled by an operating system.
As mentioned above, 100% of a resource is never
allocated in order to account for unexpected performance
variations. Allowing unallocated and unused tokens to be
used can result in overutilization of a resource. The
Cell/B.E. processor implementation addresses this issue
by automatically suspending the allocation of these
tokens when a resource becomes congested. This is
known as backpressure from the resource.
At first glance, RAM appears to be a bandwidth
management system. However, the bandwidth and
latency of system memory are very dependent on
the access size and access pattern. For example, the
bandwidth is considerably lower if an application
continually accesses addresses corresponding to the same
physical bank of memory. To avoid this problem, the
RAM facility requires a resource to obtain a token for
each physical memory bank. Software sets up a single
percentage allocation for memory, and the token
manager distributes the allocation equally among all
memory banks. Therefore, if the memory access pattern
of an application is evenly distributed, the achievable
memory bandwidth is higher than if the accesses are
poorly distributed. An application can also cause poor
utilization of system memory if it uses small accesses.
Typically, caches prevent small accesses, and they
are not a concern for most systems. However, in a
CBEA-compliant processor, MFC DMA commands can
transfer blocks smaller than a cache line. The Cell/B.E.
processor addresses this issue in two ways. First, each
token is defined as the transfer of a full cache line, or 128
bytes, regardless of the actual transfer size. Second, the
requester is required to obtain two tokens if the transfer
size requires a read–modify–write of system memory
to update the error-correction code (ECC)—one
for the read and one for the write.
Thus, the effective system memory bandwidth is a
function of the access pattern and access size. Without the
RAM, the accesses performed by one application can
affect the system memory bandwidth for all applications.
RAM provides an effective mechanism that prevents
applications with poor access patterns from affecting
the performance of other applications. RAM does not
provide a guarantee of memory bandwidth, but it does
guarantee access to a resource. Because the effective
bandwidth achieved is still dependent on the access
pattern, the facility is known as RAM instead of
bandwidth reservation.
IBM J. RES.  DEV. VOL. 51 NO. 5 SEPTEMBER 2007 C. R. JOHNS AND D. A. BROKENSHIRE
515
Power management
Power management is always a key consideration in the
development of any architecture. Some channels in the
SPE are defined as blocking, which essentially allows an
SPU to enter a very low-power state while waiting for the
channel to become available. The CBEA also defines the
following set of more traditional power management
states: active, slow(n), pause(n), state retained and
isolated (SRI), and state lost and isolated (SLI).
Each element in a CBEA-compliant processor can be
in any one of the five defined states. In the active state,
the element runs at full performance level. No power
management is applied except for the dynamic power
management provided by blocking channels or the
gating of nonactive circuits found in most advanced
designs.
Several levels are allowed for the slow(n) state, with a
higher value of n representing more aggressive power
savings. As the name implies, the element executes at a
slower rate; thus, the performance is degraded but still
guaranteed to make forward progress. An example of
a slow state is running an element at a slower clock
frequency to save power. The pause(n) state is similar
to the slow(n) state except that forward progress is not
guaranteed. A simple example of a pause state is the
situation in which an SPU is stopped but the MFC is
still active.
In the three power management states described above,
the context of the element is maintained. For example, the
TLBs and caches remain coherent with system memory
and other elements in the system. In the SRI and SLI
states, the element no longer interacts with other elements
in the system; thus, the full context for the element cannot
be maintained. (These power management states should
not be confused with the SPU isolation facility described
in the next section.) Software must prepare the element
before entering or leaving these states. The SRI state
allows software to keep the bulk of the context while
saving most of the power of the element. An example of
the SRI state is the case in which the SPE is isolated from
the system and the contents of the local storage and
MMIO registers are maintained. Coherency of the TLBs
and caches is not maintained. The SLI state is the most
aggressive power-saving state. In this state, nothing is
maintained, including the local storage, registers, and
caches. An example of this state is the situation in which
power is removed from an SPE or PPE. The actual
implementation and supported power management states
are implementation specific. Therefore, the CBEA does
not define the details.
Isolated execution environment
The CBEA includes an SPU isolation facility that is used
to create a secure execution environment within an SPE.
Figure 4 is a state diagram illustrating the transitions into
and out of the SPU isolated execution environment.
To enter an SPU isolated execution environment,
a pointer to the code to execute in the secure execution
environment is loaded into the SPU signal notification
registers 1 and 2, and the SPU is started using a Load
request. When it receives the Load request, the SPE first
sets up the isolated execution environment by removing
access to a portion of the local storage. The remaining
portion of local storage remains open. All accesses to
the local storage alias and MFC DMA commands are
forced into the open area. After the isolated environment
is initialized, the SPE loads the application into the
isolated area of local storage. As it is loaded, the code
is authenticated and decrypted by using a hardware
root key. If the load succeeds, the application is executed.
Once in the SPU isolated execution environment, the only
way out is to stop the SPU and restart it with an
Exit request. If the Load fails, the SPU remains halted in
a transition state. The only way out of the transition state
is to restart the SPU by using either another Load
request or an Exit request.
In addition to the isolated execution environment, the
SPU isolation facility provides an application with a
random number generator and persistent storage. The
Figure 4
SPU isolated state transitions.
SPU nonisolated execution environment
Transition states
SPU isolated execution environment
SPU and local storage isolated from system
Run
Stop
Load
failed
Exit
Exit
Load
Load
success
Exit
Exit
complete
Load
Run
Stop
Exit Load
Load
failed
SPU isolated
(stopped)
SPU isolated
(running)
SPU
stopped
SPU
running
C. R. JOHNS AND D. A. BROKENSHIRE IBM J. RES.  DEV. VOL. 51 NO. 5 SEPTEMBER 2007
516
persistent storage maintains its state between isolation
sessions and can be accessed only by an application
executing in the SPU isolated execution environment.
Software development toolkit
To provide software developers exposure to the CBEA
with and without access to hardware, a Software
Development Kit (SDK) is available on the IBM
alphaWorks* Web site (www.alphaworks.ibm.com/topics/
cell) with open-source content distributed on the
Barcelona Supercomputing Center Web site (www.bsc.es/
projects/deepcomputing/linuxoncell).
The SDK is continuously being enhanced with
additional components and features. The key
foundational components of the SDK are as follows:
 A Cell/B.E. processor–enabled Linux kernel
supporting the unique CBEA features.
 The IBM full system simulator for the Cell/B.E.
processor that supports simulation of either
uniprocessor or dual-processor systems. Both
functional and cycle accurate simulation modes are
provided so that programmers can maximize either
interactivity or timing accuracy.
 A system root image (sysroot) that provides a
standard set of Linux operating system utilities and
services for use in the simulated Linux operating
system environment.
 Standard CBEA-specific libraries including the newlib
C languages standard library, SIMD math libraries,
and an SPE runtime management library that exposes
the SPEs as heterogeneous POSIX threads.
 Code-generation tools including GNU and IBM XL
C/Cþþ compilers. The compilers support the
automatic generation of SIMD code for both the PPE
and the SPE.
 Productivity tools such as an Eclipse-based
integration development environment (IDE). The
IDE provides an integration framework for building,
deploying, and managing CBEA software.
 Cell/B.E. processor–enabled debug tools including the
GNU debugger (gdb). The debugger has been
enhanced to enable combined PPU and SPU
debugging and to provide user access to the unique
states of the Cell/B.E. processor.
 Performance analysis tools including the OProfile
system profiling tool, the Cell/B.E. processor
performance counter, which provides access to
Cell/B.E. processor performance monitoring facilities,
code analyzer, Feedback-Directed Program
Restructuring technology of FDPR-pro, and a
static performance analysis tool called spu_timing.
The output from many of the tools can be visualized
using the Eclipse-based visual performance analyzer
(VPA).
 An SPE executive for invoking SPU executables
directly from a Linux shell. The executive provides
standard system services including file I/O, shared
memory, memory map, and time of day.
 Many code samples that demonstrate programming
techniques, optimized libraries with source code to
jump-start application development, and workloads
to demonstrate the computational capabilities of the
Cell/B.E. processor.
 Programming model frameworks such as the
Accelerated Library Framework (ALF), which
provides services to assist in the development of
parallel applications and libraries on multicore
architectures with hierarchical memory such as
the CBEA.
The SDK is supported on the x86, PowerPC 64, and
Cell/B.E. processor blade systems running Fedora** core
for the Linux operating system. These configurations
minimize system costs, ensuring wide access to anyone
who wants to evaluate Cell/B.E. technology, learn
Cell/B.E. programming, or develop Cell/B.E. applications
and tools.
Software standards for CBEA
Critical to the success of the CBEA is the adoption of
software standards and conformance to these standards.
From the outset, the Sony Group, Toshiba, and IBM
worked together to establish an initial set of standards
including the following:
 Application binary interface (ABI) specifications
including the SPU Application Binary Interface
Specification and the CBE Linux Reference
Implementation ABI specification.
 Language specifications including the SPU Assembly
Language Specification and the C/Cþ
þ Language
Extensions for Cell Broadband Engine Architecture
specification.
 Standardized library specifications including the SPE
Runtime Management Library and the SIMD Vector
Math Library specifications.
Adherence to these specifications ensures that
application code is portable and that tools produce
binaries that will coexist. To ensure compliance with the
standards, conformance tests have been developed and
made generally available.
IBM J. RES.  DEV. VOL. 51 NO. 5 SEPTEMBER 2007 C. R. JOHNS AND D. A. BROKENSHIRE
517
Conclusions: The power of the Cell/B.E.
processor
The real power behind the CBEA is the optimization
made possible by the heterogeneous design and the
memory movement, provided by the DMA, for hiding
memory latency. While initially directed toward media
and streaming applications, the flexibility of the
architecture allows for a multitude of application and
programming techniques. For example, several papers
describe CBEA applications that include features such as
software-managed caches implemented in the local
storage and software threading of the SPEs [8].
Additionally, a generalized implementation of a software
cache is provided in the SDK.
A formal request for change (RFC) process governs
changes and additions. All RFCs require unanimous
acceptance by the three partner companies—the Sony
Group, Toshiba, and IBM. Each of the partner
companies serves as an advocate to ensure that its
business partners are represented in the RFC process.
The CBEA offers a wide variety of features to improve
the performance of systems requiring RTOS, streaming,
and computing applications. In addition, the CBEA
defines a unique approach to security by offering an
SPU isolation facility. When used properly, this facility
provides an isolated execution environment that is not
accessible from other elements or by external means. The
features and flexibility of the CBEA combined with the
new development environment and software standards
create a processor architecture and ecosystem poised
to set the standard for application acceleration.
Acknowledgments
The CBEA and the Cell/B.E. processor (the first
implementation of the architecture) resulted from a
deep collaboration of engineers from Sony Computer
Entertainment Incorporated, Toshiba, and IBM. Each
of the partners brought to the team a unique set of
requirements, experience, and perspectives. The diversity
of the team enabled it to look beyond the traditional
processor architectures and develop the heterogeneous
processing structure provided by the CBEA. More than
five years and countless person-hours went into the
development ofthe CBEA and theCell/B.E. processor. The
authors also wish to thank the anonymous reviewers for
their suggestions for improving the content of this paper.
*Trademark, service mark, or registered trademark of
International Business Machines Corporation in the United States,
other countries, or both.
**Trademark, service mark, or registered trademark of
Linus Torvalds or Red Hat, Inc., in the United States, other
countries, or both.
Cell Broadband Engine is a trademark of Sony Computer
Entertainment, Inc., in the United States, other countries, or both.
References
1. Cell Broadband Engine Architecture; see http://www.ibm.
com/chips/techlib/techlib.nsf/techdocs/
1AEEE1270EA2776387257060006E61BA.
2. J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R.
Maeurer, and D. Shippy, ‘‘Introduction to the Cell
Multiprocessor,’’ IBM J. Res.  Dev. 49, No. 4/5, 589–604
(2005).
3. PowerPC Architecture Book, Version 2.02; see http://
www.ibm.com/developerworks/power/library/pa-archguidev2/.
4. Synergistic Processor Unit Instruction Set Architecture; see
http://www.ibm.com/chips/techlib/techlib.nsf/techdocs/
76CA6C7304210F3987257060006F2C44.
5. B. Flachs, S. Asano, S. H. Dhong, H. P. Hofstee, G. Gervais,
R. Kim, T. Le, et al., ‘‘The Microarchitecture of the Synergistic
Processor for a Cell Processor,’’ IEEE J. Solid-State Circuits
41, No. 1, 63–70 (2006).
6. D. A. Brokenshire, ‘‘Maximizing the Power of the Cell
Broadband Engine Processor: 25 Tips to Optimal Application
Performance’’; see http://www.ibm.com/developerworks/power/
library/pa-celltips1/.
7. D. Krolak, ‘‘Just Like Being There: Papers from the Fall
Processor Forum 2005: Unleashing the Cell Broadband Engine
Processor: The Element Interconnect Bus’’; see http://
www.ibm.com/developerworks/power/library/pa-fpfeib/
index.html.
8. C. Benthin, I. Wald, M. Scherbaum, and H. Friedrich, ‘‘Ray
Tracing on the Cell Processor’’; see http://graphics.cs.uni-sb.de/
;benthin/cellrt06.pdf.
Received July 22, 2006; accepted for publication
C. R. JOHNS AND D. A. BROKENSHIRE IBM J. RES.  DEV. VOL. 51 NO. 5 SEPTEMBER 2007
518
March 1, 2007; Internet publication August 9, 2007
Charles R. Johns IBM Systems and Technology Group,
11400 Burnet Road, Austin, Texas 78758 (crjohns@us.ibm.com).
Mr. Johns is a Senior Technical Staff Member in the
Sony/Toshiba/IBM Design Center. He received his B.S. degree in
electrical engineering from the University of Texas at Austin in
1984. After joining IBM Austin in 1984, Mr. Johns worked on
various disk, memory, voice communication, and graphics
adapters for the IBM Personal Computer. From 1988 until he
transferred to the STI project in 2000, he was part of the graphics
organization and was responsible for the architecture and
development of entry and midrange 3D graphics adapters and
raster engines. Mr. Johns is now responsible for the CBEA; he
participated in the development of the Cell/B.E. processor, which is
the first implementation of the CBEA. Mr. Johns is an IBM Master
Inventor.
Daniel A. Brokenshire IBM Systems and Technology Group,
11400 Burnet Road, Austin, Texas 78758 (brokensh@us.ibm.com).
Mr. Brokenshire is a Senior Technical Staff Member with six years
of experience in the Cell/B.E. Processor Design Center. He
currently serves as a senior member of the IBM Cell/B.E. Processor
Systems Enablement team working on the Cell/B.E. processor
SDK. His responsibilities include the development of programming
standards, language extensions, reusable software libraries, and
software documentation. Mr. Brokenshire received a B.S. degree
in computer science and B.S. and M.S. degrees in electrical
engineering, all from Oregon State University. Prior to his work on
the Cell/B.E. processor, he enjoyed a productive career developing
3D graphics products for Tektronix, Inc., and IBM.
IBM J. RES.  DEV. VOL. 51 NO. 5 SEPTEMBER 2007 C. R. JOHNS AND D. A. BROKENSHIRE
519

More Related Content

Similar to IBM: Introduction to the Cell Broadband Engine Architecture (20)

PDF
Synergistic processing in cell's multicore architecture
Michael Gschwind
 
PPT
Low cost embedded system
ece svit
 
PDF
Power overview 2018 08-13b
Ganesan Narayanasamy
 
PPTX
Power8 sales exam prep
Jason Wong
 
PDF
IBM POWER8: The first OpenPOWER processor
Slide_N
 
PPT
Parallel processing
Syed Zaid Irshad
 
PDF
Ch2 embedded processors-i
Ankit Shah
 
PPT
Technology
Nishant Rayan
 
PPT
Technology
Nishant Rayan
 
PPT
Technology
Nishant Rayan
 
PPTX
Operating System introduction topics for beginners
divyammo
 
PDF
Cell Technology for Graphics and Visualization
Slide_N
 
PDF
Advanced High-Performance Computing Features of the OpenPOWER ISA
Ganesan Narayanasamy
 
PDF
IBM POWER8 Systems Technology Group Development
Slide_N
 
PPT
12429908.ppt
ssuser4ca1eb
 
PDF
Os Madsen Block
oscon2007
 
PDF
Heterogeneous Computing : The Future of Systems
Anand Haridass
 
PDF
POWER ISA introduction and what’s new in ISA V3.1 (Overview)
Ganesan Narayanasamy
 
PPT
Ibm cell
Vyanktesh Dorlikar
 
PDF
Chip Multiprocessing and the Cell Broadband Engine.pdf
Slide_N
 
Synergistic processing in cell's multicore architecture
Michael Gschwind
 
Low cost embedded system
ece svit
 
Power overview 2018 08-13b
Ganesan Narayanasamy
 
Power8 sales exam prep
Jason Wong
 
IBM POWER8: The first OpenPOWER processor
Slide_N
 
Parallel processing
Syed Zaid Irshad
 
Ch2 embedded processors-i
Ankit Shah
 
Technology
Nishant Rayan
 
Technology
Nishant Rayan
 
Technology
Nishant Rayan
 
Operating System introduction topics for beginners
divyammo
 
Cell Technology for Graphics and Visualization
Slide_N
 
Advanced High-Performance Computing Features of the OpenPOWER ISA
Ganesan Narayanasamy
 
IBM POWER8 Systems Technology Group Development
Slide_N
 
12429908.ppt
ssuser4ca1eb
 
Os Madsen Block
oscon2007
 
Heterogeneous Computing : The Future of Systems
Anand Haridass
 
POWER ISA introduction and what’s new in ISA V3.1 (Overview)
Ganesan Narayanasamy
 
Chip Multiprocessing and the Cell Broadband Engine.pdf
Slide_N
 

More from Slide_N (20)

PDF
IBM: Introduction to the Cell Multiprocessor
Slide_N
 
PDF
AMD: The Next Generation of Microprocessors
Slide_N
 
PDF
Cryptologic Applications of the PlayStation 3: Cell SPEED
Slide_N
 
PDF
Toward an Open and Unified Model for Heterogeneous and Accelerated Multicore ...
Slide_N
 
PDF
Roadrunner: Heterogeneous Petascale Computing for Predictive Simulation
Slide_N
 
PDF
Driving a Hybrid in the Fast-lane: The Petascale Roadrunner System at Los Alamos
Slide_N
 
PDF
Petascale Visualization: Approaches and Initial Results
Slide_N
 
PDF
The Cell at Los Alamos: From Ray Tracing to Roadrunner
Slide_N
 
PDF
Roadrunner and hybrid computing - Conference on High-Speed Computing
Slide_N
 
PDF
Roadrunner Tutorial: An Introduction to Roadrunner and the Cell Processor
Slide_N
 
PDF
Deferred Pixel Shading on the PlayStation 3
Slide_N
 
PDF
POWER9: IBM’s Next Generation POWER Processor
Slide_N
 
PDF
Efficient Usage of Compute Shaders on Xbox One and PS4
Slide_N
 
PDF
Future Commodity Chip Called CELL for HPC
Slide_N
 
PDF
Common Software Models and Platform for Cell and SpursEngine
Slide_N
 
PDF
Toshiba's Approach to Consumer Product Applications by Cell and Desire/Challe...
Slide_N
 
PDF
Towards Cell Broadband Engine - Together with Playstation
Slide_N
 
PDF
SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...
Slide_N
 
PDF
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Slide_N
 
PDF
Experiences with PlayStation VR - Sony Interactive Entertainment
Slide_N
 
IBM: Introduction to the Cell Multiprocessor
Slide_N
 
AMD: The Next Generation of Microprocessors
Slide_N
 
Cryptologic Applications of the PlayStation 3: Cell SPEED
Slide_N
 
Toward an Open and Unified Model for Heterogeneous and Accelerated Multicore ...
Slide_N
 
Roadrunner: Heterogeneous Petascale Computing for Predictive Simulation
Slide_N
 
Driving a Hybrid in the Fast-lane: The Petascale Roadrunner System at Los Alamos
Slide_N
 
Petascale Visualization: Approaches and Initial Results
Slide_N
 
The Cell at Los Alamos: From Ray Tracing to Roadrunner
Slide_N
 
Roadrunner and hybrid computing - Conference on High-Speed Computing
Slide_N
 
Roadrunner Tutorial: An Introduction to Roadrunner and the Cell Processor
Slide_N
 
Deferred Pixel Shading on the PlayStation 3
Slide_N
 
POWER9: IBM’s Next Generation POWER Processor
Slide_N
 
Efficient Usage of Compute Shaders on Xbox One and PS4
Slide_N
 
Future Commodity Chip Called CELL for HPC
Slide_N
 
Common Software Models and Platform for Cell and SpursEngine
Slide_N
 
Toshiba's Approach to Consumer Product Applications by Cell and Desire/Challe...
Slide_N
 
Towards Cell Broadband Engine - Together with Playstation
Slide_N
 
SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...
Slide_N
 
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Slide_N
 
Experiences with PlayStation VR - Sony Interactive Entertainment
Slide_N
 
Ad

Recently uploaded (20)

PPTX
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Per Axbom: The spectacular lies of maps
Nexer Digital
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Per Axbom: The spectacular lies of maps
Nexer Digital
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Ad

IBM: Introduction to the Cell Broadband Engine Architecture

  • 1. Introduction to the Cell Broadband Engine Architecture C. R. Johns D. A. Brokenshire This paper provides an overview of the Cell Broadband Enginee Architecture (CBEA). The CBEA defines a revolutionary extension to a more conventional processor organization and serves as the basis for the development of microprocessors targeted at the computer entertainment, multimedia, and real-time market segments. In this paper, the organization of the architecture is described, as well as the instruction set, commands, and facilities defined in the architecture. In many cases, the motivation for these facilities is explained and examples are provided to illustrate their intended use. In addition, this paper introduces the Software Development Kit and the software standards for a CBEA-compliant processor. Overview The Cell Broadband Engine Architecture (CBEA) defines a family of heterogeneous microprocessors that target multimedia and compute-intensive applications [1]. The CBEA resulted from a joint effort among the Sony Group, Toshiba, and IBM to develop the next-generation processor. The following motivations shaped the development of the architecture [2]: Provide outstanding performance on computer entertainment and multimedia applications. Develop an architecture applicable to a wide range of platforms. Enable real-time response to the user and the network. Address the three design challenges facing traditional processors: memory latency, power, and frequency. The CBEA extends the IBM PowerPC* 64-bit architecture with loosely coupled, cooperative offload processors. The CBEA is set apart from other processor architectures by the use of two independent instruction sets: the PowerPC and the synergistic processor unit (SPU) instruction sets. For a processor to be considered CBEA compliant, the processor must contain one or more PowerPC processor elements (PPEs), one or more synergistic processor elements (SPEs), and the required feature set defined by the CBEA. Figure 1 is a block diagram of a CBEA processor. The PPE is compliant with the IBM PowerPC Architecture* [3]. It is intended to perform the system management and application control functions, or ‘‘control plane’’ processing. Whereas other architectures have augmented the instruction set of the processor with tightly coupled extensions such as the vector/single- instruction multiple-data (SIMD) multimedia extension, the CBEA employs an independent SPU that is compliant with the SPU instruction set architecture [4] to perform the compute-intensive, or ‘‘data plane,’’ processing. This allows the data processing and control functions to be decoupled, enabling more application parallelism. Another distinction between a traditional processor and the CBEA is the definition of two storage domains: main and local. The main storage domain is the same as that commonly found in most processors. This domain contains the address space for system memory and memory-mapped I/O (MMIO) registers and devices. Associated with each SPU is a local storage address space, or domain, containing instructions and data for that SPU. Each local storage domain is also assigned an address range in the main storage domain called the local Copyright 2007 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor. IBM J. RES. DEV. VOL. 51 NO. 5 SEPTEMBER 2007 C. R. JOHNS AND D. A. BROKENSHIRE 503 0018-8646/07/$5.00 ª 2007 IBM
  • 2. storage alias. The SPU can address memory directly only within the associated local storage. To access data in the main storage domain and maintain synchronization with other processors, each SPU and local storage pair is coupled with a memory flow controller (MFC). The combined SPU, associated local storage, and MFC make up the SPE. In addition to the PPE and the SPE, the CBEA includes many features for real-time applications not typically found in conventional processor architectures. The ‘‘Cell Broadband Engine Architecture’’ document, which defines the architecture, refers to the ‘‘PowerPC Architecture Book’’ and the ‘‘SPU Instruction Set Architecture’’ document to avoid duplicating information. For a complete definition of the CBEA, the reader must have access to all of these documents, which are publicly available on IBM Web pages [1, 3, 4]. The CBEA document is divided into two parts: the user-mode environment (UME) and the privileged-mode environment (PME). The UME defines the instructions and facilities required for application portability. This section is similar to Books I and II of the PowerPC Architecture. For the PowerPC Architecture, only Book I compliance is necessary for application portability. Since the CBEA defines a heterogeneous multiprocessor, the synchronization features found in Book II of the PowerPC are necessary for cooperative processing. For this reason, Books I and II were collapsed into a single UME section. The PME, which defines the instructions and facilities required for operating system and hypervisor1 development, is similar to Book III of the PowerPC Architecture. PowerPC processor element Within a CBEA-compliant processor, the PPE performs the control plane functions that typically required the more general-purpose computing provided by the PPE. The PPE is based on Version 2.02 of the PowerPC Architecture, which offers many of the features required for the application spaces targeted by the CBEA. The Figure 1 CBEA processor block diagram. SL1 BIU Element interconnect bus (EIB) SPE group 0 (SG_0) SPE group n (SG_n) PPE group 0 (PG_0) PPE group p (PG_p) LS MMU RMT SPE_0 SPE_g MMU RMT BIU RMT SL1 LS MMU RMT SPE_0 SPE_g MMU RMT BIU RMT RMT RMT RMT BIU RMT RMT RMT Bus interface controller (BIC) Internal interrupt controller (IIC) Memory interface controller (MIC) LS LS L2 L2 BIC Bus interface controller BIU Bus interface unit IIC Internal interrupt controller L1 Memory cache internal to the CPU L2 Memory cache external to the CPU LS Local storage MFC Memory flow controller MIC Memory interface controller MMU Memory management unit PPE PowerPC processor element PPU PowerPC processor unit RMT Replacement management table SL1 First-level cache SPE Synergistic processor element SPU Synergistic processor unit I/O Memory PPUg SPU0 SPUg PPU0 SPU0 SPUg PPUg PPU0 MFC MFC MFC MFC L1 L1 L1 L1 1 A hypervisor is a layer of software running on the processor that allows multiple ‘‘guest’’ operating systems to run concurrently. The hypervisor virtualizes the processor and the system resources. C. R. JOHNS AND D. A. BROKENSHIRE IBM J. RES. DEV. VOL. 51 NO. 5 SEPTEMBER 2007 504
  • 3. PPE is based on the PowerPC Architecture for four reasons: 1) The PowerPC Architecture is a mature architecture that is applicable to a wide variety of platforms. 2) It supports multiple simultaneous operating environments through logical partitioning. 3) It contains proven microarchitectures that meet the frequency and power challenges of the targeted market segment. 4) Use of the PowerPC Architecture leverages the IBM investment in the PowerPC ecosystem. A key advantage of the PowerPC Architecture is its unique ability to support multiple concurrent operating environments through the use of logical partitioning. This feature is critical to executing a non–real-time operating system (RTOS) for the user interface while simultaneously executing a non-RTOS, such as the Linux** operating system, for management of the system. Concurrent execution of an RTOS and the Linux operating system allows non–real-time processes to be performed in the background without affecting the performance of an application running under the RTOS. Although the PowerPC is a mature architecture for more traditional platforms, concurrently supporting both an RTOS and a non-RTOS within the heterogeneous multiprocessor presented many architectural challenges: To meet these challenges and optimize the architecture for media-rich applications, the PowerPC Architecture required enhancements to the vector/SIMD multimedia extension, as well as introduction of the mediated external exception, the multiple concurrent large-page support, software management of translation lookaside buffers (TLBs),and the cachereplacementmanagementextensions. The following subsections describe the mediated external exception extension, the vector/SIMD multimedia extension, and the multiple concurrent large-page extension. Software management of TLBs and cache replacement management are described in separate sections because they apply to both the PPEs and the SPEs. See the CBEA document [1] for more information on each extension. Mediated external exception extension A key attribute of an RTOS is a guaranteed interrupt latency. A single RTOS can control the interrupt latency by not allowing interrupts to be disabled for more than a predetermined amount of time. However, when multiple operating systems are running, the presentation of an interrupt can be delayed by another partition for an arbitrary amount of time. The mediated external exception extension to the PowerPC Architecture overcomes this deficiency by allowing an interrupt to be presented to the processor even if interrupts are disabled. In the PowerPC Architecture, the term exception describes the interrupt condition, and the term interrupt refers to the processor acting on the exception condition or jumping to the interrupt handler. The critical exception conditions for an RTOS are those generated by SPEs and external devices. The current PowerPC Architecture allows an external exception to invoke a hypervisor-privileged interrupt handler. However, an operating system still has control of the external exception enabled in the machine state register of the processor. The CBEA-mediated external exception extension enables the interrupt handler to be invoked as long as the processor is not operating in hypervisor state or if external exceptions are enabled. This prevents an operating system from delaying the presentation of an exception to the processor and allows a hypervisor-privileged interrupt handler to be invoked. If the exception is for the RTOS, the hypervisor can immediately pass control to the interrupt handler of the RTOS regardless of which partition is currently active. However, an interrupt can occur when the RTOS is in a critical section with external exceptions disabled, and this is where the mediated portion of the extension comes into play. When external exceptions are disabled by the partition that is expected to handle the interrupt, the hypervisor-privileged interrupt handler sets a mediated external exception request bit in the logical partitioning control register and returns from the interrupt. When the partition enables external exceptions, the hypervisor- privileged interrupt handler is once again invoked with a mediated external exception. The interrupt handler then sets the state of the processor to mimic the original external exception. Vector/SIMD multimedia extension Although the CBEA contains offload processors for vector and streaming media processing, the architecture team included the vector/SIMD multimedia extension for the PPE. This multimedia extension was included to run software developed for the vector/SIMD multimedia extension, to make it easier to develop and port applications to the SPE, and to allow applications to be parallelized across the PPEs and SPEs. The vector/SIMD multimedia extension defined by the CBEA is very similar to the PowerPC 970* implementation. The major difference between the two is the rounding mode support. For compatibility with SPU applications, the vector/SIMD multimedia extension unit in the PPE supports the rounding modes defined by the SPU instruction set architecture. Multiple concurrent large-page extension The current PowerPC Architecture supports the base 4-KB page plus one additional large page to be used concurrently. The large-page size is implementation dependent. In the CBEA, many types of data structures are located in main storage, e.g., MMIO registers for the SPEs, local storage aliases, streaming data, and video IBM J. RES. DEV. VOL. 51 NO. 5 SEPTEMBER 2007 C. R. JOHNS AND D. A. BROKENSHIRE 505
  • 4. buffers. The limitation of only one large-page size places a burden on the TLBs. If a large-page size of 64 KB is selected, the number of translations needed for MMIO registers and local storage aliases is lower than that for the base 4-KB page size. However, more translations are required for the relatively large streaming buffers and video buffers in main memory. In contrast, a large-page size of 1 MB or 16 MB reduces the number of translations required for the streaming and video buffers, but it is too large for mapping the MMIO registers and local storage aliases. To improve the efficiency of the TLBs, the CBEA augments the PowerPC Architecture by providing support for multiple concurrent large-page sizes. The memory management units (MMUs) in the SPE also support the multiple concurrent large-pages extension. Synergistic processor element The SPE is the cooperative offload processor in the CBEA intended for the computational, or data plane, processing functions. The SPE consists of three tightly coupled units: the SPU, the associated local storage, and the MFC. The SPU is a SIMD processor with an instruction set architecture optimized for compute- intensive and media applications: It operates only on instructions and data in the associated local storage. Decoupling the SPU from other aspects of the system provides a very deterministic processing environment for the programmer. The SPU and associated local storage are coupled to the main storage domain and other processors by the MFC. The MFC enables software to move data between the storage domains and to synchronize with other processors in the system. Data movement and synchronization are initiated by using MFC commands. Either the SPU or another processor in the system, such as the PPE, can issue these commands. A direct memory access (DMA) controller in the MFC unit performs the data movement. All main storage accesses performed by the DMA controller adhere to the PowerPC Architecture for address translation and protection. They are performed asynchronously with respect to the SPU and all other units in the system. Synergistic processor unit The SPU provides the programmer with 128 registers, each of which is a 128-bit SIMD register. The large number of architected registers facilitates efficient instruction scheduling and also enables important optimization techniques such as loop unrolling. All SPU instructions are inherently SIMD operations that process data in one of four granules: sixteen 8-bit integers, eight 16-bit integers, four 32-bit integers or single-precision floating-point numbers, or two 64-bit double-precision floating-point numbers. The SIMD registers in the SPU are unified and can be an operand of either an integer or a floating-point instruction, unlike the split set of registers in the PowerPC Architecture [5]. To obtain the best performance from an SPU, the data structures of the program should be defined around the SIMD data flow. In addition, techniques such as double buffering should be employed to overlap the computation with the data movement. This technique insulates the SPU application from the latency of the system memory accesses. More programming tips can be found in [6]. Although the SPU is optimized for SIMD, scalar operations can be performed; they use the preferred slot (the upper word, or 32 bits, for 32-bit scalars) of the SIMD register. Channel interface As mentioned earlier, the SPU is decoupled from the system and the MFC provides the linkage to the main storage domain and other processors. A channel interface supplies the communication path between the SPU and the MFC. The CBEA defines multiple unique channels for issuing MFC commands and accessing MFC facilities. These channels are accessed by using the SPU channel instructions. The SPU instruction set architecture defines the channels as four words wide, but the CBEA uses only one word, which is the preferred slot of the SIMD data flow. Each channel has the following set of attributes defined by the CBEA: direction, capacity, and the ability to be blocking or nonblocking. Channels are unidirectional, with the direction defined by the CBEA. Channels used to retrieve information from or transfer information to the MFC are accessed, respectively, by using an SPU read channel (rdch) or write channel (wrch) instruction. Accessing a read channel by using a wrch instruction or a write channel by using an rdch instruction is not allowed and results in an invalid channel error. Channel capacity (depth) defines the number of words that can be contained within the channel. The CBEA defines the depth for some channels; for others, it is an implementation-specific parameter. A channel count, which is accessed by using an SPU read channel count (rdchcnt) instruction, is used to track the amount of information in the channel. The value returned by the rdchcnt instruction indicates the number of valid words contained in a read channel and the available free space or number of words that can be written to a write channel. The CBEA defines each channel as either blocking or nonblocking. A nonblocking channel essentially has an infinite depth, and a value of one is always returned when reading the channel count. Blocking channels have a C. R. JOHNS AND D. A. BROKENSHIRE IBM J. RES. DEV. VOL. 51 NO. 5 SEPTEMBER 2007 506
  • 5. depth of one or more, with an active channel count. Accessing a blocking channel whose count is zero stalls the instruction processing of the SPU until the channel count becomes nonzero. The blocking attribute relieves the programmer from having to check the channel capacity before accessing the channel. When no other useful work can be performed, this attribute allows the program to enter a very low-power state while waiting for the channel to become available. If useful work can be performed, the programmer can check the channel capacity with the rdchcnt instruction before accessing the channel and can then occasionally poll the channel count while performing other useful work. An SPU interrupt offers an alternative to polling. The presence of an SPU interrupt can affect the SPU instruction sequencing in one of two ways. If interrupts are disabled, a special SPU branch instruction (bisled) can be executed to branch to a target address if an interrupt is present. If interrupts are enabled, the SPU executes the next instruction from address zero in local storage and disables interrupts. The address of the next instruction that would have been executed if the interrupt were not present is saved in the SPU state save and restore register (SRR0). SRR0 can also be accessed by using two channels, one to read SRR0 and one to write SRR0. The SPU iret instruction can be used to return to the address stored in SRR0. SPU interrupts should not be confused with interrupts generated by an SPE that are targeted for the PPE (SPE interrupts). The SPU event facility can be programmed to detect key SPE conditions and other system events. The event facility is accessed by using channel instructions to specify the conditions of interest and determine the event status. Software can use the event facility to stall the SPU while waiting for multiple different events or to generate an interrupt when an enabled event occurs. The SPU event facility is described in more detail later in the paper. Memory flow controller The MFC provides the communication path from the SPU and the local storage domain to the main storage domain and other processors in the system. The MFC essentially decouples the SPU from the main storage domain; an application can view the SPU as having an additional asynchronous load–store processor. If used properly, the MFC can insulate an SPU application from the latency of system memory. Figure 2 is a high-level block diagram of the MFC unit. The bus interface unit (BIU) provides the interface to the element interconnect bus (EIB) [7]. The CBEA does not define the EIB protocol. The first-level cache (SL1) is a caching structure for MFC accesses to the main storage domain, and it provides an architectural entity for performance enhancements. In most implementations, the SL1 either is omitted or is very small because of the streaming nature of the MFC commands. The MFC SPU command queue is dedicated to MFC commands issued by the SPU using the channel interface. Other processors use the MFC proxy command queue to transfer data between the storage domains on behalf of the associated SPE. The MFC proxy commands, which are issued by using the MFC MMIO registers, are used primarily to efficiently initialize the local storage before an SPU program is executed. They are typically used by software executing in a PPE. The DMA controller transfers instructions and data between SPU local storage and main storage. Programs running on the associated SPU, a PPE, or another device can issue MFC DMA commands. Table 1 shows the MFC DMA commands supported by the DMA controller. A single MFC DMA command can transfer up to 16 KB of sequential data between the storage domains. A DMA transfer is typically performed as a series of smaller bus transfers, usually a cache line or less. If required, coherency is maintained for these transfers as the storage attributes in the page table. The parameters listed below affect the operation of MFC DMA commands: CL MFC class ID. TG MFC command tag identification. Figure 2 MFC block diagram. MFC SL1 (optional) BIU Data path RMT DMA request unit To element interconnect bus LS address Real address MFC registers MMIO interface MMU RMT SPU LS interface SPU channel interface DMA controller Atomic request MFC proxy command queue MFC SPU command queue IBM J. RES. DEV. VOL. 51 NO. 5 SEPTEMBER 2007 C. R. JOHNS AND D. A. BROKENSHIRE 507
  • 6. TS MFC transfer size. LSZ MFC list size. LSA MFC local storage address. EAH MFC effective address high. EAL MFC effective address low. LA MFC list local storage address. LTS List element transfer size. LEAL List element effective address low. Each command includes a classID parameter. The parameter contains a replacement management class identifier (RclassID) that controls the cache replacement management facility. The cache replacement management facility is described later. Software uses a TclassID identifier to classify the type of storage being accessed. The TclassID provides a hint to the MFC and to the system about the way the storage request should be treated. For example, the TclassID can be used by the bus interface unit in the MFC to prevent accesses to a slow device from blocking accesses to the higher-bandwidth main memory. To accomplish this, the EIB transfer request queue in the MFC (not shown in Figure 2) is segmented into two or more areas. The number of entries in each segment is sized according to the latency and bandwidth characteristics of the storage being accessed. EIB transfer requests created for a command are placed in only the transfer request queue segment that corresponds to the TclassID of the command. By issuing a command with the appropriate TclassID, software can prevent the transfer request queue from filling with transfer requests to slower devices. Since the transfers to slower devices are limited to one area of the request queue, transfers accessing higher-bandwidth Table 1 MFC DMA commands supported by the DMA controller. Command Description get,f,b.[s] Moves data from the effective address within main storage to local storage. getl,f,b.[s] Same as get,f,b., except that the effective address and size for multiple transfers are specified by list elements in local storage. put[r],f,b.[s] Moves data from local storage to the effective address within main storage. The optional ‘‘r’’ modifier provides a hint to the system that the data is likely to be accessed soon by the PPE. The data transferred by the command is a candidate for copying into the PPE cache. put[r]l,f,b. Same as put[r],f,b.[s], except that the effective address and size for multiple transfers are specified by list elements in local storage. sdcrt SL1 data cache range touch. Brings a range of effective addresses into the SL1 (performance hint for get commands). Similar to the PowerPC dcbt instruction. sdcrtst SL1 data cache range touch for store. Brings a range of effective addresses into the SL1 (performance hint for put commands). Similar to the PowerPC dcbtst instruction. sdcrz SL1 data cache range zero. Writes zeros to the contents of a range of effective addresses. Similar to the PowerPC dcbz instruction. sdcrst SL1 data cache range store. Stores the modified contents of a range of effective addresses. Similar to the PowerPC dcbst instruction. sdcrf SL1 data cache range flush. Stores the modified contents of a range of effective addresses and invalidates the block. Similar to the PowerPC dcbf instruction. sndsig,f,b. Sends a signal to another SPE. Updates the signal notification registers in another SPE. barrier Orders all commands issued prior to the barrier command with respect to all subsequent commands. mfceieio Orders the storage transactions caused by get and put commands. Similar to the PowerPC eieio instruction. mfcsync Orders DMA put and get operations within the specified tag group with respect to other processing units and devices on the system. Similar to the PowerPC sync instruction. getllar Get lock line and reserve–immediate. Similar in function to the PowerPC lwarx and ldarx instructions. putllc Put lock line conditional–immediate. Similar in function to the PowerPC stwcx and stdcx instructions. putlluc Put lock line unconditional–immediate. Main storage location updated regardless of reservation ownership. putqlluc Same as putlluc except queued with other DMA commands and has an implied fence modifier. Note: The optional ‘‘s’’ modifier starts the SPU when the command completes. The fence (,f.) and barrier (,b.) modifiers are used for command ordering. See the section on command ordering for the definition of these modifiers. C. R. JOHNS AND D. A. BROKENSHIRE IBM J. RES. DEV. VOL. 51 NO. 5 SEPTEMBER 2007 508
  • 7. devices are guaranteed a minimum number of request queue entries. This minimizes the latency impact created by accesses to slower devices. Without this feature, the transfer request queue would fill up with bus requests for slower devices and prevent the MFC from issuing requests for higher-bandwidth devices. List commands The CBEA provides a list modifier ,l. for MFC DMA commands to move data that is scattered in the main storage domain. A command with a list modifier is called a list command. List commands are converted to a series of commands, each of which is described by a list element in local storage. Each list element in local storage contains an effective address low parameter and a transfer size parameter. When a list command is executed, the MFC reads the list elements from local storage and creates a series of DMA commands. Each DMA command that is generated has the same set of parameters as the original list command, with the exception of the transfer size and effective address low parameters. Each MFC DMA list command can contain up to 2,048 list elements, each transferring up to 16 KB of data. Only an SPU can issue list commands, and such commands cannot be issued to the proxy MFC command queue. The list element includes a stall-and-notify flag. When set, the flag causes the MFC unit to stop executing the list command and notify the SPU after all of the DMA transfers for the list element and all previous list elements are complete. Subsequently, software can resume the list command. Among other uses, this feature allows the programmer to specify a data transfer larger than the available buffer space in local storage. The stall-and-notify flag is placed on the last element that fills the buffer. When the data transferred by the previous elements is processed and buffer space made available, the program can resume the MFC DMA list command. The CBEA defines one read channel that identifies the list commands that are stalled. It defines a write channel that is used to acknowledge the stall and allow the MFC to continue executing the list command. Memory management unit The MMU allows an SPU application to use the same effective address as that used by a PPE application to access main storage. The effective address for an MFC command is provided in two parameters, the EAH parameter and the EAL parameter. The EAH parameter contains the high- order 32 bits of the 64-bit effective address, and the EAL parameter contains the low-order 32 bits of the address. The EAH parameter is optional; if it is omitted, the high-order 32 address bits are set to zero. The MMU translates the effective address into the real address of main storage. This translation is compatible with the virtual address translation mechanism defined by the PowerPC Architecture. The PPE handles all MFC translation faults. An MFC translation fault occurs when a translation cannot be found for an effective address. The MMU can use either the same page table as that used by the PPEs or an independent page table; it also supports the multiple concurrent large-page extension described earlier in this paper. In addition, the MMU supports software management of the TLBs, which is presented subsequently. Command ordering As MFC commands are issued, they are placed in the appropriate command queue. MFC commands can be executed and completed in any order, regardless of the order in which they were issued. The out-of-order execution of commands allows the MFC to use system resources efficiently to achieve the best performance. For example, if the EIB supports simultaneous reads and writes, the MFC can simultaneously execute one get and one put command. If ordering were implied by the issue order, this would not be possible. While out-of-order command execution can help performance, in some cases, software may require strict command ordering. To achieve command ordering, the CBEA provides two command modifiers (fence ,f. and barrier ,b.) and a barrier command. The command modifiers order commands only within the same tag group (that is, all commands issued with the same tag parameter to the same queue). Hence, these are called tag-specific modifiers. A command with a fence modifier is performed after all previously issued commands within the same tag group. Commands issued after a command with a fence modifier are not affected. A command with the barrier modifier is performed after all previously issued commands within the same tag group. Commands that are issued after a command with a tag-specific barrier are also performed after previously issued commands. Some commands have an implied tag-specific barrier modifier. When command ordering is required to be independent of the tag group, a barrier command can be issued. Regardless of the tag identifier, this command orders all previously issued commands with respect to all subsequently issued commands within the same command queue. The CBEA does not provide a mechanism to order commands with respect to commands in the other command queue. Storage access ordering In addition to command ordering, software can also require the EIB transfers generated by the DMA IBM J. RES. DEV. VOL. 51 NO. 5 SEPTEMBER 2007 C. R. JOHNS AND D. A. BROKENSHIRE 509
  • 8. command to be ordered with respect to other processors and devices in the system. Although the command modifiers and barrier command provide for command ordering, they do not provide ordering of the bus transfers with respect to other processors and devices in the system. For this type of storage ordering, the mfceieio and mfcsync commands are used in combination with the command modifiers and barrier command. The mfceieio command provides storage ordering that is similar to the PowerPC eieio instruction; the mfcsync command provides storage ordering that is similar to the PowerPC sync instruction. The ordering of storage accesses performed by two or more processors with respect to another processor or device is called cumulative ordering. The CBEA follows the PowerPC rules for cumulative ordering when all accesses are performed within the main storage and the proper synchronization instructions and commands are performed. Standard PowerPC rules do not apply when the storage accesses are performed within the main and local storage domains. The CBEA multisource synchronization facility addresses this case. This facility provides software with a mechanism to ensure that all accesses from the main storage domain, targeting the associated local storage domain, are complete with respect to the SPU. Figure 3 illustrates the use of the multisource synchronization facility. Memory-mapped I/O space and channels The CBEA defines several SPE facilities, some of which have already been mentioned. A facility is a set of MMIO registers or channels that provide a specialized function. Facilities in the CBEA are accessed from the main storage domain using MMIO registers or are accessed using the channels defined by the CBEA. The facilities defined by the CBEA provide a wide variety of functions ranging from support of context save and restore to providing synchronization with other SPEs and processors in the system. The MMIO space and channels also provide access to other miscellaneous functions. This paper does not include all of the facilities provided by the MMIO registers and channels, but it attempts to give the reader an indication of what is available. Some of the facilities provided are listed below: Local storage alias. Command issue. Tag-group completion facility. Multisource synchronization facility. Mailbox facility. Signal notification facility. SPU event facility and decrementer. Software management of TLBs. Local storage alias facility Within the MMIO space of the SPE is an area called the local storage alias that is dedicated to local storage. It provides direct access to the local storage domain of the corresponding SPE from the main storage domain. The local storage alias is used primarily to allow the direct transfer of data from the local storage of one SPE to the local storage of another SPE. For example, if an SPE issues an MFC get command with an effective address that maps to the local storage alias area of another SPE, the data is transferred directly from the local storage of the target SPE to the local storage of the issuing SPE. Without an alias of the local storage in the main storage domain, software would be forced to copy data through system memory. Command issue facility The memory flow controller section of this paper describes the MFC DMA commands. Commands are issued from other processors in the system using a sequence of MMIO register stores and loads. Commands issued using MMIO registers are called MFC proxy commands. Code running on an SPU issues commands using a sequence of channel writes; these commands are called MFC SPU commands. MFC proxy commands are typically issued by the PPE to initialize local storage efficiently before the SPU is started. To issue an MFC proxy command, software writes each parameter to the corresponding MMIO register. Some parameters can be omitted, and the last parameter written is the MFC command opcode. After writing this parameter, software reads from the MFC Figure 3 Example of multisource synchronization. C B Main storage A Local storage 0x0 0x0 0x0 0x2 0x1 0x3 PPE 1 1. Store 0x1 to local storage location A 2. sync 3. Store 0x2 to main storage location B PPE 2 1. Load from main storage location B returns 0x2 2. sync 3. Store 0x3 to local storage location C SPU 3 1. Load from local storage location C returns 0x3 2. Use the multisource synchronization facility 3. Load from local storage location A guaranteed to return 0x1 C. R. JOHNS AND D. A. BROKENSHIRE IBM J. RES. DEV. VOL. 51 NO. 5 SEPTEMBER 2007 510
  • 9. command status register to determine whether the command was successfully placed in the MFC proxy command queue. The MFC command opcode register and the MFC command status register are mapped to the same address. Therefore, software need not issue a PowerPC eieio instruction between the last store of the opcode and the read of the command status, since the PowerPC Architecture requires that two accesses to the same location be performed in program order. Queuing a command can fail because of a command sequence error or because of insufficient room in the MFC proxy command queue. A command sequence error is typically caused when the command issue sequence is interrupted and the next process or interrupt handler issues an MFC proxy command. The command sequence error response eliminates the need to require a lock for the MFC proxy queue when issuing a command. The error due to insufficient room occurs when there is no available space in the MFC proxy command queue. In most cases, software can avoid this error by reading the MFC command queue status register to ensure that there is available space in the queue before issuing the command. If either error is reported, software must repeat the command issue sequence. The MFC SPU command sequence is a series of channel write instructions. The last channel written in the sequence is the MFC command opcode channel. This channel is blocking and has a depth equal to the size of the MFC SPU command queue. If the queue is full, the channel write results in an SPU stall until space in the queue is available. The stall is a very efficient mechanism for power savings when useful work cannot be performed until after the command has been queued. Software can avoid the stall by using the procedure described in the channel interface section or by using the event facility in the SPE to interrupt the SPU when space is available. The event facility is described in a subsequent section. Tag-group completion facility Data movement is performed asynchronously with respect to the execution of the SPU program. Therefore, the CBEA provides a tag-group completion facility so that an application can determine when a specific command or set of commands completes. Independent facilities are available for both the MFC proxy and MFC SPU command queues. A tag ID is provided with each DMA command, and commands issued with the same tag ID to the same queue are called a tag group. To determine when tag groups are complete, a mask that selects the tag groups of interest is written to the tag-group mask channel or register. The status of the selected groups is reported in a status register for the MFC proxy command queue. For the MFC SPU command queue, a query type is written to the MFC write tag-group status update request channel. The request can be that either ‘‘all selected tag groups are complete’’ or ‘‘any one of the selected groups is complete.’’ Once the query type is set, reading from the blocking tag-group status channel causes the SPU to stall until the query condition is met. Like the command issue sequence, the stall can be avoided by first determining the channel count value of the channel. The CBEA also provides a way to generate an interrupt when a tag group completes. For the MFC proxy command queue, a similar query type is provided to determine when the interrupt is generated. The interrupt is typically routed to the PPE for processing. For the MFC SPU command queue, the event facility provides the interrupt when the condition is met. Multisource synchronization facility In any multiprocessor system, the synchronization of software processes is critical. The PowerPC Architecture provides four instructions (lwarx, ldarx, stwcx, stdcx) for performing atomic updates of system memory. The atomic update of system memory is used to create software locks, which are then used to achieve the desired synchronization. Since the SPUs do not have direct access to the main storage domain, the CBEA defines four MFC commands (getllar, putllc, putlluc, putqlluc) that allow an SPU to participate in the atomic update of system memory. These commands are known as the MFC atomic update commands. They can be issued only by an SPU to the MFC SPU command queue; they cannot be issued by other processors or devices. Unlike the PowerPC instructions, these commands transfer the full reservation granule. In most processors, the reservation granule is typically the same size as the cache line. The get lock line and reserve (getllar) command is similar to the PowerPC lwarx and ldarx instructions, and the put lock line conditional (putllc) command is similar to the stwcx and stdcx PowerPC instructions. These commands are not queued in the MFC like other MFC commands; they are executed immediately but still require an MFC command queue slot. Since these commands are executed immediately, they do not support the tag parameter, and only one of these commands can be outstanding. The CBEA provides a blocking atomic command status channel for determining when these commands complete and the results of the commands. The put lock line unconditional (putlluc) and put queued lock line unconditional (putqlluc) commands are similar to a cacheable PowerPC store instruction. These commands are used to unconditionally release a lock. The putlluc command is executed immediately, whereas the putqlluc command has an implied fence and is placed IBM J. RES. DEV. VOL. 51 NO. 5 SEPTEMBER 2007 C. R. JOHNS AND D. A. BROKENSHIRE 511
  • 10. in the MFC command queue along with other MFC commands. Since the putqlluc command is queued, the tag parameter is supported, and software uses the previously described tag-group completion facility to determine when this command is complete. A common synchronization method is for software to wait for the value of the lock in system memory to change. In the PowerPC, this requires software to continually poll the memory location associated with the lock. In the CBEA, an SPU event can be generated when a reservation is lost. (The reservation is lost when another processor modifies the memory location associated with the lock.) SPU software can avoid constantly polling the lock by simply obtaining a reservation on the memory location and then waiting for the reservation to be lost using the event facility. Additionally, an SPU interrupt can be generated when the reservation is lost. While the reservation lost event does not guarantee that the lock value has changed, it very effectively eliminates the constant polling of the lock using MFC DMA commands. Since the performance of atomic updates can suffer when there is a high contention for the lock, this may not be the optimal solution for all synchronization scenarios. Therefore, the CBEA offers additional options for process synchronization, such as the mailbox facility and the signal notification facility. These facilities can be used as alternatives to the atomic update MFC commands, or they can be used in conjunction with them for process synchronization. Mailbox facility A key advantage of the CBEA is the independent nature of the SPEs. Since the SPEs are decoupled from the PPE and other SPEs, the CBEA provides a mailbox facility to assist in process-to-process communication and synchronization. The mailbox facility provides a simple, unidirectional communication mechanism typically used by the PPE to send short commands to the SPE and to receive status in return. This facility consists of an inbound, an outbound, and an outbound interrupt mailbox. Each of these mailboxes can have a depth of one or more entries. The direction of the mailbox is relative to the SPU. For example, the inbound mailbox is written by the PPE and read by the SPU. The SPU accesses the mailboxes using the SPU channel instructions. The mailbox channels are blocking, causing the SPU to stall if the outbound mailbox is full or the inbound mailbox is empty. For the PPE and other devices, MMIO registers provide access to the mailboxes and the mailbox status. As the name implies, the outbound interrupt mailbox generates an interrupt when written by the SPU. The interrupt is typically routed to the PPE for processing and is presented as an external exception. For example, the mailbox facility can be used for a command-driven SPU application. In this example, the SPU is typically stalled waiting for a command to be placed in the inbound mailbox. When a command is received, the SPU performs the requested operation given in the mailbox data or sequence of mailbox data. Once the operation is complete, the SPU places a return code in the outbound interrupt mailbox. The write of the outbound interrupt mailbox generates an interrupt for the PPE that indicates the completion of the requested operation. The SPU code then reads the next command from the inbound mailbox and stalls if a new command is not available. In addition to the application given in this simple example, many other uses were contemplated during the development of this facility. Signal notification facility The signal notification facility is very similar to the mailbox facility. It consists of two signal notification registers. From the SPU, these registers are accessed using channels; for other processors, these registers are mapped into the MMIO space. In contrast to the mailbox facility, the signal notification facility is inbound only, and the corresponding channels are only one deep and are blocking. Each of the signal notification registers has two modes of operation, overwrite and logical OR. In the overwrite mode, the contents of the register is replaced with the new value written to the MMIO register even if the SPU has not yet read the current value. In the logical OR mode, the current contents of the signal notification register is ORed with the new value written to the register. In either mode, the signal notification register is set to zero after it is read by the SPU. Normal MFC ‘‘put’’ commands can be used to access the signal notification facility in another SPE. However, the MFC ‘‘send signal’’ commands are defined for this purpose, allowing an implementation to use a more efficient mechanism than MMIO for signaling between SPUs. The signal notification facility offers an alternative synchronization mechanism between SPUs and, potentially, other devices. It can also be used in conjunction with the other synchronization mechanisms. For example, the logical OR mode can be used in a task completion notification mechanism. In this example, an SPU assigns a task to another processor; along with each task, the SPU assigns a tag. When the task completes, the processor or device performs the proper synchronization to ensure that the results are visible. It then writes a binary ‘‘1’’ in the signal notification register bit corresponding to the tag. The overwrite mode provides a fast communication when a single bit is not sufficient for synchronization. C. R. JOHNS AND D. A. BROKENSHIRE IBM J. RES. DEV. VOL. 51 NO. 5 SEPTEMBER 2007 512
  • 11. SPU event facility and decrementer facility The SPU event facility provides software with a convenient mechanism to wait for a selected set of conditions. The facility can also interrupt the SPU if one of these conditions occurs. The facility consists of an SPU read event status channel, a channel for reading the SPU write event mask channel, a channel for writing the SPU write event mask channel, and an SPU write event acknowledgment channel. To use the facility, software first writes the SPU write event mask channel to select the conditions of interest. Next, the SPU read event status channel is read. If none of the selected conditions have occurred, the SPU stalls until at least one of the conditions exists. The stall can be avoided either by reading the channel count associated with the SPU write event status channel before reading the channel or by enabling an interrupt to occur when an event status is available. After being notified that an event has occurred, the SPU program should acknowledge the event by writing a binary ‘‘1’’ to the corresponding bit in the SPU write event acknowledgment channel. Acknowledging the event enables the SPU to receive subsequent events for this condition. The SPU read event mask channel is provided for context save and restore operations. Software can also use this channel to eliminate the need for a shadow copy of the currently selected events. As is apparent in the description of the other facilities, events are defined for many of the stall conditions. The event facility allows software to wait for multiple conditions using a single blocking channel and can cause the SPU to be interrupted. The SPU read event status channel doubles as the interrupt status. The SPU event facility supports the following events: MFC tag-group status update event. MFC DMA list command stall-and-notify event. MFC SPU command queue available event. SPU inbound mailbox event. SPU outbound mailbox event. SPU outbound interrupt mailbox event. SPU signal notification 1 event. SPU signal notification 2 event. Lock line reservation lost event. Multisource synchronization event. Privileged attention event. SPU decrementer event. Except for the privileged attention event and the SPU decrementer event, these events have been described in the previous sections. The privileged attention event allows another processor or device to get the attention of an SPU program. Setting the privileged attention bit in the SPU privileged control register causes the privileged attention event in the SPU event facility to be set. The SPU decrementer event is set when the most significant bit of the SPU decrementer changes from zero to one. The SPU software can either wait for these events by reading the SPU read event status channel or enable an interrupt to be generated on the occurrence of any enabled event. Software management of TLBs facility Almost all modern processor architectures support a virtual address space that requires the use of an address translation table. The PowerPC, which is no exception, implements virtual addressing using an architected hardware page table in system memory. For performance reasons, processors usually implement a cache of the translations in an on-chip array. In the PowerPC Architecture and the CBEA, this array is called the TLB. Processor architectures usually handle a miss of the TLB with software or hardware. In the PowerPC Architecture, the TLB is hardware managed. In general, caches introduce uncertainty into the performance of a system. For example, a program accessing a cache that does not contain any of the program data (a cold cache) executes more slowly than when accessing a cache that contains the program data (a warm cache). Architectures have addressed this problem for data caches by providing cache management instructions; some even provide software access to the TLBs. Since the PowerPC Architecture does not provide a method to control the contents of the translation cache, there is typically a start-up penalty when a program is first executed that is not acceptable for real-time applications. The CBEA addresses this issue by adding a software TLB management facility. This facility is specified for both the PPE and the SPE. When used in conjunction with hardware management of the TLBs, an operating system can preload or restore an application’s translations to reduce the start-up effects. A second issue with hardware-managed translations is the fixed size and structure of the hardware page table. In addition, all processors within the same operating system partition typically share the same physical hardware page table. With software management, an operating system can choose to manage the TLBs in software for any number of the SPEs or PPEs in a CBEA-compliant processor. This allows an operating system to have a specialized page table structure for an application that is independent of the structure required by the architecture and hardware. Cache replacement management facility As mentioned in the previous section, caches introduce uncertainty into the performance of a system. Not only is this uncertainty due to a cold cache at start-up, but it can IBM J. RES. DEV. VOL. 51 NO. 5 SEPTEMBER 2007 C. R. JOHNS AND D. A. BROKENSHIRE 513
  • 12. also be caused by a characteristic of the application. For example, if a streaming application reads a very long sequential data structure, the contents of the cache is replaced with the infrequently accessed streaming data, thereby creating a cold cache. The cache management instructions offered by architectures cannot prevent the streaming data effects on the cache. Other processor architectures have offered methods to lock data into a cache, but they still do not address the effects of streaming. Since streaming media is a major application area for a CBEA-compliant processor, this issue had to be addressed. Caches are usually not an architectural data element; thus, the ‘‘Cell Broadband Engine Architecture’’ document [1] focuses only on the need for such a cache replacement management facility. An example in this document guides the processor designers with their implementation choices. Using the example implementation outlined in [1], an operating system creates a set of replacement management class identifiers (RclassIDs) and an associated replacement management table (RMT). For the PPE, the RclassID is generated from the address of the data accessed using a set of address-range registers. Separate ranges are provided for instruction and data fetches in the PPE. The RclassID is provided as an explicit parameter in the MFC commands. The RclassID is used as an index into the RMT. The RMT entry selects the sets in the cache (assuming a set-associative cache) that can be replaced if all of the entries in the congruence class are valid and the requested data does not reside in the cache. If the data resides in the cache in a different set, the data simply is accessed. If the data does not exist in the cache and there is a nonvalid entry in the congruence class, an implementation can choose to place the data in the invalid entry even if it does not correspond to the set selected by the RMT. The cache replacement management facility provides an operating system with two key methods for managing both the data and translation caches, i.e., the TLBs. The first method prevents streaming data from flushing a cache. Accesses to streaming data are tagged with an RclassID that allows the replacement of only a few sets, typically one. This means that the data can be streamed through the cache without replacing older data or translations that may be required by the operating system or application. The second method allows an operating system to lock data or translations in a cache. This method is beneficial for many types of data and helps to reduce the caching effects inherent in hardware-managed caches. For example, an RTOS can use this facility for a critical PPE interrupt handler. As mentioned, controlling the interrupt latency is very important for an RTOS. By locking the translations to the key interrupt handlers in the TLBs and locking the interrupt code itself into the caches, an RTOS can reduce the uncertainty of processing an interrupt. Locking data or translations in a cache is achieved by ensuring that only one RMT entry allows for a given cache set to be replaced. In the example above, software should ensure that only the address range of the interrupt routine can select the RMT entry. If these conditions are met, no other access will cause the interrupt hander instructions or translations to be removed from the cache. To eliminate the start-up cost of the very first interrupt, the RTOS can warm the cache state by pre-touching the interrupt handler when it is first loaded by using the cache management instructions. When used in conjunction with the mediated external exception extension and cache management instructions, the cache replacement management facility essentially eliminates the uncertainty in the execution of an interrupt handler. The cache replacement management facility is very flexible, providing for both locking data in a cache and reducing the effects of streaming data on caches. While this facility is implementation dependent, the CBEA recommends that an implementation provide an RMT for each major caching structure, including the TLBs. Internal interrupt controller Processor architectures typically do not define an interrupt controller. However, in the CBEA all exceptions generated by an SPE are typically handled by the PPEs in the system. The architecture would not be complete if it did not define how these interrupts are presented to the PPE. Therefore, the CBEA defines an internal interrupt controller (IIC) that provides a method to present an interrupt generated by an SPE to the PPE as quickly as possible. The IIC has a sufficient set of features to act as the interrupt controller for the system. If a different feature set is required, the IIC can coexist with a separate system interrupt controller. The SPU-generated exceptions include the following: internal SPU errors, errors created by the execution of an MFC command, address translation faults, address compare matches, mailbox interrupts, interrupts generated by the execution of an SPU stop-and-signal instruction, and tag-group completion. SPU exceptions cause a PPE external interrupt to occur. The CBEA defines a set of routing registers, interrupt mask registers, and interrupt status registers for each SPE. The routing registers are used to route the interrupt to the PPE that will handle the interrupt. The interrupt mask and status registers are used to enable interrupt conditions and to record the cause of the interrupt. The IIC allows software C. R. JOHNS AND D. A. BROKENSHIRE IBM J. RES. DEV. VOL. 51 NO. 5 SEPTEMBER 2007 514
  • 13. to prioritize SPE exceptions with other external interrupts generated by the system. The interrupt structure in the CBEA is flexible, allowing SPE exceptions to be routed to an IIC or directly to the system interrupt controller. Routing an SPE exception to the system interrupt controller allows for an alternative priority to be employed by the system designer. In addition to handling SPE exceptions, the IIC provides an efficient mechanism for generating interprocessor interrupts. Interprocessor interrupts are generated by a PPE, targeting another PPE in the system. Resource allocation management Real-time operating systems were a key consideration during the development of the CBEA and were the primary reason for including features such as cache replacement management and software management of TLBs in the architecture. These features help software maintain real-time guarantees in a single-processor environment, but they do little for the effects created by having multiple processors. In a multiprocessor system, critical resources such as system memory can be consumed by a relatively few devices, which can create latency and bandwidth issues for the remaining devices. To address this issue, the CBEA includes a resource allocation management (RAM) facility. With RAM enabled, each requester in a CBEA-compliant processor is assigned to a resource allocation group (RAG). A portion of each critical resource is allocated to each RAG. The definition provided in the CBEA ends here. The remaining details of the facility, such as the resources controlled and the controlling mechanism, are highly implementation specific. The rest of this section describes the Cell/B.E. processor implementation of RAM. The Cell/B.E. processor is the first implementation of the CBEA. In the Cell/B.E. processor, the resources controlled by RAM are system memory and the I/O interfaces. A token-based control mechanism allocates a percentage of the managed resources to each RAG. Before a requester can access a managed resource, a token must be requested from a central token manager. The hypervisor initializes the token manager to generate tokens for each RAG at a rate equivalent to the allocated percentage of the resource. Requesters compete for the tokens generated for their RAG using a round-robin priority scheme. During certain intervals, the frequency of requests from a given RAG may be lower than the allocated token rate; this can result in unused tokens for some RAGs, while other RAGs are throttled by their allocated token rate. The Cell/B.E. design allows the unused tokens to be given to another RAG, which prevents wasting or underutilization of a resource. Typically, software never allocates 100% of a resource in order to account for variations in system performance. This is another source of potential waste. To assist in reducing this waste, the Cell/B.E. processor design allows any RAG to use the unallocated percentage of a resource. The allocation of unused tokens and the unallocated portion of a resource is programmable and can be independently disabled by an operating system. As mentioned above, 100% of a resource is never allocated in order to account for unexpected performance variations. Allowing unallocated and unused tokens to be used can result in overutilization of a resource. The Cell/B.E. processor implementation addresses this issue by automatically suspending the allocation of these tokens when a resource becomes congested. This is known as backpressure from the resource. At first glance, RAM appears to be a bandwidth management system. However, the bandwidth and latency of system memory are very dependent on the access size and access pattern. For example, the bandwidth is considerably lower if an application continually accesses addresses corresponding to the same physical bank of memory. To avoid this problem, the RAM facility requires a resource to obtain a token for each physical memory bank. Software sets up a single percentage allocation for memory, and the token manager distributes the allocation equally among all memory banks. Therefore, if the memory access pattern of an application is evenly distributed, the achievable memory bandwidth is higher than if the accesses are poorly distributed. An application can also cause poor utilization of system memory if it uses small accesses. Typically, caches prevent small accesses, and they are not a concern for most systems. However, in a CBEA-compliant processor, MFC DMA commands can transfer blocks smaller than a cache line. The Cell/B.E. processor addresses this issue in two ways. First, each token is defined as the transfer of a full cache line, or 128 bytes, regardless of the actual transfer size. Second, the requester is required to obtain two tokens if the transfer size requires a read–modify–write of system memory to update the error-correction code (ECC)—one for the read and one for the write. Thus, the effective system memory bandwidth is a function of the access pattern and access size. Without the RAM, the accesses performed by one application can affect the system memory bandwidth for all applications. RAM provides an effective mechanism that prevents applications with poor access patterns from affecting the performance of other applications. RAM does not provide a guarantee of memory bandwidth, but it does guarantee access to a resource. Because the effective bandwidth achieved is still dependent on the access pattern, the facility is known as RAM instead of bandwidth reservation. IBM J. RES. DEV. VOL. 51 NO. 5 SEPTEMBER 2007 C. R. JOHNS AND D. A. BROKENSHIRE 515
  • 14. Power management Power management is always a key consideration in the development of any architecture. Some channels in the SPE are defined as blocking, which essentially allows an SPU to enter a very low-power state while waiting for the channel to become available. The CBEA also defines the following set of more traditional power management states: active, slow(n), pause(n), state retained and isolated (SRI), and state lost and isolated (SLI). Each element in a CBEA-compliant processor can be in any one of the five defined states. In the active state, the element runs at full performance level. No power management is applied except for the dynamic power management provided by blocking channels or the gating of nonactive circuits found in most advanced designs. Several levels are allowed for the slow(n) state, with a higher value of n representing more aggressive power savings. As the name implies, the element executes at a slower rate; thus, the performance is degraded but still guaranteed to make forward progress. An example of a slow state is running an element at a slower clock frequency to save power. The pause(n) state is similar to the slow(n) state except that forward progress is not guaranteed. A simple example of a pause state is the situation in which an SPU is stopped but the MFC is still active. In the three power management states described above, the context of the element is maintained. For example, the TLBs and caches remain coherent with system memory and other elements in the system. In the SRI and SLI states, the element no longer interacts with other elements in the system; thus, the full context for the element cannot be maintained. (These power management states should not be confused with the SPU isolation facility described in the next section.) Software must prepare the element before entering or leaving these states. The SRI state allows software to keep the bulk of the context while saving most of the power of the element. An example of the SRI state is the case in which the SPE is isolated from the system and the contents of the local storage and MMIO registers are maintained. Coherency of the TLBs and caches is not maintained. The SLI state is the most aggressive power-saving state. In this state, nothing is maintained, including the local storage, registers, and caches. An example of this state is the situation in which power is removed from an SPE or PPE. The actual implementation and supported power management states are implementation specific. Therefore, the CBEA does not define the details. Isolated execution environment The CBEA includes an SPU isolation facility that is used to create a secure execution environment within an SPE. Figure 4 is a state diagram illustrating the transitions into and out of the SPU isolated execution environment. To enter an SPU isolated execution environment, a pointer to the code to execute in the secure execution environment is loaded into the SPU signal notification registers 1 and 2, and the SPU is started using a Load request. When it receives the Load request, the SPE first sets up the isolated execution environment by removing access to a portion of the local storage. The remaining portion of local storage remains open. All accesses to the local storage alias and MFC DMA commands are forced into the open area. After the isolated environment is initialized, the SPE loads the application into the isolated area of local storage. As it is loaded, the code is authenticated and decrypted by using a hardware root key. If the load succeeds, the application is executed. Once in the SPU isolated execution environment, the only way out is to stop the SPU and restart it with an Exit request. If the Load fails, the SPU remains halted in a transition state. The only way out of the transition state is to restart the SPU by using either another Load request or an Exit request. In addition to the isolated execution environment, the SPU isolation facility provides an application with a random number generator and persistent storage. The Figure 4 SPU isolated state transitions. SPU nonisolated execution environment Transition states SPU isolated execution environment SPU and local storage isolated from system Run Stop Load failed Exit Exit Load Load success Exit Exit complete Load Run Stop Exit Load Load failed SPU isolated (stopped) SPU isolated (running) SPU stopped SPU running C. R. JOHNS AND D. A. BROKENSHIRE IBM J. RES. DEV. VOL. 51 NO. 5 SEPTEMBER 2007 516
  • 15. persistent storage maintains its state between isolation sessions and can be accessed only by an application executing in the SPU isolated execution environment. Software development toolkit To provide software developers exposure to the CBEA with and without access to hardware, a Software Development Kit (SDK) is available on the IBM alphaWorks* Web site (www.alphaworks.ibm.com/topics/ cell) with open-source content distributed on the Barcelona Supercomputing Center Web site (www.bsc.es/ projects/deepcomputing/linuxoncell). The SDK is continuously being enhanced with additional components and features. The key foundational components of the SDK are as follows: A Cell/B.E. processor–enabled Linux kernel supporting the unique CBEA features. The IBM full system simulator for the Cell/B.E. processor that supports simulation of either uniprocessor or dual-processor systems. Both functional and cycle accurate simulation modes are provided so that programmers can maximize either interactivity or timing accuracy. A system root image (sysroot) that provides a standard set of Linux operating system utilities and services for use in the simulated Linux operating system environment. Standard CBEA-specific libraries including the newlib C languages standard library, SIMD math libraries, and an SPE runtime management library that exposes the SPEs as heterogeneous POSIX threads. Code-generation tools including GNU and IBM XL C/Cþþ compilers. The compilers support the automatic generation of SIMD code for both the PPE and the SPE. Productivity tools such as an Eclipse-based integration development environment (IDE). The IDE provides an integration framework for building, deploying, and managing CBEA software. Cell/B.E. processor–enabled debug tools including the GNU debugger (gdb). The debugger has been enhanced to enable combined PPU and SPU debugging and to provide user access to the unique states of the Cell/B.E. processor. Performance analysis tools including the OProfile system profiling tool, the Cell/B.E. processor performance counter, which provides access to Cell/B.E. processor performance monitoring facilities, code analyzer, Feedback-Directed Program Restructuring technology of FDPR-pro, and a static performance analysis tool called spu_timing. The output from many of the tools can be visualized using the Eclipse-based visual performance analyzer (VPA). An SPE executive for invoking SPU executables directly from a Linux shell. The executive provides standard system services including file I/O, shared memory, memory map, and time of day. Many code samples that demonstrate programming techniques, optimized libraries with source code to jump-start application development, and workloads to demonstrate the computational capabilities of the Cell/B.E. processor. Programming model frameworks such as the Accelerated Library Framework (ALF), which provides services to assist in the development of parallel applications and libraries on multicore architectures with hierarchical memory such as the CBEA. The SDK is supported on the x86, PowerPC 64, and Cell/B.E. processor blade systems running Fedora** core for the Linux operating system. These configurations minimize system costs, ensuring wide access to anyone who wants to evaluate Cell/B.E. technology, learn Cell/B.E. programming, or develop Cell/B.E. applications and tools. Software standards for CBEA Critical to the success of the CBEA is the adoption of software standards and conformance to these standards. From the outset, the Sony Group, Toshiba, and IBM worked together to establish an initial set of standards including the following: Application binary interface (ABI) specifications including the SPU Application Binary Interface Specification and the CBE Linux Reference Implementation ABI specification. Language specifications including the SPU Assembly Language Specification and the C/Cþ þ Language Extensions for Cell Broadband Engine Architecture specification. Standardized library specifications including the SPE Runtime Management Library and the SIMD Vector Math Library specifications. Adherence to these specifications ensures that application code is portable and that tools produce binaries that will coexist. To ensure compliance with the standards, conformance tests have been developed and made generally available. IBM J. RES. DEV. VOL. 51 NO. 5 SEPTEMBER 2007 C. R. JOHNS AND D. A. BROKENSHIRE 517
  • 16. Conclusions: The power of the Cell/B.E. processor The real power behind the CBEA is the optimization made possible by the heterogeneous design and the memory movement, provided by the DMA, for hiding memory latency. While initially directed toward media and streaming applications, the flexibility of the architecture allows for a multitude of application and programming techniques. For example, several papers describe CBEA applications that include features such as software-managed caches implemented in the local storage and software threading of the SPEs [8]. Additionally, a generalized implementation of a software cache is provided in the SDK. A formal request for change (RFC) process governs changes and additions. All RFCs require unanimous acceptance by the three partner companies—the Sony Group, Toshiba, and IBM. Each of the partner companies serves as an advocate to ensure that its business partners are represented in the RFC process. The CBEA offers a wide variety of features to improve the performance of systems requiring RTOS, streaming, and computing applications. In addition, the CBEA defines a unique approach to security by offering an SPU isolation facility. When used properly, this facility provides an isolated execution environment that is not accessible from other elements or by external means. The features and flexibility of the CBEA combined with the new development environment and software standards create a processor architecture and ecosystem poised to set the standard for application acceleration. Acknowledgments The CBEA and the Cell/B.E. processor (the first implementation of the architecture) resulted from a deep collaboration of engineers from Sony Computer Entertainment Incorporated, Toshiba, and IBM. Each of the partners brought to the team a unique set of requirements, experience, and perspectives. The diversity of the team enabled it to look beyond the traditional processor architectures and develop the heterogeneous processing structure provided by the CBEA. More than five years and countless person-hours went into the development ofthe CBEA and theCell/B.E. processor. The authors also wish to thank the anonymous reviewers for their suggestions for improving the content of this paper. *Trademark, service mark, or registered trademark of International Business Machines Corporation in the United States, other countries, or both. **Trademark, service mark, or registered trademark of Linus Torvalds or Red Hat, Inc., in the United States, other countries, or both. Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc., in the United States, other countries, or both. References 1. Cell Broadband Engine Architecture; see http://www.ibm. com/chips/techlib/techlib.nsf/techdocs/ 1AEEE1270EA2776387257060006E61BA. 2. J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy, ‘‘Introduction to the Cell Multiprocessor,’’ IBM J. Res. Dev. 49, No. 4/5, 589–604 (2005). 3. PowerPC Architecture Book, Version 2.02; see http:// www.ibm.com/developerworks/power/library/pa-archguidev2/. 4. Synergistic Processor Unit Instruction Set Architecture; see http://www.ibm.com/chips/techlib/techlib.nsf/techdocs/ 76CA6C7304210F3987257060006F2C44. 5. B. Flachs, S. Asano, S. H. Dhong, H. P. Hofstee, G. Gervais, R. Kim, T. Le, et al., ‘‘The Microarchitecture of the Synergistic Processor for a Cell Processor,’’ IEEE J. Solid-State Circuits 41, No. 1, 63–70 (2006). 6. D. A. Brokenshire, ‘‘Maximizing the Power of the Cell Broadband Engine Processor: 25 Tips to Optimal Application Performance’’; see http://www.ibm.com/developerworks/power/ library/pa-celltips1/. 7. D. Krolak, ‘‘Just Like Being There: Papers from the Fall Processor Forum 2005: Unleashing the Cell Broadband Engine Processor: The Element Interconnect Bus’’; see http:// www.ibm.com/developerworks/power/library/pa-fpfeib/ index.html. 8. C. Benthin, I. Wald, M. Scherbaum, and H. Friedrich, ‘‘Ray Tracing on the Cell Processor’’; see http://graphics.cs.uni-sb.de/ ;benthin/cellrt06.pdf. Received July 22, 2006; accepted for publication C. R. JOHNS AND D. A. BROKENSHIRE IBM J. RES. DEV. VOL. 51 NO. 5 SEPTEMBER 2007 518 March 1, 2007; Internet publication August 9, 2007
  • 17. Charles R. Johns IBM Systems and Technology Group, 11400 Burnet Road, Austin, Texas 78758 ([email protected]). Mr. Johns is a Senior Technical Staff Member in the Sony/Toshiba/IBM Design Center. He received his B.S. degree in electrical engineering from the University of Texas at Austin in 1984. After joining IBM Austin in 1984, Mr. Johns worked on various disk, memory, voice communication, and graphics adapters for the IBM Personal Computer. From 1988 until he transferred to the STI project in 2000, he was part of the graphics organization and was responsible for the architecture and development of entry and midrange 3D graphics adapters and raster engines. Mr. Johns is now responsible for the CBEA; he participated in the development of the Cell/B.E. processor, which is the first implementation of the CBEA. Mr. Johns is an IBM Master Inventor. Daniel A. Brokenshire IBM Systems and Technology Group, 11400 Burnet Road, Austin, Texas 78758 ([email protected]). Mr. Brokenshire is a Senior Technical Staff Member with six years of experience in the Cell/B.E. Processor Design Center. He currently serves as a senior member of the IBM Cell/B.E. Processor Systems Enablement team working on the Cell/B.E. processor SDK. His responsibilities include the development of programming standards, language extensions, reusable software libraries, and software documentation. Mr. Brokenshire received a B.S. degree in computer science and B.S. and M.S. degrees in electrical engineering, all from Oregon State University. Prior to his work on the Cell/B.E. processor, he enjoyed a productive career developing 3D graphics products for Tektronix, Inc., and IBM. IBM J. RES. DEV. VOL. 51 NO. 5 SEPTEMBER 2007 C. R. JOHNS AND D. A. BROKENSHIRE 519