SlideShare a Scribd company logo
1
Challenges in GPU compilers
Anastasia Stulova, ARM
Guest lecture on Modern Compiler Design
Cambridge University
November 9th,2016
2
 About me:
 I work on OpenCL/GPU compilers in Media Processing Group (MPG),
ARM
 ARM (www.arm.com) is a leader in low-power microprocessor IP
 MPG’s main product is mobile Mali GPU
 http://www.arm.com/products/graphics-and-multimedia/mali-gpu
 Among others we support OpenCL and OpenGL standards
 We work with the open source LLVM framework (llvm.org)
 I am the code maintainer of OpenCL in the Clang frontend
(clang.llvm.org)
Intro
3
 GPU – graphics processing unit
 GPGPU – general purpose GPU
 ISA – instruction set architecture
 PC – program counter
 IPC – instructions per cycle
 AST – abstract syntax tree
 SIMD – single instruction multiple data
 SIMT – single instruction multiple threads
 SPMD – single program multiple data
 CSE – common sub-expression elimination
 CFG – control flow graph
 FSM – finite state machine
Abbreviations
4
 Brief history of GPUs
 What makes GPUs special?
 How to program GPUs?
 Writing a GPU compiler
 Supporting new language features – Front-end
 Optimizing for GPUs – Middle-end
 Mapping to GPU HW – Back-end
 A few words about graphics
 Summary
Outline
5
 Initially created in 70s as highly compute capable HW
accelerators to achieve required rendering speed in FPS (frame
per second)
 The GPU programmability started growing to support different
needs e.g.various games,GUIs, etc
 GPUs started offering programmability similar to regular
processors
 In 2007 Nvidia launched CUDA to support fully programmable
GPUs from C like syntax (exposing ISA,registers,pipeline),
marking the beginning of widespread GPGPU era
 Nowadays any program that contains a massive amount of data
level parallelism can be mapped efficiently to GPU
Brief GPU history
6
 Massive data-level parallelism
 Comparison CPU vs GPU
 Lockstep execution
What makes GPUs special?
7
 Classical CPU pipeline
[DLX]
 SIMD (vector) pipeline –
operates on closely located
data
 SIMT pipeline with a bundle
of 4 threads
 SPMD contains a
combination of
(SIMD&SIMT)
Types of data-level parallelism
MEMFE DE EX WB
MEMFE DE EX WB
EXEXEXMEM
EXEXEXEXFE DE
EXEXEXWB
Typically all are present in GPUs
8
 On a GPU all cores
execute the same set
of instructions but not
necessarily at the same
point in time – SPMD
model (separate
pipelines don’t
synchronize)
Parallelism in CPU vs GPU
Program
for
CPU
Core1Core2
instrN
instrN+1
…
instr1
instr2
…
Instr N-1
Program
for
GPU
instr1
instr2
…
instrN
instrN+1
…
Dozens of threads
Thousands to millions of threads
 On a CPU different cores
execute a completely
different unrelated set of
instructions
9
CPU vs GPU performance
LD1: r0= load
LD2: r1= load
ADD: r1=r0+r1
…
Time
CPU pipeline w 4 stages
S1 S2 S3 S4
LD1
LD2 LD1
LD2 LD1
LD2 LD1
LD2
ADD:T1
10
LD1: r0= load
LD2: r1= load
ADD: r1=r0+r
…
 GPU can switch fast between threads, achieved by putting extra
resources (i.e. thousands of registers)
 Switching threads in CPU costs thousands to millions of cycles
 Recall GPU pipeline picture: each Tn is typically a bundle of
threads (each threads travels along 1 lane of the pipeline)
CPU vs GPU performance
LD1: r0= load
LD2: r1= load
ADD: r1=r0+r
…
S1 S2 S3 S4
LD1:T1
LD2:T1 LD1:T1
LD1:T2 LD2:T1 LD1:T1
LD2:T2 LD1:T2 LD2:T1 LD1:T1
LD1:T3 LD2:T2 LD1:T2 LD2:T1
LD2:T3 LD1:T3 LD2:T2 LD1:T2
ADD:T1 LD2:T3 LD1:T3 LD2:T2
LD1: r0= load
LD2: r1= load
ADD: r1=r0+r1
…
Throughput = 1/3 IPC Throughput = 1 IPC
Time
Resume T1 (if more threads available,GPU
could decide to schedule new threads)
CPU pipeline w 4 stages
T1
T2
T3
GPU pipeline w 4 stages
S1 S2 S3 S4
LD1:T1
LD2:T1 LD1:T1
LD2:T1 LD1:T1
LD2:T1 LD1:T1
LD2:T1
ADD:T1
11
 A bundle (subgroup in OpenCL) of threads has to execute the
same PC in lock step
 All threads in the bundle have to go through instructions in
divergent branches (even when condition evaluates to false)
 Similar to predicated execution
 Inefficient because more time is spent to execute the whole program
 Do all GPUs execute instructions in lock step? Mali-T6xx has
separate PC for each thread [IWOCL]
Lockstep execution (SIMT)
Divergence issue
if (thread_id==0)
r3 = r1+r2;
else
r3 = r1-r2;
r4 = r0 * r3
Thread 0 Thread 1 Thread 2 Thread 4
ADD r1 r2 ACTIVE INACTIVE INACTIVE INACTIVE
SUB r1 r2 INACTIVE ACTIVE ACTIVE ACTIVE
MUL r0 r3 ACTIVE ACTIVE ACTIVE ACTIVE
Bundle of 4 threads executing in lock step
Time
Each thread executes 3 instr instead of 2
12
 GPGPU programming model (OpenCL terminology)
 www.khronos.org/opencl
 OpenCL language
How to program for GPGPUs?
13
GPGPU programming (OpenCL terminology)
C/C++
Host (CPU)
Hotspot -
Kernel
14
 WI – work-item is a single sequential unit of work with its
private memory (PM) defined by a kernel in OpenCL language
 WG – work-group is a collection ofWIs that can run in parallel
on the same core (sharing the same pipeline) and access local
memory shared among allWIs in the same WG
 Different memories can have different access time
GPGPU programming (OpenCL terminology)
- Creates program for
parallel computations
on Device
- Cross compiles for
Device
- Sends work to Device
- Copy data to/from
Device global memory
Device (GPU)
WG
:
WI
Local Mem
PM
WI
PM
WI
PM
WG
:
WI
Local Mem
PM
WI
PM
WI
PM
Global Memory
C/C++ OpenCL API OpenCL Language
Host (CPU)
Hotspot -
Kernel
15
kernel void foo(global int* buf){
int tid = get_global_id(0) *
get_global_size(1) +
get_global_id(1);
buf[tid] = cos(tid);
}
OpenCL program
Starting from C99
+ Keywords (specify kernels
i.e.parallel work items,
address spaces (AS) to allocate objects
to mem e.g.global,local,private …)
+ Builtin functions and types
- Disallowed features
(function pointers,recursion, C std
includes)
16
kernel void foo(global int* buf){
int tid = get_global_id(0) *
get_global_size(1) +
get_global_id(1);
buf[tid] = cos(tid);
}
for (int i = 0; i<DIM1; i++)
for (int j = 0; j<DIM2; j++)
std::thread([i, j, buf]() {
int tid = i*DIM2 + j;
buf[tid] = cos(tid);
}
OpenCL program
Starting from C99
+ Keywords (specify kernels
i.e.parallel work items,
address spaces (AS) to allocate objects
to mem e.g.global,local,private …)
+ Builtin functions and types
- Disallowed features
(function pointers,recursion, C std
includes)
DIM1
DIM2
17
kernel void foo(global int* buf){
int tid = get_global_id(0) *
get_global_size(1) +
get_global_id(1);
buf[tid] = cos(tid);
}
for (int i = 0; i<DIM1; i++)
for (int j = 0; j<DIM2; j++)
std::thread([i, j, buf]() {
int tid = i*DIM2 + j;
buf[tid] = cos(tid);
}
OpenCL program
Starting from C99
+ Keywords (specify kernels
i.e.parallel work items,
address spaces (AS) to allocate objects
to mem e.g.global,local,private …)
+ Builtin functions and types
- Disallowed features
(function pointers,recursion, C std
includes)
DIM1
DIM2
C99
OpenCL
18
 General strategy
 Supporting new languages – OpenCL example
 Optimising for GPUs
 Mapping to GPU HW
Compilation for GPUs
19
General compiler philosophy
 OpenCL has a lot in common with C
 Syntax,types,operators,functions
 GPU has some similarity to CPU
 ISA, registers,pipeline
 Can we reuse existing compiler
technologies?
 Extension of C parser
 Take existing IR formats
 Borrow functionality of other backends
 Integrate all the components in the same
way
20
 Supporting many new builtin types and functions
 Compiler extensions for builtin types and functions - pipe
example
 Adding address space qualifiers
Supporting new languages – OpenCL example
Frontend aspects
21
 Example:
float cos (float x) __attribute((overloadable))
double cos(double x) __attribute((overloadable))
 Many builtin functions have multiple prototypes i.e.overloads
(supported in C via __attribute((overloadable)))
 OpenCL header is 20K lines of declarations
 Automatically loaded by compiler (no #include needed)
 Affects parse time
 To speedup parsing – Precompiled Headers (PCHs) [PCH]
 has large size ~ 2Mb for OpenCL header
 Size affects memory consumed and disk space
 BUT some types and functions require special compiler
support…
Builtin functions and types
22
Compiler support - Pipe
 Classical streaming pattern
 Host (C/C++) code sets up
pipe and connections to
kernels
 OpenCL code specifies how
elements are written/read
 Standard C parser won’t offer
functionality to support pipe
kernel void producer(write_only pipe int p) {
int i = …;
write_pipe(p,&i);
}
kernel void consumer(read_only pipe int p) {
int i ;
read_pipe(p,&i);
}
Device
pipe = clCreatePipe(context, 0, sizeof(int), 10 /* # packets*/…);
producer = clCreateKernel(program, “producer”, &err);
consumer = clCreateKernel(program, “consumer”, &err);
// Set up connections
err = clSetKernelArg (producer, 0, sizeof(pipe), pipe);
err = clSetKernelArg (consumer, 0, sizeof(pipe), pipe);
// Send work for execution on the device
err = clEnqueueNDRangeKernel(queue, producer, …);
err = clEnqueueNDRangeKernel(queue, consumer, …);
Host
23
 Pipe type specifier requires special compiler support
 Pipe builtin functions:int read_pipe (read_only pipe gentype p, gentype *ptr)
 gentype is any builtin or user defined type
 Generic programming style behaviour in C99 needed
 Implemented as an internal Clang builtin function with custom generic
programming support
Semantical analysis for pipe:
error:type 'pipe' can only be used as a
function parameter in OpenCL
Pipe type implementation in Clang
file.cl:
…
pipe int p;
…
NewAST type node:
PipeType 0xa4e41b0 ‘pipe int‘
`-BuiltinType 0xa48df20 'int'
file.ll:
%opencl.pipe_t = type
opaque
…
%opencl.pipe_t* %p
Parser AST
Sema
IR
CodeGen
void Parser::ParseDeclarationSpecifiers(DeclSpec &DS…) {
…
switch (Tok.getKind()) {
…
case tok::kw_pipe:
…
isInvalid = DS.SetTypePipe(true, …);
break;
24
 OpenCL feature that affects most of the parser (being part of
every type)
 Many parts were not written with this feature in mind
Represented in qualifiers:
PointerType 0xa4e41b0 '__global int *'
`-QualType 0xa4e4198 '__global int' __global
`-BuiltinType 0xa48df20 'int'
Added diagnostics:
- Conversion rules
error:casting '__local int *' to type '__global int *‘
changes address space of pointer
- Operation validity
error:comparison between ('__constant int *' and '__global
int *') which are pointers to non-overlapping address spaces
Address spaces in Clang
file.cl:
…
global int *gptr;
…
Parser
void Parser::ParseDeclarationSpecifiers(...)
{
switch (Tok.getKind()) {
…
case tok::kw_global:
ParseOpenCLQualifiers(DS.getAttributes());
…
}
AST
Sema
IR
CodeGen
- Address spaces are attributes of types in IR
- Each number represent a memory segment ID
- The segments do not have to be identical to
OpenCL (e.g. private and local can be mapped
to the same physical segment)
file.ll:
%gptr = alloca i32
addrspace(2)*
25
 Do traditional optimisations always work for GPUs?
 Optimisation for SPMD/SIMT
 Exploring different parallelism granularity via vectorization
Optimising for GPUs
Middle-end aspects
26
 Shorter programs are still faster => sequential code
optimizations are still important
 BUT some single-thread optimizations might reduce the number
of concurrent threads that can run in parallel or might prevent
parallel access to memory
 A good trade off needs to be found because we do not always
need each thread to run as fast as possible (HW can be filled
with other threads at nearly 0 cost)
 There is still an overhead of maintaining threads
 Reducing #threads by coarsening might be beneficial in some situations
 Increased complexity and need of different considerations while
optimizing - heuristics are typically good at optimizing one thing
 Hence iterative compilations or machine learning might be helpful…
Optimizing for GPUs
27
r1 = …
if (r0)
r1 = computeA();
// shuffle data from r1 into r3
// of threads id r2.
T0:r1 T1:r1 T2:r1 T3:r1
T0:r3 T1:r3 T2:r3 T3:r3
r3 = sub_group_shuffle(r1,r2);
if (r0)
r3 = computeB();
Erroneous sequential optimizations for SIMT
28
r1 = …
if (r0)
r1 = computeA();
// shuffle data from r1 into r3
// of threads id r2.
T0:r1 T1:r1 T2:r1 T3:r1
T0:r3 T1:r3 T2:r3 T3:r3
r3 = sub_group_shuffle(r1,r2);
if (r0)
r3 = computeB();
r1 = …
if (!r0)
// Incorrectfunctionality!The data in r1
// have not been computed by all threadsyet.
r3 = sub_group_shuffle(r1,r2);
else {
r1 = computeA();
r3 = sub_group_shuffle(r1,r2);
r3 = computeB();
}
Erroneous sequential optimizations for SIMT
After valid sequential
code transformation
to reduce #branches
r0!=0
r1==0
29
 Observation:transformations should maintain control flow wrt
sub_group_shuffle as in the original program
 LLVM solution:mark such functions with special “convergent”
attribute
r1 = …
if (r0)
r1 = computeA();
// shuffle data from r1 into r3
// of threads id r2.
T0:r1 T1:r1 T2:r1 T3:r1
T0:r3 T1:r3 T2:r3 T3:r3
r3 = sub_group_shuffle(r1,r2);
if (r0)
r3 = computeB();
r1 = …
if (!r0)
// Incorrectfunctionality!The data in r1
// have not been computed by all threadsyet.
r3 = sub_group_shuffle(r1,r2);
else {
r1 = computeA();
r3 = sub_group_shuffle(r1,r2);
r3 = computeB();
}
Erroneous sequential optimizations for SIMT
After valid sequential
code transformation
to reduce #branches
r0!=0
r1==0
30
 Definition of domination/post-domination of nodes in CFG?
 Example:A dom B, C post-dom B
 CFG G´ is equivalent to G wrt Ni: iff ∀ Nj (i≠j) dom and post-
dom relations wrt Ni remain the same in both G and G´
 Example:A valid transformation can swapA1 and A2 in case it is
beneficial to do so (no other valid transformations exist)
Fixing wrong optimizations
Control flow equivalence
A1
A2
B
C
After the
transformation,B
is control
equivalent to all
other nodes
B is a convergent
operation
A2
A1
B
C A2
A1
B
C
Wrong: A2 no
longer
dominates B
31
1: if (tid == 0) c = a[i + 1]; // compute address of array element = load address of a and
compute the offset (a+1) => 2 instr, only 1 thread active
2: if (tid != 0) e = a[i + 2];// as in line 1- 2 instr for address computation (a+2) all threads other
than thread 0
 Transformed into:
1: p = &a[i + 1]; // load address of a and compute the offset => 2 instr all threads
2: if (tid == 0) c = *p; // only thread 0 will perform this computation
3: q = &a[i + 2]; // address of a is loaded on line 3 thus only offset is to be computed (CSE
optimisation) => 1 instr all threads
4: if (tid != 0) e = *q; // all threads other than thread 0 will run this
 In SPMT all lines of code will be executed, therefore the transformed program
will run 1 instr less (the transformed code will run longer for a single thread)
Optimizations for SPMD
Speculative execution
32
Exploring different types of parallelism
SPMD vs SIMD
r0= load
r1= load
r1=r0+r1
r0= load
r1= load
r1=r0+r1
r0= load
r1= load
r1=r0+r1
int *a, *b, c;
…
a[tid] = b[tid] + c;
r0= load
r1= load
r1=r0+r1
int2 *a, *b, c;
;…
//a vector operation
is performed with
multiple elements at
once
a[tid] = b[tid] + c;
Is it always good to vectorize?
Speed up if fewer
threads are created
while each thread
execution time
remains the same
33
Exploring different types of parallelism
SPMD vs SIMD
r0= load
r1= load
r1=r0+r1
r0= load
r1= load
r1=r0+r1
r0= load
r1= load
r1=r0+r1
int *a, *b, c;
…
a[i] = b[i] + c;
r0= load
r1= load
r1=r0+r1
int2 *a, *b, c;
;…
//a vector operation
is performed with
multiple elements at
once
a[i] = b[i] + c;
Is it always good to vectorize?
Speed up if fewer
threads are created
while each thread
execution time
remains the same
 What if the computation of c (calling some builtin functions)
results in many instructions that have to be duplicated in a
vectorized kernel:
vectorized application longer threads by 100 cycles
c = builtin100cysles(x)
c.s1 = builtin100cysles(x1);
c.s2 = builtin100cysles(x2)
34
 HW resources, occupancy and thread lifecycles
 Occupancy in register allocation
 Mapping to instructions
Mapping to GPU HW
Backend aspects
35
 HW will try to put as many threads inflight as possible if there is enough HW
available for their execution to completion
 If all available resources are occupied,threads will wait to be accepted for
execution on GPU
 Thus thread states can be described by FSM:
Occupancy and thread categorization
Ready
Waiting to be scheduled
Active
In GPU
pipeline
Completed
Ended
execution
Blocked
Waiting
To be admitted
for execution
Waiting for
resources to be
available
36
 How many registers to allocate for the kernel
 Using more registers reduces spilling (storing to/loading from memory)
by keeping data in a register
 For GPUs #REGs can affect amount of threads that can be
created because all registers are shared and allocated to threads
before scheduling them (Waiting -> Ready transition)
 Thread interleaving reduces idle time while waiting for result
 Example RA tread off - 1024 registers in total
 Allocate 4 REGs with 20 spills and run max 256 threads to mask blocking
 Allocate 16 REGs with 5 spills and run max 64 threads to mask blocking
 More information in
Occupancy in RegisterAllocation
[OPT]
37
 Code is often inlined and unrolled (to exploit SIMD) and
therefore big basic blocks are produced
 IR uses large number of intrinsics for implementing builtin
functions and custom operations
 Exotic and complex instructions lead to many more possibilities
in instruction selection phase
 The above factors have a negative impact on compilation time
(critical for doing it at runtime)
Mapping to instructions and compilation time
*
+
r3 = ADD r1 r2
r4 = MUL r3 r5Common case
from matrix multiplication
r4 = FMA r1 r2 r3OR
38
 Fixed pipeline with programmable functionality
 Evolved from very fixed functionality to highly programmable
components by writing shaders (in C like syntax - OpenGL)
 Evolving towards unifying graphics and compute - Vulkan [VUL]
 Sometimes requires less accuracy as soon as visualization is
correct and therefore alternative optimizations are possible
 Harder to estimate the workload
What about graphics?
Geometry
(Vertexes)
Vertex
Shader
Fragment
Shader
BlenderRasterizer
A boundary
between vertex and
pixel processing
39
 Reusing existing compilers helps shorten development time and
avoid reinventing the same techniques
 Compilation approaches suitable for GPUs are emerging
 Many grew in a very ad hoc way,combining many concepts, and moving
rapidly in different directions
 New challenges to come:
 Supporting new standards (including C++) and new use cases (computer
vision, deep learning)
 Convergence between graphics and compute
 Increase compilation speed
 Explore alternative optimization strategies that can take both single and
multi-threaded performance into account
Summary
40
 [DLX] https://en.wikipedia.org/wiki/DLX
 [IWOCL] http://www.iwocl.org/wp-content/uploads/iwocl-2014-
workshop-Tim-Hartley.pdf
 [PCH] http://clang.llvm.org/docs/PCHInternals.html
 [OPT] https://www.cvg.ethz.ch/teaching/2011spring/gpgpu/GPU-
Optimization.pdf
 [VUL] https://www.khronos.org/vulkan/
References
41
Backup slides below
42
v1.0
v1.2
v2.0
v2.1
CL
Next
OpenCL evolution
 Core part of the language and
API (operationsand types
taken from C, special types -
images,samplers,events;
builtin functions:math and
many more,address spaces)
 Allows linking of separate modules
+ adds more type and builtin
functions
 Improves interaction between
host and device:dynamic
parallelism (GPU creates new
workloads),program scope
variables,streaming
communicationsamong kernels -
pipes (bypassinghost),shared
virtual memory amonghost and
all devices
 C++ Support
43
 Transforms program from source language into representation
suitable for compiler analysis and transformations
 Interface with programmers about wrong use of language and
flags potential errors in the source code
error:use of undeclared identifier 'c'
Frontend
What does it do?
d = a*b + c
t1 = load a
t2 = load b
t3 = load c
t4 = mul t1, t2
t5 = add t3, t4
44
 Clang parser is not always suitable forAS rules of OpenCL
 Example:defaultAS
 NULL literal has to be AS agnostic to be used interchangably:
NULL = (void*)0
 Recursive Descent parsing of NULL
ParseExpr()
ParseCastExpr()
ParseType() ParseExpr()
…
ParseNullLiteral()
 Hence NULL is assigned to the default AS
 Workaround by custom checks of NULL in semantical cheks
Default address space
Scope
Type
global local
pointer LangAS::generic LangAS::generic
scalar LangAS::global LangAS::private
Not known that NULL
object is being parsed
45
 Map CL to C11 atomic types:
Sema.cpp - Sema::Initialize():
// typedef _Atomic int atomic_int
addImplicitTypedef("atomic_int",Context.getAtomicType(Context.IntTy));
 Only subset of types are allowed
 Added Sema checks to restrict operations (only allowed through
builtin functions):
atomic_int a,b; a+b; // disallowed in CL
_Atomic int a,b; a+b; // allowed in C11
 Use C11 builtin functions to implement CL2.0 functions
 Missing memory visibility scope as LLVM doesn’t have this construct
C atomic_exchange_explicit(volatile A *obj,C desired,memory_orderorder,memory_scope scope);// CL
C atomic_exchange_explicit(volatile A *obj,C desired,memory_orderorder);// C11
 Can be added as metadata or IR extension
Atomic types
if (getLangOpts().OpenCL)
//The only legal unary operation for atomics is '&'.
if (Opc != UO_AddrOf&& InputExpr->getType()->isAtomicType())
return ExprError(…);
46
 OpenCL builtin function
enqueue_kernel(…,void (^block)(local void *,...))
 block has an ObjC syntax (a lambda function with captures)
 block can have any number of local void* arguments
 Kind of variadic prototype
 No standard compilation approach
 To diagnose correctly it was added as Clang builtin function with a
custom check
Dynamic parallelism support
47
 Optimizations
 Transformation (preparing/making program more suitable for
backend - lowering)
 Ideally target agnostic but not in reality!
Middleend
What does it do?
t1 = add a 2
t2 = sub b 1
t3 = add t1 t2
t1 = add a 1
t2 = sub b t1
48
%opencl.image2d_ro_t = type opaque
define void @foo(i32 addrspace(1)* %i, %opencl.image2d_ro_t
addrspace(1)* %img) !kernel_arg_addr_space !1
!kernel_arg_access_qual !2 !kernel_arg_type !3
!kernel_arg_base_type !3 !kernel_arg_type_qual !4 {
…
%2 = addrspacecast i32 addrspace(1)* %1 to i32 addrspace(4)*
…
%0 = call i32 @llvm.gpu.read.image(%img)}
!1 = !{i32 1, i32 1}
!2 = !{!"none", !"read_only"}
!3 = !{!"int*", !"image2d_t"}
IR for SPMD (LLVM example)
LLVM IR - linear format three address code
define i32 @foo(i32 %i) {
entry:
%i.addr = alloca i32
store i32 %i, i32* %i.addr
%0 = load i32, i32* %i.addr
%add = add nsw i32 %0, 1
ret i32 %add
}
OpenCL types are all unknown to IR
Kernel is marked by
metadataIR has an address space
attribute for the type
Special IR node to cast between objects in different ASes
can be eliminated for flattenAS arch
or mapped to address calculation sequence
Additional information
about kernel is kept in
metadata
Intrinsic represent special
operations unavailable in IR
lowered to instructions by
backend
49
 Adhoc - a combination of historically grown tricks
 Do we need other formats that has explicit kernel nodes and
known special types as well as functions (i.e.barrier) ?
 Some alternatives could be SPIRV, PENSIL…
IR for SPMD (LLVM example)
50
 Multiple threads execute single PC in lock step
 What if threads have to go through different PC i.e.diverge
 Now all threads in subgroup will be able to execute in a lock step
 Inefficient because more time is spent to execute the whole
program (each thread will compute both cos and sin but only
those with valid pred will make result effective)
 Do all GPUs execute instructions in lock step? MaliT6xx has
separate PC for each thread
Divergent execution support
r0 = …
if (r0)
r2 = cos(r1);
else
r2 = sin(r1);
r0 = …
pred0 = r0!=0
pred1 = r0==0
if (pred0) r2 = cos(r1) // result in r2 is written conditionally
if (pred1) r2 = sin(r1) // result in r2 is written conditionally
What compiler might
do in order to support
lock step execution
51
 lowering i.e.mapping into instructions that make use of registers
 can perform last optimisations too (peephole optimisations)
Backend
What does it do?
ADD r1 r2 r3
*
a b
52
 Non technical aspects - closed source can be good because it allows
better synergy with other SW components but can be bad because it
prevents the knowledge sharing.
 Resource occupancy…
Backend general challenges (2/2)
53
 Closed source facilitates better interplay between SW
components but a lot of experimentally grown features that need
re-evaluation
 Close integration with runtime environment
 Compilation time in an issue:
 Online compilation requires compilation speed to be minimised
 GPU code is normally inlined and unrolled and therefore very large basic
blocks are produced => increased compilation time because compiler
needs to traverse more nodes
 Many possibilities of instructions => increased number of choices
 Need to fit needs of both Graphics and Compute…
Overall
54
 DifferentWorkloads:
 Compute has more complex algorithms and uses more features
(branching,loops,function calls, pointers)
 Graphics workloads are harder to predict (each vertex can be computed
in slightly different way while all compute threads truly go through the
same exactly program flow)
 Conflicting accuracy/performance tradeoff:
 Example:DIV is expensive => many techniques (using subtract,multiply,
reciprocal,etc, in HW or SW or hybrid)… rounding issues
 CL: exact number matters
 GL:as soon as visualisation correct we don’t care about exact number
 => different optimisations possible with different implementation of DIV
with respect to accuracy and speed
Graphics vs Compute

More Related Content

What's hot (20)

PPT
hierarchical bus system
Elvis Jonyo
 
PPT
Microprogram Control
Anuj Modi
 
PPT
system interconnect architectures in ACA
Pankaj Kumar Jain
 
PPTX
process control block
Vikas SHRIVASTAVA
 
PPTX
Storage Management
Dr. SURBHI SAROHA
 
PPT
Process and Threads in Linux - PPT
QUONTRASOLUTIONS
 
PPTX
Structure of processes ppt
Yojana Nanaware
 
PPT
Distributed objects & components of corba
Mayuresh Wadekar
 
ODP
Distributed operating system(os)
Dinesh Modak
 
PPTX
Approaches to real time scheduling
Kamal Acharya
 
PPT
Object Oriented Analysis and Design Unit-1
SangeethaSubramaniam14
 
PDF
Qemu Introduction
Chiawei Wang
 
PPTX
Linux Memory Management
Suvendu Kumar Dash
 
PPSX
Research Scope in Parallel Computing And Parallel Programming
Shitalkumar Sukhdeve
 
PDF
Process Scheduling
Abhishek Nagar
 
PPTX
Memory management ppt
ManishaJha43
 
PPTX
Direct Memory Access ppt
OECLIB Odisha Electronics Control Library
 
PPTX
Context switching
DarakhshanNayyab
 
PPTX
RISC and CISC Processors
Adeel Rasheed
 
PPTX
Lecture 3 threads
Kumbirai Junior Muzavazi
 
hierarchical bus system
Elvis Jonyo
 
Microprogram Control
Anuj Modi
 
system interconnect architectures in ACA
Pankaj Kumar Jain
 
process control block
Vikas SHRIVASTAVA
 
Storage Management
Dr. SURBHI SAROHA
 
Process and Threads in Linux - PPT
QUONTRASOLUTIONS
 
Structure of processes ppt
Yojana Nanaware
 
Distributed objects & components of corba
Mayuresh Wadekar
 
Distributed operating system(os)
Dinesh Modak
 
Approaches to real time scheduling
Kamal Acharya
 
Object Oriented Analysis and Design Unit-1
SangeethaSubramaniam14
 
Qemu Introduction
Chiawei Wang
 
Linux Memory Management
Suvendu Kumar Dash
 
Research Scope in Parallel Computing And Parallel Programming
Shitalkumar Sukhdeve
 
Process Scheduling
Abhishek Nagar
 
Memory management ppt
ManishaJha43
 
Context switching
DarakhshanNayyab
 
RISC and CISC Processors
Adeel Rasheed
 
Lecture 3 threads
Kumbirai Junior Muzavazi
 

Similar to Challenges in GPU compilers (20)

PDF
Multiplatform JIT Code Generator for NetBSD by Alexander Nasonov
eurobsdcon
 
PPTX
gpuprogram_lecture,architecture_designsn
ARUNACHALAM468781
 
PDF
A beginner’s guide to programming GPUs with CUDA
Piyush Mittal
 
PPTX
Gpu workshop cluster universe: scripting cuda
Ferdinand Jamitzky
 
PPT
Exploring Gpgpu Workloads
Unai Lopez-Novoa
 
PPTX
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Akihiro Hayashi
 
PPT
Vpu technology &gpgpu computing
Arka Ghosh
 
PPT
Vpu technology &gpgpu computing
Arka Ghosh
 
PPT
Vpu technology &gpgpu computing
Arka Ghosh
 
PPT
Introduction to parallel computing using CUDA
Martin Peniak
 
PDF
Newbie’s guide to_the_gpgpu_universe
Ofer Rosenberg
 
PPT
Vpu technology &gpgpu computing
Arka Ghosh
 
PDF
Programar para GPUs
Alcides Fonseca
 
PDF
Network Programming: Data Plane Development Kit (DPDK)
Andriy Berestovskyy
 
PDF
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf
JunZhao68
 
PDF
lecture_GPUArchCUDA02-CUDAMem.pdf
Tigabu Yaya
 
PDF
Can FPGAs Compete with GPUs?
inside-BigData.com
 
PPT
Multicore
Birgit Plötzeneder
 
PPTX
LEGaTO Integration
LEGATO project
 
PPTX
QEMU - Binary Translation
Jiann-Fuh Liaw
 
Multiplatform JIT Code Generator for NetBSD by Alexander Nasonov
eurobsdcon
 
gpuprogram_lecture,architecture_designsn
ARUNACHALAM468781
 
A beginner’s guide to programming GPUs with CUDA
Piyush Mittal
 
Gpu workshop cluster universe: scripting cuda
Ferdinand Jamitzky
 
Exploring Gpgpu Workloads
Unai Lopez-Novoa
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Akihiro Hayashi
 
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Arka Ghosh
 
Introduction to parallel computing using CUDA
Martin Peniak
 
Newbie’s guide to_the_gpgpu_universe
Ofer Rosenberg
 
Vpu technology &gpgpu computing
Arka Ghosh
 
Programar para GPUs
Alcides Fonseca
 
Network Programming: Data Plane Development Kit (DPDK)
Andriy Berestovskyy
 
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf
JunZhao68
 
lecture_GPUArchCUDA02-CUDAMem.pdf
Tigabu Yaya
 
Can FPGAs Compete with GPUs?
inside-BigData.com
 
LEGaTO Integration
LEGATO project
 
QEMU - Binary Translation
Jiann-Fuh Liaw
 
Ad

Recently uploaded (20)

PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PDF
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PPTX
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
PPTX
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
PPTX
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PPTX
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PPTX
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PDF
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Ad

Challenges in GPU compilers

  • 1. 1 Challenges in GPU compilers Anastasia Stulova, ARM Guest lecture on Modern Compiler Design Cambridge University November 9th,2016
  • 2. 2  About me:  I work on OpenCL/GPU compilers in Media Processing Group (MPG), ARM  ARM (www.arm.com) is a leader in low-power microprocessor IP  MPG’s main product is mobile Mali GPU  http://www.arm.com/products/graphics-and-multimedia/mali-gpu  Among others we support OpenCL and OpenGL standards  We work with the open source LLVM framework (llvm.org)  I am the code maintainer of OpenCL in the Clang frontend (clang.llvm.org) Intro
  • 3. 3  GPU – graphics processing unit  GPGPU – general purpose GPU  ISA – instruction set architecture  PC – program counter  IPC – instructions per cycle  AST – abstract syntax tree  SIMD – single instruction multiple data  SIMT – single instruction multiple threads  SPMD – single program multiple data  CSE – common sub-expression elimination  CFG – control flow graph  FSM – finite state machine Abbreviations
  • 4. 4  Brief history of GPUs  What makes GPUs special?  How to program GPUs?  Writing a GPU compiler  Supporting new language features – Front-end  Optimizing for GPUs – Middle-end  Mapping to GPU HW – Back-end  A few words about graphics  Summary Outline
  • 5. 5  Initially created in 70s as highly compute capable HW accelerators to achieve required rendering speed in FPS (frame per second)  The GPU programmability started growing to support different needs e.g.various games,GUIs, etc  GPUs started offering programmability similar to regular processors  In 2007 Nvidia launched CUDA to support fully programmable GPUs from C like syntax (exposing ISA,registers,pipeline), marking the beginning of widespread GPGPU era  Nowadays any program that contains a massive amount of data level parallelism can be mapped efficiently to GPU Brief GPU history
  • 6. 6  Massive data-level parallelism  Comparison CPU vs GPU  Lockstep execution What makes GPUs special?
  • 7. 7  Classical CPU pipeline [DLX]  SIMD (vector) pipeline – operates on closely located data  SIMT pipeline with a bundle of 4 threads  SPMD contains a combination of (SIMD&SIMT) Types of data-level parallelism MEMFE DE EX WB MEMFE DE EX WB EXEXEXMEM EXEXEXEXFE DE EXEXEXWB Typically all are present in GPUs
  • 8. 8  On a GPU all cores execute the same set of instructions but not necessarily at the same point in time – SPMD model (separate pipelines don’t synchronize) Parallelism in CPU vs GPU Program for CPU Core1Core2 instrN instrN+1 … instr1 instr2 … Instr N-1 Program for GPU instr1 instr2 … instrN instrN+1 … Dozens of threads Thousands to millions of threads  On a CPU different cores execute a completely different unrelated set of instructions
  • 9. 9 CPU vs GPU performance LD1: r0= load LD2: r1= load ADD: r1=r0+r1 … Time CPU pipeline w 4 stages S1 S2 S3 S4 LD1 LD2 LD1 LD2 LD1 LD2 LD1 LD2 ADD:T1
  • 10. 10 LD1: r0= load LD2: r1= load ADD: r1=r0+r …  GPU can switch fast between threads, achieved by putting extra resources (i.e. thousands of registers)  Switching threads in CPU costs thousands to millions of cycles  Recall GPU pipeline picture: each Tn is typically a bundle of threads (each threads travels along 1 lane of the pipeline) CPU vs GPU performance LD1: r0= load LD2: r1= load ADD: r1=r0+r … S1 S2 S3 S4 LD1:T1 LD2:T1 LD1:T1 LD1:T2 LD2:T1 LD1:T1 LD2:T2 LD1:T2 LD2:T1 LD1:T1 LD1:T3 LD2:T2 LD1:T2 LD2:T1 LD2:T3 LD1:T3 LD2:T2 LD1:T2 ADD:T1 LD2:T3 LD1:T3 LD2:T2 LD1: r0= load LD2: r1= load ADD: r1=r0+r1 … Throughput = 1/3 IPC Throughput = 1 IPC Time Resume T1 (if more threads available,GPU could decide to schedule new threads) CPU pipeline w 4 stages T1 T2 T3 GPU pipeline w 4 stages S1 S2 S3 S4 LD1:T1 LD2:T1 LD1:T1 LD2:T1 LD1:T1 LD2:T1 LD1:T1 LD2:T1 ADD:T1
  • 11. 11  A bundle (subgroup in OpenCL) of threads has to execute the same PC in lock step  All threads in the bundle have to go through instructions in divergent branches (even when condition evaluates to false)  Similar to predicated execution  Inefficient because more time is spent to execute the whole program  Do all GPUs execute instructions in lock step? Mali-T6xx has separate PC for each thread [IWOCL] Lockstep execution (SIMT) Divergence issue if (thread_id==0) r3 = r1+r2; else r3 = r1-r2; r4 = r0 * r3 Thread 0 Thread 1 Thread 2 Thread 4 ADD r1 r2 ACTIVE INACTIVE INACTIVE INACTIVE SUB r1 r2 INACTIVE ACTIVE ACTIVE ACTIVE MUL r0 r3 ACTIVE ACTIVE ACTIVE ACTIVE Bundle of 4 threads executing in lock step Time Each thread executes 3 instr instead of 2
  • 12. 12  GPGPU programming model (OpenCL terminology)  www.khronos.org/opencl  OpenCL language How to program for GPGPUs?
  • 13. 13 GPGPU programming (OpenCL terminology) C/C++ Host (CPU) Hotspot - Kernel
  • 14. 14  WI – work-item is a single sequential unit of work with its private memory (PM) defined by a kernel in OpenCL language  WG – work-group is a collection ofWIs that can run in parallel on the same core (sharing the same pipeline) and access local memory shared among allWIs in the same WG  Different memories can have different access time GPGPU programming (OpenCL terminology) - Creates program for parallel computations on Device - Cross compiles for Device - Sends work to Device - Copy data to/from Device global memory Device (GPU) WG : WI Local Mem PM WI PM WI PM WG : WI Local Mem PM WI PM WI PM Global Memory C/C++ OpenCL API OpenCL Language Host (CPU) Hotspot - Kernel
  • 15. 15 kernel void foo(global int* buf){ int tid = get_global_id(0) * get_global_size(1) + get_global_id(1); buf[tid] = cos(tid); } OpenCL program Starting from C99 + Keywords (specify kernels i.e.parallel work items, address spaces (AS) to allocate objects to mem e.g.global,local,private …) + Builtin functions and types - Disallowed features (function pointers,recursion, C std includes)
  • 16. 16 kernel void foo(global int* buf){ int tid = get_global_id(0) * get_global_size(1) + get_global_id(1); buf[tid] = cos(tid); } for (int i = 0; i<DIM1; i++) for (int j = 0; j<DIM2; j++) std::thread([i, j, buf]() { int tid = i*DIM2 + j; buf[tid] = cos(tid); } OpenCL program Starting from C99 + Keywords (specify kernels i.e.parallel work items, address spaces (AS) to allocate objects to mem e.g.global,local,private …) + Builtin functions and types - Disallowed features (function pointers,recursion, C std includes) DIM1 DIM2
  • 17. 17 kernel void foo(global int* buf){ int tid = get_global_id(0) * get_global_size(1) + get_global_id(1); buf[tid] = cos(tid); } for (int i = 0; i<DIM1; i++) for (int j = 0; j<DIM2; j++) std::thread([i, j, buf]() { int tid = i*DIM2 + j; buf[tid] = cos(tid); } OpenCL program Starting from C99 + Keywords (specify kernels i.e.parallel work items, address spaces (AS) to allocate objects to mem e.g.global,local,private …) + Builtin functions and types - Disallowed features (function pointers,recursion, C std includes) DIM1 DIM2 C99 OpenCL
  • 18. 18  General strategy  Supporting new languages – OpenCL example  Optimising for GPUs  Mapping to GPU HW Compilation for GPUs
  • 19. 19 General compiler philosophy  OpenCL has a lot in common with C  Syntax,types,operators,functions  GPU has some similarity to CPU  ISA, registers,pipeline  Can we reuse existing compiler technologies?  Extension of C parser  Take existing IR formats  Borrow functionality of other backends  Integrate all the components in the same way
  • 20. 20  Supporting many new builtin types and functions  Compiler extensions for builtin types and functions - pipe example  Adding address space qualifiers Supporting new languages – OpenCL example Frontend aspects
  • 21. 21  Example: float cos (float x) __attribute((overloadable)) double cos(double x) __attribute((overloadable))  Many builtin functions have multiple prototypes i.e.overloads (supported in C via __attribute((overloadable)))  OpenCL header is 20K lines of declarations  Automatically loaded by compiler (no #include needed)  Affects parse time  To speedup parsing – Precompiled Headers (PCHs) [PCH]  has large size ~ 2Mb for OpenCL header  Size affects memory consumed and disk space  BUT some types and functions require special compiler support… Builtin functions and types
  • 22. 22 Compiler support - Pipe  Classical streaming pattern  Host (C/C++) code sets up pipe and connections to kernels  OpenCL code specifies how elements are written/read  Standard C parser won’t offer functionality to support pipe kernel void producer(write_only pipe int p) { int i = …; write_pipe(p,&i); } kernel void consumer(read_only pipe int p) { int i ; read_pipe(p,&i); } Device pipe = clCreatePipe(context, 0, sizeof(int), 10 /* # packets*/…); producer = clCreateKernel(program, “producer”, &err); consumer = clCreateKernel(program, “consumer”, &err); // Set up connections err = clSetKernelArg (producer, 0, sizeof(pipe), pipe); err = clSetKernelArg (consumer, 0, sizeof(pipe), pipe); // Send work for execution on the device err = clEnqueueNDRangeKernel(queue, producer, …); err = clEnqueueNDRangeKernel(queue, consumer, …); Host
  • 23. 23  Pipe type specifier requires special compiler support  Pipe builtin functions:int read_pipe (read_only pipe gentype p, gentype *ptr)  gentype is any builtin or user defined type  Generic programming style behaviour in C99 needed  Implemented as an internal Clang builtin function with custom generic programming support Semantical analysis for pipe: error:type 'pipe' can only be used as a function parameter in OpenCL Pipe type implementation in Clang file.cl: … pipe int p; … NewAST type node: PipeType 0xa4e41b0 ‘pipe int‘ `-BuiltinType 0xa48df20 'int' file.ll: %opencl.pipe_t = type opaque … %opencl.pipe_t* %p Parser AST Sema IR CodeGen void Parser::ParseDeclarationSpecifiers(DeclSpec &DS…) { … switch (Tok.getKind()) { … case tok::kw_pipe: … isInvalid = DS.SetTypePipe(true, …); break;
  • 24. 24  OpenCL feature that affects most of the parser (being part of every type)  Many parts were not written with this feature in mind Represented in qualifiers: PointerType 0xa4e41b0 '__global int *' `-QualType 0xa4e4198 '__global int' __global `-BuiltinType 0xa48df20 'int' Added diagnostics: - Conversion rules error:casting '__local int *' to type '__global int *‘ changes address space of pointer - Operation validity error:comparison between ('__constant int *' and '__global int *') which are pointers to non-overlapping address spaces Address spaces in Clang file.cl: … global int *gptr; … Parser void Parser::ParseDeclarationSpecifiers(...) { switch (Tok.getKind()) { … case tok::kw_global: ParseOpenCLQualifiers(DS.getAttributes()); … } AST Sema IR CodeGen - Address spaces are attributes of types in IR - Each number represent a memory segment ID - The segments do not have to be identical to OpenCL (e.g. private and local can be mapped to the same physical segment) file.ll: %gptr = alloca i32 addrspace(2)*
  • 25. 25  Do traditional optimisations always work for GPUs?  Optimisation for SPMD/SIMT  Exploring different parallelism granularity via vectorization Optimising for GPUs Middle-end aspects
  • 26. 26  Shorter programs are still faster => sequential code optimizations are still important  BUT some single-thread optimizations might reduce the number of concurrent threads that can run in parallel or might prevent parallel access to memory  A good trade off needs to be found because we do not always need each thread to run as fast as possible (HW can be filled with other threads at nearly 0 cost)  There is still an overhead of maintaining threads  Reducing #threads by coarsening might be beneficial in some situations  Increased complexity and need of different considerations while optimizing - heuristics are typically good at optimizing one thing  Hence iterative compilations or machine learning might be helpful… Optimizing for GPUs
  • 27. 27 r1 = … if (r0) r1 = computeA(); // shuffle data from r1 into r3 // of threads id r2. T0:r1 T1:r1 T2:r1 T3:r1 T0:r3 T1:r3 T2:r3 T3:r3 r3 = sub_group_shuffle(r1,r2); if (r0) r3 = computeB(); Erroneous sequential optimizations for SIMT
  • 28. 28 r1 = … if (r0) r1 = computeA(); // shuffle data from r1 into r3 // of threads id r2. T0:r1 T1:r1 T2:r1 T3:r1 T0:r3 T1:r3 T2:r3 T3:r3 r3 = sub_group_shuffle(r1,r2); if (r0) r3 = computeB(); r1 = … if (!r0) // Incorrectfunctionality!The data in r1 // have not been computed by all threadsyet. r3 = sub_group_shuffle(r1,r2); else { r1 = computeA(); r3 = sub_group_shuffle(r1,r2); r3 = computeB(); } Erroneous sequential optimizations for SIMT After valid sequential code transformation to reduce #branches r0!=0 r1==0
  • 29. 29  Observation:transformations should maintain control flow wrt sub_group_shuffle as in the original program  LLVM solution:mark such functions with special “convergent” attribute r1 = … if (r0) r1 = computeA(); // shuffle data from r1 into r3 // of threads id r2. T0:r1 T1:r1 T2:r1 T3:r1 T0:r3 T1:r3 T2:r3 T3:r3 r3 = sub_group_shuffle(r1,r2); if (r0) r3 = computeB(); r1 = … if (!r0) // Incorrectfunctionality!The data in r1 // have not been computed by all threadsyet. r3 = sub_group_shuffle(r1,r2); else { r1 = computeA(); r3 = sub_group_shuffle(r1,r2); r3 = computeB(); } Erroneous sequential optimizations for SIMT After valid sequential code transformation to reduce #branches r0!=0 r1==0
  • 30. 30  Definition of domination/post-domination of nodes in CFG?  Example:A dom B, C post-dom B  CFG G´ is equivalent to G wrt Ni: iff ∀ Nj (i≠j) dom and post- dom relations wrt Ni remain the same in both G and G´  Example:A valid transformation can swapA1 and A2 in case it is beneficial to do so (no other valid transformations exist) Fixing wrong optimizations Control flow equivalence A1 A2 B C After the transformation,B is control equivalent to all other nodes B is a convergent operation A2 A1 B C A2 A1 B C Wrong: A2 no longer dominates B
  • 31. 31 1: if (tid == 0) c = a[i + 1]; // compute address of array element = load address of a and compute the offset (a+1) => 2 instr, only 1 thread active 2: if (tid != 0) e = a[i + 2];// as in line 1- 2 instr for address computation (a+2) all threads other than thread 0  Transformed into: 1: p = &a[i + 1]; // load address of a and compute the offset => 2 instr all threads 2: if (tid == 0) c = *p; // only thread 0 will perform this computation 3: q = &a[i + 2]; // address of a is loaded on line 3 thus only offset is to be computed (CSE optimisation) => 1 instr all threads 4: if (tid != 0) e = *q; // all threads other than thread 0 will run this  In SPMT all lines of code will be executed, therefore the transformed program will run 1 instr less (the transformed code will run longer for a single thread) Optimizations for SPMD Speculative execution
  • 32. 32 Exploring different types of parallelism SPMD vs SIMD r0= load r1= load r1=r0+r1 r0= load r1= load r1=r0+r1 r0= load r1= load r1=r0+r1 int *a, *b, c; … a[tid] = b[tid] + c; r0= load r1= load r1=r0+r1 int2 *a, *b, c; ;… //a vector operation is performed with multiple elements at once a[tid] = b[tid] + c; Is it always good to vectorize? Speed up if fewer threads are created while each thread execution time remains the same
  • 33. 33 Exploring different types of parallelism SPMD vs SIMD r0= load r1= load r1=r0+r1 r0= load r1= load r1=r0+r1 r0= load r1= load r1=r0+r1 int *a, *b, c; … a[i] = b[i] + c; r0= load r1= load r1=r0+r1 int2 *a, *b, c; ;… //a vector operation is performed with multiple elements at once a[i] = b[i] + c; Is it always good to vectorize? Speed up if fewer threads are created while each thread execution time remains the same  What if the computation of c (calling some builtin functions) results in many instructions that have to be duplicated in a vectorized kernel: vectorized application longer threads by 100 cycles c = builtin100cysles(x) c.s1 = builtin100cysles(x1); c.s2 = builtin100cysles(x2)
  • 34. 34  HW resources, occupancy and thread lifecycles  Occupancy in register allocation  Mapping to instructions Mapping to GPU HW Backend aspects
  • 35. 35  HW will try to put as many threads inflight as possible if there is enough HW available for their execution to completion  If all available resources are occupied,threads will wait to be accepted for execution on GPU  Thus thread states can be described by FSM: Occupancy and thread categorization Ready Waiting to be scheduled Active In GPU pipeline Completed Ended execution Blocked Waiting To be admitted for execution Waiting for resources to be available
  • 36. 36  How many registers to allocate for the kernel  Using more registers reduces spilling (storing to/loading from memory) by keeping data in a register  For GPUs #REGs can affect amount of threads that can be created because all registers are shared and allocated to threads before scheduling them (Waiting -> Ready transition)  Thread interleaving reduces idle time while waiting for result  Example RA tread off - 1024 registers in total  Allocate 4 REGs with 20 spills and run max 256 threads to mask blocking  Allocate 16 REGs with 5 spills and run max 64 threads to mask blocking  More information in Occupancy in RegisterAllocation [OPT]
  • 37. 37  Code is often inlined and unrolled (to exploit SIMD) and therefore big basic blocks are produced  IR uses large number of intrinsics for implementing builtin functions and custom operations  Exotic and complex instructions lead to many more possibilities in instruction selection phase  The above factors have a negative impact on compilation time (critical for doing it at runtime) Mapping to instructions and compilation time * + r3 = ADD r1 r2 r4 = MUL r3 r5Common case from matrix multiplication r4 = FMA r1 r2 r3OR
  • 38. 38  Fixed pipeline with programmable functionality  Evolved from very fixed functionality to highly programmable components by writing shaders (in C like syntax - OpenGL)  Evolving towards unifying graphics and compute - Vulkan [VUL]  Sometimes requires less accuracy as soon as visualization is correct and therefore alternative optimizations are possible  Harder to estimate the workload What about graphics? Geometry (Vertexes) Vertex Shader Fragment Shader BlenderRasterizer A boundary between vertex and pixel processing
  • 39. 39  Reusing existing compilers helps shorten development time and avoid reinventing the same techniques  Compilation approaches suitable for GPUs are emerging  Many grew in a very ad hoc way,combining many concepts, and moving rapidly in different directions  New challenges to come:  Supporting new standards (including C++) and new use cases (computer vision, deep learning)  Convergence between graphics and compute  Increase compilation speed  Explore alternative optimization strategies that can take both single and multi-threaded performance into account Summary
  • 40. 40  [DLX] https://en.wikipedia.org/wiki/DLX  [IWOCL] http://www.iwocl.org/wp-content/uploads/iwocl-2014- workshop-Tim-Hartley.pdf  [PCH] http://clang.llvm.org/docs/PCHInternals.html  [OPT] https://www.cvg.ethz.ch/teaching/2011spring/gpgpu/GPU- Optimization.pdf  [VUL] https://www.khronos.org/vulkan/ References
  • 42. 42 v1.0 v1.2 v2.0 v2.1 CL Next OpenCL evolution  Core part of the language and API (operationsand types taken from C, special types - images,samplers,events; builtin functions:math and many more,address spaces)  Allows linking of separate modules + adds more type and builtin functions  Improves interaction between host and device:dynamic parallelism (GPU creates new workloads),program scope variables,streaming communicationsamong kernels - pipes (bypassinghost),shared virtual memory amonghost and all devices  C++ Support
  • 43. 43  Transforms program from source language into representation suitable for compiler analysis and transformations  Interface with programmers about wrong use of language and flags potential errors in the source code error:use of undeclared identifier 'c' Frontend What does it do? d = a*b + c t1 = load a t2 = load b t3 = load c t4 = mul t1, t2 t5 = add t3, t4
  • 44. 44  Clang parser is not always suitable forAS rules of OpenCL  Example:defaultAS  NULL literal has to be AS agnostic to be used interchangably: NULL = (void*)0  Recursive Descent parsing of NULL ParseExpr() ParseCastExpr() ParseType() ParseExpr() … ParseNullLiteral()  Hence NULL is assigned to the default AS  Workaround by custom checks of NULL in semantical cheks Default address space Scope Type global local pointer LangAS::generic LangAS::generic scalar LangAS::global LangAS::private Not known that NULL object is being parsed
  • 45. 45  Map CL to C11 atomic types: Sema.cpp - Sema::Initialize(): // typedef _Atomic int atomic_int addImplicitTypedef("atomic_int",Context.getAtomicType(Context.IntTy));  Only subset of types are allowed  Added Sema checks to restrict operations (only allowed through builtin functions): atomic_int a,b; a+b; // disallowed in CL _Atomic int a,b; a+b; // allowed in C11  Use C11 builtin functions to implement CL2.0 functions  Missing memory visibility scope as LLVM doesn’t have this construct C atomic_exchange_explicit(volatile A *obj,C desired,memory_orderorder,memory_scope scope);// CL C atomic_exchange_explicit(volatile A *obj,C desired,memory_orderorder);// C11  Can be added as metadata or IR extension Atomic types if (getLangOpts().OpenCL) //The only legal unary operation for atomics is '&'. if (Opc != UO_AddrOf&& InputExpr->getType()->isAtomicType()) return ExprError(…);
  • 46. 46  OpenCL builtin function enqueue_kernel(…,void (^block)(local void *,...))  block has an ObjC syntax (a lambda function with captures)  block can have any number of local void* arguments  Kind of variadic prototype  No standard compilation approach  To diagnose correctly it was added as Clang builtin function with a custom check Dynamic parallelism support
  • 47. 47  Optimizations  Transformation (preparing/making program more suitable for backend - lowering)  Ideally target agnostic but not in reality! Middleend What does it do? t1 = add a 2 t2 = sub b 1 t3 = add t1 t2 t1 = add a 1 t2 = sub b t1
  • 48. 48 %opencl.image2d_ro_t = type opaque define void @foo(i32 addrspace(1)* %i, %opencl.image2d_ro_t addrspace(1)* %img) !kernel_arg_addr_space !1 !kernel_arg_access_qual !2 !kernel_arg_type !3 !kernel_arg_base_type !3 !kernel_arg_type_qual !4 { … %2 = addrspacecast i32 addrspace(1)* %1 to i32 addrspace(4)* … %0 = call i32 @llvm.gpu.read.image(%img)} !1 = !{i32 1, i32 1} !2 = !{!"none", !"read_only"} !3 = !{!"int*", !"image2d_t"} IR for SPMD (LLVM example) LLVM IR - linear format three address code define i32 @foo(i32 %i) { entry: %i.addr = alloca i32 store i32 %i, i32* %i.addr %0 = load i32, i32* %i.addr %add = add nsw i32 %0, 1 ret i32 %add } OpenCL types are all unknown to IR Kernel is marked by metadataIR has an address space attribute for the type Special IR node to cast between objects in different ASes can be eliminated for flattenAS arch or mapped to address calculation sequence Additional information about kernel is kept in metadata Intrinsic represent special operations unavailable in IR lowered to instructions by backend
  • 49. 49  Adhoc - a combination of historically grown tricks  Do we need other formats that has explicit kernel nodes and known special types as well as functions (i.e.barrier) ?  Some alternatives could be SPIRV, PENSIL… IR for SPMD (LLVM example)
  • 50. 50  Multiple threads execute single PC in lock step  What if threads have to go through different PC i.e.diverge  Now all threads in subgroup will be able to execute in a lock step  Inefficient because more time is spent to execute the whole program (each thread will compute both cos and sin but only those with valid pred will make result effective)  Do all GPUs execute instructions in lock step? MaliT6xx has separate PC for each thread Divergent execution support r0 = … if (r0) r2 = cos(r1); else r2 = sin(r1); r0 = … pred0 = r0!=0 pred1 = r0==0 if (pred0) r2 = cos(r1) // result in r2 is written conditionally if (pred1) r2 = sin(r1) // result in r2 is written conditionally What compiler might do in order to support lock step execution
  • 51. 51  lowering i.e.mapping into instructions that make use of registers  can perform last optimisations too (peephole optimisations) Backend What does it do? ADD r1 r2 r3 * a b
  • 52. 52  Non technical aspects - closed source can be good because it allows better synergy with other SW components but can be bad because it prevents the knowledge sharing.  Resource occupancy… Backend general challenges (2/2)
  • 53. 53  Closed source facilitates better interplay between SW components but a lot of experimentally grown features that need re-evaluation  Close integration with runtime environment  Compilation time in an issue:  Online compilation requires compilation speed to be minimised  GPU code is normally inlined and unrolled and therefore very large basic blocks are produced => increased compilation time because compiler needs to traverse more nodes  Many possibilities of instructions => increased number of choices  Need to fit needs of both Graphics and Compute… Overall
  • 54. 54  DifferentWorkloads:  Compute has more complex algorithms and uses more features (branching,loops,function calls, pointers)  Graphics workloads are harder to predict (each vertex can be computed in slightly different way while all compute threads truly go through the same exactly program flow)  Conflicting accuracy/performance tradeoff:  Example:DIV is expensive => many techniques (using subtract,multiply, reciprocal,etc, in HW or SW or hybrid)… rounding issues  CL: exact number matters  GL:as soon as visualisation correct we don’t care about exact number  => different optimisations possible with different implementation of DIV with respect to accuracy and speed Graphics vs Compute