CUG2011 Introduction to GPU Computing

Introduction to GPU Computing

Jeff Larkin
Cray Supercomputing Center of
Excellence
larkin@cray.com

Goals for this tutorial
• Understand the architectural differences
between GPUs and CPUs and the associated
trade-offs
• Recognize several GPU programming models
and how/when to use each
• Understand how to analyze GPU performance
• Recognize very basic GPU optimizations

This tutorial is not…
• A deep-dive on GPU programming
• The be all and end all on GPU optimization
• A recipe for getting 10, 100, 1000X speed-ups
for your application

Section Goals
• Recognize the differences between CPU/GPU
architectures
• Identify when one architecture may be better
suited than the other.

CPU/GPU Architectures
CPU GPU
ALU ALU
Control ALU Control ALU
ALU ALU
Cache Cache

Cache

Cache
RAM
RAM

CPU/GPU Architectures
CPU GPU
• Large memory, directly • Relatively small memory,
accessible must be managed by CPU
• Each core has own, • Groups of compute cores
independent control logic share control logic
– Allows independent – Saves space, power, …
execution • Shared cache &
• Coherent caches between synchronization within
cores groups
– Can share & synchronize – None between groups

Play to your strengths
CPU GPU
• Tuned for serial execution • Tuned for highly parallel
with short vectors execution
• Multiple independent • Threads work in lockstep
threads of execution within groups
– Much like vectors
• Branch-prediction • Serializes branchy code
• Memory latency hidden by • Memory latency hidden by
cache & prefetching swapping away stalled
– Requires regular data access threads
patterns – Requires 1000s of concurrent
threads

GPU Glossary
Hardware Software
(CUDA) Core Thread/Work Unit
Streaming Multiprocessor (SM) Thread Block/Work Group

• A Grid is a group of related Thread Blocks running the same kernel
• A Warp is Nvidia’s term for 32 Threads running in lock-step
• Warp Diversion is what happens when some threads within a warp
stall due to a branch
• Shared Memory is a user-managed cache within a Thread Block
• Occupancy is the degree to which all of the GPU hardware can be
used in a Kernel
– Heavily influenced by registers/thread and threads/block
• Stream is a series of data transfers and kernel launches that happen
in series

Section Goals
• Introduce several GPU programming models
• Discuss why someone may choose one
programming paradigm over the others.

Explicit/Implicit GPU Programming
Explicit Implicit
• Bottom-up approach • Traditional Top-down
• Explicit Kernel written from programming
threads’ perspective – Big Picture
• Memory management • Compiler handles memory
controlled by programmer and thread management
– May be guided by
• Thread Blocks & Grid
programmer
defined by programmer
• CPU & GPU may use the
• GPU code usually distinct
same code
from CPU code
– Easier code maintenance

GPU Programming Models
• Explicit
– CUDA C (Free from Nvidia)
– CUDA Fortran (Commercial from PGI)
– OpenCL (Free from Multiple Vendors)
• Implicit
– Proposed OpenMP Directives (Multiple Vendors)
– PGI Directives (Commercial from PGI)
– HMPP Directives (Commercial from CAPS)
– Libraries (CUBLAS, MAGMA, etc.)

Multi-node Programming
• GPU papers & tutorials usually focus on 1 node, what about the rest
of the machine?

• High-level MPI parallelism between nodes
– You’re probably already doing this
• Loose, on-node parallelism via threads
– Most codes today are using MPI, but threading is becoming more
important
• Tight, on-node, vector parallelism
– SSE/AVX on CPUs
– GPU threaded parallelism

Programmers need to expose the same parallelism with/without GPUs

Using the Machine Efficiently
So-So Hybridization Better Hybridization
MPI MPI MPI MPI
CPU 0 CPU 1

G0 0 1 2 3 G1 0 1 2 3
Time

GPU 0 GPU 1

MPI MPI
CPU 0 CPU 1

MPI MPI
• Overlap CPU/GPU work and
CPU 0 CPU 1
data movement.
• Neglects the CPU • Even better if you can
• Suffers from Amdahl’s Law overlap communication too!

Original S3D
RHS – Called 6 times for each time step –
Runge Kutta iterations

Calculate Primary Variable – point wise All major loops are at low level of the
Mesh loops within 5 different routines Call tree
Green – major computation – point-wise
Yellow – major computation – Halos 5 zones
Perform Derivative computation – High thick
order differencing

Calculate Diffusion – 3 different routines
with some derivative computation

Perform Derivative computation for
forming rhs – lots of communication

Perform point-wise chemistry
computation

5/24/2011 16

Restructured S3D for multi-core systems
RHS – Called 6 times for each time step –
Runge Kutta iterations
Calculate Primary Variable – point
wise
OMP loop over grid Mesh loops within 3 different
routines
Perform Derivative computation –
High order differencing
Calculate Primary Variable – point Overlapped
OMP loop over grid wise
Mesh loops within 2 different
routines
Calculate Diffusion – 3 different
routines with some derivative
computation

Perform derivative computation

Overlapped
OMP loop over grid Perform point-wise chemistry
computation (1)

Perform Derivative computation for
forming rhs – lots of communication
Overlapped
OMP loop over grid Perform point-wise chemistry
computation (2)
5/24/2011

The Hybridization of S3D

5/24/2011 18

Explicit: CUDA C/Fortran & OpenCL
• Programmer writes a kernel in C/Fortran that will be run on
the GPU
– This is essentially the loop body from original CPU code
• GPU memory must be explicitly allocated, freed, and filled
from CPU memory over PCIe
– Generally results in 2 variables referring to every pertinent array,
one in each memory domain (hostA, devA)
• Programmer declares how to decompose into thread blocks
and grid
– Must understand limits of thread block size and how to
maximize occupancy
• CPU code launches kernel on device.
– May continue to work while GPU executes kernel(s)

CUDA C Example
Host Code GPU Code
Allocate &
double a[1000], *d_a;
dim3 block( 1000, 1, 1 ); Copy to GPU __global__
dim3 grid( 1, 1, 1 );
void scaleit_kernel(double *a,int n)
cudaMalloc((void**)&d_a, 1000*sizeof(double));
cudaMemcpy(d_a, a,
{
1000*sizeof(double),cudaMemcpyHostToDev
ice);
int i = threadIdx.x; My Index
scaleit_kernel<<<grid,block>>>(d_a,n); Launch
cudaMemcpy(a, d_a,
if (i < n)
Calculate
1000*sizeof(double),cudaMemcpyDeviceToH
ost);
a[i] = a[i] * 2.0l; Myself
cudaFree(d_a);
}
Copy Back & Free

CUDA Fortran Example
Host Code GPU Code
subroutine scaleit(a,n) attributes(global)&
real(8),intent(inout) :: a(n) Declare on subroutine scaleit_kernel(a,n)
real(8),device :: d_a(n) Device real(8),intent(inout) :: a(n)
integer,intent(in) :: n
type(dim3) :: blk, grd integer,intent(in),value :: n
integer I
blk = dim3(1000,1,1)
grd = dim3(1,1,1) i = threadIdx%x My Index

d_a = a Copy To Device
call scaleit_kernel<<<grd,blk>>>(d_a,n)
if (i.le.n) then Calculate
a = d_a a(i) = 2.0 * a(i) Myself
end subroutine scaleit
Launch & Copy endif
Back end subroutine scaleit_kernel

Implicit: Directives
• Programmer adds directives to existing CPU
code
• Compiler determines
– Memory management
– Thread management
• Programmer adds directives to guide compiler
– Higher-level data regions
– Partial array updates
– Improved thread blocking

Proposed OpenMP Directives Example

real*8 a(1000)
integer i Build for device, Copy a on and off

!$omp acc_region_loop acc_copy(a)
do i=1,1000
a(i) = 2 * a(i)
enddo
!$omp end acc_region_loop

Implicit: Libraries
• Calls to existing Math libraries replaced with
accelerated libraries
– BLAS, LAPACK
– FFT
– Sparse kernels
• Unless application spends very high % of
runtime in library calls, this will need to be
combined with other methods

Libraries Example
info = cublas_set_matrix(lda, na, sizeof_Z, a, lda, devA, lda)

info = cula_device_zgetrf(m,m,devA+idx2f(ioff+1,ioff+1,lda)*sizeof_Z,lda,devIPVT)
info = cula_device_zgetrs('n',m,ioff,devA+idx2f(ioff+1,ioff+1,lda)*sizeof_Z,lda,devIPVT,
& devA+idx2f(ioff+1,1,lda)*sizeof_Z,lda)
call cublas_zgemm('n','n',n,ioff-k+1,na-ioff,cmone,devA+idx2f(joff+1,ioff+1,lda)*sizeof_Z,lda,
& devA+idx2f(ioff+1,k,lda)*sizeof_Z,lda,cone,devA+idx2f(joff+1,k,lda)*sizeof_Z,lda)
call cublas_zgemm('n','n',blk_sz(1),blk_sz(1)-k+1,na-blk_sz(1),
& cmone,devA+idx2f(1,blk_sz(1)+1,lda)*sizeof_Z,lda,
& devA+idx2f(blk_sz(1)+1,k,lda)*sizeof_Z,lda,cone,devA,lda)

info = cublas_get_matrix(lda, na, sizeof_Z, devA, lda, a, lda)

Section Goals
• Understand multiple options for gathering
GPU performance metrics

• Increasing number of tools available, I’ll cover
3 methods
– Explicit event instrumentation
– CUDA Profiler
– CrayPAT Preview

CUDA Event API
• Most CUDA API calls are asynchronous: explicit
CPU timers won’t work
• CUDA allows inserting events into the stream
– Insert an event before and after what needs to be
timed
– Synchronize with events
– Calculate time between events
• Introduces small driver overhead and may
synchronize asynchronous calls
– Don’t use in production

CUDA Event Example
Event st0

Allocate ierr = cudaEventRecord(st0,0)
Event st1 allocate(d_a(n))
ierr = cudaEventRecord(st1,0)
Copy-in d_a = a
ierr = cudaEventRecord(st2,0)
Event st2 call &
scaleit_kernel<<<grd,blk>>>&
(d_a,n)
Run Kernel ierr = cudaEventRecord(st3,0)
a = d_a
Event st3 ierr = cudaEventRecord(st4,0)
deallocate(d_a)
Copy-out ierr = cudaEventRecord(st5,0)
...
Event st4
ierr = cudaEventSynchronize(st2)
Deallocate ierr = cudaEventSynchronize(st3)
ierr = cudaEventElapsedTime &
Event st5 (et, st2, st3)
Synchronize write(*,*)‘Kernel Time',et

CUDA Profiler
• Silently built-in to CUDA driver and enabled via
environment variable
– Works with both CUDA and Directives programs
• Returns time of memory copies and kernel
launches by default
– Also reports kernel occupancy
– Can be configured to report many other metrics
• All metrics are recorded at driver level and high
resolution
– May add small kernel overhead and synchronize
asynchronous operations.

CUDA Profiler Example
# Enable Profiler
$ export CUDA_PROFILE=1
$ aprun ./a.out
$ cat cuda_profile_0.log

# CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE 0 Tesla M2090
# TIMESTAMPFACTOR fffff6f3e9b1f6c0
method,gputime,cputime,occupancy
method=[ memcpyHtoD ] gputime=[ 2.304 ] cputime=[ 23.000 ]
method=[ _Z14scaleit_kernelPdi ] gputime=[ 4.096 ] cputime=[
15.000 ] occupancy=[ 0.667 ]
method=[ memcpyDtoH ] gputime=[ 3.072 ] cputime=[ 34.000 ]

CUDA Profiler Example
# Customize Experiment
$ cat exp.txt
l1_global_load_miss
l1_global_load_hit
$ export CUDA_PROFILE_CONFIG=exp.txt
$ aprun ./a.out
$ cat cuda_profile_0.log

# CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE 0 Tesla M2090
# TIMESTAMPFACTOR fffff6f4318519c8
method,gputime,cputime,occupancy,l1_global_load_miss,l1_global_load_hit
method=[ memcpyHtoD ] gputime=[ 2.240 ] cputime=[ 23.000 ]
method=[ _Z14scaleit_kernelPdi ] gputime=[ 4.000 ] cputime=[ 36.000 ]
occupancy=[ 0.667 ] l1_global_load_miss=[ 63 ] l1_global_load_hit=[
0 ]
method=[ memcpyDtoH ] gputime=[ 3.008 ] cputime=[ 33.000 ]

CrayPAT Prototype
• Luiz DeRose is giving a tutorial on CrayPAT future
work at CUG (you’re missing it right now)
• The goal of the CrayPAT team is to make
instrumenting applications and understanding
the results as simple as possible
– No code modification
– Derived metrics
– Optimization suggestions
– …
• Several new tools are being developed that will
help with accelerator development

CrayPAT Preview: Performance Stats
5|||| 1.3% | 21.836221 | 21.630958 | 6760.318 | 6760.318 | 3201 |collisionb_
||||||-------------------------------------------------------------------------------
6||||| 1.1% | 18.888240 | 18.708450 | 0.000 | 6507.596 | 1400 |collisionb_(exclusive)
|||||||------------------------------------------------------------------------------
7|||||| 0.4% | 7.306387 | 7.291820 | 0.000 | 0.000 | 200 |collisionb_.ASYNC_KERNEL@li.599
7|||||| 0.2% | 3.799065 | 3.799065 | 0.000 | 6507.596 | 200 |collisionb_.SYNC_COPY@li.593
7|||||| 0.0% | 0.527203 | 0.376397 | 0.000 | 0.000 | 200 |lbm3d2p_d_.ASYNC_COPY@li.129
7|||||| 0.0% | 0.073654 | 0.064766 | 0.000 | 0.000 | 200 |collisionb_.ASYNC_COPY@li.703
7|||||| 0.0% | 0.013917 | 0.011082 | 0.000 | 0.000 | 199 |grad_exchange_.ASYNC_COPY@li.428
7|||||| 0.0% | 0.000134 | 0.000127 | 0.000 | 0.000 | 1 |collisionb_.ASYNC_COPY@li.566
6||||| 0.2% | 2.947981 | 2.922508 | 6760.318 | 252.722 | 1801 |grad_exchange_
|||||||------------------------------------------------------------------------------
7|||||| 0.1% | 2.485119 | 2.485119 | 6507.596 | 0.000 | 200 |collisionb_.SYNC_COPY@li.596
7|||||| 0.0% | 0.107396 | 0.107396 | 0.000 | 126.361 | 200 |grad_exchange_.SYNC_COPY@li.472
7|||||| 0.0% | 0.056946 | 0.045612 | 0.000 | 0.000 | 200
|grad_exchange_.ASYNC_KERNEL@li.453
7|||||| 0.0% | 0.029640 | 0.028101 | 0.000 | 0.000 | 200
7|||||| 0.0% | 0.025947 | 0.014719 | 0.000 | 0.000 | 200

This example is taken from a real user application and
“ported” using proposed OpenMP extensions.

CrayPAT Preview: Data Transfer Stats
Host | Host Time | Acc Time | Acc Copy | Acc Copy | Calls |Group='ACCELERATOR'
Time % | | | In (MB) | Out (MB) | | PE
100.0% | 42.763019 | 42.720514 | 21877.192 | 20076.420 | 703 |Total
|-----------------------------------------------------------------------------------
| 100.0% | 42.763019 | 42.720514 | 21877.192 | 20076.420 | 703 |ACCELERATOR
||----------------------------------------------------------------------------------
5|||| 4.6% | 31.319188 | 31.318755 | 19425.659 | 19425.659 | 140 |recolor_
||||||------------------------------------------------------------------------------
6||||| 4.5% | 30.661050 | 30.660616 | 18454.376 | 19425.659 | 139 |recolor_(exclusive)
|||||||-----------------------------------------------------------------------------
7|||||| 2.4% | 16.761967 | 16.761967 | 0.000 | 19425.659 | 20 |recolor_.SYNC_COPY@li.790
7|||||| 1.9% | 13.227889 | 13.227889 | 18454.376 | 0.000 | 19 |recolor_.SYNC_COPY@li.793
7|||||| 0.1% | 0.668515 | 0.668480 | 0.000 | 0.000 | 20 |recolor_.ASYNC_KERNEL@li.781
7|||||| 0.0% | 0.002122 | 0.002059 | 0.000 | 0.000 | 20 |lbm3d2p_d_.ASYNC_COPY@li.118
7|||||| 0.0% | 0.000332 | 0.000105 | 0.000 | 0.000 | 20 |recolor_.ASYNC_COPY@li.794
|||||||=============================================================================
6||||| 0.1% | 0.658138 | 0.658138 | 971.283 | 0.000 | 1 |streaming_exchange_
7||||| | | | | | | recolor_.SYNC_COPY@li.793
||||||==============================================================================

Full PCIe data transfer information without any code
modifications.

Cray Tools: More Information
• Cray is developing a lot of tools that deserve
more time than this tutorial allows, so…

• Go to “Cray GPU Programming Tools” BOF at
4:15 on Wednesday (Track 15B)
• Talk to Luiz DeRose and/or Heidi Poxon while
you’re here.

Basic Optimizations

OCCUPANCY

Calculating Occupancy
• Occupancy is the degree to which the hardware is
saturated by your kernel
– Generally higher occupancy results in higher
performance
• Heavily affected by
– Thread decomposition
– Register usage
– Shared memory use
• Nvidia provides an “occupancy calculator”
spreadsheet as part of the SDK
– Live example to follow

Calculating Occupancy
1. Get the register count
ptxas info : Compiling entry function
'laplace_sphere_wk_kernel3' for 'sm_20'
ptxas info : Used 36 registers, 7808+0 bytes
smem, 88 bytes cmem[0], 768 bytes cmem[2]
2. Get the thread decomposition
blockdim = dim3( 4, 4, 26)
griddim = dim3(101, 16, 1)
3. Enter into occupancy calculator

Result: 54%

Improving the Results

Reducing registers per
Varying #threads or thread may increase
shared memory use has occupancy.
little effect

Reducing Registers/Thread
• Maximum number of
registers/thread can be
set via compiler flag
• Reducing the number of
registers/thread to 18
increases occupancy to
81%
• Time Before: 924us
• Time After: 837us
• Improvement: ~10%
• Occupancy isn’t a silver
bullet

Occupancy Case Study
• Results from a Finite Difference Kernel,
provided by Paulius Micikevicius of Nvidia
• Default compilation
– 46 registers, no spills to lmem
– runs a single 32x16 threadblock per SM
concurrently
– Occupancy: 33%
– 3,395 MCells/s throughput (39.54ms)

Occupancy Case Study cont.
• Reducing Maximum Registers to 32
– Set maximum register count via compiler flag
– 32 registers, 44 bytes spilled to lmem
– runs two 32x16 threadblocks per SM concurrently
– Occupancy: 67%
– 4,275 MCells/s (31.40ms)

• Improvement: ~26%

Basic Optimizations

ASYNCHRONICITY

Asynchronous Execution
• Most GPU Operations are Asynchronous from
the CPU code
– Hint: The CPU can be busy doing other things
• Current Hardware can handle 1 Copy-in, 1
Kernel, and 1 Copy-out simultaneous, if in
separate streams
– Hint: Data transfer costs can be hidden by running
multiple streams and asynchronous tranfers

Asynchronous Execution with Streams
• Synchronous Execution (1 Stream):
In Run Out In Run Out In Run Out In Run Out

• Asynchronous Execution (3 Streams):
In Run Out

In Run Out
In Run Out

In Run Out

• If data cannot remain resident on device,
streaming may allow GPU to offset transfer costs

Asynchronous Execution: Example
• Add some number of streams to
integer :: streams(3)
existing code integer :: ierr,j,mystream
• Use Asynchronous memory copies
to copy part of data to/from device do j=1,3
ierr = cudaStreamCreate(streams(j))
– GOTCHA: Host arrays must be
enddo
“pinned” in order to use Async
copies do j=1,m
• Add stream parameter to kernel mystream = mod(j,3)
launch ierr = cudaMemcpyAsync&
(d_a(:,j),a(:,j),size(a(:,j)),streams(mystream))
call
• Sync Time: 0.6987200 scaleit_kernel<<<grd,blk,0,streams(mystrea
m)>>>(d_a(:,j),n)
• Async Time: 0.2472000 ierr = cudaMemcpyAsync&
(a(:,j),d_a(:,j),size(a(:,j)),streams(mystream))
enddo
ierr = cudaStreamSynchronize(streams(1))

Asynchronous Case Study

CAVEAT: The above kernel over-emphasizes data transfer, thus necessitating
streaming.

Basic Optimizations

SHARED MEMORY

Shared Memory
• Much like CPU cache, shared memory is much faster
than global memory (up to 100X lower latency)
– Staging Area
– Scratch Pad
• 64KB Shared Memory sits on each SM
– With Fermi, this is split between User-Manager and L1:
48/16 or 16/48
– Split can be determined kernel to kernel
• If data is shared between threads in a thread block or
reused well, staging it into shared memory may be
beneficial
– Think: Cache Prefetching

Simple Matrix Multiply
ptxas info : Compiling entry
attributes(global)&
subroutine mm1_kernel(C,A,B,N)
function 'mm1_kernel' for
integer, value, intent(in) :: N 'sm_20'
real(8), intent(in) ::
ptxas info : Used 22
A(N,N),B(N,N)
real(8), intent(inout) :: C(N,N) registers, 60 bytes cmem[0]

integer i,j,k
real(8) :: val
• No shared memory use,
i = (blockIdx%x - 1) * blockDim%x totally relies on
+ threadIdx%x
j = (blockIdx%y - 1) * blockDim%y
+ threadIdx%y
hardware L1
val = C(i,j) Kernel Time (ms) Occupancy
do k=1,N
val = val + A(i,k) * B(k,j) Simple 269.0917 67%
enddo
C(i,j) = val
end

Tiled Matrix Multiply
ptxas info : Compiling entry
integer,parameter :: M = 32
real(8),shared :: AS(M,M),BS(M,M) function 'mm2_kernel' for
real(8) :: val 'sm_20'
val = C(i,j) ptxas info : Used 18
registers, 16384+0 bytes
do blk=1,N,M
AS(threadIdx%x,threadIdx%y) = & smem, 60 bytes cmem[0], 4
A(blk+threadIdx%x-1,blk+threadIdx%y-1) bytes cmem[16]
BS(threadIdx%x,threadIdx%y) = &
B(blk+threadIdx%x-1,blk+threadIdx%y-1)
call syncthreads() • Now uses 16K of shared
do k=1,M
val = val + AS(threadIdx%x,k) &
memory
* BS(k,threadIdx%y)
enddo Kernel Time (ms) Occupancy
call syncthreads()
enddo Simple 269.0917 67%
C(i,j) = val
endif
Tiled 213.7160 67%

What if we increase the occupancy?
• With 32x32 blocks, we’ll never get above 67%
• Reduce block size from 32x32 to 16x16?
Kernel Time (ms) Occupancy
Simple (32x32) 269.0917 67%
Tiled (32x32) 213.7160 67%
Simple (16x16) 371.7050 83%
Tiled (16x16) 209.8233 83%

• Reduce Max Registers to 18?
Kernel Time (ms) Occupancy
Simple (16x16) 371.7050 83%

Tiled (16x16) 209.8233 83%

Simple (16x16) 18 registers 345.7340 100%

Tiled (16x16) 18 registers 212.2826 100%

• Turns out the 16 is even worse.

Basic Optimizations

MEMORY COALESCING

Coalescing Memory Accesses
• The GPU will try to load needed memory in as
few memory transactions as possible.
– 128 B if possible
– If not, 2 X 64 B
– If not, 64 B may be split to 32 B
– Continue until every thread has needed data
• Coalescing is possible if:
– 128B aligned
– All threads access elements in same segment

Why is coalescing important?
• Issuing 1 128B transaction reduces memory
latency and better utilizes memory bandwidth
• L1/Shared Memory cache lines are 128B
– Not using all fetched addresses wastes bandwidth
• Nvidia Guide: “Because of this possible
performance degradation, memory coalescing
is the most critical aspect of performance
optimization of device memory.”

Coalescing Examples
Simple, Stride-1:
Segment 1
Segment 0

Threads in same warp

Every thread accesses memory within same
128B-aligned memory segment, so the
hardware will coalesce into 1 transaction.

Will This Coalesce?

Yes! Every thread is still accessing memory within a single
128B segment and segment is 128B aligned.

No. Although this is stride-1, it is misaligned, accessing 2
128B segments. 2 64B transactions will result.

Will This Coalesce?
Stride-2, half warp:

Yes, but..
• Half of the memory transaction is wasted.
• Poor utilization of the memory bus.

Striding
• Striding results in more Striding: Relative Bandwidth
7
memory transactions
6
and wastes cache line
entries 5

4

1/Time(s)
attributes(global)&
subroutine stride_kernel(datin, 3
datout, st)
integer,value :: st 2
real(8) :: datin(n), datout(n)
integer i 1

i = (blockIdx%x * blockDim%x ) &
0
+ (threadIdx%x * st)
0 5 10 15 20 25 30 35
datout(i) = datin(i)
end subroutine stride_kernel Stride

Offsets (Not 128B-aligned)
• Memory offsets result Offset: Relative Bandwidth
6
in more memory
transactions by crossing 5

segment boundaries 4

1/Time(ms)
attributes(global)& 3
subroutine offset_kernel(datin,
datout, st)
integer,value :: st
2 128B Boundaries
real(8) :: datin(n), datout(n)
integer i 1

i = (blockIdx%x * blockDim%x ) &
0
+ threadIdx%x + st
0 5 10 15 20 25 30 35
datout(i) = datin(i)
end subroutine offset_kernel Offset

On The Web
• GTC 2010 Tutorials:
http://www.nvidia.com/object/gtc2010-
presentation-archive.html
• Nvidia CUDA online resources:
http://developer.nvidia.com/cuda-education-
training
• PGI CUDA Fortran:
http://www.pgroup.com/resources/cudafortra
n.htm

CUG2011 Introduction to GPU Computing

More Related Content

What's hot (20)

Similar to CUG2011 Introduction to GPU Computing (20)

More from Jeff Larkin (17)

Recently uploaded (20)

CUG2011 Introduction to GPU Computing