Computing using GPUs

Computing Using
Graphics Cards

Shree Kumar, Hewlett Packard
http://www.shreekumar.in/

Speaker Intro

• High Performance Computing @ Hewlett‐Packard
– VizStack (http://vizstack.sourceforge.net)
– GPU Computing
• Big 3D enthusiast
• Travels a lot
• Blogs at http://www.shreekumar.in/

What we will cover

• GPUs and their history
• Why use GPUs
• Architecture
• Getting Started with GPU Programming
• Challenges, Techniques & Pitfalls
• Where not to use GPUs ?
• Resources
• The Future

What is a GPU

• Graphics Programming Unit
– Coined in 1999 by NVidia
– Specialized add‐on board
• Accelerates interactive 3D rendering
– 60 image updates (or more) on large data
– Solves embarrassingly parallel problem
– Game driven volume economics
• NVidia v/s ATI, just like Intel v/s AMD
• Demand for better effects led to
– programmable GPUs
– floating point capabilities
– this led to General Purpose GPU(GPGPU) Computation

History of GPUs : a GPGPU Perspective
Date Product Trans Cores Flops Technology

1997 RIVA 128 3 M Rasterization
1999 GeForce 256 25 M Transform & Lighting
2001 GeForce 3 60 M Programmable shaders
2002 GeForce FX 125 M 16, 32 bit FP, long shaders
2004 GeForce 6800 222 M Infinite length shaders, branching
2006 GeForce 8800 681 M 128 Unified graphics & compute, CUDA,
64 bit FP
2008 GeForce GTX 1.4 B 240 933 G IEEE FP, CUDA C, OpenCL and
280 78 M DirectCompute, PCI‐express Gen 2
2009 Tesla M2050 3.0 B 512 1.03 T Improved 64 bit perf, caching, ECC
515 G memory, 64‐bit unified addressing,
asynchronous bidirectional data
transfer, multiple kernels
Source : Nickolls J. , Dally W.J. “The GPU Computing Era”, IEEE Micro, March-April 2010

The GPU Advantage

30x CPU FLOPS on Latest GPUs 10x Memory Bandwidth

Add to these a
3x Performance/$

Energy Efficient : 5x Performance/Watt
All Graphs From: GPU4Vision : http://gpu4vision.icg.tugrz.at/

People use GPUs for…

Source : Nickolls J. , Dally W.J. “The GPU Computing Era”, IEEE Micro, March-April 2010

More “why to use GPUs”

• Proliferation of GPUs
– Mobile devices will have capable GPUs soon !
• Make more things possible
– Make things real‐time
• From seconds to real‐time interactive performance
– Reduce offline processing overhead
• Research Opportunities
– New & efficient algorithms
– Pairing Multi‐core CPUs and massively multi‐threaded
GPUs

GPU Computing 1‐2‐3

A GPU isn’t a CPU replacement!


There ain’t no such thing as a FREE Lunch!


You don’t always “port” a CPU algorithm to a GPU!

CPU versus GPU

• CPU
– Optimized for latency
– Speedup techniques
• Vectorization (MMX, SSE, …)
• Coarse Grained Parallelism using multiple CPUs and cores
– Memory approaching a TB
• GPU
– Optimized for throughput
– Speedup techniques
• Massive multithreading
• Fine grained parallelism
– A few GBs of memory max

Getting Started

• Software
– CUDA (NVidia specific)
– OpenCL (Cross‐platform, GPU/CPU)
– DirectCompute (MS specific)
• Hardware
– A system equipped with GPU
• OS no bar
– But Windows, RedHat Enterprise Linux seem better
supported

CUDA
• Compute Unified Device
Architecture
• Most popular GPGPU toolkit
• CUDA C extends C with
constructs
– Easy to write programs
• Lower level “driver” API is
available
Source: NVIDIA CUDA Architecture, Introduction and Overview
– Provides more control
– Use multiple GPUs in the same
application
– Mix graphics & compute code
• Language bindings available
– PyCUDA, Java, .NET
• Toolkit provides conveniences

CUDA Toolkit

CUDA Architecture
• 1 more streaming
multiprocessors (“cores”)
• Thread Blocks
– Single Instruction, Multiple
Thread (SIMT)
– Hide latency by parallelism
• Memory Hierarchy
– Fermi GPUs can access
system memory
• Primitives for
– Thread synchronization
– Atomic Operations on
memory

Source : The GPU Computing Era

Simple Example : Vector Addition
C/C++ ‐ serial code
void VecAdd(const float *A, const float*B, float *C, int N) {
for(unsigned int i=0;i<N;i++)
C[i]=A[i]+B[i];
}
VecAdd(A,B,C,N);

C/C++ with OpenMP – thread level parallelism
void VecAdd(const float *A, const float*B, float *C, int N) {
#pragma omp for
for(unsigned int i=0;i<N;i++)
C[i]=A[i]+B[i];
}
VecAdd(A,B,C,N);

Vector Addition using CUDA
CUDA C – element level parallelism
__global__ void VecAdd(const float *A, const float*B, float *C, int N) {
int I = blockDim.x * blockIdx.x + threadIdx.x;
if(i<N)
C[i]=A[i]+B[i];
}

Invoking the function
cudaMalloc((void**)&d_A, size);
Allocate Memory on GPU
cudaMalloc((void**)&d_B, size);
cudaMalloc((void**)&d_C, size);
cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice); Copy Arrays to GPU
cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock; Invoke function
VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);
Copy Result Back to Main Memory
cudaFree(d_A);
cudaFree(d_B);
Free GPU Memory
cudaFree(d_C);

Compilation
# nvcc vectorAdd.cu –I ../../common/inc

GPU Programming Challenges

• Need high “occupancy” for best performance
• Extracting parallelism with limited resources
– Limited Registers
– Limited Shared Memory
• Preferred Approach
– Small Kernels
– Multiple Passes if needed
• Decompose Problem into Parallel Pieces
– Write once, scale perform everywhere!

GPU Programming

• Use Shared Memory when possible
– Cooperation between threads in a block
– Reduce access to global memory
• Reduce Data Transfer over the Bus
• It’s still a GPU !
– use textures to your advantage
– use vector data types if you can
• Watch out for GPU capability differences!

Enough Theory!

Demo Time
&
Let’s do some programming 

Watch out for

• Portability of programs across GPUs
– Capabilities vary from GPU to GPU
– Memory usage
• Arithmetic differences in the result
• Pay careful attention to demos…

Resources

• CUDA
– Tools on NVIDIA Developer Site
http://developer.nvidia.com/object/gpucomputing.html
– CUDPP
http://code.google.com/p/cudpp/
• OpenCL
• Google Search !

The Future

• Better throughput
– More GPU cores, scaling by Moore’s law
– PCIe Gen 3
• Easier to program
• Arbitrary control and data access patterns

Questions ?

shree.shree@gmail.com

Computing using GPUs

More Related Content

What's hot (18)

Viewers also liked (18)

Similar to Computing using GPUs (20)

Recently uploaded (20)

Computing using GPUs