SlideShare a Scribd company logo
Computing Using
Graphics Cards

Shree Kumar, Hewlett Packard
http://www.shreekumar.in/
Speaker Intro

• High Performance Computing @ Hewlett‐Packard
  – VizStack (http://vizstack.sourceforge.net)
  – GPU Computing
• Big 3D enthusiast
• Travels a lot
• Blogs at http://www.shreekumar.in/
What we will cover

•   GPUs and their history
•   Why use GPUs
•   Architecture
•   Getting Started with GPU Programming
•   Challenges, Techniques & Pitfalls
•   Where not to use GPUs ?
•   Resources
•   The Future
What is a GPU

• Graphics Programming Unit
   – Coined in 1999 by NVidia
   – Specialized add‐on board
• Accelerates interactive 3D rendering
   – 60 image updates (or more) on large data
   – Solves embarrassingly parallel problem
   – Game driven volume economics
       • NVidia v/s ATI, just like Intel v/s AMD
• Demand for better effects led to
   – programmable GPUs
   – floating point capabilities
   – this led to General Purpose GPU(GPGPU) Computation
History of GPUs : a GPGPU Perspective
Date Product               Trans       Cores Flops             Technology

1997   RIVA 128            3 M                                 Rasterization
1999 GeForce 256           25 M                                Transform & Lighting
2001   GeForce 3           60 M                                Programmable shaders
2002 GeForce FX            125 M                               16, 32 bit FP, long shaders
2004 GeForce 6800 222 M                                        Infinite length shaders, branching
2006 GeForce 8800 681 M 128                                    Unified graphics & compute, CUDA, 
                                                               64 bit FP
2008 GeForce GTX           1.4 B       240        933 G        IEEE FP, CUDA C, OpenCL and 
     280                                          78 M         DirectCompute, PCI‐express Gen 2
2009 Tesla M2050           3.0 B       512        1.03 T       Improved 64 bit perf, caching, ECC 
                                                  515 G        memory, 64‐bit unified addressing, 
                                                               asynchronous bidirectional data 
                                                               transfer, multiple kernels
          Source : Nickolls J. , Dally W.J. “The GPU Computing Era”, IEEE Micro, March-April 2010
The GPU Advantage




  30x CPU FLOPS on Latest GPUs                10x Memory Bandwidth




                                                  Add to these a
                                                 3x Performance/$


Energy Efficient : 5x Performance/Watt
                                         All Graphs From: GPU4Vision : http://gpu4vision.icg.tugrz.at/
People use GPUs for…




    Source : Nickolls J. , Dally W.J. “The GPU Computing Era”, IEEE Micro, March-April 2010
More “why to use GPUs”

• Proliferation of GPUs
   – Mobile devices will have capable GPUs soon !
• Make more things possible
   – Make things real‐time
      • From seconds to real‐time interactive performance
   – Reduce offline processing overhead
• Research Opportunities
   – New & efficient algorithms
   – Pairing Multi‐core CPUs and massively multi‐threaded 
     GPUs
GPU Computing 1‐2‐3


A GPU isn’t a CPU replacement!
GPU Computing 1‐2‐3


There ain’t no such thing as a FREE Lunch!
GPU Computing 1‐2‐3


You don’t always “port” a CPU algorithm to a GPU!
CPU versus GPU

• CPU
  – Optimized for latency
  – Speedup techniques
     • Vectorization (MMX, SSE, …)
     • Coarse Grained Parallelism using multiple CPUs and cores
  – Memory approaching a TB
• GPU
  – Optimized for throughput
  – Speedup techniques
     • Massive multithreading
     • Fine grained parallelism
  – A few GBs of memory max
Getting Started

• Software
  – CUDA (NVidia specific)
  – OpenCL (Cross‐platform, GPU/CPU)
  – DirectCompute (MS specific)
• Hardware
  – A system equipped with GPU
• OS no bar
  – But Windows, RedHat Enterprise Linux seem better 
    supported
CUDA
• Compute Unified Device 
  Architecture
• Most popular GPGPU toolkit
• CUDA C extends C with 
  constructs
     – Easy to write programs
•   Lower level “driver” API is 
    available
                                        Source: NVIDIA CUDA Architecture, Introduction and Overview
     – Provides more control
     – Use multiple GPUs in the same 
       application
     – Mix graphics & compute code
•   Language bindings available
     – PyCUDA, Java, .NET
•   Toolkit provides conveniences


                                                                CUDA Toolkit
CUDA Architecture
• 1 more streaming 
  multiprocessors (“cores”)
• Thread Blocks
   – Single Instruction, Multiple 
     Thread (SIMT)
   – Hide latency by parallelism
• Memory Hierarchy
   – Fermi GPUs can access 
     system memory
• Primitives for
   – Thread synchronization
   – Atomic Operations on 
     memory


                                     Source : The GPU Computing Era
Simple Example : Vector Addition
C/C++ ‐ serial code
void VecAdd(const float *A, const float*B, float *C, int N) {
  for(unsigned int i=0;i<N;i++)
    C[i]=A[i]+B[i];
}
VecAdd(A,B,C,N);




C/C++ with OpenMP – thread level parallelism
void VecAdd(const float *A, const float*B, float *C, int N) {
  #pragma omp for
  for(unsigned int i=0;i<N;i++)
    C[i]=A[i]+B[i];
}
VecAdd(A,B,C,N);
Vector Addition using CUDA
CUDA C – element level parallelism
__global__ void VecAdd(const float *A, const float*B, float *C, int N) {
  int I = blockDim.x * blockIdx.x + threadIdx.x;
  if(i<N)
    C[i]=A[i]+B[i];
}


Invoking the function
cudaMalloc((void**)&d_A, size);
                                                                       Allocate Memory on GPU
cudaMalloc((void**)&d_B, size);
cudaMalloc((void**)&d_C, size);
cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);                            Copy Arrays to GPU
cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;                Invoke function
VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);
                                                               Copy Result Back to Main Memory
cudaFree(d_A);
cudaFree(d_B);
                                                                              Free GPU Memory
cudaFree(d_C);


Compilation
# nvcc vectorAdd.cu –I ../../common/inc
GPU Programming Challenges

• Need high “occupancy” for best performance
• Extracting parallelism with limited resources
  – Limited Registers
  – Limited Shared Memory
• Preferred Approach
  – Small Kernels
  – Multiple Passes if needed
• Decompose Problem into Parallel Pieces
  – Write once, scale perform everywhere!
GPU Programming

• Use Shared Memory when possible
   – Cooperation between threads in a block
   – Reduce access to global memory
• Reduce Data Transfer over the Bus
• It’s still a GPU !
   – use textures to your advantage
   – use vector data types if you can
• Watch out for GPU capability differences!
Enough Theory!

          Demo Time
              &
Let’s do some programming 
Watch out for

• Portability of programs across GPUs
   – Capabilities vary from GPU to GPU
   – Memory usage
• Arithmetic differences in the result
• Pay careful attention to demos…
Resources

• CUDA
  – Tools on NVIDIA Developer Site 
    http://developer.nvidia.com/object/gpucomputing.html
  – CUDPP 
    http://code.google.com/p/cudpp/
• OpenCL
• Google Search !
The Future

• Better throughput
   – More GPU cores, scaling by Moore’s law
   – PCIe Gen 3
• Easier to program
• Arbitrary control and data access patterns
Questions ?

shree.shree@gmail.com

More Related Content

What's hot (18)

PDF
Introduction to CUDA
Raymond Tay
 
PDF
Cuda tutorial
Mahesh Khadatare
 
PPTX
Intro to GPGPU with CUDA (DevLink)
Rob Gillen
 
PPTX
Cuda Architecture
Piyush Mittal
 
PPTX
Cuda
Amy Devadas
 
PDF
Cuda introduction
Hanibei
 
PDF
A beginner’s guide to programming GPUs with CUDA
Piyush Mittal
 
PPTX
Intro to GPGPU Programming with Cuda
Rob Gillen
 
PPT
Vpu technology &gpgpu computing
Arka Ghosh
 
PPTX
Gpu with cuda architecture
Dhaval Kaneria
 
PPT
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
DefCamp
 
PDF
Introduction to CUDA C: NVIDIA : Notes
Subhajit Sahu
 
PDF
The Rise of Parallel Computing
bakers84
 
PPT
Cuda intro
Anshul Sharma
 
PDF
GIST AI-X Computing Cluster
Jax Jargalsaikhan
 
PPT
CUDA Architecture
Dr Shashikant Athawale
 
PDF
Tech Talk NVIDIA CUDA
Jens Rühmkorf
 
PPTX
AI Hardware Landscape 2021
Grigory Sapunov
 
Introduction to CUDA
Raymond Tay
 
Cuda tutorial
Mahesh Khadatare
 
Intro to GPGPU with CUDA (DevLink)
Rob Gillen
 
Cuda Architecture
Piyush Mittal
 
Cuda introduction
Hanibei
 
A beginner’s guide to programming GPUs with CUDA
Piyush Mittal
 
Intro to GPGPU Programming with Cuda
Rob Gillen
 
Vpu technology &gpgpu computing
Arka Ghosh
 
Gpu with cuda architecture
Dhaval Kaneria
 
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
DefCamp
 
Introduction to CUDA C: NVIDIA : Notes
Subhajit Sahu
 
The Rise of Parallel Computing
bakers84
 
Cuda intro
Anshul Sharma
 
GIST AI-X Computing Cluster
Jax Jargalsaikhan
 
CUDA Architecture
Dr Shashikant Athawale
 
Tech Talk NVIDIA CUDA
Jens Rühmkorf
 
AI Hardware Landscape 2021
Grigory Sapunov
 

Viewers also liked (18)

PDF
Extending Android with New Devices
Shree Kumar
 
PPTX
Switching on the fibre staff present 2010
moranf
 
PDF
Calendario 3divisao hoquei
Mané Castilho
 
PDF
Shop and Awe
Adriana Young
 
PPTX
Breakfast with Beatrice by Andrea Olausson
Adriana Young
 
PPT
Using Social Media for Nonprofits
Profiles, Inc.
 
PPT
Verification 2006 1
hjbarten
 
PPT
Passie voor Oranje
Kees Richters
 
PPT
台北縣政府農業局簡介_final
Nancy Xiao
 
PPT
Livro 3 leitura sem simbolo
eliane santos
 
PPT
Contract and its assential
spicysugar
 
PPTX
A horror story about me
Manohar Patil
 
PDF
Il mercato pubblicitario in un contesto postmoderno
pginzaina
 
PPT
Diseño de envases, packaging
Nieves dibujo
 
PPTX
How i built my own irrigation controller
Shree Kumar
 
PDF
Android Service Patterns
Shree Kumar
 
PPTX
Android, without batteries
Shree Kumar
 
PPTX
Woning in spanje
OsirisRojales
 
Extending Android with New Devices
Shree Kumar
 
Switching on the fibre staff present 2010
moranf
 
Calendario 3divisao hoquei
Mané Castilho
 
Shop and Awe
Adriana Young
 
Breakfast with Beatrice by Andrea Olausson
Adriana Young
 
Using Social Media for Nonprofits
Profiles, Inc.
 
Verification 2006 1
hjbarten
 
Passie voor Oranje
Kees Richters
 
台北縣政府農業局簡介_final
Nancy Xiao
 
Livro 3 leitura sem simbolo
eliane santos
 
Contract and its assential
spicysugar
 
A horror story about me
Manohar Patil
 
Il mercato pubblicitario in un contesto postmoderno
pginzaina
 
Diseño de envases, packaging
Nieves dibujo
 
How i built my own irrigation controller
Shree Kumar
 
Android Service Patterns
Shree Kumar
 
Android, without batteries
Shree Kumar
 
Woning in spanje
OsirisRojales
 
Ad

Similar to Computing using GPUs (20)

PPTX
GPU Computing: A brief overview
Rajiv Kumar
 
PPTX
GPU in Computer Science advance topic .pptx
HamzaAli998966
 
PPTX
Introduction to Accelerators
Dilum Bandara
 
PDF
N A G P A R I S280101
John Holden
 
PPT
Vpu technology &gpgpu computing
Arka Ghosh
 
PPTX
lecture11_GPUArchCUDA01.pptx
ssuser413a98
 
PDF
Introduction to GPU Programming
Chakkrit (Kla) Tantithamthavorn
 
PDF
Cuda Without a Phd - A practical guick start
LloydMoore
 
PPT
Vpu technology &gpgpu computing
Arka Ghosh
 
PPT
Vpu technology &gpgpu computing
Arka Ghosh
 
PDF
Newbie’s guide to_the_gpgpu_universe
Ofer Rosenberg
 
PPT
Lecture2 cuda spring 2010
haythem_2015
 
PDF
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools
laparuma
 
PPTX
Graphics Processing unit ppt
VictorAbhinav
 
PDF
Gpu perf-presentation
GiannisTsagatakis
 
PDF
GPGPU Computation
jtsagata
 
PDF
CSTalks - GPGPU - 19 Jan
cstalks
 
PPT
Parallel computing with Gpu
Rohit Khatana
 
PDF
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)
PhtRaveller
 
PDF
Gpu Cuda
melbournepatterns
 
GPU Computing: A brief overview
Rajiv Kumar
 
GPU in Computer Science advance topic .pptx
HamzaAli998966
 
Introduction to Accelerators
Dilum Bandara
 
N A G P A R I S280101
John Holden
 
Vpu technology &gpgpu computing
Arka Ghosh
 
lecture11_GPUArchCUDA01.pptx
ssuser413a98
 
Introduction to GPU Programming
Chakkrit (Kla) Tantithamthavorn
 
Cuda Without a Phd - A practical guick start
LloydMoore
 
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Arka Ghosh
 
Newbie’s guide to_the_gpgpu_universe
Ofer Rosenberg
 
Lecture2 cuda spring 2010
haythem_2015
 
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools
laparuma
 
Graphics Processing unit ppt
VictorAbhinav
 
Gpu perf-presentation
GiannisTsagatakis
 
GPGPU Computation
jtsagata
 
CSTalks - GPGPU - 19 Jan
cstalks
 
Parallel computing with Gpu
Rohit Khatana
 
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)
PhtRaveller
 
Ad

Recently uploaded (20)

PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Digital Circuits, important subject in CS
contactparinay1
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 

Computing using GPUs

  • 2. Speaker Intro • High Performance Computing @ Hewlett‐Packard – VizStack (http://vizstack.sourceforge.net) – GPU Computing • Big 3D enthusiast • Travels a lot • Blogs at http://www.shreekumar.in/
  • 3. What we will cover • GPUs and their history • Why use GPUs • Architecture • Getting Started with GPU Programming • Challenges, Techniques & Pitfalls • Where not to use GPUs ? • Resources • The Future
  • 4. What is a GPU • Graphics Programming Unit – Coined in 1999 by NVidia – Specialized add‐on board • Accelerates interactive 3D rendering – 60 image updates (or more) on large data – Solves embarrassingly parallel problem – Game driven volume economics • NVidia v/s ATI, just like Intel v/s AMD • Demand for better effects led to – programmable GPUs – floating point capabilities – this led to General Purpose GPU(GPGPU) Computation
  • 5. History of GPUs : a GPGPU Perspective Date Product Trans Cores Flops Technology 1997 RIVA 128 3 M Rasterization 1999 GeForce 256 25 M Transform & Lighting 2001 GeForce 3 60 M Programmable shaders 2002 GeForce FX 125 M 16, 32 bit FP, long shaders 2004 GeForce 6800 222 M Infinite length shaders, branching 2006 GeForce 8800 681 M 128 Unified graphics & compute, CUDA,  64 bit FP 2008 GeForce GTX  1.4 B 240 933 G IEEE FP, CUDA C, OpenCL and  280 78 M DirectCompute, PCI‐express Gen 2 2009 Tesla M2050 3.0 B 512 1.03 T Improved 64 bit perf, caching, ECC  515 G memory, 64‐bit unified addressing,  asynchronous bidirectional data  transfer, multiple kernels Source : Nickolls J. , Dally W.J. “The GPU Computing Era”, IEEE Micro, March-April 2010
  • 6. The GPU Advantage 30x CPU FLOPS on Latest GPUs 10x Memory Bandwidth Add to these a 3x Performance/$ Energy Efficient : 5x Performance/Watt All Graphs From: GPU4Vision : http://gpu4vision.icg.tugrz.at/
  • 7. People use GPUs for… Source : Nickolls J. , Dally W.J. “The GPU Computing Era”, IEEE Micro, March-April 2010
  • 8. More “why to use GPUs” • Proliferation of GPUs – Mobile devices will have capable GPUs soon ! • Make more things possible – Make things real‐time • From seconds to real‐time interactive performance – Reduce offline processing overhead • Research Opportunities – New & efficient algorithms – Pairing Multi‐core CPUs and massively multi‐threaded  GPUs
  • 12. CPU versus GPU • CPU – Optimized for latency – Speedup techniques • Vectorization (MMX, SSE, …) • Coarse Grained Parallelism using multiple CPUs and cores – Memory approaching a TB • GPU – Optimized for throughput – Speedup techniques • Massive multithreading • Fine grained parallelism – A few GBs of memory max
  • 13. Getting Started • Software – CUDA (NVidia specific) – OpenCL (Cross‐platform, GPU/CPU) – DirectCompute (MS specific) • Hardware – A system equipped with GPU • OS no bar – But Windows, RedHat Enterprise Linux seem better  supported
  • 14. CUDA • Compute Unified Device  Architecture • Most popular GPGPU toolkit • CUDA C extends C with  constructs – Easy to write programs • Lower level “driver” API is  available Source: NVIDIA CUDA Architecture, Introduction and Overview – Provides more control – Use multiple GPUs in the same  application – Mix graphics & compute code • Language bindings available – PyCUDA, Java, .NET • Toolkit provides conveniences CUDA Toolkit
  • 15. CUDA Architecture • 1 more streaming  multiprocessors (“cores”) • Thread Blocks – Single Instruction, Multiple  Thread (SIMT) – Hide latency by parallelism • Memory Hierarchy – Fermi GPUs can access  system memory • Primitives for – Thread synchronization – Atomic Operations on  memory Source : The GPU Computing Era
  • 16. Simple Example : Vector Addition C/C++ ‐ serial code void VecAdd(const float *A, const float*B, float *C, int N) { for(unsigned int i=0;i<N;i++) C[i]=A[i]+B[i]; } VecAdd(A,B,C,N); C/C++ with OpenMP – thread level parallelism void VecAdd(const float *A, const float*B, float *C, int N) { #pragma omp for for(unsigned int i=0;i<N;i++) C[i]=A[i]+B[i]; } VecAdd(A,B,C,N);
  • 17. Vector Addition using CUDA CUDA C – element level parallelism __global__ void VecAdd(const float *A, const float*B, float *C, int N) { int I = blockDim.x * blockIdx.x + threadIdx.x; if(i<N) C[i]=A[i]+B[i]; } Invoking the function cudaMalloc((void**)&d_A, size); Allocate Memory on GPU cudaMalloc((void**)&d_B, size); cudaMalloc((void**)&d_C, size); cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice); Copy Arrays to GPU cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice); int threadsPerBlock = 256; int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock; Invoke function VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N); cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost); Copy Result Back to Main Memory cudaFree(d_A); cudaFree(d_B); Free GPU Memory cudaFree(d_C); Compilation # nvcc vectorAdd.cu –I ../../common/inc
  • 18. GPU Programming Challenges • Need high “occupancy” for best performance • Extracting parallelism with limited resources – Limited Registers – Limited Shared Memory • Preferred Approach – Small Kernels – Multiple Passes if needed • Decompose Problem into Parallel Pieces – Write once, scale perform everywhere!
  • 19. GPU Programming • Use Shared Memory when possible – Cooperation between threads in a block – Reduce access to global memory • Reduce Data Transfer over the Bus • It’s still a GPU ! – use textures to your advantage – use vector data types if you can • Watch out for GPU capability differences!
  • 20. Enough Theory! Demo Time & Let’s do some programming 
  • 21. Watch out for • Portability of programs across GPUs – Capabilities vary from GPU to GPU – Memory usage • Arithmetic differences in the result • Pay careful attention to demos…
  • 22. Resources • CUDA – Tools on NVIDIA Developer Site  http://developer.nvidia.com/object/gpucomputing.html – CUDPP  http://code.google.com/p/cudpp/ • OpenCL • Google Search !
  • 23. The Future • Better throughput – More GPU cores, scaling by Moore’s law – PCIe Gen 3 • Easier to program • Arbitrary control and data access patterns