SlideShare a Scribd company logo
The Open Standard for Heterogeneous Parallel Programming
The Kronos Group
https://www.khronos.org
Open means…
Many languages…
• C/C++ - https://www.khronos.org/opencl/
• .NET - http://openclnet.codeplex.com/
• Python - http://mathema.tician.de/software/pyopencl/
• Java - http://www.jocl.org/
• Julia - https://github.com/JuliaGPU/OpenCL.jl
Many platforms…
• AMD - CPUs, APUs, GPUs
• NVIDIA - GPUs
• INTEL - CPUs, GPUs
• APPLE - CPUs
• SAMSUMG - ARM processors
• OTHERS -
https://www.khronos.org/conformance/adopters/conforman
t-products#opencl
Why GPUs?
• Designed for Parallelism - Supports thousands of threads
with no thread management cost
• High Speed
• Low Cost
• Availability
How does it work?
• Host code - Runs on CPU
• Serial code (data pre-processing, sequential algorithms)
• Reads data from input (files, databases, streams)
• Transfers data from host to device (gpu)
• Calls device code (kernels)
• Copies data back from device to host
• Device code - Runs on GPU
• Independent parallel tasks called kernels
• Same task acts on different pieces of data - SIMD - Data Parallelism
• Different tasks act on different pieces of data - MIMD - Task Parallelism
Speed up - Amdahl’s Law
Computing Model
Computing Model
• Compute Device = GPU
• Compute Unit = Processor
• Compute/Processing Element = Processor Core
• A GPU can contain from hundreds up to thousands cores
Memory Model
Work-items/Work-groups
• Work-item = Thread
• Work-items are grouped into Work-groups
• Work-items in the same Work-group can:
• Share Data
• Synchronize
• Map work-items to better match the data structure
Work-items 1D Mapping
Work-items 2D Mapping
Matrix Multiplication
• Matrix A[4,2]
• Matrix B[2,3]
• Matrix C[4,3] = A * B
Matrix Multiplication
• For matrices A[128,128] and B[128,128]
• Matrix C will have 16384 elements
• We can launch 16384 work-items (threads)
• The work-group size can be set to [16,16]
• So we end up with 64 groups of 256 elements each
Kernel Code
__kernel
void matrixMultiplication(__global float* A, __global float* B, __global float* C, int
widthA, int widthB )
{
//will range from 0 to 127
int i = get_global_id(0);
//will range from 0 to 127
int j = get_global_id(1);
float value=0;
for ( int k = 0; k < widthA; k++)
{
value = value + A[k + j * widthA] * B[k*widthB + i];
}
C[i + widthA * j] = value;
}
Host Code
/* Create Kernel Program from the source */
program = clCreateProgramWithSource(context, 1, (const char **)&source_str, (const size_t
*)&source_size, &ret);
/* Build Kernel Program */
ret = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL);
/* Create OpenCL Kernel */
kernel = clCreateKernel(program, "matrixMultiplication", &ret);
/* Set OpenCL Kernel Arguments */
ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&memobjA);
ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&memobjB);
ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&memobjC);
ret = clSetKernelArg(kernel, 3, sizeof(int), (void *)&row);
ret = clSetKernelArg(kernel, 4, sizeof(int), (void *)&col);
/* Execute OpenCL Kernel */
size_t globalThreads[2] = {widthA, heightB};
size_t localThreads[2] = {16,16};
clEnqueueNDRangeKernel(command_queue, kernel, 2, 0, globalThreads, localThreads, 0, 0, 0);
/* Copy results from the memory buffer */
ret = clEnqueueReadBuffer(command_queue, memobjC, CL_TRUE, 0, widthA * heightC *
sizeof(float),Res, 0, NULL, NULL);
Limitations
• Number of work-items (threads)
• Group size (# of work-items, memory size)
• Data transfer bandwidth
• Device memory size
Be careful with…
• Uncoalesced memory access
• Branch divergence
• Access to global memory
• Data transfer between host and device
Demo
Thanks!

More Related Content

What's hot (20)

PDF
Introduction to char device driver
Vandana Salve
 
PDF
Embedded Android : System Development - Part IV
Emertxe Information Technologies Pvt Ltd
 
PPTX
UI Programming with Qt-Quick and QML
Emertxe Information Technologies Pvt Ltd
 
PDF
Low Level View of Android System Architecture
National Cheng Kung University
 
PPT
Ppt of socket
Amandeep Kaur
 
PDF
Embedded Systems: Lecture 1: Course Overview
Ahmed El-Arabawy
 
PPT
Os Threads
Salman Memon
 
PPTX
Computer organization &amp; architecture chapter-1
Shah Rukh Rayaz
 
PDF
Valgrind
aidanshribman
 
PDF
Power Management from Linux Kernel to Android
National Cheng Kung University
 
PPT
Basic Linux Internals
mukul bhardwaj
 
PDF
OpenMP Tutorial for Beginners
Dhanashree Prasad
 
ODP
Embedded Android : System Development - Part III
Emertxe Information Technologies Pvt Ltd
 
PPTX
Computer architecture multi processor
Mazin Alwaaly
 
PDF
Anatomy of the loadable kernel module (lkm)
Adrian Huang
 
PDF
Q2.12: Debugging with GDB
Linaro
 
PDF
Embedded Operating System - Linux
Emertxe Information Technologies Pvt Ltd
 
PPT
Compiler Design
Mir Majid
 
PDF
Android OTA updates
Gary Bisson
 
PDF
Understanding the Android System Server
Opersys inc.
 
Introduction to char device driver
Vandana Salve
 
Embedded Android : System Development - Part IV
Emertxe Information Technologies Pvt Ltd
 
UI Programming with Qt-Quick and QML
Emertxe Information Technologies Pvt Ltd
 
Low Level View of Android System Architecture
National Cheng Kung University
 
Ppt of socket
Amandeep Kaur
 
Embedded Systems: Lecture 1: Course Overview
Ahmed El-Arabawy
 
Os Threads
Salman Memon
 
Computer organization &amp; architecture chapter-1
Shah Rukh Rayaz
 
Valgrind
aidanshribman
 
Power Management from Linux Kernel to Android
National Cheng Kung University
 
Basic Linux Internals
mukul bhardwaj
 
OpenMP Tutorial for Beginners
Dhanashree Prasad
 
Embedded Android : System Development - Part III
Emertxe Information Technologies Pvt Ltd
 
Computer architecture multi processor
Mazin Alwaaly
 
Anatomy of the loadable kernel module (lkm)
Adrian Huang
 
Q2.12: Debugging with GDB
Linaro
 
Embedded Operating System - Linux
Emertxe Information Technologies Pvt Ltd
 
Compiler Design
Mir Majid
 
Android OTA updates
Gary Bisson
 
Understanding the Android System Server
Opersys inc.
 

Similar to OpenCL Heterogeneous Parallel Computing (20)

PDF
"Making OpenCV Code Run Fast," a Presentation from Intel
Edge AI and Vision Alliance
 
PDF
Intro to C++ - language
Jussi Pohjolainen
 
PPTX
C # (C Sharp).pptx
SnapeSever
 
PDF
MattsonTutorialSC14.pdf
George Papaioannou
 
PPTX
MattsonTutorialSC14.pptx
gopikahari7
 
PDF
Building High Performance Android Applications in Java and C++
Kenneth Geisshirt
 
PPTX
introduction to node.js
orkaplan
 
PPTX
Secure coding for developers
sluge
 
PPT
CIntro_Up_To_Functions.ppt;uoooooooooooooooooooo
muhammedcti23240202
 
PPTX
25-MPI-OpenMP.pptx
GopalPatidar13
 
PDF
Implement Runtime Environments for HSA using LLVM
National Cheng Kung University
 
PDF
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Guido Schmutz
 
PDF
Golang
Felipe Mamud
 
PDF
Tips and tricks for building high performance android apps using native code
Kenneth Geisshirt
 
PDF
Apache Thrift
knight1128
 
PDF
Exploring the Programming Models for the LUMI Supercomputer
George Markomanolis
 
PDF
C++ amp on linux
Miller Lee
 
PPT
270_1_CIntro_Up_To_Functions.ppt
UdhayaKumar175069
 
PPT
270_1_CIntro_Up_To_Functions.ppt
Alefya1
 
PPT
Survey of programming language getting started in C
ummeafruz
 
"Making OpenCV Code Run Fast," a Presentation from Intel
Edge AI and Vision Alliance
 
Intro to C++ - language
Jussi Pohjolainen
 
C # (C Sharp).pptx
SnapeSever
 
MattsonTutorialSC14.pdf
George Papaioannou
 
MattsonTutorialSC14.pptx
gopikahari7
 
Building High Performance Android Applications in Java and C++
Kenneth Geisshirt
 
introduction to node.js
orkaplan
 
Secure coding for developers
sluge
 
CIntro_Up_To_Functions.ppt;uoooooooooooooooooooo
muhammedcti23240202
 
25-MPI-OpenMP.pptx
GopalPatidar13
 
Implement Runtime Environments for HSA using LLVM
National Cheng Kung University
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Guido Schmutz
 
Golang
Felipe Mamud
 
Tips and tricks for building high performance android apps using native code
Kenneth Geisshirt
 
Apache Thrift
knight1128
 
Exploring the Programming Models for the LUMI Supercomputer
George Markomanolis
 
C++ amp on linux
Miller Lee
 
270_1_CIntro_Up_To_Functions.ppt
UdhayaKumar175069
 
270_1_CIntro_Up_To_Functions.ppt
Alefya1
 
Survey of programming language getting started in C
ummeafruz
 
Ad
Ad

Recently uploaded (20)

PDF
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PDF
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
PPTX
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PPTX
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
PPTX
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PPTX
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PPTX
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
PPTX
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
PDF
Executive Business Intelligence Dashboards
vandeslie24
 
PPTX
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
PDF
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
DOCX
Import Data Form Excel to Tally Services
Tally xperts
 
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
Executive Business Intelligence Dashboards
vandeslie24
 
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
Import Data Form Excel to Tally Services
Tally xperts
 

OpenCL Heterogeneous Parallel Computing

  • 1. The Open Standard for Heterogeneous Parallel Programming
  • 4. Many languages… • C/C++ - https://www.khronos.org/opencl/ • .NET - http://openclnet.codeplex.com/ • Python - http://mathema.tician.de/software/pyopencl/ • Java - http://www.jocl.org/ • Julia - https://github.com/JuliaGPU/OpenCL.jl
  • 5. Many platforms… • AMD - CPUs, APUs, GPUs • NVIDIA - GPUs • INTEL - CPUs, GPUs • APPLE - CPUs • SAMSUMG - ARM processors • OTHERS - https://www.khronos.org/conformance/adopters/conforman t-products#opencl
  • 6. Why GPUs? • Designed for Parallelism - Supports thousands of threads with no thread management cost • High Speed • Low Cost • Availability
  • 7. How does it work? • Host code - Runs on CPU • Serial code (data pre-processing, sequential algorithms) • Reads data from input (files, databases, streams) • Transfers data from host to device (gpu) • Calls device code (kernels) • Copies data back from device to host • Device code - Runs on GPU • Independent parallel tasks called kernels • Same task acts on different pieces of data - SIMD - Data Parallelism • Different tasks act on different pieces of data - MIMD - Task Parallelism
  • 8. Speed up - Amdahl’s Law
  • 10. Computing Model • Compute Device = GPU • Compute Unit = Processor • Compute/Processing Element = Processor Core • A GPU can contain from hundreds up to thousands cores
  • 12. Work-items/Work-groups • Work-item = Thread • Work-items are grouped into Work-groups • Work-items in the same Work-group can: • Share Data • Synchronize • Map work-items to better match the data structure
  • 15. Matrix Multiplication • Matrix A[4,2] • Matrix B[2,3] • Matrix C[4,3] = A * B
  • 16. Matrix Multiplication • For matrices A[128,128] and B[128,128] • Matrix C will have 16384 elements • We can launch 16384 work-items (threads) • The work-group size can be set to [16,16] • So we end up with 64 groups of 256 elements each
  • 17. Kernel Code __kernel void matrixMultiplication(__global float* A, __global float* B, __global float* C, int widthA, int widthB ) { //will range from 0 to 127 int i = get_global_id(0); //will range from 0 to 127 int j = get_global_id(1); float value=0; for ( int k = 0; k < widthA; k++) { value = value + A[k + j * widthA] * B[k*widthB + i]; } C[i + widthA * j] = value; }
  • 18. Host Code /* Create Kernel Program from the source */ program = clCreateProgramWithSource(context, 1, (const char **)&source_str, (const size_t *)&source_size, &ret); /* Build Kernel Program */ ret = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL); /* Create OpenCL Kernel */ kernel = clCreateKernel(program, "matrixMultiplication", &ret); /* Set OpenCL Kernel Arguments */ ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&memobjA); ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&memobjB); ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&memobjC); ret = clSetKernelArg(kernel, 3, sizeof(int), (void *)&row); ret = clSetKernelArg(kernel, 4, sizeof(int), (void *)&col); /* Execute OpenCL Kernel */ size_t globalThreads[2] = {widthA, heightB}; size_t localThreads[2] = {16,16}; clEnqueueNDRangeKernel(command_queue, kernel, 2, 0, globalThreads, localThreads, 0, 0, 0); /* Copy results from the memory buffer */ ret = clEnqueueReadBuffer(command_queue, memobjC, CL_TRUE, 0, widthA * heightC * sizeof(float),Res, 0, NULL, NULL);
  • 19. Limitations • Number of work-items (threads) • Group size (# of work-items, memory size) • Data transfer bandwidth • Device memory size
  • 20. Be careful with… • Uncoalesced memory access • Branch divergence • Access to global memory • Data transfer between host and device
  • 21. Demo