SlideShare a Scribd company logo
PyCUDA:
Harnessing the power of GPU with Python
Talk Structure




                    1. Why a GPU ?
                    2. How does It works ?
                    3. How do I Program it ?
                    4. Can I Use Python ?

PyCon 4 – Florence 2010 – Fabrizio Milo
Talk Structure




                    1. Why a GPU ?
                    2. How does It works ?
                    3. How do I Program it ?
                    4. Can I Use Python ?

PyCon 4 – Florence 2010 – Fabrizio Milo
WHY A GPU ?


PyCon 4 – Florence 2010 – Fabrizio Milo
APPLICATIONS & DEMOS


PyCon 4 – Florence 2010 – Fabrizio Milo
Why GPU?




PyCon 4 – Florence 2010 – Fabrizio Milo
Talk Structure




                    1. Why a GPU ?
                    2. How does it works ?
                    3. How do I Program it ?
                    4. Can I Use Python ?

PyCon 4 – Florence 2010 – Fabrizio Milo
How does it works ?




PyCon 4 – Florence 2010 – Fabrizio Milo
ALU   ALU

                                          Control

                                                            ALU   ALU




                                                    Cache




                                DRAM




                                                    CPU
PyCon 4 – Florence 2010 – Fabrizio Milo
DRAM




                                          GPU
PyCon 4 – Florence 2010 – Fabrizio Milo
ALU   ALU
                   Control
                                              ALU   ALU



                                      Cache




           DRAM                                           DRAM



                                      CPU                        GPU




PyCon 4 – Florence 2010 – Fabrizio Milo
CUDA




PyCon 4 – Florence 2010 – Fabrizio Milo
Compute Unified Device Architecture




PyCon 4 – Florence 2010 – Fabrizio Milo
CUDA
                      A Parallel Computing Architecture for NVIDIA GPUs




                                                Direct X
                                               Compute




PyCon 4 – Florence 2010 – Fabrizio Milo
Execution Model

                        CUDA
                                          Device Model




PyCon 4 – Florence 2010 – Fabrizio Milo
EXECUTION MODEL


PyCon 4 – Florence 2010 – Fabrizio Milo
Thread
                            Smallest unit of logic




PyCon 4 – Florence 2010 – Fabrizio Milo
A Block
                            A Group of Threads




PyCon 4 – Florence 2010 – Fabrizio Milo
A Grid
                            A Group of Blocks




PyCon 4 – Florence 2010 – Fabrizio Milo
One Block can have many threads




PyCon 4 – Florence 2010 – Fabrizio Milo
One Grid can have many blocks




PyCon 4 – Florence 2010 – Fabrizio Milo
The hardware

     DEVICE MODEL


PyCon 4 – Florence 2010 – Fabrizio Milo
Scalar Processor




PyCon 4 – Florence 2010 – Fabrizio Milo
Scalar Processor




PyCon 4 – Florence 2010 – Fabrizio Milo
Many Scalar Processors




PyCon 4 – Florence 2010 – Fabrizio Milo
+ Register File




PyCon 4 – Florence 2010 – Fabrizio Milo
+ Shared Memory




PyCon 4 – Florence 2010 – Fabrizio Milo
Multiprocessor




PyCon 4 – Florence 2010 – Fabrizio Milo
Device




PyCon 4 – Florence 2010 – Fabrizio Milo
Real Example: 10-Series Architecture

"   240 Scalar Processor (SP) cores execute kernel threads
"   30 Streaming Multiprocessors (SMs) each contain
         " 8 scalar processors
             
         "  1 double precision unit
         "  Shared memory




PyCon 4 – Florence 2010 – Fabrizio Milo
Software   Hardware

                                                         Scalar
                                                       Processor
                                           Thread




                                           Thread
                                            Block    Multiprocessor




                                            Grid        Device
PyCon 4 – Florence 2010 – Fabrizio Milo
Global Memory




PyCon 4 – Florence 2010 – Fabrizio Milo
Global Memory




PyCon 4 – Florence 2010 – Fabrizio Milo
RAM




                                     CPU    Global Memory




                            Host - Device




PyCon 4 – Florence 2010 – Fabrizio Milo
RAM




                                     CPU




                            Host – Multi Device




PyCon 4 – Florence 2010 – Fabrizio Milo
1. Why a GPU ?
                    2. How does It works ?
                    3. How do I Program it ?
                    4. Can I Use Python ?

PyCon 4 – Florence 2010 – Fabrizio Milo
Software   Hardware

                                                         Scalar
                                                       Processor
                                           Thread




                                           Thread
                                            Block    Multiprocessor




                                            Grid        Device
PyCon 4 – Florence 2010 – Fabrizio Milo
Kernel


__global__ void multiply_them( float *dest,
     	   	     	    	    	     	 float *a, 	
     	   	     	    	    	     	 float *b )	
{	
   const int i = threadIdx.x;	
   dest[i] = a[i] * b[i];	
}	




                                          Thread
PyCon 4 – Florence 2010 – Fabrizio Milo
Kernel


__global__ void multiply_them( float *dest,
     	   	     	    	    	     	 float *a, 	
     	   	     	    	    	     	 float *b )	
{	
   const int i = threadIdx.x;	
   dest[i] = a[i] * b[i];	
}	




                                          Thread
PyCon 4 – Florence 2010 – Fabrizio Milo
Kernel


__global__ void multiply_them( float *dest,
     	   	     	    	    	   	 float *a, 	
     	   	     	    	    	   	 float *b )	
{	
   const int i = threadIdx.x;	
   dest[i] = a[i] * b[i];	
}	




                                          Block
PyCon 4 – Florence 2010 – Fabrizio Milo
Kernel


__global__ void kernel( … )	
{	
   const int idx =	

                blockIdx.x * blockDim.x + threadIdx.x;	
        …	
}	




                                          Grid
PyCon 4 – Florence 2010 – Fabrizio Milo
How do I Program it ?


                                          Main Logic   Kernel


                                            GCC
                                                       NVCC




         CPU                                 .bin      .cubin   GPU




PyCon 4 – Florence 2010 – Fabrizio Milo
How do I Program it ?


                                          Main Logic                Kernel


                                            GCC
                                                                    NVCC



                                                                             GPU

                                             .bin                   .cubin




                                                    .bin   .cubin     .      CPU

PyCon 4 – Florence 2010 – Fabrizio Milo
RAM




                                     CPU    Global Memory




                            Host - Device




PyCon 4 – Florence 2010 – Fabrizio Milo
RAM




                                     CPU   Global Memory




PyCon 4 – Florence 2010 – Fabrizio Milo
Allocate Memory


cudaMalloc( pointer, size )	




 PyCon 4 – Florence 2010 – Fabrizio Milo
Copy to device


cudaMalloc( pointer, size )	

cudaMemcpy( dest, src, size, direction)	




 PyCon 4 – Florence 2010 – Fabrizio Milo
Kernel Launch


cudaMalloc( pointer, size )	

cudaMemcpy( dest, src, size, direction)	

Kernel<<< # blocks, # threads >> (*params)	




 PyCon 4 – Florence 2010 – Fabrizio Milo
Get Back the Results


cudaMalloc( pointer, size )	

cudaMemcpy( dest, src, size, direction)	

Kernel<<< # blocks, # threads >> (*params)	

cudaMemcpy( dest, src, size, direction)	




 PyCon 4 – Florence 2010 – Fabrizio Milo
Error Handling




If(cudaMalloc( pointer, size ) != cudaSuccess){	
   handle_error()	
}	




 PyCon 4 – Florence 2010 – Fabrizio Milo
And soon it becomes …


If(cudaMalloc( pointer, size ) != cudaSuccess){	
 handle_error()	
}	

if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}	

If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){	
 handle_error()	
}	

If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }	




  PyCon 4 – Florence 2010 – Fabrizio Milo
And soon it becomes …
If(cudaMalloc( pointer, size ) != cudaSuccess){	
 handle_error()	                                                     If(cudaMalloc( pointer, size ) != cudaSuccess){	
}	                                                                    handle_error()	
                                                                     }	
if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}	
                                                                     if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}	
If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){	
 handle_error()	                                                     If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){	
}	                                                                    handle_error()	
                                                                     }	
If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }	
                                                                     If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }	

 If(cudaMalloc( pointer, size ) != cudaSuccess){	
  handle_error()	                                                     If(cudaMalloc( pointer, size ) != cudaSuccess){	
 }	                                                                    handle_error()	
                                                                      }	
 if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}	
                                                                      if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}	
 If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){	
  handle_error()	                                                     If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){	
 }	                                                                    handle_error()	
                                                                      }	
 If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }	
                                                                      If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }	


  If(cudaMalloc( pointer, size ) != cudaSuccess){	
   handle_error()	                                                     If(cudaMalloc( pointer, size ) != cudaSuccess){	
  }	                                                                    handle_error()	
                                                                       }	
  if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}	
                                                                       if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}	
  If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){	
   handle_error()	                                                     If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){	
  }	                                                                    handle_error()	
                                                                       }	
  If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }	
                                                                       If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }	




  PyCon 4 – Florence 2010 – Fabrizio Milo
PyCon 4 – Florence 2010 – Fabrizio Milo
1. Why a GPU ?
                    2. How does It works ?
                    3. How do I Program it ?
                    4. Can I Use Python ?

PyCon 4 – Florence 2010 – Fabrizio Milo
+




    & ANDREAS KLOCKNER

    = PYCUDA

PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda Philosopy




                                             Provide
                                            Complete
                                             Access

  PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda Philosopy




                                            AutoMatically
                                              Manage
                                             Resources

  PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda Philosopy




                                             Check and
                                            Report Errors



  PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda Philosopy




                                           Cross
                                          Platform



PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda Philosopy




                                               Allow
                                            Interactive
                                                Use


  PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda Philosopy




                                              NumPy
                                            Integration



  PyCon 4 – Florence 2010 – Fabrizio Milo
NUMPY - ARRAY
PyCon 4 – Florence 2010 – Fabrizio Milo
1       1   1   1   1   1

                                               0                   99




import numpy	

 my_array = numpy.array([1,] * 100)	



 PyCon 4 – Florence 2010 – Fabrizio Milo
1   1   1   0   1   1




import numpy	

 my_array = numpy.array([1,] * 100)	

 my_array[3] = 0	
 PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda: Workflow




PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda: Workflow




PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda: Workflow




PyCon 4 – Florence 2010 – Fabrizio Milo
Memory Allocation


cuda.mem_alloc( size_bytes )	




 PyCon 4 – Florence 2010 – Fabrizio Milo
Memory Copy


gpu_mem = cuda.mem_alloc( size_bytes )	

cuda.memcpy_htod( gpu_mem, cpu_mem )	




 PyCon 4 – Florence 2010 – Fabrizio Milo
Kernel


gpu_mem = cuda.mem_alloc( size_bytes )	

cuda.memcpy_htod( gpu_mem, cpu_mem )	

SourceModule(“””	
__global__ void multiply_them( float *dest, float *a, 	
       	    	      	      	    	      	      float *b )	
{	
   const int i = threadIdx.x;	
   dest[i] = a[i] * b[i];	
}”””)	




  PyCon 4 – Florence 2010 – Fabrizio Milo
Kernel Launch


mod = SourceModule(“””	
__global__ void multiply_them( float *dest, float *a, 	
       	    	      	      	    	      	      float *b )	
{	
   const int i = threadIdx.x;	
   dest[i] = a[i] * b[i];	
}”””)	

multiply_them = mod.get_function(“multiply_them”)	
multiply_them ( *args, block=(30, 64, 1))	




  PyCon 4 – Florence 2010 – Fabrizio Milo
PyCon 4 – Florence 2010 – Fabrizio Milo
PyCon 4 – Florence 2010 – Fabrizio Milo
PyCon 4 – Florence 2010 – Fabrizio Milo
Hello Gpu

     DEMO


PyCon 4 – Florence 2010 – Fabrizio Milo
GPUARRAY
PyCon 4 – Florence 2010 – Fabrizio Milo
gpuarray




PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda: GpuArray




   gpuarray.to_gpu(numpy array)	

   numpy array = gpuarray.get()	




PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda: GpuArray




   gpuarray.to_gpu(numpy array)	

   numpy array = gpuarray.get()	

     +, -, !, /, fill, sin, exp, rand, basic
     indexing, norm, inner product …

PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda: GpuArray: ElementWise



from pycuda.elementwise import ElementwiseKernel




PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda: GpuArray: ElementWise



from pycuda.elementwise import ElementwiseKernel


lincomb = ElementwiseKernel(
      ” float a , float !x , float b , float !y , float !z”,
      ”z [ i ] = a !x[ i ] + b!y[i ] ”
)




PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda: GpuArray: ElementWise



from pycuda.elementwise import ElementwiseKernel


lin comb = ElementwiseKernel(
       ” float a , float !x , float b , float !y , float !z”,
       ”z [ i ] = a !x[ i ] + b!y[i ] ”
)

c gpu = gpuarray. empty like (a gpu)
lincomb (5, a gpu, 6, b gpu, c gpu)

assert la . norm((c gpu ! (5!a gpu+6!b gpu)).get()) < 1e!5
PyCon 4 – Florence 2010 – Fabrizio Milo
Meta-Programming


__kernel_template__ = “””	
__global__ void kernel( args )	
{	

for (int i=0; i={{ iterations }}; i++){	
 {{operations}}	
}	

}”””	




  See for example jinja2

  PyCon 4 – Florence 2010 – Fabrizio Milo
Meta-Programming




PyCon 4 – Florence 2010 – Fabrizio Milo
Meta-Programming




         Generate Source !




PyCon 4 – Florence 2010 – Fabrizio Milo
Performances ?




PyCon 4 – Florence 2010 – Fabrizio Milo
mandelbrot

     DEMO


PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda: Documentation




PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda

WebSite:
http://mathema.tician.de/software/ pycuda

License:
X Consortium License
  (no warranty, free for all use)

Dependencies:
  Python 2.4+, numpy, Boost
 PyCon 4 – Florence 2010 – Fabrizio Milo
In the Future …




    OPENCL

PyCon 4 – Florence 2010 – Fabrizio Milo
THANK YOU & HAVE FUN !


PyCon 4 – Florence 2010 – Fabrizio Milo
?

PyCon 4 – Florence 2010 – Fabrizio Milo

More Related Content

PDF
Bigger Hard Drive Jamie Lean
Future Perfect 2012
 
PDF
The str/bytes nightmare before python2 EOL
Kir Chou
 
PPT
Capitolo 6a elementi di valutazione dei prodotti derivati
Giovanni Della Lunga
 
PDF
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
npinto
 
PPT
Capitolo 1 richiami mat. finanziaria
Giovanni Della Lunga
 
PDF
Simulation methods finance_2
Giovanni Della Lunga
 
PDF
GPU Computing for Data Science
Domino Data Lab
 
PDF
Feed back report 2010
PyCon Italia
 
Bigger Hard Drive Jamie Lean
Future Perfect 2012
 
The str/bytes nightmare before python2 EOL
Kir Chou
 
Capitolo 6a elementi di valutazione dei prodotti derivati
Giovanni Della Lunga
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
npinto
 
Capitolo 1 richiami mat. finanziaria
Giovanni Della Lunga
 
Simulation methods finance_2
Giovanni Della Lunga
 
GPU Computing for Data Science
Domino Data Lab
 
Feed back report 2010
PyCon Italia
 

More from PyCon Italia (19)

PDF
Spyppolare o non spyppolare
PyCon Italia
 
PDF
zc.buildout: "Un modo estremamente civile per sviluppare un'applicazione"
PyCon Italia
 
PDF
Undici anni di lavoro con Python
PyCon Italia
 
PDF
socket e SocketServer: il framework per i server Internet in Python
PyCon Italia
 
PDF
Qt mobile PySide bindings
PyCon Italia
 
PDF
Python: ottimizzazione numerica algoritmi genetici
PyCon Italia
 
PDF
Python idiomatico
PyCon Italia
 
PDF
Python in the browser
PyCon Italia
 
PDF
PyPy 1.2: snakes never crawled so fast
PyCon Italia
 
PDF
OpenERP e l'arte della gestione aziendale con Python
PyCon Italia
 
PDF
New and improved: Coming changes to the unittest module
PyCon Italia
 
PDF
Monitoraggio del Traffico di Rete Usando Python ed ntop
PyCon Italia
 
PDF
Jython for embedded software validation
PyCon Italia
 
PDF
Foxgame introduzione all'apprendimento automatico
PyCon Italia
 
PDF
Effective EC2
PyCon Italia
 
PDF
Django è pronto per l'Enterprise
PyCon Italia
 
PDF
Crogioli, alambicchi e beute: dove mettere i vostri dati.
PyCon Italia
 
PDF
Comet web applications with Python, Django & Orbited
PyCon Italia
 
ZIP
Cleanup and new optimizations in WPython 1.1
PyCon Italia
 
Spyppolare o non spyppolare
PyCon Italia
 
zc.buildout: "Un modo estremamente civile per sviluppare un'applicazione"
PyCon Italia
 
Undici anni di lavoro con Python
PyCon Italia
 
socket e SocketServer: il framework per i server Internet in Python
PyCon Italia
 
Qt mobile PySide bindings
PyCon Italia
 
Python: ottimizzazione numerica algoritmi genetici
PyCon Italia
 
Python idiomatico
PyCon Italia
 
Python in the browser
PyCon Italia
 
PyPy 1.2: snakes never crawled so fast
PyCon Italia
 
OpenERP e l'arte della gestione aziendale con Python
PyCon Italia
 
New and improved: Coming changes to the unittest module
PyCon Italia
 
Monitoraggio del Traffico di Rete Usando Python ed ntop
PyCon Italia
 
Jython for embedded software validation
PyCon Italia
 
Foxgame introduzione all'apprendimento automatico
PyCon Italia
 
Effective EC2
PyCon Italia
 
Django è pronto per l'Enterprise
PyCon Italia
 
Crogioli, alambicchi e beute: dove mettere i vostri dati.
PyCon Italia
 
Comet web applications with Python, Django & Orbited
PyCon Italia
 
Cleanup and new optimizations in WPython 1.1
PyCon Italia
 
Ad

Recently uploaded (20)

PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Ad

PyCuda: Come sfruttare la potenza delle schede video nelle applicazioni python

  • 1. PyCUDA: Harnessing the power of GPU with Python
  • 2. Talk Structure 1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • 3. Talk Structure 1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • 4. WHY A GPU ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • 5. APPLICATIONS & DEMOS PyCon 4 – Florence 2010 – Fabrizio Milo
  • 6. Why GPU? PyCon 4 – Florence 2010 – Fabrizio Milo
  • 7. Talk Structure 1. Why a GPU ? 2. How does it works ? 3. How do I Program it ? 4. Can I Use Python ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • 8. How does it works ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • 9. ALU ALU Control ALU ALU Cache DRAM CPU PyCon 4 – Florence 2010 – Fabrizio Milo
  • 10. DRAM GPU PyCon 4 – Florence 2010 – Fabrizio Milo
  • 11. ALU ALU Control ALU ALU Cache DRAM DRAM CPU GPU PyCon 4 – Florence 2010 – Fabrizio Milo
  • 12. CUDA PyCon 4 – Florence 2010 – Fabrizio Milo
  • 13. Compute Unified Device Architecture PyCon 4 – Florence 2010 – Fabrizio Milo
  • 14. CUDA A Parallel Computing Architecture for NVIDIA GPUs Direct X Compute PyCon 4 – Florence 2010 – Fabrizio Milo
  • 15. Execution Model CUDA Device Model PyCon 4 – Florence 2010 – Fabrizio Milo
  • 16. EXECUTION MODEL PyCon 4 – Florence 2010 – Fabrizio Milo
  • 17. Thread Smallest unit of logic PyCon 4 – Florence 2010 – Fabrizio Milo
  • 18. A Block A Group of Threads PyCon 4 – Florence 2010 – Fabrizio Milo
  • 19. A Grid A Group of Blocks PyCon 4 – Florence 2010 – Fabrizio Milo
  • 20. One Block can have many threads PyCon 4 – Florence 2010 – Fabrizio Milo
  • 21. One Grid can have many blocks PyCon 4 – Florence 2010 – Fabrizio Milo
  • 22. The hardware DEVICE MODEL PyCon 4 – Florence 2010 – Fabrizio Milo
  • 23. Scalar Processor PyCon 4 – Florence 2010 – Fabrizio Milo
  • 24. Scalar Processor PyCon 4 – Florence 2010 – Fabrizio Milo
  • 25. Many Scalar Processors PyCon 4 – Florence 2010 – Fabrizio Milo
  • 26. + Register File PyCon 4 – Florence 2010 – Fabrizio Milo
  • 27. + Shared Memory PyCon 4 – Florence 2010 – Fabrizio Milo
  • 28. Multiprocessor PyCon 4 – Florence 2010 – Fabrizio Milo
  • 29. Device PyCon 4 – Florence 2010 – Fabrizio Milo
  • 30. Real Example: 10-Series Architecture "   240 Scalar Processor (SP) cores execute kernel threads "   30 Streaming Multiprocessors (SMs) each contain " 8 scalar processors   "  1 double precision unit "  Shared memory PyCon 4 – Florence 2010 – Fabrizio Milo
  • 31. Software Hardware Scalar Processor Thread Thread Block Multiprocessor Grid Device PyCon 4 – Florence 2010 – Fabrizio Milo
  • 32. Global Memory PyCon 4 – Florence 2010 – Fabrizio Milo
  • 33. Global Memory PyCon 4 – Florence 2010 – Fabrizio Milo
  • 34. RAM CPU Global Memory Host - Device PyCon 4 – Florence 2010 – Fabrizio Milo
  • 35. RAM CPU Host – Multi Device PyCon 4 – Florence 2010 – Fabrizio Milo
  • 36. 1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • 37. Software Hardware Scalar Processor Thread Thread Block Multiprocessor Grid Device PyCon 4 – Florence 2010 – Fabrizio Milo
  • 38. Kernel __global__ void multiply_them( float *dest, float *a, float *b ) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; } Thread PyCon 4 – Florence 2010 – Fabrizio Milo
  • 39. Kernel __global__ void multiply_them( float *dest, float *a, float *b ) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; } Thread PyCon 4 – Florence 2010 – Fabrizio Milo
  • 40. Kernel __global__ void multiply_them( float *dest, float *a, float *b ) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; } Block PyCon 4 – Florence 2010 – Fabrizio Milo
  • 41. Kernel __global__ void kernel( … ) { const int idx = blockIdx.x * blockDim.x + threadIdx.x; … } Grid PyCon 4 – Florence 2010 – Fabrizio Milo
  • 42. How do I Program it ? Main Logic Kernel GCC NVCC CPU .bin .cubin GPU PyCon 4 – Florence 2010 – Fabrizio Milo
  • 43. How do I Program it ? Main Logic Kernel GCC NVCC GPU .bin .cubin .bin .cubin . CPU PyCon 4 – Florence 2010 – Fabrizio Milo
  • 44. RAM CPU Global Memory Host - Device PyCon 4 – Florence 2010 – Fabrizio Milo
  • 45. RAM CPU Global Memory PyCon 4 – Florence 2010 – Fabrizio Milo
  • 46. Allocate Memory cudaMalloc( pointer, size ) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 47. Copy to device cudaMalloc( pointer, size ) cudaMemcpy( dest, src, size, direction) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 48. Kernel Launch cudaMalloc( pointer, size ) cudaMemcpy( dest, src, size, direction) Kernel<<< # blocks, # threads >> (*params) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 49. Get Back the Results cudaMalloc( pointer, size ) cudaMemcpy( dest, src, size, direction) Kernel<<< # blocks, # threads >> (*params) cudaMemcpy( dest, src, size, direction) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 50. Error Handling If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error() } PyCon 4 – Florence 2010 – Fabrizio Milo
  • 51. And soon it becomes … If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error() } if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error() } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } PyCon 4 – Florence 2010 – Fabrizio Milo
  • 52. And soon it becomes … If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error() If(cudaMalloc( pointer, size ) != cudaSuccess){ } handle_error() } if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error() If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ } handle_error() } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error() If(cudaMalloc( pointer, size ) != cudaSuccess){ } handle_error() } if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error() If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ } handle_error() } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error() If(cudaMalloc( pointer, size ) != cudaSuccess){ } handle_error() } if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error() If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ } handle_error() } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } PyCon 4 – Florence 2010 – Fabrizio Milo
  • 53. PyCon 4 – Florence 2010 – Fabrizio Milo
  • 54. 1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • 55. + & ANDREAS KLOCKNER = PYCUDA PyCon 4 – Florence 2010 – Fabrizio Milo
  • 56. PyCuda Philosopy Provide Complete Access PyCon 4 – Florence 2010 – Fabrizio Milo
  • 57. PyCuda Philosopy AutoMatically Manage Resources PyCon 4 – Florence 2010 – Fabrizio Milo
  • 58. PyCuda Philosopy Check and Report Errors PyCon 4 – Florence 2010 – Fabrizio Milo
  • 59. PyCuda Philosopy Cross Platform PyCon 4 – Florence 2010 – Fabrizio Milo
  • 60. PyCuda Philosopy Allow Interactive Use PyCon 4 – Florence 2010 – Fabrizio Milo
  • 61. PyCuda Philosopy NumPy Integration PyCon 4 – Florence 2010 – Fabrizio Milo
  • 62. NUMPY - ARRAY PyCon 4 – Florence 2010 – Fabrizio Milo
  • 63. 1 1 1 1 1 1 0 99 import numpy my_array = numpy.array([1,] * 100) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 64. 1 1 1 0 1 1 import numpy my_array = numpy.array([1,] * 100) my_array[3] = 0 PyCon 4 – Florence 2010 – Fabrizio Milo
  • 65. PyCuda: Workflow PyCon 4 – Florence 2010 – Fabrizio Milo
  • 66. PyCuda: Workflow PyCon 4 – Florence 2010 – Fabrizio Milo
  • 67. PyCuda: Workflow PyCon 4 – Florence 2010 – Fabrizio Milo
  • 68. Memory Allocation cuda.mem_alloc( size_bytes ) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 69. Memory Copy gpu_mem = cuda.mem_alloc( size_bytes ) cuda.memcpy_htod( gpu_mem, cpu_mem ) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 70. Kernel gpu_mem = cuda.mem_alloc( size_bytes ) cuda.memcpy_htod( gpu_mem, cpu_mem ) SourceModule(“”” __global__ void multiply_them( float *dest, float *a, float *b ) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; }”””) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 71. Kernel Launch mod = SourceModule(“”” __global__ void multiply_them( float *dest, float *a, float *b ) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; }”””) multiply_them = mod.get_function(“multiply_them”) multiply_them ( *args, block=(30, 64, 1)) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 72. PyCon 4 – Florence 2010 – Fabrizio Milo
  • 73. PyCon 4 – Florence 2010 – Fabrizio Milo
  • 74. PyCon 4 – Florence 2010 – Fabrizio Milo
  • 75. Hello Gpu DEMO PyCon 4 – Florence 2010 – Fabrizio Milo
  • 76. GPUARRAY PyCon 4 – Florence 2010 – Fabrizio Milo
  • 77. gpuarray PyCon 4 – Florence 2010 – Fabrizio Milo
  • 78. PyCuda: GpuArray gpuarray.to_gpu(numpy array) numpy array = gpuarray.get() PyCon 4 – Florence 2010 – Fabrizio Milo
  • 79. PyCuda: GpuArray gpuarray.to_gpu(numpy array) numpy array = gpuarray.get() +, -, !, /, fill, sin, exp, rand, basic indexing, norm, inner product … PyCon 4 – Florence 2010 – Fabrizio Milo
  • 80. PyCuda: GpuArray: ElementWise from pycuda.elementwise import ElementwiseKernel PyCon 4 – Florence 2010 – Fabrizio Milo
  • 81. PyCuda: GpuArray: ElementWise from pycuda.elementwise import ElementwiseKernel lincomb = ElementwiseKernel( ” float a , float !x , float b , float !y , float !z”, ”z [ i ] = a !x[ i ] + b!y[i ] ” ) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 82. PyCuda: GpuArray: ElementWise from pycuda.elementwise import ElementwiseKernel lin comb = ElementwiseKernel( ” float a , float !x , float b , float !y , float !z”, ”z [ i ] = a !x[ i ] + b!y[i ] ” ) c gpu = gpuarray. empty like (a gpu) lincomb (5, a gpu, 6, b gpu, c gpu) assert la . norm((c gpu ! (5!a gpu+6!b gpu)).get()) < 1e!5 PyCon 4 – Florence 2010 – Fabrizio Milo
  • 83. Meta-Programming __kernel_template__ = “”” __global__ void kernel( args ) { for (int i=0; i={{ iterations }}; i++){ {{operations}} } }””” See for example jinja2 PyCon 4 – Florence 2010 – Fabrizio Milo
  • 84. Meta-Programming PyCon 4 – Florence 2010 – Fabrizio Milo
  • 85. Meta-Programming Generate Source ! PyCon 4 – Florence 2010 – Fabrizio Milo
  • 86. Performances ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • 87. mandelbrot DEMO PyCon 4 – Florence 2010 – Fabrizio Milo
  • 88. PyCuda: Documentation PyCon 4 – Florence 2010 – Fabrizio Milo
  • 89. PyCuda WebSite: http://mathema.tician.de/software/ pycuda License: X Consortium License (no warranty, free for all use) Dependencies: Python 2.4+, numpy, Boost PyCon 4 – Florence 2010 – Fabrizio Milo
  • 90. In the Future … OPENCL PyCon 4 – Florence 2010 – Fabrizio Milo
  • 91. THANK YOU & HAVE FUN ! PyCon 4 – Florence 2010 – Fabrizio Milo
  • 92. ? PyCon 4 – Florence 2010 – Fabrizio Milo