SlideShare a Scribd company logo
Dmitri Nesteruk
dmitri@activemesa.com
http://activemesa.com
http://spbalt.net
http://devtalk.net
“Premature optimization is the root of all evil.”

                                                    Donald Knuth
              Structured Programming with go to Statements, ACM
         Journal Computing Surveys, Vol 6, No. 4, Dec. 1974. p.268.


“In practice, it is often necessary to keep performance goals in mind
when first designing software, but the programmer balances the goals
of design and optimization.”

                                                      Wikipedia
               http://en.wikipedia.org/wiki/Program_optimization
Brief intro
   Why unmanaged code?
   Why parallelize?
P/Invoke
SIMD
OpenMP
Intel stack: TBB, MKL, IPP
GPGPU: Cuda, Accelerator
Miscellanea
Today                                 Tomorrow

 Threads & ThreadPool                  Tasks and TaskManager
 Sync structures                         Task
   Monitor.(Try)Enter/Exit               Future
   ReaderWriterLock(Slim)
   Mutex                               Data-level parallelism
   Semaphore                             Parallel.For/ForEach
 Wait handles
   Manual/AutoResetEvent
                                       Parallel LINQ
                                         AsParallel()
 Pulse & wait
 Async delegates
 Async simplifications
   F# async workflow
   AsyncEnumerator (PowerThreading)
Performance               Managed interfaces for
Low-level (fine-tuning)   SIMD/MPI-optimized
framework                 libraries
Instruction-level         Threading tools
parallelism                 Debugging
GPU SIMD                    Profiling
                            Inferencing
General vectorization
                          Cross-machine
Simple cross-machine
                          debugging
framework
                          Task management UI
WTF?!? Isn’t C# 5% faster than C?
   It depends.
Why is there a difference?
   More safety (e.g., CLR array bound checking)
   JIT: No auto-parallelization
   JIT: No SIMD
   Lack of fine control
IL can be every bit as fast as C/C++
   But this is only true for simple problems
   The code is only as good as the JITter
Libraries (MKL, IPP)

  OpenMP

     Intel TBB, Microsoft PPL

        SIMD (CPU & GPGPU)
Part I
A way of calling unmanaged C++ from .Net
   Not the same as C++/CLI
For interaction with ‘legacy’ systems
Can pass data between managed and
unmanaged code
   Literals (int, string)
   Pointers (e.g., pointer to array)
   Structures
Marshalling is taken care of by the runtime
Make a Win32 C++ DLL
   MYLIB_API int Add(int first, int second)
   {
     return first + second;
   }
   Specify a post-build step to copy DLL to .Net assembly
     Important: default DLL location is solution root
   Build the DLL
Make a .Net application
   [DllImport("MyLib.dll")]
   public static extern int Add(
     int first, int second);
Call the method
Basic C# ↔ C++ Interop
DLL not found                      Entry point not found
  Make sure post-build step          Make sure method names and
  copies DLL to target folder        signatures are equivalent
  Or that DLL is in PATH             Make sure calling convention
An attempt was made to load          matches
                                        [DllImport(…,
DLL with incorrect format                 CallingConvention=))
  DLL relies on other DLLs which     On 64-bit systems, specify entry
  are not found                      name explicitly
    Open Visual Studio command          Use dumpbin /exports
    prompt
                                        [DllImport(…,
    Use dumpbin /dependents               EntryPoint = "?Add@@YAHHH@Z"
    mylib.dll to find out
                                        No, extern "C " does not help
    Copy files to target dir
    This is common in Debug mode   It all works
  32-bit/64-bit mismatch             Congratulations!
Special cases          Handling them
  String handling        Marshal
    Unicode vs. ANSI     MarshalAsAttribute
    LP(C)WSTR            [In] and [Out]
  Arrays                 StructLayout
    fixed
                         IntPtr
  Memory allocation
                         … and lots more
  Calling convention
  “Bitness” issues
                       Handle on a case-by-case
  … and lots more!
                       basis
Make sure signatures
match
  Including return types!
To debug
  If your OS is 64-bit, make
  sure .Net assemblies
  compile in 32-bit mode
  Make sure unmanaged
  code debugging is turned
  on
In 64-bit                       Visit the P/Invoke wiki @
  Launch target DLL with        http://pinvoke.net
  the .Net assembly as target
    Good luck! :)
Part II
An API for multi-platform
shared-memory parallel
programming in C/C++
and Fortran.
Uses #pragma statements
to decorate code
Easy!!!
  Syntax can be learned very
  quickly
  Can be turned off and on
  in project settings
Enable it (disabled by default)




Use it!
   No further action necessary

To use configuration API
   #include <omp.h>
   Call methods, e.g., omp_get_num_procs()
void MultiplyMatricesDoubleOMP(
  int size, double* m1[], double* m2[], double* result[])
{
  int i, j, k;
  #pragma omp parallel for shared(size,m1,m2,result) private (i,j,k)
  for (i = 0; i < size; i++)
  {                                              #pragma omp parallel for
    for (j = 0; j < size; j++)                   Hints to the compiler that it’s
    {                                            worth parallelizing loop
      result[i][j] = 0;
      for (k = 0; k < size; k++)                 shared(size,m1,m2,result)
      {                                          Variables shared between all
        result[i][j] += m1[i][k] * m2[k][j];     threads
      }
    }                                            private(i,j,k)
                                                 Variables which have differing
  }
                                                 values in different threads
}
Using OpenMP in your C++ app
Homepage
http://openmp.org
cOMPunity (community of OMP users)
http://www.compunity.org/
OpenMP debug/optimization article (Russian)
http://bit.ly/BJbPU
VivaMP (static analyzer for OpenMP code)
http://viva64.com/vivamp-tool
Part III
Libraries save you from reinventing the wheel
   Tested
   Optimized (e.g., for multi-core, SIMD)
These typically have C++ and Fortran
interfaces
   Some also have MPI support
   Of course, there are .Net libraries too :)
The ‘trick’ is to use these libraries from C#
   Fortran-compatible API is tricky!
   Data structure passing can be quite arcane!
Intel makes multi-core processors
                        Multi-core know-how
                           Parallel Composer
                             C++ Compiler (autoparallelization, OpenMP 3.0)
Intel Parallel Studio




                             Libraries (Math Kernel Library, Integrated Performance
                             Primitives, Threading Building Blocks)
                             Parallel debugger extension
                           Parallel inspector (memory/threading errors)
                           Parallel amplifier (hotspots, concurrency, locks and
                           waits)
                           Parallel Advisor Lite
Low-level parallelization
framework from Intel
Lets you fine-tune code
for multi-core
Is a library
  Uses a set of primitives
Has OSS license
#include "tbb/parallel_for.h"
#include "tbb/blocked_range.h"
using namespace tbb;                         Functor
struct Average {
    float* input;
    float* output;
    void operator()( const blocked_range<int>& range ) const {
        for( int i=range.begin(); i!=range.end( ); ++i )
            output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.0f);
    }
};
// Note: The input must be padded such that input[-1] and input[n]
// can be used to calculate the first and last output values.
void ParallelAverage( float* output, float* input, size_t n ) {
    Average avg;
    avg.input = input;
    avg.output = output;
                                                                 Library
    parallel_for(blocked_range<int>( 0, n, 1000 ), avg);
}                                                                  call
Integrated Performance            Math Kernel Library
Primitives
 High-performance libraries for     Optimized, multithreaded
   Signal processing                library for math
   Image processing                 Support for
   Computer vision                    BLAS
   Speech recognition                 LAPACK
   Data compression                   ScaLAPACK
   Cryptography                       Sparse Solvers
   String manipulation                Fast Fourier Transforms
   Audio processing                   Vector Math
   Video coding                       … and lots more
   Realistic rendering
 Also support codec
 construction
Part IV
CPU support for performing operations on
large registers
Normal-size data is loaded into 128-bit
registers
Operation on multiple elements with a single
instruction
   E.g., add 4 numbers at once
Requires special CPU instructions
   Less portable
Supported in C++ via ‘intrinsics’
SSE is an instruction set
   Initially called MMX (n/a on 64-bit CPUs)
   Now SSE and SSE2
Compiler intrinsics
   C++ functions that map to one or more SSE
   assembly instructions
Determining support
   Use cpuid
   Non-issue if you are
     A systems integrator
     Run your own servers (e.g., Asp.Net)
128-bit data types
   __m128
   __m128i (integer intrinsics)
   __m128d (double intrinsics)     } sse2
Operations for load and set
   __m128 a = _mm_set_ps(1,2,3,4);
   To get at data, dereference and choose type
     E.g., myValue.m128_f32[0] gets first float
Perform operations (add, multiply, etc.)
   E.g., _mm_mul_ps(first, second) multiplies two
   values yielding a third
Make or get data
   Either create with initialized values
   static __m128 factor =
      _mm_set_ps(1.0f, 0.3f, 0.59f, 0.11f);
   Or load it into a SIMD-sized memory location
   __m128 pixel;
   pixel.m128_f32[0] = s->m128i_u8[(p<<2)];
   Or convert an existing pointer
   __m128* pixel = (__m128*)(&my_array + p);
Perform a SIMD operation and get data
   pixel = _mm_mul_ps(pixel, factor);
Get the data
   const BYTE sum = (BYTE)(pixel.m128_f32[0]);
Image processing with SIMD
Part IV
Graphics cards have GPUs
These are highly parallelized
   Pipelining
   Useful for graphics
GPUs are programmable
   We can do math ops on vectors
   Mainly float, with double support emerging
GPUs have programmable parts
   Vertex shader (vertex position)
   Pixel shader (pixel colour)
Treat data as texture (render target)
   Load inputs as texture
   Use pixel shader
   Get data from result texture
Special languages used to program them
   HLSL (DirectX)
   GLSL (OpenGL)
High-level wrappers (CUDA, Accelerator)
A Microsoft Research project
   Not for commercial use
Uses a managed API
Employs data-parallel arrays
   Int
   Float
   Bool
   Bitmap-aware
Requires PS 2.0
Sorry! No demo.
Accelerator does not work on 64-bit :(
Unmanaged Parallelization via P/Invoke
If a library already exists, use it
If C# is fast enough, use it
To speed things up, try
   TPL/PLINQ
   Manual Parallelization
   unsafe (can be combined with TPL)
If you are unhappy, then
   Write in C++
   Speculatively add OpenMP directives
   Fine-tune with TBB as needed
System.Drawing.Bitmap is slow
  Has very slow GetPixel()/SetPixel()
  methods
Can fix bitmap in memory and manipulate it in
unmanaged code
What we need to pass in
  Pointer to bitmap image data
  Bitmap width and height
  Bitmap stride (horizontal memory space)
Image-rendered headings with subpixel
postprocessing (http://bit.ly/10x0G8)
  WPF FlowDocument for initial generation
  C++/OpenMP for postprocessing
  Asp.Net for serving the result
Freeform rendered text with OpenType
features (http://bit.ly/1cCP50)
  Bitmap rendering in Direct2D (C++/lightweight
  COM API)
  OpenType markup language
a   l   b   v   o   b   q   l   l   k   u   t   m   y   w   m   w   r   e   e   r   q   q   m   q   i   q   d   n   w   g   s   s   w   d   a
v   p   d   v   n   u   x   j   l   s   y   t   u   b   n   b   y   c   t   h   r   r   y   u   v   a   s   t   a   d   t   n   z   f   f   x
g   q   h   b   j   j   p   y   o   w   s   i   g   i   c   i   i   g   s   o   f   n   f   r   j   f   d   c   f   g   m   k   w   u   y   j
v   b   v   e   m   i   t   i   j   x   u   v   w   s   j   u   g   u   y   l   b   o   c   m   y   k   u   b   w   s   w   n   p   x   i   o
k   a   y   c   q   o   s   u   n   k   s   c   g   x   j   x   j   e   q   p   h   j   i   a   c   m   j   z   h   c   k   v   x   k   a   k
f   e   c   r   u   u   x   q   p   p   k   o   f   w   g   x   b   v   j   m   b   e   l   e   e   w   k   s   c   v   n   n   o   g   c   z
w   w   f   w   i   n   e   h   j   q   l   h   x   u   v   j   o   m   h   g   s   x   a   j   z   b   d   n   u   a   s   c   n   a   j   i
x   w   i   n   w   z   j   d   s   p   n   w   i   p   c   n   d   s   r   m   j   h   z   q   j   g   b   w   j   m   e   z   k   j   v   a
z   o   u   q   w   d   c   j   c   f   o   x   w   t   h   v   s   r   h   o   m   j   y   n   a   u   p   p   u   p   h   z   n   s   j   r
m   b   z   o   w   k   i   n   t   h   l   i   k   z   w   m   z   m   f   x   c   h   o   m   w   x   b   s   m   x   u   c   j   x   o   s
h   x   u   e   t   p   u   x   e   o   v   l   h   a   y   p   f   f   v   a   x   z   x   l   z   u   l   c   l   n   q   g   e   g   m   x
y   k   k   k   q   j   n   h   p   i   j   w   i   p   d   d   a   x   z   s   z   e   m   p   c   l   i   m   s   u   g   e   i   z   o   m
q   p   r   p   d   w   m   y   q   t   o   v   m   p   T   H   E   y   E   N   D   v   z   d   c   z   x   m   g   q   q   r   h   n   b   j
i   b   q   i   p   x   n   h   w   i   d   o   h   m   a   w   c   x   m   g   h   c   y   r   i   k   n   p   n   d   m   c   x   l   z   e
h   h   s   c   l   f   s   y   l   k   j   s   p   t   d   q   e   b   k   v   u   x   k   m   k   z   p   g   k   e   n   a   f   h   h   r
o   x   v   w   k   u   j   u   t   n   e   u   q   f   a   d   n   e   d   y   y   y   f   c   z   c   a   p   x   y   f   b   r   w   e   y
o   f   a   v   f   h   z   r   y   a   n   z   u   q   r   o   g   n   f   p   x   l   j   y   l   u   a   n   r   d   o   r   v   k   m   f
j   y   n   h   p   c   c   t   k   x   y   t   b   f   j   r   n   x   g   c   z   h   s   p   c   e   i   q   g   x   k   p   f   g   r   n
l   y   i   i   f   t   i   s   b   i   f   c   k   c   h   e   s   l   w   y   s   u   p   d   v   x   b   r   l   q   l   k   i   z   d   z
w   s   a   w   r   i   i   u   m   n   i   x   r   c   j   n   d   h   n   w   g   s   f   s   i   l   h   a   b   h   l   h   x   m   v   p
t   e   g   k   n   o   i   s   g   s   x   v   b   o   k   e   c   i   j   y   b   e   d   r   t   p   e   x   v   r   c   w   u   v   d   s
d   o   a   z   t   t   m   u   i   u   v   u   b   p   l   w   c   p   x   n   k   k   v   a   a   v   b   b   s   e   e   f   d   b   f   y
i   v   c   j   k   r   g   r   y   t   j   a   m   f   v   h   b   f   s   b   z   l   i   n   a   x   c   l   r   l   z   i   v   l   c   b
n   u   d   l   l   g   u   y   r   t   t   u   q   t   l   y   j   l   q   u   h   a   o   u   o   p   t   g   v   l   q   q   r   k   r   q
y   p   l   z   d   x   n   q   n   q   v   t   f   b   u   h   r   y   n   k   f   q   i   t   h   i   u   w   i   n   m   l   o   c   c   c

More Related Content

What's hot (20)

PDF
Fun with Lambdas: C++14 Style (part 1)
Sumant Tambe
 
PPTX
Summary of C++17 features
Bartlomiej Filipek
 
PDF
2018 cosup-delete unused python code safely - english
Jen Yee Hong
 
PDF
C++11
ppd1961
 
PDF
Modern c++ (C++ 11/14)
Geeks Anonymes
 
PPTX
C++ Presentation
Carson Wilber
 
PDF
C++11 & C++14
CyberPlusIndia
 
PPTX
C++ 11 Features
Jan Rüegg
 
PDF
Modern C++
Michael Clark
 
PDF
Basic c++ 11/14 for python programmers
Jen Yee Hong
 
PPTX
C++11
Quang Trần Duy
 
PDF
C++20 the small things - Timur Doumler
corehard_by
 
PDF
C++11 concurrency
xu liwei
 
PDF
C++17 introduction - Meetup @EtixLabs
Stephane Gleizes
 
PPT
Gentle introduction to modern C++
Mihai Todor
 
PDF
Golang and Eco-System Introduction / Overview
Markus Schneider
 
PDF
Introduction to Go programming language
Slawomir Dorzak
 
PPTX
Fun with Lambdas: C++14 Style (part 2)
Sumant Tambe
 
PPT
Csdfsadf
Atul Setu
 
Fun with Lambdas: C++14 Style (part 1)
Sumant Tambe
 
Summary of C++17 features
Bartlomiej Filipek
 
2018 cosup-delete unused python code safely - english
Jen Yee Hong
 
C++11
ppd1961
 
Modern c++ (C++ 11/14)
Geeks Anonymes
 
C++ Presentation
Carson Wilber
 
C++11 & C++14
CyberPlusIndia
 
C++ 11 Features
Jan Rüegg
 
Modern C++
Michael Clark
 
Basic c++ 11/14 for python programmers
Jen Yee Hong
 
C++20 the small things - Timur Doumler
corehard_by
 
C++11 concurrency
xu liwei
 
C++17 introduction - Meetup @EtixLabs
Stephane Gleizes
 
Gentle introduction to modern C++
Mihai Todor
 
Golang and Eco-System Introduction / Overview
Markus Schneider
 
Introduction to Go programming language
Slawomir Dorzak
 
Fun with Lambdas: C++14 Style (part 2)
Sumant Tambe
 
Csdfsadf
Atul Setu
 

Similar to Unmanaged Parallelization via P/Invoke (20)

PPTX
25-MPI-OpenMP.pptx
GopalPatidar13
 
PDF
Skiron - Experiments in CPU Design in D
Mithun Hunsur
 
PPT
Migration To Multi Core - Parallel Programming Models
Zvi Avraham
 
PDF
Peyton jones-2011-parallel haskell-the_future
Takayuki Muranushi
 
PDF
Simon Peyton Jones: Managing parallelism
Skills Matter
 
PPT
Virtual platform
sean chen
 
PDF
parallel-computation.pdf
Jayanti Prasad Ph.D.
 
PPT
Intro dotnet
shuklagirish
 
PPT
Intro dotnet
shuklagirish
 
PPT
Intro dotnet
shuklagirish
 
PPT
Intro dotnet
shuklagirish
 
PDF
Parallel computation
Jayanti Prasad Ph.D.
 
PDF
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Intel® Software
 
PPT
Multicore
Birgit Plötzeneder
 
PPT
មេរៀនៈ Data Structure and Algorithm in C/C++
Ngeam Soly
 
PPT
Os Worthington
oscon2007
 
PPTX
ConFoo Montreal - Microservices for building an IDE - The innards of JetBrain...
Maarten Balliauw
 
ODP
Parallel Programming on the ANDC cluster
Sudhang Shankar
 
PPTX
CSE 116 OOP Educational Materials of United International University
MdMirajulIslam21
 
25-MPI-OpenMP.pptx
GopalPatidar13
 
Skiron - Experiments in CPU Design in D
Mithun Hunsur
 
Migration To Multi Core - Parallel Programming Models
Zvi Avraham
 
Peyton jones-2011-parallel haskell-the_future
Takayuki Muranushi
 
Simon Peyton Jones: Managing parallelism
Skills Matter
 
Virtual platform
sean chen
 
parallel-computation.pdf
Jayanti Prasad Ph.D.
 
Intro dotnet
shuklagirish
 
Intro dotnet
shuklagirish
 
Intro dotnet
shuklagirish
 
Intro dotnet
shuklagirish
 
Parallel computation
Jayanti Prasad Ph.D.
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Intel® Software
 
មេរៀនៈ Data Structure and Algorithm in C/C++
Ngeam Soly
 
Os Worthington
oscon2007
 
ConFoo Montreal - Microservices for building an IDE - The innards of JetBrain...
Maarten Balliauw
 
Parallel Programming on the ANDC cluster
Sudhang Shankar
 
CSE 116 OOP Educational Materials of United International University
MdMirajulIslam21
 
Ad

More from Dmitri Nesteruk (20)

PDF
Good Ideas in Programming Languages
Dmitri Nesteruk
 
PDF
Design Pattern Observations
Dmitri Nesteruk
 
PDF
CallSharp: Automatic Input/Output Matching in .NET
Dmitri Nesteruk
 
PDF
Design Patterns in Modern C++
Dmitri Nesteruk
 
PPTX
C# Tricks
Dmitri Nesteruk
 
PPTX
Introduction to Programming Bots
Dmitri Nesteruk
 
PDF
Converting Managed Languages to C++
Dmitri Nesteruk
 
PDF
Monte Carlo C++
Dmitri Nesteruk
 
PDF
Tpl DataFlow
Dmitri Nesteruk
 
PDF
YouTrack: Not Just an Issue Tracker
Dmitri Nesteruk
 
PPTX
Проект X2C
Dmitri Nesteruk
 
PPTX
Domain Transformations
Dmitri Nesteruk
 
PDF
Victor CG Erofeev - Metro UI
Dmitri Nesteruk
 
PDF
Developer Efficiency
Dmitri Nesteruk
 
PPTX
Distributed Development
Dmitri Nesteruk
 
PDF
Dynamics CRM Data Integration
Dmitri Nesteruk
 
PDF
Web mining
Dmitri Nesteruk
 
PDF
Data mapping tutorial
Dmitri Nesteruk
 
PDF
Reactive Extensions
Dmitri Nesteruk
 
PDF
Design Patterns in .Net
Dmitri Nesteruk
 
Good Ideas in Programming Languages
Dmitri Nesteruk
 
Design Pattern Observations
Dmitri Nesteruk
 
CallSharp: Automatic Input/Output Matching in .NET
Dmitri Nesteruk
 
Design Patterns in Modern C++
Dmitri Nesteruk
 
C# Tricks
Dmitri Nesteruk
 
Introduction to Programming Bots
Dmitri Nesteruk
 
Converting Managed Languages to C++
Dmitri Nesteruk
 
Monte Carlo C++
Dmitri Nesteruk
 
Tpl DataFlow
Dmitri Nesteruk
 
YouTrack: Not Just an Issue Tracker
Dmitri Nesteruk
 
Проект X2C
Dmitri Nesteruk
 
Domain Transformations
Dmitri Nesteruk
 
Victor CG Erofeev - Metro UI
Dmitri Nesteruk
 
Developer Efficiency
Dmitri Nesteruk
 
Distributed Development
Dmitri Nesteruk
 
Dynamics CRM Data Integration
Dmitri Nesteruk
 
Web mining
Dmitri Nesteruk
 
Data mapping tutorial
Dmitri Nesteruk
 
Reactive Extensions
Dmitri Nesteruk
 
Design Patterns in .Net
Dmitri Nesteruk
 
Ad

Recently uploaded (20)

PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
July Patch Tuesday
Ivanti
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Biography of Daniel Podor.pdf
Daniel Podor
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 

Unmanaged Parallelization via P/Invoke

  • 2. “Premature optimization is the root of all evil.” Donald Knuth Structured Programming with go to Statements, ACM Journal Computing Surveys, Vol 6, No. 4, Dec. 1974. p.268. “In practice, it is often necessary to keep performance goals in mind when first designing software, but the programmer balances the goals of design and optimization.” Wikipedia http://en.wikipedia.org/wiki/Program_optimization
  • 3. Brief intro Why unmanaged code? Why parallelize? P/Invoke SIMD OpenMP Intel stack: TBB, MKL, IPP GPGPU: Cuda, Accelerator Miscellanea
  • 4. Today Tomorrow Threads & ThreadPool Tasks and TaskManager Sync structures Task Monitor.(Try)Enter/Exit Future ReaderWriterLock(Slim) Mutex Data-level parallelism Semaphore Parallel.For/ForEach Wait handles Manual/AutoResetEvent Parallel LINQ AsParallel() Pulse & wait Async delegates Async simplifications F# async workflow AsyncEnumerator (PowerThreading)
  • 5. Performance Managed interfaces for Low-level (fine-tuning) SIMD/MPI-optimized framework libraries Instruction-level Threading tools parallelism Debugging GPU SIMD Profiling Inferencing General vectorization Cross-machine Simple cross-machine debugging framework Task management UI
  • 6. WTF?!? Isn’t C# 5% faster than C? It depends. Why is there a difference? More safety (e.g., CLR array bound checking) JIT: No auto-parallelization JIT: No SIMD Lack of fine control IL can be every bit as fast as C/C++ But this is only true for simple problems The code is only as good as the JITter
  • 7. Libraries (MKL, IPP) OpenMP Intel TBB, Microsoft PPL SIMD (CPU & GPGPU)
  • 9. A way of calling unmanaged C++ from .Net Not the same as C++/CLI For interaction with ‘legacy’ systems Can pass data between managed and unmanaged code Literals (int, string) Pointers (e.g., pointer to array) Structures Marshalling is taken care of by the runtime
  • 10. Make a Win32 C++ DLL MYLIB_API int Add(int first, int second) { return first + second; } Specify a post-build step to copy DLL to .Net assembly Important: default DLL location is solution root Build the DLL Make a .Net application [DllImport("MyLib.dll")] public static extern int Add( int first, int second); Call the method
  • 11. Basic C# ↔ C++ Interop
  • 12. DLL not found Entry point not found Make sure post-build step Make sure method names and copies DLL to target folder signatures are equivalent Or that DLL is in PATH Make sure calling convention An attempt was made to load matches [DllImport(…, DLL with incorrect format CallingConvention=)) DLL relies on other DLLs which On 64-bit systems, specify entry are not found name explicitly Open Visual Studio command Use dumpbin /exports prompt [DllImport(…, Use dumpbin /dependents EntryPoint = "?Add@@YAHHH@Z" mylib.dll to find out No, extern "C " does not help Copy files to target dir This is common in Debug mode It all works 32-bit/64-bit mismatch Congratulations!
  • 13. Special cases Handling them String handling Marshal Unicode vs. ANSI MarshalAsAttribute LP(C)WSTR [In] and [Out] Arrays StructLayout fixed IntPtr Memory allocation … and lots more Calling convention “Bitness” issues Handle on a case-by-case … and lots more! basis
  • 14. Make sure signatures match Including return types! To debug If your OS is 64-bit, make sure .Net assemblies compile in 32-bit mode Make sure unmanaged code debugging is turned on In 64-bit Visit the P/Invoke wiki @ Launch target DLL with http://pinvoke.net the .Net assembly as target Good luck! :)
  • 16. An API for multi-platform shared-memory parallel programming in C/C++ and Fortran. Uses #pragma statements to decorate code Easy!!! Syntax can be learned very quickly Can be turned off and on in project settings
  • 17. Enable it (disabled by default) Use it! No further action necessary To use configuration API #include <omp.h> Call methods, e.g., omp_get_num_procs()
  • 18. void MultiplyMatricesDoubleOMP( int size, double* m1[], double* m2[], double* result[]) { int i, j, k; #pragma omp parallel for shared(size,m1,m2,result) private (i,j,k) for (i = 0; i < size; i++) { #pragma omp parallel for for (j = 0; j < size; j++) Hints to the compiler that it’s { worth parallelizing loop result[i][j] = 0; for (k = 0; k < size; k++) shared(size,m1,m2,result) { Variables shared between all result[i][j] += m1[i][k] * m2[k][j]; threads } } private(i,j,k) Variables which have differing } values in different threads }
  • 19. Using OpenMP in your C++ app
  • 20. Homepage http://openmp.org cOMPunity (community of OMP users) http://www.compunity.org/ OpenMP debug/optimization article (Russian) http://bit.ly/BJbPU VivaMP (static analyzer for OpenMP code) http://viva64.com/vivamp-tool
  • 22. Libraries save you from reinventing the wheel Tested Optimized (e.g., for multi-core, SIMD) These typically have C++ and Fortran interfaces Some also have MPI support Of course, there are .Net libraries too :) The ‘trick’ is to use these libraries from C# Fortran-compatible API is tricky! Data structure passing can be quite arcane!
  • 23. Intel makes multi-core processors Multi-core know-how Parallel Composer C++ Compiler (autoparallelization, OpenMP 3.0) Intel Parallel Studio Libraries (Math Kernel Library, Integrated Performance Primitives, Threading Building Blocks) Parallel debugger extension Parallel inspector (memory/threading errors) Parallel amplifier (hotspots, concurrency, locks and waits) Parallel Advisor Lite
  • 24. Low-level parallelization framework from Intel Lets you fine-tune code for multi-core Is a library Uses a set of primitives Has OSS license
  • 25. #include "tbb/parallel_for.h" #include "tbb/blocked_range.h" using namespace tbb; Functor struct Average { float* input; float* output; void operator()( const blocked_range<int>& range ) const { for( int i=range.begin(); i!=range.end( ); ++i ) output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.0f); } }; // Note: The input must be padded such that input[-1] and input[n] // can be used to calculate the first and last output values. void ParallelAverage( float* output, float* input, size_t n ) { Average avg; avg.input = input; avg.output = output; Library parallel_for(blocked_range<int>( 0, n, 1000 ), avg); } call
  • 26. Integrated Performance Math Kernel Library Primitives High-performance libraries for Optimized, multithreaded Signal processing library for math Image processing Support for Computer vision BLAS Speech recognition LAPACK Data compression ScaLAPACK Cryptography Sparse Solvers String manipulation Fast Fourier Transforms Audio processing Vector Math Video coding … and lots more Realistic rendering Also support codec construction
  • 28. CPU support for performing operations on large registers Normal-size data is loaded into 128-bit registers Operation on multiple elements with a single instruction E.g., add 4 numbers at once Requires special CPU instructions Less portable Supported in C++ via ‘intrinsics’
  • 29. SSE is an instruction set Initially called MMX (n/a on 64-bit CPUs) Now SSE and SSE2 Compiler intrinsics C++ functions that map to one or more SSE assembly instructions Determining support Use cpuid Non-issue if you are A systems integrator Run your own servers (e.g., Asp.Net)
  • 30. 128-bit data types __m128 __m128i (integer intrinsics) __m128d (double intrinsics) } sse2 Operations for load and set __m128 a = _mm_set_ps(1,2,3,4); To get at data, dereference and choose type E.g., myValue.m128_f32[0] gets first float Perform operations (add, multiply, etc.) E.g., _mm_mul_ps(first, second) multiplies two values yielding a third
  • 31. Make or get data Either create with initialized values static __m128 factor = _mm_set_ps(1.0f, 0.3f, 0.59f, 0.11f); Or load it into a SIMD-sized memory location __m128 pixel; pixel.m128_f32[0] = s->m128i_u8[(p<<2)]; Or convert an existing pointer __m128* pixel = (__m128*)(&my_array + p); Perform a SIMD operation and get data pixel = _mm_mul_ps(pixel, factor); Get the data const BYTE sum = (BYTE)(pixel.m128_f32[0]);
  • 34. Graphics cards have GPUs These are highly parallelized Pipelining Useful for graphics GPUs are programmable We can do math ops on vectors Mainly float, with double support emerging
  • 35. GPUs have programmable parts Vertex shader (vertex position) Pixel shader (pixel colour) Treat data as texture (render target) Load inputs as texture Use pixel shader Get data from result texture Special languages used to program them HLSL (DirectX) GLSL (OpenGL) High-level wrappers (CUDA, Accelerator)
  • 36. A Microsoft Research project Not for commercial use Uses a managed API Employs data-parallel arrays Int Float Bool Bitmap-aware Requires PS 2.0
  • 37. Sorry! No demo. Accelerator does not work on 64-bit :(
  • 39. If a library already exists, use it If C# is fast enough, use it To speed things up, try TPL/PLINQ Manual Parallelization unsafe (can be combined with TPL) If you are unhappy, then Write in C++ Speculatively add OpenMP directives Fine-tune with TBB as needed
  • 40. System.Drawing.Bitmap is slow Has very slow GetPixel()/SetPixel() methods Can fix bitmap in memory and manipulate it in unmanaged code What we need to pass in Pointer to bitmap image data Bitmap width and height Bitmap stride (horizontal memory space)
  • 41. Image-rendered headings with subpixel postprocessing (http://bit.ly/10x0G8) WPF FlowDocument for initial generation C++/OpenMP for postprocessing Asp.Net for serving the result Freeform rendered text with OpenType features (http://bit.ly/1cCP50) Bitmap rendering in Direct2D (C++/lightweight COM API) OpenType markup language
  • 42. a l b v o b q l l k u t m y w m w r e e r q q m q i q d n w g s s w d a v p d v n u x j l s y t u b n b y c t h r r y u v a s t a d t n z f f x g q h b j j p y o w s i g i c i i g s o f n f r j f d c f g m k w u y j v b v e m i t i j x u v w s j u g u y l b o c m y k u b w s w n p x i o k a y c q o s u n k s c g x j x j e q p h j i a c m j z h c k v x k a k f e c r u u x q p p k o f w g x b v j m b e l e e w k s c v n n o g c z w w f w i n e h j q l h x u v j o m h g s x a j z b d n u a s c n a j i x w i n w z j d s p n w i p c n d s r m j h z q j g b w j m e z k j v a z o u q w d c j c f o x w t h v s r h o m j y n a u p p u p h z n s j r m b z o w k i n t h l i k z w m z m f x c h o m w x b s m x u c j x o s h x u e t p u x e o v l h a y p f f v a x z x l z u l c l n q g e g m x y k k k q j n h p i j w i p d d a x z s z e m p c l i m s u g e i z o m q p r p d w m y q t o v m p T H E y E N D v z d c z x m g q q r h n b j i b q i p x n h w i d o h m a w c x m g h c y r i k n p n d m c x l z e h h s c l f s y l k j s p t d q e b k v u x k m k z p g k e n a f h h r o x v w k u j u t n e u q f a d n e d y y y f c z c a p x y f b r w e y o f a v f h z r y a n z u q r o g n f p x l j y l u a n r d o r v k m f j y n h p c c t k x y t b f j r n x g c z h s p c e i q g x k p f g r n l y i i f t i s b i f c k c h e s l w y s u p d v x b r l q l k i z d z w s a w r i i u m n i x r c j n d h n w g s f s i l h a b h l h x m v p t e g k n o i s g s x v b o k e c i j y b e d r t p e x v r c w u v d s d o a z t t m u i u v u b p l w c p x n k k v a a v b b s e e f d b f y i v c j k r g r y t j a m f v h b f s b z l i n a x c l r l z i v l c b n u d l l g u y r t t u q t l y j l q u h a o u o p t g v l q q r k r q y p l z d x n q n q v t f b u h r y n k f q i t h i u w i n m l o c c c