SlideShare a Scribd company logo
The R-Stream High-Level Program Transformation Tool
      N. Vasilache, B. Meister, M. Baskaran, A.Hartono, R. Lethin




Reservoir Labs   Harvard 04/12/2011                                 1
Outline


        •• R-Stream Overview

        •• Compilation Walk-through

        •• Performance Results

        •• Getting R-Stream




Reservoir Labs   Harvard 04/12/2011   2
Power efficiency driving architectures




           Heterogeneous                         SIMD   SIMD          SIMD   SIMD        NUMA
                                          FPGA                 FPGA
             Processing


                                          DMA      Memory      DMA      Memory



            Distributed
                                          GPP                  GPP
          Local Memories                         SIMD   SIMD          SIMD   SIMD      Hierarchical
                                                                                    (including board,
                                                                                    chassis, cabinet)

                                                 SIMD   SIMD          SIMD   SIMD
             Explicitly                   FPGA                 FPGA
             Managed
            Architecture
                                          DMA      Memory      DMA
                                                                                        Multiple
                                                                        Memory
                                                                                        Execution
                                                                                         Models
             Bandwidth
                                          GPP                  GPP
              Starved                            SIMD   SIMD          SIMD   SIMD




              Multiple                                                                   Mixed
               Spatial                                                                 Parallelism
             Dimensions                                                                  Types




                                                                                                        3
Reservoir Labs       Harvard 04/12/2011
Computation choreography


        •• Expressing it
        • Annotations and pragma dialects for C
        • Explicitly (e.g., new languages like CUDA and OpenCL)

        •• But before expressing it, how can programmers find it?
        • Manual constructive procedures, art, sweat, time
          – Artisans get complete control over every detail
        • Automatically
          –   Operations research problem
          –   Like scheduling trucks to save fuel             Our focus
          –   Model, solve              , implement
          –   Faster, sometimes better, than a human



Reservoir Labs   Harvard 04/12/2011                                       4
How to do automatic scheduling?


        •• Naïve approach
        • Model
          – Tasks, dependences
          – Resource use, latencies
          – Machine model with connectivity, resource capacities
        • Solve with ILP
          – Minimize overall task length
          – Subject to dependences, resource use
        • Problems
          – Complexity: task graph is huge!
          – Dynamics: loop lengths unknown.

        •• So we do something much more cool.

Reservoir Labs   Harvard 04/12/2011                                5
Program Transformations Specification


iteration space of a statement S(i,j)
                                                          t2
      j
                                              2       2
                                         :Z       Z
                        i
                                                                 t1
          ••   Schedule maps iterations to multi-dimensional time:
          •    A feasible schedule preserves dependences
          ••   Placement maps iterations to multi-dimensional space:
          •    UHPC in progress, partially done
          ••   Layout maps data elements to multi-dimensional space:
          •    UHPC in progress
          ••   Hierarchical by design, tiling serves separation of concerns

                                                                              6
 Reservoir Labs     Harvard 04/12/2011
Loop transformations

 for(i=0; i<N; i++)
   for(j=0; j<N; j++)
     S(i,j);
                                                                unimodular
                  for(j=0; j<N; j++)                    0 1 i
permutation         for(i=0; i<N; i++)        (i, j )
                      S(i,j);                           1 0 j
                  for(i=N-1; i>=0; i--)                 1 0 i
  reversal          for(j=0; j<N; j++)        (i, j )
                      S(j,i);                           0   1         j

                 for(i=0; i<N; i++)                     1   0 i
  skewing          for(j= *i; j<N+ *i; j++)   (i, j )
                     S(i,j- *i);                            1     j

                  for(i=0; i< *N; i+= )                     0 i
   scaling          for(j=0; j<N; j++)        (i, j )
                      S(i/ ,j);                         0   1     j


                                                                          7
Reservoir Labs   Harvard 04/12/2011
Loop fusion and distribution

 for(i=0; i<N; i++)
   for(j=0; j<N; j++)
                                             fusion            for(i=0; i<N; i++)
                                                                 for(j=0; j<N; j++)
     S1(i,j);
                                                                   S1(i,j);
   for(j=0; j<N; j++)
                                                                   S2(i,j)
     S2(i,j)                               distribution
                                                                               0 0 0
                   0 0 0
                                                                               1 0 0 i
                   1 0 0 i                        fusion             (i, j )   0 0 0 j
    1 (i, j )      0 0 0 j                                       1

                                                                               0 1 0 1
                   0 1 0 1
                                                                               0 0 0
                   0 0 0
                   0 0 0                                                       0 0 0
                   1 0 0 i                                                     1 0 0 i
    2   (i, j )    0 0 1 j                                       2   (i, j )   0 0 0 j
                                                distribution                   0 1 0 1
                   0 1 0 1
                   0 0 0                                                       0 0 1


                                                                                         8
Reservoir Labs        Harvard 04/12/2011
Enabling technology is new compiler math

Uniform Recurrence Equations [Karp et al. 1970]
                                                                   Many: Lamport, Allen/Kennedy,
                                                            Banerjee, Irigoin, Wolfe/Lam, Pugh, Pingali, e
Loop Transformations and Parallelization [1970-]
                                                              Vectorization, SMP, locality optimizations
                                                    Dependence summary: direction/distance vectors

                                                               Unimodular transformations
   Systolic Array Mapping                              Mostly linear-algebraic

                                        Many: Feautrier, Darte, Vivien, Wilde, Rajopadhye, etc,....

                                        Exact dependence analysis
   Polyhedral Model [1980-]                                          General affine transformations
                                          Loop synthesis via polyhedral scanning

                                          New computational techniques based
                                             on polyhedral representations

 Reservoir Labs    Harvard 04/12/2011
                                                                                                9
R-Stream model: polyhedra

n = f();
for (i=5; i<= n; i+=2) {
  A[i][i] = A[i][i]/B[i];
  for (j=0; j<=i; j++) {
    if (j<=10) {
      …… A[i+2j+n][i+3]……
  }
}

{i, j   Z2 | k   Z ,5 i      n;0       j i; j i ; i   2k 1}
                i
  A0    1 2 1 0
                j
  A1    1 0 0 3                                   Affine and non-affine transformations
                n                                 Order and place of operations and data
  1     0 0 0 1
                1

Loop code represented (exactly or conservatively) with polyhedrons
   High-level, mathematical view of a mapping
   But targets concrete properties: parallelism, locality, memory footprint



Reservoir Labs    Harvard 04/12/2011                                                       10
Polyhedral slogans


        •• Parametric imperfect loop nests

        •• Subsumes classical transformations

        •• Compacts the transformation search space

        •• Parallelization, locality optimization (communication avoiding)

        •• Preserves semantics

        •• Analytic joint formulations of optimizations

        •• Not just for affine static control programs

Reservoir Labs   Harvard 04/12/2011                                          11
Polyhedral model – challenges in building a compiler


        •• Killer math

        •• Scalability of optimizations/code generation

        •• Mostly confined to dependence preserving transformations

        •• Code can be radically transformed – outputs can look
          wildly different

        •• Modeling indirections, pointers, non-affine code.

        •• Many of these challenges are solved


Reservoir Labs   Harvard 04/12/2011                                   12
R-Stream blueprint




                 Machine
                                                  Polyhedral Mapper
                  Model




                                        Raising                       Lowering




                  EDG C                                                          Pretty
                                              Scalar Representation
                 Front End                                                       Printer




                                                                                           13
Reservoir Labs     Harvard 04/12/2011
Inside the polyhedral mapper




                                             GDG representation



                                                 Tactics Module


                 Parallelization
                                                                         Comm.
                    Locality            Tiling            Placement
                                                                        Generation
                 Optimization
                                                                                     ……
                    Memory              Sync               Layout       Polyhedral
                   Promotion          Generation         Optimization   Scanning




                                                   Jolylib, ……




                                                                                          14
Reservoir Labs   Harvard 04/12/2011
Inside the polyhedral mapper
Optimization modules engineered to expose ““knobs”” that could be used by auto-tuner



                                              GDG representation



                                                  Tactics Module


                  Parallelization
                                                                          Comm.
                     Locality            Tiling            Placement
                                                                         Generation
                  Optimization
                                                                                      ……
                     Memory              Sync               Layout       Polyhedral
                    Promotion          Generation         Optimization   Scanning




                                                    Jolylib, ……




                                                                                           15
 Reservoir Labs   Harvard 04/12/2011
Driving the mapping: the machine model


•• Target machine characteristics that have an
     influence on how the mapping should be done
•    Local memory / cache sizes
•    Communication facilities: DMA, cache(s)
•    Synchronization capabilities
•    Symmetrical or not
•    SIMD width
•    Bandwidths

•• Currently: two-level model (Host and Accelerators)
•• XML schema and graphical rendering



    Reservoir Labs   Harvard 04/12/2011
Machine model example: multi-Tesla


       Host



1 thread per GPU




              OpenMP morph              XML file
                                                   CUDA morph
Reservoir Labs     Harvard 04/12/2011                           17
Mapping process

             Dependencies
                                                               2- Task formation:
                                                               - Coarse-grain atomic tasks
                                                               - Master/slave side operations




    1- Scheduling:
    Parallelism, locality, tilability


                        3- Placement:
                        Assign tasks to blocks/threads



                                                         - Local / global data layout optimization
                                                         - Multi-buffering (explicitly managed)
                                                         - Synchronization (barriers)
                                                         - Bulk communications
                                                         - Thread generation -> master/slave
                                                         - CUDA-specific optimizations

Reservoir Labs          Harvard 04/12/2011                                                           18
Program Transformations Specification


iteration space of a statement S(i,j)
                                                          t2
      j
                                              2       2
                                         :Z       Z
                        i
                                                                 t1
          ••   Schedule maps iterations to multi-dimensional time:
          •    A feasible schedule preserves dependences
          ••   Placement maps iterations to multi-dimensional space:
          •    UHPC in progress, partially done
          ••   Layout maps data elements to multi-dimensional space:
          •    UHPC in progress
          ••   Hierarchical by design, tiling serves separation of concerns

                                                                              19
 Reservoir Labs     Harvard 04/12/2011
Model for scheduling trades 3 objectives jointly


                                      Loop Fission
  Fewer
  Global                  More                          More        Sufficient
 Memory                  Locality                    Parallelism    Occupancy
 Accesses
                                      Loop Fusion

                  + successive                       + successive
                     thread                             thread
                   contiguity                         contiguity
                                       Memory
                                      Coalescing
                                                       Better
                                                      Effective
                                                     Bandwidth

                                                                    Patent pending

Reservoir Labs   Harvard 04/12/2011                                              20
Optimization with BLAS vs. global optimization


                         Numerous cache misses
                                                         /* Global Optimization*/
/* Optimization with BLAS */                             doall loop {      Can parallelize
for loop {             Outer loop(s)                       ……               outer loop(s)
  ……                                                       for loop {
               Retrieve data Z from disk                     ……
  BLAS call 1
  ……           Store data Z back to disk                     [read from Z]
               Retrieve data Z from disk !!!                                   Loop fusion
  BLAS call 2                                    VS.         ……
  ……                                                         [write to Z]           can
  ……                                                         ……                  improve
  BLAS call n                                                [read from Z]        locality
  ……                                                       }
}                                                          ……
                                                         }



          Global optimization can expose better parallelism and locality


 Reservoir Labs    Harvard 04/12/2011
Tradeoffs between parallelism and locality

        • Significant parallelism is needed to fully utilize all resources
        • Locality is also critical to minimize communication
        • Parallelism can come at the expense of locality
                                                                         Limited bandwidth
                                                                           at chip border
                                                                             High on-chip
                                                                              parallelism

        •• Our approach: R-Stream compiler exposes parallelism via affine scheduling
           that simultaneously augments locality using loop fusion
                                                                              Reuse data once
                                                                             loaded on chip =
                                                                                  locality




Reservoir Labs   Harvard 04/12/2011
Parallelism/locality tradeoff example
                           Array z gets expanded, to
                                                                Maximum distribution destroys locality
                           introduce another level of
                           parallelism
/*                                                               doall (i=0; i<400; i++)
  * Original code:                                                doall (j=0; j<3997; j++)
  * Simplified CSLC LMS                                            z_e[j][i]=0
  */                                                             doall (i=0; i<400; i++)
for (k=0; k<400; k++) {                                           doall (j=0; j<3997; j++)
                                            Max. parallelism
  for (i=0; i<3997; i++) {                                         for (k=0; k<4000; k++)
                                              (no fusion)
    z[i]=0;                                                          z_e[j][i]=z_e[j][i]+B[j][k]*x[i][k];
    for (j=0; j<4000; j++)                                       doall (i=0; i<3997; i++)
     z[i]= z[i]+B[i][j]*x[k][j];                                  for (j=0; j<400; j++)
  }                                                                w[i]=w[i]+z_e[i][j];
  for (i=0; i<3997; i++)                                         doall (i=0; i<3997; i++)
    w[i]=w[i]+z[i];                                  Data
                                                                  z[i] = z_e[i][399];
}                                                accumulation



               2 levels of parallelism, but poor data reuse (on array z_e)

  Reservoir Labs       Harvard 04/12/2011
Parallelism/locality tradeoff example (cont.)


                                                                Aggressive loop fusion destroys
                                                                parallelism (i.e., only 1 degree
/*
                                                                of parallelism)
  * Original code:
  * Simplified CSLC LMS
  */                                                      doall (i=0; i<3997; i++)
for (k=0; k<400; k++) {                                    for (j=0; j<400; j++) {
  for (i=0; i<3997; i++) {
                                            Max. fusion      z[i]=0;
    z[i]=0;                                                  for (k=0; k<4000; k++)
    for (j=0; j<4000; j++)                                    z[i]=z[i]+B[i][k]*x[j][k];
     z[i]= z[i]+B[i][j]*x[k][j];                             w[i]=w[i]+z[i];
  }                                                        }
  for (i=0; i<3997; i++)
    w[i]=w[i]+z[i];
}


               Very good data reuse (on array z), but only 1 level of parallelism

  Reservoir Labs       Harvard 04/12/2011
Parallelism/locality tradeoff example (cont.)

                                                                                  Partial fusion doesn’t
                                      Expansion of array z                        decrease parallelism
/*
  * Original code:                                              doall (i=0; i<3997; i++) {
  * Simplified CSLC LMS                                           doall (j=0; j<400; j++) {
  */                                                                z_e[i][j]=0;
for (k=0; k<400; k++) {                                             for (k=0; k<4000; k++)
                                                                     z_e[i][j]=z_e[i][j]+B[i][k]*x[j][k];
  for (i=0; i<3997; i++) {                  Parallelism with
                                                                  }
    z[i]=0;                                  partial fusion
    for (j=0; j<4000; j++)                                        for (j=0; j<400; j++)
     z[i]= z[i]+B[i][j]*x[k][j];                                    w[i]=w[i]+z_e[i][j];
  }                                                             }
                                                                doall (i=0; i<3997; i++)
  for (i=0; i<3997; i++)                             Data
                                                                  z[i]=z_e[i][399];
    w[i]=w[i]+z[i];                              accumulation
}


               2 levels of parallelism with good data reuse (on array z_e)

  Reservoir Labs       Harvard 04/12/2011
Parallelism/locality tradeoffs: performance
           numbers




 Code with a good balance between parallelism and fusion performs best
 In explicitly managed memory/scratchpad architectures this is even more true

Reservoir Labs   Harvard 04/12/2011
R-Stream: affine scheduling and fusion

    •• R-Stream uses a heuristic based on an objective function with several cost
      coefficients:
    • slowdown in execution if a loop p is executed sequentially rather than in parallel
    • cost in performance if two loops p and q remain unfused rather than fused



                  minimize                  wl pl              ue f e
                                      l loops        e loop edges


                                      slowdown in sequential        cost of unfusing
                                            execution                  two loops

    •• These two cost coefficients address parallelism and locality in a unified and
       unbiased manner (as opposed to traditional compilers)
    •• Fine-grained parallelism, such as SIMD, can also be modeled using similar
       formulation

                                                                                       Patent Pending
Reservoir Labs   Harvard 04/12/2011
Parallelism + locality + spatial locality


                                      Hypothesis that auto-tuning should adjust these
                                                       parameters




                               wl pl              ue f e
                       l loops            e loop edges

                                                           benefits of improved locality
                    benefits of parallel execution



             New algorithm (unpublished) balances contiguity to
               enhance coalescing for GPU and SIMDization
                    modulo data-layout transformations


                                                                                           28
Reservoir Labs   Harvard 04/12/2011
Outline


        •• R-Stream Overview

        •• Compilation Walk-through

        •• Performance Results

        •• Getting R-Stream




Reservoir Labs   Harvard 04/12/2011   29
What R-Stream does for you – in a nutshell

     •• Input
     • Sequential
        – Short and simple
          textbook C code
        – Just add a “#pragma
          map” and R-Stream
          figures out the rest




Reservoir Labs   Harvard 04/12/2011                     30
What R-Stream does for you – in a nutshell

     •• Input                                    •• Output
     • Sequential                                • OpenMP + CUDA code
        – Short and simple            R-Stream     – Hundreds of lines of tightly
          textbook C code                            optimized GPU-side CUDA
        – Just add a “#pragma                        code
          map” and R-Stream                        – Few lines of host-side
          figures out the rest                       OpenMP C code

   • Gauss-Seidel 9 points stencil                              • Very difficult to
      – Used in iterative PDE solvers                             hand-optimize
          – scientific modeling (heat, fluid flow, waves,       • Not available in
             etc.)                                                any standard
      – Building block for faster iterative solvers like          library
        Multigrid or AMR



Reservoir Labs   Harvard 04/12/2011                                                   31
What R-Stream does for you – in a nutshell

     •• Input                                    •• Output
     • Sequential                                • OpenMP + CUDA code
        – Short and simple            R-Stream     – Hundreds of lines of tightly
          textbook C code                            optimized GPU-side CUDA
        – Just add a “#pragma                        code
          map” and R-Stream                        – Few lines of host-side
          figures out the rest                       OpenMP C code



                                                               • Achieving up to
                                           Will be                – 20 GFLOPS in
                                           illustrated in           GTX 285
                                           the next few           – 25 GFLOPS in
                                           slides                   GTX 480



Reservoir Labs   Harvard 04/12/2011                                                 32
Finding and utilizing available parallelism

           Excerpt of automatically generated code
                                                           GPU
                                                               SM N


                                                             SM 2

                                                            SM 1

                                                                        Shared Memory

                                                            Registers   Registers       Registers
                                                                                                    Instruction


R-Stream AUTOMATICALLY finds
                                                               SP 1        SP 2     …     SP M
                                                                                                        Unit



and forms parallelism                                                                               Constant
                                                                                                      Cache


                                                                                                     Texture
                                                                                                      Cache


                                                             Off-chip Device memory (Global,
                                                             constant, texture)



                                                     Extracting and mapping parallel loops

 Reservoir Labs   Harvard 04/12/2011                                                                        33
Memory compaction on GPU scratchpad


              Excerpt of automatically generated code   GPU
                                                           SM N


                                                         SM 2

                                                        SM 1

                                                                    Shared Memory

                                                        Registers   Registers       Registers
                                                                                                Instruction

                                                           SP 1        SP 2     …     SP M
                                                                                                    Unit


R-Stream AUTOMATICALLY
manages local scratchpad                                                                        Constant
                                                                                                  Cache


                                                                                                 Texture
                                                                                                  Cache


                                                         Off-chip Device memory (Global,
                                                         constant, texture)




 Reservoir Labs   Harvard 04/12/2011                                                                    34
GPU DRAM to scratchpad coalesced communication

             Excerpt of automatically generated code
                                                                      GPU
                                                                         SM N


                                                                       SM 2

                                                                       SM 1

                                                                                  Shared Memory

                                                                      Registers   Registers       Registers
                                                                                                              Instruction

                                                                         SP 1        SP 2     …     SP M
                                                                                                                  Unit


R-Stream AUTOMATICALLY
chooses parallelism to favor                                                                                  Constant
                                                                                                                Cache
coalescing
                                                                                                               Texture
                                                                                                                Cache


                                                                       Off-chip Device memory (Global,
                                                                       constant, texture)


                                                 Coalesced GPU DRAM accesses


 Reservoir Labs     Harvard 04/12/2011                                                                                35
Host-to-GPU communication

          Excerpt of automatically generated code

                 R-Stream AUTOMATICALLY
                                                              GPU
                 chooses partition and sets up host
                                                                 SM N
                 to GPU communication
                                                               SM 2

                                                              SM 1

                                                                          Shared Memory

                                                              Registers   Registers       Registers
                                                                                                      Instruction

                                                                 SP 1        SP 2     …     SP M
                                                                                                          Unit




                           CPU                                                                        Constant
                                                                                                        Cache


                                                                                                       Texture
                                                      PCI                                               Cache
                                                    Express
                 Host memory                                  Off-chip Device memory (Global,
                                                              constant, texture)




Reservoir Labs    Harvard 04/12/2011                                                                          36
Multi-GPU mapping
          Excerpt of automatically generated code



                                      Mapping              Host               Host
                                      across all GPUs     memory             memory



                                                              CPU             CPU




   R-Stream AUTOMATICALLY finds                         GPU                     GPU
                                                                     GPU
   another level of parallelism,
   across GPUs
                                                         GPU         GPU       GPU
                                                        memory      memory    memory

Reservoir Labs   Harvard 04/12/2011                                                   37
Multi-GPU mapping
          Excerpt of automatically generated code



                                                           Host               Host
                                      Multi-streaming     memory             memory
                                      of host-GPU
                                      communication
                                                              CPU             CPU




   R-Stream AUTOMATICALLY                               GPU                     GPU
                                                                     GPU
   creates n-way software pipelines
   for communications
                                                         GPU         GPU       GPU
                                                        memory      memory    memory

Reservoir Labs   Harvard 04/12/2011                                                   38
Future capabilities – mapping to CPU-GPU clusters

                                                            High Speed Interconnect (e.g.
                                                                    InfiniBand)




Program
                                                  CPU + GPU                      CPU + GPU
 MPI Process                      MPI Process

     OpenMP                           OpenMP
     process                          process
                                                          DRAM
     launching                        launching                                     DRAM


     CUDA                             CUDA
                                                          CPU                      CPU


                                                  GPU               GPU                 GPU
                                                   DRAM              DRAM               DRAM




Reservoir Labs   Harvard 04/12/2011                                                            39
Outline


        •• R-Stream Overview

        •• Compilation Walk-through

        •• Performance Results

        •• Getting R-Stream




Reservoir Labs   Harvard 04/12/2011   40
Experimental evaluation

     Configuration 1: MKL                           Configuration 2: Low-level compilers

       Radar                                         Radar                GCC
                              MKL calls
       code                                          code                 ICC



                                      Configuration 3: R-Stream

                  Radar                                      Optimized          GCC
                                         R-Stream
                  code                                       radar code         ICC

  •• Main comparisons:
  • R-Stream High-Level C Transformation Tool 3.1.2
  • Intel MKL 10.2.1


Reservoir Labs   Harvard 04/12/2011
Experimental evaluation (cont.)

  ••   Intel Xeon workstation:
  •    Dual quad-core E5405 Xeon processors (8 cores total)
  •    9GB memory
  ••   8 OpenMP threads
  ••   Single precision floating point data
  ••   Low-level compilers and the used flags:
  •    GCC: -O6 -fno-trapping-math -ftree-vectorize -msse3 -fopenmp
  •    ICC: -fast -openmp




Reservoir Labs   Harvard 04/12/2011
Radar benchmarks


     •• Beamforming algorithms:
     • MVDR-SER: Minimum Variance Distortionless Response using
          Sequential Regression
     •    CSLC-LMS: Coherent Sidelobe Cancellation using Least Mean Square
     •    CSLC-RLS: Coherent Sidelobe Cancellation using Robust Least
          Square
     ••   Expressed in sequential ANSI C
     ••   400 radar iterations
     ••   Compute 3 radar sidelobes (for CSLC-LMS and CSLC-RLS)




Reservoir Labs   Harvard 04/12/2011
MVDR-SER




Reservoir Labs   Harvard 04/12/2011
CSLC-LMS




Reservoir Labs   Harvard 04/12/2011
CSLC-RLS




Reservoir Labs   Harvard 04/12/2011
3D Discretized wave equation input code (RTM)
        ••   #pragma rstream map
        ••   void RTM_3D(double (*U1)[Y][X], double (*U2)[Y][X], double (*V)[Y][X],
        ••               int pX, int pY, int pZ) {
        ••     double temp;
        ••     int i, j, k;

        ••    for (k=4; k<pZ-4; k++) {
        ••      for (j=4; j<pY-4; j++) {
        ••        for (i=4; i<pX-4; i++) {
        ••         temp = C0 * U2[k][j][i] +
        ••            C1 * (U2[k-1][j][i] + U2[k+1][j][i] +
        ••                  U2[k][j-1][i] + U2[k][j+1][i] +
        ••                 U2[k][j][i-1] + U2[k][j][i+1]) +
        ••            C2 * (U2[k-2][j][i] + U2[k+2][j][i] +
        ••                  U2[k][j-2][i] + U2[k][j+2][i] +        25-point 8th
        ••                  U2[k][j][i-2] + U2[k][j][i+2]) +       order (in space)
        ••            C3 * (U2[k-3][j][i] + U2[k+3][j][i] +
        ••                  U2[k][j-3][i] + U2[k][j+3][i] +
                                                                   stencil
        ••                  U2[k][j][i-3] + U2[k][j][i+3]) +
        ••            C4 * (U2[k-4][j][i] + U2[k+4][j][i] +
        ••                  U2[k][j-4][i] + U2[k][j+4][i] +
        ••                  U2[k][j][i-4] + U2[k][j][i+4]);

        ••             U1[k][j][i] =
        ••               2.0f * U2[k][j][i] - U1[k][j][i] +
        ••               V[k][j][i] * temp;
        ••   } } } }

Reservoir Labs     Harvard 04/12/2011                                                 47
3D Discretized wave equation input code (RTM)




Not so naïve …


Communication
autotuning
knobs




                                               ThreadIdx.x divergence
                                               is expensive

 Reservoir Labs   Harvard 04/12/2011                                    48
Outline


        •• R-Stream Overview

        •• Compilation Walk-through

        •• Performance Results

        •• Getting R-Stream




Reservoir Labs   Harvard 04/12/2011   49
Current status


        •• Ongoing development also supported by DOE, Reservoir
        • Improvements in scope, stability, performance

        •• Installations/evaluations at US government laboratories

        •• Forward collaboration with Georgia Tech on Keeneland
        • HP SL390 - 3 FERMI GPU, 2 Westmere/node

        •• Basis of compiler for DARPA UHPC Intel Corporation Team




Reservoir Labs   Harvard 04/12/2011                                  50
Availability



        •• Per developer seat licensing model

        •• Support (releases, bug fixes, services)

        •• Available with commercial grade external solvers

        •• Government has limited rights / SBIR data rights

        •• Academic source licenses with collaborators

        •• Professional team, continuity, software engineering


Reservoir Labs   Harvard 04/12/2011                              51
R-Stream Gives More Insight About Programs


        •• Teaches you about parallelism and the polyhedral model:
        • Generates correct code for any transformation
             – The transformation may be incorrect if specified by the user
        • Imperative code generation has been the bottleneck till 2005
             – 3 thesis written on the topic and it’s still not completely covered
             – It is good to have an intuition of how these things work
        ••   R-Stream has meaningful metrics to represent your program
        •    Maximal amount of parallelism given minimal expansion
        •    Tradeoffs between coarse-grained and fine-grained parallelism
        •    Loop types (doall, red, perm, seq)
        ••   Help with algorithm selection
        ••   Tiling of imperfectly nested loop nests
        ••   Generates code for explicitly managed memory

Reservoir Labs    Harvard 04/12/2011                                                 52
How to Use R-Stream Successfully


        ••   R-Stream is a great transformation tool, it is also a great learning tool:
        •    Takes “simple C” input code
        •    Applies multiple transformations and lists them explicitly
        •    Can dump code at any step in the process
        •    It’s the tool I wish I had during my PhD:
             – To be fair, I already had a great set of tools
        ••   R-Stream can be used in multiple modes:
        •    Fully automatic + compile flag options
        •    Autotuning mode (more than just tile size and unrolling … )
        •    Scripting / Programmatic mode (Beanshell + interfaces)
        •    Mix of these modes + manual post-processing




Reservoir Labs    Harvard 04/12/2011                                               53
Use Case: Fully Automatic Mode + Compile Flag Options


        •• Akin to traditional gcc / icc compilation with flags:
        • Predefined transformations can be parameterized
          – Except less/no phase ordering issues
        • Except you can see what each transformation does at each step
        • Except you can generate compilable and executable code at (almost)
          any step in the process
        • Except you can control code generation for compactness or
          performance




Reservoir Labs   Harvard 04/12/2011                                        54
Use Case: Autotuning Mode


        •• More advanced than traditional approaches:
        •  Knobs go far beyond loop unrolling + unroll-and-jam + tiling
        •  Knobs are based on well-understood models
        •  Knobs target high-level properties of the program
           – Amount of parallelism, amount of memory expansion, depth of
             pipelining of communication and computations …
        • Knobs are dependent on target machine, program and state of the
           mapping process:
           – Our tool has introspection
        •• We are really building a hierarchical autotuning transformation tool




Reservoir Labs   Harvard 04/12/2011                                           55
Use Case: “Power user” interactive interface


        •• Beanshell access to optimizations

        •• Can direct and review the process of compilation
        • automatic tactics (affine scheduling)

        •• Can direct code generation

        •• Access to “tuning parameters”

        •• All options / commands available on command line
          interface




Reservoir Labs   Harvard 04/12/2011                           56
Conclusion

       •• R-Stream simplifies software development and maintenance

       •• Porting: reduces expense and delivery delays

       •• Does this by automatically parallelizing loop code
       • While optimizing for data locality, coalescing, etc.

       •• Addresses
       • Dense loop-intensive computations

       ••   Extensions
       •    Data-parallel programming idioms
       •    Sparse representations
       •    Dynamic runtime execution
Reservoir Labs   Harvard 04/12/2011                                  57
Contact us


        •• Per-developer seat, floating, cloud-based licensing

        •• Discounts for academic users

        •• Research collaborations with academic partners


        ••   For more information:
        •    Call us at 212-780-0527, or
        •    See Rich, Ann
        •    E-mail us at
             – {sales,lethin,johnson}@reservoir.com



Reservoir Labs    Harvard 04/12/2011                             58

More Related Content

PDF
Embedded System Microcontroller Interactive Course using BASCOM-AVR - Lecture3
AL-AWAIL for Electronic Engineering
 
PPTX
Ee600 lab3 hal9000_grp
Loren Schwappach
 
PDF
Evidence Of Bimodal Crystallite Size Distribution In Microcrystalline Silico...
Sanjay Ram
 
PDF
C++ material
vamshi batchu
 
PDF
Yang greenstein part_2
Obsidian Software
 
PDF
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
PDF
Presentation hawgscuff fall2012
markcenter
 
PDF
CWCAS X-ISCKER Poster
Jose Pinilla
 
Embedded System Microcontroller Interactive Course using BASCOM-AVR - Lecture3
AL-AWAIL for Electronic Engineering
 
Ee600 lab3 hal9000_grp
Loren Schwappach
 
Evidence Of Bimodal Crystallite Size Distribution In Microcrystalline Silico...
Sanjay Ram
 
C++ material
vamshi batchu
 
Yang greenstein part_2
Obsidian Software
 
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Presentation hawgscuff fall2012
markcenter
 
CWCAS X-ISCKER Poster
Jose Pinilla
 

What's hot (6)

PPTX
study Domain Transform for Edge-Aware Image and Video Processing
Chiamin Hsu
 
PPTX
GQSAR presentation
VLife Sciences Tech. Pvt. Ltd.
 
PDF
Cascading[1]
btomasette
 
PDF
Consulting design presentation
Kevin Lomeli
 
PDF
200081003 Friday Food@IBBT
imec.archive
 
PDF
A fast implementation of matrix-matrix product in double-double precision on ...
Maho Nakata
 
study Domain Transform for Edge-Aware Image and Video Processing
Chiamin Hsu
 
GQSAR presentation
VLife Sciences Tech. Pvt. Ltd.
 
Cascading[1]
btomasette
 
Consulting design presentation
Kevin Lomeli
 
200081003 Friday Food@IBBT
imec.archive
 
A fast implementation of matrix-matrix product in double-double precision on ...
Maho Nakata
 
Ad

Viewers also liked (14)

PDF
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
npinto
 
PDF
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
npinto
 
PDF
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
npinto
 
PDF
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
npinto
 
PDF
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
npinto
 
PDF
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
npinto
 
PDF
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
npinto
 
PDF
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
npinto
 
PDF
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
npinto
 
PDF
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
npinto
 
PDF
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
npinto
 
PDF
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
npinto
 
PDF
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
npinto
 
PDF
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
npinto
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
npinto
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
npinto
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
npinto
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
npinto
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
npinto
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
npinto
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
npinto
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
npinto
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
npinto
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
npinto
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
npinto
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
npinto
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
npinto
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
npinto
 
Ad

Similar to [Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs) (20)

KEY
A compiler approach_to_fast_hardware_design_exploration_in_fpga-based-systems
Takefumi MIYOSHI
 
PDF
High-Level Synthesis with GAUT
AdaCore
 
PDF
Simulation Informatics
David Gleich
 
PDF
Architectures for parallel
Sanjivani Sontakke
 
PPTX
SmB café 13 sep '12 - Compaan Design
Christiaan van Gorkum
 
PDF
GPU programming
Roberto Bonvallet
 
PDF
MDE based FPGA physical Design Fast prototyping with Smalltalk
ESUG
 
PDF
libHPC: Software sustainability and reuse through metadata preservation
SoftwarePractice
 
PDF
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
npinto
 
PPT
Advanced computer architecture
Md. Mahedi Mahfuj
 
PDF
VAST-Tree, EDBT'12
Takeshi Yamamuro
 
PPTX
DAC 2012
FlexTiles Team
 
PPTX
Introduction To Parallel Computing
Jörn Dinkla
 
PDF
Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...
ICSM 2011
 
PDF
My Ph.D. Research
Po-Ting Wu
 
PDF
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
Shinya Takamaeda-Y
 
PDF
Diseño digital
Danny Mena Enamorado
 
PDF
A Methodology for the Emulation of Boolean Logic that Paved the Way for the S...
ricky_pi_tercios
 
PDF
Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A...
Fahad Cheema
 
PDF
Designing Architecture-aware Library using Boost.Proto
Joel Falcou
 
A compiler approach_to_fast_hardware_design_exploration_in_fpga-based-systems
Takefumi MIYOSHI
 
High-Level Synthesis with GAUT
AdaCore
 
Simulation Informatics
David Gleich
 
Architectures for parallel
Sanjivani Sontakke
 
SmB café 13 sep '12 - Compaan Design
Christiaan van Gorkum
 
GPU programming
Roberto Bonvallet
 
MDE based FPGA physical Design Fast prototyping with Smalltalk
ESUG
 
libHPC: Software sustainability and reuse through metadata preservation
SoftwarePractice
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
npinto
 
Advanced computer architecture
Md. Mahedi Mahfuj
 
VAST-Tree, EDBT'12
Takeshi Yamamuro
 
DAC 2012
FlexTiles Team
 
Introduction To Parallel Computing
Jörn Dinkla
 
Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...
ICSM 2011
 
My Ph.D. Research
Po-Ting Wu
 
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
Shinya Takamaeda-Y
 
Diseño digital
Danny Mena Enamorado
 
A Methodology for the Emulation of Boolean Logic that Paved the Way for the S...
ricky_pi_tercios
 
Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A...
Fahad Cheema
 
Designing Architecture-aware Library using Boost.Proto
Joel Falcou
 

More from npinto (15)

PDF
"AI" for Blockchain Security (Case Study: Cosmos)
npinto
 
PDF
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
npinto
 
PDF
[Harvard CS264] 05 - Advanced-level CUDA Programming
npinto
 
PDF
[Harvard CS264] 04 - Intermediate-level CUDA Programming
npinto
 
PDF
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
npinto
 
PDF
[Harvard CS264] 01 - Introduction
npinto
 
PDF
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
npinto
 
PDF
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
npinto
 
PDF
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
npinto
 
PDF
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
npinto
 
PDF
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
npinto
 
PDF
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
npinto
 
PDF
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
npinto
 
PDF
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
npinto
 
PDF
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
npinto
 
"AI" for Blockchain Security (Case Study: Cosmos)
npinto
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
npinto
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
npinto
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
npinto
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
npinto
 
[Harvard CS264] 01 - Introduction
npinto
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
npinto
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
npinto
 
MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 02: CUDA Basics #1 (Nicolas Pinto, MIT)
npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
npinto
 
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
npinto
 

Recently uploaded (20)

DOCX
SAROCES Action-Plan FOR ARAL PROGRAM IN DEPED
Levenmartlacuna1
 
PPTX
How to Track Skills & Contracts Using Odoo 18 Employee
Celine George
 
PPTX
20250924 Navigating the Future: How to tell the difference between an emergen...
McGuinness Institute
 
PPTX
Measures_of_location_-_Averages_and__percentiles_by_DR SURYA K.pptx
Surya Ganesh
 
PPTX
How to Manage Leads in Odoo 18 CRM - Odoo Slides
Celine George
 
PPTX
BASICS IN COMPUTER APPLICATIONS - UNIT I
suganthim28
 
PPTX
How to Close Subscription in Odoo 18 - Odoo Slides
Celine George
 
PPTX
Command Palatte in Odoo 18.1 Spreadsheet - Odoo Slides
Celine George
 
PPTX
Python-Application-in-Drug-Design by R D Jawarkar.pptx
Rahul Jawarkar
 
PPTX
Kanban Cards _ Mass Action in Odoo 18.2 - Odoo Slides
Celine George
 
PPTX
HEALTH CARE DELIVERY SYSTEM - UNIT 2 - GNM 3RD YEAR.pptx
Priyanshu Anand
 
DOCX
pgdei-UNIT -V Neurological Disorders & developmental disabilities
JELLA VISHNU DURGA PRASAD
 
PDF
The-Invisible-Living-World-Beyond-Our-Naked-Eye chapter 2.pdf/8th science cur...
Sandeep Swamy
 
PDF
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
PPTX
Continental Accounting in Odoo 18 - Odoo Slides
Celine George
 
PDF
Virat Kohli- the Pride of Indian cricket
kushpar147
 
PPTX
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
PDF
What is CFA?? Complete Guide to the Chartered Financial Analyst Program
sp4989653
 
DOCX
Unit 5: Speech-language and swallowing disorders
JELLA VISHNU DURGA PRASAD
 
PPTX
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 
SAROCES Action-Plan FOR ARAL PROGRAM IN DEPED
Levenmartlacuna1
 
How to Track Skills & Contracts Using Odoo 18 Employee
Celine George
 
20250924 Navigating the Future: How to tell the difference between an emergen...
McGuinness Institute
 
Measures_of_location_-_Averages_and__percentiles_by_DR SURYA K.pptx
Surya Ganesh
 
How to Manage Leads in Odoo 18 CRM - Odoo Slides
Celine George
 
BASICS IN COMPUTER APPLICATIONS - UNIT I
suganthim28
 
How to Close Subscription in Odoo 18 - Odoo Slides
Celine George
 
Command Palatte in Odoo 18.1 Spreadsheet - Odoo Slides
Celine George
 
Python-Application-in-Drug-Design by R D Jawarkar.pptx
Rahul Jawarkar
 
Kanban Cards _ Mass Action in Odoo 18.2 - Odoo Slides
Celine George
 
HEALTH CARE DELIVERY SYSTEM - UNIT 2 - GNM 3RD YEAR.pptx
Priyanshu Anand
 
pgdei-UNIT -V Neurological Disorders & developmental disabilities
JELLA VISHNU DURGA PRASAD
 
The-Invisible-Living-World-Beyond-Our-Naked-Eye chapter 2.pdf/8th science cur...
Sandeep Swamy
 
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
Continental Accounting in Odoo 18 - Odoo Slides
Celine George
 
Virat Kohli- the Pride of Indian cricket
kushpar147
 
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
What is CFA?? Complete Guide to the Chartered Financial Analyst Program
sp4989653
 
Unit 5: Speech-language and swallowing disorders
JELLA VISHNU DURGA PRASAD
 
Cleaning Validation Ppt Pharmaceutical validation
Ms. Ashatai Patil
 

[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Programming GPUs without Writing a Line of CUDA (Nicolas Vasilache, Reservoir Labs)

  • 1. The R-Stream High-Level Program Transformation Tool N. Vasilache, B. Meister, M. Baskaran, A.Hartono, R. Lethin Reservoir Labs Harvard 04/12/2011 1
  • 2. Outline •• R-Stream Overview •• Compilation Walk-through •• Performance Results •• Getting R-Stream Reservoir Labs Harvard 04/12/2011 2
  • 3. Power efficiency driving architectures Heterogeneous SIMD SIMD SIMD SIMD NUMA FPGA FPGA Processing DMA Memory DMA Memory Distributed GPP GPP Local Memories SIMD SIMD SIMD SIMD Hierarchical (including board, chassis, cabinet) SIMD SIMD SIMD SIMD Explicitly FPGA FPGA Managed Architecture DMA Memory DMA Multiple Memory Execution Models Bandwidth GPP GPP Starved SIMD SIMD SIMD SIMD Multiple Mixed Spatial Parallelism Dimensions Types 3 Reservoir Labs Harvard 04/12/2011
  • 4. Computation choreography •• Expressing it • Annotations and pragma dialects for C • Explicitly (e.g., new languages like CUDA and OpenCL) •• But before expressing it, how can programmers find it? • Manual constructive procedures, art, sweat, time – Artisans get complete control over every detail • Automatically – Operations research problem – Like scheduling trucks to save fuel Our focus – Model, solve , implement – Faster, sometimes better, than a human Reservoir Labs Harvard 04/12/2011 4
  • 5. How to do automatic scheduling? •• Naïve approach • Model – Tasks, dependences – Resource use, latencies – Machine model with connectivity, resource capacities • Solve with ILP – Minimize overall task length – Subject to dependences, resource use • Problems – Complexity: task graph is huge! – Dynamics: loop lengths unknown. •• So we do something much more cool. Reservoir Labs Harvard 04/12/2011 5
  • 6. Program Transformations Specification iteration space of a statement S(i,j) t2 j 2 2 :Z Z i t1 •• Schedule maps iterations to multi-dimensional time: • A feasible schedule preserves dependences •• Placement maps iterations to multi-dimensional space: • UHPC in progress, partially done •• Layout maps data elements to multi-dimensional space: • UHPC in progress •• Hierarchical by design, tiling serves separation of concerns 6 Reservoir Labs Harvard 04/12/2011
  • 7. Loop transformations for(i=0; i<N; i++) for(j=0; j<N; j++) S(i,j); unimodular for(j=0; j<N; j++) 0 1 i permutation for(i=0; i<N; i++) (i, j ) S(i,j); 1 0 j for(i=N-1; i>=0; i--) 1 0 i reversal for(j=0; j<N; j++) (i, j ) S(j,i); 0 1 j for(i=0; i<N; i++) 1 0 i skewing for(j= *i; j<N+ *i; j++) (i, j ) S(i,j- *i); 1 j for(i=0; i< *N; i+= ) 0 i scaling for(j=0; j<N; j++) (i, j ) S(i/ ,j); 0 1 j 7 Reservoir Labs Harvard 04/12/2011
  • 8. Loop fusion and distribution for(i=0; i<N; i++) for(j=0; j<N; j++) fusion for(i=0; i<N; i++) for(j=0; j<N; j++) S1(i,j); S1(i,j); for(j=0; j<N; j++) S2(i,j) S2(i,j) distribution 0 0 0 0 0 0 1 0 0 i 1 0 0 i fusion (i, j ) 0 0 0 j 1 (i, j ) 0 0 0 j 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 i 1 0 0 i 2 (i, j ) 0 0 1 j 2 (i, j ) 0 0 0 j distribution 0 1 0 1 0 1 0 1 0 0 0 0 0 1 8 Reservoir Labs Harvard 04/12/2011
  • 9. Enabling technology is new compiler math Uniform Recurrence Equations [Karp et al. 1970] Many: Lamport, Allen/Kennedy, Banerjee, Irigoin, Wolfe/Lam, Pugh, Pingali, e Loop Transformations and Parallelization [1970-] Vectorization, SMP, locality optimizations Dependence summary: direction/distance vectors Unimodular transformations Systolic Array Mapping Mostly linear-algebraic Many: Feautrier, Darte, Vivien, Wilde, Rajopadhye, etc,.... Exact dependence analysis Polyhedral Model [1980-] General affine transformations Loop synthesis via polyhedral scanning New computational techniques based on polyhedral representations Reservoir Labs Harvard 04/12/2011 9
  • 10. R-Stream model: polyhedra n = f(); for (i=5; i<= n; i+=2) { A[i][i] = A[i][i]/B[i]; for (j=0; j<=i; j++) { if (j<=10) { …… A[i+2j+n][i+3]…… } } {i, j Z2 | k Z ,5 i n;0 j i; j i ; i 2k 1} i A0 1 2 1 0 j A1 1 0 0 3 Affine and non-affine transformations n Order and place of operations and data 1 0 0 0 1 1 Loop code represented (exactly or conservatively) with polyhedrons High-level, mathematical view of a mapping But targets concrete properties: parallelism, locality, memory footprint Reservoir Labs Harvard 04/12/2011 10
  • 11. Polyhedral slogans •• Parametric imperfect loop nests •• Subsumes classical transformations •• Compacts the transformation search space •• Parallelization, locality optimization (communication avoiding) •• Preserves semantics •• Analytic joint formulations of optimizations •• Not just for affine static control programs Reservoir Labs Harvard 04/12/2011 11
  • 12. Polyhedral model – challenges in building a compiler •• Killer math •• Scalability of optimizations/code generation •• Mostly confined to dependence preserving transformations •• Code can be radically transformed – outputs can look wildly different •• Modeling indirections, pointers, non-affine code. •• Many of these challenges are solved Reservoir Labs Harvard 04/12/2011 12
  • 13. R-Stream blueprint Machine Polyhedral Mapper Model Raising Lowering EDG C Pretty Scalar Representation Front End Printer 13 Reservoir Labs Harvard 04/12/2011
  • 14. Inside the polyhedral mapper GDG representation Tactics Module Parallelization Comm. Locality Tiling Placement Generation Optimization …… Memory Sync Layout Polyhedral Promotion Generation Optimization Scanning Jolylib, …… 14 Reservoir Labs Harvard 04/12/2011
  • 15. Inside the polyhedral mapper Optimization modules engineered to expose ““knobs”” that could be used by auto-tuner GDG representation Tactics Module Parallelization Comm. Locality Tiling Placement Generation Optimization …… Memory Sync Layout Polyhedral Promotion Generation Optimization Scanning Jolylib, …… 15 Reservoir Labs Harvard 04/12/2011
  • 16. Driving the mapping: the machine model •• Target machine characteristics that have an influence on how the mapping should be done • Local memory / cache sizes • Communication facilities: DMA, cache(s) • Synchronization capabilities • Symmetrical or not • SIMD width • Bandwidths •• Currently: two-level model (Host and Accelerators) •• XML schema and graphical rendering Reservoir Labs Harvard 04/12/2011
  • 17. Machine model example: multi-Tesla Host 1 thread per GPU OpenMP morph XML file CUDA morph Reservoir Labs Harvard 04/12/2011 17
  • 18. Mapping process Dependencies 2- Task formation: - Coarse-grain atomic tasks - Master/slave side operations 1- Scheduling: Parallelism, locality, tilability 3- Placement: Assign tasks to blocks/threads - Local / global data layout optimization - Multi-buffering (explicitly managed) - Synchronization (barriers) - Bulk communications - Thread generation -> master/slave - CUDA-specific optimizations Reservoir Labs Harvard 04/12/2011 18
  • 19. Program Transformations Specification iteration space of a statement S(i,j) t2 j 2 2 :Z Z i t1 •• Schedule maps iterations to multi-dimensional time: • A feasible schedule preserves dependences •• Placement maps iterations to multi-dimensional space: • UHPC in progress, partially done •• Layout maps data elements to multi-dimensional space: • UHPC in progress •• Hierarchical by design, tiling serves separation of concerns 19 Reservoir Labs Harvard 04/12/2011
  • 20. Model for scheduling trades 3 objectives jointly Loop Fission Fewer Global More More Sufficient Memory Locality Parallelism Occupancy Accesses Loop Fusion + successive + successive thread thread contiguity contiguity Memory Coalescing Better Effective Bandwidth Patent pending Reservoir Labs Harvard 04/12/2011 20
  • 21. Optimization with BLAS vs. global optimization Numerous cache misses /* Global Optimization*/ /* Optimization with BLAS */ doall loop { Can parallelize for loop { Outer loop(s) …… outer loop(s) …… for loop { Retrieve data Z from disk …… BLAS call 1 …… Store data Z back to disk [read from Z] Retrieve data Z from disk !!! Loop fusion BLAS call 2 VS. …… …… [write to Z] can …… …… improve BLAS call n [read from Z] locality …… } } …… } Global optimization can expose better parallelism and locality Reservoir Labs Harvard 04/12/2011
  • 22. Tradeoffs between parallelism and locality • Significant parallelism is needed to fully utilize all resources • Locality is also critical to minimize communication • Parallelism can come at the expense of locality Limited bandwidth at chip border High on-chip parallelism •• Our approach: R-Stream compiler exposes parallelism via affine scheduling that simultaneously augments locality using loop fusion Reuse data once loaded on chip = locality Reservoir Labs Harvard 04/12/2011
  • 23. Parallelism/locality tradeoff example Array z gets expanded, to Maximum distribution destroys locality introduce another level of parallelism /* doall (i=0; i<400; i++) * Original code: doall (j=0; j<3997; j++) * Simplified CSLC LMS z_e[j][i]=0 */ doall (i=0; i<400; i++) for (k=0; k<400; k++) { doall (j=0; j<3997; j++) Max. parallelism for (i=0; i<3997; i++) { for (k=0; k<4000; k++) (no fusion) z[i]=0; z_e[j][i]=z_e[j][i]+B[j][k]*x[i][k]; for (j=0; j<4000; j++) doall (i=0; i<3997; i++) z[i]= z[i]+B[i][j]*x[k][j]; for (j=0; j<400; j++) } w[i]=w[i]+z_e[i][j]; for (i=0; i<3997; i++) doall (i=0; i<3997; i++) w[i]=w[i]+z[i]; Data z[i] = z_e[i][399]; } accumulation 2 levels of parallelism, but poor data reuse (on array z_e) Reservoir Labs Harvard 04/12/2011
  • 24. Parallelism/locality tradeoff example (cont.) Aggressive loop fusion destroys parallelism (i.e., only 1 degree /* of parallelism) * Original code: * Simplified CSLC LMS */ doall (i=0; i<3997; i++) for (k=0; k<400; k++) { for (j=0; j<400; j++) { for (i=0; i<3997; i++) { Max. fusion z[i]=0; z[i]=0; for (k=0; k<4000; k++) for (j=0; j<4000; j++) z[i]=z[i]+B[i][k]*x[j][k]; z[i]= z[i]+B[i][j]*x[k][j]; w[i]=w[i]+z[i]; } } for (i=0; i<3997; i++) w[i]=w[i]+z[i]; } Very good data reuse (on array z), but only 1 level of parallelism Reservoir Labs Harvard 04/12/2011
  • 25. Parallelism/locality tradeoff example (cont.) Partial fusion doesn’t Expansion of array z decrease parallelism /* * Original code: doall (i=0; i<3997; i++) { * Simplified CSLC LMS doall (j=0; j<400; j++) { */ z_e[i][j]=0; for (k=0; k<400; k++) { for (k=0; k<4000; k++) z_e[i][j]=z_e[i][j]+B[i][k]*x[j][k]; for (i=0; i<3997; i++) { Parallelism with } z[i]=0; partial fusion for (j=0; j<4000; j++) for (j=0; j<400; j++) z[i]= z[i]+B[i][j]*x[k][j]; w[i]=w[i]+z_e[i][j]; } } doall (i=0; i<3997; i++) for (i=0; i<3997; i++) Data z[i]=z_e[i][399]; w[i]=w[i]+z[i]; accumulation } 2 levels of parallelism with good data reuse (on array z_e) Reservoir Labs Harvard 04/12/2011
  • 26. Parallelism/locality tradeoffs: performance numbers Code with a good balance between parallelism and fusion performs best In explicitly managed memory/scratchpad architectures this is even more true Reservoir Labs Harvard 04/12/2011
  • 27. R-Stream: affine scheduling and fusion •• R-Stream uses a heuristic based on an objective function with several cost coefficients: • slowdown in execution if a loop p is executed sequentially rather than in parallel • cost in performance if two loops p and q remain unfused rather than fused minimize wl pl ue f e l loops e loop edges slowdown in sequential cost of unfusing execution two loops •• These two cost coefficients address parallelism and locality in a unified and unbiased manner (as opposed to traditional compilers) •• Fine-grained parallelism, such as SIMD, can also be modeled using similar formulation Patent Pending Reservoir Labs Harvard 04/12/2011
  • 28. Parallelism + locality + spatial locality Hypothesis that auto-tuning should adjust these parameters wl pl ue f e l loops e loop edges benefits of improved locality benefits of parallel execution New algorithm (unpublished) balances contiguity to enhance coalescing for GPU and SIMDization modulo data-layout transformations 28 Reservoir Labs Harvard 04/12/2011
  • 29. Outline •• R-Stream Overview •• Compilation Walk-through •• Performance Results •• Getting R-Stream Reservoir Labs Harvard 04/12/2011 29
  • 30. What R-Stream does for you – in a nutshell •• Input • Sequential – Short and simple textbook C code – Just add a “#pragma map” and R-Stream figures out the rest Reservoir Labs Harvard 04/12/2011 30
  • 31. What R-Stream does for you – in a nutshell •• Input •• Output • Sequential • OpenMP + CUDA code – Short and simple R-Stream – Hundreds of lines of tightly textbook C code optimized GPU-side CUDA – Just add a “#pragma code map” and R-Stream – Few lines of host-side figures out the rest OpenMP C code • Gauss-Seidel 9 points stencil • Very difficult to – Used in iterative PDE solvers hand-optimize – scientific modeling (heat, fluid flow, waves, • Not available in etc.) any standard – Building block for faster iterative solvers like library Multigrid or AMR Reservoir Labs Harvard 04/12/2011 31
  • 32. What R-Stream does for you – in a nutshell •• Input •• Output • Sequential • OpenMP + CUDA code – Short and simple R-Stream – Hundreds of lines of tightly textbook C code optimized GPU-side CUDA – Just add a “#pragma code map” and R-Stream – Few lines of host-side figures out the rest OpenMP C code • Achieving up to Will be – 20 GFLOPS in illustrated in GTX 285 the next few – 25 GFLOPS in slides GTX 480 Reservoir Labs Harvard 04/12/2011 32
  • 33. Finding and utilizing available parallelism Excerpt of automatically generated code GPU SM N SM 2 SM 1 Shared Memory Registers Registers Registers Instruction R-Stream AUTOMATICALLY finds SP 1 SP 2 … SP M Unit and forms parallelism Constant Cache Texture Cache Off-chip Device memory (Global, constant, texture) Extracting and mapping parallel loops Reservoir Labs Harvard 04/12/2011 33
  • 34. Memory compaction on GPU scratchpad Excerpt of automatically generated code GPU SM N SM 2 SM 1 Shared Memory Registers Registers Registers Instruction SP 1 SP 2 … SP M Unit R-Stream AUTOMATICALLY manages local scratchpad Constant Cache Texture Cache Off-chip Device memory (Global, constant, texture) Reservoir Labs Harvard 04/12/2011 34
  • 35. GPU DRAM to scratchpad coalesced communication Excerpt of automatically generated code GPU SM N SM 2 SM 1 Shared Memory Registers Registers Registers Instruction SP 1 SP 2 … SP M Unit R-Stream AUTOMATICALLY chooses parallelism to favor Constant Cache coalescing Texture Cache Off-chip Device memory (Global, constant, texture) Coalesced GPU DRAM accesses Reservoir Labs Harvard 04/12/2011 35
  • 36. Host-to-GPU communication Excerpt of automatically generated code R-Stream AUTOMATICALLY GPU chooses partition and sets up host SM N to GPU communication SM 2 SM 1 Shared Memory Registers Registers Registers Instruction SP 1 SP 2 … SP M Unit CPU Constant Cache Texture PCI Cache Express Host memory Off-chip Device memory (Global, constant, texture) Reservoir Labs Harvard 04/12/2011 36
  • 37. Multi-GPU mapping Excerpt of automatically generated code Mapping Host Host across all GPUs memory memory CPU CPU R-Stream AUTOMATICALLY finds GPU GPU GPU another level of parallelism, across GPUs GPU GPU GPU memory memory memory Reservoir Labs Harvard 04/12/2011 37
  • 38. Multi-GPU mapping Excerpt of automatically generated code Host Host Multi-streaming memory memory of host-GPU communication CPU CPU R-Stream AUTOMATICALLY GPU GPU GPU creates n-way software pipelines for communications GPU GPU GPU memory memory memory Reservoir Labs Harvard 04/12/2011 38
  • 39. Future capabilities – mapping to CPU-GPU clusters High Speed Interconnect (e.g. InfiniBand) Program CPU + GPU CPU + GPU MPI Process MPI Process OpenMP OpenMP process process DRAM launching launching DRAM CUDA CUDA CPU CPU GPU GPU GPU DRAM DRAM DRAM Reservoir Labs Harvard 04/12/2011 39
  • 40. Outline •• R-Stream Overview •• Compilation Walk-through •• Performance Results •• Getting R-Stream Reservoir Labs Harvard 04/12/2011 40
  • 41. Experimental evaluation Configuration 1: MKL Configuration 2: Low-level compilers Radar Radar GCC MKL calls code code ICC Configuration 3: R-Stream Radar Optimized GCC R-Stream code radar code ICC •• Main comparisons: • R-Stream High-Level C Transformation Tool 3.1.2 • Intel MKL 10.2.1 Reservoir Labs Harvard 04/12/2011
  • 42. Experimental evaluation (cont.) •• Intel Xeon workstation: • Dual quad-core E5405 Xeon processors (8 cores total) • 9GB memory •• 8 OpenMP threads •• Single precision floating point data •• Low-level compilers and the used flags: • GCC: -O6 -fno-trapping-math -ftree-vectorize -msse3 -fopenmp • ICC: -fast -openmp Reservoir Labs Harvard 04/12/2011
  • 43. Radar benchmarks •• Beamforming algorithms: • MVDR-SER: Minimum Variance Distortionless Response using Sequential Regression • CSLC-LMS: Coherent Sidelobe Cancellation using Least Mean Square • CSLC-RLS: Coherent Sidelobe Cancellation using Robust Least Square •• Expressed in sequential ANSI C •• 400 radar iterations •• Compute 3 radar sidelobes (for CSLC-LMS and CSLC-RLS) Reservoir Labs Harvard 04/12/2011
  • 44. MVDR-SER Reservoir Labs Harvard 04/12/2011
  • 45. CSLC-LMS Reservoir Labs Harvard 04/12/2011
  • 46. CSLC-RLS Reservoir Labs Harvard 04/12/2011
  • 47. 3D Discretized wave equation input code (RTM) •• #pragma rstream map •• void RTM_3D(double (*U1)[Y][X], double (*U2)[Y][X], double (*V)[Y][X], •• int pX, int pY, int pZ) { •• double temp; •• int i, j, k; •• for (k=4; k<pZ-4; k++) { •• for (j=4; j<pY-4; j++) { •• for (i=4; i<pX-4; i++) { •• temp = C0 * U2[k][j][i] + •• C1 * (U2[k-1][j][i] + U2[k+1][j][i] + •• U2[k][j-1][i] + U2[k][j+1][i] + •• U2[k][j][i-1] + U2[k][j][i+1]) + •• C2 * (U2[k-2][j][i] + U2[k+2][j][i] + •• U2[k][j-2][i] + U2[k][j+2][i] + 25-point 8th •• U2[k][j][i-2] + U2[k][j][i+2]) + order (in space) •• C3 * (U2[k-3][j][i] + U2[k+3][j][i] + •• U2[k][j-3][i] + U2[k][j+3][i] + stencil •• U2[k][j][i-3] + U2[k][j][i+3]) + •• C4 * (U2[k-4][j][i] + U2[k+4][j][i] + •• U2[k][j-4][i] + U2[k][j+4][i] + •• U2[k][j][i-4] + U2[k][j][i+4]); •• U1[k][j][i] = •• 2.0f * U2[k][j][i] - U1[k][j][i] + •• V[k][j][i] * temp; •• } } } } Reservoir Labs Harvard 04/12/2011 47
  • 48. 3D Discretized wave equation input code (RTM) Not so naïve … Communication autotuning knobs ThreadIdx.x divergence is expensive Reservoir Labs Harvard 04/12/2011 48
  • 49. Outline •• R-Stream Overview •• Compilation Walk-through •• Performance Results •• Getting R-Stream Reservoir Labs Harvard 04/12/2011 49
  • 50. Current status •• Ongoing development also supported by DOE, Reservoir • Improvements in scope, stability, performance •• Installations/evaluations at US government laboratories •• Forward collaboration with Georgia Tech on Keeneland • HP SL390 - 3 FERMI GPU, 2 Westmere/node •• Basis of compiler for DARPA UHPC Intel Corporation Team Reservoir Labs Harvard 04/12/2011 50
  • 51. Availability •• Per developer seat licensing model •• Support (releases, bug fixes, services) •• Available with commercial grade external solvers •• Government has limited rights / SBIR data rights •• Academic source licenses with collaborators •• Professional team, continuity, software engineering Reservoir Labs Harvard 04/12/2011 51
  • 52. R-Stream Gives More Insight About Programs •• Teaches you about parallelism and the polyhedral model: • Generates correct code for any transformation – The transformation may be incorrect if specified by the user • Imperative code generation has been the bottleneck till 2005 – 3 thesis written on the topic and it’s still not completely covered – It is good to have an intuition of how these things work •• R-Stream has meaningful metrics to represent your program • Maximal amount of parallelism given minimal expansion • Tradeoffs between coarse-grained and fine-grained parallelism • Loop types (doall, red, perm, seq) •• Help with algorithm selection •• Tiling of imperfectly nested loop nests •• Generates code for explicitly managed memory Reservoir Labs Harvard 04/12/2011 52
  • 53. How to Use R-Stream Successfully •• R-Stream is a great transformation tool, it is also a great learning tool: • Takes “simple C” input code • Applies multiple transformations and lists them explicitly • Can dump code at any step in the process • It’s the tool I wish I had during my PhD: – To be fair, I already had a great set of tools •• R-Stream can be used in multiple modes: • Fully automatic + compile flag options • Autotuning mode (more than just tile size and unrolling … ) • Scripting / Programmatic mode (Beanshell + interfaces) • Mix of these modes + manual post-processing Reservoir Labs Harvard 04/12/2011 53
  • 54. Use Case: Fully Automatic Mode + Compile Flag Options •• Akin to traditional gcc / icc compilation with flags: • Predefined transformations can be parameterized – Except less/no phase ordering issues • Except you can see what each transformation does at each step • Except you can generate compilable and executable code at (almost) any step in the process • Except you can control code generation for compactness or performance Reservoir Labs Harvard 04/12/2011 54
  • 55. Use Case: Autotuning Mode •• More advanced than traditional approaches: • Knobs go far beyond loop unrolling + unroll-and-jam + tiling • Knobs are based on well-understood models • Knobs target high-level properties of the program – Amount of parallelism, amount of memory expansion, depth of pipelining of communication and computations … • Knobs are dependent on target machine, program and state of the mapping process: – Our tool has introspection •• We are really building a hierarchical autotuning transformation tool Reservoir Labs Harvard 04/12/2011 55
  • 56. Use Case: “Power user” interactive interface •• Beanshell access to optimizations •• Can direct and review the process of compilation • automatic tactics (affine scheduling) •• Can direct code generation •• Access to “tuning parameters” •• All options / commands available on command line interface Reservoir Labs Harvard 04/12/2011 56
  • 57. Conclusion •• R-Stream simplifies software development and maintenance •• Porting: reduces expense and delivery delays •• Does this by automatically parallelizing loop code • While optimizing for data locality, coalescing, etc. •• Addresses • Dense loop-intensive computations •• Extensions • Data-parallel programming idioms • Sparse representations • Dynamic runtime execution Reservoir Labs Harvard 04/12/2011 57
  • 58. Contact us •• Per-developer seat, floating, cloud-based licensing •• Discounts for academic users •• Research collaborations with academic partners •• For more information: • Call us at 212-780-0527, or • See Rich, Ann • E-mail us at – {sales,lethin,johnson}@reservoir.com Reservoir Labs Harvard 04/12/2011 58