SlideShare a Scribd company logo
PG-Strom
GPU Accelerated Asynchronous
Super-Parallel Query Execution

KaiGai Kohei <kaigai@kaigai.gr.jp>
(Tw: @kkaigai)
Self Introduction

• 名前                     海外 浩平 (KaiGai Kohei)
• 所属                     NEC Europe - SAP Global Competence Center
• 仕事                     OSS活用によるイノベーション創出 (20-30%)
                         SAPとのアライアンス、PF製品の拡販 (70-80%)
   • SAPのIn-memory DB “SAP HANA” の認証作業とか
   • CLUSTERPROのSAP認定取得、拡販とか                           特にコレの
                                                      割合がデカい




20130218 - PG-Strom Workshop, Tokyo                                  2
All everyone talks about BIG-DATA




       猫                                    杓子

      熱い視線

                                 BIG DATA
20130218 - PG-Strom Workshop, Tokyo              3
Big-Data Database?




            ¥¥¥¥¥¥
            $$$$$$
            €€€€€€
20130218 - PG-Strom Workshop, Tokyo   4
Homogeneous / Heterogeneous computing


                                                                        KPIs
                 Homogeneous
                 Scale-Up
                                                          •   Computing Performance
                                                          •   Power Consumption
                                                          •   System Cost (HW/SW)
                                      Heterogeneous       •   Variety of Applications
                                      Scale-Up
                                                          •   Vendor Support
                                                          •   Software Development
                                                                        :

                                                      +
    Scale-out
    (not a topic of
     today’s talk)


20130218 - PG-Strom Workshop, Tokyo                                                     5
Design concept of PG-Strom


              The world cheapest
            The most Cost-Effective
              Big-Data Database
• Utilization of open source technology
• Utilization of commodity hardware
  • up-to ?? CPUs
                          まだ、この辺をとやかく
  • up-to ??? GB RAM      言える段階ではない
  • up-to ??? Data size
• Utilization of heterogeneous computing with GPU
20130218 - PG-Strom Workshop, Tokyo                 6
Characteristics of GPU (1/2)

                                                    Nvidia       AMD             Intel
                                                    Kepler       GCN             SandyBridge

                                      Model         Tesla K20X   FirePro S9000   Xeon E5-2690
                                                    (Q4/2012)    (Q3/2012)       (Q1/2012)
                                      Number of     7.1billion   4.3billion      2.26billion
                                      Transistors
                                      Number of     2688         1792            16
                                      Cores         Simple       Simple          Functional
                                      Core clock    732MHz       925MHz          2.9GHz
                                      Peak FLOPS    3.95Tflops   3.23TFlops      185.6GFlops
                                      Memory        6GB, GDDR5   6GB, GDDR5      384GB/socket,
                                      Size / TYPE                                DDR3
                                      Memory        ~192GB/s     ~264GB/s        ~51.2GB/s
                                      Bandwidth
                                      Power         ~235W        ~225W           ~135W
                                      Consumption
                                      Price         $3199?       $2499?          $2061

20130218 - PG-Strom Workshop, Tokyo                                                            7
Characteristics of GPU (2/2)

  Example)

  Zi = Xi + Yi             (0 <= i <= n)


   X0       X1        X2              Xn

   +         +        +               +
   Y0       Y1        Y2              Yn

                                   
   Z0        Z1       Z2              Zn



               Assign a particular “core”


                                            Nvidia’s GeForce GTX 680 Block Diagram
20130218 - PG-Strom Workshop, Tokyo                                                  8
Play with GPU (1/3)
         Memory                                CPU                             GPU
                                                                                      計算負荷
                                                                                     GPUの仕事
                                                                        GPGPU
        on-host            DDR3-1600
                                                                        (non-integrated)
         buffer           (~51.2GB/s)

                                                                                DDR5
              通常のI/O負荷                                                          192.2GB/s
                                         IO HUB
               CPUの仕事                                                              GPU
                                                                   on-device
                                                                                    code
             HDD                                                    buffer
                                        HBA                                      device DRAM


                           SAS 2.0 (600MB/s)         PCI-E 3.0 x16 (~32GB/s)

 Asynchronous Execution of CPU, GPU and PCI-E
 Minimization of data transfer between host and device
20130218 - PG-Strom Workshop, Tokyo                                                            9
Play with GPU (2/3)
   Host code example
void sqrt_float4(int n, float v[])
{
  /* Acquire device memory and data transfer (host -> device) */
  dev_v = clCreateBuffer(cxt, CL_MEM_READ_WRITE,
                         sizeof(float) * n, NULL, &rv);
  /* Enqueue data transfer host to device */
  clEnqueueWriteBuffer(cmdq, dev_x, CL_TRUE, 0, NULL,
                       v, 0, NULL, NULL);
  /* Set arguments of kernel code */
  clSetKernelArg(kernel, 0, sizeof(int), (void *)&n);
  clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&dev_v);
  /* Enqueue invocation of device kernel */
  clEnqueueNDRangeKernel(cmdq, kernel, 1, NULL, &g_itemsz, &l_itemsz,
                         0, NULL, NULL);
  /* Enqueue data transfer device to host */
  clEnqueueReadBuffer(cmdq, dev_x, CL_TRUE, 0, NULL,
                      v, 0, NULL, NULL);
  /* Release device memory */
  clReleaseMemObject(dev_x)
}


20130218 - PG-Strom Workshop, Tokyo                                     10
Play with GPU (3/3)
    Device code example
__kernel void dev_sqrt_float(int length, float x[])
{
  int i = get_global_id(0);

    if (i < length)
      x[i] = sqrt(x[i]);
}


    Host code to load kernel
/* Load source code of the program */
program = clCreateProgramWithSource(cxt, 1,
                        (const char *)&kernel_source,
                        (const size_t *)&kernel_source_len, &rv);
/* Run-time build of the program */
rv = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL);
/* Create a device kernel object */
kernel = clCreateKernel(program, “dev_sqrt_float”, &rv);


20130218 - PG-Strom Workshop, Tokyo                                 11
Comparison - CPU vs CPU + GPU

     CPU                   CPU + GPU                    • Advantage
 Storage                 Storage 
                                                         • シンプルな演算の
 Host buffer              Host buffer                        超並列処理
                                                         •   非同期実行&
                               DMA: host  device
                                                             パイプライン処理
                          Storage         Parallel     • Disadvantage
                          Host buffer     Calculation
                                                         • ホストデバイス間の
  Loop &                                                     DMA転送コスト
                               DMA: host  device
 Calculation
                          Host buffer                    •   コードの複雑化
                           output

                                        depends on
                                         workload
Host buffer
 output


20130218 - PG-Strom Workshop, Tokyo                                      12
Architecture
                     of
                 PG-Strom


20130218 - PG-Strom Workshop, Tokyo   13
PG-Strom’s Asynchronous Execution model
    vanilla PostgreSQL                                                          PostgreSQL with PG-Strom

                CPU                                                            CPU                      GPU

                                                                                               Asynchronous memory
                                                                                               transfer and execution


                                      Iteration of scan tuples and
                                         evaluation of qualifiers




                                                                        Synchronization


                                                                     Larger “chunk” to scan
                                                                                                   Earlier than
                                                                      the database at once
                                                                                                 “Only CPU” scan


                : Scan tuples on shared-buffers
                : Execution of the qualifiers
                                                                                                                    Page
20130218 - PG-Strom Workshop, Tokyo
                                                                                                                     14
Re-definition of SQL/MED (1/2)

• SQL/MED (Management of External Data)
   • External data source performing as if regular tables
   • Not only “management”, but external computing resources also
                                                                                                        Exec
                                                                           Regular                                    Exec
                                                                                           storage
                                                Query Executor              Table
                                Query Planner
                 Query Parser




                                                                           Foreign             MySQL
                                                                            Table               FDW

  SQL
 Query                                                                     Foreign             Oracle
                                                                                               FDW                    Exec
                                                                            Table

                                                                           Foreign        PG-Strom
                                                                            Table           FDW

                                                                                                               Exec
                                                                                     Regular
                                                                 storage
                                                                                      Table


20130218 - PG-Strom Workshop, Tokyo                                                                                      15
Re-definition of SQL/MED (2/2)
                     Query
                     Parser
                                      SQL/MED API            construction of
  Query                                                       remote SQL
   Tree                                                         remote
                                                                  SQL
                     Query                      FDW
                    Planner                    Planner

                                                                                Remote pgsql
                                                             connection open

                                                              remote query
                    Query                       FDW
                   Executor                   Executor
                                                                 result set

  Result                                                     connection close
   Set

20130218 - PG-Strom Workshop, Tokyo
                                          Extension module                               16
PG-Strom as SQL/MED driver
                     Query
                     Parser                           ① 条件句から、GPU用
                                                     Kernel Codeを自動生成
  Query
   Tree                                                     WHERE log(x) < 10

                                                                         .....
                     Query                 FDW                            ....
                    Planner               Planner                        .....


                                               ② Kernel codeの                    ④ GPU Kernelの
                                                 JIT-compile                      非同期実行

                                                                chunk
                    Query                   FDW
                                                                buffer
                   Executor               Executor

                                                                         ③ Shadow Table
  Result
                                                                          からのロード
   Set

20130218 - PG-Strom Workshop, Tokyo
                                      PG-Strom module    shadow tables
                                                                                          17
Overall architecture
                                                          World of CPU
            regular                shadow
             tables                 tables



                     shared_buffer                   chunk_buffer                    World of GPU
                                                      chunk      GPU
                                                       data      code
                                                                                       GPU Device
                                                                                    chunk Memory
           SeqScan,                    PG-Strom               request               data
             etc...                                           handler JIT                Super
                                                                        compile         Parallel
                                   ForeignScan
                                                               Event                   Execution
Result                                                        monitor                  GPU Kernel
                    Query Executor                                                   GPU Function
                                                          PG-Strom                  kernel
                  PostgreSQL Backend                     GPU Server

                                        Postmaster
                                                                          background worker
                                                                                                    18
 20130218 - PG-Strom Workshop, Tokyo
So what, How fast is it?
  postgres=# SELECT COUNT(*) FROM rtbl
             WHERE sqrt((x-256)^2 + (y-128)^2) < 40;
   count
  --------
   100467
  (1 row)
  Time: 7668.684 ms
  postgres=# SELECT COUNT(*) FROM ftbl
             WHERE sqrt((x-256)^2 + (y-128)^2) < 40;
   count
  --------
   100467
  (1 row)                 Accelerated!
  Time: 857.298 ms
   CPU: Xeon E5-2670 (2.60GHz), GPU: NVidia GeForce GT640, RAM: 384GB
   Both of regular rtbl and PG-Strom ftbl contain 20milion rows with same value
20130218 - PG-Strom Workshop, Tokyo                                            19
Key Technologies

• Automatic GPU code generation & JIT compile
• Column-oriented data structure
• Asynchronous Execution




20130218 - PG-Strom Workshop, Tokyo             20
Automatic “pseudo” code generation

          SELECT * FROM ftbl WHERE
          c like ‘%xyz%’ AND sqrt((x-256)^2+(y-100)^2) < 10;

     contains unsupported
     operators / functions                                    Translation to
                                                              pseudo code
                                      xreg10 = $(ftbl.x)
                                      xreg12 = 256.000000::double
Pseudo-code based implementation
will be replaced by native code and   xreg8 = (xreg10 - xreg12)
JIT-compile approach soon.            xreg10 = 2.000000::double
                                      xreg6 = pow(xreg8, xreg10)
                                      xreg12 = $(ftbl.y)
                                      xreg14 = 128.000000::double
                                               :

20130218 - PG-Strom Workshop, Tokyo                                            21
Automatic native code generation - WIP

          SELECT * FROM ftbl WHERE
          c like ‘%xyz%’ AND sqrt((x-256)^2+(y-100)^2) < 10;

OpenCL run-time builds
native GPU binary
                                      __kernel void
                                      pgstrom_qual(int nitems, bool result[],
                                                    float x[], float y[])
                                      {
                                          int index = get_global_id(0);

                                          if (sqrt(pow(x[i] - 256, 2) +
                                                   pow(y[i] - 100, 2)) < 10)
                                              result[i] = true;
                                          else
                                              result[i] = false;
                                      }

20130218 - PG-Strom Workshop, Tokyo                                             22
Save bandwidth & shared-buffer usage
   E.g) SELECT name, tel, email, address FROM address_book
          WHERE sqrt((pos_x – 24.5)^2 + (pos_y – 52.3)^2) < 10;
    No sense to fetch columns being not in use
           CPU                        GPU                 CPU              GPU




   Synchronization                                   Synchronization

     : Scan tuples on the shared-buffers
                                                Save the bandwidth of PCI-E bus
     : Execution of the qualifiers
     : Columns being not used the qualifiers    Save the shared-buffer usage
20130218 - PG-Strom Workshop, Tokyo                                              23
Column-oriented data structure (1/3)

                                                     (shadow) TABLE “public.ft.rowid”
                                                     rowid nitems                 isnull
      FOREIGN TABLE ft
                                                      4000      2000        {0,0,0,1,0,0,…}
    int    float   text
     X       Y      Z                                 6000      2000        {0,0,0,0,0,0,…}
                                                        :         :                  :
                                             (shadow) TABLE “public.ft.z.cs”
                                                     14000       400        {0,0,1,0,0,0,…}
                                     rowid nitems        isnull              values
                                      4000      15     {0,0,…}       { ‘hello’, ‘world’, … }
                                      4015      20     {0,0,…} { ‘aaa’, ‘bbb’, ‘ccc’, … }
                                (shadow) TABLE “public.ft.y.cs”
                        rowid nitems : isnull :            : values              :
                        4000      25014275
                                         {0,0,…} { 1.38, 6.45, 2.15, …‘yyy’, ‘zzz’, …}
                                                25     {0,0,…}       {‘xxx’, }
                        4250      250    {0,1,…} { 4.32, 5.46, 3.14, … }
                 (shadow): TABLE “public.ft.a.cs”
                                   :         :                    :
           rowid nitems 14200
                          isnull 100         values
                                         {0,0,…} {19, 29, 39, 49, 59, …}
            4000   500   {0,0,…} {10, 20, 30, 40, 50, …}
            4500   500   {0,1,…}     {11, 0, 31, 41, 51, …}
              :      :      :                   :
           14200   200   {0,0,…} {19, 29, 39, 49, 59, …}
20130218 - PG-Strom Workshop, Tokyo                                                      24
Column-oriented data structure (2/3)

 postgres=# CREATE FOREIGN TABLE example
              (a int, b text) SERVER pg_strom;
 CREATE FOREIGN TABLE

 postgres=# SELECT * FROM pgstrom_shadow_relations;
   oid |        relname        | relkind | relsize
 -------+----------------------+---------+-----------
  16446 | public.example.rowid | r       |          0
  16449 | public.example.idx   | i       |      8192
  16450 | public.example.a.cs | r        |          0
  16453 | public.example.a.idx | i       |      8192
  16454 | public.example.b.cs | r        |          0
  16457 | public.example.b.idx | i       |      8192
  16462 | public.example.seq   | S       |      8192
 (9 rows)

 postgres=# SELECT * FROM pg_strom."public.example.a.cs" ;
  rowid | nitems | isnull | values
 -------+--------+--------+--------
 (0 rows)


20130218 - PG-Strom Workshop, Tokyo                          25
Column-oriented data structure (3/3)
② Calculation                                                         opcode               Pseudo Code




                                                 PgStromChunkBuffer
                       ① Transfer                                     rowmap

                                                                      value   a[]     <not used>

                                                                      value   b[]
                      ③ Write-Back
                                                                      value   c[]

                                                                      value   d[]     <not used>


• Less bandwidth consumption                                            Table: my_schema.ft1.b.cs
   of PCI-Express bus                                                 10100       {2.4, 5.6, 4.95, … }
                                                                      10300     {10.23, 7.54, 5.43, … }
• Less usage of buffer-cache
                                                                                  Table: my_schema.ft1.c.cs
• Suitable for data                                                             10100      {‘2010-10-21’, …}
   compression                                                                  10200      {‘2011-01-23’, …}
                                                                                10300      {‘2011-08-17’, …}

PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module                                       26
Asynchronous Execution (1/2)
           IterateForeignScan
                                                        Yes                                   free_chunk_list
        No more rows
      on current chunk?
                                                                 current
                                                                  chunk                                           Job Queue
            No


                                                                                           next
                           If no chunks are ready yet


                                                              Load chunk from
                                                                                          chunk
                                                               shadow tables


                                                                                GPU
                                                                                code

                                                                                       shadow tables

                                                                                                                GPU Management
                                                                 current                                            Server
          Return next
                                                                  chunk
         TupleTableSlot
                                                                                           ready_chunk_list
20130218 - PG-Strom Workshop, Tokyo                                                                                              27
Asynchronous Execution (2/2) - in the future
           IterateForeignScan
                                 Yes              free_chunk_list
                                                                             Asynchronous       Asynchronous
        No more rows                                                          Data Load          Calculation
      on current chunk?
                                      current
                                       chunk
            No
                                                                         next                  Job Queue




                                                        Job Queue
                                                                        chunk
                                         next
                                        chunk


                                           GPU                      shadow tables
                                           code
                                                        Parallel I/O Server
                                                         Parallel I/O Server
                                                          Parallel I/O Server


                                                                                            GPU Management
                                      current                                                   Server
          Return next
                                       chunk
         TupleTableSlot
                                                            ready_chunk_list
20130218 - PG-Strom Workshop, Tokyo                                                                        28
Eco-System
                   in
               PostgreSQL
              Development


20130218 - PG-Strom Workshop, Tokyo   29
PostgreSQL developer’s community




                                    PostgreSQL
                                    developer’s
                                    community



                                  contribution,           software,
                                    feedback,          documentation,
                                   donation, ...       knowledge, ...
                                               Service
                                          infrastructure,
                                              support,
20130218 - PG-Strom Workshop, Tokyo
                                          consulting, ...               30
PostgreSQL development cycle
                   2011                            2012                         2013
v9.2
cycle
                                                     v9.2 Release
               CommitFest 1st~4th
v9.3                  PGconf2011 &
cycle               developer meeting
                                                          CommitFest 1st~4th
                                                              PGconf2012 &
                                                            developer meeting


                                                   v9.3 development schedule
                                                   • 17th-May developer meeting
                                                   • 15th-Jun CommitFest:1st
                                                   • 15th-Sep CommitFest:2nd
                                                   • 15th-Nov CommitFest:3rd
    PostgreSQL developer meeting (17th-May-2012)
                                                   • 15th-Jan CommitFest:4th
20130218 - PG-Strom Workshop, Tokyo                                                    31
PostgreSQL CommitFest




20130218 - PG-Strom Workshop, Tokyo   32
Key features towards upcoming v9.3 (1/3)

• Background Worker
   • It enables extensions to manage own background worker process
   • Pre-requisite of PG-Strom’s GPU control server
   • KaiGai implemented 1st version, then Alvaro revised and committed

                      Shared Resources (DB cluster, shared mem, IPC, ...)


                                              Built-in              Extra
                                            background             daemon
                    PostgreSQL
                     PostgreSQL
                      PostgreSQL
                       PostgreSQL
                    Backend
                     Backend
                      Backend              (autovacuum,          Own main()
                       Backend              bgwriter...)

                                                                            manage
                                                                 Extension
                                         postmaster

20130218 - PG-Strom Workshop, Tokyo                                                  33
Key features towards upcoming v9.3 (2/3)

• Writable-FDW
   •   It allows FDW-drivers to modify external data source via foreign-table.
   •   Several new APIs shall be added
   •   Helpful for PG-Strom to modify shadow-tables using standard DML
   •   KaiGai submitted patch, then it is “ready-for-committer” status now

   SELECT
                                                         SQL Executor
                                           SQL Planner
                              SQL Parser




   INSERT
                                                                            FDW
                                                                            driver
   UPDATE

   DELETE
                                                                        New API       External
                                                                                     Data Source

20130218 - PG-Strom Workshop, Tokyo                                                                34
Key features towards upcoming v9.3 (3/3)

• Writable-FDW (Pseudo-column support)
   • It required to identify a particular remote-row to be written.
   • “rowid” shall be carried from scan-stage to modify-stage as
      a value of pseudo-column.
   • Pseudo-column concept is also available to push-down complex
      calculation into external computing resource.

       SELECT X, Y, (X-Y)^2 from ftable;
                    
       SELECT X, Y, Pcol_1 from ftable;

      Just reference to               (SELECT X, Y, (X-Y)^2 AS Pcol_1
    the calculated result               from remote_data_source)
     in the remote side
                                                      Remote Query
20130218 - PG-Strom Workshop, Tokyo                                     35
Direction
                    of
               The Future
              Development


20130218 - PG-Strom Workshop, Tokyo   36
Move to OpenCL - WIP

• OpenCL support, instead of CUDA
   • multiplatform support
   • built-in JIT compiler

                   OpenCL Source
                     (CString)

    clCreateProgramWithSource()       ○   ○
                      cl_program

              clCreateKernel()        ○   ×
                       cl_kernel
                                      ○   ×
       clEnqueueNDRangeKernel

20130218 - PG-Strom Workshop, Tokyo           37
Variable Length Data Support - WIP

• Data layout on chunk-buffer is revising, to accept variable-length data.
• Older format assumed fixed-number of items per chunk.
• Newer format assumes fixed-size chunk; consumed from head/tail




                                                                                          to consume
                                                                                            Direction
          for fixed-length values X                       for fixed-length variable A


                                                  for index of variable-length value B
                                                         offset: 123
          for fixed-length values Y



                                          to consume
                                            Direction
                                                                        text: ‘hello world’
                                                        for contents of variable-lengths
          for fixed-length values Z


       Older chunk-buffer layout                        Newer chunk-buffer layout
20130218 - PG-Strom Workshop, Tokyo                                                           38
Procedural Language Support

• This idea allows users to describe complicated logic
   as procedural language to be executed on GPU.
• Expected usage: image processing, genome matching, ...

 CREATE FUNCTION genome_similarity(text,text) RETURNS float AS
 $$
     varlena *genome1 = ARG1;
     varlena *genome2 = ARG2;
         :
      <something complicated logic>
         :
     return similarity;
 $$ LANGUAGE pg_strom;

 SELECT id, label FROM genome_db
     WHERE genome_similarity(data, ‘ATGCAGGT....’) > 0.9;

20130218 - PG-Strom Workshop, Tokyo                         39
You getting involved

I’d like to know ...
• How PG-Strom run on
     real-world dataset and workload
• How PG-Strom should get evolved
• Which region and problem will fit
                                      
All the co-development / co-evaluation
projects are always welcome!
20130218 - PG-Strom Workshop, Tokyo       40
Summary

• Characteristics of GPU & OpenCL
  • Inflexible instructions but much higher parallelism
  • Cheap and small power consumption per computing capability
• PG-Strom - towards most cost-effective database
  • Utilization of GPU to off-load CPU jobs
  • Automatic code generation and JIT compile
  • Asynchronous execution
  • Column-oriented data structure
• Upcoming development
  • Move to OpenCL rather then CUDA
  • Support for variable length values
  • Support for procedural language
• Your involvement will lead future evolution!
20130218 - PG-Strom Workshop, Tokyo                              41
Source

• Source code
   • https://github.com/kaigai/pg_strom
• Wikipage
   • http://wiki.postgresql.org/wiki/PGStrom
      (need maintenance...)




20130218 - PG-Strom Workshop, Tokyo            42
Any Questions?



20130218 - PG-Strom Workshop, Tokyo   43

More Related Content

PDF
PG-Strom v2.0 Technical Brief (17-Apr-2018)
Kohei KaiGai
 
PPTX
Gpu databases
Mahmoud Eskandari
 
PPTX
VIRTUAL MEMORY
Kamran Ashraf
 
PPTX
Understanding DPDK
Denys Haryachyy
 
PDF
DPDK: Multi Architecture High Performance Packet Processing
Michelle Holley
 
PDF
Introduction to CUDA
Raymond Tay
 
PDF
20181016_pgconfeu_ssd2gpu_multi
Kohei KaiGai
 
PPTX
Oracle DBA
shivankuniversity
 
PG-Strom v2.0 Technical Brief (17-Apr-2018)
Kohei KaiGai
 
Gpu databases
Mahmoud Eskandari
 
VIRTUAL MEMORY
Kamran Ashraf
 
Understanding DPDK
Denys Haryachyy
 
DPDK: Multi Architecture High Performance Packet Processing
Michelle Holley
 
Introduction to CUDA
Raymond Tay
 
20181016_pgconfeu_ssd2gpu_multi
Kohei KaiGai
 
Oracle DBA
shivankuniversity
 

What's hot (20)

PDF
Dagster - DataOps and MLOps for Machine Learning Engineers.pdf
Hong Ong
 
PDF
U-Boot - An universal bootloader
Emertxe Information Technologies Pvt Ltd
 
PDF
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
ScyllaDB
 
PDF
第11回ACRiウェビナー_東工大/坂本先生ご講演資料
直久 住川
 
PDF
Streaming replication in practice
Alexey Lesovsky
 
PPTX
Introduction to DPDK
Kernel TLV
 
PDF
Performance optimization for all flash based on aarch64 v2.0
Ceph Community
 
PDF
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
 
PDF
PostgreSQL WAL for DBAs
PGConf APAC
 
PDF
Introduction of AMD Virtual Interrupt Controller
The Linux Foundation
 
PDF
Introduction VAUUM, Freezing, XID wraparound
Masahiko Sawada
 
PPTX
Kernel I/O subsystem
AtiKa Bhatti
 
PDF
NVIDIA HPC ソフトウエア斜め読み
NVIDIA Japan
 
PPTX
Presto query optimizer: pursuit of performance
DataWorks Summit
 
PPTX
Combined paging and segmentation
Tech_MX
 
PDF
Function Level Analysis of Linux NVMe Driver
인구 강
 
PPTX
GPU Architecture NVIDIA (GTX GeForce 480)
Fatima Qayyum
 
PDF
最新機能までを総ざらい!PostgreSQLの注目機能を振り返る(第32回 中国地方DB勉強会 in 岡山 発表資料)
NTT DATA Technology & Innovation
 
PDF
/proc/irq/&lt;irq>/smp_affinity
Takuya ASADA
 
PDF
Qemu Introduction
Chiawei Wang
 
Dagster - DataOps and MLOps for Machine Learning Engineers.pdf
Hong Ong
 
U-Boot - An universal bootloader
Emertxe Information Technologies Pvt Ltd
 
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
ScyllaDB
 
第11回ACRiウェビナー_東工大/坂本先生ご講演資料
直久 住川
 
Streaming replication in practice
Alexey Lesovsky
 
Introduction to DPDK
Kernel TLV
 
Performance optimization for all flash based on aarch64 v2.0
Ceph Community
 
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
 
PostgreSQL WAL for DBAs
PGConf APAC
 
Introduction of AMD Virtual Interrupt Controller
The Linux Foundation
 
Introduction VAUUM, Freezing, XID wraparound
Masahiko Sawada
 
Kernel I/O subsystem
AtiKa Bhatti
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA Japan
 
Presto query optimizer: pursuit of performance
DataWorks Summit
 
Combined paging and segmentation
Tech_MX
 
Function Level Analysis of Linux NVMe Driver
인구 강
 
GPU Architecture NVIDIA (GTX GeForce 480)
Fatima Qayyum
 
最新機能までを総ざらい!PostgreSQLの注目機能を振り返る(第32回 中国地方DB勉強会 in 岡山 発表資料)
NTT DATA Technology & Innovation
 
/proc/irq/&lt;irq>/smp_affinity
Takuya ASADA
 
Qemu Introduction
Chiawei Wang
 
Ad

Similar to PG-Strom (20)

PDF
PG-Strom - GPU Accelerated Asyncr
Kohei KaiGai
 
PDF
Pgopencl
Tim Child
 
PDF
PostgreSQL with OpenCL
Muhaza Liebenlito
 
PPTX
iMinds The Conference: Jan Lemeire
imec
 
PDF
R cuda presentation_ib_features_120704
Olexandr Isayev
 
PDF
Heterogeneous Systems Architecture: The Next Area of Computing Innovation
AMD
 
PDF
Cots moves to multicore: AMD
Konrad Witte
 
PDF
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools
laparuma
 
PDF
N A G P A R I S280101
John Holden
 
PDF
Mateo valero p1
guadalupe.moreno
 
PDF
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
HSA Foundation
 
PDF
Cuda tutorial
Mahesh Khadatare
 
PDF
GPUDirect RDMA and Green Multi-GPU Architectures
inside-BigData.com
 
PPT
Arista @ HPC on Wall Street 2012
Kazunori Sato
 
PDF
Amd accelerated computing -ufrj
Roberto Brandao
 
PDF
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
Shinya Takamaeda-Y
 
PDF
PG-Strom - A FDW module utilizing GPU device
Kohei KaiGai
 
PPTX
Gpu archi
Piyush Mittal
 
PDF
NVidia CUDA Tutorial - June 15, 2009
Randall Hand
 
PDF
Training Lecture
iuui
 
PG-Strom - GPU Accelerated Asyncr
Kohei KaiGai
 
Pgopencl
Tim Child
 
PostgreSQL with OpenCL
Muhaza Liebenlito
 
iMinds The Conference: Jan Lemeire
imec
 
R cuda presentation_ib_features_120704
Olexandr Isayev
 
Heterogeneous Systems Architecture: The Next Area of Computing Innovation
AMD
 
Cots moves to multicore: AMD
Konrad Witte
 
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools
laparuma
 
N A G P A R I S280101
John Holden
 
Mateo valero p1
guadalupe.moreno
 
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
HSA Foundation
 
Cuda tutorial
Mahesh Khadatare
 
GPUDirect RDMA and Green Multi-GPU Architectures
inside-BigData.com
 
Arista @ HPC on Wall Street 2012
Kazunori Sato
 
Amd accelerated computing -ufrj
Roberto Brandao
 
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
Shinya Takamaeda-Y
 
PG-Strom - A FDW module utilizing GPU device
Kohei KaiGai
 
Gpu archi
Piyush Mittal
 
NVidia CUDA Tutorial - June 15, 2009
Randall Hand
 
Training Lecture
iuui
 
Ad

More from Kohei KaiGai (20)

PDF
20221116_DBTS_PGStrom_History
Kohei KaiGai
 
PDF
20221111_JPUG_CustomScan_API
Kohei KaiGai
 
PDF
20211112_jpugcon_gpu_and_arrow
Kohei KaiGai
 
PDF
20210928_pgunconf_hll_count
Kohei KaiGai
 
PDF
20210731_OSC_Kyoto_PGStrom3.0
Kohei KaiGai
 
PDF
20210511_PGStrom_GpuCache
Kohei KaiGai
 
PDF
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
Kohei KaiGai
 
PDF
20201128_OSC_Fukuoka_Online_GPUPostGIS
Kohei KaiGai
 
PDF
20201113_PGconf_Japan_GPU_PostGIS
Kohei KaiGai
 
PDF
20201006_PGconf_Online_Large_Data_Processing
Kohei KaiGai
 
PDF
20200828_OSCKyoto_Online
Kohei KaiGai
 
PDF
20200806_PGStrom_PostGIS_GstoreFdw
Kohei KaiGai
 
PDF
20200424_Writable_Arrow_Fdw
Kohei KaiGai
 
PDF
20191211_Apache_Arrow_Meetup_Tokyo
Kohei KaiGai
 
PDF
20191115-PGconf.Japan
Kohei KaiGai
 
PDF
20190926_Try_RHEL8_NVMEoF_Beta
Kohei KaiGai
 
PDF
20190925_DBTS_PGStrom
Kohei KaiGai
 
PDF
20190909_PGconf.ASIA_KaiGai
Kohei KaiGai
 
PDF
20190516_DLC10_PGStrom
Kohei KaiGai
 
PDF
20190418_PGStrom_on_ArrowFdw
Kohei KaiGai
 
20221116_DBTS_PGStrom_History
Kohei KaiGai
 
20221111_JPUG_CustomScan_API
Kohei KaiGai
 
20211112_jpugcon_gpu_and_arrow
Kohei KaiGai
 
20210928_pgunconf_hll_count
Kohei KaiGai
 
20210731_OSC_Kyoto_PGStrom3.0
Kohei KaiGai
 
20210511_PGStrom_GpuCache
Kohei KaiGai
 
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
Kohei KaiGai
 
20201128_OSC_Fukuoka_Online_GPUPostGIS
Kohei KaiGai
 
20201113_PGconf_Japan_GPU_PostGIS
Kohei KaiGai
 
20201006_PGconf_Online_Large_Data_Processing
Kohei KaiGai
 
20200828_OSCKyoto_Online
Kohei KaiGai
 
20200806_PGStrom_PostGIS_GstoreFdw
Kohei KaiGai
 
20200424_Writable_Arrow_Fdw
Kohei KaiGai
 
20191211_Apache_Arrow_Meetup_Tokyo
Kohei KaiGai
 
20191115-PGconf.Japan
Kohei KaiGai
 
20190926_Try_RHEL8_NVMEoF_Beta
Kohei KaiGai
 
20190925_DBTS_PGStrom
Kohei KaiGai
 
20190909_PGconf.ASIA_KaiGai
Kohei KaiGai
 
20190516_DLC10_PGStrom
Kohei KaiGai
 
20190418_PGStrom_on_ArrowFdw
Kohei KaiGai
 

Recently uploaded (20)

PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
Doc9.....................................
SofiaCollazos
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 

PG-Strom

  • 1. PG-Strom GPU Accelerated Asynchronous Super-Parallel Query Execution KaiGai Kohei <[email protected]> (Tw: @kkaigai)
  • 2. Self Introduction • 名前 海外 浩平 (KaiGai Kohei) • 所属 NEC Europe - SAP Global Competence Center • 仕事 OSS活用によるイノベーション創出 (20-30%) SAPとのアライアンス、PF製品の拡販 (70-80%) • SAPのIn-memory DB “SAP HANA” の認証作業とか • CLUSTERPROのSAP認定取得、拡販とか 特にコレの 割合がデカい 20130218 - PG-Strom Workshop, Tokyo 2
  • 3. All everyone talks about BIG-DATA 猫 杓子 熱い視線 BIG DATA 20130218 - PG-Strom Workshop, Tokyo 3
  • 4. Big-Data Database? ¥¥¥¥¥¥ $$$$$$ €€€€€€ 20130218 - PG-Strom Workshop, Tokyo 4
  • 5. Homogeneous / Heterogeneous computing KPIs Homogeneous Scale-Up • Computing Performance • Power Consumption • System Cost (HW/SW) Heterogeneous • Variety of Applications Scale-Up • Vendor Support • Software Development : + Scale-out (not a topic of today’s talk) 20130218 - PG-Strom Workshop, Tokyo 5
  • 6. Design concept of PG-Strom The world cheapest The most Cost-Effective Big-Data Database • Utilization of open source technology • Utilization of commodity hardware • up-to ?? CPUs まだ、この辺をとやかく • up-to ??? GB RAM 言える段階ではない • up-to ??? Data size • Utilization of heterogeneous computing with GPU 20130218 - PG-Strom Workshop, Tokyo 6
  • 7. Characteristics of GPU (1/2) Nvidia AMD Intel Kepler GCN SandyBridge Model Tesla K20X FirePro S9000 Xeon E5-2690 (Q4/2012) (Q3/2012) (Q1/2012) Number of 7.1billion 4.3billion 2.26billion Transistors Number of 2688 1792 16 Cores Simple Simple Functional Core clock 732MHz 925MHz 2.9GHz Peak FLOPS 3.95Tflops 3.23TFlops 185.6GFlops Memory 6GB, GDDR5 6GB, GDDR5 384GB/socket, Size / TYPE DDR3 Memory ~192GB/s ~264GB/s ~51.2GB/s Bandwidth Power ~235W ~225W ~135W Consumption Price $3199? $2499? $2061 20130218 - PG-Strom Workshop, Tokyo 7
  • 8. Characteristics of GPU (2/2) Example) Zi = Xi + Yi (0 <= i <= n) X0 X1 X2 Xn + + + + Y0 Y1 Y2 Yn     Z0 Z1 Z2 Zn Assign a particular “core” Nvidia’s GeForce GTX 680 Block Diagram 20130218 - PG-Strom Workshop, Tokyo 8
  • 9. Play with GPU (1/3) Memory CPU GPU 計算負荷 GPUの仕事 GPGPU on-host DDR3-1600 (non-integrated) buffer (~51.2GB/s) DDR5 通常のI/O負荷 192.2GB/s IO HUB  CPUの仕事 GPU on-device code HDD buffer HBA device DRAM SAS 2.0 (600MB/s) PCI-E 3.0 x16 (~32GB/s)  Asynchronous Execution of CPU, GPU and PCI-E  Minimization of data transfer between host and device 20130218 - PG-Strom Workshop, Tokyo 9
  • 10. Play with GPU (2/3) Host code example void sqrt_float4(int n, float v[]) { /* Acquire device memory and data transfer (host -> device) */ dev_v = clCreateBuffer(cxt, CL_MEM_READ_WRITE, sizeof(float) * n, NULL, &rv); /* Enqueue data transfer host to device */ clEnqueueWriteBuffer(cmdq, dev_x, CL_TRUE, 0, NULL, v, 0, NULL, NULL); /* Set arguments of kernel code */ clSetKernelArg(kernel, 0, sizeof(int), (void *)&n); clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&dev_v); /* Enqueue invocation of device kernel */ clEnqueueNDRangeKernel(cmdq, kernel, 1, NULL, &g_itemsz, &l_itemsz, 0, NULL, NULL); /* Enqueue data transfer device to host */ clEnqueueReadBuffer(cmdq, dev_x, CL_TRUE, 0, NULL, v, 0, NULL, NULL); /* Release device memory */ clReleaseMemObject(dev_x) } 20130218 - PG-Strom Workshop, Tokyo 10
  • 11. Play with GPU (3/3) Device code example __kernel void dev_sqrt_float(int length, float x[]) { int i = get_global_id(0); if (i < length) x[i] = sqrt(x[i]); } Host code to load kernel /* Load source code of the program */ program = clCreateProgramWithSource(cxt, 1, (const char *)&kernel_source, (const size_t *)&kernel_source_len, &rv); /* Run-time build of the program */ rv = clBuildProgram(program, 1, &device_id, NULL, NULL, NULL); /* Create a device kernel object */ kernel = clCreateKernel(program, “dev_sqrt_float”, &rv); 20130218 - PG-Strom Workshop, Tokyo 11
  • 12. Comparison - CPU vs CPU + GPU CPU CPU + GPU • Advantage Storage  Storage  • シンプルな演算の Host buffer Host buffer 超並列処理 • 非同期実行& DMA: host  device パイプライン処理 Storage  Parallel • Disadvantage Host buffer Calculation • ホストデバイス間の Loop & DMA転送コスト DMA: host  device Calculation Host buffer • コードの複雑化  output depends on workload Host buffer  output 20130218 - PG-Strom Workshop, Tokyo 12
  • 13. Architecture of PG-Strom 20130218 - PG-Strom Workshop, Tokyo 13
  • 14. PG-Strom’s Asynchronous Execution model vanilla PostgreSQL PostgreSQL with PG-Strom CPU CPU GPU Asynchronous memory transfer and execution Iteration of scan tuples and evaluation of qualifiers Synchronization Larger “chunk” to scan Earlier than the database at once “Only CPU” scan : Scan tuples on shared-buffers : Execution of the qualifiers Page 20130218 - PG-Strom Workshop, Tokyo 14
  • 15. Re-definition of SQL/MED (1/2) • SQL/MED (Management of External Data) • External data source performing as if regular tables • Not only “management”, but external computing resources also Exec Regular Exec storage Query Executor Table Query Planner Query Parser Foreign MySQL Table FDW SQL Query Foreign Oracle FDW Exec Table Foreign PG-Strom Table FDW Exec Regular storage Table 20130218 - PG-Strom Workshop, Tokyo 15
  • 16. Re-definition of SQL/MED (2/2) Query Parser SQL/MED API construction of Query remote SQL Tree remote SQL Query FDW Planner Planner Remote pgsql connection open remote query Query FDW Executor Executor result set Result connection close Set 20130218 - PG-Strom Workshop, Tokyo Extension module 16
  • 17. PG-Strom as SQL/MED driver Query Parser ① 条件句から、GPU用 Kernel Codeを自動生成 Query Tree WHERE log(x) < 10 ..... Query FDW .... Planner Planner ..... ② Kernel codeの ④ GPU Kernelの JIT-compile 非同期実行 chunk Query FDW buffer Executor Executor ③ Shadow Table Result からのロード Set 20130218 - PG-Strom Workshop, Tokyo PG-Strom module shadow tables 17
  • 18. Overall architecture World of CPU regular shadow tables tables shared_buffer chunk_buffer World of GPU chunk GPU data code GPU Device chunk Memory SeqScan, PG-Strom request data etc... handler JIT Super compile Parallel ForeignScan Event Execution Result monitor GPU Kernel Query Executor GPU Function PG-Strom kernel PostgreSQL Backend GPU Server Postmaster background worker 18 20130218 - PG-Strom Workshop, Tokyo
  • 19. So what, How fast is it? postgres=# SELECT COUNT(*) FROM rtbl WHERE sqrt((x-256)^2 + (y-128)^2) < 40; count -------- 100467 (1 row) Time: 7668.684 ms postgres=# SELECT COUNT(*) FROM ftbl WHERE sqrt((x-256)^2 + (y-128)^2) < 40; count -------- 100467 (1 row) Accelerated! Time: 857.298 ms  CPU: Xeon E5-2670 (2.60GHz), GPU: NVidia GeForce GT640, RAM: 384GB  Both of regular rtbl and PG-Strom ftbl contain 20milion rows with same value 20130218 - PG-Strom Workshop, Tokyo 19
  • 20. Key Technologies • Automatic GPU code generation & JIT compile • Column-oriented data structure • Asynchronous Execution 20130218 - PG-Strom Workshop, Tokyo 20
  • 21. Automatic “pseudo” code generation SELECT * FROM ftbl WHERE c like ‘%xyz%’ AND sqrt((x-256)^2+(y-100)^2) < 10; contains unsupported operators / functions Translation to pseudo code xreg10 = $(ftbl.x) xreg12 = 256.000000::double Pseudo-code based implementation will be replaced by native code and xreg8 = (xreg10 - xreg12) JIT-compile approach soon. xreg10 = 2.000000::double xreg6 = pow(xreg8, xreg10) xreg12 = $(ftbl.y) xreg14 = 128.000000::double : 20130218 - PG-Strom Workshop, Tokyo 21
  • 22. Automatic native code generation - WIP SELECT * FROM ftbl WHERE c like ‘%xyz%’ AND sqrt((x-256)^2+(y-100)^2) < 10; OpenCL run-time builds native GPU binary __kernel void pgstrom_qual(int nitems, bool result[], float x[], float y[]) { int index = get_global_id(0); if (sqrt(pow(x[i] - 256, 2) + pow(y[i] - 100, 2)) < 10) result[i] = true; else result[i] = false; } 20130218 - PG-Strom Workshop, Tokyo 22
  • 23. Save bandwidth & shared-buffer usage E.g) SELECT name, tel, email, address FROM address_book WHERE sqrt((pos_x – 24.5)^2 + (pos_y – 52.3)^2) < 10;  No sense to fetch columns being not in use CPU GPU CPU GPU Synchronization Synchronization : Scan tuples on the shared-buffers  Save the bandwidth of PCI-E bus : Execution of the qualifiers : Columns being not used the qualifiers  Save the shared-buffer usage 20130218 - PG-Strom Workshop, Tokyo 23
  • 24. Column-oriented data structure (1/3) (shadow) TABLE “public.ft.rowid” rowid nitems isnull FOREIGN TABLE ft 4000 2000 {0,0,0,1,0,0,…} int float text X Y Z 6000 2000 {0,0,0,0,0,0,…} : : : (shadow) TABLE “public.ft.z.cs” 14000 400 {0,0,1,0,0,0,…} rowid nitems isnull values 4000 15 {0,0,…} { ‘hello’, ‘world’, … } 4015 20 {0,0,…} { ‘aaa’, ‘bbb’, ‘ccc’, … } (shadow) TABLE “public.ft.y.cs” rowid nitems : isnull : : values : 4000 25014275 {0,0,…} { 1.38, 6.45, 2.15, …‘yyy’, ‘zzz’, …} 25 {0,0,…} {‘xxx’, } 4250 250 {0,1,…} { 4.32, 5.46, 3.14, … } (shadow): TABLE “public.ft.a.cs” : : : rowid nitems 14200 isnull 100 values {0,0,…} {19, 29, 39, 49, 59, …} 4000 500 {0,0,…} {10, 20, 30, 40, 50, …} 4500 500 {0,1,…} {11, 0, 31, 41, 51, …} : : : : 14200 200 {0,0,…} {19, 29, 39, 49, 59, …} 20130218 - PG-Strom Workshop, Tokyo 24
  • 25. Column-oriented data structure (2/3) postgres=# CREATE FOREIGN TABLE example (a int, b text) SERVER pg_strom; CREATE FOREIGN TABLE postgres=# SELECT * FROM pgstrom_shadow_relations; oid | relname | relkind | relsize -------+----------------------+---------+----------- 16446 | public.example.rowid | r | 0 16449 | public.example.idx | i | 8192 16450 | public.example.a.cs | r | 0 16453 | public.example.a.idx | i | 8192 16454 | public.example.b.cs | r | 0 16457 | public.example.b.idx | i | 8192 16462 | public.example.seq | S | 8192 (9 rows) postgres=# SELECT * FROM pg_strom."public.example.a.cs" ; rowid | nitems | isnull | values -------+--------+--------+-------- (0 rows) 20130218 - PG-Strom Workshop, Tokyo 25
  • 26. Column-oriented data structure (3/3) ② Calculation opcode Pseudo Code PgStromChunkBuffer ① Transfer rowmap value a[] <not used> value b[] ③ Write-Back value c[] value d[] <not used> • Less bandwidth consumption Table: my_schema.ft1.b.cs of PCI-Express bus 10100 {2.4, 5.6, 4.95, … } 10300 {10.23, 7.54, 5.43, … } • Less usage of buffer-cache Table: my_schema.ft1.c.cs • Suitable for data 10100 {‘2010-10-21’, …} compression 10200 {‘2011-01-23’, …} 10300 {‘2011-08-17’, …} PGconf.EU 2012 / PGStrom - GPU Accelerated Asynchronous Execution Module 26
  • 27. Asynchronous Execution (1/2) IterateForeignScan Yes free_chunk_list No more rows on current chunk? current chunk Job Queue No next If no chunks are ready yet Load chunk from chunk shadow tables GPU code shadow tables GPU Management current Server Return next chunk TupleTableSlot ready_chunk_list 20130218 - PG-Strom Workshop, Tokyo 27
  • 28. Asynchronous Execution (2/2) - in the future IterateForeignScan Yes free_chunk_list Asynchronous Asynchronous No more rows Data Load Calculation on current chunk? current chunk No next Job Queue Job Queue chunk next chunk GPU shadow tables code Parallel I/O Server Parallel I/O Server Parallel I/O Server GPU Management current Server Return next chunk TupleTableSlot ready_chunk_list 20130218 - PG-Strom Workshop, Tokyo 28
  • 29. Eco-System in PostgreSQL Development 20130218 - PG-Strom Workshop, Tokyo 29
  • 30. PostgreSQL developer’s community PostgreSQL developer’s community contribution, software, feedback, documentation, donation, ... knowledge, ... Service infrastructure, support, 20130218 - PG-Strom Workshop, Tokyo consulting, ... 30
  • 31. PostgreSQL development cycle 2011 2012 2013 v9.2 cycle v9.2 Release CommitFest 1st~4th v9.3 PGconf2011 & cycle developer meeting CommitFest 1st~4th PGconf2012 & developer meeting v9.3 development schedule • 17th-May developer meeting • 15th-Jun CommitFest:1st • 15th-Sep CommitFest:2nd • 15th-Nov CommitFest:3rd PostgreSQL developer meeting (17th-May-2012) • 15th-Jan CommitFest:4th 20130218 - PG-Strom Workshop, Tokyo 31
  • 32. PostgreSQL CommitFest 20130218 - PG-Strom Workshop, Tokyo 32
  • 33. Key features towards upcoming v9.3 (1/3) • Background Worker • It enables extensions to manage own background worker process • Pre-requisite of PG-Strom’s GPU control server • KaiGai implemented 1st version, then Alvaro revised and committed Shared Resources (DB cluster, shared mem, IPC, ...) Built-in Extra background daemon PostgreSQL PostgreSQL PostgreSQL PostgreSQL Backend Backend Backend (autovacuum, Own main() Backend bgwriter...) manage Extension postmaster 20130218 - PG-Strom Workshop, Tokyo 33
  • 34. Key features towards upcoming v9.3 (2/3) • Writable-FDW • It allows FDW-drivers to modify external data source via foreign-table. • Several new APIs shall be added • Helpful for PG-Strom to modify shadow-tables using standard DML • KaiGai submitted patch, then it is “ready-for-committer” status now SELECT SQL Executor SQL Planner SQL Parser INSERT FDW driver UPDATE DELETE New API External Data Source 20130218 - PG-Strom Workshop, Tokyo 34
  • 35. Key features towards upcoming v9.3 (3/3) • Writable-FDW (Pseudo-column support) • It required to identify a particular remote-row to be written. • “rowid” shall be carried from scan-stage to modify-stage as a value of pseudo-column. • Pseudo-column concept is also available to push-down complex calculation into external computing resource. SELECT X, Y, (X-Y)^2 from ftable;  SELECT X, Y, Pcol_1 from ftable; Just reference to (SELECT X, Y, (X-Y)^2 AS Pcol_1 the calculated result from remote_data_source) in the remote side Remote Query 20130218 - PG-Strom Workshop, Tokyo 35
  • 36. Direction of The Future Development 20130218 - PG-Strom Workshop, Tokyo 36
  • 37. Move to OpenCL - WIP • OpenCL support, instead of CUDA • multiplatform support • built-in JIT compiler OpenCL Source (CString) clCreateProgramWithSource() ○ ○ cl_program clCreateKernel() ○ × cl_kernel ○ × clEnqueueNDRangeKernel 20130218 - PG-Strom Workshop, Tokyo 37
  • 38. Variable Length Data Support - WIP • Data layout on chunk-buffer is revising, to accept variable-length data. • Older format assumed fixed-number of items per chunk. • Newer format assumes fixed-size chunk; consumed from head/tail to consume Direction for fixed-length values X for fixed-length variable A for index of variable-length value B offset: 123 for fixed-length values Y to consume Direction text: ‘hello world’ for contents of variable-lengths for fixed-length values Z Older chunk-buffer layout Newer chunk-buffer layout 20130218 - PG-Strom Workshop, Tokyo 38
  • 39. Procedural Language Support • This idea allows users to describe complicated logic as procedural language to be executed on GPU. • Expected usage: image processing, genome matching, ... CREATE FUNCTION genome_similarity(text,text) RETURNS float AS $$ varlena *genome1 = ARG1; varlena *genome2 = ARG2; : <something complicated logic> : return similarity; $$ LANGUAGE pg_strom; SELECT id, label FROM genome_db WHERE genome_similarity(data, ‘ATGCAGGT....’) > 0.9; 20130218 - PG-Strom Workshop, Tokyo 39
  • 40. You getting involved I’d like to know ... • How PG-Strom run on real-world dataset and workload • How PG-Strom should get evolved • Which region and problem will fit  All the co-development / co-evaluation projects are always welcome! 20130218 - PG-Strom Workshop, Tokyo 40
  • 41. Summary • Characteristics of GPU & OpenCL • Inflexible instructions but much higher parallelism • Cheap and small power consumption per computing capability • PG-Strom - towards most cost-effective database • Utilization of GPU to off-load CPU jobs • Automatic code generation and JIT compile • Asynchronous execution • Column-oriented data structure • Upcoming development • Move to OpenCL rather then CUDA • Support for variable length values • Support for procedural language • Your involvement will lead future evolution! 20130218 - PG-Strom Workshop, Tokyo 41
  • 42. Source • Source code • https://github.com/kaigai/pg_strom • Wikipage • http://wiki.postgresql.org/wiki/PGStrom (need maintenance...) 20130218 - PG-Strom Workshop, Tokyo 42
  • 43. Any Questions? 20130218 - PG-Strom Workshop, Tokyo 43