SlideShare a Scribd company logo
HUAWEI TECHNOLOGIES CO., LTD.
47pt
www.huawei.com
Usatyuk Vasiliy
2013
Channel coding:
Speed-up optimization
and simulation using GPU
HUAWEI TECHNOLOGIES CO., LTD.
Simply goal:
Invent ‘New Code’
or improve existence
HUAWEI TECHNOLOGIES CO., LTD.
Main reason for speedup
Channel Coding
computing and simulation
HUAWEI TECHNOLOGIES CO., LTD.
Problem Statements
Estimate error-floor lower bound of graph codes using defined decoder
(MSA, SPA for float and fixed point):
Found trapping set in the tanner graph.
Example of a (4,2)trapping set with 4 variable nodes and Hamming weight* 2 in
the Tanner graph of an LDPC code.
1v 2v 3v 4v
* If all variable node set to 1, only 2,3 check nodes are connected to odd number of 1,
syndrome has Hamming weight equal 2.
HUAWEI TECHNOLOGIES CO., LTD.
Problem Statements
Estimate error-floor lower bound of graph codes using defined decoder
(MSA, SPA for float and fixed point, with defined quantize and etc):
Found trapping set in the tanner graph.
It is mean decoding of subgraph in bipartite graph with some variance of
error.
Example of a (4,2)trapping set with 4 variable nodes and Hamming weight* 2 in
the Tanner graph of an LDPC code.
1v 2v 3v 4v
* If all variable node in trapping set equal 1 (all another to 0), only 2,3 check nodes are
connected to odd number of 1, so syndrome has Hamming weight equal 2.
...
...
...
...
HUAWEI TECHNOLOGIES CO., LTD.
Using simulation platform
Under LDPC 6 to 48 parity-check matrix, with expander 320 using Misha Chertkov
und Misha Stepanov LP-approach we weighed all pseudocodewords < 330.
To estimate error-floor and most part of waterfall need compute for 48 nodes
Twelve iterations of float layered NOMS.
decoded by sequentially CPU algorithm one nodes take around 19.2 days
decoded by parallel CPU algorithm one nodes take around 2.37 days
decoded by parallel GPU algorithm (without profile) one nodes take around 2 hour
decoded by parallel GPU algorithm (with profile) one nodes take around 70
minutes
We speedup error-floor estimation algorithm using GPU
around 395times compare to sequential algorithm
48.75 times compare to parallel CPU algorithm
113 days become 2.(3) days
HUAWEI TECHNOLOGIES CO., LTD.
Our simulation platform base on the GPU
NVIDIA C2075 (2011 model) 2.5 K $:
448 CUDA cores
Peak performance
515 Gflops in double precision calculations
1030 Gflops in single precision.
Memory: 6GB GDDR5 (12.5 percent for ECC)
Memory speed: 1.5 GHz
Memory interface: 384-bit
Memory bandwidth: 144 GB/sec (12.5 percent for ECC)
Power consumption: 225W TDP
OS: Ubuntu 10.04 LT
CPU Intel i7-3820 CPU @ 3.60GHz (4 cores)
RAM: 64 Gb DDR 3
HDD: Western Digital RE4 1TB
Platform overall cost lest than 5 K $
Moreover, can be upgrade by installing
second GPU.
HUAWEI TECHNOLOGIES CO., LTD.
Our simulation platform base on the GPU
NVIDIA C2075 (2011 model) 2.5 K $:
448 CUDA cores
Peak performance
515 Gflops in double precision calculations
1030 Gflops in single precision.
Memory: 6GB GDDR5 (12.5 % for ECC)
Memory clock: 1.5 GHz
Memory interface: 384-bit
Memory bandwidth: 144 GB/sec (12.5 percent for ECC)
Power consumption: 225W TDP
Compare to modern GPU (2013 year):
GPU Tesla K40(2013 model) 4.8 K $:
2880 cores with Peak performance
1430 GfLops in double precision calculations
4290 Gflops in single precision.
Memory: 12 Gb (6.25 % for ECC)
Memory bandwidth: 288 GB/sec(6.25 ECC)
Memory clock: 3 GHz
Power consumption: 235W TDP
HUAWEI TECHNOLOGIES CO., LTD.
GPU Tesla K40 (2880 cores) 4.8 K $ 10 times faster than
two CPU XEON E5-2687W (16 cores) with cost 4.4 K $
around 200 times faster than sequential execution:
2880 CUDA cores with peak performance
1430 GfLops in double precision floating point performance (15 decimal number)
4290 Gflops in single precision floating point performance (7 decimal number)
Memory: 12 Gb ( ECC off) , 11.25 GB (6.25 % for ECC)
Memory bandwidth: 288 GB/sec(ECC off) , 270 GB/sec(ECC on).
Memory clock: 3 GHz
Power consumption: 235W TDP
Computer capability: 3.5
Tread/Warp:32
Max Warps/Multiprocessors 64
Max Treads: 2048
Max Tread Blocks: 16
Single precision registers: 65536
Max grid dimension:
Support Hyper-Q (32 simultaneous MPI Tasks)
1232

Can be installed 8
Tesla K40 in one
Server
HUAWEI TECHNOLOGIES CO., LTD.
We done profiling using Nvcc compile output
(-Xptxas -v) and NVIDIA Visual Profiler
Speedup 1.7 times by choosing optimal thread model.
HUAWEI TECHNOLOGIES CO., LTD.
Сейчас не удаетсяотобразит ь рисунок.
HUAWEI TECHNOLOGIES CO., LTD.
Сейчас не удаетсяотобразит ь рисунок.
Amazing control
but hard to programming.
HUAWEI TECHNOLOGIES CO., LTD.
Сейчас не удаетсяотобразит ь рисунок.
HUAWEI TECHNOLOGIES CO., LTD.
Сейчас не удаетсяотобразит ь рисунок.
Fast simulation can weak by slow
implementation (programming).
HUAWEI TECHNOLOGIES CO., LTD.
Сейчас не удаетсяотобразит ь рисунок.
HUAWEI TECHNOLOGIES CO., LTD.
Сейчас не удаетсяотобразит ь рисунок.
If you are not enough, you want to done half year simulation in
one day. Just speed up by increasing number of GPU.
HUAWEI TECHNOLOGIES CO., LTD.
Сейчас не удаетсяотобразит ь рисунок.
HUAWEI TECHNOLOGIES CO., LTD.
Сейчас не удаетсяотобразит ь рисунок.
HUAWEI TECHNOLOGIES CO., LTD.
Сейчас не удаетсяотобразит ь рисунок.
nm
RCBAR
CBAC



,,;,
,


HUAWEI TECHNOLOGIES CO., LTD.
Сейчас не удаетсяотобразит ь рисунок.
HUAWEI TECHNOLOGIES CO., LTD.
Сейчас не удаетсяотобразит ь рисунок.
HUAWEI TECHNOLOGIES CO., LTD.
Сейчас не удаетсяотобразит ь рисунок.
Increasing number of GPU on high parallel task give speedup close to linear.
‘New Code’ must have high throughput => must be high parallel=>
Simulation can be done using several HPC platform with multiple GPU
HUAWEI TECHNOLOGIES CO., LTD.
Сейчас не удаетсяотобразит ь рисунок.
HUAWEI TECHNOLOGIES CO., LTD.
Сейчас не удаетсяотобразит ь рисунок.
HUAWEI TECHNOLOGIES CO., LTD.
Сейчас не удаетсяотобразит ь рисунок.
One week of simulation done by one day
This type of implementation done by Matlab without
hardcore programming-> speedup simulation developing.
HUAWEI TECHNOLOGIES CO., LTD.
Сейчас не удаетсяотобразит ь рисунок.
IS CML: Iterative Solutions Coded Modulation Library
6.3 speedup compare to CPU
Thank you
www.huawei.com
Thank You
for attention

More Related Content

What's hot (20)

PPTX
GPU and Deep learning best practices
Lior Sidi
 
PDF
Easy and High Performance GPU Programming for Java Programmers
Kazuaki Ishizaki
 
PDF
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
Chris Fregly
 
PDF
201907 Radeon ProRender2.0@Siggraph2019
Takahiro Harada
 
PDF
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
Chris Fregly
 
PDF
SQL+GPU+SSD=∞ (English)
Kohei KaiGai
 
PDF
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
Chris Fregly
 
PDF
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
Chris Fregly
 
PDF
Introduction to Polyaxon
Yu Ishikawa
 
PDF
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...
Chris Fregly
 
PDF
Using GPUs to handle Big Data with Java by Adam Roberts.
J On The Beach
 
PDF
[2016 GDC] Multiplatform GPU Ray-Tracing Solutions With FireRender and FireRays
Takahiro Harada
 
PDF
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
Kohei KaiGai
 
PDF
PG-Strom - A FDW module utilizing GPU device
Kohei KaiGai
 
PDF
Profiling deep learning network using NVIDIA nsight systems
Jack (Jaegeun) Han
 
PPTX
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
Akihiro Hayashi
 
PDF
PG-Strom
Kohei KaiGai
 
PDF
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Chris Fregly
 
PDF
Transparent GPU Exploitation for Java
Kazuaki Ishizaki
 
PDF
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
Chris Fregly
 
GPU and Deep learning best practices
Lior Sidi
 
Easy and High Performance GPU Programming for Java Programmers
Kazuaki Ishizaki
 
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
Chris Fregly
 
201907 Radeon ProRender2.0@Siggraph2019
Takahiro Harada
 
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
Chris Fregly
 
SQL+GPU+SSD=∞ (English)
Kohei KaiGai
 
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
Chris Fregly
 
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
Chris Fregly
 
Introduction to Polyaxon
Yu Ishikawa
 
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...
Chris Fregly
 
Using GPUs to handle Big Data with Java by Adam Roberts.
J On The Beach
 
[2016 GDC] Multiplatform GPU Ray-Tracing Solutions With FireRender and FireRays
Takahiro Harada
 
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
Kohei KaiGai
 
PG-Strom - A FDW module utilizing GPU device
Kohei KaiGai
 
Profiling deep learning network using NVIDIA nsight systems
Jack (Jaegeun) Han
 
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
Akihiro Hayashi
 
PG-Strom
Kohei KaiGai
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Chris Fregly
 
Transparent GPU Exploitation for Java
Kazuaki Ishizaki
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
Chris Fregly
 

Similar to Solving channel coding simulation and optimization problems using GPU (20)

PDF
Fast & Furious: building HPC solutions in a nutshell
Victor Haydin
 
PDF
GTC 2022 Keynote
Alison B. Lowndes
 
PDF
Slide tesi
Nicolò Savioli
 
PDF
Efficient Evaluation of Embedded-System Design Alternatives (SPLC Tutorial 2019)
Maxime Cordy
 
PDF
E3MV - Embedded Vision - Sundance
Sundance Multiprocessor Technology Ltd.
 
PPTX
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
MLconf
 
PDF
JMI Techtalk: 한재근 - How to use GPU for developing AI
Lablup Inc.
 
PPTX
DigitRecognition.pptx
ruvex
 
PDF
Advances in GPU Computing
Frédéric Parienté
 
PPTX
Rocketick accelerated verilog simulations
chiportal
 
PDF
Cuda Without a Phd - A practical guick start
LloydMoore
 
PDF
06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...
Zalando adtech lab
 
PDF
Thesis_Walter_PhD_final_updated
Walter Rodrigues
 
PDF
Application Optimisation using OpenPOWER and Power 9 systems
Ganesan Narayanasamy
 
PDF
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
Edge AI and Vision Alliance
 
PDF
"Massive Parallel Decoding of Low-Density Parity-Check Codes Using Graphic Ca...
Enrique Monzo Solves
 
PDF
Accelerating Real Time Applications on Heterogeneous Platforms
IJMER
 
PDF
Hardware & Software Platforms for HPC, AI and ML
inside-BigData.com
 
PDF
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Dr. Fabio Baruffa
 
PDF
infoShare AI Roadshow 2018 - Tomasz Kopacz (Microsoft) - jakie możliwości daj...
Infoshare
 
Fast & Furious: building HPC solutions in a nutshell
Victor Haydin
 
GTC 2022 Keynote
Alison B. Lowndes
 
Slide tesi
Nicolò Savioli
 
Efficient Evaluation of Embedded-System Design Alternatives (SPLC Tutorial 2019)
Maxime Cordy
 
E3MV - Embedded Vision - Sundance
Sundance Multiprocessor Technology Ltd.
 
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
MLconf
 
JMI Techtalk: 한재근 - How to use GPU for developing AI
Lablup Inc.
 
DigitRecognition.pptx
ruvex
 
Advances in GPU Computing
Frédéric Parienté
 
Rocketick accelerated verilog simulations
chiportal
 
Cuda Without a Phd - A practical guick start
LloydMoore
 
06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...
Zalando adtech lab
 
Thesis_Walter_PhD_final_updated
Walter Rodrigues
 
Application Optimisation using OpenPOWER and Power 9 systems
Ganesan Narayanasamy
 
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
Edge AI and Vision Alliance
 
"Massive Parallel Decoding of Low-Density Parity-Check Codes Using Graphic Ca...
Enrique Monzo Solves
 
Accelerating Real Time Applications on Heterogeneous Platforms
IJMER
 
Hardware & Software Platforms for HPC, AI and ML
inside-BigData.com
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Dr. Fabio Baruffa
 
infoShare AI Roadshow 2018 - Tomasz Kopacz (Microsoft) - jakie możliwości daj...
Infoshare
 
Ad

More from Usatyuk Vasiliy (9)

PPTX
Ewdts 2018
Usatyuk Vasiliy
 
PDF
Tsp 2018 presentation Simulated Annealing Method for Construction of High-Gi...
Usatyuk Vasiliy
 
PDF
Computing the code distance of linear binary and ternary block codes using p...
Usatyuk Vasiliy
 
PDF
Multi-Edge Type LDPC codes
Usatyuk Vasiliy
 
PDF
Algebraic methods for design QC-LDPC codes
Usatyuk Vasiliy
 
PDF
Enumerating cycles in bipartite graph using matrix approach
Usatyuk Vasiliy
 
PPTX
Cycle’s topological optimizations and the iterative decoding problem on gener...
Usatyuk Vasiliy
 
PDF
Codes on the graph related problems
Usatyuk Vasiliy
 
PDF
Cycle’s topological optimizations and the iterative decoding problem on gener...
Usatyuk Vasiliy
 
Ewdts 2018
Usatyuk Vasiliy
 
Tsp 2018 presentation Simulated Annealing Method for Construction of High-Gi...
Usatyuk Vasiliy
 
Computing the code distance of linear binary and ternary block codes using p...
Usatyuk Vasiliy
 
Multi-Edge Type LDPC codes
Usatyuk Vasiliy
 
Algebraic methods for design QC-LDPC codes
Usatyuk Vasiliy
 
Enumerating cycles in bipartite graph using matrix approach
Usatyuk Vasiliy
 
Cycle’s topological optimizations and the iterative decoding problem on gener...
Usatyuk Vasiliy
 
Codes on the graph related problems
Usatyuk Vasiliy
 
Cycle’s topological optimizations and the iterative decoding problem on gener...
Usatyuk Vasiliy
 
Ad

Recently uploaded (20)

PDF
Plant growth promoting bacterial non symbiotic
psuvethapalani
 
PPTX
Microbiome_Engineering_Poster_Fixed.pptx
SupriyaPolisetty1
 
PPTX
Basal_ganglia_Structure_Function_Importance
muralinath2
 
PPTX
ION EXCHANGE CHROMATOGRAPHY NEW PPT (JA).pptx
adhagalejotshna
 
PDF
BlackBody Radiation experiment report.pdf
Ghadeer Shaabna
 
PPTX
Phage Therapy and Bacteriophage Biology.pptx
Prachi Virat
 
PDF
A High-Caliber View of the Bullet Cluster through JWST Strong and Weak Lensin...
Sérgio Sacani
 
PDF
Plankton and Fisheries Bovas Joel Notes.pdf
J. Bovas Joel BFSc
 
PPTX
LESSON 2 PSYCHOSOCIAL DEVELOPMENT.pptx L
JeanCarolColico1
 
PPTX
Q1_Science 8_Week3-Day 1.pptx science lesson
AizaRazonado
 
PDF
Preserving brand authenticity amid AI-driven misinformation: Sustaining consu...
Selcen Ozturkcan
 
PDF
Global Congress on Forensic Science and Research
infoforensicscience2
 
DOCX
Paper - Suprasegmental Features (Makalah Presentasi)
Sahmiral Amri Rajagukguk
 
PPTX
Diagnostic Features of Common Oral Ulcerative Lesions.pptx
Dr Palak borade
 
PPTX
770043401-q1-Ppt-pe-and-Health-7-week-1-lesson-1.pptx
AizaRazonado
 
PPTX
Class12_Physics_Chapter2 electric potential and capacitance.pptx
mgmahati1234
 
PDF
Unit-3 ppt.pdf organic chemistry unit 3 heterocyclic
visionshukla007
 
PDF
Calcium in a supernova remnant as a fingerprint of a sub-Chandrasekhar-mass e...
Sérgio Sacani
 
PDF
Carbon-richDustInjectedintotheInterstellarMediumbyGalacticWCBinaries Survives...
Sérgio Sacani
 
PPTX
Systamatic Acquired Resistence (SAR).pptx
giriprasanthmuthuraj
 
Plant growth promoting bacterial non symbiotic
psuvethapalani
 
Microbiome_Engineering_Poster_Fixed.pptx
SupriyaPolisetty1
 
Basal_ganglia_Structure_Function_Importance
muralinath2
 
ION EXCHANGE CHROMATOGRAPHY NEW PPT (JA).pptx
adhagalejotshna
 
BlackBody Radiation experiment report.pdf
Ghadeer Shaabna
 
Phage Therapy and Bacteriophage Biology.pptx
Prachi Virat
 
A High-Caliber View of the Bullet Cluster through JWST Strong and Weak Lensin...
Sérgio Sacani
 
Plankton and Fisheries Bovas Joel Notes.pdf
J. Bovas Joel BFSc
 
LESSON 2 PSYCHOSOCIAL DEVELOPMENT.pptx L
JeanCarolColico1
 
Q1_Science 8_Week3-Day 1.pptx science lesson
AizaRazonado
 
Preserving brand authenticity amid AI-driven misinformation: Sustaining consu...
Selcen Ozturkcan
 
Global Congress on Forensic Science and Research
infoforensicscience2
 
Paper - Suprasegmental Features (Makalah Presentasi)
Sahmiral Amri Rajagukguk
 
Diagnostic Features of Common Oral Ulcerative Lesions.pptx
Dr Palak borade
 
770043401-q1-Ppt-pe-and-Health-7-week-1-lesson-1.pptx
AizaRazonado
 
Class12_Physics_Chapter2 electric potential and capacitance.pptx
mgmahati1234
 
Unit-3 ppt.pdf organic chemistry unit 3 heterocyclic
visionshukla007
 
Calcium in a supernova remnant as a fingerprint of a sub-Chandrasekhar-mass e...
Sérgio Sacani
 
Carbon-richDustInjectedintotheInterstellarMediumbyGalacticWCBinaries Survives...
Sérgio Sacani
 
Systamatic Acquired Resistence (SAR).pptx
giriprasanthmuthuraj
 

Solving channel coding simulation and optimization problems using GPU

  • 1. HUAWEI TECHNOLOGIES CO., LTD. 47pt www.huawei.com Usatyuk Vasiliy 2013 Channel coding: Speed-up optimization and simulation using GPU
  • 2. HUAWEI TECHNOLOGIES CO., LTD. Simply goal: Invent ‘New Code’ or improve existence
  • 3. HUAWEI TECHNOLOGIES CO., LTD. Main reason for speedup Channel Coding computing and simulation
  • 4. HUAWEI TECHNOLOGIES CO., LTD. Problem Statements Estimate error-floor lower bound of graph codes using defined decoder (MSA, SPA for float and fixed point): Found trapping set in the tanner graph. Example of a (4,2)trapping set with 4 variable nodes and Hamming weight* 2 in the Tanner graph of an LDPC code. 1v 2v 3v 4v * If all variable node set to 1, only 2,3 check nodes are connected to odd number of 1, syndrome has Hamming weight equal 2.
  • 5. HUAWEI TECHNOLOGIES CO., LTD. Problem Statements Estimate error-floor lower bound of graph codes using defined decoder (MSA, SPA for float and fixed point, with defined quantize and etc): Found trapping set in the tanner graph. It is mean decoding of subgraph in bipartite graph with some variance of error. Example of a (4,2)trapping set with 4 variable nodes and Hamming weight* 2 in the Tanner graph of an LDPC code. 1v 2v 3v 4v * If all variable node in trapping set equal 1 (all another to 0), only 2,3 check nodes are connected to odd number of 1, so syndrome has Hamming weight equal 2. ... ... ... ...
  • 6. HUAWEI TECHNOLOGIES CO., LTD. Using simulation platform Under LDPC 6 to 48 parity-check matrix, with expander 320 using Misha Chertkov und Misha Stepanov LP-approach we weighed all pseudocodewords < 330. To estimate error-floor and most part of waterfall need compute for 48 nodes Twelve iterations of float layered NOMS. decoded by sequentially CPU algorithm one nodes take around 19.2 days decoded by parallel CPU algorithm one nodes take around 2.37 days decoded by parallel GPU algorithm (without profile) one nodes take around 2 hour decoded by parallel GPU algorithm (with profile) one nodes take around 70 minutes We speedup error-floor estimation algorithm using GPU around 395times compare to sequential algorithm 48.75 times compare to parallel CPU algorithm 113 days become 2.(3) days
  • 7. HUAWEI TECHNOLOGIES CO., LTD. Our simulation platform base on the GPU NVIDIA C2075 (2011 model) 2.5 K $: 448 CUDA cores Peak performance 515 Gflops in double precision calculations 1030 Gflops in single precision. Memory: 6GB GDDR5 (12.5 percent for ECC) Memory speed: 1.5 GHz Memory interface: 384-bit Memory bandwidth: 144 GB/sec (12.5 percent for ECC) Power consumption: 225W TDP OS: Ubuntu 10.04 LT CPU Intel i7-3820 CPU @ 3.60GHz (4 cores) RAM: 64 Gb DDR 3 HDD: Western Digital RE4 1TB Platform overall cost lest than 5 K $ Moreover, can be upgrade by installing second GPU.
  • 8. HUAWEI TECHNOLOGIES CO., LTD. Our simulation platform base on the GPU NVIDIA C2075 (2011 model) 2.5 K $: 448 CUDA cores Peak performance 515 Gflops in double precision calculations 1030 Gflops in single precision. Memory: 6GB GDDR5 (12.5 % for ECC) Memory clock: 1.5 GHz Memory interface: 384-bit Memory bandwidth: 144 GB/sec (12.5 percent for ECC) Power consumption: 225W TDP Compare to modern GPU (2013 year): GPU Tesla K40(2013 model) 4.8 K $: 2880 cores with Peak performance 1430 GfLops in double precision calculations 4290 Gflops in single precision. Memory: 12 Gb (6.25 % for ECC) Memory bandwidth: 288 GB/sec(6.25 ECC) Memory clock: 3 GHz Power consumption: 235W TDP
  • 9. HUAWEI TECHNOLOGIES CO., LTD. GPU Tesla K40 (2880 cores) 4.8 K $ 10 times faster than two CPU XEON E5-2687W (16 cores) with cost 4.4 K $ around 200 times faster than sequential execution: 2880 CUDA cores with peak performance 1430 GfLops in double precision floating point performance (15 decimal number) 4290 Gflops in single precision floating point performance (7 decimal number) Memory: 12 Gb ( ECC off) , 11.25 GB (6.25 % for ECC) Memory bandwidth: 288 GB/sec(ECC off) , 270 GB/sec(ECC on). Memory clock: 3 GHz Power consumption: 235W TDP Computer capability: 3.5 Tread/Warp:32 Max Warps/Multiprocessors 64 Max Treads: 2048 Max Tread Blocks: 16 Single precision registers: 65536 Max grid dimension: Support Hyper-Q (32 simultaneous MPI Tasks) 1232  Can be installed 8 Tesla K40 in one Server
  • 10. HUAWEI TECHNOLOGIES CO., LTD. We done profiling using Nvcc compile output (-Xptxas -v) and NVIDIA Visual Profiler Speedup 1.7 times by choosing optimal thread model.
  • 11. HUAWEI TECHNOLOGIES CO., LTD. Сейчас не удаетсяотобразит ь рисунок.
  • 12. HUAWEI TECHNOLOGIES CO., LTD. Сейчас не удаетсяотобразит ь рисунок. Amazing control but hard to programming.
  • 13. HUAWEI TECHNOLOGIES CO., LTD. Сейчас не удаетсяотобразит ь рисунок.
  • 14. HUAWEI TECHNOLOGIES CO., LTD. Сейчас не удаетсяотобразит ь рисунок. Fast simulation can weak by slow implementation (programming).
  • 15. HUAWEI TECHNOLOGIES CO., LTD. Сейчас не удаетсяотобразит ь рисунок.
  • 16. HUAWEI TECHNOLOGIES CO., LTD. Сейчас не удаетсяотобразит ь рисунок. If you are not enough, you want to done half year simulation in one day. Just speed up by increasing number of GPU.
  • 17. HUAWEI TECHNOLOGIES CO., LTD. Сейчас не удаетсяотобразит ь рисунок.
  • 18. HUAWEI TECHNOLOGIES CO., LTD. Сейчас не удаетсяотобразит ь рисунок.
  • 19. HUAWEI TECHNOLOGIES CO., LTD. Сейчас не удаетсяотобразит ь рисунок. nm RCBAR CBAC    ,,;, ,  
  • 20. HUAWEI TECHNOLOGIES CO., LTD. Сейчас не удаетсяотобразит ь рисунок.
  • 21. HUAWEI TECHNOLOGIES CO., LTD. Сейчас не удаетсяотобразит ь рисунок.
  • 22. HUAWEI TECHNOLOGIES CO., LTD. Сейчас не удаетсяотобразит ь рисунок. Increasing number of GPU on high parallel task give speedup close to linear. ‘New Code’ must have high throughput => must be high parallel=> Simulation can be done using several HPC platform with multiple GPU
  • 23. HUAWEI TECHNOLOGIES CO., LTD. Сейчас не удаетсяотобразит ь рисунок.
  • 24. HUAWEI TECHNOLOGIES CO., LTD. Сейчас не удаетсяотобразит ь рисунок.
  • 25. HUAWEI TECHNOLOGIES CO., LTD. Сейчас не удаетсяотобразит ь рисунок. One week of simulation done by one day This type of implementation done by Matlab without hardcore programming-> speedup simulation developing.
  • 26. HUAWEI TECHNOLOGIES CO., LTD. Сейчас не удаетсяотобразит ь рисунок. IS CML: Iterative Solutions Coded Modulation Library 6.3 speedup compare to CPU