Solving channel coding simulation and optimization problems using GPU

HUAWEI TECHNOLOGIES CO., LTD.
47pt
www.huawei.com
Usatyuk Vasiliy
2013
Channel coding:
Speed-up optimization
and simulation using GPU

Simply goal:
Invent ‘New Code’
or improve existence

Main reason for speedup
Channel Coding
computing and simulation

Problem Statements
Estimate error-floor lower bound of graph codes using defined decoder
(MSA, SPA for float and fixed point):
Found trapping set in the tanner graph.
Example of a (4,2)trapping set with 4 variable nodes and Hamming weight* 2 in
the Tanner graph of an LDPC code.
1v 2v 3v 4v
* If all variable node set to 1, only 2,3 check nodes are connected to odd number of 1,
syndrome has Hamming weight equal 2.

Problem Statements
Estimate error-floor lower bound of graph codes using defined decoder
(MSA, SPA for float and fixed point, with defined quantize and etc):
Found trapping set in the tanner graph.
It is mean decoding of subgraph in bipartite graph with some variance of
error.
Example of a (4,2)trapping set with 4 variable nodes and Hamming weight* 2 in
the Tanner graph of an LDPC code.
1v 2v 3v 4v
* If all variable node in trapping set equal 1 (all another to 0), only 2,3 check nodes are
connected to odd number of 1, so syndrome has Hamming weight equal 2.
...
...
...
...

Using simulation platform
Under LDPC 6 to 48 parity-check matrix, with expander 320 using Misha Chertkov
und Misha Stepanov LP-approach we weighed all pseudocodewords < 330.
To estimate error-floor and most part of waterfall need compute for 48 nodes
Twelve iterations of float layered NOMS.
decoded by sequentially CPU algorithm one nodes take around 19.2 days
decoded by parallel CPU algorithm one nodes take around 2.37 days
decoded by parallel GPU algorithm (without profile) one nodes take around 2 hour
decoded by parallel GPU algorithm (with profile) one nodes take around 70
minutes
We speedup error-floor estimation algorithm using GPU
around 395times compare to sequential algorithm
48.75 times compare to parallel CPU algorithm
113 days become 2.(3) days

Our simulation platform base on the GPU
NVIDIA C2075 (2011 model) 2.5 K $:
448 CUDA cores
Peak performance
515 Gflops in double precision calculations
1030 Gflops in single precision.
Memory: 6GB GDDR5 (12.5 percent for ECC)
Memory speed: 1.5 GHz
Memory interface: 384-bit
Memory bandwidth: 144 GB/sec (12.5 percent for ECC)
Power consumption: 225W TDP
OS: Ubuntu 10.04 LT
CPU Intel i7-3820 CPU @ 3.60GHz (4 cores)
RAM: 64 Gb DDR 3
HDD: Western Digital RE4 1TB
Platform overall cost lest than 5 K $
Moreover, can be upgrade by installing
second GPU.

Our simulation platform base on the GPU
NVIDIA C2075 (2011 model) 2.5 K $:
448 CUDA cores
Peak performance
515 Gflops in double precision calculations
Memory: 6GB GDDR5 (12.5 % for ECC)
Memory clock: 1.5 GHz
Memory interface: 384-bit
Memory bandwidth: 144 GB/sec (12.5 percent for ECC)
Compare to modern GPU (2013 year):
GPU Tesla K40(2013 model) 4.8 K $:
2880 cores with Peak performance
1430 GfLops in double precision calculations
Memory: 12 Gb (6.25 % for ECC)
Memory bandwidth: 288 GB/sec(6.25 ECC)
Memory clock: 3 GHz

GPU Tesla K40 (2880 cores) 4.8 K $ 10 times faster than
two CPU XEON E5-2687W (16 cores) with cost 4.4 K $
around 200 times faster than sequential execution:
2880 CUDA cores with peak performance
1430 GfLops in double precision floating point performance (15 decimal number)
4290 Gflops in single precision floating point performance (7 decimal number)
Memory: 12 Gb ( ECC off) , 11.25 GB (6.25 % for ECC)
Memory bandwidth: 288 GB/sec(ECC off) , 270 GB/sec(ECC on).
Memory clock: 3 GHz
Computer capability: 3.5
Tread/Warp:32
Max Warps/Multiprocessors 64
Max Treads: 2048
Max Tread Blocks: 16
Single precision registers: 65536
Max grid dimension:
Support Hyper-Q (32 simultaneous MPI Tasks)
1232

Can be installed 8
Tesla K40 in one
Server

We done profiling using Nvcc compile output
(-Xptxas -v) and NVIDIA Visual Profiler
Speedup 1.7 times by choosing optimal thread model.

Сейчас не удаетсяотобразит ь рисунок.

Amazing control
but hard to programming.

Fast simulation can weak by slow
implementation (programming).

If you are not enough, you want to done half year simulation in
one day. Just speed up by increasing number of GPU.

nm
RCBAR
CBAC



,,;,
,



Increasing number of GPU on high parallel task give speedup close to linear.
‘New Code’ must have high throughput => must be high parallel=>
Simulation can be done using several HPC platform with multiple GPU

One week of simulation done by one day
This type of implementation done by Matlab without
hardcore programming-> speedup simulation developing.

IS CML: Iterative Solutions Coded Modulation Library
6.3 speedup compare to CPU

Thank you
www.huawei.com
Thank You
for attention

Solving channel coding simulation and optimization problems using GPU

More Related Content

What's hot (20)

Similar to Solving channel coding simulation and optimization problems using GPU (20)

More from Usatyuk Vasiliy (9)

Recently uploaded (20)

Solving channel coding simulation and optimization problems using GPU