SlideShare a Scribd company logo
Accelerating TensorFlow with RDMA for High-
Performance Deep Learning
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: panda@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda
Talk at DataWorks Summit Berlin 2018
by
Xiaoyi Lu
The Ohio State University
E-mail: luxi@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~luxi
DataWorks Summit@Berlin 2018 2Network Based Computing Laboratory
• Deep Learning is a sub-set of Machine Learning
– But, it is perhaps the most radical and revolutionary
subset
• Deep Learning is going through a resurgence
– Model: Excellent accuracy for deep/convolutional
neural networks
– Data: Public availability of versatile datasets like
MNIST, CIFAR, and ImageNet
– Capability: Unprecedented computing and
communication capabilities: Multi-/Many-Core,
GPGPUs, Xeon Phi, InfiniBand, RoCE, etc.
• Big Data has become one of the most important
elements in business analytics
– Increasing demand for getting Big Value out of Big
Data to drive the revenue continuously growing
Why Deep Learning is so hot?
Courtesy: http://www.zdnet.com/article/caffe2-deep-learning-wide-ambitions-
flexibility-scalability-and-advocacy/
MNIST handwritten digits Deep Neural Network
DataWorks Summit@Berlin 2018 3Network Based Computing Laboratory
Application Example of DL: Flickr’s Magic View Photo Filtering
Courtesy: https://thenextweb.com/opinion/2015/05/22/flickrs-new-magic-view-photo-filtering-feature-works-so-well-it-convinced-me-to-ditch-iphoto/#.tnw_RaZEaD6g
• Image recognition to divide pictures into surprisingly accurate categories
• Magic of AI/DL: Generate accurate tags for billions of pictures
DataWorks Summit@Berlin 2018 4Network Based Computing Laboratory
• TensorFlow
• Caffe/Caffe2
• Torch
• SparkNet
• TensorFrame
• DeepLearning4J
• BigDL
• CNTK
• mmlspark
• Many others…
Examples of Deep Learning Stacks
DataWorks Summit@Berlin 2018 5Network Based Computing Laboratory
Trends of Deep Learning Stacks
• Google TensorFlow
• Microsoft CNTK
• Facebook Caffe2
• Google Search Trend (March 23, 2018)
DataWorks Summit@Berlin 2018 6Network Based Computing Laboratory
Big Data
(Hadoop, Spark,
HBase,
Memcached,
etc.)
Deep Learning
(Caffe, TensorFlow, BigDL,
etc.)
HPC
(MPI, RDMA,
Lustre, etc.)
Increasing Usage of HPC, Big Data and Deep Learning
Convergence of HPC, Big Data, and Deep Learning!!!
DataWorks Summit@Berlin 2018 7Network Based Computing Laboratory
• BLAS Libraries – the heart of math
operations
– Atlas/OpenBLAS
– NVIDIA cuBlas
– Intel Math Kernel Library (MKL)
• DNN Libraries – the heart of Convolutions!
– NVIDIA cuDNN (already reached its 7th
iteration – cudnn-v7)
– Intel MKL-DNN (MKL 2017) – recent but a
very promising development
• Communication Libraries – the heart of
model parameter updating
– RDMA
– GPUDirect RDMA
Highly-Optimized Underlying Libraries with HPC Technologies
DataWorks Summit@Berlin 2018 8Network Based Computing Laboratory
Outline
• Overview of gRPC, TensorFlow, and RDMA Technologies
• Accelerating gRPC and TensorFlow with RDMA
• Benchmarking gRPC and TensorFlow
• Performance Evaluation
• Conclusion
DataWorks Summit@Berlin 2018 9Network Based Computing Laboratory
Architecture Overview of gRPC
Key Features:
• Simple service definition
• Works across languages and platforms
• C++, Java, Python, Android Java etc
• Linux, Mac, Windows
• Start quickly and scale
• Bi-directional streaming and integrated
authentication
• Used by Google (several of Google’s cloud
products and Google externally facing
APIs, TensorFlow), NetFlix, Docker, Cisco,
Juniper Networks etc.
• Uses sockets for communication!
Source: http://www.grpc.io/
Large-scale distributed systems composed of
micro services
DataWorks Summit@Berlin 2018 10Network Based Computing Laboratory
Architecture Overview of Google TensorFlow
Key Features:
• Widely used for Deep Learning
• Open source software library for numerical
computation using data flow graphs
• Graph edges represent the multidimensional
data arrays
• Nodes in the graph represent mathematical
operations
• Flexible architecture allows to deploy
computation to one or more CPUs or GPUs in a
desktop, server, or mobile device with a single
API
• Used by Google, Airbnb, DropBox, Snapchat,
Twitter
• Communication and Computation intensive
Source: https://www.tensorflow.org/
Architecture of TensorFlow
DataWorks Summit@Berlin 2018 11Network Based Computing Laboratory
Overview of gRPC with TensorFlow
Worker services communicate among each other using gRPC, or gRPC+X!
Client Master
Worker
/job:PS/task:0
Worker
/job:Worker/task:0
CPU GPU
gRPC server/ client
CPU GPU
gRPC server/ client
DataWorks Summit@Berlin 2018 12Network Based Computing Laboratory
Drivers of Modern HPC Cluster and Data Center Architecture
• Multi-core/many-core technologies
• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
– Single Root I/O Virtualization (SR-IOV)
• Solid State Drives (SSDs), NVM, Parallel Filesystems, Object Storage Clusters
• Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)
High Performance Interconnects –
InfiniBand (with SR-IOV)
<1usec latency, 200Gbps Bandwidth>
Multi-/Many-core
Processors
Cloud CloudSDSC Comet TACC Stampede
Accelerators / Coprocessors
high compute density, high
performance/watt
>1 TFlop DP on a chip
SSD, NVMe-SSD, NVRAM
DataWorks Summit@Berlin 2018 13Network Based Computing Laboratory
Interconnects and Protocols in OpenFabrics Stack
Kernel
Space
Application /
Middleware
Verbs
Ethernet
Adapter
Ethernet
Switch
Ethernet
Driver
TCP/IP
1/10/25/40/
100 GigE
InfiniBand
Adapter
InfiniBand
Switch
IPoIB
IPoIB
Ethernet
Adapter
Ethernet
Switch
Hardware
Offload
TCP/IP
10/40 GigE-
TOE
InfiniBand
Adapter
InfiniBand
Switch
User
Space
RSockets
RSockets
iWARP
Adapter
Ethernet
Switch
TCP/IP
User
Space
iWARP
RoCE
Adapter
Ethernet
Switch
RDMA
User
Space
RoCE
InfiniBand
Switch
InfiniBand
Adapter
RDMA
User
Space
IB Native
Sockets
Application /
Middleware Interface
Protocol
Adapter
Switch
InfiniBand
Adapter
InfiniBand
Switch
RDMA
SDP
SDP
DataWorks Summit@Berlin 2018 14Network Based Computing Laboratory
• 163 IB Clusters (32.6%) in the Nov’17 Top500 list
– (http://www.top500.org)
• Installations in the Top 50 (17 systems):
Large-scale InfiniBand Installations
19,860,000 core (Gyoukou) in Japan (4th) 60,512 core (DGX SATURN V) at NVIDIA/USA (36th)
241,108 core (Pleiades) at NASA/Ames (17th) 72,000 core (HPC2) in Italy (37th)
220,800 core (Pangea) in France (21st) 152,692 core (Thunder) at AFRL/USA (40th)
144,900 core (Cheyenne) at NCAR/USA (24th) 99,072 core (Mistral) at DKRZ/Germany (42nd)
155,150 core (Jureca) in Germany (29th) 147,456 core (SuperMUC) in Germany (44th)
72,800 core Cray CS-Storm in US (30th) 86,016 core (SuperMUC Phase 2) in Germany (45th)
72,800 core Cray CS-Storm in US (31st) 74,520 core (Tsubame 2.5) at Japan/GSIC (48th)
78,336 core (Electra) at NASA/USA (33rd) 66,000 core (HPC3) in Italy (51st)
124,200 core (Topaz) SGI ICE at ERDC DSRC in US (34th) 194,616 core (Cascade) at PNNL (53rd)
60,512 core (NVIDIA DGX-1/Relion) at Facebook in USA (35th) and many more!
DataWorks Summit@Berlin 2018 15Network Based Computing Laboratory
• Introduced in Oct 2000
• High Performance Data Transfer
– Interprocessor communication and I/O
– Low latency (<1.0 microsec), High bandwidth (up to 25 GigaBytes/sec -> 200Gbps), and
low CPU utilization (5-10%)
• Flexibility for LAN and WAN communication
• Multiple Transport Services
– Reliable Connection (RC), Unreliable Connection (UC), Reliable Datagram (RD), Unreliable
Datagram (UD), and Raw Datagram
– Provides flexibility to develop upper layers
• Multiple Operations
– Send/Recv
– RDMA Read/Write
– Atomic Operations (very unique)
• high performance and scalable implementations of distributed locks, semaphores, collective
communication operations
• Leading to big changes in designing HPC clusters, file systems, cloud computing
systems, grid computing systems, ….
Open Standard InfiniBand Networking Technology
DataWorks Summit@Berlin 2018 16Network Based Computing Laboratory
RDMA over Converged Enhanced Ethernet (RoCE)
IB Verbs
Application
Hardware
RoCE
IB Verbs
Application
RoCE v2
InfiniBand
Link Layer
IB Network
IB Transport
IB Verbs
Application
InfiniBand
Ethernet Link
Layer
IB Network
IB Transport
Ethernet Link
Layer
UDP / IP
IB Transport
• Takes advantage of IB and Ethernet
– Software written with IB-Verbs
– Link layer is Converged (Enhanced) Ethernet (CE)
– 100Gb/s support from latest EDR and ConnectX-
3 Pro adapters
• Pros: IB Vs RoCE
– Works natively in Ethernet environments
• Entire Ethernet management ecosystem is available
– Has all the benefits of IB verbs
– Link layer is very similar to the link layer of
native IB, so there are no missing features
• RoCE v2: Additional Benefits over RoCE
– Traditional Network Management Tools Apply
– ACLs (Metering, Accounting, Firewalling)
– GMP Snooping for Optimized Multicast
– Network Monitoring Tools
Courtesy: OFED, Mellanox
Network Stack Comparison
Packet Header Comparison
ETH
L2 Hdr
Ethertype
IB GRH
L3 Hdr
IB BTH+
L4 Hdr
RoCE
ETH
L2 Hdr
Ethertype
IP Hdr
L3 Hdr
IB BTH+
L4 Hdr
Proto#
RoCEv2
UDP
Hdr
Port#
DataWorks Summit@Berlin 2018 17Network Based Computing Laboratory
• RDMA for Apache Spark
• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions
• RDMA for Apache HBase
• RDMA for Memcached (RDMA-Memcached)
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• OSU HiBD-Benchmarks (OHB)
– HDFS, Memcached, HBase, and Spark Micro-benchmarks
• http://hibd.cse.ohio-state.edu
• Users Base: 280 organizations from 34 countries
• More than 26,000 downloads from the project site
The High-Performance Big Data (HiBD) Project
Available for InfiniBand and RoCE
Also run on Ethernet
Available for x86 and OpenPOWER
Support for Singularity and Docker
DataWorks Summit@Berlin 2018 18Network Based Computing Laboratory
HiBD Release Timeline and Downloads
0
5000
10000
15000
20000
25000
30000
NumberofDownloads
Timeline
RDMA-Hadoop2.x0.9.1
RDMA-Hadoop2.x0.9.6
RDMA-Memcached0.9.4
RDMA-Hadoop2.x0.9.7
RDMA-Spark0.9.1
RDMA-HBase0.9.1
RDMA-Hadoop2.x1.0.0
RDMA-Spark0.9.4
RDMA-Hadoop2.x1.3.0
RDMA-Memcached0.9.6&OHB-0.9.3
RDMA-Spark0.9.5
RDMA-Hadoop1.x0.9.0
RDMA-Hadoop1.x0.9.8
RDMA-Memcached0.9.1&OHB-0.7.1
RDMA-Hadoop1.x0.9.9
DataWorks Summit@Berlin 2018 19Network Based Computing Laboratory
• Can similar designs be done for gRPC and TensorFlow to achieve significant
performance benefits by taking advantage of native RDMA support?
• How do we benchmark gRPC and TensorFlow for both deep learning and
system researchers?
• What kind of performance benefits we can get through native RDMA-based
designs in gRPC and TensorFlow?
Motivation
DataWorks Summit@Berlin 2018 20Network Based Computing Laboratory
Outline
• Overview of gRPC, TensorFlow, and RDMA Technologies
• Accelerating gRPC and TensorFlow with RDMA
• Benchmarking gRPC and TensorFlow
• Performance Evaluation
• Conclusion
DataWorks Summit@Berlin 2018 21Network Based Computing Laboratory
Tensor Communication over gRPC channel
• Rendezvous protocol
• TensorFlow worker (tensor receiving
process) actively requests for tensors to
the parameter server (tensor sending
process)
• Worker issues Tensor RPC request that to
Parameter Server (PS)
• PS finds the requested tensor, and responds to
the worker
• gRPC core uses recvmsg and sendmsg
primitives for receiving and sending payloads
• Tensor Transmission uses iovec structures
R. Biswas, X. Lu, and D. K. Panda, Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences, BPOE, 2018.
DataWorks Summit@Berlin 2018 22Network Based Computing Laboratory
• gRPC + Verbs
– Dedicated verbs channel for tensor communication
– gRPC channel for administrative task communication
• gRPC + MPI
– Dedicated MPI channel for tensor communication
– gRPC channel for administrative task communication
• Uber Horovod
– Uber’s approach of MPI based distributed TensorFlow
• Baidu Tensorflow-Allreduce
– Baidu’s approach of MPI based distributed TensorFlow
High Performance Tensor Communication Channel
DataWorks Summit@Berlin 2018 23Network Based Computing Laboratory
TensorFlow workload via gRPC
iovec Buffer Distribution Observed for TensorFlow
training over gRPC
• Small, Medium and Large indicate
buffers of few Bytes, KBytes and
MBytes of length
• gRPC payload may contain a
uniform distribution of such Small
buffers
• A lot of Large buffers and a few
Small buffers may create a skew
distribution of such buffers in one
gRPC payload
R. Biswas, X. Lu, and D. K. Panda, Designing a Micro-
Benchmark Suite to Evaluate gRPC for TensorFlow: Early
Experiences, BPOE, 2018.
DataWorks Summit@Berlin 2018 24Network Based Computing Laboratory
OSU AR-gRPC Architecture
R. Biswas, X. Lu, and D. K. Panda, Characterizing and Accelerating TensorFlow: Can Adaptive and RDMA-based RPC Benefit It? (Under Review)
• Adaptive RDMA gRPC
• Features
– Hybrid Communication engine
• Adaptive protocol selection between
eager and rendezvous
– Message pipelining and coalescing
• Adaptive chunking and accumulation
• Intelligent threshold detection
– Zero copy transmission
• Zero copy send/recv
DataWorks Summit@Berlin 2018 25Network Based Computing Laboratory
AR-gRPC Enhanced TensorFlow
DataWorks Summit@Berlin 2018 26Network Based Computing Laboratory
Outline
• Overview of gRPC, TensorFlow, and RDMA Technologies
• Accelerating gRPC and TensorFlow with RDMA
• Benchmarking gRPC and TensorFlow
• Performance Evaluation
• Conclusion
DataWorks Summit@Berlin 2018 27Network Based Computing Laboratory
MNIST CIFAR-10 ImageNet
Category Digit Classification Object Classification Object Classification
Resolution 28 × 28 B&W 32 × 32 Color 256 × 256 Color
Classes 10 10 1000
Training Images 60 K 50 K 1.2 M
Testing Images 10 K 10 K 100 K
Available Benchmarks, Models, and Datasets
Model Layers (Conv. / Full-connected) Dataset Framework
LeNet 2 / 2 MNIST CaffeOnSpark, TensorFlowOnSpark
SoftMax Regression NA / NA MNIST TensorFlowOnSpark
CIFAR-10 Quick 3 / 1 CIFAR-10 CaffeOnSpark, TensorFlowOnSpark, MMLSpark
VGG-16 13 / 3 CIFAR-10 BigDL
AlexNet 5 / 3 ImageNet CaffeOnSpark
GoogLeNet 22 / 0 ImageNet CaffeOnSpark
Resnet-50 53/1 Synthetic TensorFlow
DataWorks Summit@Berlin 2018 28Network Based Computing Laboratory
• Current DL models and benchmarks are deep
learning research oriented
• Example: Facebook caffe2 takes 1 hour to
train ImageNet data (State of the art)1
• However, many system researchers are
focused on improving the communication
engine of deep learning frameworks
• A fast benchmark that models deep learning
characteristics is highly desirable
Are Current Benchmarks Sufficient?
1. Goyal, Priya, et al. "Accurate, large minibatch SGD: training
imagenet in 1 hour." arXiv preprint arXiv:1706.02677 (2017).
DataWorks Summit@Berlin 2018 29Network Based Computing Laboratory
TensorFlow DL Micro-benchmarks for gRPC
R. Biswas, X. Lu, and D. K. Panda, Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences, BPOE, 2018.
DataWorks Summit@Berlin 2018 30Network Based Computing Laboratory
Outline
• Overview of gRPC, TensorFlow, and RDMA Technologies
• Accelerating gRPC and TensorFlow with RDMA
• Benchmarking gRPC and TensorFlow
• Performance Evaluation
• Conclusion
DataWorks Summit@Berlin 2018 31Network Based Computing Laboratory
Experimental Setup
• We have used two different clusters
• Software Stack used
A: OSU-RI2-IB-EDR B: SDSC-Comet-IB-FDR
Intel Broadwell, dual fourteen-core processors Intel Haswell, dual twelve-core processors
512 GB RAM 128 GB RAM
370 GB Local NVMe-SSD 320 GB Local SSD
InfiniBand EDR (Some nodes with 40G-Ether) InfiniBand FDR and 10G-Ether
Stack Version Cluster
gRPC 1.5.0 A, B
AR-gRPC (OSU
RDMA gRPC)1
Based on 1.5.0 A, B
TensorFlow 1.4 , Python 2.7 A
DataWorks Summit@Berlin 2018 32Network Based Computing Laboratory
Performance Benefits for AR-gRPC with Micro-Benchmark
Point-to-Point Latency
• AR-gRPC (OSU design) Latency on SDSC-Comet-FDR
– Up to 2.7x performance speedup over Default gRPC (IPoIB) for Latency for small messages.
– Up to 2.8x performance speedup over Default gRPC (IPoIB) for Latency for medium messages.
– Up to 2.5x performance speedup over Default gRPC (IPoIB) for Latency for large messages.
0
15
30
45
60
75
90
2 8 32 128 512 2K 8K
Latency(us)
payload (Bytes)
Default gRPC (IPoIB-56Gbps)
AR-gRPC (RDMA-56Gbps)
0
200
400
600
800
1000
16K 32K 64K 128K 256K 512KLatency(us)
Payload (Bytes)
Default gRPC (IPoIB-
56Gbps)
AR-gRPC (RDMA-56
Gbps)
100
3800
7500
11200
14900
18600
1M 2M 4M 8M
Latency(us)
Payload (Bytes)
Default gRPC
(IPoIB-56Gbps)
AR-gRPC (RDMA-
56Gbps)
R. Biswas, X. Lu, and D. K. Panda, Characterizing and Accelerating TensorFlow: Can Adaptive and RDMA-based RPC Benefit It? (Under Review)
DataWorks Summit@Berlin 2018 33Network Based Computing Laboratory
0
2
4
Uniform Random Skew
Latency(ms)
Payload Generation Scheme
Default gRPC (Ethernet 40G)
Default gRPC (IPoIB-100Gbps)
AR-gRPC (RDMA-100Gbps)
TF-gRPC-P2P-Latency
0
2
4
6
Uniform Random Skew
Latency(ms)
Payload Generation Scheme
Default gRPC (Ethernet 10G)
Default gRPC (IPoIB-56Gbps)
AR-gRPC (RDMA-56Gbps)
• Cluster A: AR-gRPC (RDMA) reduces latency by 59% and 56% compared to Default gRPC over
40G Ethernet and IPoIB
• Cluster B: AR-gRPC (RDMA) reduces 78% latency compared to 10G (Default gRPC) Ethernet
and 69% compared to IPoIB (Default gRPC)
Cluster A Cluster B
56% 69%
DataWorks Summit@Berlin 2018 34Network Based Computing Laboratory
TF-gRPC-P2P-Bandwidth
0
1000
2000
3000
4000
5000
6000
Uniform Random Skew
Bandwidth(MB/s)
Payload Generation Scheme
Default gRPC (Ethernet 40G)
Default gRPC (IPoIB-100Gbps)
AR-gRPC (RDMA-100Gbps)
0
2000
4000
6000
Uniform Random Skew
Bandwidth(MB/s)
Payload Generation Scheme
Default gRPC (Ethernet 10G)
Default gRPC (IPoIB-56Gbps)
AR-gRPC (RDMA-56Gbps)Cluster A Cluster B
• Cluster A: AR-gRPC (RDMA) gRPC achieves a 2.14x bandwidth increase compared to Default gRPC (IPoIB
and Ethernet)
• Cluster B: AR-gRPC (RDMA) achieves 3.2x bandwidth compared to Default gRPC IPoIB for skewed data
2.14x 3.2x
DataWorks Summit@Berlin 2018 35Network Based Computing Laboratory
0
1000
2000
3000
4000
Uniform Random Skew
RPC/second
Payload Generation Scheme
Default gRPC (Ethernet 40G)
Default gRPC (IPoIB-100Gbps)
AR-gRPC (RDMA-100Gbps)
TF-gRPC-PS-Throughput
0
1000
2000
3000
4000
Uniform Random Skew
RPC/second
Payload Generation Scheme
Default gRPC (Ethernet 10G)
Default gRPC (IPoIB-56Gbps)
AR-gRPC (RDMA-56Gbps)
Cluster A Cluster B
• Cluster A: AR-gRPC (RDMA) gRPC achieves a 3.4x speedup compared to Default gRPC over IPoIB for
uniform scheme
• Cluster B: AR-gRPC (RDMA) achieves 3.6x bandwidth compared to Default gRPC over IPoIB for uniform
scheme
3.4 x
3.6 x
DataWorks Summit@Berlin 2018 36Network Based Computing Laboratory
Performance Benefits for AR-gRPC with Micro-Benchmark
Fully-Connected Architecture (Mimic TensorFlow communication)
• AR-gRPC (OSU design) TensorFlow Mimic test on SDSC-Comet-FDR
– Up to 60% reduction in average latency over Default gRPC (IPoIB)
– Up to 2.68x performance speedup over Default gRPC (IPoIB)
0
6
12
18
24
30
2M 4M 8M
Latency(ms)
payload (Bytes)
Default gRPC (IPoIB-56Gbps)
AR-gRPC (RDMA-56Gbps)
0
100
200
300
400
500
2M 4M 8M
Calls/Second
payload (Bytes)
Default gRPC (IPoIB-56Gbps)
AR-gRPC (RDMA-56Gbps)
2.68 x
DataWorks Summit@Berlin 2018 37Network Based Computing Laboratory
0
200
400
600
800
16 32 64
Images/Second
Batch Size
gRPPC (IPoIB-100Gbps)
Verbs (RDMA-100Gbps)
MPI (RDMA-100Gbps)
AR-gRPC (RDMA-100Gbps)
0
200
400
600
16 32 64
Images/Second Batch Size
gRPPC (IPoIB-100Gbps)
Verbs (RDMA-100Gbps)
MPI (RDMA-100Gbps)
AR-gRPC (RDMA-100Gbps)
0
100
200
300
16 32 64
Images/Second
Batch Size
gRPPC (IPoIB-100Gbps)
Verbs (RDMA-100Gbps)
MPI (RDMA-100Gbps)
AR-gRPC (RDMA-100Gbps)
Performance Benefit for TensorFlow (Resnet50)
• TensorFlow Resnet50 performance evaluation on an IB EDR cluster
– Up to 30% performance speedup over Default gRPC (IPoIB) for 8 GPUs
– Up to 34% performance speedup over Default gRPC (IPoIB) for 16 GPUs
– Up to 44% performance speedup over Default gRPC (IPoIB) for 24 GPUs
4 Nodes (8 GPUs) 8 Nodes (16 GPUS) 12 Nodes (24 GPUS)
DataWorks Summit@Berlin 2018 38Network Based Computing Laboratory
0
50
100
150
200
16 32 64
Images/Second
Batch Size
gRPPC (IPoIB-100Gbps)
Verbs (RDMA-100Gbps)
MPI (RDMA-100Gbps)
AR-gRPC (RDMA-100Gbps)
Performance Benefit for TensorFlow (Inception3)
• TensorFlow Inception3 performance evaluation on an IB EDR cluster
– Up to 20% performance speedup over Default gRPC (IPoIB) for 8 GPUs
– Up to 34% performance speedup over Default gRPC (IPoIB) for 16 GPUs
– Up to 37% performance speedup over Default gRPC (IPoIB) for 24 GPUs
4 Nodes (8 GPUS) 8 Nodes (16 GPUS) 12 Nodes (24 GPUS)
0
200
400
600
16 32 64
Images/Second
Batch Size
gRPPC (IPoIB-100Gbps)
Verbs (RDMA-100Gbps)
MPI (RDMA-100Gbps)
AR-gRPC (RDMA-100Gbps)
0
100
200
300
400
16 32 64Images/Second
Batch Size
gRPPC (IPoIB-100Gbps)
Verbs (RDMA-100Gbps)
MPI (RDMA-100Gbps)
AR-gRPC (RDMA-100Gbps)
DataWorks Summit@Berlin 2018 39Network Based Computing Laboratory
Outline
• Overview of gRPC, TensorFlow, and RDMA Technologies
• Accelerating gRPC and TensorFlow with RDMA
• Benchmarking gRPC and TensorFlow
• Performance Evaluation
• Conclusion
DataWorks Summit@Berlin 2018 40Network Based Computing Laboratory
• Present overview of gRPC and TensorFlow
• Present overview of RDMA and high-performance networks
• Discussed challenges in accelerating and benchmarking TensorFlow and gRPC
• RDMA can benefit DL workloads as showed by our AR-gRPC and the
corresponding enhanced TensorFlow
• Many other open issues need to be solved
• Will enable Deep Learning community to take advantage of modern HPC
technologies to carry out their analytics in a fast and scalable manner
Concluding Remarks
DataWorks Summit@Berlin 2018 41Network Based Computing Laboratory
Funding Acknowledgments
Funding Support by
Equipment Support by
DataWorks Summit@Berlin 2018 42Network Based Computing Laboratory
Personnel Acknowledgments
Current Students (Graduate)
– A. Awan (Ph.D.)
– R. Biswas (M.S.)
– M. Bayatpour (Ph.D.)
– S. Chakraborthy (Ph.D.)
– C.-H. Chu (Ph.D.)
– S. Guganani (Ph.D.)
Past Students
– A. Augustine (M.S.)
– P. Balaji (Ph.D.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– R. Rajachandrasekar (Ph.D.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
Past Research Scientist
– K. Hamidouche
– S. Sur
Past Post-Docs
– D. Banerjee
– X. Besseron
– H.-W. Jin
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– K. Kulkarni (M.S.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– M. Li (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– J. Hashmi (Ph.D.)
– H. Javed (Ph.D.)
– P. Kousha (Ph.D.)
– D. Shankar (Ph.D.)
– H. Shi (Ph.D.)
– J. Zhang (Ph.D.)
– J. Lin
– M. Luo
– E. Mancini
Current Research Scientists
– X. Lu
– H. Subramoni
Past Programmers
– D. Bureddy
– J. Perkins
Current Research Specialist
– J. Smith
– M. Arnold
– S. Marcarelli
– J. Vienne
– H. Wang
Current Post-doc
– A. Ruhela
– K. Manian
Current Students (Undergraduate)
– N. Sarkauskas (B.S.)
DataWorks Summit@Berlin 2018 43Network Based Computing Laboratory
The 4th International Workshop on
High-Performance Big Data Computing (HPBDC)
HPBDC 2018 will be held with IEEE International Parallel and Distributed Processing
Symposium (IPDPS 2018), Vancouver, British Columbia CANADA, May, 2018
Workshop Date: May 21st, 2018
Keynote Talk: Prof. Geoffrey Fox, Twister2: A High-Performance Big Data Programming Environment
Six Regular Research Papers and Two Short Research Papers
Panel Topic: Which Framework is the Best for High-Performance Deep Learning:
Big Data Framework or HPC Framework?
http://web.cse.ohio-state.edu/~luxi/hpbdc2018
HPBDC 2017 was held in conjunction with IPDPS’17
http://web.cse.ohio-state.edu/~luxi/hpbdc2017
HPBDC 2016 was held in conjunction with IPDPS’16
http://web.cse.ohio-state.edu/~luxi/hpbdc2016
DataWorks Summit@Berlin 2018 44Network Based Computing Laboratory
{panda, luxi}@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda
http://www.cse.ohio-state.edu/~luxi
Thank You!
Network-Based Computing Laboratory
http://nowlab.cse.ohio-state.edu/
The High-Performance Big Data Project
http://hibd.cse.ohio-state.edu/

More Related Content

What's hot (20)

PDF
Hadoop入門
Preferred Networks
 
PPTX
Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...
NTT DATA Technology & Innovation
 
PDF
DPDKによる高速コンテナネットワーキング
Tomoya Hibi
 
PDF
さいきんのMySQLに関する取り組み(仮)
Takanori Sejima
 
PPTX
PostgreSQL共有バッファと関連ツール
Masahiko Sawada
 
PPTX
3GPP TR38.801-e00まとめ
Tetsuya Hasegawa
 
PDF
DNSキャッシュサーバ チューニングの勘所
hdais
 
PPTX
分散ストレージ技術Cephの最新情報
Emma Haruka Iwao
 
PDF
Apache Spark の紹介(前半:Sparkのキホン)
NTT DATA OSS Professional Services
 
PDF
Db2 Warehouse セッション資料 db tech showcase
IBM Analytics Japan
 
PDF
Hadoopのシステム設計・運用のポイント
Cloudera Japan
 
PDF
Nwdafまとめ
Yuhei Hayashi
 
PDF
Stargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動する
Kohei Tokunaga
 
PPTX
Yarn resource-manager
Seiya Mizuno
 
PDF
Db2 v11.5.4 高可用性構成 & HADR 構成パターンご紹介
IBM Analytics Japan
 
PDF
Openstackを200%活用するSDSの挑戦
Tomohiro Hirano
 
PDF
VirtualBox と Rocky Linux 8 で始める Pacemaker ~ VirtualBox でも STONITH 機能が試せる! Vi...
ksk_ha
 
PDF
CODT2021 CyberAgentでの サーバ選定手法の紹介
Hiroki Chinen
 
PDF
閉域網接続の技術入門
Masayuki Kobayashi
 
PDF
プログラマ目線から見たRDMAのメリットと その応用例について
Masanori Itoh
 
Hadoop入門
Preferred Networks
 
Apache Sparkの基本と最新バージョン3.2のアップデート(Open Source Conference 2021 Online/Fukuoka ...
NTT DATA Technology & Innovation
 
DPDKによる高速コンテナネットワーキング
Tomoya Hibi
 
さいきんのMySQLに関する取り組み(仮)
Takanori Sejima
 
PostgreSQL共有バッファと関連ツール
Masahiko Sawada
 
3GPP TR38.801-e00まとめ
Tetsuya Hasegawa
 
DNSキャッシュサーバ チューニングの勘所
hdais
 
分散ストレージ技術Cephの最新情報
Emma Haruka Iwao
 
Apache Spark の紹介(前半:Sparkのキホン)
NTT DATA OSS Professional Services
 
Db2 Warehouse セッション資料 db tech showcase
IBM Analytics Japan
 
Hadoopのシステム設計・運用のポイント
Cloudera Japan
 
Nwdafまとめ
Yuhei Hayashi
 
Stargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動する
Kohei Tokunaga
 
Yarn resource-manager
Seiya Mizuno
 
Db2 v11.5.4 高可用性構成 & HADR 構成パターンご紹介
IBM Analytics Japan
 
Openstackを200%活用するSDSの挑戦
Tomohiro Hirano
 
VirtualBox と Rocky Linux 8 で始める Pacemaker ~ VirtualBox でも STONITH 機能が試せる! Vi...
ksk_ha
 
CODT2021 CyberAgentでの サーバ選定手法の紹介
Hiroki Chinen
 
閉域網接続の技術入門
Masayuki Kobayashi
 
プログラマ目線から見たRDMAのメリットと その応用例について
Masanori Itoh
 

Similar to Accelerating TensorFlow with RDMA for high-performance deep learning (20)

PDF
Scalable and Distributed DNN Training on Modern HPC Systems
inside-BigData.com
 
PDF
A Library for Emerging High-Performance Computing Clusters
Intel® Software
 
PPTX
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
DataWorks Summit/Hadoop Summit
 
PDF
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
inside-BigData.com
 
PDF
Accelerate Big Data Processing with High-Performance Computing Technologies
Intel® Software
 
PDF
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
inside-BigData.com
 
PPTX
Designing High performance & Scalable Middleware for HPC
Object Automation
 
PDF
Designing HPC & Deep Learning Middleware for Exascale Systems
inside-BigData.com
 
PPTX
Feec telecom-nw-softwarization-aug-2015
Christian Esteve Rothenberg
 
PDF
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Herman Wu
 
PPTX
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
Object Automation
 
PPTX
Essential Data Engineering for Data Scientist
SoftServe
 
PPTX
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
inside-BigData.com
 
PDF
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
Dan Eaton
 
PPTX
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
inside-BigData.com
 
PDF
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
OpenStack
 
PDF
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
inside-BigData.com
 
PPTX
2017 dagstuhl-nfv-rothenberg
Christian Esteve Rothenberg
 
PDF
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Matej Misik
 
PDF
Scallable Distributed Deep Learning on OpenPOWER systems
Ganesan Narayanasamy
 
Scalable and Distributed DNN Training on Modern HPC Systems
inside-BigData.com
 
A Library for Emerging High-Performance Computing Clusters
Intel® Software
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
DataWorks Summit/Hadoop Summit
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
inside-BigData.com
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Intel® Software
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
inside-BigData.com
 
Designing High performance & Scalable Middleware for HPC
Object Automation
 
Designing HPC & Deep Learning Middleware for Exascale Systems
inside-BigData.com
 
Feec telecom-nw-softwarization-aug-2015
Christian Esteve Rothenberg
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Herman Wu
 
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
Object Automation
 
Essential Data Engineering for Data Scientist
SoftServe
 
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
inside-BigData.com
 
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
Dan Eaton
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
inside-BigData.com
 
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
OpenStack
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
inside-BigData.com
 
2017 dagstuhl-nfv-rothenberg
Christian Esteve Rothenberg
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Matej Misik
 
Scallable Distributed Deep Learning on OpenPOWER systems
Ganesan Narayanasamy
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Digital Circuits, important subject in CS
contactparinay1
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 

Accelerating TensorFlow with RDMA for high-performance deep learning

  • 1. Accelerating TensorFlow with RDMA for High- Performance Deep Learning Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Talk at DataWorks Summit Berlin 2018 by Xiaoyi Lu The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~luxi
  • 2. DataWorks Summit@Berlin 2018 2Network Based Computing Laboratory • Deep Learning is a sub-set of Machine Learning – But, it is perhaps the most radical and revolutionary subset • Deep Learning is going through a resurgence – Model: Excellent accuracy for deep/convolutional neural networks – Data: Public availability of versatile datasets like MNIST, CIFAR, and ImageNet – Capability: Unprecedented computing and communication capabilities: Multi-/Many-Core, GPGPUs, Xeon Phi, InfiniBand, RoCE, etc. • Big Data has become one of the most important elements in business analytics – Increasing demand for getting Big Value out of Big Data to drive the revenue continuously growing Why Deep Learning is so hot? Courtesy: http://www.zdnet.com/article/caffe2-deep-learning-wide-ambitions- flexibility-scalability-and-advocacy/ MNIST handwritten digits Deep Neural Network
  • 3. DataWorks Summit@Berlin 2018 3Network Based Computing Laboratory Application Example of DL: Flickr’s Magic View Photo Filtering Courtesy: https://thenextweb.com/opinion/2015/05/22/flickrs-new-magic-view-photo-filtering-feature-works-so-well-it-convinced-me-to-ditch-iphoto/#.tnw_RaZEaD6g • Image recognition to divide pictures into surprisingly accurate categories • Magic of AI/DL: Generate accurate tags for billions of pictures
  • 4. DataWorks Summit@Berlin 2018 4Network Based Computing Laboratory • TensorFlow • Caffe/Caffe2 • Torch • SparkNet • TensorFrame • DeepLearning4J • BigDL • CNTK • mmlspark • Many others… Examples of Deep Learning Stacks
  • 5. DataWorks Summit@Berlin 2018 5Network Based Computing Laboratory Trends of Deep Learning Stacks • Google TensorFlow • Microsoft CNTK • Facebook Caffe2 • Google Search Trend (March 23, 2018)
  • 6. DataWorks Summit@Berlin 2018 6Network Based Computing Laboratory Big Data (Hadoop, Spark, HBase, Memcached, etc.) Deep Learning (Caffe, TensorFlow, BigDL, etc.) HPC (MPI, RDMA, Lustre, etc.) Increasing Usage of HPC, Big Data and Deep Learning Convergence of HPC, Big Data, and Deep Learning!!!
  • 7. DataWorks Summit@Berlin 2018 7Network Based Computing Laboratory • BLAS Libraries – the heart of math operations – Atlas/OpenBLAS – NVIDIA cuBlas – Intel Math Kernel Library (MKL) • DNN Libraries – the heart of Convolutions! – NVIDIA cuDNN (already reached its 7th iteration – cudnn-v7) – Intel MKL-DNN (MKL 2017) – recent but a very promising development • Communication Libraries – the heart of model parameter updating – RDMA – GPUDirect RDMA Highly-Optimized Underlying Libraries with HPC Technologies
  • 8. DataWorks Summit@Berlin 2018 8Network Based Computing Laboratory Outline • Overview of gRPC, TensorFlow, and RDMA Technologies • Accelerating gRPC and TensorFlow with RDMA • Benchmarking gRPC and TensorFlow • Performance Evaluation • Conclusion
  • 9. DataWorks Summit@Berlin 2018 9Network Based Computing Laboratory Architecture Overview of gRPC Key Features: • Simple service definition • Works across languages and platforms • C++, Java, Python, Android Java etc • Linux, Mac, Windows • Start quickly and scale • Bi-directional streaming and integrated authentication • Used by Google (several of Google’s cloud products and Google externally facing APIs, TensorFlow), NetFlix, Docker, Cisco, Juniper Networks etc. • Uses sockets for communication! Source: http://www.grpc.io/ Large-scale distributed systems composed of micro services
  • 10. DataWorks Summit@Berlin 2018 10Network Based Computing Laboratory Architecture Overview of Google TensorFlow Key Features: • Widely used for Deep Learning • Open source software library for numerical computation using data flow graphs • Graph edges represent the multidimensional data arrays • Nodes in the graph represent mathematical operations • Flexible architecture allows to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API • Used by Google, Airbnb, DropBox, Snapchat, Twitter • Communication and Computation intensive Source: https://www.tensorflow.org/ Architecture of TensorFlow
  • 11. DataWorks Summit@Berlin 2018 11Network Based Computing Laboratory Overview of gRPC with TensorFlow Worker services communicate among each other using gRPC, or gRPC+X! Client Master Worker /job:PS/task:0 Worker /job:Worker/task:0 CPU GPU gRPC server/ client CPU GPU gRPC server/ client
  • 12. DataWorks Summit@Berlin 2018 12Network Based Computing Laboratory Drivers of Modern HPC Cluster and Data Center Architecture • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) – Single Root I/O Virtualization (SR-IOV) • Solid State Drives (SSDs), NVM, Parallel Filesystems, Object Storage Clusters • Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) High Performance Interconnects – InfiniBand (with SR-IOV) <1usec latency, 200Gbps Bandwidth> Multi-/Many-core Processors Cloud CloudSDSC Comet TACC Stampede Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip SSD, NVMe-SSD, NVRAM
  • 13. DataWorks Summit@Berlin 2018 13Network Based Computing Laboratory Interconnects and Protocols in OpenFabrics Stack Kernel Space Application / Middleware Verbs Ethernet Adapter Ethernet Switch Ethernet Driver TCP/IP 1/10/25/40/ 100 GigE InfiniBand Adapter InfiniBand Switch IPoIB IPoIB Ethernet Adapter Ethernet Switch Hardware Offload TCP/IP 10/40 GigE- TOE InfiniBand Adapter InfiniBand Switch User Space RSockets RSockets iWARP Adapter Ethernet Switch TCP/IP User Space iWARP RoCE Adapter Ethernet Switch RDMA User Space RoCE InfiniBand Switch InfiniBand Adapter RDMA User Space IB Native Sockets Application / Middleware Interface Protocol Adapter Switch InfiniBand Adapter InfiniBand Switch RDMA SDP SDP
  • 14. DataWorks Summit@Berlin 2018 14Network Based Computing Laboratory • 163 IB Clusters (32.6%) in the Nov’17 Top500 list – (http://www.top500.org) • Installations in the Top 50 (17 systems): Large-scale InfiniBand Installations 19,860,000 core (Gyoukou) in Japan (4th) 60,512 core (DGX SATURN V) at NVIDIA/USA (36th) 241,108 core (Pleiades) at NASA/Ames (17th) 72,000 core (HPC2) in Italy (37th) 220,800 core (Pangea) in France (21st) 152,692 core (Thunder) at AFRL/USA (40th) 144,900 core (Cheyenne) at NCAR/USA (24th) 99,072 core (Mistral) at DKRZ/Germany (42nd) 155,150 core (Jureca) in Germany (29th) 147,456 core (SuperMUC) in Germany (44th) 72,800 core Cray CS-Storm in US (30th) 86,016 core (SuperMUC Phase 2) in Germany (45th) 72,800 core Cray CS-Storm in US (31st) 74,520 core (Tsubame 2.5) at Japan/GSIC (48th) 78,336 core (Electra) at NASA/USA (33rd) 66,000 core (HPC3) in Italy (51st) 124,200 core (Topaz) SGI ICE at ERDC DSRC in US (34th) 194,616 core (Cascade) at PNNL (53rd) 60,512 core (NVIDIA DGX-1/Relion) at Facebook in USA (35th) and many more!
  • 15. DataWorks Summit@Berlin 2018 15Network Based Computing Laboratory • Introduced in Oct 2000 • High Performance Data Transfer – Interprocessor communication and I/O – Low latency (<1.0 microsec), High bandwidth (up to 25 GigaBytes/sec -> 200Gbps), and low CPU utilization (5-10%) • Flexibility for LAN and WAN communication • Multiple Transport Services – Reliable Connection (RC), Unreliable Connection (UC), Reliable Datagram (RD), Unreliable Datagram (UD), and Raw Datagram – Provides flexibility to develop upper layers • Multiple Operations – Send/Recv – RDMA Read/Write – Atomic Operations (very unique) • high performance and scalable implementations of distributed locks, semaphores, collective communication operations • Leading to big changes in designing HPC clusters, file systems, cloud computing systems, grid computing systems, …. Open Standard InfiniBand Networking Technology
  • 16. DataWorks Summit@Berlin 2018 16Network Based Computing Laboratory RDMA over Converged Enhanced Ethernet (RoCE) IB Verbs Application Hardware RoCE IB Verbs Application RoCE v2 InfiniBand Link Layer IB Network IB Transport IB Verbs Application InfiniBand Ethernet Link Layer IB Network IB Transport Ethernet Link Layer UDP / IP IB Transport • Takes advantage of IB and Ethernet – Software written with IB-Verbs – Link layer is Converged (Enhanced) Ethernet (CE) – 100Gb/s support from latest EDR and ConnectX- 3 Pro adapters • Pros: IB Vs RoCE – Works natively in Ethernet environments • Entire Ethernet management ecosystem is available – Has all the benefits of IB verbs – Link layer is very similar to the link layer of native IB, so there are no missing features • RoCE v2: Additional Benefits over RoCE – Traditional Network Management Tools Apply – ACLs (Metering, Accounting, Firewalling) – GMP Snooping for Optimized Multicast – Network Monitoring Tools Courtesy: OFED, Mellanox Network Stack Comparison Packet Header Comparison ETH L2 Hdr Ethertype IB GRH L3 Hdr IB BTH+ L4 Hdr RoCE ETH L2 Hdr Ethertype IP Hdr L3 Hdr IB BTH+ L4 Hdr Proto# RoCEv2 UDP Hdr Port#
  • 17. DataWorks Summit@Berlin 2018 17Network Based Computing Laboratory • RDMA for Apache Spark • RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) – Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions • RDMA for Apache HBase • RDMA for Memcached (RDMA-Memcached) • RDMA for Apache Hadoop 1.x (RDMA-Hadoop) • OSU HiBD-Benchmarks (OHB) – HDFS, Memcached, HBase, and Spark Micro-benchmarks • http://hibd.cse.ohio-state.edu • Users Base: 280 organizations from 34 countries • More than 26,000 downloads from the project site The High-Performance Big Data (HiBD) Project Available for InfiniBand and RoCE Also run on Ethernet Available for x86 and OpenPOWER Support for Singularity and Docker
  • 18. DataWorks Summit@Berlin 2018 18Network Based Computing Laboratory HiBD Release Timeline and Downloads 0 5000 10000 15000 20000 25000 30000 NumberofDownloads Timeline RDMA-Hadoop2.x0.9.1 RDMA-Hadoop2.x0.9.6 RDMA-Memcached0.9.4 RDMA-Hadoop2.x0.9.7 RDMA-Spark0.9.1 RDMA-HBase0.9.1 RDMA-Hadoop2.x1.0.0 RDMA-Spark0.9.4 RDMA-Hadoop2.x1.3.0 RDMA-Memcached0.9.6&OHB-0.9.3 RDMA-Spark0.9.5 RDMA-Hadoop1.x0.9.0 RDMA-Hadoop1.x0.9.8 RDMA-Memcached0.9.1&OHB-0.7.1 RDMA-Hadoop1.x0.9.9
  • 19. DataWorks Summit@Berlin 2018 19Network Based Computing Laboratory • Can similar designs be done for gRPC and TensorFlow to achieve significant performance benefits by taking advantage of native RDMA support? • How do we benchmark gRPC and TensorFlow for both deep learning and system researchers? • What kind of performance benefits we can get through native RDMA-based designs in gRPC and TensorFlow? Motivation
  • 20. DataWorks Summit@Berlin 2018 20Network Based Computing Laboratory Outline • Overview of gRPC, TensorFlow, and RDMA Technologies • Accelerating gRPC and TensorFlow with RDMA • Benchmarking gRPC and TensorFlow • Performance Evaluation • Conclusion
  • 21. DataWorks Summit@Berlin 2018 21Network Based Computing Laboratory Tensor Communication over gRPC channel • Rendezvous protocol • TensorFlow worker (tensor receiving process) actively requests for tensors to the parameter server (tensor sending process) • Worker issues Tensor RPC request that to Parameter Server (PS) • PS finds the requested tensor, and responds to the worker • gRPC core uses recvmsg and sendmsg primitives for receiving and sending payloads • Tensor Transmission uses iovec structures R. Biswas, X. Lu, and D. K. Panda, Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences, BPOE, 2018.
  • 22. DataWorks Summit@Berlin 2018 22Network Based Computing Laboratory • gRPC + Verbs – Dedicated verbs channel for tensor communication – gRPC channel for administrative task communication • gRPC + MPI – Dedicated MPI channel for tensor communication – gRPC channel for administrative task communication • Uber Horovod – Uber’s approach of MPI based distributed TensorFlow • Baidu Tensorflow-Allreduce – Baidu’s approach of MPI based distributed TensorFlow High Performance Tensor Communication Channel
  • 23. DataWorks Summit@Berlin 2018 23Network Based Computing Laboratory TensorFlow workload via gRPC iovec Buffer Distribution Observed for TensorFlow training over gRPC • Small, Medium and Large indicate buffers of few Bytes, KBytes and MBytes of length • gRPC payload may contain a uniform distribution of such Small buffers • A lot of Large buffers and a few Small buffers may create a skew distribution of such buffers in one gRPC payload R. Biswas, X. Lu, and D. K. Panda, Designing a Micro- Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences, BPOE, 2018.
  • 24. DataWorks Summit@Berlin 2018 24Network Based Computing Laboratory OSU AR-gRPC Architecture R. Biswas, X. Lu, and D. K. Panda, Characterizing and Accelerating TensorFlow: Can Adaptive and RDMA-based RPC Benefit It? (Under Review) • Adaptive RDMA gRPC • Features – Hybrid Communication engine • Adaptive protocol selection between eager and rendezvous – Message pipelining and coalescing • Adaptive chunking and accumulation • Intelligent threshold detection – Zero copy transmission • Zero copy send/recv
  • 25. DataWorks Summit@Berlin 2018 25Network Based Computing Laboratory AR-gRPC Enhanced TensorFlow
  • 26. DataWorks Summit@Berlin 2018 26Network Based Computing Laboratory Outline • Overview of gRPC, TensorFlow, and RDMA Technologies • Accelerating gRPC and TensorFlow with RDMA • Benchmarking gRPC and TensorFlow • Performance Evaluation • Conclusion
  • 27. DataWorks Summit@Berlin 2018 27Network Based Computing Laboratory MNIST CIFAR-10 ImageNet Category Digit Classification Object Classification Object Classification Resolution 28 × 28 B&W 32 × 32 Color 256 × 256 Color Classes 10 10 1000 Training Images 60 K 50 K 1.2 M Testing Images 10 K 10 K 100 K Available Benchmarks, Models, and Datasets Model Layers (Conv. / Full-connected) Dataset Framework LeNet 2 / 2 MNIST CaffeOnSpark, TensorFlowOnSpark SoftMax Regression NA / NA MNIST TensorFlowOnSpark CIFAR-10 Quick 3 / 1 CIFAR-10 CaffeOnSpark, TensorFlowOnSpark, MMLSpark VGG-16 13 / 3 CIFAR-10 BigDL AlexNet 5 / 3 ImageNet CaffeOnSpark GoogLeNet 22 / 0 ImageNet CaffeOnSpark Resnet-50 53/1 Synthetic TensorFlow
  • 28. DataWorks Summit@Berlin 2018 28Network Based Computing Laboratory • Current DL models and benchmarks are deep learning research oriented • Example: Facebook caffe2 takes 1 hour to train ImageNet data (State of the art)1 • However, many system researchers are focused on improving the communication engine of deep learning frameworks • A fast benchmark that models deep learning characteristics is highly desirable Are Current Benchmarks Sufficient? 1. Goyal, Priya, et al. "Accurate, large minibatch SGD: training imagenet in 1 hour." arXiv preprint arXiv:1706.02677 (2017).
  • 29. DataWorks Summit@Berlin 2018 29Network Based Computing Laboratory TensorFlow DL Micro-benchmarks for gRPC R. Biswas, X. Lu, and D. K. Panda, Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences, BPOE, 2018.
  • 30. DataWorks Summit@Berlin 2018 30Network Based Computing Laboratory Outline • Overview of gRPC, TensorFlow, and RDMA Technologies • Accelerating gRPC and TensorFlow with RDMA • Benchmarking gRPC and TensorFlow • Performance Evaluation • Conclusion
  • 31. DataWorks Summit@Berlin 2018 31Network Based Computing Laboratory Experimental Setup • We have used two different clusters • Software Stack used A: OSU-RI2-IB-EDR B: SDSC-Comet-IB-FDR Intel Broadwell, dual fourteen-core processors Intel Haswell, dual twelve-core processors 512 GB RAM 128 GB RAM 370 GB Local NVMe-SSD 320 GB Local SSD InfiniBand EDR (Some nodes with 40G-Ether) InfiniBand FDR and 10G-Ether Stack Version Cluster gRPC 1.5.0 A, B AR-gRPC (OSU RDMA gRPC)1 Based on 1.5.0 A, B TensorFlow 1.4 , Python 2.7 A
  • 32. DataWorks Summit@Berlin 2018 32Network Based Computing Laboratory Performance Benefits for AR-gRPC with Micro-Benchmark Point-to-Point Latency • AR-gRPC (OSU design) Latency on SDSC-Comet-FDR – Up to 2.7x performance speedup over Default gRPC (IPoIB) for Latency for small messages. – Up to 2.8x performance speedup over Default gRPC (IPoIB) for Latency for medium messages. – Up to 2.5x performance speedup over Default gRPC (IPoIB) for Latency for large messages. 0 15 30 45 60 75 90 2 8 32 128 512 2K 8K Latency(us) payload (Bytes) Default gRPC (IPoIB-56Gbps) AR-gRPC (RDMA-56Gbps) 0 200 400 600 800 1000 16K 32K 64K 128K 256K 512KLatency(us) Payload (Bytes) Default gRPC (IPoIB- 56Gbps) AR-gRPC (RDMA-56 Gbps) 100 3800 7500 11200 14900 18600 1M 2M 4M 8M Latency(us) Payload (Bytes) Default gRPC (IPoIB-56Gbps) AR-gRPC (RDMA- 56Gbps) R. Biswas, X. Lu, and D. K. Panda, Characterizing and Accelerating TensorFlow: Can Adaptive and RDMA-based RPC Benefit It? (Under Review)
  • 33. DataWorks Summit@Berlin 2018 33Network Based Computing Laboratory 0 2 4 Uniform Random Skew Latency(ms) Payload Generation Scheme Default gRPC (Ethernet 40G) Default gRPC (IPoIB-100Gbps) AR-gRPC (RDMA-100Gbps) TF-gRPC-P2P-Latency 0 2 4 6 Uniform Random Skew Latency(ms) Payload Generation Scheme Default gRPC (Ethernet 10G) Default gRPC (IPoIB-56Gbps) AR-gRPC (RDMA-56Gbps) • Cluster A: AR-gRPC (RDMA) reduces latency by 59% and 56% compared to Default gRPC over 40G Ethernet and IPoIB • Cluster B: AR-gRPC (RDMA) reduces 78% latency compared to 10G (Default gRPC) Ethernet and 69% compared to IPoIB (Default gRPC) Cluster A Cluster B 56% 69%
  • 34. DataWorks Summit@Berlin 2018 34Network Based Computing Laboratory TF-gRPC-P2P-Bandwidth 0 1000 2000 3000 4000 5000 6000 Uniform Random Skew Bandwidth(MB/s) Payload Generation Scheme Default gRPC (Ethernet 40G) Default gRPC (IPoIB-100Gbps) AR-gRPC (RDMA-100Gbps) 0 2000 4000 6000 Uniform Random Skew Bandwidth(MB/s) Payload Generation Scheme Default gRPC (Ethernet 10G) Default gRPC (IPoIB-56Gbps) AR-gRPC (RDMA-56Gbps)Cluster A Cluster B • Cluster A: AR-gRPC (RDMA) gRPC achieves a 2.14x bandwidth increase compared to Default gRPC (IPoIB and Ethernet) • Cluster B: AR-gRPC (RDMA) achieves 3.2x bandwidth compared to Default gRPC IPoIB for skewed data 2.14x 3.2x
  • 35. DataWorks Summit@Berlin 2018 35Network Based Computing Laboratory 0 1000 2000 3000 4000 Uniform Random Skew RPC/second Payload Generation Scheme Default gRPC (Ethernet 40G) Default gRPC (IPoIB-100Gbps) AR-gRPC (RDMA-100Gbps) TF-gRPC-PS-Throughput 0 1000 2000 3000 4000 Uniform Random Skew RPC/second Payload Generation Scheme Default gRPC (Ethernet 10G) Default gRPC (IPoIB-56Gbps) AR-gRPC (RDMA-56Gbps) Cluster A Cluster B • Cluster A: AR-gRPC (RDMA) gRPC achieves a 3.4x speedup compared to Default gRPC over IPoIB for uniform scheme • Cluster B: AR-gRPC (RDMA) achieves 3.6x bandwidth compared to Default gRPC over IPoIB for uniform scheme 3.4 x 3.6 x
  • 36. DataWorks Summit@Berlin 2018 36Network Based Computing Laboratory Performance Benefits for AR-gRPC with Micro-Benchmark Fully-Connected Architecture (Mimic TensorFlow communication) • AR-gRPC (OSU design) TensorFlow Mimic test on SDSC-Comet-FDR – Up to 60% reduction in average latency over Default gRPC (IPoIB) – Up to 2.68x performance speedup over Default gRPC (IPoIB) 0 6 12 18 24 30 2M 4M 8M Latency(ms) payload (Bytes) Default gRPC (IPoIB-56Gbps) AR-gRPC (RDMA-56Gbps) 0 100 200 300 400 500 2M 4M 8M Calls/Second payload (Bytes) Default gRPC (IPoIB-56Gbps) AR-gRPC (RDMA-56Gbps) 2.68 x
  • 37. DataWorks Summit@Berlin 2018 37Network Based Computing Laboratory 0 200 400 600 800 16 32 64 Images/Second Batch Size gRPPC (IPoIB-100Gbps) Verbs (RDMA-100Gbps) MPI (RDMA-100Gbps) AR-gRPC (RDMA-100Gbps) 0 200 400 600 16 32 64 Images/Second Batch Size gRPPC (IPoIB-100Gbps) Verbs (RDMA-100Gbps) MPI (RDMA-100Gbps) AR-gRPC (RDMA-100Gbps) 0 100 200 300 16 32 64 Images/Second Batch Size gRPPC (IPoIB-100Gbps) Verbs (RDMA-100Gbps) MPI (RDMA-100Gbps) AR-gRPC (RDMA-100Gbps) Performance Benefit for TensorFlow (Resnet50) • TensorFlow Resnet50 performance evaluation on an IB EDR cluster – Up to 30% performance speedup over Default gRPC (IPoIB) for 8 GPUs – Up to 34% performance speedup over Default gRPC (IPoIB) for 16 GPUs – Up to 44% performance speedup over Default gRPC (IPoIB) for 24 GPUs 4 Nodes (8 GPUs) 8 Nodes (16 GPUS) 12 Nodes (24 GPUS)
  • 38. DataWorks Summit@Berlin 2018 38Network Based Computing Laboratory 0 50 100 150 200 16 32 64 Images/Second Batch Size gRPPC (IPoIB-100Gbps) Verbs (RDMA-100Gbps) MPI (RDMA-100Gbps) AR-gRPC (RDMA-100Gbps) Performance Benefit for TensorFlow (Inception3) • TensorFlow Inception3 performance evaluation on an IB EDR cluster – Up to 20% performance speedup over Default gRPC (IPoIB) for 8 GPUs – Up to 34% performance speedup over Default gRPC (IPoIB) for 16 GPUs – Up to 37% performance speedup over Default gRPC (IPoIB) for 24 GPUs 4 Nodes (8 GPUS) 8 Nodes (16 GPUS) 12 Nodes (24 GPUS) 0 200 400 600 16 32 64 Images/Second Batch Size gRPPC (IPoIB-100Gbps) Verbs (RDMA-100Gbps) MPI (RDMA-100Gbps) AR-gRPC (RDMA-100Gbps) 0 100 200 300 400 16 32 64Images/Second Batch Size gRPPC (IPoIB-100Gbps) Verbs (RDMA-100Gbps) MPI (RDMA-100Gbps) AR-gRPC (RDMA-100Gbps)
  • 39. DataWorks Summit@Berlin 2018 39Network Based Computing Laboratory Outline • Overview of gRPC, TensorFlow, and RDMA Technologies • Accelerating gRPC and TensorFlow with RDMA • Benchmarking gRPC and TensorFlow • Performance Evaluation • Conclusion
  • 40. DataWorks Summit@Berlin 2018 40Network Based Computing Laboratory • Present overview of gRPC and TensorFlow • Present overview of RDMA and high-performance networks • Discussed challenges in accelerating and benchmarking TensorFlow and gRPC • RDMA can benefit DL workloads as showed by our AR-gRPC and the corresponding enhanced TensorFlow • Many other open issues need to be solved • Will enable Deep Learning community to take advantage of modern HPC technologies to carry out their analytics in a fast and scalable manner Concluding Remarks
  • 41. DataWorks Summit@Berlin 2018 41Network Based Computing Laboratory Funding Acknowledgments Funding Support by Equipment Support by
  • 42. DataWorks Summit@Berlin 2018 42Network Based Computing Laboratory Personnel Acknowledgments Current Students (Graduate) – A. Awan (Ph.D.) – R. Biswas (M.S.) – M. Bayatpour (Ph.D.) – S. Chakraborthy (Ph.D.) – C.-H. Chu (Ph.D.) – S. Guganani (Ph.D.) Past Students – A. Augustine (M.S.) – P. Balaji (Ph.D.) – S. Bhagvat (M.S.) – A. Bhat (M.S.) – D. Buntinas (Ph.D.) – L. Chai (Ph.D.) – B. Chandrasekharan (M.S.) – N. Dandapanthula (M.S.) – V. Dhanraj (M.S.) – T. Gangadharappa (M.S.) – K. Gopalakrishnan (M.S.) – R. Rajachandrasekar (Ph.D.) – G. Santhanaraman (Ph.D.) – A. Singh (Ph.D.) – J. Sridhar (M.S.) – S. Sur (Ph.D.) – H. Subramoni (Ph.D.) – K. Vaidyanathan (Ph.D.) – A. Vishnu (Ph.D.) – J. Wu (Ph.D.) – W. Yu (Ph.D.) Past Research Scientist – K. Hamidouche – S. Sur Past Post-Docs – D. Banerjee – X. Besseron – H.-W. Jin – W. Huang (Ph.D.) – W. Jiang (M.S.) – J. Jose (Ph.D.) – S. Kini (M.S.) – M. Koop (Ph.D.) – K. Kulkarni (M.S.) – R. Kumar (M.S.) – S. Krishnamoorthy (M.S.) – K. Kandalla (Ph.D.) – M. Li (Ph.D.) – P. Lai (M.S.) – J. Liu (Ph.D.) – M. Luo (Ph.D.) – A. Mamidala (Ph.D.) – G. Marsh (M.S.) – V. Meshram (M.S.) – A. Moody (M.S.) – S. Naravula (Ph.D.) – R. Noronha (Ph.D.) – X. Ouyang (Ph.D.) – S. Pai (M.S.) – S. Potluri (Ph.D.) – J. Hashmi (Ph.D.) – H. Javed (Ph.D.) – P. Kousha (Ph.D.) – D. Shankar (Ph.D.) – H. Shi (Ph.D.) – J. Zhang (Ph.D.) – J. Lin – M. Luo – E. Mancini Current Research Scientists – X. Lu – H. Subramoni Past Programmers – D. Bureddy – J. Perkins Current Research Specialist – J. Smith – M. Arnold – S. Marcarelli – J. Vienne – H. Wang Current Post-doc – A. Ruhela – K. Manian Current Students (Undergraduate) – N. Sarkauskas (B.S.)
  • 43. DataWorks Summit@Berlin 2018 43Network Based Computing Laboratory The 4th International Workshop on High-Performance Big Data Computing (HPBDC) HPBDC 2018 will be held with IEEE International Parallel and Distributed Processing Symposium (IPDPS 2018), Vancouver, British Columbia CANADA, May, 2018 Workshop Date: May 21st, 2018 Keynote Talk: Prof. Geoffrey Fox, Twister2: A High-Performance Big Data Programming Environment Six Regular Research Papers and Two Short Research Papers Panel Topic: Which Framework is the Best for High-Performance Deep Learning: Big Data Framework or HPC Framework? http://web.cse.ohio-state.edu/~luxi/hpbdc2018 HPBDC 2017 was held in conjunction with IPDPS’17 http://web.cse.ohio-state.edu/~luxi/hpbdc2017 HPBDC 2016 was held in conjunction with IPDPS’16 http://web.cse.ohio-state.edu/~luxi/hpbdc2016
  • 44. DataWorks Summit@Berlin 2018 44Network Based Computing Laboratory {panda, luxi}@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda http://www.cse.ohio-state.edu/~luxi Thank You! Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/