SlideShare a Scribd company logo
Costin Iancu, Khaled Ibrahim – LBNL
Nicholas Chaimov – U. Oregon
Spark on Supercomputers:
A Tale of the Storage Hierarchy
Apache Spark
• Developed for cloud environments
• Specialized runtime provides for
– Performance J, Elastic parallelism, Resilience
• Programming productivity through
– HLL front-ends (Scala, R, SQL), multiple domain-specific libraries:
Streaming, SparkSQL, SparkR, GraphX, Splash, MLLib, Velox
• We have huge datasets but little penetration in HPC
Apache Spark
• In-memory Map-Reduce framework
• Central abstraction is the Resilient Distributed Dataset.
• Data movement is important
– Lazy, on-demand
– Horizontal (node-to-node) – shuffle/Reduce
– Vertical (node-to-storage) - Map/Reduce
p1
p2
p3
textFile
p1
p2
p3
flatMap
p1
p2
p3
map
p1
p2
p3
reduceByKey
(local)
STAGE 0
p1
p2
p3
reduceByKey
(global)
STAGE 1
JOB 0
Data Centers/Clouds
Node	local	storage,	assumes	 all	disk
operations	are	equal
Disk I/O	optimized	for	latency
Network optimized	for	bandwidth
HPC
Global	file	system,	asymmetry	expected
Disk I/O	optimized	for	bandwidth
Network optimized	for	latency
HDD/
SSD
NIC
CPU
Mem
HDD/
SDD
HDD/
SDD
HDD/
SDD
CPU
Mem
NIC
HDD/
SDD
HDD/
SSD
HDD/
SSD
HDD/
SSD
HDD
/SSD
Cloud: commodity CPU,
memory, HDD/SSD NIC
Data appliance: server CPU,
large fast memory, fast SSD
Backend storage
Intermediate
storage
HPC: server CPU, fast memory,
combo of fast and slower storage
HDD/
SSD
NIC
CPU
Mem
HDD/
SDD
HDD/
SDD
HDD/
SDD
CPU
Mem
NIC
HDD/
SDD
HDD/
SSD
HDD/
SSD
HDD/
SSD
HDD
/SSD
Backend storage
Intermediate
storage
2.5 GHz Intel Haswell - 24 cores
2.3 GHz Intel Haswell – 32 cores
128GB/1.5TB DDR4
128GB DDR4
320 GB of SSD local 56 Gbps FDR InfiniBand
Cray Data Warp
1.8PB at 1.7TB/s
Sonexion Lustre 30PB
Cray Aries
Comet (DELL)
Cori (Cray XC40)
Scaling Spark on Cray XC40
(It’s all about file system metadata)
Not ALL I/O is Created Equal
0	
2000	
4000	
6000	
8000	
10000	
12000	
1	 2	 4	 8	 16	
Time	Per	Opera1on	(microseconds)	
Nodes	
GroupByTest	-	I/O	Components	-	Cori		
Lustre	-	Open	 BB	Striped	-	Open	
BB	Private	-	Open	 Lustre	-	Read	
BB	Striped	-	Read	 BB	Private	-	Read	
Lustre	-	Write	 BB	Striped	-	Write	
# Shuffle opens = # Shuffle reads O(cores2)
Time per open increases with scale, unlike read/write
9,216
36,864
147,456
589,824
2,359,296
opens
I/O Variability is HIGH
fopen is a problem:
• Mean time is 23X larger than SSD
• Variability is 14,000X
READ fopen
Improving I/O Performance
Eliminate file metadata operations
1. Keep files open (cache fopen)
• Surprising 10%-20% improvement on data appliance
• Argues for user level file systems, gets rid of serialized system calls
2. Use file system backed by single Lustre file for shuffle
• This should also help on systems with local SSDs
3. Use containers
• Speeds up startup, up to 20% end-to-end performance improvement
• Solutions need to be used in conjunction
– E.g. fopen from Parquet reader
Plenty of details in “Scaling Spark on HPC Systems”. HPDC 2016
0
100
200
300
400
500
600
700
32 160 320 640 1280 2560 5120 10240
Time	(s)
Cores
Cori	- GroupBy	- Weak	Scaling	- Time	to	Job	Completion
Ramdisk
Mounted	File
Lustre
Scalability
6x
12x 14x
19x
33x
61x
At 10,240 cores
only 1.6x slower
than RAMdisk
(in memory
execution)
We scaled Spark from O(100)
up to
O(10,000) cores
File-Backed Filesystems
• NERSC Shifter (container infrastructure for HPC)
– Compatible with Docker images
– Integrated with Slurm scheduler
– Can control mounting of filesystems within container
• Per-Node Cache
– File-backed filesystem mounted within each node’s container instance at common
path (/mnt)
– ​--volume=$SCRATCH/backingFile:/mnt:perNodeCache=
size=100G
– File for each node is created stored on backend Lustre filesystem
– Single file open — intermediate data file opens are kept local
Now the fun part J
Architectural Performance Considerations
Cori Comet
The Supercomputer vs The Data Appliance
HDD/
SSD
NIC
CPU
Mem
HDD/
SDD
HDD/
SDD
HDD/
SDD
CPU
Mem
NIC
HDD/
SDD
HDD/
SSD
HDD/
SSD
HDD/
SSD
HDD
/SSD
Backend storage
Intermediate
storage
2.5 GHz Intel Haswell - 24 cores
2.3 GHz Intel Haswell – 32 cores
128GB/1.5TB DDR4
128GB DDR4
320 GB of SSD local 56 Gbps FDR InfiniBand
Cray Data Warp
1.8PB at 1.7TB/s
Sonexion Lustre 30PB
Cray Aries
Comet (DELL)
Cori (Cray XC40)
CPU, Memory, Network, Disk?
• Multiple extensions to Blocked Time Analysis (Ousterhout, 2015)
• BTA indicated that CPU dominates
– Network 2%, disk 19%
• Concentrate on scaling out, weak scaling studies
– Spark-perf, BigDataBenchmark, TPC-DS, TeraSort
• Interested in determining right ratio, machine balance for
– CPU, memory, network, disk …
• Spark 2.0.2 & Spark-RDMA 0.9.4 from Ohio State University,
Hadoop 2.6
Storage hierarchy and performance
Global Storage Matches Local Storage
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
Lustre
Mount+Pool
SSD+IB
Lustre
Mount+Pool
SSD+IB
Lustre
Mount+Pool
SSD+IB
1 5 20
Time	(ms)
Nodes	(32	cores)
App
JVM
RW	Input
RW	Shuffle
Open	Input
Open	Shuffle
• Variability matters more than
advertised latency and bandwidth
number
• Storage performance
obscured/mitigated by network
due to client/server in
BlockManager
• Small scale local is slightly
faster
• Large scale global is faster
Disk+Network Latency/BW
Metadata Overhead
Cray XC40 – TeraSort (100GB/node)
0
0.2
0.4
0.6
0.8
1
1.2
1 16 1 16
Comet	RDMA	Singularity	24	Cores Cori	Shifter	24	Cores
Average	Across	MLLib	Benchmarks
App Fetch JVM
Global Storage Matches Local Storage
11.8%
Fetch
12.5%
Fetch
Intermediate Storage Hurts
Performance
0
2000
4000
6000
8000
10000
12000
14000
1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Cori	Shifter	Lustre Cori	Shifter	BB	Striped Cori	Shifter	BB	Private
Time	(s)
TPC-DS	- Weak	Scaling
App Fetch JVM
19.4%	slower
on	average
86.8%	slower
on	average
(Without our optimizations, intermediate storage scaled better)
Networking performance
0
50
100
150
200
250
300
350
1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64
Comet	Singularity Comet	RDMA	Singularity Cori	Shifter	24	cores
Time	(s)
Singular	Value	Decomposition
App
Fetch
JVM
Latency or Bandwidth?
10X in bandwidth,
latency differences matter
Can hide 2X differences
Average message size for spark-perf is 43B
Network Matters at Scale
0
50
100
150
200
250
1 2 4 8 16 32 64 1 2 4 8 16 32 64
Cori	Shifter	24	Cores Cori	Shifter	32	Cores
Time	(s)
Average	Across	Benchmarks
App Fetch JVM
44%
CPU
More cores or better memory?
• Need more cores to hide
disk and network latency
at scale.
• Preliminary experiences
with Intel KNL are bad
• Too much concurrency
• Not enough integer
throughput
• Execution does not seem
to be memory bandwidth
limited
0
50
100
150
200
250
1 2 4 8 16 32 64 1 2 4 8 16 32 64
Cori	Shifter	24	Cores Cori	Shifter	32	Cores
Time	(s)
Average	Across	Benchmarks
App Fetch JVM
Summary/Conclusions
• Latency and bandwidth are important, but not dominant
– Variability more important than marketing numbers
• Network time dominates at scale
– Network, disk is mis-attributed as CPU
• Comet matches Cori up to 512 cores, Cori twice as fast at
2048 cores
– Spark can run well on global storage
• Global storage opens the possibility of global name space, no
more client-server
Ackowledgement
Work partially supported by
Intel Parallel Computing Center: Big Data Support
for HPC
Thank You.
Questions, collaborations, free software
cciancu@lbl.gov
kzibrahim@lbl.gov
nchaimov@uoregon.edu
Burst Buffer
Setup
• Cray XC30 at NERSC (Edison): 2.4 GHz IvyBridge - Global
• Cray XC40 at NERSC (Cori): 2.3 GHz Haswell + Cray
DataWarp
• Comet at SDSC: 2.5GHz Haswell, InfiniBand FDR, 320 GB
SSD, 1.5TB memory - LOCAL

More Related Content

What's hot (20)

PDF
Reference Architecture: Architecting Ceph Storage Solutions
Ceph Community
 
PPTX
OpenStack and Ceph case study at the University of Alabama
Kamesh Pemmaraju
 
PPTX
Hadoop Architecture_Cluster_Cap_Plan
Narayana B
 
PPTX
Interactive Hadoop via Flash and Memory
Chris Nauroth
 
PPTX
Millions of Regions in HBase: Size Matters
DataWorks Summit
 
ODP
Hug Hbase Presentation.
Jack Levin
 
PDF
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
Cloudera, Inc.
 
PPTX
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Yahoo Developer Network
 
PDF
HBaseCon 2015: Elastic HBase on Mesos
HBaseCon
 
PPTX
Spark tunning in Apache Kylin
Shi Shao Feng
 
PPT
Hadoop 1.x vs 2
Rommel Garcia
 
PDF
ORC 2015: Faster, Better, Smaller
The Apache Software Foundation
 
PPTX
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
Newton Alex
 
PDF
Tutorial Haddop 2.3
Atanu Chatterjee
 
PDF
[RakutenTechConf2014] [D-4] The next step of LeoFS and Introducing NewDB Project
Rakuten Group, Inc.
 
PDF
Ceph on arm64 upload
Ceph Community
 
PPTX
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
DataWorks Summit
 
PDF
RuG Guest Lecture
fvanvollenhoven
 
PPTX
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Cloudera, Inc.
 
PDF
UberCloud HPC Experiment Introduction for Beginners
hpcexperiment
 
Reference Architecture: Architecting Ceph Storage Solutions
Ceph Community
 
OpenStack and Ceph case study at the University of Alabama
Kamesh Pemmaraju
 
Hadoop Architecture_Cluster_Cap_Plan
Narayana B
 
Interactive Hadoop via Flash and Memory
Chris Nauroth
 
Millions of Regions in HBase: Size Matters
DataWorks Summit
 
Hug Hbase Presentation.
Jack Levin
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
Cloudera, Inc.
 
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Yahoo Developer Network
 
HBaseCon 2015: Elastic HBase on Mesos
HBaseCon
 
Spark tunning in Apache Kylin
Shi Shao Feng
 
Hadoop 1.x vs 2
Rommel Garcia
 
ORC 2015: Faster, Better, Smaller
The Apache Software Foundation
 
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
Newton Alex
 
Tutorial Haddop 2.3
Atanu Chatterjee
 
[RakutenTechConf2014] [D-4] The next step of LeoFS and Introducing NewDB Project
Rakuten Group, Inc.
 
Ceph on arm64 upload
Ceph Community
 
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
DataWorks Summit
 
RuG Guest Lecture
fvanvollenhoven
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Cloudera, Inc.
 
UberCloud HPC Experiment Introduction for Beginners
hpcexperiment
 

Similar to Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin Iancu and Nicholas Chaimov (20)

PDF
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
In-Memory Computing Summit
 
PDF
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
OpenEBS
 
PDF
Shak larry-jeder-perf-and-tuning-summit14-part2-final
Tommy Lee
 
PDF
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Databricks
 
PDF
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
PDF
OMI - The Missing Piece of a Modular, Flexible and Composable Computing World
Allan Cantle
 
PDF
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Odinot Stanislas
 
PDF
Linux one vs x86
Diego Rodriguez
 
PDF
Linux one vs x86 18 july
Diego Rodriguez
 
PDF
Linux High Availability Overview - openSUSE.Asia Summit 2015
Roger Zhou 周志强
 
PDF
CLFS 2010
bergwolf
 
PDF
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Databricks
 
PDF
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
PDF
Memory, Big Data, NoSQL and Virtualization
Bigstep
 
PDF
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Databricks
 
PDF
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Mac Moore
 
PDF
Sheepdog Status Report
Liu Yuan
 
PDF
CETH for XDP [Linux Meetup Santa Clara | July 2016]
IO Visor Project
 
PDF
OpenPOWER Acceleration of HPCC Systems
HPCC Systems
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
In-Memory Computing Summit
 
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
OpenEBS
 
Shak larry-jeder-perf-and-tuning-summit14-part2-final
Tommy Lee
 
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Databricks
 
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
OMI - The Missing Piece of a Modular, Flexible and Composable Computing World
Allan Cantle
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Odinot Stanislas
 
Linux one vs x86
Diego Rodriguez
 
Linux one vs x86 18 july
Diego Rodriguez
 
Linux High Availability Overview - openSUSE.Asia Summit 2015
Roger Zhou 周志强
 
CLFS 2010
bergwolf
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Databricks
 
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
Memory, Big Data, NoSQL and Virtualization
Bigstep
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Databricks
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Mac Moore
 
Sheepdog Status Report
Liu Yuan
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
IO Visor Project
 
OpenPOWER Acceleration of HPCC Systems
HPCC Systems
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
big data eco system fundamentals of data science
arivukarasi
 

Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin Iancu and Nicholas Chaimov

  • 1. Costin Iancu, Khaled Ibrahim – LBNL Nicholas Chaimov – U. Oregon Spark on Supercomputers: A Tale of the Storage Hierarchy
  • 2. Apache Spark • Developed for cloud environments • Specialized runtime provides for – Performance J, Elastic parallelism, Resilience • Programming productivity through – HLL front-ends (Scala, R, SQL), multiple domain-specific libraries: Streaming, SparkSQL, SparkR, GraphX, Splash, MLLib, Velox • We have huge datasets but little penetration in HPC
  • 3. Apache Spark • In-memory Map-Reduce framework • Central abstraction is the Resilient Distributed Dataset. • Data movement is important – Lazy, on-demand – Horizontal (node-to-node) – shuffle/Reduce – Vertical (node-to-storage) - Map/Reduce p1 p2 p3 textFile p1 p2 p3 flatMap p1 p2 p3 map p1 p2 p3 reduceByKey (local) STAGE 0 p1 p2 p3 reduceByKey (global) STAGE 1 JOB 0
  • 4. Data Centers/Clouds Node local storage, assumes all disk operations are equal Disk I/O optimized for latency Network optimized for bandwidth HPC Global file system, asymmetry expected Disk I/O optimized for bandwidth Network optimized for latency
  • 5. HDD/ SSD NIC CPU Mem HDD/ SDD HDD/ SDD HDD/ SDD CPU Mem NIC HDD/ SDD HDD/ SSD HDD/ SSD HDD/ SSD HDD /SSD Cloud: commodity CPU, memory, HDD/SSD NIC Data appliance: server CPU, large fast memory, fast SSD Backend storage Intermediate storage HPC: server CPU, fast memory, combo of fast and slower storage
  • 6. HDD/ SSD NIC CPU Mem HDD/ SDD HDD/ SDD HDD/ SDD CPU Mem NIC HDD/ SDD HDD/ SSD HDD/ SSD HDD/ SSD HDD /SSD Backend storage Intermediate storage 2.5 GHz Intel Haswell - 24 cores 2.3 GHz Intel Haswell – 32 cores 128GB/1.5TB DDR4 128GB DDR4 320 GB of SSD local 56 Gbps FDR InfiniBand Cray Data Warp 1.8PB at 1.7TB/s Sonexion Lustre 30PB Cray Aries Comet (DELL) Cori (Cray XC40)
  • 7. Scaling Spark on Cray XC40 (It’s all about file system metadata)
  • 8. Not ALL I/O is Created Equal 0 2000 4000 6000 8000 10000 12000 1 2 4 8 16 Time Per Opera1on (microseconds) Nodes GroupByTest - I/O Components - Cori Lustre - Open BB Striped - Open BB Private - Open Lustre - Read BB Striped - Read BB Private - Read Lustre - Write BB Striped - Write # Shuffle opens = # Shuffle reads O(cores2) Time per open increases with scale, unlike read/write 9,216 36,864 147,456 589,824 2,359,296 opens
  • 9. I/O Variability is HIGH fopen is a problem: • Mean time is 23X larger than SSD • Variability is 14,000X READ fopen
  • 10. Improving I/O Performance Eliminate file metadata operations 1. Keep files open (cache fopen) • Surprising 10%-20% improvement on data appliance • Argues for user level file systems, gets rid of serialized system calls 2. Use file system backed by single Lustre file for shuffle • This should also help on systems with local SSDs 3. Use containers • Speeds up startup, up to 20% end-to-end performance improvement • Solutions need to be used in conjunction – E.g. fopen from Parquet reader Plenty of details in “Scaling Spark on HPC Systems”. HPDC 2016
  • 11. 0 100 200 300 400 500 600 700 32 160 320 640 1280 2560 5120 10240 Time (s) Cores Cori - GroupBy - Weak Scaling - Time to Job Completion Ramdisk Mounted File Lustre Scalability 6x 12x 14x 19x 33x 61x At 10,240 cores only 1.6x slower than RAMdisk (in memory execution) We scaled Spark from O(100) up to O(10,000) cores
  • 12. File-Backed Filesystems • NERSC Shifter (container infrastructure for HPC) – Compatible with Docker images – Integrated with Slurm scheduler – Can control mounting of filesystems within container • Per-Node Cache – File-backed filesystem mounted within each node’s container instance at common path (/mnt) – ​--volume=$SCRATCH/backingFile:/mnt:perNodeCache= size=100G – File for each node is created stored on backend Lustre filesystem – Single file open — intermediate data file opens are kept local
  • 13. Now the fun part J Architectural Performance Considerations Cori Comet The Supercomputer vs The Data Appliance
  • 14. HDD/ SSD NIC CPU Mem HDD/ SDD HDD/ SDD HDD/ SDD CPU Mem NIC HDD/ SDD HDD/ SSD HDD/ SSD HDD/ SSD HDD /SSD Backend storage Intermediate storage 2.5 GHz Intel Haswell - 24 cores 2.3 GHz Intel Haswell – 32 cores 128GB/1.5TB DDR4 128GB DDR4 320 GB of SSD local 56 Gbps FDR InfiniBand Cray Data Warp 1.8PB at 1.7TB/s Sonexion Lustre 30PB Cray Aries Comet (DELL) Cori (Cray XC40)
  • 15. CPU, Memory, Network, Disk? • Multiple extensions to Blocked Time Analysis (Ousterhout, 2015) • BTA indicated that CPU dominates – Network 2%, disk 19% • Concentrate on scaling out, weak scaling studies – Spark-perf, BigDataBenchmark, TPC-DS, TeraSort • Interested in determining right ratio, machine balance for – CPU, memory, network, disk … • Spark 2.0.2 & Spark-RDMA 0.9.4 from Ohio State University, Hadoop 2.6
  • 16. Storage hierarchy and performance
  • 17. Global Storage Matches Local Storage 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000 Lustre Mount+Pool SSD+IB Lustre Mount+Pool SSD+IB Lustre Mount+Pool SSD+IB 1 5 20 Time (ms) Nodes (32 cores) App JVM RW Input RW Shuffle Open Input Open Shuffle • Variability matters more than advertised latency and bandwidth number • Storage performance obscured/mitigated by network due to client/server in BlockManager • Small scale local is slightly faster • Large scale global is faster Disk+Network Latency/BW Metadata Overhead Cray XC40 – TeraSort (100GB/node)
  • 18. 0 0.2 0.4 0.6 0.8 1 1.2 1 16 1 16 Comet RDMA Singularity 24 Cores Cori Shifter 24 Cores Average Across MLLib Benchmarks App Fetch JVM Global Storage Matches Local Storage 11.8% Fetch 12.5% Fetch
  • 19. Intermediate Storage Hurts Performance 0 2000 4000 6000 8000 10000 12000 14000 1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Cori Shifter Lustre Cori Shifter BB Striped Cori Shifter BB Private Time (s) TPC-DS - Weak Scaling App Fetch JVM 19.4% slower on average 86.8% slower on average (Without our optimizations, intermediate storage scaled better)
  • 21. 0 50 100 150 200 250 300 350 1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Comet Singularity Comet RDMA Singularity Cori Shifter 24 cores Time (s) Singular Value Decomposition App Fetch JVM Latency or Bandwidth? 10X in bandwidth, latency differences matter Can hide 2X differences Average message size for spark-perf is 43B
  • 22. Network Matters at Scale 0 50 100 150 200 250 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Cori Shifter 24 Cores Cori Shifter 32 Cores Time (s) Average Across Benchmarks App Fetch JVM 44%
  • 23. CPU
  • 24. More cores or better memory? • Need more cores to hide disk and network latency at scale. • Preliminary experiences with Intel KNL are bad • Too much concurrency • Not enough integer throughput • Execution does not seem to be memory bandwidth limited 0 50 100 150 200 250 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Cori Shifter 24 Cores Cori Shifter 32 Cores Time (s) Average Across Benchmarks App Fetch JVM
  • 25. Summary/Conclusions • Latency and bandwidth are important, but not dominant – Variability more important than marketing numbers • Network time dominates at scale – Network, disk is mis-attributed as CPU • Comet matches Cori up to 512 cores, Cori twice as fast at 2048 cores – Spark can run well on global storage • Global storage opens the possibility of global name space, no more client-server
  • 26. Ackowledgement Work partially supported by Intel Parallel Computing Center: Big Data Support for HPC
  • 28. Burst Buffer Setup • Cray XC30 at NERSC (Edison): 2.4 GHz IvyBridge - Global • Cray XC40 at NERSC (Cori): 2.3 GHz Haswell + Cray DataWarp • Comet at SDSC: 2.5GHz Haswell, InfiniBand FDR, 320 GB SSD, 1.5TB memory - LOCAL