12th ANNUAL WORKSHOP 2016
HPC STORAGE AND IO TRENDS
AND WORKFLOWS
Gary Grider, Division Leader, HPC Division
April 4, 2014
Los Alamos National Laboratory
LA-UR-16-20184
OpenFabrics Alliance Workshop 2016
EIGHT DECADES OF PRODUCTION WEAPONS
COMPUTING TO KEEP THE NATION SAFE
2	
CM-2	IBM	Stretch	 CDC	 Cray	1	 Cray	X/Y	Maniac	
CM-5	 SGI	Blue	Mountain	 DEC/HP	Q	 IBM	Cell	Roadrunner	 Cray	XE	Cielo	
Cray	Intel	KNL	Trinity	 Ziggy	DWave	
Cross	Roads
OpenFabrics Alliance Workshop 2016
ECONOMICS HAVE SHAPED OUR WORLD
The beginning of storage layer proliferation circa 2009
§  Economic modeling for
large burst of data from
memory shows
bandwidth / capacity
better matched for solid
state storage near the
compute nodes
$0	
$5,000,000	
$10,000,000	
$15,000,000	
$20,000,000	
$25,000,000	
2012	
2013	
2014	
2015	
2016	
2017	
2018	
2019	
2020	
2021	
2022	
2023	
2024	
2025	
Hdwr/media	cost	3	mem/mo	10%	FS	
new	servers	
new	disk	
new	cartridges	
new	drives	
new	robots	
n  Economic	modeling	for	archive	
shows	bandwidth	/	capacity	be?er	
matched	for	disk
OpenFabrics Alliance Workshop 2016
THE HOOPLA PARADE CIRCA 2014
Data	Warp
OpenFabrics Alliance Workshop 2016
WHAT ARE ALL THESE STORAGE LAYERS?
WHY DO WE NEED ALL THESE STORAGE LAYERS?
§  Why
• BB: Economics (disk bw/iops too expensive)
• PFS: Maturity and BB capacity too small
• Campaign: Economics (tape bw too expensive)
• Archive: Maturity and we really do need a “forever”
Memory	
Burst	Buffer	
Parallel	File	System	
Campaign	Storage	
Archive	
Memory	
Parallel	File	System	
Archive	
HPC	Before	Trinity	
HPC	A]er	Trinity	
1-2	PB/sec																			
Residence	–	hours											
Overwri`en	–	conanuous	
4-6	TB/sec																											
Residence	–	hours					
Overwri`en	–	hours	
1-2	TB/sec																				
Residence	–	days/weeks	
Flushed	–	weeks	
100-300	GB/sec																
Residence	–	months-year	
Flushed	–	months-year	
10s	GB/sec	(parallel	tape	
Residence	–	forever	
HPSS	Parallel	
Tape	
Lustre	
Parallel	File	
System	
DRAM
OpenFabrics Alliance Workshop 2016
BURST BUFFERS ARE AMONG US
§  Many instances in the world now, some at multi-PB Multi-TB/s
scale
§  Uses
•  Checkpoint/Restart, Analysis, Out of core
§  Access
•  Largely POSIX like, with JCL for stage in, stage out based on job exit health, etc.
§  A little about Out of Core
•  Before HBM we were thinking DDR ~50-100 GB/sec and NVM 2-10 GB/sec (10X
penalty and durability penalty)
•  Only 10X penalty from working set speed to out of core speed
•  After HBM we have HBM 500-1000 GB/sec, DDR 50-100 GB/sec, and NVS 2-10
GB/sec (100X penalty from HBM to NVS)
•  Before HBM, out of core seemed like it might help some read mostly apps
•  After HBM, using DDR for working set seems limiting, but useful for some
•  Using NVS for read mostly out of core seems limiting too, but useful for some
OpenFabrics Alliance Workshop 2016
CAMPAIGN STORAGE SYSTEMS ARE AMONG US TOO – MARFS
MASSIVELY SCALABLE POSIX LIKE FILE SYSTEM NAME PACE
OVER CLOUD STYLE ERASURE PROTECTED OBJECTS
§  Background
•  Object Systems provide massive scaling and efficient erasure
•  Friendly to applications, not to people. People need a name space.
•  Huge Economic appeal (erasure enables use of inexpensive storage)
•  POSIX name space is powerful but has issues scaling
§  The challenges
•  Mismatch of POSIX an Object metadata, security, read/write size/semantics
•  No update in place with Objects
•  Scale to Trillions of files/directories and Billions of files in a directory
•  100’s of GB/sec but with years data longevity
§  Looked at
•  GPFS, Lustre, Panasas, OrangeFS, Cleversafe/Scality/EMC ViPR/Ceph/Swift, Glusterfs,
Nirvana/Storage Resource Broker/IRODS, Maginatics, Camlistore, Bridgestore, Avere, HDFS
§  Experiences
•  Pilot scaled to 3PB and 3 GB/sec,
•  First real deployment scaling to 30PB and 30 GB/sec
§  Next a demo of scale to trillions and billions
Current	Deployment	
Uses	N	GPFS’s	for	MDS	
and	Scality	So]ware	Only	
Erasure	Object	Store	
Be	nice	to	the	Object	system:	
pack	many	small	files	into	one	
object,	break	up	huge	files	
into	mulaple	objects
OpenFabrics Alliance Workshop 2016
MARFS
METADATA SCALING NXM
DATA SCALING BY X
Slide	8
OpenFabrics Alliance Workshop 2016
ISN’T THAT TOO MANY LAYERS JUST FOR
STORAGE?
§  If the Burst Buffer does its job very well (and indications are
capacity of in system NV will grow radically) and campaign
storage works out well (leveraging cloud), do we need a
parallel file system anymore, or an archive? Maybe just a
bw/iops tier and a capacity tier.
§  Too soon to say, seems feasible longer term
Memory	
Burst	Buffer	
Parallel	File	System	(PFS)	
Campaign	Storage	
Archive	
Memory	
IOPS/BW	Tier	
Parallel	File	System	(PFS)	
Capacity	Tier	
Archive	
Diagram	
courtesy	of	
John	Bent	
EMC	
Factoids	
(ames	are	
changing!)	
LANL	HPSS	=	53	PB	
and	543	M	files	
Trinity	2	PB	
memory,	4	PB	flash	
(11%	of	HPSS)	and	
80	PB	PFS	or	150%	
HPSS)	
Crossroads	may	
have	5-10	PB	
memory,	40	PB	solid	
state	or	100%	of	
HPSS	with	data	
residency	measured	
in	days	or	weeks		
We	would	have	never	
contemplated	more	in	
system	storage	than	our	
archive	a	few	years	ago
OpenFabrics Alliance Workshop 2016
BURST BUFFER -> PERF/IOPS TIER
WHAT WOULD NEED TO CHANGE?
§ Burst Buffers are designed for data durations in
hours-days. If in system solid state in system
storage is to be used for months duration many
things are missing.
• Protecting Burst Buffers with RAID/Erasure is NOT economical for
checkpoint and short term use because you can always go back to a lower
tier copy, but longer duration data requires protection. Likely you would
need a software RAID/Erasure that is distributed on the Supercomputer
over its fabric
• Long term protection of the name space is also needed
• QoS issues are also more acute
• With much more longer term data, locality based scheduling is also
perhaps more important
OpenFabrics Alliance Workshop 2016
CAMPAIGN -> CAPACITY TIER
WHAT WOULD NEED TO CHANGE CAMPAIGN?
§ Campaign data duration is targeted at a few years but
for a true capacity tier more flexible perhaps much
longer time periods may be required.
• Probably need power managed storage devices that match the
parallel BW needed to move around PB sized data sets
§ Scaling out to Exabytes, Trillions of files, etc.
• Much of this work is underway
§ Maturing of the solution space with multiple at least
partially vendor supported solutions is needed as
well.
OpenFabrics Alliance Workshop 2016
OTHER CONSIDERATIONS
§  New interfaces to storage that preserve/leverage structure/
semantics of data (much of this is being/was explored in the
DOE Storage FFwd)
•  DAOS like concepts with name spaces friendly to science application needs
•  Async/transactional/Versioning to match better with future async programming
models
§  The concept of a loadable storage stack (being worked in MarFS
and EMC)
•  It would be nice if the Perf/IOPS tier could be directed to “check out” a “problem” to
work on for days/weeks. Think PBs of data and billions of metadata entries of
various shapes. (EMC calls this dynamically loadable name spaces)
•  MarFS metadata demonstration of Trillions of files in a file system and Billions of files
in a directory will be an example of a “loadable” name space
•  CMU’s IndexFS->BatchFS->DeltaFS - stores POSIX metadata into thousands of
distributed KVS’s makes moving/restart with billions of metadata entries simply and
easily
•  Check out a billion metadata entries, make modifications, check back in as a new
version or a merge back into the Capacity Tier master name space
OpenFabrics Alliance Workshop 2016
BUT, THAT IS JUST ECONOMIC ARM WAVING.
How will the economics
combine with the apps/machine/
environmental
needs?
Enter Workflows
OpenFabrics Alliance Workshop 2016
WORKFLOWS TO THE RESCUE?
§  What did I learn from the workflow-fest circa 04/2015?
• There are 57 ways to interpret in situ J
• There are more workflow tools than requirements documents
• There is no common taxonomy that can be used to reason about
data flow/workflow for architects or programmers L
§  What did I learn from FY15 Apex Vendor meetings
• Where do you want your flash, how big, how fast, how durable
• Where do you want your SCM, how big, how fast
• Do you want it near the node or near the disk or in the network
• --- YOU REALLY DON’T WANT ME TO TELL YOU WHERE TO
PUT YOUR NAND/SCM ---
§ Can workflows help us beyond some automation tools?
OpenFabrics Alliance Workshop 2016
INSITU / POST / ACTIVE STORAGE / ACTIVE ARCHIVE
ANALYSIS WORK FLOWS IN WORK FLOWS
OpenFabrics Alliance Workshop 2016
WORKFLOWS: POTENTIAL TAXONOMY
Derived from Dave Montoya (Circa 05/15)
V2	
taxonomy	
a`empt
OpenFabrics Alliance Workshop 2016
WORKFLOWS CAN HELP US BEYOND SOME AUTOMATION TOOLS:
WORKFLOWS ENTER THE REALM OF RFP/PROCUREMENT
§  Trinity/Cori
• We didn’t specify flops, we specified running bigger app faster
• We wanted it to go forward 90% of the time
• We didn’t specify how much burst buffer, or speeds/feeds
• Vendors didn’t like this at first but realized it was degrees of freedom we
were giving them
§  Apex Crossroads/NERSC+1
• Still no flops J
• Still want it to go forward a large % of the time
• Vendors ask: where and how much flash/nvram/pixy dust do we put on
node, in network, in ionode, near storage, blah blah
• We don’t care we want to get the most of these work flows through the
system in 6 months
V3	Taxonomy	A`empt	
Next	slides	represents		work	
done	by	the	APEX	Workflow	
Team
OpenFabrics Alliance Workshop 2016
V3 TAXONOMY FROM APEX PROCUREMENT DOCS
A SIMULATION PIPELINE
OpenFabrics Alliance Workshop 2016
V3 TAXONOMY FROM APEX PROCUREMENT DOCS
A HIGH THROUGHPUT/UQ PIPELINE
OpenFabrics Alliance Workshop 2016
WORKFLOW DATA THAT GOES WITH THE
WORKFLOW DIAGRAMS
OpenFabrics Alliance Workshop 2016
SUMMARY
§  Economic modeling/analysis is a powerful tool for guiding our
next steps
§  Given the growing footprint of Data Mgmt/Movement in the cost
and pain in HPC, workflows may grow in importance and may be
more useful in planning for new machine architectures/
procurement/integration than ever.
§  Combining Economics and Workflows helps paint a picture of
the future for all of us.
12th ANNUAL WORKSHOP 2016
THANK YOU
AND
RIP PFS

More Related Content

PPTX
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
PPT
Hadoop 1.x vs 2
PPTX
Hadoop migration and upgradation
PDF
Apache Hadoop 0.22 and Other Versions
PDF
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
PPTX
Hadoop Operations - Best Practices from the Field
PDF
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
PPTX
Debunking the Myths of HDFS Erasure Coding Performance
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
Hadoop 1.x vs 2
Hadoop migration and upgradation
Apache Hadoop 0.22 and Other Versions
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
Hadoop Operations - Best Practices from the Field
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Debunking the Myths of HDFS Erasure Coding Performance

What's hot (20)

PPTX
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
PDF
Optimizing Dell PowerEdge Configurations for Hadoop
DOC
Configure h base hadoop and hbase client
PDF
Architectural Overview of MapR's Apache Hadoop Distribution
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PDF
UberCloud HPC Experiment Introduction for Beginners
PPTX
HDFS Internals
PDF
Apache Hadoop YARN, NameNode HA, HDFS Federation
PDF
Introduction to apache hadoop
PPTX
Hadoop architecture meetup
PDF
ORC 2015: Faster, Better, Smaller
PPTX
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
PPTX
Introduction to Hadoop part 2
PPTX
LLAP: Sub-Second Analytical Queries in Hive
PDF
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
PPT
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
PPTX
Hug syncsort etl hadoop big data
PPTX
Hadoop Storage in the Cloud Native Era
PDF
Hadoop - Lessons Learned
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Optimizing Dell PowerEdge Configurations for Hadoop
Configure h base hadoop and hbase client
Architectural Overview of MapR's Apache Hadoop Distribution
HBase Tales From the Trenches - Short stories about most common HBase operati...
UberCloud HPC Experiment Introduction for Beginners
HDFS Internals
Apache Hadoop YARN, NameNode HA, HDFS Federation
Introduction to apache hadoop
Hadoop architecture meetup
ORC 2015: Faster, Better, Smaller
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Introduction to Hadoop part 2
LLAP: Sub-Second Analytical Queries in Hive
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Hug syncsort etl hadoop big data
Hadoop Storage in the Cloud Native Era
Hadoop - Lessons Learned
Ad

Viewers also liked (11)

PDF
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
PPTX
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
PDF
Introduction to EasyBuild: Tutorial Part 1
PPTX
Importing data in Oasis Montaj
PDF
Dell Lustre Storage Architecture Presentation - MBUG 2016
PDF
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
PDF
High–Performance Computing
PDF
NSCC Training - Introductory Class
PDF
Working With Big Data
PPTX
High Performance Computing and Big Data
PDF
Visual Design with Data
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
Introduction to EasyBuild: Tutorial Part 1
Importing data in Oasis Montaj
Dell Lustre Storage Architecture Presentation - MBUG 2016
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
High–Performance Computing
NSCC Training - Introductory Class
Working With Big Data
High Performance Computing and Big Data
Visual Design with Data
Ad

Similar to HPC Storage and IO Trends and Workflows (20)

PDF
HPC Networking in the Real World
PDF
Characteristics of Remote Persistent Memory – Performance, Capacity, or Local...
PDF
Architecting a 35 PB distributed parallel file system for science
PDF
GEN-Z: An Overview and Use Cases
PDF
InfiniBand for the enterprise
PPTX
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
PDF
Advancing OpenFabrics Interfaces
PDF
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
PDF
The Consequences of Infinite Storage Bandwidth: Allen Samuels, SanDisk
PPTX
High-Availability of YARN (MRv2)
PDF
Cornami Accelerates Performance on SPARK: Spark Summit East talk by Paul Master
PDF
Saviak lviv ai-2019-e-mail (1)
PDF
HPE Solutions for Challenges in AI and Big Data
PDF
Storage solutions for High Performance Computing
PDF
Omni-Path Status, Upstreaming and Ongoing Work
PDF
The Internet-of-things: Architecting for the deluge of data
PDF
Exascale Storage
PDF
Exascale storage
PPTX
Sn wf12 amd fabric server (satheesh nanniyur) oct 12
PDF
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio Manfred
HPC Networking in the Real World
Characteristics of Remote Persistent Memory – Performance, Capacity, or Local...
Architecting a 35 PB distributed parallel file system for science
GEN-Z: An Overview and Use Cases
InfiniBand for the enterprise
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
Advancing OpenFabrics Interfaces
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
The Consequences of Infinite Storage Bandwidth: Allen Samuels, SanDisk
High-Availability of YARN (MRv2)
Cornami Accelerates Performance on SPARK: Spark Summit East talk by Paul Master
Saviak lviv ai-2019-e-mail (1)
HPE Solutions for Challenges in AI and Big Data
Storage solutions for High Performance Computing
Omni-Path Status, Upstreaming and Ongoing Work
The Internet-of-things: Architecting for the deluge of data
Exascale Storage
Exascale storage
Sn wf12 amd fabric server (satheesh nanniyur) oct 12
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio Manfred

More from inside-BigData.com (20)

PDF
Major Market Shifts in IT
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
PPTX
Transforming Private 5G Networks
PDF
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
PDF
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
PDF
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
PDF
HPC Impact: EDA Telemetry Neural Networks
PDF
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
PDF
Machine Learning for Weather Forecasts
PPTX
HPC AI Advisory Council Update
PDF
Fugaku Supercomputer joins fight against COVID-19
PDF
Energy Efficient Computing using Dynamic Tuning
PDF
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
PDF
State of ARM-based HPC
PDF
Versal Premium ACAP for Network and Cloud Acceleration
PDF
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
PDF
Scaling TCO in a Post Moore's Era
PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
PDF
Introducing HPC with a Raspberry Pi Cluster
PDF
Overview of HPC Interconnects
Major Market Shifts in IT
Preparing to program Aurora at Exascale - Early experiences and future direct...
Transforming Private 5G Networks
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
HPC Impact: EDA Telemetry Neural Networks
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Machine Learning for Weather Forecasts
HPC AI Advisory Council Update
Fugaku Supercomputer joins fight against COVID-19
Energy Efficient Computing using Dynamic Tuning
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
State of ARM-based HPC
Versal Premium ACAP for Network and Cloud Acceleration
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Scaling TCO in a Post Moore's Era
CUDA-Python and RAPIDS for blazing fast scientific computing
Introducing HPC with a Raspberry Pi Cluster
Overview of HPC Interconnects

Recently uploaded (20)

PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PPTX
Benefits of Physical activity for teenagers.pptx
PPT
What is a Computer? Input Devices /output devices
PDF
Getting Started with Data Integration: FME Form 101
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PPTX
The various Industrial Revolutions .pptx
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
Unlock new opportunities with location data.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
observCloud-Native Containerability and monitoring.pptx
sustainability-14-14877-v2.pddhzftheheeeee
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
Developing a website for English-speaking practice to English as a foreign la...
A review of recent deep learning applications in wood surface defect identifi...
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Benefits of Physical activity for teenagers.pptx
What is a Computer? Input Devices /output devices
Getting Started with Data Integration: FME Form 101
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Hindi spoken digit analysis for native and non-native speakers
A comparative study of natural language inference in Swahili using monolingua...
Taming the Chaos: How to Turn Unstructured Data into Decisions
The various Industrial Revolutions .pptx
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
A contest of sentiment analysis: k-nearest neighbor versus neural network
Unlock new opportunities with location data.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Web Crawler for Trend Tracking Gen Z Insights.pptx

HPC Storage and IO Trends and Workflows

  • 1. 12th ANNUAL WORKSHOP 2016 HPC STORAGE AND IO TRENDS AND WORKFLOWS Gary Grider, Division Leader, HPC Division April 4, 2014 Los Alamos National Laboratory LA-UR-16-20184
  • 2. OpenFabrics Alliance Workshop 2016 EIGHT DECADES OF PRODUCTION WEAPONS COMPUTING TO KEEP THE NATION SAFE 2 CM-2 IBM Stretch CDC Cray 1 Cray X/Y Maniac CM-5 SGI Blue Mountain DEC/HP Q IBM Cell Roadrunner Cray XE Cielo Cray Intel KNL Trinity Ziggy DWave Cross Roads
  • 3. OpenFabrics Alliance Workshop 2016 ECONOMICS HAVE SHAPED OUR WORLD The beginning of storage layer proliferation circa 2009 §  Economic modeling for large burst of data from memory shows bandwidth / capacity better matched for solid state storage near the compute nodes $0 $5,000,000 $10,000,000 $15,000,000 $20,000,000 $25,000,000 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 Hdwr/media cost 3 mem/mo 10% FS new servers new disk new cartridges new drives new robots n  Economic modeling for archive shows bandwidth / capacity be?er matched for disk
  • 4. OpenFabrics Alliance Workshop 2016 THE HOOPLA PARADE CIRCA 2014 Data Warp
  • 5. OpenFabrics Alliance Workshop 2016 WHAT ARE ALL THESE STORAGE LAYERS? WHY DO WE NEED ALL THESE STORAGE LAYERS? §  Why • BB: Economics (disk bw/iops too expensive) • PFS: Maturity and BB capacity too small • Campaign: Economics (tape bw too expensive) • Archive: Maturity and we really do need a “forever” Memory Burst Buffer Parallel File System Campaign Storage Archive Memory Parallel File System Archive HPC Before Trinity HPC A]er Trinity 1-2 PB/sec Residence – hours Overwri`en – conanuous 4-6 TB/sec Residence – hours Overwri`en – hours 1-2 TB/sec Residence – days/weeks Flushed – weeks 100-300 GB/sec Residence – months-year Flushed – months-year 10s GB/sec (parallel tape Residence – forever HPSS Parallel Tape Lustre Parallel File System DRAM
  • 6. OpenFabrics Alliance Workshop 2016 BURST BUFFERS ARE AMONG US §  Many instances in the world now, some at multi-PB Multi-TB/s scale §  Uses •  Checkpoint/Restart, Analysis, Out of core §  Access •  Largely POSIX like, with JCL for stage in, stage out based on job exit health, etc. §  A little about Out of Core •  Before HBM we were thinking DDR ~50-100 GB/sec and NVM 2-10 GB/sec (10X penalty and durability penalty) •  Only 10X penalty from working set speed to out of core speed •  After HBM we have HBM 500-1000 GB/sec, DDR 50-100 GB/sec, and NVS 2-10 GB/sec (100X penalty from HBM to NVS) •  Before HBM, out of core seemed like it might help some read mostly apps •  After HBM, using DDR for working set seems limiting, but useful for some •  Using NVS for read mostly out of core seems limiting too, but useful for some
  • 7. OpenFabrics Alliance Workshop 2016 CAMPAIGN STORAGE SYSTEMS ARE AMONG US TOO – MARFS MASSIVELY SCALABLE POSIX LIKE FILE SYSTEM NAME PACE OVER CLOUD STYLE ERASURE PROTECTED OBJECTS §  Background •  Object Systems provide massive scaling and efficient erasure •  Friendly to applications, not to people. People need a name space. •  Huge Economic appeal (erasure enables use of inexpensive storage) •  POSIX name space is powerful but has issues scaling §  The challenges •  Mismatch of POSIX an Object metadata, security, read/write size/semantics •  No update in place with Objects •  Scale to Trillions of files/directories and Billions of files in a directory •  100’s of GB/sec but with years data longevity §  Looked at •  GPFS, Lustre, Panasas, OrangeFS, Cleversafe/Scality/EMC ViPR/Ceph/Swift, Glusterfs, Nirvana/Storage Resource Broker/IRODS, Maginatics, Camlistore, Bridgestore, Avere, HDFS §  Experiences •  Pilot scaled to 3PB and 3 GB/sec, •  First real deployment scaling to 30PB and 30 GB/sec §  Next a demo of scale to trillions and billions Current Deployment Uses N GPFS’s for MDS and Scality So]ware Only Erasure Object Store Be nice to the Object system: pack many small files into one object, break up huge files into mulaple objects
  • 8. OpenFabrics Alliance Workshop 2016 MARFS METADATA SCALING NXM DATA SCALING BY X Slide 8
  • 9. OpenFabrics Alliance Workshop 2016 ISN’T THAT TOO MANY LAYERS JUST FOR STORAGE? §  If the Burst Buffer does its job very well (and indications are capacity of in system NV will grow radically) and campaign storage works out well (leveraging cloud), do we need a parallel file system anymore, or an archive? Maybe just a bw/iops tier and a capacity tier. §  Too soon to say, seems feasible longer term Memory Burst Buffer Parallel File System (PFS) Campaign Storage Archive Memory IOPS/BW Tier Parallel File System (PFS) Capacity Tier Archive Diagram courtesy of John Bent EMC Factoids (ames are changing!) LANL HPSS = 53 PB and 543 M files Trinity 2 PB memory, 4 PB flash (11% of HPSS) and 80 PB PFS or 150% HPSS) Crossroads may have 5-10 PB memory, 40 PB solid state or 100% of HPSS with data residency measured in days or weeks We would have never contemplated more in system storage than our archive a few years ago
  • 10. OpenFabrics Alliance Workshop 2016 BURST BUFFER -> PERF/IOPS TIER WHAT WOULD NEED TO CHANGE? § Burst Buffers are designed for data durations in hours-days. If in system solid state in system storage is to be used for months duration many things are missing. • Protecting Burst Buffers with RAID/Erasure is NOT economical for checkpoint and short term use because you can always go back to a lower tier copy, but longer duration data requires protection. Likely you would need a software RAID/Erasure that is distributed on the Supercomputer over its fabric • Long term protection of the name space is also needed • QoS issues are also more acute • With much more longer term data, locality based scheduling is also perhaps more important
  • 11. OpenFabrics Alliance Workshop 2016 CAMPAIGN -> CAPACITY TIER WHAT WOULD NEED TO CHANGE CAMPAIGN? § Campaign data duration is targeted at a few years but for a true capacity tier more flexible perhaps much longer time periods may be required. • Probably need power managed storage devices that match the parallel BW needed to move around PB sized data sets § Scaling out to Exabytes, Trillions of files, etc. • Much of this work is underway § Maturing of the solution space with multiple at least partially vendor supported solutions is needed as well.
  • 12. OpenFabrics Alliance Workshop 2016 OTHER CONSIDERATIONS §  New interfaces to storage that preserve/leverage structure/ semantics of data (much of this is being/was explored in the DOE Storage FFwd) •  DAOS like concepts with name spaces friendly to science application needs •  Async/transactional/Versioning to match better with future async programming models §  The concept of a loadable storage stack (being worked in MarFS and EMC) •  It would be nice if the Perf/IOPS tier could be directed to “check out” a “problem” to work on for days/weeks. Think PBs of data and billions of metadata entries of various shapes. (EMC calls this dynamically loadable name spaces) •  MarFS metadata demonstration of Trillions of files in a file system and Billions of files in a directory will be an example of a “loadable” name space •  CMU’s IndexFS->BatchFS->DeltaFS - stores POSIX metadata into thousands of distributed KVS’s makes moving/restart with billions of metadata entries simply and easily •  Check out a billion metadata entries, make modifications, check back in as a new version or a merge back into the Capacity Tier master name space
  • 13. OpenFabrics Alliance Workshop 2016 BUT, THAT IS JUST ECONOMIC ARM WAVING. How will the economics combine with the apps/machine/ environmental needs? Enter Workflows
  • 14. OpenFabrics Alliance Workshop 2016 WORKFLOWS TO THE RESCUE? §  What did I learn from the workflow-fest circa 04/2015? • There are 57 ways to interpret in situ J • There are more workflow tools than requirements documents • There is no common taxonomy that can be used to reason about data flow/workflow for architects or programmers L §  What did I learn from FY15 Apex Vendor meetings • Where do you want your flash, how big, how fast, how durable • Where do you want your SCM, how big, how fast • Do you want it near the node or near the disk or in the network • --- YOU REALLY DON’T WANT ME TO TELL YOU WHERE TO PUT YOUR NAND/SCM --- § Can workflows help us beyond some automation tools?
  • 15. OpenFabrics Alliance Workshop 2016 INSITU / POST / ACTIVE STORAGE / ACTIVE ARCHIVE ANALYSIS WORK FLOWS IN WORK FLOWS
  • 16. OpenFabrics Alliance Workshop 2016 WORKFLOWS: POTENTIAL TAXONOMY Derived from Dave Montoya (Circa 05/15) V2 taxonomy a`empt
  • 17. OpenFabrics Alliance Workshop 2016 WORKFLOWS CAN HELP US BEYOND SOME AUTOMATION TOOLS: WORKFLOWS ENTER THE REALM OF RFP/PROCUREMENT §  Trinity/Cori • We didn’t specify flops, we specified running bigger app faster • We wanted it to go forward 90% of the time • We didn’t specify how much burst buffer, or speeds/feeds • Vendors didn’t like this at first but realized it was degrees of freedom we were giving them §  Apex Crossroads/NERSC+1 • Still no flops J • Still want it to go forward a large % of the time • Vendors ask: where and how much flash/nvram/pixy dust do we put on node, in network, in ionode, near storage, blah blah • We don’t care we want to get the most of these work flows through the system in 6 months V3 Taxonomy A`empt Next slides represents work done by the APEX Workflow Team
  • 18. OpenFabrics Alliance Workshop 2016 V3 TAXONOMY FROM APEX PROCUREMENT DOCS A SIMULATION PIPELINE
  • 19. OpenFabrics Alliance Workshop 2016 V3 TAXONOMY FROM APEX PROCUREMENT DOCS A HIGH THROUGHPUT/UQ PIPELINE
  • 20. OpenFabrics Alliance Workshop 2016 WORKFLOW DATA THAT GOES WITH THE WORKFLOW DIAGRAMS
  • 21. OpenFabrics Alliance Workshop 2016 SUMMARY §  Economic modeling/analysis is a powerful tool for guiding our next steps §  Given the growing footprint of Data Mgmt/Movement in the cost and pain in HPC, workflows may grow in importance and may be more useful in planning for new machine architectures/ procurement/integration than ever. §  Combining Economics and Workflows helps paint a picture of the future for all of us.
  • 22. 12th ANNUAL WORKSHOP 2016 THANK YOU AND RIP PFS