HPC Storage and IO Trends and Workflows

12th ANNUAL WORKSHOP 2016
HPC STORAGE AND IO TRENDS
AND WORKFLOWS
Gary Grider, Division Leader, HPC Division
April 4, 2014
Los Alamos National Laboratory
LA-UR-16-20184

OpenFabrics Alliance Workshop 2016
EIGHT DECADES OF PRODUCTION WEAPONS
COMPUTING TO KEEP THE NATION SAFE
2
CM-2 IBM Stretch CDC Cray 1 Cray X/Y Maniac
CM-5 SGI Blue Mountain DEC/HP Q IBM Cell Roadrunner Cray XE Cielo
Cray Intel KNL Trinity Ziggy DWave
Cross Roads

ECONOMICS HAVE SHAPED OUR WORLD
The beginning of storage layer proliferation circa 2009
§  Economic modeling for
large burst of data from
memory shows
bandwidth / capacity
better matched for solid
state storage near the
compute nodes
$0
$5,000,000
$10,000,000
$15,000,000
$20,000,000
$25,000,000
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
Hdwr/media cost 3 mem/mo 10% FS
new servers
new disk
new cartridges
new drives
new robots
n  Economic modeling for archive
shows bandwidth / capacity be?er
matched for disk

THE HOOPLA PARADE CIRCA 2014
Data Warp

WHAT ARE ALL THESE STORAGE LAYERS?
WHY DO WE NEED ALL THESE STORAGE LAYERS?
§  Why
• BB: Economics (disk bw/iops too expensive)
• PFS: Maturity and BB capacity too small
• Campaign: Economics (tape bw too expensive)
• Archive: Maturity and we really do need a “forever”
Memory
Burst Buffer
Parallel File System
Campaign Storage
Archive
Memory
Parallel File System
Archive
HPC Before Trinity
HPC A]er Trinity
1-2 PB/sec
Residence – hours
Overwrièn – conanuous
4-6 TB/sec
Residence – hours
Overwrièn – hours
1-2 TB/sec
Residence – days/weeks
Flushed – weeks
100-300 GB/sec
Residence – months-year
Flushed – months-year
10s GB/sec (parallel tape
Residence – forever
HPSS Parallel
Tape
Lustre
Parallel File
System
DRAM

BURST BUFFERS ARE AMONG US
§  Many instances in the world now, some at multi-PB Multi-TB/s
scale
§  Uses
•  Checkpoint/Restart, Analysis, Out of core
§  Access
•  Largely POSIX like, with JCL for stage in, stage out based on job exit health, etc.
§  A little about Out of Core
•  Before HBM we were thinking DDR ~50-100 GB/sec and NVM 2-10 GB/sec (10X
penalty and durability penalty)
•  Only 10X penalty from working set speed to out of core speed
•  After HBM we have HBM 500-1000 GB/sec, DDR 50-100 GB/sec, and NVS 2-10
GB/sec (100X penalty from HBM to NVS)
•  Before HBM, out of core seemed like it might help some read mostly apps
•  After HBM, using DDR for working set seems limiting, but useful for some
•  Using NVS for read mostly out of core seems limiting too, but useful for some

CAMPAIGN STORAGE SYSTEMS ARE AMONG US TOO – MARFS
MASSIVELY SCALABLE POSIX LIKE FILE SYSTEM NAME PACE
OVER CLOUD STYLE ERASURE PROTECTED OBJECTS
§  Background
•  Object Systems provide massive scaling and efficient erasure
•  Friendly to applications, not to people. People need a name space.
•  Huge Economic appeal (erasure enables use of inexpensive storage)
•  POSIX name space is powerful but has issues scaling
§  The challenges
•  Mismatch of POSIX an Object metadata, security, read/write size/semantics
•  No update in place with Objects
•  Scale to Trillions of files/directories and Billions of files in a directory
•  100’s of GB/sec but with years data longevity
§  Looked at
•  GPFS, Lustre, Panasas, OrangeFS, Cleversafe/Scality/EMC ViPR/Ceph/Swift, Glusterfs,
Nirvana/Storage Resource Broker/IRODS, Maginatics, Camlistore, Bridgestore, Avere, HDFS
§  Experiences
•  Pilot scaled to 3PB and 3 GB/sec,
•  First real deployment scaling to 30PB and 30 GB/sec
§  Next a demo of scale to trillions and billions
Current Deployment
Uses N GPFS’s for MDS
and Scality So]ware Only
Erasure Object Store
Be nice to the Object system:
pack many small ﬁles into one
object, break up huge ﬁles
into mulaple objects

MARFS
METADATA SCALING NXM
DATA SCALING BY X
Slide 8

ISN’T THAT TOO MANY LAYERS JUST FOR
STORAGE?
§  If the Burst Buffer does its job very well (and indications are
capacity of in system NV will grow radically) and campaign
storage works out well (leveraging cloud), do we need a
parallel file system anymore, or an archive? Maybe just a
bw/iops tier and a capacity tier.
§  Too soon to say, seems feasible longer term
Memory
Burst Buffer
Parallel File System (PFS)
Campaign Storage
Archive
Memory
IOPS/BW Tier
Parallel File System (PFS)
Capacity Tier
Archive
Diagram
courtesy of
John Bent
EMC
Factoids
(ames are
changing!)
LANL HPSS = 53 PB
and 543 M files
Trinity 2 PB
memory, 4 PB flash
(11% of HPSS) and
80 PB PFS or 150%
HPSS)
Crossroads may
have 5-10 PB
memory, 40 PB solid
state or 100% of
HPSS with data
residency measured
in days or weeks
We would have never
contemplated more in
system storage than our
archive a few years ago

BURST BUFFER -> PERF/IOPS TIER
WHAT WOULD NEED TO CHANGE?
§ Burst Buffers are designed for data durations in
hours-days. If in system solid state in system
storage is to be used for months duration many
things are missing.
• Protecting Burst Buffers with RAID/Erasure is NOT economical for
checkpoint and short term use because you can always go back to a lower
tier copy, but longer duration data requires protection. Likely you would
need a software RAID/Erasure that is distributed on the Supercomputer
over its fabric
• Long term protection of the name space is also needed
• QoS issues are also more acute
• With much more longer term data, locality based scheduling is also
perhaps more important

CAMPAIGN -> CAPACITY TIER
WHAT WOULD NEED TO CHANGE CAMPAIGN?
§ Campaign data duration is targeted at a few years but
for a true capacity tier more flexible perhaps much
longer time periods may be required.
• Probably need power managed storage devices that match the
parallel BW needed to move around PB sized data sets
§ Scaling out to Exabytes, Trillions of files, etc.
• Much of this work is underway
§ Maturing of the solution space with multiple at least
partially vendor supported solutions is needed as
well.

OTHER CONSIDERATIONS
§  New interfaces to storage that preserve/leverage structure/
semantics of data (much of this is being/was explored in the
DOE Storage FFwd)
•  DAOS like concepts with name spaces friendly to science application needs
•  Async/transactional/Versioning to match better with future async programming
models
§  The concept of a loadable storage stack (being worked in MarFS
and EMC)
•  It would be nice if the Perf/IOPS tier could be directed to “check out” a “problem” to
work on for days/weeks. Think PBs of data and billions of metadata entries of
various shapes. (EMC calls this dynamically loadable name spaces)
•  MarFS metadata demonstration of Trillions of files in a file system and Billions of files
in a directory will be an example of a “loadable” name space
•  CMU’s IndexFS->BatchFS->DeltaFS - stores POSIX metadata into thousands of
distributed KVS’s makes moving/restart with billions of metadata entries simply and
easily
•  Check out a billion metadata entries, make modifications, check back in as a new
version or a merge back into the Capacity Tier master name space

BUT, THAT IS JUST ECONOMIC ARM WAVING.
How will the economics
combine with the apps/machine/
environmental
needs?
Enter Workflows

WORKFLOWS TO THE RESCUE?
§  What did I learn from the workflow-fest circa 04/2015?
• There are 57 ways to interpret in situ J
• There are more workflow tools than requirements documents
• There is no common taxonomy that can be used to reason about
data flow/workflow for architects or programmers L
§  What did I learn from FY15 Apex Vendor meetings
• Where do you want your flash, how big, how fast, how durable
• Where do you want your SCM, how big, how fast
• Do you want it near the node or near the disk or in the network
• --- YOU REALLY DON’T WANT ME TO TELL YOU WHERE TO
PUT YOUR NAND/SCM ---
§ Can workflows help us beyond some automation tools?

INSITU / POST / ACTIVE STORAGE / ACTIVE ARCHIVE
ANALYSIS WORK FLOWS IN WORK FLOWS

WORKFLOWS: POTENTIAL TAXONOMY
Derived from Dave Montoya (Circa 05/15)
V2
taxonomy
a`empt

WORKFLOWS CAN HELP US BEYOND SOME AUTOMATION TOOLS:
WORKFLOWS ENTER THE REALM OF RFP/PROCUREMENT
§  Trinity/Cori
• We didn’t specify flops, we specified running bigger app faster
• We wanted it to go forward 90% of the time
• We didn’t specify how much burst buffer, or speeds/feeds
• Vendors didn’t like this at first but realized it was degrees of freedom we
were giving them
§  Apex Crossroads/NERSC+1
• Still no flops J
• Still want it to go forward a large % of the time
• Vendors ask: where and how much flash/nvram/pixy dust do we put on
node, in network, in ionode, near storage, blah blah
• We don’t care we want to get the most of these work flows through the
system in 6 months
V3 Taxonomy A`empt
Next slides represents work
done by the APEX Workﬂow
Team

V3 TAXONOMY FROM APEX PROCUREMENT DOCS
A SIMULATION PIPELINE

V3 TAXONOMY FROM APEX PROCUREMENT DOCS
A HIGH THROUGHPUT/UQ PIPELINE

WORKFLOW DATA THAT GOES WITH THE
WORKFLOW DIAGRAMS

SUMMARY
§  Economic modeling/analysis is a powerful tool for guiding our
next steps
§  Given the growing footprint of Data Mgmt/Movement in the cost
and pain in HPC, workflows may grow in importance and may be
more useful in planning for new machine architectures/
procurement/integration than ever.
§  Combining Economics and Workflows helps paint a picture of
the future for all of us.

12th ANNUAL WORKSHOP 2016
THANK YOU
AND
RIP PFS

HPC Storage and IO Trends and Workflows

More Related Content

What's hot (20)

Viewers also liked (11)

Similar to HPC Storage and IO Trends and Workflows (20)

More from inside-BigData.com (20)

Recently uploaded (20)

HPC Storage and IO Trends and Workflows