SlideShare a Scribd company logo
HDFS Tiered
Storage
Chris Douglas, Virajith Jalaparti
Microsoft CISL
>id
Microsoft Cloud and Information Services Lab (CISL)
Applied research group in large-scale systems and machine learning
Contributions to Apache Hadoop YARN
Preemption, reservations/planning, federation, distributed sched.
Apache REEF: control-plane for big data systems
Chris Douglas (cdoug@microsoft.com)
Contributor to Apache Hadoop since 2007, member of its PMC
Virajith Jalaparti (vijala@microsoft.com)
Data in Hadoop
All data in one place
Tools written against abstractions
Compatible FileSystems (Azure/S3/etc.)
Multi-tenant
Management APIs
Quotas, auth, encryption, media
Works well if all data is in one cluster
In most cases, we have multiple clusters…
Multiple storage clusters
Production/research partitioning
Compliance and regulatory restrictions
Datasets can be shared
Geographically distributed clusters
Disaster recovery
Cloud backup/Hybrid clouds
Heterogeneous storage tiers in a cluster
Compute +
Storage
Compute +
Storage
wasb://…
hdfs://b/
hdfs://a/
Managing multiple clusters: Today
Using the framework
Copy data (distcp) between clusters
(+) Clients process local copies, no visible
partial copies
(-) Uses compute resources, requires
capacity planning
Using the application
Directly access data in multiple clusters
(+) Consistency managed at client
(-) Auth to all data sources, consistency is
hard, no opportunities for transparent
caching
D A
hdfs://a/ hdfs://b/
A
r/w
hdfs://a/ hdfs://b/
r/w
Managing multiple clusters: Our proposal
Tiering: Using the platform
Synchronize storage with remote namespace
(+) Transparent to users, caching/prefetching,
unified namespace
(-) Conflicts may be unresolvable
Use HDFS to coordinate external storage
No capability or performance gap
Support for heterogeneous media (RAM/SSD/DISK), rebalancing, security,
quotas, etc.
A
hdfs://a/
hdfs://b/
r/w
mount
Challenges
Synchronize metadata without copying data
Dynamically page in “blocks” on demand
Define policies to prefetch and evict local replicas
Mirror changes in remote namespace
Handle out-of-band churn in remote storage
Avoid dropping valid, cached data (e.g., rename)
Handle writes consistently
Writes committed to the backing store must “make sense”
Proposal: Provided Storage Type
Peer to RAM, SSD, DISK in HDFS (HDFS-2832)
Data in external store mapped to HDFS blocks
Each block associated with an Alias = (REF, nonce)
Used to map blocks to external data
Nonce used to detect changes on backing store
E.g.: REF = (file URI, offset, length); nonce = GUID
Mapping stored in a BlockMap
KV store accessible by NN and all DNs
ProvidedVolume on Datanodes reads/writes
data from/to external store
DN1
External store
DN2
BlockManager
/𝑎/𝑓𝑜𝑜 → 𝑏𝑖, … , 𝑏𝑗
𝑏𝑖 → {𝑠1, 𝑠2, 𝑠3}
/𝑎𝑑𝑙/𝑏𝑎𝑟 → 𝑏 𝑘, … , 𝑏𝑙
𝑏_𝑘 → {𝑠 𝑃𝑅𝑂𝑉𝐼𝐷𝐸𝐷}
FSNamesystem
NN
BlockMap
𝑏 𝑘→ 𝐴𝑙𝑖𝑎𝑠 𝑘
…
RAM_DISK SSD DISK PROVIDED
Example: Using an immutable cloud store
External namespace ext://nn
… …
… …
/
a b c
e f g
d
External store
(Data)
mount
Client
read(/d/e)
DN1 DN2
HDFS cluster
NN
read(/c/d/e)
(file data)
(file data)
Example: Using an immutable cloud store
FSImage
BlockMap
/𝑑/𝑒 → {𝑏1, 𝑏2, … }
/d/f/z1 → {𝑏𝑖, 𝑏𝑖+1, … }
…
𝑏𝑖 → {rep = 1, PROVIDED}
…
𝑏𝑖 → { 𝑒𝑥𝑡://𝑛𝑛/c/d/f/z1, 0, 𝐿 , inodeId1}
𝑏𝑖+1 → { 𝑒𝑥𝑡://𝑛𝑛/c/d/f/z1, 𝐿, 2𝐿 , inodeId1}
…
Create FSImage and BlockMap
Block StoragePolicy can be set as required
E.g. {rep=2, PROVIDED, DISK }
External namespace ext://nn
… …
… …
/
a b c
e f g
d
External store
Example: Using an immutable cloud store
FSImage
BlockMap
Start NN with the FSImage
Replication > 1 start copying to local media
All blocks reachable from NN when a DN
with PROVIDED storage heartbeats in
In contrast to READ_ONLY_SHARED (HDFS-5318)
… …
d
e f g
NN
BlockManager
DN1 DN2
… …
… …
/
a b c
e f g
d
External namespace
Example: Using an immutable cloud store
FSImage
BlockMap
Block locations stored as a
composite DN
Contains all DNs with the
storage configured
Resolved in getBlockLocation()
to a single DN
DN looks up block in
BlockMap, uses Alias to read
from external store
Data can be cached locally as
it is read (read-through cache)
… …
d
e f g
NN
BlockManager
DN1 DN2
DFSClient
getBlockLocation
(“/d/f/z1”, 0, L)
return LocatedBlocks
{{DN2, 𝑏𝑖, PROVIDED}}
lookup(𝑏𝑖)
(“/c/d/f/z1/”, 0, L, GUID1)
External store
Benefits of the PROVIDED design
Use existing HDFS features to enforce quotas, limits on storage tiers
Simpler implementation, no mismatch between HDFS invariants and framework
Supports different types of back-end storages
org.apache.hadoop.FileSystem, blob stores, etc.
Enables several policies to improve performance
Set replication in FSImage to pre-fetch
Read-through cache
Actively pre-fetch while cluster is running
Set StoragePolicy for the file to prefetch
Credentials hidden from client
Only NN and DNs require credentials of external store
HDFS can be used to enforce access controls for remote store
Handling out-of-band changes
Nonce for correctness
Asynchronously poll external store
Integrate detected changes to the NN
Update BlockMap on file creation/deletion
Consensus, shared log, etc.
Tighter NS integration complements provided store abstraction
Operations like rename can cause unnecessary evictions
Heuristics based on common rename scenarios (e.g., output promotion) to
assign block ids
Assumptions
Churn is rare and relatively predictable
Analytic workloads, ETL into external/cloud storage, compute in cluster
Clusters are either consumers/producers for a subtree/region
FileSystem has too little information to resolve conflicts
Clients can recognize/ignore inconsistent states
External stores can tighten these semantics
Independent of PROVIDED storage
Implementation roadmap
Read-only image (with periodic, naive refresh)
ViewFS-based: NN configured to refresh from root
Mount within an existing NN
Refresh view of remote cluster and sync
Write-through
Cloud backup: no namespace in external store, replication only
Return to writer only when data are committed to external store
Write-back
Lazily replicate to external store
Resources
Tiered Storage HDFS-9806 [issues.apache.org]
Design documentation
List of subtasks – take one!
Discussion of scope, implementation, and feedback
Read-only replicas HDFS-5318 [issues.apache.org]
Related READ_ONLY_SHARED work; excellent design doc
{cdoug,vijala}@microsoft.com
Alternative approaches: Client-driven tiering
Existing solutions: ViewFS/HADOOP-12077
Challenges
Maintain synchronized client views
Enforcing storage quotas, rate limiting reads etc. fall upon the client
Clients need sufficient privileges to read/write data
Client is responsible for maintaining the system in a consistent state
Need to recover partially completed operations from other clients

More Related Content

PPTX
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
DataWorks Summit
 
PDF
HDFS Analysis for Small Files
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop File system (HDFS)
Prashant Gupta
 
PDF
Pillars of Heterogeneous HDFS Storage
Pete Kisich
 
PPTX
Backup and Disaster Recovery in Hadoop
larsgeorge
 
PPTX
Ravi Namboori Hadoop & HDFS Architecture
Ravi namboori
 
PPTX
Hadoop Distributed File System
Rutvik Bapat
 
PPTX
HDFS Tiered Storage: Mounting Object Stores in HDFS
DataWorks Summit/Hadoop Summit
 
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
DataWorks Summit
 
HDFS Analysis for Small Files
DataWorks Summit/Hadoop Summit
 
Hadoop File system (HDFS)
Prashant Gupta
 
Pillars of Heterogeneous HDFS Storage
Pete Kisich
 
Backup and Disaster Recovery in Hadoop
larsgeorge
 
Ravi Namboori Hadoop & HDFS Architecture
Ravi namboori
 
Hadoop Distributed File System
Rutvik Bapat
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
DataWorks Summit/Hadoop Summit
 

What's hot (20)

PPTX
Democratizing Memory Storage
DataWorks Summit
 
PPTX
Hadoop hdfs
Sudipta Ghosh
 
PPTX
2.introduction to hdfs
databloginfo
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HDFS Erasure Coding in Action
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop Distributed File System
Anand Kulkarni
 
PDF
From limited Hadoop compute capacity to increased data scientist efficiency
Alluxio, Inc.
 
PPTX
Hadoop and WANdisco: The Future of Big Data
WANdisco Plc
 
PPTX
Scalding by Adform Research, Alex Gryzlov
Vasil Remeniuk
 
PDF
Scalable and High available Distributed File System Metadata Service Using gR...
Alluxio, Inc.
 
PDF
What is HDFS | Hadoop Distributed File System | Edureka
Edureka!
 
PPTX
Data Protection in Hybrid Enterprise Data Lake Environment
DataWorks Summit
 
PPTX
Introduction to HDFS
Bhavesh Padharia
 
PPT
Hadoop
Mallikarjuna G D
 
PPT
Hadoop training in hyderabad-kellytechnologies
Kelly Technologies
 
PDF
Achieving Separation of Compute and Storage in a Cloud World
Alluxio, Inc.
 
PDF
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
PDF
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
Chris Almond
 
PDF
Tales from the Cloudera Field
HBaseCon
 
PDF
How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio...
Alluxio, Inc.
 
Democratizing Memory Storage
DataWorks Summit
 
Hadoop hdfs
Sudipta Ghosh
 
2.introduction to hdfs
databloginfo
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HDFS Erasure Coding in Action
DataWorks Summit/Hadoop Summit
 
Hadoop Distributed File System
Anand Kulkarni
 
From limited Hadoop compute capacity to increased data scientist efficiency
Alluxio, Inc.
 
Hadoop and WANdisco: The Future of Big Data
WANdisco Plc
 
Scalding by Adform Research, Alex Gryzlov
Vasil Remeniuk
 
Scalable and High available Distributed File System Metadata Service Using gR...
Alluxio, Inc.
 
What is HDFS | Hadoop Distributed File System | Edureka
Edureka!
 
Data Protection in Hybrid Enterprise Data Lake Environment
DataWorks Summit
 
Introduction to HDFS
Bhavesh Padharia
 
Hadoop training in hyderabad-kellytechnologies
Kelly Technologies
 
Achieving Separation of Compute and Storage in a Cloud World
Alluxio, Inc.
 
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
Chris Almond
 
Tales from the Cloudera Field
HBaseCon
 
How to Build a Cloud Native Stack for Analytics with Spark, Hive, and Alluxio...
Alluxio, Inc.
 
Ad

Similar to HDFS Tiered Storage (20)

PPTX
HDFS Tiered Storage: Mounting Object Stores in HDFS
DataWorks Summit
 
PPTX
HDFS tiered storage
DataWorks Summit
 
PDF
Hadoop Data Management (1).pdfhbjhkjkkmkm
vineela19
 
PPTX
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
PPTX
Big Data-Session, data engineering and scala
ssusera3b277
 
PDF
Hadoop data management
Subhas Kumar Ghosh
 
PPTX
Hdfs
Chirag Ahuja
 
PDF
Hadoop
Shahbaz Sidhu
 
PPTX
Hadoop
Esraa El Ghoul
 
PPTX
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
Erik Krogen
 
PPTX
Data Analytics presentation.pptx
SwarnaSLcse
 
PPTX
Hadop-HDFS-HDFS-Hadop-HDFS-HDFS-Hadop-HDFS-HDFS
mohammedansaralima
 
PPTX
Design of Hadoop Distributed File System
Dr. C.V. Suresh Babu
 
PDF
Bigdata Technologies that includes various components .pdf
ashokchoppadandi685
 
PDF
An experimental evaluation of performance
ijcsa
 
PDF
HDFS Design Principles
Konstantin V. Shvachko
 
PPTX
Introduction to hadoop and hdfs
shrey mehrotra
 
PDF
HDFS User Reference
Biju Nair
 
PDF
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Sameer Tiwari
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
DataWorks Summit
 
HDFS tiered storage
DataWorks Summit
 
Hadoop Data Management (1).pdfhbjhkjkkmkm
vineela19
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
Big Data-Session, data engineering and scala
ssusera3b277
 
Hadoop data management
Subhas Kumar Ghosh
 
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
Erik Krogen
 
Data Analytics presentation.pptx
SwarnaSLcse
 
Hadop-HDFS-HDFS-Hadop-HDFS-HDFS-Hadop-HDFS-HDFS
mohammedansaralima
 
Design of Hadoop Distributed File System
Dr. C.V. Suresh Babu
 
Bigdata Technologies that includes various components .pdf
ashokchoppadandi685
 
An experimental evaluation of performance
ijcsa
 
HDFS Design Principles
Konstantin V. Shvachko
 
Introduction to hadoop and hdfs
shrey mehrotra
 
HDFS User Reference
Biju Nair
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Sameer Tiwari
 
Ad

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
PPT
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
PDF
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
PDF
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
PDF
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PPTX
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
PPTX
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
PPTX
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
PPTX
HBase in Practice
DataWorks Summit/Hadoop Summit
 
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
PPTX
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 

Recently uploaded (20)

PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Doc9.....................................
SofiaCollazos
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
The Future of Artificial Intelligence (AI)
Mukul
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Doc9.....................................
SofiaCollazos
 

HDFS Tiered Storage

  • 1. HDFS Tiered Storage Chris Douglas, Virajith Jalaparti Microsoft CISL
  • 2. >id Microsoft Cloud and Information Services Lab (CISL) Applied research group in large-scale systems and machine learning Contributions to Apache Hadoop YARN Preemption, reservations/planning, federation, distributed sched. Apache REEF: control-plane for big data systems Chris Douglas ([email protected]) Contributor to Apache Hadoop since 2007, member of its PMC Virajith Jalaparti ([email protected])
  • 3. Data in Hadoop All data in one place Tools written against abstractions Compatible FileSystems (Azure/S3/etc.) Multi-tenant Management APIs Quotas, auth, encryption, media Works well if all data is in one cluster
  • 4. In most cases, we have multiple clusters… Multiple storage clusters Production/research partitioning Compliance and regulatory restrictions Datasets can be shared Geographically distributed clusters Disaster recovery Cloud backup/Hybrid clouds Heterogeneous storage tiers in a cluster Compute + Storage Compute + Storage wasb://… hdfs://b/ hdfs://a/
  • 5. Managing multiple clusters: Today Using the framework Copy data (distcp) between clusters (+) Clients process local copies, no visible partial copies (-) Uses compute resources, requires capacity planning Using the application Directly access data in multiple clusters (+) Consistency managed at client (-) Auth to all data sources, consistency is hard, no opportunities for transparent caching D A hdfs://a/ hdfs://b/ A r/w hdfs://a/ hdfs://b/ r/w
  • 6. Managing multiple clusters: Our proposal Tiering: Using the platform Synchronize storage with remote namespace (+) Transparent to users, caching/prefetching, unified namespace (-) Conflicts may be unresolvable Use HDFS to coordinate external storage No capability or performance gap Support for heterogeneous media (RAM/SSD/DISK), rebalancing, security, quotas, etc. A hdfs://a/ hdfs://b/ r/w mount
  • 7. Challenges Synchronize metadata without copying data Dynamically page in “blocks” on demand Define policies to prefetch and evict local replicas Mirror changes in remote namespace Handle out-of-band churn in remote storage Avoid dropping valid, cached data (e.g., rename) Handle writes consistently Writes committed to the backing store must “make sense”
  • 8. Proposal: Provided Storage Type Peer to RAM, SSD, DISK in HDFS (HDFS-2832) Data in external store mapped to HDFS blocks Each block associated with an Alias = (REF, nonce) Used to map blocks to external data Nonce used to detect changes on backing store E.g.: REF = (file URI, offset, length); nonce = GUID Mapping stored in a BlockMap KV store accessible by NN and all DNs ProvidedVolume on Datanodes reads/writes data from/to external store DN1 External store DN2 BlockManager /𝑎/𝑓𝑜𝑜 → 𝑏𝑖, … , 𝑏𝑗 𝑏𝑖 → {𝑠1, 𝑠2, 𝑠3} /𝑎𝑑𝑙/𝑏𝑎𝑟 → 𝑏 𝑘, … , 𝑏𝑙 𝑏_𝑘 → {𝑠 𝑃𝑅𝑂𝑉𝐼𝐷𝐸𝐷} FSNamesystem NN BlockMap 𝑏 𝑘→ 𝐴𝑙𝑖𝑎𝑠 𝑘 … RAM_DISK SSD DISK PROVIDED
  • 9. Example: Using an immutable cloud store External namespace ext://nn … … … … / a b c e f g d External store (Data) mount Client read(/d/e) DN1 DN2 HDFS cluster NN read(/c/d/e) (file data) (file data)
  • 10. Example: Using an immutable cloud store FSImage BlockMap /𝑑/𝑒 → {𝑏1, 𝑏2, … } /d/f/z1 → {𝑏𝑖, 𝑏𝑖+1, … } … 𝑏𝑖 → {rep = 1, PROVIDED} … 𝑏𝑖 → { 𝑒𝑥𝑡://𝑛𝑛/c/d/f/z1, 0, 𝐿 , inodeId1} 𝑏𝑖+1 → { 𝑒𝑥𝑡://𝑛𝑛/c/d/f/z1, 𝐿, 2𝐿 , inodeId1} … Create FSImage and BlockMap Block StoragePolicy can be set as required E.g. {rep=2, PROVIDED, DISK } External namespace ext://nn … … … … / a b c e f g d External store
  • 11. Example: Using an immutable cloud store FSImage BlockMap Start NN with the FSImage Replication > 1 start copying to local media All blocks reachable from NN when a DN with PROVIDED storage heartbeats in In contrast to READ_ONLY_SHARED (HDFS-5318) … … d e f g NN BlockManager DN1 DN2 … … … … / a b c e f g d External namespace
  • 12. Example: Using an immutable cloud store FSImage BlockMap Block locations stored as a composite DN Contains all DNs with the storage configured Resolved in getBlockLocation() to a single DN DN looks up block in BlockMap, uses Alias to read from external store Data can be cached locally as it is read (read-through cache) … … d e f g NN BlockManager DN1 DN2 DFSClient getBlockLocation (“/d/f/z1”, 0, L) return LocatedBlocks {{DN2, 𝑏𝑖, PROVIDED}} lookup(𝑏𝑖) (“/c/d/f/z1/”, 0, L, GUID1) External store
  • 13. Benefits of the PROVIDED design Use existing HDFS features to enforce quotas, limits on storage tiers Simpler implementation, no mismatch between HDFS invariants and framework Supports different types of back-end storages org.apache.hadoop.FileSystem, blob stores, etc. Enables several policies to improve performance Set replication in FSImage to pre-fetch Read-through cache Actively pre-fetch while cluster is running Set StoragePolicy for the file to prefetch Credentials hidden from client Only NN and DNs require credentials of external store HDFS can be used to enforce access controls for remote store
  • 14. Handling out-of-band changes Nonce for correctness Asynchronously poll external store Integrate detected changes to the NN Update BlockMap on file creation/deletion Consensus, shared log, etc. Tighter NS integration complements provided store abstraction Operations like rename can cause unnecessary evictions Heuristics based on common rename scenarios (e.g., output promotion) to assign block ids
  • 15. Assumptions Churn is rare and relatively predictable Analytic workloads, ETL into external/cloud storage, compute in cluster Clusters are either consumers/producers for a subtree/region FileSystem has too little information to resolve conflicts Clients can recognize/ignore inconsistent states External stores can tighten these semantics Independent of PROVIDED storage
  • 16. Implementation roadmap Read-only image (with periodic, naive refresh) ViewFS-based: NN configured to refresh from root Mount within an existing NN Refresh view of remote cluster and sync Write-through Cloud backup: no namespace in external store, replication only Return to writer only when data are committed to external store Write-back Lazily replicate to external store
  • 17. Resources Tiered Storage HDFS-9806 [issues.apache.org] Design documentation List of subtasks – take one! Discussion of scope, implementation, and feedback Read-only replicas HDFS-5318 [issues.apache.org] Related READ_ONLY_SHARED work; excellent design doc {cdoug,vijala}@microsoft.com
  • 18. Alternative approaches: Client-driven tiering Existing solutions: ViewFS/HADOOP-12077 Challenges Maintain synchronized client views Enforcing storage quotas, rate limiting reads etc. fall upon the client Clients need sufficient privileges to read/write data Client is responsible for maintaining the system in a consistent state Need to recover partially completed operations from other clients

Editor's Notes

  • #2: Welcome. Thanks for coming. We’re discussing a proposal for implementing tiering in HDFS, building on its support for heterogeneous storage.
  • #3: We’re members of the Microsoft C.I.S.L., an applied research lab that publishes papers, builds prototypes, writes production code for Microsoft clusters... but who cares about that? We work in open source, particularly Apache projects, particularly Apache Hadoop. REEF is out of CISL which is like a stdlib for resource management frameworks, including YARN and Mesos. [CD] intro [VJ] intro
  • #4: Hadoop gained traction by putting all of an org’s data in one place, in common formats, to be processed by common tools. Different applications get a consistent view of their data from HDFS. Data is protected and managed by a set of user and operator invariants that assign quotas, authenticate users, encrypt data, and distribute it across heterogeneous media. If you have only one source of data to process using that abstraction, then you get to enjoy nice things and the rest of us will sullenly resent you.
  • #5: However, reality is far removed from this. In most companies which deal with some kind of data, big or small, there are multiple clusters which store the data. You typically have multiple production clusters, either owned by different groups, or separate due to compliance, privacy or regulatory restrictions; and some datasets can be accessed across each other. Also, for scenarios like BCP or backing to the cloud, we have to deal with geographically separate storages, which might be different systems altogether -- for example, you might be running HDFS locally but are backing to Azure Blob store. Further, many clusters today have different storage devices or tiers like RAM/SSD/Disk within a single cluster. In such cases, we would like to make efficient and performant use of these storage tiers, for example, by placing the hottest data in RAM and the cold data on DISK or tapes.
  • #6: In most cases, these multiple clusters and differrent tiers of storage are managed today using two main techniques, The first one is to use the framework for example, people run distcp jobs to copy data over from one storage cluster to another. While this allows for clients to process local copies of data, and leaves no visible intermediate state, it needs compute resources and manual capacity planning. The second one is to use the application to handle multiple clusters, the application can be made aware of the fact that data is in multiple clusters and it can read the data from each one separately while reasoning about the data’s consistency. However, now each application must implement techniques to these reads, authenticate to different sources, and this leaves us with no opportunities for transparent caching or prefetching to improve performance.
  • #7: Our proposal is to use the platform to manage multiple storage clusters. So, we propose to use the storage layer to manage the multiple external storages. This allows us to use different storages for multiple applications and users in a transparent manner, we can use local storage to cache data from remote storage and have a single uniform namespace across multiple storage systems, which can be in the same building or on the other side of the world, in the cloud. In this talk, we are going to describe how we can enable HDFS to do this – how we can mount external storage systems in HDFS. This allows us to exploit all the capabilities and features that HDFS supports such as quotas, and security in accessing the different storage systems.
  • #9: XXX CUT XXX We introduce a new provided storage type which will be a peer to existing storage types. So, in HDFS today, The NN is partitioned into a namespace (FSNamesystem) that maps files to block IDs, and the block lifecycle management in the BlockManager. Each file is a sequence of block IDs in the namespace. Each block ID is mapped to a list of replicas resident on a storage attached to a datanode. A storage is a device (DISK, SSD, memory region) attached to a Datanode. Because HDFS understands blocks, even for files in the provided storage, we have a similar mapping. However, we also need to have some mapping of these blocks and how data is laid out in the provided store. For this, replicas of a block in “provided” storage are mapped to an alias. An alias is simply a tuple: a reference resolvable in the namespace of the external store, and a nonce to verify that the reference still locates the data matching that block. If my external store is another FileSystem, then my reference may be a URI, offset, length. and the nonce includes an inode/fileID, modification time, etc. Finally, we have provided volumes in Datanodes which are used to read and write data from the external store. The provided volume is essentially implements a client that is capable to talking to the external store.
  • #10: To understand how this would work in practice, let’s look at a simple example where we want to access an external cloud storage through HDFS. Let’s ignore writes for now. -> Now suppose, this is the part of the namespace we want to -> mount in HDFS. -> if the mount is successful, we should be able to access data in the cloud through HDFS. That is -> if a client comes and requests for a particular file, say /d/e, from HDFS, then HDFS should be -> able to read the file from the external store, -> get back the data from the external store and -> stream the data back to the client. Now, I will hand it over to Chris to explain how we make all of this to happen using the PROVIDED abstraction I just introduced.
  • #11: Let’s drill down into an example. Assume we want to mount this external namespace into HDFS. Rather, this subtree. [] We can generate a mirror of the metadata as an FSImage (checkpoint of NN state). For every file, we also partition it into blocks, and store the reference in the blockmap with the corresponding nonce. [] Note that the image contains only the block IDs and storage policy, while the blockmap stores the block alias. So if file /c/d/e were 1GB, the image could record 4 logical blocks. For each block, the blockmap would record the reference (URI,offset,len) and a nonce (inodeId, LMT) sufficient to detect inconsistency.
  • #12: A quick note on block reports, if those are unfamiliar. By the way: if any of this is unfamiliar, please speak up The NN persists metadata about blocks, but their location in the cluster is reported by DNs. Each DN reports the volumes (HDD, SSD) attached to it, and a list of block IDs stored in each. At startup, the NN comes out of safe mode (starts accepting writes) when some fraction of its namespace is available. [] When a DN reports its provided storage, it does not send a full block report for the provided storage (which is, recall, a peer of its local media). It only reports that any block stored therein is reachable through it. As long as the NN has at least one DN reporting that provided storage, it considers all the blocks in the block map as reachable. The NN scans the block map to discover DN blocks in that provided storage. This is in contrast to some existing work supporting read-only replicas, where every DN sends a block report of the shared data, as when multiple DNs mount a filer.
  • #13: ORIG Inside the NN, blocks are stored as a composite.
  • #14: Inside the NN, we relax the invariant that a storage- a HDD/SDD- belongs to only one DN. So when a client requests the block locations for a file (here z1) [] the NN will report all the local replicas, and NN will select a single PROV replica, say closest to the client. This avoids reporting every DN as a valid target, which is accurate, but not helpful for applications. [] When the client requests the PROV block from the DN, the DN will lookup the block in the blockmap [] find the block alias, resolve the reference [] request the block data from the external store [] and return the data to the client, having verified the nonce [] because the block is read through the DN, we can also cache the data as a local block.
  • #15: There are a few points worth calling out, here. * First, this is a relatively small change to HDFS. The only client-visible change adds a new storage type. As a user, this is simpler than coordinating with copying jobs. In our cloud example, all the cluster’s data is immediately available once it’s in the namespace, even if the replication policy hasn’t prefetched data into local media. * Second, particularly for read-only mounts, this is a narrow API to implement. For cloud backup scenarios- where the NN is the only writer to the namespace- then we only need the block to object ID map and NN metadata to mount a prefix/snapshot of the cluster. * Third, because the client reads through the DN, it can cache a copy of the block on read. Pointedly, the NN can direct the client to any DN that should cache a copy on read, opening some interesting combinations of placement policies and read-through caching. The DN isn’t necessarily the closest to the client, but it may follow another objective function or replication policy. * Finally, in our example the cloud credentials are hidden from the client. S3/WAS both authenticate clients to containers using a single key. Because HDFS owns and protects the external store’s credentials, the client only accesses data permitted by HDFS. Generally, we can use features of HDFS that aren’t directly supported by the backing store if we can define the mapping.
  • #16: It’s imperative that we never return the wrong data. If a file were overwritten in the backing store, we will never return part of the first file, and part of the second. The nonce is what we use to protect ourselves from that. But there needs to be some way to ingest new data into HDFS. If our external store has a namespace compatible with FS, then we can always scan it, but... while refresh is limited to scans, the view to the client can be inconsistent. A client may see some unpromoted output, some promoted output, and a sentinel file declaring it completely promoted. Better cooperation with external stores can tighten the namespace integration, to expose meaningful states. For example, if the external store could expose meaningful snapshots, then HDFS could move from one to the next, maintaining a read-only version while it updates. If the path remains valid while the NN updates itself, we have clearer guarantees. For anyone familiar with CORFU and Tango (MSR, Dahlia Malkhi, Mahesh Balakrishnan, Ted Wobber), or with WANdisco’s integration of their Paxos engine with the NN, we can make the metadata sync tight and meaningful. We still need the logic at the block layer we’re adding as provided storage. After correctness, we also need to be mindful of efficiency. Output is often promoted by renaming it, and if the NN were to interpret that as a deletion and creation, our HDFS cluster would discard blocks just to recopy them, right at the moment they are consumed. One of our goals is to conservatively identify these cases based on actual workloads.
  • #17: Since I mentioned strong consensus engines, this isn’t a “real” shared namespace. Even the read-only case is eventually consistent; in the base case we’re scanning the entire subtree in the external store. That’s obviously not workable, but most bigdata workloads don’t hit pathological cases. The typical year/month/day/hour layouts common to analytics clusters are mostly additive, and this is sufficient for that case. * When writes conflict, there is only so much the FS can do to merge conflicts. Set aside really complex cases like compactions; even simple cases may not merge cleanly. If a user creates a directory that is also present in the external store, can that be merged? Maybe not; successful creation might be gating access; many frameworks in the Hadoop ecosystem follow conventions that rely on atomicity of operations in HDFS. * The permissions, timestamps, or storage policy may not match, and there isn’t a “correct” answer for the merged result (absent application semantics). * So we assume that, generally (or by construction), clusters will be either producers or consumers for some part of the shared namespace. Fundamentally: no magic, here. We haven’t made any breakthroughs in consensus, but provided storage is a tractable solution that happens to cover some common cases/deployments in its early incarnations, and from a R&D perspective, some very interesting problems in the policy space. Please find us after the talk, we love to talk about this.
  • #18: The implementation will be staged. The read-only case is relatively straightforward; we implemented a proof-of-concept spread over a few weeks. A link is posted to JIRA. We will start with a NN managing an external store, merged using federation (ViewFS). This lets us defer the mounting logic, which would otherwise interfere with NN operation. We will then explore strategies for creating and maintaining mounts in the primary NN, alongside other data. For those familiar with the NN locking and the formidable challenge of relaxing it, note that most of the invariants we’d enforce don’t apply inside the mount. Quotas aren’t enforced, renames outside can be disallowed, etc. So it may be possible to embed this in the NN. Refresh will start as naive scans, then improve. Identifying subtrees that change and/or are accessed more frequently could improve the general case, but polling is fundamentally limited. Given some experience, we can recognize the common abstractions when tiering over stores that expose version information, snapshots, etc. and write some tighter integrations. Writes are complex, so we will move from working system to working system. We’re wiring the PROV type into the DN state machines, so the write-through case should be tractable, particularly when the external store behind the provided abstraction is an object store. Ultimately, we’d like to use local storage to batch- or even prioritize- writes to the external store. Because HDFS sits between the client and the external store: if we have limited bandwidth, want to apply cost or priority models, etc. these can be embedded in HDFS.
  • #19: Please join us. We have a design document posted to JIRA, an active discussion of the implementation choices, and we’ll be starting a branch to host these changes. The existing work on READ_ONLY_SHARED replicas has a superlative design doc, if you want to contribute but need some orientation in the internal details. We have a few minutes for questions, but please find us after the talk. There are far more details than we can possibly cover in a single presentation and we’re still setting the design, so we’re very open to collaboration. Thanks, and... let’s take a couple questions.