HDFS Tiered Storage

HDFS Tiered
Storage
Chris Douglas, Virajith Jalaparti
Microsoft CISL

>id
Microsoft Cloud and Information Services Lab (CISL)
Applied research group in large-scale systems and machine learning
Contributions to Apache Hadoop YARN
Preemption, reservations/planning, federation, distributed sched.
Apache REEF: control-plane for big data systems
Chris Douglas (cdoug@microsoft.com)
Contributor to Apache Hadoop since 2007, member of its PMC
Virajith Jalaparti (vijala@microsoft.com)

Data in Hadoop
All data in one place
Tools written against abstractions
Compatible FileSystems (Azure/S3/etc.)
Multi-tenant
Management APIs
Quotas, auth, encryption, media
Works well if all data is in one cluster

In most cases, we have multiple clusters…
Multiple storage clusters
Production/research partitioning
Compliance and regulatory restrictions
Datasets can be shared
Geographically distributed clusters
Disaster recovery
Cloud backup/Hybrid clouds
Heterogeneous storage tiers in a cluster
Compute +
Storage
Compute +
Storage
wasb://…
hdfs://b/
hdfs://a/

Managing multiple clusters: Today
Using the framework
Copy data (distcp) between clusters
(+) Clients process local copies, no visible
partial copies
(-) Uses compute resources, requires
capacity planning
Using the application
Directly access data in multiple clusters
(+) Consistency managed at client
(-) Auth to all data sources, consistency is
hard, no opportunities for transparent
caching
D A
hdfs://a/ hdfs://b/
A
r/w
hdfs://a/ hdfs://b/
r/w

Managing multiple clusters: Our proposal
Tiering: Using the platform
Synchronize storage with remote namespace
(+) Transparent to users, caching/prefetching,
unified namespace
(-) Conflicts may be unresolvable
Use HDFS to coordinate external storage
No capability or performance gap
Support for heterogeneous media (RAM/SSD/DISK), rebalancing, security,
quotas, etc.
A
hdfs://a/
hdfs://b/
r/w
mount

Challenges
Synchronize metadata without copying data
Dynamically page in “blocks” on demand
Define policies to prefetch and evict local replicas
Mirror changes in remote namespace
Handle out-of-band churn in remote storage
Avoid dropping valid, cached data (e.g., rename)
Handle writes consistently
Writes committed to the backing store must “make sense”

Proposal: Provided Storage Type
Peer to RAM, SSD, DISK in HDFS (HDFS-2832)
Data in external store mapped to HDFS blocks
Each block associated with an Alias = (REF, nonce)
Used to map blocks to external data
Nonce used to detect changes on backing store
E.g.: REF = (file URI, offset, length); nonce = GUID
Mapping stored in a BlockMap
KV store accessible by NN and all DNs
ProvidedVolume on Datanodes reads/writes
data from/to external store
DN1
External store
DN2
BlockManager
/𝑎/𝑓𝑜𝑜 → 𝑏𝑖, … , 𝑏𝑗
𝑏𝑖 → {𝑠1, 𝑠2, 𝑠3}
/𝑎𝑑𝑙/𝑏𝑎𝑟 → 𝑏 𝑘, … , 𝑏𝑙
𝑏_𝑘 → {𝑠 𝑃𝑅𝑂𝑉𝐼𝐷𝐸𝐷}
FSNamesystem
NN
BlockMap
𝑏 𝑘→ 𝐴𝑙𝑖𝑎𝑠 𝑘
…
RAM_DISK SSD DISK PROVIDED

Example: Using an immutable cloud store
External namespace ext://nn
… …
… …
/
a b c
e f g
d
External store
(Data)
mount
Client
read(/d/e)
DN1 DN2
HDFS cluster
NN
read(/c/d/e)
(file data)
(file data)

FSImage
BlockMap
/𝑑/𝑒 → {𝑏1, 𝑏2, … }
/d/f/z1 → {𝑏𝑖, 𝑏𝑖+1, … }
…
𝑏𝑖 → {rep = 1, PROVIDED}
…
𝑏𝑖 → { 𝑒𝑥𝑡://𝑛𝑛/c/d/f/z1, 0, 𝐿 , inodeId1}
𝑏𝑖+1 → { 𝑒𝑥𝑡://𝑛𝑛/c/d/f/z1, 𝐿, 2𝐿 , inodeId1}
…
Create FSImage and BlockMap
Block StoragePolicy can be set as required
E.g. {rep=2, PROVIDED, DISK }
External namespace ext://nn
… …
… …
/
a b c
e f g
d
External store

FSImage
BlockMap
Start NN with the FSImage
Replication > 1 start copying to local media
All blocks reachable from NN when a DN
with PROVIDED storage heartbeats in
In contrast to READ_ONLY_SHARED (HDFS-5318)
… …
d
e f g
NN
BlockManager
DN1 DN2
… …
… …
/
a b c
e f g
d
External namespace

FSImage
BlockMap
Block locations stored as a
composite DN
Contains all DNs with the
storage configured
Resolved in getBlockLocation()
to a single DN
DN looks up block in
BlockMap, uses Alias to read
from external store
Data can be cached locally as
it is read (read-through cache)
… …
d
e f g
NN
BlockManager
DN1 DN2
DFSClient
getBlockLocation
(“/d/f/z1”, 0, L)
return LocatedBlocks
{{DN2, 𝑏𝑖, PROVIDED}}
lookup(𝑏𝑖)
(“/c/d/f/z1/”, 0, L, GUID1)
External store

Benefits of the PROVIDED design
Use existing HDFS features to enforce quotas, limits on storage tiers
Simpler implementation, no mismatch between HDFS invariants and framework
Supports different types of back-end storages
org.apache.hadoop.FileSystem, blob stores, etc.
Enables several policies to improve performance
Set replication in FSImage to pre-fetch
Read-through cache
Actively pre-fetch while cluster is running
Set StoragePolicy for the file to prefetch
Credentials hidden from client
Only NN and DNs require credentials of external store
HDFS can be used to enforce access controls for remote store

Handling out-of-band changes
Nonce for correctness
Asynchronously poll external store
Integrate detected changes to the NN
Update BlockMap on file creation/deletion
Consensus, shared log, etc.
Tighter NS integration complements provided store abstraction
Operations like rename can cause unnecessary evictions
Heuristics based on common rename scenarios (e.g., output promotion) to
assign block ids

Assumptions
Churn is rare and relatively predictable
Analytic workloads, ETL into external/cloud storage, compute in cluster
Clusters are either consumers/producers for a subtree/region
FileSystem has too little information to resolve conflicts
Clients can recognize/ignore inconsistent states
External stores can tighten these semantics
Independent of PROVIDED storage

Implementation roadmap
Read-only image (with periodic, naive refresh)
ViewFS-based: NN configured to refresh from root
Mount within an existing NN
Refresh view of remote cluster and sync
Write-through
Cloud backup: no namespace in external store, replication only
Return to writer only when data are committed to external store
Write-back
Lazily replicate to external store

Resources
Tiered Storage HDFS-9806 [issues.apache.org]
Design documentation
List of subtasks – take one!
Discussion of scope, implementation, and feedback
Read-only replicas HDFS-5318 [issues.apache.org]
Related READ_ONLY_SHARED work; excellent design doc
{cdoug,vijala}@microsoft.com

Alternative approaches: Client-driven tiering
Existing solutions: ViewFS/HADOOP-12077
Challenges
Maintain synchronized client views
Enforcing storage quotas, rate limiting reads etc. fall upon the client
Clients need sufficient privileges to read/write data
Client is responsible for maintaining the system in a consistent state
Need to recover partially completed operations from other clients

HDFS Tiered Storage

More Related Content

What's hot (20)

Similar to HDFS Tiered Storage (20)

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded (20)

HDFS Tiered Storage

Editor's Notes