SlideShare a Scribd company logo
HDFS Erasure Coding
Zhe Zhang
zhezhang@cloudera.com
Replication is Expensive
§ HDFS inherits 3-way replication from Google File System
- Simple, scalable and robust
Replication is Expensive
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
§ HDFS inherits 3-way replication from Google File System
- Simple, scalable and robust
§ 200% storage overhead
Replication is Expensive
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
§ HDFS inherits 3-way replication from Google File System
- Simple, scalable and robust
§ 200% storage overhead
§ Secondary replicas rarely accessed
Replication is Expensive
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
Erasure Coding Saves Storage
Erasure Coding Saves Storage
§ Simplified Example: storing 2 bits
1 0Replication:
XOR Coding: 1 0
Erasure Coding Saves Storage
§ Simplified Example: storing 2 bits
1 01 0Replication:
XOR Coding: 1 0
Erasure Coding Saves Storage
§ Simplified Example: storing 2 bits
1 01 0Replication:
XOR Coding: 1 0
2 extra bits
Erasure Coding Saves Storage
§ Simplified Example: storing 2 bits
1 01 0Replication:
XOR Coding: 1 0⊕ 1=
2 extra bits
Erasure Coding Saves Storage
§ Simplified Example: storing 2 bits
1 01 0Replication:
XOR Coding: 1 0⊕ 1=
2 extra bits
1 extra bit
Erasure Coding Saves Storage
§ Simplified Example: storing 2 bits
§ Same data durability
- can lose any 1 bit
1 01 0Replication:
XOR Coding: 1 0⊕ 1=
2 extra bits
1 extra bit
Erasure Coding Saves Storage
§ Simplified Example: storing 2 bits
§ Same data durability
- can lose any 1 bit
§ Half the storage overhead
1 01 0Replication:
XOR Coding: 1 0⊕ 1=
2 extra bits
1 extra bit
Erasure Coding Saves Storage
§ Simplified Example: storing 2 bits
§ Same data durability
- can lose any 1 bit
§ Half the storage overhead
§ Slower recovery
1 01 0Replication:
XOR Coding: 1 0⊕ 1=
2 extra bits
1 extra bit
Erasure Coding Saves Storage
Erasure Coding Saves Storage
§ Facebook
- f4 stores 65PB of BLOBs in EC
Erasure Coding Saves Storage
§ Facebook
- f4 stores 65PB of BLOBs in EC
§ Windows Azure Storage (WAS)
- A PB of new data every 1~2 days
- All “sealed” data stored in EC
Erasure Coding Saves Storage
§ Facebook
- f4 stores 65PB of BLOBs in EC
§ Windows Azure Storage (WAS)
- A PB of new data every 1~2 days
- All “sealed” data stored in EC
§ Google File System
- Large portion of data stored in EC
Roadmap
Roadmap
§ Background of EC
- Redundancy Theory
- EC in Distributed Storage Systems
Roadmap
§ Background of EC
- Redundancy Theory
- EC in Distributed Storage Systems
§ HDFS-EC architecture
- Choosing Block Layout
- NameNode — Generalizing the Block Concept
- Client — Parallel I/O
- DataNode — Background Reconstruction
Roadmap
§ Background of EC
- Redundancy Theory
- EC in Distributed Storage Systems
§ HDFS-EC architecture
- Choosing Block Layout
- NameNode — Generalizing the Block Concept
- Client — Parallel I/O
- DataNode — Background Reconstruction
§ Hardware-accelerated Codec Framework
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
3-way Replication:
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
3-way Replication:
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
3-way Replication: Data Durability = 2
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
3-way Replication: Data Durability = 2
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
useful data
3-way Replication: Data Durability = 2
redundant data
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Replica
DataNode0 DataNode1 DataNode2
Block
NameNode
Replica Replica
useful data
3-way Replication: Data Durability = 2
Storage Efficiency = 1/3 (33%)
redundant data
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
XOR:
X Y X ⊕ Y
0 0 0
0 1 1
1 0 1
1 1 0
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
XOR:
X Y X ⊕ Y
0 0 0
0 1 1
1 0 1
1 1 0
Y = 0 ⊕ 1 = 1
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
XOR:
Data Durability = 1
X Y X ⊕ Y
0 0 0
0 1 1
1 0 1
1 1 0
Y = 0 ⊕ 1 = 1
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
XOR:
Data Durability = 1
useful data redundant data
X Y X ⊕ Y
0 0 0
0 1 1
1 0 1
1 1 0
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
XOR:
Data Durability = 1
Storage Efficiency = 2/3 (67%)
useful data redundant data
X Y X ⊕ Y
0 0 0
0 1 1
1 0 1
1 1 0
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Reed-Solomon (RS):
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Reed-Solomon (RS):
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Reed-Solomon (RS):
Data Durability = 2
Storage Efficiency = 4/6 (67%)
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Reed-Solomon (RS):
Data Durability = 2
Storage Efficiency = 4/6 (67%)
Very flexible!
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication 2
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication 2 33%
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication 2 33%
XOR with 6 data cells
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication 2 33%
XOR with 6 data cells 1
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication 2 33%
XOR with 6 data cells 1 86%
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication 2 33%
XOR with 6 data cells 1 86%
RS (6,3)
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication 2 33%
XOR with 6 data cells 1 86%
RS (6,3) 3
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication 2 33%
XOR with 6 data cells 1 86%
RS (6,3) 3 67%
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication 2 33%
XOR with 6 data cells 1 86%
RS (6,3) 3 67%
RS (10,4)
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication 2 33%
XOR with 6 data cells 1 86%
RS (6,3) 3 67%
RS (10,4) 4
Durability and Efficiency
Data Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency
Single Replica 0 100%
3-way Replication 2 33%
XOR with 6 data cells 1 86%
RS (6,3) 3 67%
RS (10,4) 4 71%
EC in Distributed Storage
Block Layout:
128~256MFile 0~128M … 640~768M0~128M 128~256M
EC in Distributed Storage
Block Layout:
128~256MFile … 640~768M
0~128
M
block0
DataNode 0
0~128M 128~256M
EC in Distributed Storage
Block Layout:
File … 640~768M
0~128
M
block0
DataNode 0
128~
256M
block1
DataNode 1
0~128M 128~256M
EC in Distributed Storage
Block Layout:
File … 640~768M
0~128
M
block0
DataNode 0
128~
256M
block1
DataNode 1
0~128M 128~256M
… 640~
768M
block5
DataNode 5
EC in Distributed Storage
Block Layout:
File … 640~768M
0~128
M
block0
DataNode 0
128~
256M
block1
DataNode 1
0~128M 128~256M
… 640~
768M
block5
DataNode 5 DataNode 6
…
parity
EC in Distributed Storage
Block Layout:
File … 640~768M
0~128
M
block0
DataNode 0
128~
256M
block1
DataNode 1
0~128M 128~256M
… 640~
768M
block5
DataNode 5 DataNode 6
…
parity
Contiguous Layout:
EC in Distributed Storage
Block Layout:
Data Locality !
File … 640~768M
0~128
M
block0
DataNode 0
128~
256M
block1
DataNode 1
0~128M 128~256M
… 640~
768M
block5
DataNode 5 DataNode 6
…
parity
Contiguous Layout:
EC in Distributed Storage
Block Layout:
Data Locality !
Small Files "
File … 640~768M
0~128
M
block0
DataNode 0
128~
256M
block1
DataNode 1
0~128M 128~256M
… 640~
768M
block5
DataNode 5 DataNode 6
…
parity
Contiguous Layout:
EC in Distributed Storage
Block Layout:
File
block0
DataNode 0
block1
DataNode 1
…
block5
DataNode 5 DataNode 6
…
parity
0~128M 128~256M
EC in Distributed Storage
Block Layout:
File
block0
DataNode 0
block1
DataNode 1
…
block5
DataNode 5 DataNode 6
…
parity
0~1M 1~2M 5~6M
0~128M 128~256M
EC in Distributed Storage
Block Layout:
File
block0
DataNode 0
block1
DataNode 1
…
block5
DataNode 5 DataNode 6
…
parity
0~1M 1~2M 5~6M
6~7M
0~128M 128~256M
EC in Distributed Storage
Block Layout:
File
block0
DataNode 0
block1
DataNode 1
…
block5
DataNode 5 DataNode 6
…
parity
Striped Layout:
0~1M 1~2M 5~6M
6~7M
Data Locality "
Small Files !
Parallel I/O !
0~128M 128~256M
EC in Distributed Storage
Spectrum:
Replication
Erasure
Coding
Striping
Contiguous
Ceph
Ceph
Quancast File System
Quancast File System
HDFS Facebook f4
Windows Azure
Roadmap
§ Background of EC
- Redundancy Theory
- EC in Distributed Storage Systems
§ HDFS-EC architecture
- Choosing Block Layout
- NameNode — Generalizing the Block Concept
- Client — Parallel I/O
- DataNode — Background Reconstruction
§ Hardware-accelerated Codec Framework
Choosing Block Layout
• Medium: 1~6 blocks• Small files: < 1 block• Assuming (6,3) coding • Large: > 6 blocks (1 group)
Choosing Block Layout
• Medium: 1~6 blocks• Small files: < 1 block• Assuming (6,3) coding • Large: > 6 blocks (1 group)
64.61%
9.33%
26.06%
1.85%1.86%
96.29%
small medium large
file count
space usage
Top 2% files occupy ~65% space
Cluster A Profile
Choosing Block Layout
• Medium: 1~6 blocks• Small files: < 1 block• Assuming (6,3) coding • Large: > 6 blocks (1 group)
64.61%
9.33%
26.06%
1.85%1.86%
96.29%
small medium large
file count
space usage
Top 2% files occupy ~65% space
Cluster A Profile
40.08%
36.03%
23.89%
2.03%
11.38%
86.59% file count
space
usage
Top 2% files occupy ~40% space
small medium large
Cluster B Profile
Choosing Block Layout
• Medium: 1~6 blocks• Small files: < 1 block• Assuming (6,3) coding • Large: > 6 blocks (1 group)
64.61%
9.33%
26.06%
1.85%1.86%
96.29%
small medium large
file count
space usage
Top 2% files occupy ~65% space
Cluster A Profile
40.08%
36.03%
23.89%
2.03%
11.38%
86.59% file count
space
usage
Top 2% files occupy ~40% space
small medium large
Cluster B Profile
3.20%
20.75%
76.05%
0.00%0.36%
99.64%
file count
space usage
Dominated by small files
small medium large
Cluster C Profile
Choosing Block Layout
Striping
Contiguous
Replication
Erasure
Coding
Phase
1.1
Phase
1.2
Phase 2
(Future work)
Phase 3
(Future work)
Current
HDFS
Generalizing Block NameNode
Generalizing Block NameNode
Mapping Logical and Storage Blocks
Generalizing Block NameNode
Mapping Logical and Storage Blocks Too Many Storage Blocks?
Generalizing Block NameNode
Mapping Logical and Storage Blocks Too Many Storage Blocks?
Hierarchical Naming Protocol:
Client Parallel Writing
streamer
queue
streamer … streamer
DataNode DataNode DataNode
Client Parallel Writing
streamer
queue
streamer … streamer
DataNode DataNode DataNode
Client Parallel Writing
streamer
queue
streamer … streamer
DataNode DataNode DataNode
Client Parallel Writing
streamer
queue
streamer … streamer
DataNode DataNode DataNode
Client Parallel Writing
streamer
queue
streamer … streamer
DataNode DataNode DataNode
Coordinator
Client Parallel Writing
streamer
queue
streamer … streamer
DataNode DataNode DataNode
Coordinator
Client Parallel Writing
streamer
queue
streamer … streamer
DataNode DataNode DataNode
Coordinator
Client Parallel Writing
streamer
queue
streamer … streamer
DataNode DataNode DataNode
Coordinator
Client Parallel Reading
… DataNodeDataNode DataNode DataNode DataNode
Client Parallel Reading
… DataNodeDataNode DataNode DataNode DataNode
Client Parallel Reading
… DataNodeDataNode DataNode DataNode DataNode
Client Parallel Reading
… DataNodeDataNode DataNode DataNode DataNode
parity
Reconstruction on DataNode
§ Important to avoid delay on the critical path
- Especially if original data is lost
§ Integrated with Replication Monitor
- Under-protected EC blocks scheduled together with under-replicated blocks
- New priority algorithms
§ New ErasureCodingWorker component on DataNode
Roadmap
§ Background of EC
- Redundancy Theory
- EC in Distributed Storage Systems
§ HDFS-EC architecture
- Choosing Block Layout
- NameNode — Generalizing the Block Concept
- Client — Parallel I/O
- DataNode — Background Reconstruction
§ Hardware-accelerated Codec Framework
Acceleration with Intel ISA-L
§ 1 legacy coder
- From Facebook’s HDFS-RAID project
§ 2 new coders
- Pure Java — code improvement over HDFS-RAID
- Native coder with Intel’s Intelligent Storage Acceleration Library (ISA-L)
Microbenchmark: Codec Calculation
Microbenchmark: HDFS I/O
Conclusion
Conclusion
§ Erasure coding expands effective storage space by ~50%!
Conclusion
§ Erasure coding expands effective storage space by ~50%!
§ HDFS-EC phase I implements erasure coding in striped block layout
Conclusion
§ Erasure coding expands effective storage space by ~50%!
§ HDFS-EC phase I implements erasure coding in striped block layout
§ Upstream effort (HDFS-7285):
- Design finalized Nov. 2014
- Development started Jan. 2015
- 218 commits, ~25k LoC change
- Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo (Japan)
Conclusion
§ Erasure coding expands effective storage space by ~50%!
§ HDFS-EC phase I implements erasure coding in striped block layout
§ Upstream effort (HDFS-7285):
- Design finalized Nov. 2014
- Development started Jan. 2015
- 218 commits, ~25k LoC change
- Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo (Japan)
§ Phase II will support contiguous block layout for better locality
Acknowledgements
§ Cloudera
- Andrew Wang, Aaron T. Myers, Colin McCabe, Todd Lipcon, Silvius Rus
§ Intel
- Kai Zheng, Uma Maheswara Rao G, Vinayakumar B, Yi Liu, Weihua Jiang
§ Hortonworks
- Jing Zhao, Tsz Wo Nicholas Sze
§ Huawei
- Walter Su, Rakesh R, Xinwei Qin
§ Yahoo (Japan)
- Gao Rui, Kai Sasaki, Takuya Fukudome, Hui Zheng
Just merged to trunk!
Questions?
Just merged to trunk!
Questions?
Just merged to trunk!
Erasure Coding:A type of Error Correction Coding
EC in Distributed Storage
Spectrum:
EC in Distributed Storage
0~128
M
128~256
M
DataNode0
block0
block1
…
DataNode1
640~768
M
DataNode5
block5
Contiguous
DataNode6 DataNode8
data parity
…
Block Layout:
128~256MFile 0~128M … 640~768M
EC in Distributed Storage
0~128
M
128~256
M
DataNode0
block0
block1
…
DataNode1
640~768
M
DataNode5
block5
Contiguous
DataNode6 DataNode8
data parity
…
Block Layout:
Data Locality !
128~256MFile 0~128M … 640~768M
EC in Distributed Storage
0~128
M
128~256
M
DataNode0
block0
block1
…
DataNode1
640~768
M
DataNode5
block5
Contiguous
DataNode6 DataNode8
data parity
…
Block Layout:
Data Locality !
Small Files "
128~256MFile 0~128M … 640~768M
EC in Distributed Storage
0~128
M
128~256
M
DataNode0
block0
block1
…
DataNode1
640~768
M
DataNode5
block5
Contiguous
DataNode6 DataNode8
data parity
…
Block Layout:
Data Locality !
Small Files "
128~256MFile … 640~768M
EC in Distributed Storage
0~1M
…
…
1~2M
…
…
DataNode0
block0
DataNode1
5~6M
…
127~128M
DataNode5
Striping
DataNode6 DataNode8
data parity
……
Block Layout:
EC in Distributed Storage
0~1M
…
…
1~2M
…
…
DataNode0
block0
DataNode1
5~6M
…
127~128M
DataNode5
Striping
DataNode6 DataNode8
data parity
……
Block Layout:
Data Locality "
EC in Distributed Storage
0~1M
…
…
1~2M
…
…
DataNode0
block0
DataNode1
5~6M
…
127~128M
DataNode5
Striping
DataNode6 DataNode8
data parity
……
Block Layout:
Data Locality "
Small Files !
EC in Distributed Storage
0~1M
…
…
1~2M
…
…
DataNode0
block0
DataNode1
5~6M
…
127~128M
DataNode5
Striping
DataNode6 DataNode8
data parity
……
Block Layout:
Data Locality "
Small Files !
Parallel I/O !
Client Parallel Writing
blockGroup
DataStreamer 0 DataStreamer 1 DataStreamer 2 DataStreamer 3 DataStreamer 4
DFSStripedOutputStream
dataQueue 0 dataQueue 1 dataQueue 2 dataQueue 3 dataQueue 4
blk_1009 blk_1010 blk_1011 blk_1012 blk_1013
Coordinator
allocate new blockGroup
Client Parallel Reading
Stripe 0
Stripe 1
Stripe 2
DataNode 0 DataNode 1 DataNode 2 DataNode 2 DataNode 3
(parity blocks)(data blocks)
all zero all zero
requested
requested requested requested
requested
recovery
read
recovery
read
recovery
read
recovery
read
recovery
read
recovery
read
recovery
read
recovery
read

More Related Content

What's hot (20)

PDF
Hdfs architecture
Aisha Siddiqa
 
PPTX
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
PDF
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Sameer Tiwari
 
PPT
HDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, Datatypes
The HDF-EOS Tools and Information Center
 
ODP
Hadoop HDFS by rohitkapa
kapa rohit
 
PDF
Hadoop Distributed File System
elliando dias
 
PDF
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
DataWorks Summit/Hadoop Summit
 
PDF
HDFS User Reference
Biju Nair
 
PPTX
Hadoop HDFS Architeture and Design
sudhakara st
 
PPT
Anatomy of file write in hadoop
Rajesh Ananda Kumar
 
PDF
HDFS Design Principles
Konstantin V. Shvachko
 
PPTX
HDFS: Hadoop Distributed Filesystem
Steve Loughran
 
PDF
Apache Hadoop 0.22 and Other Versions
Konstantin V. Shvachko
 
PPTX
presentation_Hadoop_File_System
Brett Keim
 
PPTX
Understanding Hadoop
Mahendran Ponnusamy
 
PPTX
Introduction to HDFS and MapReduce
Uday Vakalapudi
 
PPTX
Facebook's Approach to Big Data Storage Challenge
DataWorks Summit
 
PPTX
HDFS Tiered Storage
DataWorks Summit/Hadoop Summit
 
PDF
CloverETL + Hadoop
David Pavlis
 
PPTX
Hadoop HDFS Detailed Introduction
Hanborq Inc.
 
Hdfs architecture
Aisha Siddiqa
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Sameer Tiwari
 
HDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, Datatypes
The HDF-EOS Tools and Information Center
 
Hadoop HDFS by rohitkapa
kapa rohit
 
Hadoop Distributed File System
elliando dias
 
Cross-DC Fault-Tolerant ViewFileSystem @ Twitter
DataWorks Summit/Hadoop Summit
 
HDFS User Reference
Biju Nair
 
Hadoop HDFS Architeture and Design
sudhakara st
 
Anatomy of file write in hadoop
Rajesh Ananda Kumar
 
HDFS Design Principles
Konstantin V. Shvachko
 
HDFS: Hadoop Distributed Filesystem
Steve Loughran
 
Apache Hadoop 0.22 and Other Versions
Konstantin V. Shvachko
 
presentation_Hadoop_File_System
Brett Keim
 
Understanding Hadoop
Mahendran Ponnusamy
 
Introduction to HDFS and MapReduce
Uday Vakalapudi
 
Facebook's Approach to Big Data Storage Challenge
DataWorks Summit
 
HDFS Tiered Storage
DataWorks Summit/Hadoop Summit
 
CloverETL + Hadoop
David Pavlis
 
Hadoop HDFS Detailed Introduction
Hanborq Inc.
 

Viewers also liked (15)

PDF
図でわかるHDFS Erasure Coding
Kai Sasaki
 
PDF
Data Science Crash Course Hadoop Summit SJ
Daniel Madrigal
 
PDF
HDFS Deep Dive
Yifeng Jiang
 
PDF
Timeline Service v.2 (Hadoop Summit 2016)
Sangjin Lee
 
PDF
Apache Hadoop Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
 
PDF
Apache Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop crashcourse v3
Hortonworks
 
PDF
Performance comparison of Distributed File Systems on 1Gbit networks
Marian Marinov
 
PPTX
What's new in hadoop 3.0
Heiko Loewe
 
PDF
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Daniel Madrigal
 
PPTX
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
DataWorks Summit
 
PPTX
HDFS Erasure Coding in Action
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop Summit Tokyo Apache NiFi Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
#HSTokyo16 Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop crash course workshop at Hadoop Summit
DataWorks Summit
 
図でわかるHDFS Erasure Coding
Kai Sasaki
 
Data Science Crash Course Hadoop Summit SJ
Daniel Madrigal
 
HDFS Deep Dive
Yifeng Jiang
 
Timeline Service v.2 (Hadoop Summit 2016)
Sangjin Lee
 
Apache Hadoop Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
 
Apache Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Hadoop crashcourse v3
Hortonworks
 
Performance comparison of Distributed File Systems on 1Gbit networks
Marian Marinov
 
What's new in hadoop 3.0
Heiko Loewe
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Daniel Madrigal
 
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
DataWorks Summit
 
HDFS Erasure Coding in Action
DataWorks Summit/Hadoop Summit
 
Hadoop Summit Tokyo Apache NiFi Crash Course
DataWorks Summit/Hadoop Summit
 
#HSTokyo16 Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Hadoop crash course workshop at Hadoop Summit
DataWorks Summit
 
Ad

Similar to Native erasure coding support inside hdfs presentation (20)

PPTX
Debunking the Myths of HDFS Erasure Coding Performance
DataWorks Summit/Hadoop Summit
 
PDF
Scalable Storage for Massive Volume Data Systems
Lars Nielsen
 
PPTX
CS 2212- UNIT -4.pptx
LilyMkayula
 
PDF
Intro to Cassandra
Tyler Hobbs
 
PDF
Design Patterns For Distributed NO-reational databases
lovingprince58
 
PPTX
Data Distribution Theory
William LaForest
 
PDF
Cassandra for Ruby/Rails Devs
Tyler Hobbs
 
PDF
Design Patterns for Distributed Non-Relational Databases
guestdfd1ec
 
PPTX
What is Object storage ?
Nabil Kassi
 
PDF
QuadIron An open source library for number theoretic transform-based erasure ...
Scality
 
PPTX
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
Ontico
 
PPTX
Simple regenerating codes: Network Coding for Cloud Storage
Kevin Tong
 
PDF
Automated re allocator of replicas
IJCNCJournal
 
PDF
Durability Simulator Design for OpenStack Swift
Kota Tsuyuzaki
 
PPT
MongoDB Basic Concepts
MongoDB
 
PDF
Nosql
Vivek Beniwal
 
PDF
Scalable Data Storage Getting You Down? To The Cloud!
Mikhail Panchenko
 
PDF
Scalable Data Storage Getting you Down? To the Cloud!
Mikhail Panchenko
 
PDF
Nzpug welly-cassandra-02-12-2010
aaronmorton
 
PDF
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
Debunking the Myths of HDFS Erasure Coding Performance
DataWorks Summit/Hadoop Summit
 
Scalable Storage for Massive Volume Data Systems
Lars Nielsen
 
CS 2212- UNIT -4.pptx
LilyMkayula
 
Intro to Cassandra
Tyler Hobbs
 
Design Patterns For Distributed NO-reational databases
lovingprince58
 
Data Distribution Theory
William LaForest
 
Cassandra for Ruby/Rails Devs
Tyler Hobbs
 
Design Patterns for Distributed Non-Relational Databases
guestdfd1ec
 
What is Object storage ?
Nabil Kassi
 
QuadIron An open source library for number theoretic transform-based erasure ...
Scality
 
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
Ontico
 
Simple regenerating codes: Network Coding for Cloud Storage
Kevin Tong
 
Automated re allocator of replicas
IJCNCJournal
 
Durability Simulator Design for OpenStack Swift
Kota Tsuyuzaki
 
MongoDB Basic Concepts
MongoDB
 
Scalable Data Storage Getting You Down? To The Cloud!
Mikhail Panchenko
 
Scalable Data Storage Getting you Down? To the Cloud!
Mikhail Panchenko
 
Nzpug welly-cassandra-02-12-2010
aaronmorton
 
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
Ad

Recently uploaded (20)

PDF
Cloud Computing Service Availability.pdf
chakrirocky1
 
PDF
What should be in a Leadership and Motivation Plan?
Writegenic AI
 
PPTX
BARRIERS TO EFFECTIVE COMMUNICATION.pptx
shraddham25
 
PPTX
A Mother's Love - Helen Steiner Rice.pptx
AlbertoTierra
 
PPTX
2025-07-13 Abraham 07 (shared slides).pptx
Dale Wells
 
PPT
Wireless Communications Course lecture1.ppt
abdullahyaqot2015
 
PDF
Medical Technology Corporation: Supply Chain Strategy
daretruong
 
PPTX
A brief History of counseling in Social Work.pptx
Josaya Injesi
 
PPTX
English_Book_1 part 1 LET Reviewers NEw-
2022mimiacadserver
 
PDF
FINAL ZAKROS - UNESCO SITE CANDICACY - PRESENTATION - September 2024
StavrosKefalas1
 
PPTX
Speech Act, types of Speech Act in Pragmatics
gracehananatalias
 
PPTX
Food_and_Drink_Bahasa_Inggris_Kelas_5.pptx
debbystevani36
 
PPTX
677697609-States-Research-Questions-Final.pptx
francistiin8
 
PPTX
Bob Stewart Humble Obedience 07-13-2025.pptx
FamilyWorshipCenterD
 
PPTX
Presentationexpressions You are student leader and have just come from a stud...
BENSTARBEATZ
 
PPTX
Blended Family Future, the Mayflower and You
UCG NWA
 
PDF
From 0 to Gemini: a Workshop created by GDG Firenze
gdgflorence
 
PDF
Mining RACE Newsletter 10 - first half of 2025
Mining RACE
 
PPTX
Pastor Bob Stewart Acts 21 07 09 2025.pptx
FamilyWorshipCenterD
 
PDF
Generalization predition MOOCs - Conference presentation - eMOOCs 2025
pmmorenom01
 
Cloud Computing Service Availability.pdf
chakrirocky1
 
What should be in a Leadership and Motivation Plan?
Writegenic AI
 
BARRIERS TO EFFECTIVE COMMUNICATION.pptx
shraddham25
 
A Mother's Love - Helen Steiner Rice.pptx
AlbertoTierra
 
2025-07-13 Abraham 07 (shared slides).pptx
Dale Wells
 
Wireless Communications Course lecture1.ppt
abdullahyaqot2015
 
Medical Technology Corporation: Supply Chain Strategy
daretruong
 
A brief History of counseling in Social Work.pptx
Josaya Injesi
 
English_Book_1 part 1 LET Reviewers NEw-
2022mimiacadserver
 
FINAL ZAKROS - UNESCO SITE CANDICACY - PRESENTATION - September 2024
StavrosKefalas1
 
Speech Act, types of Speech Act in Pragmatics
gracehananatalias
 
Food_and_Drink_Bahasa_Inggris_Kelas_5.pptx
debbystevani36
 
677697609-States-Research-Questions-Final.pptx
francistiin8
 
Bob Stewart Humble Obedience 07-13-2025.pptx
FamilyWorshipCenterD
 
Presentationexpressions You are student leader and have just come from a stud...
BENSTARBEATZ
 
Blended Family Future, the Mayflower and You
UCG NWA
 
From 0 to Gemini: a Workshop created by GDG Firenze
gdgflorence
 
Mining RACE Newsletter 10 - first half of 2025
Mining RACE
 
Pastor Bob Stewart Acts 21 07 09 2025.pptx
FamilyWorshipCenterD
 
Generalization predition MOOCs - Conference presentation - eMOOCs 2025
pmmorenom01
 

Native erasure coding support inside hdfs presentation

  • 3. § HDFS inherits 3-way replication from Google File System - Simple, scalable and robust Replication is Expensive Replica DataNode0 DataNode1 DataNode2 Block NameNode Replica Replica
  • 4. § HDFS inherits 3-way replication from Google File System - Simple, scalable and robust § 200% storage overhead Replication is Expensive Replica DataNode0 DataNode1 DataNode2 Block NameNode Replica Replica
  • 5. § HDFS inherits 3-way replication from Google File System - Simple, scalable and robust § 200% storage overhead § Secondary replicas rarely accessed Replication is Expensive Replica DataNode0 DataNode1 DataNode2 Block NameNode Replica Replica
  • 7. Erasure Coding Saves Storage § Simplified Example: storing 2 bits 1 0Replication: XOR Coding: 1 0
  • 8. Erasure Coding Saves Storage § Simplified Example: storing 2 bits 1 01 0Replication: XOR Coding: 1 0
  • 9. Erasure Coding Saves Storage § Simplified Example: storing 2 bits 1 01 0Replication: XOR Coding: 1 0 2 extra bits
  • 10. Erasure Coding Saves Storage § Simplified Example: storing 2 bits 1 01 0Replication: XOR Coding: 1 0⊕ 1= 2 extra bits
  • 11. Erasure Coding Saves Storage § Simplified Example: storing 2 bits 1 01 0Replication: XOR Coding: 1 0⊕ 1= 2 extra bits 1 extra bit
  • 12. Erasure Coding Saves Storage § Simplified Example: storing 2 bits § Same data durability - can lose any 1 bit 1 01 0Replication: XOR Coding: 1 0⊕ 1= 2 extra bits 1 extra bit
  • 13. Erasure Coding Saves Storage § Simplified Example: storing 2 bits § Same data durability - can lose any 1 bit § Half the storage overhead 1 01 0Replication: XOR Coding: 1 0⊕ 1= 2 extra bits 1 extra bit
  • 14. Erasure Coding Saves Storage § Simplified Example: storing 2 bits § Same data durability - can lose any 1 bit § Half the storage overhead § Slower recovery 1 01 0Replication: XOR Coding: 1 0⊕ 1= 2 extra bits 1 extra bit
  • 16. Erasure Coding Saves Storage § Facebook - f4 stores 65PB of BLOBs in EC
  • 17. Erasure Coding Saves Storage § Facebook - f4 stores 65PB of BLOBs in EC § Windows Azure Storage (WAS) - A PB of new data every 1~2 days - All “sealed” data stored in EC
  • 18. Erasure Coding Saves Storage § Facebook - f4 stores 65PB of BLOBs in EC § Windows Azure Storage (WAS) - A PB of new data every 1~2 days - All “sealed” data stored in EC § Google File System - Large portion of data stored in EC
  • 20. Roadmap § Background of EC - Redundancy Theory - EC in Distributed Storage Systems
  • 21. Roadmap § Background of EC - Redundancy Theory - EC in Distributed Storage Systems § HDFS-EC architecture - Choosing Block Layout - NameNode — Generalizing the Block Concept - Client — Parallel I/O - DataNode — Background Reconstruction
  • 22. Roadmap § Background of EC - Redundancy Theory - EC in Distributed Storage Systems § HDFS-EC architecture - Choosing Block Layout - NameNode — Generalizing the Block Concept - Client — Parallel I/O - DataNode — Background Reconstruction § Hardware-accelerated Codec Framework
  • 23. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
  • 24. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Replica DataNode0 DataNode1 DataNode2 Block NameNode Replica Replica 3-way Replication:
  • 25. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Replica DataNode0 DataNode1 DataNode2 Block NameNode Replica Replica 3-way Replication:
  • 26. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Replica DataNode0 DataNode1 DataNode2 Block NameNode Replica Replica 3-way Replication: Data Durability = 2
  • 27. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Replica DataNode0 DataNode1 DataNode2 Block NameNode Replica Replica 3-way Replication: Data Durability = 2
  • 28. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Replica DataNode0 DataNode1 DataNode2 Block NameNode Replica Replica useful data 3-way Replication: Data Durability = 2 redundant data
  • 29. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Replica DataNode0 DataNode1 DataNode2 Block NameNode Replica Replica useful data 3-way Replication: Data Durability = 2 Storage Efficiency = 1/3 (33%) redundant data
  • 30. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
  • 31. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? XOR: X Y X ⊕ Y 0 0 0 0 1 1 1 0 1 1 1 0
  • 32. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? XOR: X Y X ⊕ Y 0 0 0 0 1 1 1 0 1 1 1 0 Y = 0 ⊕ 1 = 1
  • 33. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? XOR: Data Durability = 1 X Y X ⊕ Y 0 0 0 0 1 1 1 0 1 1 1 0 Y = 0 ⊕ 1 = 1
  • 34. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? XOR: Data Durability = 1 useful data redundant data X Y X ⊕ Y 0 0 0 0 1 1 1 0 1 1 1 0
  • 35. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? XOR: Data Durability = 1 Storage Efficiency = 2/3 (67%) useful data redundant data X Y X ⊕ Y 0 0 0 0 1 1 1 0 1 1 1 0
  • 36. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Reed-Solomon (RS):
  • 37. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Reed-Solomon (RS):
  • 38. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Reed-Solomon (RS): Data Durability = 2 Storage Efficiency = 4/6 (67%)
  • 39. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Reed-Solomon (RS): Data Durability = 2 Storage Efficiency = 4/6 (67%) Very flexible!
  • 40. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data?
  • 41. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency
  • 42. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica
  • 43. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0
  • 44. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100%
  • 45. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication
  • 46. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication 2
  • 47. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication 2 33%
  • 48. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication 2 33% XOR with 6 data cells
  • 49. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication 2 33% XOR with 6 data cells 1
  • 50. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication 2 33% XOR with 6 data cells 1 86%
  • 51. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication 2 33% XOR with 6 data cells 1 86% RS (6,3)
  • 52. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication 2 33% XOR with 6 data cells 1 86% RS (6,3) 3
  • 53. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication 2 33% XOR with 6 data cells 1 86% RS (6,3) 3 67%
  • 54. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication 2 33% XOR with 6 data cells 1 86% RS (6,3) 3 67% RS (10,4)
  • 55. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication 2 33% XOR with 6 data cells 1 86% RS (6,3) 3 67% RS (10,4) 4
  • 56. Durability and Efficiency Data Durability = How many simultaneous failures can be tolerated? Storage Efficiency = How much portion of storage is for useful data? Data Durability Storage Efficiency Single Replica 0 100% 3-way Replication 2 33% XOR with 6 data cells 1 86% RS (6,3) 3 67% RS (10,4) 4 71%
  • 57. EC in Distributed Storage Block Layout: 128~256MFile 0~128M … 640~768M0~128M 128~256M
  • 58. EC in Distributed Storage Block Layout: 128~256MFile … 640~768M 0~128 M block0 DataNode 0 0~128M 128~256M
  • 59. EC in Distributed Storage Block Layout: File … 640~768M 0~128 M block0 DataNode 0 128~ 256M block1 DataNode 1 0~128M 128~256M
  • 60. EC in Distributed Storage Block Layout: File … 640~768M 0~128 M block0 DataNode 0 128~ 256M block1 DataNode 1 0~128M 128~256M … 640~ 768M block5 DataNode 5
  • 61. EC in Distributed Storage Block Layout: File … 640~768M 0~128 M block0 DataNode 0 128~ 256M block1 DataNode 1 0~128M 128~256M … 640~ 768M block5 DataNode 5 DataNode 6 … parity
  • 62. EC in Distributed Storage Block Layout: File … 640~768M 0~128 M block0 DataNode 0 128~ 256M block1 DataNode 1 0~128M 128~256M … 640~ 768M block5 DataNode 5 DataNode 6 … parity Contiguous Layout:
  • 63. EC in Distributed Storage Block Layout: Data Locality ! File … 640~768M 0~128 M block0 DataNode 0 128~ 256M block1 DataNode 1 0~128M 128~256M … 640~ 768M block5 DataNode 5 DataNode 6 … parity Contiguous Layout:
  • 64. EC in Distributed Storage Block Layout: Data Locality ! Small Files " File … 640~768M 0~128 M block0 DataNode 0 128~ 256M block1 DataNode 1 0~128M 128~256M … 640~ 768M block5 DataNode 5 DataNode 6 … parity Contiguous Layout:
  • 65. EC in Distributed Storage Block Layout: File block0 DataNode 0 block1 DataNode 1 … block5 DataNode 5 DataNode 6 … parity 0~128M 128~256M
  • 66. EC in Distributed Storage Block Layout: File block0 DataNode 0 block1 DataNode 1 … block5 DataNode 5 DataNode 6 … parity 0~1M 1~2M 5~6M 0~128M 128~256M
  • 67. EC in Distributed Storage Block Layout: File block0 DataNode 0 block1 DataNode 1 … block5 DataNode 5 DataNode 6 … parity 0~1M 1~2M 5~6M 6~7M 0~128M 128~256M
  • 68. EC in Distributed Storage Block Layout: File block0 DataNode 0 block1 DataNode 1 … block5 DataNode 5 DataNode 6 … parity Striped Layout: 0~1M 1~2M 5~6M 6~7M Data Locality " Small Files ! Parallel I/O ! 0~128M 128~256M
  • 69. EC in Distributed Storage Spectrum: Replication Erasure Coding Striping Contiguous Ceph Ceph Quancast File System Quancast File System HDFS Facebook f4 Windows Azure
  • 70. Roadmap § Background of EC - Redundancy Theory - EC in Distributed Storage Systems § HDFS-EC architecture - Choosing Block Layout - NameNode — Generalizing the Block Concept - Client — Parallel I/O - DataNode — Background Reconstruction § Hardware-accelerated Codec Framework
  • 71. Choosing Block Layout • Medium: 1~6 blocks• Small files: < 1 block• Assuming (6,3) coding • Large: > 6 blocks (1 group)
  • 72. Choosing Block Layout • Medium: 1~6 blocks• Small files: < 1 block• Assuming (6,3) coding • Large: > 6 blocks (1 group) 64.61% 9.33% 26.06% 1.85%1.86% 96.29% small medium large file count space usage Top 2% files occupy ~65% space Cluster A Profile
  • 73. Choosing Block Layout • Medium: 1~6 blocks• Small files: < 1 block• Assuming (6,3) coding • Large: > 6 blocks (1 group) 64.61% 9.33% 26.06% 1.85%1.86% 96.29% small medium large file count space usage Top 2% files occupy ~65% space Cluster A Profile 40.08% 36.03% 23.89% 2.03% 11.38% 86.59% file count space usage Top 2% files occupy ~40% space small medium large Cluster B Profile
  • 74. Choosing Block Layout • Medium: 1~6 blocks• Small files: < 1 block• Assuming (6,3) coding • Large: > 6 blocks (1 group) 64.61% 9.33% 26.06% 1.85%1.86% 96.29% small medium large file count space usage Top 2% files occupy ~65% space Cluster A Profile 40.08% 36.03% 23.89% 2.03% 11.38% 86.59% file count space usage Top 2% files occupy ~40% space small medium large Cluster B Profile 3.20% 20.75% 76.05% 0.00%0.36% 99.64% file count space usage Dominated by small files small medium large Cluster C Profile
  • 77. Generalizing Block NameNode Mapping Logical and Storage Blocks
  • 78. Generalizing Block NameNode Mapping Logical and Storage Blocks Too Many Storage Blocks?
  • 79. Generalizing Block NameNode Mapping Logical and Storage Blocks Too Many Storage Blocks? Hierarchical Naming Protocol:
  • 80. Client Parallel Writing streamer queue streamer … streamer DataNode DataNode DataNode
  • 81. Client Parallel Writing streamer queue streamer … streamer DataNode DataNode DataNode
  • 82. Client Parallel Writing streamer queue streamer … streamer DataNode DataNode DataNode
  • 83. Client Parallel Writing streamer queue streamer … streamer DataNode DataNode DataNode
  • 84. Client Parallel Writing streamer queue streamer … streamer DataNode DataNode DataNode Coordinator
  • 85. Client Parallel Writing streamer queue streamer … streamer DataNode DataNode DataNode Coordinator
  • 86. Client Parallel Writing streamer queue streamer … streamer DataNode DataNode DataNode Coordinator
  • 87. Client Parallel Writing streamer queue streamer … streamer DataNode DataNode DataNode Coordinator
  • 88. Client Parallel Reading … DataNodeDataNode DataNode DataNode DataNode
  • 89. Client Parallel Reading … DataNodeDataNode DataNode DataNode DataNode
  • 90. Client Parallel Reading … DataNodeDataNode DataNode DataNode DataNode
  • 91. Client Parallel Reading … DataNodeDataNode DataNode DataNode DataNode parity
  • 92. Reconstruction on DataNode § Important to avoid delay on the critical path - Especially if original data is lost § Integrated with Replication Monitor - Under-protected EC blocks scheduled together with under-replicated blocks - New priority algorithms § New ErasureCodingWorker component on DataNode
  • 93. Roadmap § Background of EC - Redundancy Theory - EC in Distributed Storage Systems § HDFS-EC architecture - Choosing Block Layout - NameNode — Generalizing the Block Concept - Client — Parallel I/O - DataNode — Background Reconstruction § Hardware-accelerated Codec Framework
  • 94. Acceleration with Intel ISA-L § 1 legacy coder - From Facebook’s HDFS-RAID project § 2 new coders - Pure Java — code improvement over HDFS-RAID - Native coder with Intel’s Intelligent Storage Acceleration Library (ISA-L)
  • 98. Conclusion § Erasure coding expands effective storage space by ~50%!
  • 99. Conclusion § Erasure coding expands effective storage space by ~50%! § HDFS-EC phase I implements erasure coding in striped block layout
  • 100. Conclusion § Erasure coding expands effective storage space by ~50%! § HDFS-EC phase I implements erasure coding in striped block layout § Upstream effort (HDFS-7285): - Design finalized Nov. 2014 - Development started Jan. 2015 - 218 commits, ~25k LoC change - Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo (Japan)
  • 101. Conclusion § Erasure coding expands effective storage space by ~50%! § HDFS-EC phase I implements erasure coding in striped block layout § Upstream effort (HDFS-7285): - Design finalized Nov. 2014 - Development started Jan. 2015 - 218 commits, ~25k LoC change - Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo (Japan) § Phase II will support contiguous block layout for better locality
  • 102. Acknowledgements § Cloudera - Andrew Wang, Aaron T. Myers, Colin McCabe, Todd Lipcon, Silvius Rus § Intel - Kai Zheng, Uma Maheswara Rao G, Vinayakumar B, Yi Liu, Weihua Jiang § Hortonworks - Jing Zhao, Tsz Wo Nicholas Sze § Huawei - Walter Su, Rakesh R, Xinwei Qin § Yahoo (Japan) - Gao Rui, Kai Sasaki, Takuya Fukudome, Hui Zheng
  • 103. Just merged to trunk!
  • 105. Questions? Just merged to trunk! Erasure Coding:A type of Error Correction Coding
  • 106. EC in Distributed Storage Spectrum:
  • 107. EC in Distributed Storage 0~128 M 128~256 M DataNode0 block0 block1 … DataNode1 640~768 M DataNode5 block5 Contiguous DataNode6 DataNode8 data parity … Block Layout: 128~256MFile 0~128M … 640~768M
  • 108. EC in Distributed Storage 0~128 M 128~256 M DataNode0 block0 block1 … DataNode1 640~768 M DataNode5 block5 Contiguous DataNode6 DataNode8 data parity … Block Layout: Data Locality ! 128~256MFile 0~128M … 640~768M
  • 109. EC in Distributed Storage 0~128 M 128~256 M DataNode0 block0 block1 … DataNode1 640~768 M DataNode5 block5 Contiguous DataNode6 DataNode8 data parity … Block Layout: Data Locality ! Small Files " 128~256MFile 0~128M … 640~768M
  • 110. EC in Distributed Storage 0~128 M 128~256 M DataNode0 block0 block1 … DataNode1 640~768 M DataNode5 block5 Contiguous DataNode6 DataNode8 data parity … Block Layout: Data Locality ! Small Files " 128~256MFile … 640~768M
  • 111. EC in Distributed Storage 0~1M … … 1~2M … … DataNode0 block0 DataNode1 5~6M … 127~128M DataNode5 Striping DataNode6 DataNode8 data parity …… Block Layout:
  • 112. EC in Distributed Storage 0~1M … … 1~2M … … DataNode0 block0 DataNode1 5~6M … 127~128M DataNode5 Striping DataNode6 DataNode8 data parity …… Block Layout: Data Locality "
  • 113. EC in Distributed Storage 0~1M … … 1~2M … … DataNode0 block0 DataNode1 5~6M … 127~128M DataNode5 Striping DataNode6 DataNode8 data parity …… Block Layout: Data Locality " Small Files !
  • 114. EC in Distributed Storage 0~1M … … 1~2M … … DataNode0 block0 DataNode1 5~6M … 127~128M DataNode5 Striping DataNode6 DataNode8 data parity …… Block Layout: Data Locality " Small Files ! Parallel I/O !
  • 115. Client Parallel Writing blockGroup DataStreamer 0 DataStreamer 1 DataStreamer 2 DataStreamer 3 DataStreamer 4 DFSStripedOutputStream dataQueue 0 dataQueue 1 dataQueue 2 dataQueue 3 dataQueue 4 blk_1009 blk_1010 blk_1011 blk_1012 blk_1013 Coordinator allocate new blockGroup
  • 116. Client Parallel Reading Stripe 0 Stripe 1 Stripe 2 DataNode 0 DataNode 1 DataNode 2 DataNode 2 DataNode 3 (parity blocks)(data blocks) all zero all zero requested requested requested requested requested recovery read recovery read recovery read recovery read recovery read recovery read recovery read recovery read