SlideShare a Scribd company logo
Facebook’s Approach to Big Data
Storage Challenge


Weiyan Wang
Software Engineer (Data Infrastructure – HStore)
March 1 2013
Agenda
1   Data Warehouse Overview and Challenge

2   Smart Retention

3   Sort Before Compression

4   HDFS Raid

5   Directory XOR & Compaction

6   Q&A
Life of a tag in Data Warehouse
        Periodic Analysis                                      Adhoc Analysis
                                nocron
  Daily report on count of                       hipal          Count photos tagged by
photo tags by country (1day)                                  females age 20-25 yesterday

                                                             Scrapes
                                                            User info reaches
                                         Warehouse          Warehouse (1day)        UDB

                               copier/loader   Log line reaches            User tags
                                               warehouse (1hr)              a photo

                       puma
  Realtime Analytics                  Scribe Log Storage        www.facebook.com
 Count users tagging photos              Log line reaches         Log line generated:
   in the last hour (1min)                 Scribeh (10s)          <user_id, photo_id>
History (2008/03-2012/03)
Data, Data, and more Data


         Facebook                 Scribe Data/    Nodes in
                    Queries/Day                              Size (Total)
           Users                      Day        Warehouse




Growth      14X         60X          250X          260X        2500X
Directions to handle data growth problem
• Improve the software
•   HDFS Federation
•   Prism

• Improve storage efficiency
•   Store more data without increasing capacity
•   More and more important, translate into millions of
    dollars saving.
Ways to Improve Storage Efficiency
• Better capacity management

• Reduce space usage of Hive tables

• Reduce replication factor of data
Smart Retention – Motivation
• Hive table “retention” metadata
 •   Partitions older than retention value are automatically
     purged by system

• Table owners are unaware of table usage
 •   Difficult to set retention value right at the beginning.

• Improper retention setting may waste spaces
 •   Users only accessed recent 30-day partitions of a 3-
     month-retention table
Smart Retention
• Add a post-execute hook that logs table/partition
  names and query start time to MySQL.

• Calculate the “empirical-retention” per table
  Given a partition P whose creation time is CTP:
  Data_age_at_last_queryP =
              max{StartTimeQ - CTP | ∀query Q
  accesses P}
  Given a table T:
  Empirical_retentionT =
            max{Data_age_at_last_queryP | ∀ P ∈ T}
Smart Retention
• Inform Empirical_retentionT to table owners with a
  call to action:
 •   Accept the empirical value and change retention
 •   Review table query history, figure out better setting

• After 2weeks, the system will archive partitions
  that are older than Empirical_retentionT
 •   Free up spaces after partitions get archived
 •   Users need to restore archived data for querying
Smart Retention – Things Learned
• Table query history enables table owners to
  identify outliers:
 •   A table is queried mostly < 32 days olds data but there
     was one time a 42 days old partition was accessed

• Prioritize tables with the most space savings
 •   Save 8PB from the TOP 100 tables!
Sort Before Compression - Motivation
• In RCFile format, data are stored in columns inside
  every row block
 •   Sort by one or two columns with lots of duplicate values
     reduces final compressed data size

• Trade extra computation for space saving
Sort Before Compression
• Identify the best column to sort
 •   Take a sample of table and sort it by every column. Pick
     the one with the most space saving.

• Transfer target partitions from service clusters to
  compute clusters
• Sort them into compressed RCFile format.
• Sorted partitions are transferred back to service
  clusters to replace original ones
How we sort
set hive.exec.reducers.max=1024;
set hive.io.rcfile.record.buffer.size=67108864;
 INSERT OVERWRITE TABLE hive_table PARTITION
(ds='2012-08-06',source_type='mobile_sort')
     SELECT `(ds|source_type)?+.+` from hive_table
     WHERE ds='2012-08-06' and source_type='mobile'
     DISTRIBUTE BY IF (userid <> 0 AND NOT (userid
is null), userid, CAST(RAND() AS STRING))
     SORT BY userid, ip_address;
Sort Before Compression – Things Learned
• Sorting achieves >40% space saving!

• It’s important to verify data correctness
 •   Compare original and sorted partitions’ hash values
 •   Find a hive bug

• Sort cold data first, and gradually move to hot
  data
HDFS Raid
In HDFS, data are 3X replicated
                                    Meta operations       NameNode
  /warehouse/file1                                        (Metadata)

                               Client
   1     2       3

                     Read/Write Data



             1           3                            2                3


                 2             1         3            1                2

        DataNode 1     DataNode 2       DataNode 3 DataNode 4     DataNode 5
HDFS Raid – File-level XOR (10, 1)
                 Before                                    After
            /warehouse/file1                          /warehouse/file1

1   2 3     4   5   6   7   8   9   10   1    2 3      4   5        6   7   8   9   10

1   2   3   4   5   6   7   8   9   10   1    2   3    4   5        6   7   8   9   10

1   2   3   4   5   6   7   8   9   10
                                              Parity file: /raid/warehouse/file1


                                                               11

                                    (10, 1)                    11

                3X                                         2.2X
HDFS Raid
• What if a file has 15 blocks
•   Treat as 20 blocks and generate one parity with 2
    blocks
•   Replication factor = (15*2+2*2)/15 = 2.27

• Reconstruction
•   Online reconstruction – DistributedRaidFileSystem
•   Offline reconstruction – RaidNode

• Block Placement
HDFS Raid – File-level Reed Solomon
(10, 4) Before
      /warehouse/file1
                                After
                           /warehouse/file1

1   2 3     4   5   6   7   8   9   10   1     2 3     4   5   6   7   8    9   10

1   2   3   4   5   6   7   8   9   10

        3                                    Parity file: /raidrs/warehouse/file1
1   2       4   5   6   7   8   9   10


                                                      11   12 13 14


                                    (10, 4)
                3X                                         1.4X
HDFS Raid – Hybrid Storage
                                       Even older
                              ×1.4
                     RS                3months older
                     Raided
                          ×2.2
                                       1day older
                   XOR Raided
                        ×3
                                       Born
Life of file /warehouse/facebook.jpg
HDFS Raid – Things Learned
• Replication factor 3 ->2.65 (12% space saving)

• Avoid flooding namenode with requests
•   Daily pipeline scans fsimage to pick raidable files
    rather than recursively search from namenode

• Small files disallow more replication reduction
•   50% of files in the warehouse have only 1 or 2
    blocks. They are too small to be raided.
Raid Warm Small Files: Directory level XOR
                    Before                                       After
                          /data/file3                                    /data/file3
/data/file1   /data/file2        /data/file4   /data/file1   /data/file2        /data/file4


 1   2 3      4    5   6    7    8   9   10    1    2 3      4   5    6   7   8   9   10

 1   2 3      4    5   6    7    8   9   10    1   2 3       4   5    6   7   8   9   10
                   5   6                 10
/raid/data/file1           /raid/data/file3             Parity file: /dir-raid/data

      11                        12
                                                                     11
      11                        12
                                                                     11

                   2.7X                                      2.2X
Handle Directory Change                                        Directory change
                                                                   happens very
                          /namespace/infra/ds=2013-07-07
                                                                   infrequently in
                  2
                                                                   warehouse
                                      file1    File2     file3
      Client
                                      file1    file2     file3           Try to read file2,
                                                                    3         encounter
                                      parity
                                                                         missing blocks
                                      parity
                                               1                                            Client

                                        Stripe store (MySQL)         Look at the stripe
    RaidNode
                                      Block id         Stripe id     table, figure out
                                                                     that file4 does not
       4                              Blk_file_1       Strp_1
                                                                     belong to the
                                      Blk_file_2       Strp_1        stripe, and file3 is
Re-raid the directory, before file3   Blk_file_3       Strp_1        in trash.
is actually deleted from cluster      Blk_parity       Strp_1
                                                                     Reconstruct file2!!
Raid Cold Small Files: Compaction
• Compact cold small files into large files and apply
  file-level RS
•       No need to handle directory changes for file-level RS
    •    Re-raid a Directory-RS Raided directory is expensive
•       Raid-aware compaction can achieve best space saving
    •    Change block size to produce files with multiples of
         ten blocks
•       Reduce the number of metadata
Raid-Aware Compaction
▪       Compaction settings:
         set mapred.min.split.size = 39*blockSize;
         set mapred.max.split.size = 39*blockSize;
         set mapred.min.split.size.per.node = 39*blockSize;
         set mapred.min.split.size.per.rack = 39*blockSize;
         set dfs.block.size = blockSize;
         set hive.input.format =
              org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

▪       Calculate the best block size for a partition
    ▪    Make sure bestBlockSize * N ≈ Partition size where
         N = 39p + q (p ∈N+ , q ∈ {10, 20, 30})
    ▪    Compaction will generate p 40-block files and one q-
         block file
Raid-Aware Compaction
▪       Compact SeqFile format partition
    ▪    INSERT OVERWRITE TABLE seq_table
         PARTITION (ds = "2012-08-17")
            SELECT `(ds)?+.+` FROM seq_table
            WHERE ds = "2012-08-17";
▪       Compact RCFile format partition
    ▪    ALTER TABLE rc_table PARTITION
             (ds="2009-08-31") CONCATENATE;
Directory XOR & Compaction - Things Learned
 • Replication factor 2.65 ->2.35! (additional 12% space
   saving) Still rolling out

 • Bookkeeping blocks’ checksums could avoid data
   corruption caused by bugs

 • Unawareness of Raid in HDFS causes some issues
  •   Operational error could cause data loss (forget to move
      parity data with source data)

 • Directory XOR & Compaction only work for warehouse
   data
Questions?

More Related Content

PDF
HDFS Design Principles
Konstantin V. Shvachko
 
PPTX
Hadoop HDFS Detailed Introduction
Hanborq Inc.
 
PDF
HDFS User Reference
Biju Nair
 
PPT
Hadoop Architecture
Delhi/NCR HUG
 
PPTX
Introduction to hadoop and hdfs
shrey mehrotra
 
PPTX
Hadoop HDFS Concepts
tutorialvillage
 
PDF
Interacting with hdfs
Pradeep Kumbhar
 
PPTX
Hadoop HDFS Architeture and Design
sudhakara st
 
HDFS Design Principles
Konstantin V. Shvachko
 
Hadoop HDFS Detailed Introduction
Hanborq Inc.
 
HDFS User Reference
Biju Nair
 
Hadoop Architecture
Delhi/NCR HUG
 
Introduction to hadoop and hdfs
shrey mehrotra
 
Hadoop HDFS Concepts
tutorialvillage
 
Interacting with hdfs
Pradeep Kumbhar
 
Hadoop HDFS Architeture and Design
sudhakara st
 

What's hot (20)

PPTX
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Cloudera, Inc.
 
PDF
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Sameer Tiwari
 
PPTX
Hadoop HDFS Concepts
ProTechSkills Training
 
PDF
Dynamic Namespace Partitioning with Giraffa File System
DataWorks Summit
 
PDF
Hadoop introduction
Subhas Kumar Ghosh
 
PDF
Hadoop Distributed File System
elliando dias
 
PPTX
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
ODP
Hadoop HDFS by rohitkapa
kapa rohit
 
PDF
Apache Hadoop YARN, NameNode HA, HDFS Federation
Adam Kawa
 
PPTX
Hadoop Distributed File System(HDFS) : Behind the scenes
Nitin Khattar
 
PPT
Anatomy of file read in hadoop
Rajesh Ananda Kumar
 
PPTX
Hadoop Distributed File System
Rutvik Bapat
 
PPT
Anatomy of file write in hadoop
Rajesh Ananda Kumar
 
PDF
HDFS Trunncate: Evolving Beyond Write-Once Semantics
DataWorks Summit
 
PDF
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Konstantin V. Shvachko
 
PDF
Hadoop Introduction
tutorialvillage
 
PPTX
March 2011 HUG: HDFS Federation
Yahoo Developer Network
 
PPTX
Hadoop HDFS NameNode HA
Hanborq Inc.
 
PPTX
Introduction to HDFS and MapReduce
Uday Vakalapudi
 
PDF
Hdfs architecture
Aisha Siddiqa
 
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Cloudera, Inc.
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Sameer Tiwari
 
Hadoop HDFS Concepts
ProTechSkills Training
 
Dynamic Namespace Partitioning with Giraffa File System
DataWorks Summit
 
Hadoop introduction
Subhas Kumar Ghosh
 
Hadoop Distributed File System
elliando dias
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
Hadoop HDFS by rohitkapa
kapa rohit
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Adam Kawa
 
Hadoop Distributed File System(HDFS) : Behind the scenes
Nitin Khattar
 
Anatomy of file read in hadoop
Rajesh Ananda Kumar
 
Hadoop Distributed File System
Rutvik Bapat
 
Anatomy of file write in hadoop
Rajesh Ananda Kumar
 
HDFS Trunncate: Evolving Beyond Write-Once Semantics
DataWorks Summit
 
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Konstantin V. Shvachko
 
Hadoop Introduction
tutorialvillage
 
March 2011 HUG: HDFS Federation
Yahoo Developer Network
 
Hadoop HDFS NameNode HA
Hanborq Inc.
 
Introduction to HDFS and MapReduce
Uday Vakalapudi
 
Hdfs architecture
Aisha Siddiqa
 
Ad

Similar to Facebook's Approach to Big Data Storage Challenge (20)

PDF
北航云计算公开课03 google file system
Cando Zhou
 
PPTX
Hadoop Backup and Disaster Recovery
Cloudera, Inc.
 
ODP
Distributed File System
Ntu
 
PDF
SCM Dashboard
Perforce
 
PDF
SCM dashobard using Hadoop, Mongodb, Django
prakash_ranade
 
PDF
The google file system
Daniel Checchia
 
PDF
Gfs论文
yiditushe
 
PDF
RuG Guest Lecture
fvanvollenhoven
 
PDF
Making Big Data Analytics Interactive and Real-­Time
Seven Nguyen
 
KEY
NoSQL "Tools in Action" talk at Devoxx
NGDATA
 
PDF
Federated HDFS
huguk
 
PPTX
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
Ontico
 
PPTX
Distributed file system
Anamika Singh
 
PPTX
Google
rpaikrao
 
PPTX
Sector CloudSlam 09
Robert Grossman
 
PDF
Cacheconcurrencyconsistency cassandra svcc
srisatish ambati
 
PPT
Hadoop 24/7
Allen Wittenauer
 
PDF
Fuse'ing python for rapid development of storage efficient FS
Chetan Giridhar
 
PDF
Hadoop and Hive Development at Facebook
elliando dias
 
PDF
Hadoop and Hive Development at Facebook
S S
 
北航云计算公开课03 google file system
Cando Zhou
 
Hadoop Backup and Disaster Recovery
Cloudera, Inc.
 
Distributed File System
Ntu
 
SCM Dashboard
Perforce
 
SCM dashobard using Hadoop, Mongodb, Django
prakash_ranade
 
The google file system
Daniel Checchia
 
Gfs论文
yiditushe
 
RuG Guest Lecture
fvanvollenhoven
 
Making Big Data Analytics Interactive and Real-­Time
Seven Nguyen
 
NoSQL "Tools in Action" talk at Devoxx
NGDATA
 
Federated HDFS
huguk
 
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
Ontico
 
Distributed file system
Anamika Singh
 
Google
rpaikrao
 
Sector CloudSlam 09
Robert Grossman
 
Cacheconcurrencyconsistency cassandra svcc
srisatish ambati
 
Hadoop 24/7
Allen Wittenauer
 
Fuse'ing python for rapid development of storage efficient FS
Chetan Giridhar
 
Hadoop and Hive Development at Facebook
elliando dias
 
Hadoop and Hive Development at Facebook
S S
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

Recently uploaded (20)

PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 

Facebook's Approach to Big Data Storage Challenge

  • 1. Facebook’s Approach to Big Data Storage Challenge Weiyan Wang Software Engineer (Data Infrastructure – HStore) March 1 2013
  • 2. Agenda 1 Data Warehouse Overview and Challenge 2 Smart Retention 3 Sort Before Compression 4 HDFS Raid 5 Directory XOR & Compaction 6 Q&A
  • 3. Life of a tag in Data Warehouse Periodic Analysis Adhoc Analysis nocron Daily report on count of hipal Count photos tagged by photo tags by country (1day) females age 20-25 yesterday Scrapes User info reaches Warehouse Warehouse (1day) UDB copier/loader Log line reaches User tags warehouse (1hr) a photo puma Realtime Analytics Scribe Log Storage www.facebook.com Count users tagging photos Log line reaches Log line generated: in the last hour (1min) Scribeh (10s) <user_id, photo_id>
  • 4. History (2008/03-2012/03) Data, Data, and more Data Facebook Scribe Data/ Nodes in Queries/Day Size (Total) Users Day Warehouse Growth 14X 60X 250X 260X 2500X
  • 5. Directions to handle data growth problem • Improve the software • HDFS Federation • Prism • Improve storage efficiency • Store more data without increasing capacity • More and more important, translate into millions of dollars saving.
  • 6. Ways to Improve Storage Efficiency • Better capacity management • Reduce space usage of Hive tables • Reduce replication factor of data
  • 7. Smart Retention – Motivation • Hive table “retention” metadata • Partitions older than retention value are automatically purged by system • Table owners are unaware of table usage • Difficult to set retention value right at the beginning. • Improper retention setting may waste spaces • Users only accessed recent 30-day partitions of a 3- month-retention table
  • 8. Smart Retention • Add a post-execute hook that logs table/partition names and query start time to MySQL. • Calculate the “empirical-retention” per table Given a partition P whose creation time is CTP: Data_age_at_last_queryP = max{StartTimeQ - CTP | ∀query Q accesses P} Given a table T: Empirical_retentionT = max{Data_age_at_last_queryP | ∀ P ∈ T}
  • 9. Smart Retention • Inform Empirical_retentionT to table owners with a call to action: • Accept the empirical value and change retention • Review table query history, figure out better setting • After 2weeks, the system will archive partitions that are older than Empirical_retentionT • Free up spaces after partitions get archived • Users need to restore archived data for querying
  • 10. Smart Retention – Things Learned • Table query history enables table owners to identify outliers: • A table is queried mostly < 32 days olds data but there was one time a 42 days old partition was accessed • Prioritize tables with the most space savings • Save 8PB from the TOP 100 tables!
  • 11. Sort Before Compression - Motivation • In RCFile format, data are stored in columns inside every row block • Sort by one or two columns with lots of duplicate values reduces final compressed data size • Trade extra computation for space saving
  • 12. Sort Before Compression • Identify the best column to sort • Take a sample of table and sort it by every column. Pick the one with the most space saving. • Transfer target partitions from service clusters to compute clusters • Sort them into compressed RCFile format. • Sorted partitions are transferred back to service clusters to replace original ones
  • 13. How we sort set hive.exec.reducers.max=1024; set hive.io.rcfile.record.buffer.size=67108864; INSERT OVERWRITE TABLE hive_table PARTITION (ds='2012-08-06',source_type='mobile_sort') SELECT `(ds|source_type)?+.+` from hive_table WHERE ds='2012-08-06' and source_type='mobile' DISTRIBUTE BY IF (userid <> 0 AND NOT (userid is null), userid, CAST(RAND() AS STRING)) SORT BY userid, ip_address;
  • 14. Sort Before Compression – Things Learned • Sorting achieves >40% space saving! • It’s important to verify data correctness • Compare original and sorted partitions’ hash values • Find a hive bug • Sort cold data first, and gradually move to hot data
  • 15. HDFS Raid In HDFS, data are 3X replicated Meta operations NameNode /warehouse/file1 (Metadata) Client 1 2 3 Read/Write Data 1 3 2 3 2 1 3 1 2 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5
  • 16. HDFS Raid – File-level XOR (10, 1) Before After /warehouse/file1 /warehouse/file1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Parity file: /raid/warehouse/file1 11 (10, 1) 11 3X 2.2X
  • 17. HDFS Raid • What if a file has 15 blocks • Treat as 20 blocks and generate one parity with 2 blocks • Replication factor = (15*2+2*2)/15 = 2.27 • Reconstruction • Online reconstruction – DistributedRaidFileSystem • Offline reconstruction – RaidNode • Block Placement
  • 18. HDFS Raid – File-level Reed Solomon (10, 4) Before /warehouse/file1 After /warehouse/file1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 3 Parity file: /raidrs/warehouse/file1 1 2 4 5 6 7 8 9 10 11 12 13 14 (10, 4) 3X 1.4X
  • 19. HDFS Raid – Hybrid Storage Even older ×1.4 RS 3months older Raided ×2.2 1day older XOR Raided ×3 Born Life of file /warehouse/facebook.jpg
  • 20. HDFS Raid – Things Learned • Replication factor 3 ->2.65 (12% space saving) • Avoid flooding namenode with requests • Daily pipeline scans fsimage to pick raidable files rather than recursively search from namenode • Small files disallow more replication reduction • 50% of files in the warehouse have only 1 or 2 blocks. They are too small to be raided.
  • 21. Raid Warm Small Files: Directory level XOR Before After /data/file3 /data/file3 /data/file1 /data/file2 /data/file4 /data/file1 /data/file2 /data/file4 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 5 6 10 /raid/data/file1 /raid/data/file3 Parity file: /dir-raid/data 11 12 11 11 12 11 2.7X 2.2X
  • 22. Handle Directory Change Directory change happens very /namespace/infra/ds=2013-07-07 infrequently in 2 warehouse file1 File2 file3 Client file1 file2 file3 Try to read file2, 3 encounter parity missing blocks parity 1 Client Stripe store (MySQL) Look at the stripe RaidNode Block id Stripe id table, figure out that file4 does not 4 Blk_file_1 Strp_1 belong to the Blk_file_2 Strp_1 stripe, and file3 is Re-raid the directory, before file3 Blk_file_3 Strp_1 in trash. is actually deleted from cluster Blk_parity Strp_1 Reconstruct file2!!
  • 23. Raid Cold Small Files: Compaction • Compact cold small files into large files and apply file-level RS • No need to handle directory changes for file-level RS • Re-raid a Directory-RS Raided directory is expensive • Raid-aware compaction can achieve best space saving • Change block size to produce files with multiples of ten blocks • Reduce the number of metadata
  • 24. Raid-Aware Compaction ▪ Compaction settings: set mapred.min.split.size = 39*blockSize; set mapred.max.split.size = 39*blockSize; set mapred.min.split.size.per.node = 39*blockSize; set mapred.min.split.size.per.rack = 39*blockSize; set dfs.block.size = blockSize; set hive.input.format = org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; ▪ Calculate the best block size for a partition ▪ Make sure bestBlockSize * N ≈ Partition size where N = 39p + q (p ∈N+ , q ∈ {10, 20, 30}) ▪ Compaction will generate p 40-block files and one q- block file
  • 25. Raid-Aware Compaction ▪ Compact SeqFile format partition ▪ INSERT OVERWRITE TABLE seq_table PARTITION (ds = "2012-08-17") SELECT `(ds)?+.+` FROM seq_table WHERE ds = "2012-08-17"; ▪ Compact RCFile format partition ▪ ALTER TABLE rc_table PARTITION (ds="2009-08-31") CONCATENATE;
  • 26. Directory XOR & Compaction - Things Learned • Replication factor 2.65 ->2.35! (additional 12% space saving) Still rolling out • Bookkeeping blocks’ checksums could avoid data corruption caused by bugs • Unawareness of Raid in HDFS causes some issues • Operational error could cause data loss (forget to move parity data with source data) • Directory XOR & Compaction only work for warehouse data