SlideShare a Scribd company logo
Hadoop Backup and Disaster
        Recovery
       Jai Ranganathan
         Cloudera Inc
What makes Hadoop different?

            Not much

            EXCEPT
    • Tera- to Peta-bytes of data
      • Commodity hardware
        • Highly distributed
     • Many different services
What needs protection?

  Data Sets:       Applications:       Configuration:
                        System            Knobs and
                   applications (JT,    configurations
Data & Meta-data
                     NN, Region        necessary to run
 about your data
                   Servers, etc) and     applications
     (Hive)
                   User applications
We will focus on….


              Data Sets
but not because the others aren’t important..

  Existing systems & processes can help
  manage Apps & Configuration (to some
                 extent)
Classes of Problems to Plan For
Hardware Failures
 • Data corruption on disk
 • Disk/Node crash
 • Rack failure


User/Application Error
 • Accidental or malicious data deletion
 • Corrupted data writes


Site Failures
 • Permanent site loss – fire, ice, etc
 • Temporary site loss – Network, Power, etc (more common)
Business goals must drive solutions
        RPOs and RTOs are awesome…
But plan for what you care about – how much is
               this data worth?
Failure mode          Risk           Cost

Disk failure          High           Low

Node failure          High           Low

Rack failure         Medium         Medium

Accidental deletes   Medium         Medium

Site loss             Low            High
Basics of HDFS*




          * From Hadoop documentation
Hardware failures – Data Corruption
  Data corruption on disk


 • Checksums metadata for each block stored
   with file
 • If checksums do not match, name node
   discards block and replaces with fresh copy
 • Name node can write metadata to multiple
   copies for safety – write to different file
   systems and make backups
Hardware Failures - Crashes
Disk/Node crash


• Synchronous replicationon disk day- first
     Data corruption saves the
  two replicas always on different hosts
• Hardware failure detected by heartbeat loss
• Name node HA for meta-data
• HDFS automatically re-replicates blocks
  without enough replicas through periodic
  process
Hardware Failures – Rack failure
 Rack failure


 • Configure corruption on diskprovide rack
       Data at least 3 replicas and
   information (
   topology.node.switch.mapping.impl or
   topology.script.file.name)
 • 3rd replica always in a different rack
 • 3rd is important – allows for time window
   between failure and detection to safely exist
Don’t forget metadata


   • Your data is defined by Hive metadata
• But this is easy! SQL backups as per usual for
                     Hive safety
Cool.. Basic hardware is under control
                   Not quite
      • Employ Monitoring to track node health
     • Examine data node block scanner reports
    (http://datanode:50075/blockScannerReport)
             • Hadoop fsck is your friend


Of course, your friendly neighborhood Hadoop vendor
  has tools – Cloudera Manager health checks FTW!
Phew.. Past the easy stuff
              One more small detail…

   Upgrades for HDFS should be treated with care
         On-disk layout changes are risky!

        • Save name node meta-data offsite
• Test upgrade on smaller cluster before pushing out
• Data layout upgrades support roll-back but be safe
• Making backups of all or important data to remote
               location before upgrade!
Application or user errors

                     Permissions scope
                  Users only have access to data they
                         must have access to
  Apply the
principle of
   least            Quota management
 privilege            Name quota: Limits number of
                             files rooted at dir
                      Space quota: Limit bytes of files
                                rooted at dir
Protecting against accidental deletes

                         Trash server
             When enabled, files are deleted into
                            trash
             Enable using fs.trash.interval to set
                        trash interval

                    Keep in mind:
• Trash deletion only works through fs shell –
  programmatic deletes will not employ Trash
• .Trash is a per user directory for restores
Accidental deletes – don’t forget
           metadata



  • Again, regular SQL backups is key
HDFS Snapshots
             What are snapshots?
Snapshots represent state of the system at a point
                    in time
Often implemented using copy-on-write semantics



• In HDFS, append-only fs means only deletes have
                  to be managed
   • Many of the problems with COW are gone!
HDFS Snapshots – coming to a distro
            near you

 Community is hard at work on HDFS snapshots
Expect availability in major distros within the year


    Some implementation details – NameNode
                  snapshotting:
         • Very fast snapping capability
            • Consistency guarantees
      • Restores need to perform data copy
• .snapshot directories for access to individual files
What can HDFS Snapshots do for you?


  • Handles user/application data corruption
         • Handles accidental deletes
   • Can also be used for Test/Dev purposes!
HBase snapshots

            Oh hello, HBase!
Very similar construct to HDFS snapshots
               COW model

               • Fast snaps
        • Consistent snapshots
      • Restores still need a copy
    (hey, at least we are consistent)
Hive metadata
   The recurring theme of data + meta-data

Ideally, metadata backed up in the same flow as the
                      core data
     Consistency of data and metadata is really
                     important
Management of snapshots
Space considerations:

• % of cluster for snapshots
• Number of snapshots
• Alerting on space issues

Scheduling backups:

• Time based
• Workflow based
Great… Are we done?

        Don’t forget Roger Duronio!

Principle of least privilege still matters…
Disaster Recovery


  Datacenter A              Datacenter B




HDFS   Hive   HBase
Teeing vs Copying
         Teeing                     Copying

                               Data is copied from
 Send data during ingest
                            production to replica as a
 phase to production and
                               separate step after
     replica clusters
                                   processing
• Time delay is minimal
                           • Consistent data
  between clusters
                             between both sites
• Bandwidth required
                           • Process once only
  could be larger
                           • Time delay for RPO
• Requires re-processing
                             objectives to do
  data on both sides
                             incremental copy
• No consistency between
                           • More bandwidth
  sites
                             needed
Recommendations?


       Scenario dependent
                But
Generally prefer copying over teeing
How to replicate – per service


HDFS                   HBase                                 Hive
       Teeing:
                               Teeing:
       Flume and                                                            Teeing:
                               Application
       Sqoop support                                                        NA
                               level teeing
       teeing


       Copying:
                               Copying:                                     Copying:
       DistCP for
       copying                 HBase                                        Database
                               replication                                  import/export*




                                              * Database import/export isn’t the full story
Hive metadata
   The recurring theme of data + meta-data

Ideally, metadata backed up in the same flow as the
                      core data
     Consistency of data and metadata is really
                     important
Key considerations for large data
                   movement
•   Is your data compressed?
     – None of the systems support compression on the wire natively
     – WAN accelerators can help but cost $$

•   Do you know your bandwidth needs?
     – Initial data load
     – Daily ingest rate – Maintain historical information

•   Do you know your network security setup?
     – Data nodes & Region Servers talk to each other – they need to be able to have network connectivity

•   Have you configured security appropriately?
     – Kerberos support for cross-realm trust is challenging

•   What about cross-version copying?
     – Can’t always have both clusters be same version – but this is not trivial
Management of replications
Scheduling replication jobs

• Time based
• Workflow based – Kicked off from Oozie script?

Prioritization

• Keep replications in a separate scheduler group and
  dedicate capacity to replication jobs
• Don’t schedule more map tasks than can handle
  available network bandwidth between sites
Secondary configuration and usage
Hardware considerations
• Denser disk configurations acceptable on remote site
  depending on workload goals – 4 TB disks vs 2 TB disks, etc
• Fewer nodes are typical – consider replicating only critical
  data. Be careful playing with replication factors

Usage considerations
• Physical partitioning means a great place for ad-hoc
  analytics
• Production workloads continue to run on core cluster but
  ad-hoc analytics on replica cluster
• For HBase, all clusters can be used for data serving!
What about external systems?

• Backing up to external systems is a 1 way
  street with large data volumes

• Can’t do useful processing on the other side

• Cost of hadoop storage is fairly low, especially
  if you can drive work on it
Summary
• It can be done!

• Lots of gotchas and details to track in the process

• We haven’t even talked about applications and
  configuration!

• Failure workflows are important too – testing,
  testing, testing
Cloudera Enterprise BDR

CLOUDERA ENTERPRISE
CLOUDERA MANAGER

         SELECT                   CONFIGURE                  SYNCHRONIZE                   MONITOR


                                          DISASTER RECOVERY MODULE



CDH



                   HDFS DISTRIBUTED REPLICATION                            HIVE METASTORE REPLICATION
                      HIGH PERFORMANCE REPLICATION                         THE ONLY DISASTER RECOVERY SOLUTION
                            USING MAPREDUCE                                           FOR METADATA

         HDFS                                                   HIVE




                                                                                                                 34

More Related Content

PPTX
Backup and Disaster Recovery in Hadoop
larsgeorge
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PDF
Introduction to Spark Internals
Pietro Michiardi
 
PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
PDF
Cassandra Introduction & Features
DataStax Academy
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Backup and Disaster Recovery in Hadoop
larsgeorge
 
Apache Spark Architecture
Alexey Grishchenko
 
Introduction to Spark Internals
Pietro Michiardi
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Cassandra Introduction & Features
DataStax Academy
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Parquet performance tuning: the missing guide
Ryan Blue
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 

What's hot (20)

PDF
Building large scale transactional data lake using apache hudi
Bill Liu
 
PDF
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
PDF
Facebook Messages & HBase
强 王
 
PPTX
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
PDF
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
PPTX
Real-Time Data Flows with Apache NiFi
Manish Gupta
 
PDF
Intro to HBase
alexbaranau
 
PPTX
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
PPTX
Introduction to Storm
Chandler Huang
 
PDF
Performance Analysis of Apache Spark and Presto in Cloud Environments
Databricks
 
PDF
Spark SQL
Joud Khattab
 
PDF
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
PDF
Dynamic Partition Pruning in Apache Spark
Databricks
 
PDF
Introduction to MongoDB
Mike Dirolf
 
PDF
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
PDF
Iceberg + Alluxio for Fast Data Analytics
Alluxio, Inc.
 
PPTX
Apache Spark overview
DataArt
 
PPTX
Apache hive
pradipbajpai68
 
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
PDF
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Building large scale transactional data lake using apache hudi
Bill Liu
 
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Facebook Messages & HBase
强 王
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
Real-Time Data Flows with Apache NiFi
Manish Gupta
 
Intro to HBase
alexbaranau
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
Introduction to Storm
Chandler Huang
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Databricks
 
Spark SQL
Joud Khattab
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Dynamic Partition Pruning in Apache Spark
Databricks
 
Introduction to MongoDB
Mike Dirolf
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
Iceberg + Alluxio for Fast Data Analytics
Alluxio, Inc.
 
Apache Spark overview
DataArt
 
Apache hive
pradipbajpai68
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Ad

Similar to Hadoop Backup and Disaster Recovery (20)

PPTX
Bigdata workshop february 2015
clairvoyantllc
 
PPTX
Introduction to HDFS and MapReduce
Derek Chen
 
PPTX
Big data Hadoop
Ayyappan Paramesh
 
PPTX
Apache hadoop basics
saili mane
 
PDF
Hadoop on Azure, Blue elephants
Ovidiu Dimulescu
 
PPT
HDFS_architecture.ppt
vijayapraba1
 
PDF
Chapter2.pdf
WasyihunSema2
 
PDF
getFamiliarWithHadoop
AmirReza Mohammadi
 
PDF
Petabyte scale on commodity infrastructure
elliando dias
 
PPT
Hadoop training in bangalore
Kelly Technologies
 
PPTX
Introduction to Hadoop and Big Data
Joe Alex
 
PDF
Aziksa hadoop architecture santosh jha
Data Con LA
 
PPTX
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop Distributed File System
Vaibhav Jain
 
PPTX
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
VMware Tanzu
 
PDF
Hadoop distributed computing framework for big data
Cyanny LIANG
 
PPTX
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
Michael Stack
 
PDF
Meta scale kognitio hadoop webinar
Kognitio
 
PDF
[B4]deview 2012-hdfs
NAVER D2
 
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
Bigdata workshop february 2015
clairvoyantllc
 
Introduction to HDFS and MapReduce
Derek Chen
 
Big data Hadoop
Ayyappan Paramesh
 
Apache hadoop basics
saili mane
 
Hadoop on Azure, Blue elephants
Ovidiu Dimulescu
 
HDFS_architecture.ppt
vijayapraba1
 
Chapter2.pdf
WasyihunSema2
 
getFamiliarWithHadoop
AmirReza Mohammadi
 
Petabyte scale on commodity infrastructure
elliando dias
 
Hadoop training in bangalore
Kelly Technologies
 
Introduction to Hadoop and Big Data
Joe Alex
 
Aziksa hadoop architecture santosh jha
Data Con LA
 
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Hadoop Distributed File System
Vaibhav Jain
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
VMware Tanzu
 
Hadoop distributed computing framework for big data
Cyanny LIANG
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
Michael Stack
 
Meta scale kognitio hadoop webinar
Kognitio
 
[B4]deview 2012-hdfs
NAVER D2
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
PPTX
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
PPTX
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
PPTX
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
PPTX
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
PPTX
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
PPTX
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
PPTX
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
PPTX
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

Recently uploaded (20)

PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Software Development Methodologies in 2025
KodekX
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Doc9.....................................
SofiaCollazos
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Software Development Methodologies in 2025
KodekX
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
The Future of Artificial Intelligence (AI)
Mukul
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 

Hadoop Backup and Disaster Recovery

  • 1. Hadoop Backup and Disaster Recovery Jai Ranganathan Cloudera Inc
  • 2. What makes Hadoop different? Not much EXCEPT • Tera- to Peta-bytes of data • Commodity hardware • Highly distributed • Many different services
  • 3. What needs protection? Data Sets: Applications: Configuration: System Knobs and applications (JT, configurations Data & Meta-data NN, Region necessary to run about your data Servers, etc) and applications (Hive) User applications
  • 4. We will focus on…. Data Sets but not because the others aren’t important.. Existing systems & processes can help manage Apps & Configuration (to some extent)
  • 5. Classes of Problems to Plan For Hardware Failures • Data corruption on disk • Disk/Node crash • Rack failure User/Application Error • Accidental or malicious data deletion • Corrupted data writes Site Failures • Permanent site loss – fire, ice, etc • Temporary site loss – Network, Power, etc (more common)
  • 6. Business goals must drive solutions RPOs and RTOs are awesome… But plan for what you care about – how much is this data worth? Failure mode Risk Cost Disk failure High Low Node failure High Low Rack failure Medium Medium Accidental deletes Medium Medium Site loss Low High
  • 7. Basics of HDFS* * From Hadoop documentation
  • 8. Hardware failures – Data Corruption Data corruption on disk • Checksums metadata for each block stored with file • If checksums do not match, name node discards block and replaces with fresh copy • Name node can write metadata to multiple copies for safety – write to different file systems and make backups
  • 9. Hardware Failures - Crashes Disk/Node crash • Synchronous replicationon disk day- first Data corruption saves the two replicas always on different hosts • Hardware failure detected by heartbeat loss • Name node HA for meta-data • HDFS automatically re-replicates blocks without enough replicas through periodic process
  • 10. Hardware Failures – Rack failure Rack failure • Configure corruption on diskprovide rack Data at least 3 replicas and information ( topology.node.switch.mapping.impl or topology.script.file.name) • 3rd replica always in a different rack • 3rd is important – allows for time window between failure and detection to safely exist
  • 11. Don’t forget metadata • Your data is defined by Hive metadata • But this is easy! SQL backups as per usual for Hive safety
  • 12. Cool.. Basic hardware is under control Not quite • Employ Monitoring to track node health • Examine data node block scanner reports (http://datanode:50075/blockScannerReport) • Hadoop fsck is your friend Of course, your friendly neighborhood Hadoop vendor has tools – Cloudera Manager health checks FTW!
  • 13. Phew.. Past the easy stuff One more small detail… Upgrades for HDFS should be treated with care On-disk layout changes are risky! • Save name node meta-data offsite • Test upgrade on smaller cluster before pushing out • Data layout upgrades support roll-back but be safe • Making backups of all or important data to remote location before upgrade!
  • 14. Application or user errors Permissions scope Users only have access to data they must have access to Apply the principle of least Quota management privilege Name quota: Limits number of files rooted at dir Space quota: Limit bytes of files rooted at dir
  • 15. Protecting against accidental deletes Trash server When enabled, files are deleted into trash Enable using fs.trash.interval to set trash interval Keep in mind: • Trash deletion only works through fs shell – programmatic deletes will not employ Trash • .Trash is a per user directory for restores
  • 16. Accidental deletes – don’t forget metadata • Again, regular SQL backups is key
  • 17. HDFS Snapshots What are snapshots? Snapshots represent state of the system at a point in time Often implemented using copy-on-write semantics • In HDFS, append-only fs means only deletes have to be managed • Many of the problems with COW are gone!
  • 18. HDFS Snapshots – coming to a distro near you Community is hard at work on HDFS snapshots Expect availability in major distros within the year Some implementation details – NameNode snapshotting: • Very fast snapping capability • Consistency guarantees • Restores need to perform data copy • .snapshot directories for access to individual files
  • 19. What can HDFS Snapshots do for you? • Handles user/application data corruption • Handles accidental deletes • Can also be used for Test/Dev purposes!
  • 20. HBase snapshots Oh hello, HBase! Very similar construct to HDFS snapshots COW model • Fast snaps • Consistent snapshots • Restores still need a copy (hey, at least we are consistent)
  • 21. Hive metadata The recurring theme of data + meta-data Ideally, metadata backed up in the same flow as the core data Consistency of data and metadata is really important
  • 22. Management of snapshots Space considerations: • % of cluster for snapshots • Number of snapshots • Alerting on space issues Scheduling backups: • Time based • Workflow based
  • 23. Great… Are we done? Don’t forget Roger Duronio! Principle of least privilege still matters…
  • 24. Disaster Recovery Datacenter A Datacenter B HDFS Hive HBase
  • 25. Teeing vs Copying Teeing Copying Data is copied from Send data during ingest production to replica as a phase to production and separate step after replica clusters processing • Time delay is minimal • Consistent data between clusters between both sites • Bandwidth required • Process once only could be larger • Time delay for RPO • Requires re-processing objectives to do data on both sides incremental copy • No consistency between • More bandwidth sites needed
  • 26. Recommendations? Scenario dependent But Generally prefer copying over teeing
  • 27. How to replicate – per service HDFS HBase Hive Teeing: Teeing: Flume and Teeing: Application Sqoop support NA level teeing teeing Copying: Copying: Copying: DistCP for copying HBase Database replication import/export* * Database import/export isn’t the full story
  • 28. Hive metadata The recurring theme of data + meta-data Ideally, metadata backed up in the same flow as the core data Consistency of data and metadata is really important
  • 29. Key considerations for large data movement • Is your data compressed? – None of the systems support compression on the wire natively – WAN accelerators can help but cost $$ • Do you know your bandwidth needs? – Initial data load – Daily ingest rate – Maintain historical information • Do you know your network security setup? – Data nodes & Region Servers talk to each other – they need to be able to have network connectivity • Have you configured security appropriately? – Kerberos support for cross-realm trust is challenging • What about cross-version copying? – Can’t always have both clusters be same version – but this is not trivial
  • 30. Management of replications Scheduling replication jobs • Time based • Workflow based – Kicked off from Oozie script? Prioritization • Keep replications in a separate scheduler group and dedicate capacity to replication jobs • Don’t schedule more map tasks than can handle available network bandwidth between sites
  • 31. Secondary configuration and usage Hardware considerations • Denser disk configurations acceptable on remote site depending on workload goals – 4 TB disks vs 2 TB disks, etc • Fewer nodes are typical – consider replicating only critical data. Be careful playing with replication factors Usage considerations • Physical partitioning means a great place for ad-hoc analytics • Production workloads continue to run on core cluster but ad-hoc analytics on replica cluster • For HBase, all clusters can be used for data serving!
  • 32. What about external systems? • Backing up to external systems is a 1 way street with large data volumes • Can’t do useful processing on the other side • Cost of hadoop storage is fairly low, especially if you can drive work on it
  • 33. Summary • It can be done! • Lots of gotchas and details to track in the process • We haven’t even talked about applications and configuration! • Failure workflows are important too – testing, testing, testing
  • 34. Cloudera Enterprise BDR CLOUDERA ENTERPRISE CLOUDERA MANAGER SELECT CONFIGURE SYNCHRONIZE MONITOR DISASTER RECOVERY MODULE CDH HDFS DISTRIBUTED REPLICATION HIVE METASTORE REPLICATION HIGH PERFORMANCE REPLICATION THE ONLY DISASTER RECOVERY SOLUTION USING MAPREDUCE FOR METADATA HDFS HIVE 34

Editor's Notes

  • #3: Data movement is expensiveHardware more likely to failMore complex interactions in distributed environmentEach service requires different hand-holding
  • #5: Keep in mind that configuration may not even make sense to replicate – remote side may have different configuration options
  • #8: Data is split into blocks: Default 128 MBBlocks are replicated: Default: 3 timesHDFS is rack aware
  • #14: Cloudera Manager helps with replication by managing versions as well
  • #35: Cross-version managementImproveddistcpHive export/import with updatesSimple UI