Hadoop Backup and Disaster Recovery

Hadoop Backup and Disaster
Recovery
Jai Ranganathan
Cloudera Inc

What makes Hadoop different?

Not much

EXCEPT
• Tera- to Peta-bytes of data
• Commodity hardware
• Highly distributed
• Many different services

What needs protection?

Data Sets: Applications: Configuration:
System Knobs and
applications (JT, configurations
Data & Meta-data
NN, Region necessary to run
about your data
Servers, etc) and applications
(Hive)
User applications

We will focus on….

Data Sets
but not because the others aren’t important..

Existing systems & processes can help
manage Apps & Configuration (to some
extent)

Classes of Problems to Plan For
Hardware Failures
• Data corruption on disk
• Disk/Node crash
• Rack failure

User/Application Error
• Accidental or malicious data deletion
• Corrupted data writes

Site Failures
• Permanent site loss – fire, ice, etc
• Temporary site loss – Network, Power, etc (more common)

Business goals must drive solutions
RPOs and RTOs are awesome…
But plan for what you care about – how much is
this data worth?
Failure mode Risk Cost

Disk failure High Low

Node failure High Low

Rack failure Medium Medium

Accidental deletes Medium Medium

Site loss Low High

Basics of HDFS*

* From Hadoop documentation

Hardware failures – Data Corruption
Data corruption on disk

• Checksums metadata for each block stored
with file
• If checksums do not match, name node
discards block and replaces with fresh copy
• Name node can write metadata to multiple
copies for safety – write to different file
systems and make backups

Hardware Failures - Crashes
Disk/Node crash

• Synchronous replicationon disk day- first
Data corruption saves the
two replicas always on different hosts
• Hardware failure detected by heartbeat loss
• Name node HA for meta-data
• HDFS automatically re-replicates blocks
without enough replicas through periodic
process

Hardware Failures – Rack failure
Rack failure

• Configure corruption on diskprovide rack
Data at least 3 replicas and
information (
topology.node.switch.mapping.impl or
topology.script.file.name)
• 3rd replica always in a different rack
• 3rd is important – allows for time window
between failure and detection to safely exist

Don’t forget metadata

• Your data is defined by Hive metadata
• But this is easy! SQL backups as per usual for
Hive safety

Cool.. Basic hardware is under control
Not quite
• Employ Monitoring to track node health
• Examine data node block scanner reports
(http://datanode:50075/blockScannerReport)
• Hadoop fsck is your friend

Of course, your friendly neighborhood Hadoop vendor
has tools – Cloudera Manager health checks FTW!

Phew.. Past the easy stuff
One more small detail…

Upgrades for HDFS should be treated with care
On-disk layout changes are risky!

• Save name node meta-data offsite
• Test upgrade on smaller cluster before pushing out
• Data layout upgrades support roll-back but be safe
• Making backups of all or important data to remote
location before upgrade!

Application or user errors

Permissions scope
Users only have access to data they
must have access to
Apply the
principle of
least Quota management
privilege Name quota: Limits number of
files rooted at dir
Space quota: Limit bytes of files
rooted at dir

Protecting against accidental deletes

Trash server
When enabled, files are deleted into
trash
Enable using fs.trash.interval to set
trash interval

Keep in mind:
• Trash deletion only works through fs shell –
programmatic deletes will not employ Trash
• .Trash is a per user directory for restores

Accidental deletes – don’t forget
metadata

• Again, regular SQL backups is key

HDFS Snapshots
What are snapshots?
Snapshots represent state of the system at a point
in time
Often implemented using copy-on-write semantics

• In HDFS, append-only fs means only deletes have
to be managed
• Many of the problems with COW are gone!

HDFS Snapshots – coming to a distro
near you

Community is hard at work on HDFS snapshots
Expect availability in major distros within the year

Some implementation details – NameNode
snapshotting:
• Very fast snapping capability
• Consistency guarantees
• Restores need to perform data copy
• .snapshot directories for access to individual files

What can HDFS Snapshots do for you?

• Handles user/application data corruption
• Handles accidental deletes
• Can also be used for Test/Dev purposes!

HBase snapshots

Oh hello, HBase!
Very similar construct to HDFS snapshots
COW model

• Fast snaps
• Consistent snapshots
• Restores still need a copy
(hey, at least we are consistent)

Hive metadata
The recurring theme of data + meta-data

Ideally, metadata backed up in the same flow as the
core data
Consistency of data and metadata is really
important

Management of snapshots
Space considerations:

• % of cluster for snapshots
• Number of snapshots
• Alerting on space issues

Scheduling backups:

• Time based
• Workflow based

Great… Are we done?

Don’t forget Roger Duronio!

Principle of least privilege still matters…

Disaster Recovery

Datacenter A Datacenter B

HDFS Hive HBase

Teeing vs Copying
Teeing Copying

Data is copied from
Send data during ingest
production to replica as a
phase to production and
separate step after
replica clusters
processing
• Time delay is minimal
• Consistent data
between clusters
between both sites
• Bandwidth required
• Process once only
could be larger
• Time delay for RPO
• Requires re-processing
objectives to do
data on both sides
incremental copy
• No consistency between
• More bandwidth
sites
needed

Recommendations?

Scenario dependent
But
Generally prefer copying over teeing

How to replicate – per service

HDFS HBase Hive
Teeing:
Teeing:
Flume and Teeing:
Application
Sqoop support NA
level teeing
teeing

Copying:
Copying: Copying:
DistCP for
copying HBase Database
replication import/export*

* Database import/export isn’t the full story

Key considerations for large data
movement
• Is your data compressed?
– None of the systems support compression on the wire natively
– WAN accelerators can help but cost $$

• Do you know your bandwidth needs?
– Initial data load
– Daily ingest rate – Maintain historical information

• Do you know your network security setup?
– Data nodes & Region Servers talk to each other – they need to be able to have network connectivity

• Have you configured security appropriately?
– Kerberos support for cross-realm trust is challenging

• What about cross-version copying?
– Can’t always have both clusters be same version – but this is not trivial

Management of replications
Scheduling replication jobs

• Time based
• Workflow based – Kicked off from Oozie script?

Prioritization

• Keep replications in a separate scheduler group and
dedicate capacity to replication jobs
• Don’t schedule more map tasks than can handle
available network bandwidth between sites

Secondary configuration and usage
Hardware considerations
• Denser disk configurations acceptable on remote site
depending on workload goals – 4 TB disks vs 2 TB disks, etc
• Fewer nodes are typical – consider replicating only critical
data. Be careful playing with replication factors

Usage considerations
• Physical partitioning means a great place for ad-hoc
analytics
• Production workloads continue to run on core cluster but
ad-hoc analytics on replica cluster
• For HBase, all clusters can be used for data serving!

What about external systems?

• Backing up to external systems is a 1 way
street with large data volumes

• Can’t do useful processing on the other side

• Cost of hadoop storage is fairly low, especially
if you can drive work on it

Summary
• It can be done!

• Lots of gotchas and details to track in the process

• We haven’t even talked about applications and
configuration!

• Failure workflows are important too – testing,
testing, testing

Cloudera Enterprise BDR

CLOUDERA ENTERPRISE
CLOUDERA MANAGER

SELECT CONFIGURE SYNCHRONIZE MONITOR

DISASTER RECOVERY MODULE

CDH

HDFS DISTRIBUTED REPLICATION HIVE METASTORE REPLICATION
HIGH PERFORMANCE REPLICATION THE ONLY DISASTER RECOVERY SOLUTION
USING MAPREDUCE FOR METADATA

HDFS HIVE

34

Hadoop Backup and Disaster Recovery

More Related Content

What's hot (20)

Similar to Hadoop Backup and Disaster Recovery (20)

More from Cloudera, Inc. (20)

Recently uploaded (20)

Hadoop Backup and Disaster Recovery

Editor's Notes