Hadoop - Disk Fail In Place (DFIP)

Hadoop Disk Fail Inplace

Bharath Mundlapudi
(Email: mundlapudi@yahoo.com)

Core Hadoop Engineer

About Me!
•
Current Hadoop Engineering, Yahoo!
- Performance, Utilization & HDFS core group.

•
Recent Past Javasoft & J2EE Group, Sun
- JVM Performance, SIP container,
XML & Web Services.

My contribution to Hadoop
•
Namenode memory improvements
•
Developed tools to understand cluster
utilization and performance at scale.
•
Namenode & Job tracker - Garbage
collector tunings.
•
Disk Fail Inplace

Agenda
•
Disk Fail Inplace
•
Methodology
•
Issues found
•
Operational Changes
•
Hadoop Changes
•
Lessons learned

Disk Failures

Isn’t Hadoop already handling disk failures?

Where are we today?

In Hadoop, If a single disk in a node fails,
the entire node is blacklisted for the
TaskTracker, and the DataNode process
fails to startup.

Trends in commodity nodes
•
More Storage
–
12 * 3TB
•
More Compute power
–
24 core
•
RAM
–
48GB

Impact of a single disk failure
Old generation grids: New grids:
(6 x 1.5TB drives, 12 slots) (12 x 3TB drives, 24 slots)

10PB, 3 replica grid = 10 PB, 3 replica grid =
3777 nodes 944 nodes
Failure of one disk = Failure of one disk =
Loss of 0.02% of grid storage Loss of 0.1% of grid storage, i.e.
5 times magnified loss of storage

Failure of one disk = Failure of one disk =
Loss of 0.02% of grid compute Loss of 0.1% of grid compute
capacity capacity, i.e 5 times magnified loss
of compute

Node Statistics

Total Active Blackliste Excluded
nodes d
30242 28436(94%) 65 (0.2%) 1741(6%)
Breakout of blacklisted nodes in all grids

Ethernet Link Failure Disk Failure
11 (16% of failures) 54 (83% of failures)

What is DFIP?
•
DFIP – Disk Fail Inplace
•
We want to run Hadoop even when
disks fail until a threshold.
•
Primarily – DataNode and TaskTracker
•
We took a holistic approach to solve this
disk failure problem.

Why now?
•
Trend in high density disks (36TB)
–
Cost of losing a node is high

•
To increase operational efficiency
–
Utilization
–
Scaling data
–
Various other benefits

Where to inject a failure?
•
Complete stack analysis for disk failures.

DataNode TaskTracker

JVM

Linux

SCSI Device Driver

Lab Setup
•
40 node cluster on two racks
•
Kickstart and TFTP Server
•
Kerberos Server

Lab Setup(Cont…)
•
PXE Boot, TFTP Server, DHCP Server &
Kerberos Server.

Kerberos Server

PXE Server

Hadoop Nodes

Operational Improvement
•
With DFIP, Completely changed Hadoop
deployment layout.
•
Linux re-image time took 4 hours
on a 12 disk system.
Improvement:
We reduced the re-image time to
20 minutes (12X better).

Analysis Phase
•
Which files are used?
–
Use linux system commands to identify
these.
•
Identified all the files used by datanode
and tasktracker. Logs, tmp, conf,
libraries(system), jars etc.

Methodology
•
Umount –l
•
Chmod 000, 400 etc
•
System Tap
–
Similar to Dtrace in solaris.
–
Probes the modules of interest.
–
Written probes for SCSI and CCISS modules.

Failure Framework
•
System Tap (stap) based framework
•
Requires root privileges
•
Time duration based injection
•
Developed for SCSI and CCISS drivers.

Hadoop Changes
•
Umbrella Jira – Hadoop Disk Fail Inplace

HADOOP-7123

TaskTracker Datanode
HADOOP-7124 HADOOP-7125

File Management
•
Separate out user and system files
•
RAID1 on system files
•
System files
–
Kernel files, Hadoop binaries, pids and logs
& JDK
•
User files
–
HDFS data, Task logs and output &
Distributed cache etc.

Datanode impact
•
Separation of system and user files
•
Datanode logs on RAID1
•
DataNode doesn’t honor volumes
tolerated.
–
Jira – HDFS-1592
•
DataNode process doesn’t exit when
disks fail
–
Jira – HDFS-1692

Datanode: HDFS-1592

•
DataNode doesn’t honor volumes tolerated.
–
Startup failure.

Datanode: HDFS-1692

•
DataNode process doesn’t exit when disks
fail
–
Runtime issue (Secure Mode).

TaskTracker Impact
•
Separation of system and user files
•
Tasktracker logs on RAID1
•
Tasktracker should handle disk failures at both startup and
runtime.
–
Jira: MAPREDUCE-2413
•
Distribute task userlogs on multiple disks.
–
Jira: MAPREDUCE-2415
²
Components impacted:
- Linux task controller, Default task controller, Health check
script, Security and most of the components in Tasktracker.

Tasktracker: MAPREDUCE-2413
•
Tasktracker should handle disk failures at
both startup and runtime.
–
Keep track of good disks all the time.
–
Pass the good disks to all the components
like DefaultTaskController and
LinuxTaskController.
–
Periodically check for disk failures
–
If disk failures happens, re-init the
TaskTracker.
–
Modified Health Check Scripts.

TaskTracker: MAPREDUCE-2415
•
Distribute task userlogs on multiple disks.
–
Single point of failure.

Rigorous Testing
•
Random writer benchmark (With failures)
•
Terasort benchmark (With failures)
•
Gridmixv3 benchmark (With failures)
•
Passed 950 QA tests
•
Tested with Valgrind for Memory leaks

Read JDK APIs carefully
•
What is the problem with this code?

File fileList[] = dir.listFiles();
For(File f : fileList) {
…
}

Exception Handling
•
ServerSocket.accept() will throw
AsynchronousCloseException

Future Work
•
Disk Hot Swap.
•
More kinds of failures – Timeouts, CRC
errors, network, CPU, Memory etc
•
And more :-)

Thank you
Contacts:
Email: mundlapudi@yahoo.com
Linkedin: http://www.linkedin.com/pub/bharath-mundlapudi/2/148/501

Hadoop - Disk Fail In Place (DFIP)

More Related Content

What's hot (20)

Similar to Hadoop - Disk Fail In Place (DFIP) (20)

Recently uploaded (20)

Hadoop - Disk Fail In Place (DFIP)