SlideShare a Scribd company logo
Hadoop Disk Fail Inplace

      Bharath Mundlapudi
      (Email: mundlapudi@yahoo.com)


     Core Hadoop Engineer
About Me!
•
    Current    Hadoop Engineering, Yahoo!
               - Performance, Utilization & HDFS core group.


•
    Recent Past Javasoft & J2EE Group, Sun
                - JVM Performance, SIP container,
                     XML & Web Services.
My contribution to Hadoop
•
    Namenode memory improvements
•
    Developed tools to understand cluster
    utilization and performance at scale.
•
    Namenode & Job tracker - Garbage
    collector tunings.
•
    Disk Fail Inplace
Agenda
•
    Disk Fail Inplace
•
    Methodology
•
    Issues found
•
    Operational Changes
•
    Hadoop Changes
•
    Lessons learned
Disk Failures



Isn’t Hadoop already handling disk failures?
Where are we today?


In Hadoop, If a single disk in a node fails,
the entire node is blacklisted for the
TaskTracker, and the DataNode process
fails to startup.
Trends in commodity nodes
•
    More Storage
    –
        12 * 3TB
•
    More Compute power
    –
        24 core
•
    RAM
    –
        48GB
Siteops Tickets
Impact of a single disk failure
    Old generation grids:                  New grids:
 (6 x 1.5TB drives, 12 slots)      (12 x 3TB drives, 24 slots)

    10PB, 3 replica grid =         10 PB, 3 replica grid =
        3777 nodes                      944 nodes
    Failure of one disk =             Failure of one disk =
Loss of 0.02% of grid storage   Loss of 0.1% of grid storage, i.e.
                                5 times magnified loss of storage



    Failure of one disk =              Failure of one disk =
Loss of 0.02% of grid compute     Loss of 0.1% of grid compute
           capacity             capacity, i.e 5 times magnified loss
                                             of compute
Node Statistics


  Total        Active       Blackliste Excluded
 nodes                          d
  30242      28436(94%)       65 (0.2%)       1741(6%)
          Breakout of blacklisted nodes in all grids


Ethernet Link Failure               Disk Failure
   11 (16% of failures)           54 (83% of failures)
What is DFIP?
•
    DFIP – Disk Fail Inplace
•
    We want to run Hadoop even when
    disks fail until a threshold.
•
    Primarily – DataNode and TaskTracker
•
    We took a holistic approach to solve this
    disk failure problem.
Why now?
•
    Trend in high density disks (36TB)
    –
        Cost of losing a node is high


•
    To increase operational efficiency
    –
        Utilization
    –
        Scaling data
    –
        Various other benefits
Where to inject a failure?
•
    Complete stack analysis for disk failures.

                 DataNode         TaskTracker



                            JVM



                            Linux


                    SCSI Device Driver
Operational Changes
Lab Setup
•
    40 node cluster on two racks
•
    Kickstart and TFTP Server
•
    Kerberos Server
Lab Setup(Cont…)
•
    PXE Boot, TFTP Server, DHCP Server &
    Kerberos Server.


                                       Kerberos Server


    PXE Server




                        Hadoop Nodes
Operational Improvement
•
    With DFIP, Completely changed Hadoop
    deployment layout.
•
    Linux re-image time took 4 hours
    on a 12 disk system.
      Improvement:
      We reduced the re-image time to
      20 minutes (12X better).
Hadoop Changes
Analysis Phase
•
    Which files are used?
    –
        Use linux system commands to identify
        these.
•
    Identified all the files used by datanode
    and tasktracker. Logs, tmp, conf,
    libraries(system), jars etc.
Methodology
•
    Umount –l
•
    Chmod 000, 400 etc
•
    System Tap
    –
        Similar to Dtrace in solaris.
    –
        Probes the modules of interest.
    –
        Written probes for SCSI and CCISS modules.
Failure Framework
•
    System Tap (stap) based framework
•
    Requires root privileges
•
    Time duration based injection
•
    Developed for SCSI and CCISS drivers.
Hadoop Changes
•
    Umbrella Jira – Hadoop Disk Fail Inplace

                     HADOOP-7123




       TaskTracker                   Datanode
      HADOOP-7124                  HADOOP-7125
File Management
•
    Separate out user and system files
•
    RAID1 on system files
•
    System files
    –
        Kernel files, Hadoop binaries, pids and logs
        & JDK
•
    User files
    –
        HDFS data, Task logs and output &
        Distributed cache etc.
Datanode impact
•
    Separation of system and user files
•
    Datanode logs on RAID1
•
    DataNode doesn’t honor volumes
    tolerated.
    –
        Jira – HDFS-1592
•
    DataNode process doesn’t exit when
    disks fail
    –
        Jira – HDFS-1692
Datanode: HDFS-1592


•
    DataNode doesn’t honor volumes tolerated.
    –
        Startup failure.
Datanode: HDFS-1692


•
    DataNode process doesn’t exit when disks
    fail
    –
        Runtime issue (Secure Mode).
TaskTracker Impact
•
    Separation of system and user files
•
    Tasktracker logs on RAID1
•
    Tasktracker should handle disk failures at both startup and
    runtime.
     –
         Jira: MAPREDUCE-2413
•
    Distribute task userlogs on multiple disks.
     –
         Jira: MAPREDUCE-2415
²
    Components impacted:
- Linux task controller, Default task controller, Health check
script, Security and most of the components in Tasktracker.
Tasktracker: MAPREDUCE-2413
•
    Tasktracker should handle disk failures at
    both startup and runtime.
    –
        Keep track of good disks all the time.
    –
        Pass the good disks to all the components
        like DefaultTaskController and
        LinuxTaskController.
    –
        Periodically check for disk failures
    –
        If disk failures happens, re-init the
        TaskTracker.
    –
        Modified Health Check Scripts.
TaskTracker: MAPREDUCE-2415
•
    Distribute task userlogs on multiple disks.
    –
        Single point of failure.
Rigorous Testing
•
    Random writer benchmark (With failures)
•
    Terasort benchmark (With failures)
•
    Gridmixv3 benchmark (With failures)
•
    Passed 950 QA tests
•
    Tested with Valgrind for Memory leaks
Some Code lessons
Read JDK APIs carefully
•
    What is the problem with this code?


File fileList[] = dir.listFiles();
For(File f : fileList) {
…
}
Exception Handling
•
    ServerSocket.accept() will throw
    AsynchronousCloseException
Future Work
•
    Disk Hot Swap.
•
    More kinds of failures – Timeouts, CRC
    errors, network, CPU, Memory etc
•
    And more :-)
Thank you
                               Contacts:
                   Email: mundlapudi@yahoo.com
Linkedin: http://www.linkedin.com/pub/bharath-mundlapudi/2/148/501

More Related Content

What's hot (20)

PPTX
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Cloudera, Inc.
 
PDF
Improving Hadoop Cluster Performance via Linux Configuration
Alex Moundalexis
 
PPTX
Learn Hadoop Administration
Edureka!
 
PPT
Hw09 Monitoring Best Practices
Cloudera, Inc.
 
PPTX
Improving Hadoop Cluster Performance via Linux Configuration
DataWorks Summit
 
PDF
Apache Hadoop YARN, NameNode HA, HDFS Federation
Adam Kawa
 
DOC
Configure h base hadoop and hbase client
Shashwat Shriparv
 
PDF
Troubleshooting Hadoop: Distributed Debugging
Great Wide Open
 
PPT
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Impetus Technologies
 
PDF
Introduction to apache hadoop
Shashwat Shriparv
 
PPTX
Hadoop migration and upgradation
Shashwat Shriparv
 
PDF
Administer Hadoop Cluster
Edureka!
 
PDF
Hadoop single node installation on ubuntu 14
jijukjoseph
 
PDF
Hadoop Operations for Production Systems (Strata NYC)
Kathleen Ting
 
PDF
From docker to kubernetes: running Apache Hadoop in a cloud native way
DataWorks Summit
 
PDF
Hadoop - Lessons Learned
tcurdt
 
PPT
Hadoop 1.x vs 2
Rommel Garcia
 
DOCX
Apache kafka configuration-guide
Chetan Khatri
 
PDF
Hortonworks.Cluster Config Guide
Douglas Bernardini
 
ODP
Hadoop admin
Balaji Rajan
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Cloudera, Inc.
 
Improving Hadoop Cluster Performance via Linux Configuration
Alex Moundalexis
 
Learn Hadoop Administration
Edureka!
 
Hw09 Monitoring Best Practices
Cloudera, Inc.
 
Improving Hadoop Cluster Performance via Linux Configuration
DataWorks Summit
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Adam Kawa
 
Configure h base hadoop and hbase client
Shashwat Shriparv
 
Troubleshooting Hadoop: Distributed Debugging
Great Wide Open
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Impetus Technologies
 
Introduction to apache hadoop
Shashwat Shriparv
 
Hadoop migration and upgradation
Shashwat Shriparv
 
Administer Hadoop Cluster
Edureka!
 
Hadoop single node installation on ubuntu 14
jijukjoseph
 
Hadoop Operations for Production Systems (Strata NYC)
Kathleen Ting
 
From docker to kubernetes: running Apache Hadoop in a cloud native way
DataWorks Summit
 
Hadoop - Lessons Learned
tcurdt
 
Hadoop 1.x vs 2
Rommel Garcia
 
Apache kafka configuration-guide
Chetan Khatri
 
Hortonworks.Cluster Config Guide
Douglas Bernardini
 
Hadoop admin
Balaji Rajan
 

Similar to Hadoop - Disk Fail In Place (DFIP) (20)

PDF
Hadoop: Code Injection, Distributed Fault Injection
Cloudera, Inc.
 
PDF
Practice and challenges from building IaaS
Shawn Zhu
 
PDF
Next Generation Hadoop Operations
Owen O'Malley
 
PDF
End of RAID as we know it with Ceph Replication
Ceph Community
 
PDF
Apache hbase for the enterprise (Strata+Hadoop World 2012)
jmhsieh
 
PPTX
Strata + Hadoop World 2012: Apache HBase Features for the Enterprise
Cloudera, Inc.
 
PDF
CSL Seminar presented by Cassiano Campes - 17-03-13
Cassiano Campes
 
PPT
Schneider Electric Scada Global Support Provides Troubleshooting and Technica...
Preeya Selvarajah
 
PPTX
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
PDF
Infrastructure Around Hadoop
DataWorks Summit
 
PPTX
Hui 3.0
Arulkumar Arumugam
 
PPT
Considerations when implementing_ha_in_dmf
hik_lhz
 
PDF
Identify_Stability_Problems
Michael Materie
 
PPT
My other computer is a datacentre - 2012 edition
Steve Loughran
 
PDF
Introduction to hadoop and hdfs
TrendProgContest13
 
PDF
OS_File_systems_Consistency_Semantics.ppt
sujaachar6
 
PPT
Borthakur hadoop univ-research
saintdevil163
 
PDF
Hadoop at datasift
Jairam Chandar
 
PDF
HDFS NameNode High Availability
DataWorks Summit
 
PPTX
Hadoop Summit 2012 | HDFS High Availability
Cloudera, Inc.
 
Hadoop: Code Injection, Distributed Fault Injection
Cloudera, Inc.
 
Practice and challenges from building IaaS
Shawn Zhu
 
Next Generation Hadoop Operations
Owen O'Malley
 
End of RAID as we know it with Ceph Replication
Ceph Community
 
Apache hbase for the enterprise (Strata+Hadoop World 2012)
jmhsieh
 
Strata + Hadoop World 2012: Apache HBase Features for the Enterprise
Cloudera, Inc.
 
CSL Seminar presented by Cassiano Campes - 17-03-13
Cassiano Campes
 
Schneider Electric Scada Global Support Provides Troubleshooting and Technica...
Preeya Selvarajah
 
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Infrastructure Around Hadoop
DataWorks Summit
 
Considerations when implementing_ha_in_dmf
hik_lhz
 
Identify_Stability_Problems
Michael Materie
 
My other computer is a datacentre - 2012 edition
Steve Loughran
 
Introduction to hadoop and hdfs
TrendProgContest13
 
OS_File_systems_Consistency_Semantics.ppt
sujaachar6
 
Borthakur hadoop univ-research
saintdevil163
 
Hadoop at datasift
Jairam Chandar
 
HDFS NameNode High Availability
DataWorks Summit
 
Hadoop Summit 2012 | HDFS High Availability
Cloudera, Inc.
 
Ad

Recently uploaded (20)

PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Ad

Hadoop - Disk Fail In Place (DFIP)

  • 1. Hadoop Disk Fail Inplace Bharath Mundlapudi (Email: [email protected]) Core Hadoop Engineer
  • 2. About Me! • Current Hadoop Engineering, Yahoo! - Performance, Utilization & HDFS core group. • Recent Past Javasoft & J2EE Group, Sun - JVM Performance, SIP container, XML & Web Services.
  • 3. My contribution to Hadoop • Namenode memory improvements • Developed tools to understand cluster utilization and performance at scale. • Namenode & Job tracker - Garbage collector tunings. • Disk Fail Inplace
  • 4. Agenda • Disk Fail Inplace • Methodology • Issues found • Operational Changes • Hadoop Changes • Lessons learned
  • 5. Disk Failures Isn’t Hadoop already handling disk failures?
  • 6. Where are we today? In Hadoop, If a single disk in a node fails, the entire node is blacklisted for the TaskTracker, and the DataNode process fails to startup.
  • 7. Trends in commodity nodes • More Storage – 12 * 3TB • More Compute power – 24 core • RAM – 48GB
  • 9. Impact of a single disk failure Old generation grids: New grids: (6 x 1.5TB drives, 12 slots) (12 x 3TB drives, 24 slots) 10PB, 3 replica grid = 10 PB, 3 replica grid = 3777 nodes 944 nodes Failure of one disk = Failure of one disk = Loss of 0.02% of grid storage Loss of 0.1% of grid storage, i.e. 5 times magnified loss of storage Failure of one disk = Failure of one disk = Loss of 0.02% of grid compute Loss of 0.1% of grid compute capacity capacity, i.e 5 times magnified loss of compute
  • 10. Node Statistics Total Active Blackliste Excluded nodes d 30242 28436(94%) 65 (0.2%) 1741(6%) Breakout of blacklisted nodes in all grids Ethernet Link Failure Disk Failure 11 (16% of failures) 54 (83% of failures)
  • 11. What is DFIP? • DFIP – Disk Fail Inplace • We want to run Hadoop even when disks fail until a threshold. • Primarily – DataNode and TaskTracker • We took a holistic approach to solve this disk failure problem.
  • 12. Why now? • Trend in high density disks (36TB) – Cost of losing a node is high • To increase operational efficiency – Utilization – Scaling data – Various other benefits
  • 13. Where to inject a failure? • Complete stack analysis for disk failures. DataNode TaskTracker JVM Linux SCSI Device Driver
  • 15. Lab Setup • 40 node cluster on two racks • Kickstart and TFTP Server • Kerberos Server
  • 16. Lab Setup(Cont…) • PXE Boot, TFTP Server, DHCP Server & Kerberos Server. Kerberos Server PXE Server Hadoop Nodes
  • 17. Operational Improvement • With DFIP, Completely changed Hadoop deployment layout. • Linux re-image time took 4 hours on a 12 disk system. Improvement: We reduced the re-image time to 20 minutes (12X better).
  • 19. Analysis Phase • Which files are used? – Use linux system commands to identify these. • Identified all the files used by datanode and tasktracker. Logs, tmp, conf, libraries(system), jars etc.
  • 20. Methodology • Umount –l • Chmod 000, 400 etc • System Tap – Similar to Dtrace in solaris. – Probes the modules of interest. – Written probes for SCSI and CCISS modules.
  • 21. Failure Framework • System Tap (stap) based framework • Requires root privileges • Time duration based injection • Developed for SCSI and CCISS drivers.
  • 22. Hadoop Changes • Umbrella Jira – Hadoop Disk Fail Inplace HADOOP-7123 TaskTracker Datanode HADOOP-7124 HADOOP-7125
  • 23. File Management • Separate out user and system files • RAID1 on system files • System files – Kernel files, Hadoop binaries, pids and logs & JDK • User files – HDFS data, Task logs and output & Distributed cache etc.
  • 24. Datanode impact • Separation of system and user files • Datanode logs on RAID1 • DataNode doesn’t honor volumes tolerated. – Jira – HDFS-1592 • DataNode process doesn’t exit when disks fail – Jira – HDFS-1692
  • 25. Datanode: HDFS-1592 • DataNode doesn’t honor volumes tolerated. – Startup failure.
  • 26. Datanode: HDFS-1692 • DataNode process doesn’t exit when disks fail – Runtime issue (Secure Mode).
  • 27. TaskTracker Impact • Separation of system and user files • Tasktracker logs on RAID1 • Tasktracker should handle disk failures at both startup and runtime. – Jira: MAPREDUCE-2413 • Distribute task userlogs on multiple disks. – Jira: MAPREDUCE-2415 ² Components impacted: - Linux task controller, Default task controller, Health check script, Security and most of the components in Tasktracker.
  • 28. Tasktracker: MAPREDUCE-2413 • Tasktracker should handle disk failures at both startup and runtime. – Keep track of good disks all the time. – Pass the good disks to all the components like DefaultTaskController and LinuxTaskController. – Periodically check for disk failures – If disk failures happens, re-init the TaskTracker. – Modified Health Check Scripts.
  • 29. TaskTracker: MAPREDUCE-2415 • Distribute task userlogs on multiple disks. – Single point of failure.
  • 30. Rigorous Testing • Random writer benchmark (With failures) • Terasort benchmark (With failures) • Gridmixv3 benchmark (With failures) • Passed 950 QA tests • Tested with Valgrind for Memory leaks
  • 32. Read JDK APIs carefully • What is the problem with this code? File fileList[] = dir.listFiles(); For(File f : fileList) { … }
  • 33. Exception Handling • ServerSocket.accept() will throw AsynchronousCloseException
  • 34. Future Work • Disk Hot Swap. • More kinds of failures – Timeouts, CRC errors, network, CPU, Memory etc • And more :-)
  • 35. Thank you Contacts: Email: [email protected] Linkedin: http://www.linkedin.com/pub/bharath-mundlapudi/2/148/501