SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Apache Hadoop 3
Andrew Wang Daniel Templeton
andrew.wang@cloudera.com daniel@cloudera.com
2© Cloudera, Inc. All rights reserved.
Who We Are
Andrew Wang
● HDFS @ Cloudera
● Hadoop PMC Member
● Release Manager for Hadoop 3.0
Daniel Templeton
● YARN @ Cloudera
● Hadoop PMC Member
3© Cloudera, Inc. All rights reserved.
An Abbreviated History of Hadoop Releases
Date Release Major Notes
2007-11-04 0.14.1 First release at the ASF
2011-12-27 1.0.0 Security, HBase support
2012-05-23 2.0.0 YARN, NameNode HA, wire compat
2014-11-18 2.6.0 HDFS encryption, rolling upgrade, node labels
2015-04-21 2.7.0 Truncate, Variable-length blocks, YARN Global Caching,
2017-03-22 2.8.0 Cloud improvement, Azure Data Lake, and etc.
2017-11-17 2.9.0 Stability Improvement
2017-12-13 3.0.0 Java 8, Erasure Coding, S3Guard, YARN Timeline Service
4© Cloudera, Inc. All rights reserved.
Motivation for Hadoop 3
● Upgrade minimum Java version to Java 8
○ Java 7 end-of-life in April 2015
○ Many Java libraries now only support Java 8
● HDFS erasure coding
○ Major feature that refactored core pieces of HDFS
○ Too big to backport to 2.x
● Classpath isolation
○ Potentially impacts all clients
● Other miscellaneous incompatible bugfixes and improvements
○ Hadoop 2.x was branched in 2011
○ 6 years of changes waiting for 3.0
5© Cloudera, Inc. All rights reserved.
Hadoop 3 Status and Release Plan
● After four alphas and one beta, 3.0.0 is out!
● Took close to two years from inception
● 3.0.1 and 3.1.0 are already in progress
https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+3.0.0+release
Release Date
3.0.0-alpha1 2016-09-03 ✔
3.0.0-alpha2 2017-01-25 ✔
3.0.0-alpha3 2017-05-26 ✔
3.0.0-alpha4 2017-07-07 ✔
3.0.0-beta1 2017-10-03 ✔
3.0.0 GA 2017-12-13 ✔
3.0.1 2017 Mar
6© Cloudera, Inc. All rights reserved.
HDFS & Hadoop Features
7© Cloudera, Inc. All rights reserved.
3x replication vs. Erasure coding
b1 b2 b3
/foo.csv - 3 block file
8© Cloudera, Inc. All rights reserved.
3x replication vs. Erasure coding
b1 b2 b3
/foo.csv - 3 block file
b1 b2 b3
b1 b2 b3
9© Cloudera, Inc. All rights reserved.
3x replication vs. Erasure coding
b1 b2 b3
/foo.csv - 3 block file
b1 b2 b3
b1 b2 b3
3 replicas
3 blocks
3 x 3 = 9 total replicas
9 / 3 = 200% overhead!
10© Cloudera, Inc. All rights reserved.
3x replication vs. Erasure coding
b1 b2 b3
/foo.csv - 3 block file
11© Cloudera, Inc. All rights reserved.
3x replication vs. Erasure coding
b1 b2 b3
/foo.csv - 3 block file
p1 p2
12© Cloudera, Inc. All rights reserved.
3x replication vs. Erasure coding
b1 b2 b3
/foo.csv - 3 block file
p1 p2
3 data blocks 2 parity blocks
3 + 2 = 5 replicas
5 / 3 = 67% overhead!
13© Cloudera, Inc. All rights reserved.
3x replication vs. Erasure coding
b1 b2 b3
/foo.csv - 3 block file
p1 p2
3 data blocks 2 parity blocks
3 + 2 = 5 replicas
5 / 3 = 67% overhead!
b1 b2 b10
/bigfoo.csv - 10 block file
p1 p4
10 data blocks 4 parity blocks
10 + 4 = 14 replicas
14 / 10 = 40% overhead!
... ...
14© Cloudera, Inc. All rights reserved.
EC Reconstruction
b1 b2 b3
/foo.csv - 3 block file
p1 p2 Reed-Solomon (3,2)
15© Cloudera, Inc. All rights reserved.
EC Reconstruction
b1 b2 b3
/foo.csv - 3 block file
p1 p2 Reed-Solomon (3,2)
X
16© Cloudera, Inc. All rights reserved.
EC Reconstruction
b1 b2 b3
/foo.csv - 3 block file
p1 p2 Reed-Solomon (3,2)
Read 3 remaining blocks
b3
Run RS to recover b3
New copy of b3 recovered
X
17© Cloudera, Inc. All rights reserved.
Erasure coding (HDFS-7285)
● Motivation: improve storage efficiency of HDFS
○ ~2x the storage efficiency compared to 3x replication
○ Reduction of overhead from 200% to 40%
● Uses Reed-Solomon(k,m) erasure codes instead of replication
○ Support for multiple erasure coding policies
○ RS(3,2), RS(6,3), RS(10,4)
● Can improves data durability
○ RS(6,3) can tolerate 3 failures
○ RS(10,4) can tolerate 4 failures
● Missing blocks reconstructed from remaining blocks
18© Cloudera, Inc. All rights reserved.
EC implications
● File data is striped across multiple nodes and racks
● Reads and writes are remote and cross-rack
● Reconstruction is network-intensive, reads m blocks cross-rack
● Important to use Intel’s optimized ISA-L for performance
○ 1+ GB/s encode/decode speed, much faster than Java implementation
● Combine data into larger files to avoid an explosion in # replicas
○ Bad: 1x1GB file -> RS(10,4) -> 14x100MB EC blocks (4.6x # replicas)
○ Good: 10x1GB file -> RS(10,4) -> 14x1GB EC blocks (0.46x # replicas)
● Works best for archival / cold data use cases
19© Cloudera, Inc. All rights reserved.
EC performance
20© Cloudera, Inc. All rights reserved.
EC performance
21© Cloudera, Inc. All rights reserved.
EC performance
22© Cloudera, Inc. All rights reserved.
Erasure coding status
● Massive development effort by the Hadoop community
○ 20+ contributors from many companies
■ Cloudera, Intel, Hortonworks, Huawei, Y! JP, …
○ 100s of commits over more than three years (started in 2014)
● Erasure coding is ready in 3.0.0 GA!
● Current focus is on testing and integration efforts
○ Want the complete Hadoop stack to work with HDFS erasure coding enabled
○ Ongoing stress / endurance testing to ensure stability at scale
23© Cloudera, Inc. All rights reserved.
● Hadoop leaks lots of dependencies
onto the application’s classpath
○ Known offenders: Guava,
Protobuf, Jackson, Jetty, …
● No separate HDFS client jar means
server jars are leaked
● YARN / MR clients not shaded
● HDFS-6200: Split HDFS client into
separate JAR
● HADOOP-11804: Shaded
hadoop-client dependency
● YARN-6466: Shade the task
umbilical for a clean YARN
container environment (ongoing)
Classpath isolation (HADOOP-11656)
24© Cloudera, Inc. All rights reserved.
Miscellaneous
● Supportability improvements
○ Shell script rewrite
○ Intra-DataNode balancer
○ Move default ports out of the ephemeral range
● Support for multiple Standby NameNodes
● Cloud enhancements
○ Support for Microsoft Azure Data Lake and Aliyun OSS
○ S3 consistency and performance improvements
● Tightened Hadoop compatibility policy
25© Cloudera, Inc. All rights reserved.
YARN Features
26© Cloudera, Inc. All rights reserved.
Job History Server
Resource
Manager
27© Cloudera, Inc. All rights reserved.
Job History Server
Resource
Manager
jobs
28© Cloudera, Inc. All rights reserved.
Job History Server
Resource
Manager
jobs
Job
History
Server
29© Cloudera, Inc. All rights reserved.
Job History Server
Resource
Manager
jobs
Job
History
Server
HDFS
Node
Manager
30© Cloudera, Inc. All rights reserved.
Job History Server
Resource
Manager
jobs
Job
History
Server
Spark
History
Server
31© Cloudera, Inc. All rights reserved.
Job History Server
Resource
Manager
jobs
Job
History
Server
Spark
History
Server
?
32© Cloudera, Inc. All rights reserved.
Application Timeline Service v2
● Store for application and system events and data
○ Distributed
○ Scalable
○ Structured Data Model
● Updated in real time
○ Application status
○ Application metrics
○ System metrics
● Fed by resource manager, node manager, and application masters
● REST API
33© Cloudera, Inc. All rights reserved.
Application Timeline Service v2
Resource
Manager
jobs
Application
Timeline
Service
HBase
34© Cloudera, Inc. All rights reserved.
Timeline
Reader
Timeline
Reader
Application Timeline Service v2
Resource
Manager
Timeline
Collecter
HBase Node
Manager
Application
Master
Timeline
Collecter
Timeline
Reader
35© Cloudera, Inc. All rights reserved.
Application Timeline Service v2 Flows
36© Cloudera, Inc. All rights reserved.
Application Timeline Service v2 Flows
37© Cloudera, Inc. All rights reserved.
Application Timeline Service v2 Flows
38© Cloudera, Inc. All rights reserved.
Old YARN UI
39© Cloudera, Inc. All rights reserved.
New YARN UI
● Rich client application
○ Built on Node.js and Ember
● Improved visibility into cluster usage
○ Memory, CPU
○ By queues and applications
○ Sunburst graphs for hierarchical queues
○ NodeManager heatmap
● ATSv2 integration
○ Plot container start/stop events
○ Easy to capture delays in app execution
40© Cloudera, Inc. All rights reserved.
New YARN UI: Cluster Overview
41© Cloudera, Inc. All rights reserved.
New YARN UI: Queues
42© Cloudera, Inc. All rights reserved.
● Before Hadoop 3 memory and CPU are the only managed resources
● Resource Types allows adding new managed resources
○ Countable resources: GPUs, Disks etc.
○ Static resources: Java version, Python version, hardware profile, ...
■ Still in proposal stage
● Resource profiles
○ Similar conceptually to EC2 instance types
○ Capture complex resource request
● DRF for scheduling
● Current virtual CPU cores and memory resources work as before
Resource Types
43© Cloudera, Inc. All rights reserved.
YARN Federation
● YARN scalability
○ Twitter runs a 10k node cluster with fair scheduler
○ Yahoo! runs 4k node cluster with capacity scheduler
● Federation
○ Restrict users to sub-clusters based on policy
○ Scalability to 100k nodes and beyond
○ Independent cluster scheduling
44© Cloudera, Inc. All rights reserved.
YARN Federation
Router
Resource
Manager
Node Manager
Node Manager
Node Manager
Node Manager
Resource
Manager
Node Manager
Node Manager
Node Manager
Node Manager
Policy
Admin
45© Cloudera, Inc. All rights reserved.
Opportunistic Containers
● Scheduler’s job is to keep all resources busy
● Scheduling gaps
○ Nothing to run
○ Resource contention
○ Resource reservations
● Opportunistic containers fill those gaps
○ Requested explicitly
○ Dedicated scheduler
○ Queued at the node managers
○ Scheduled locally when resources are available
○ Preempted when guaranteed containers need to run
● Coming in 2.9 and 3.0
46© Cloudera, Inc. All rights reserved.
Oversubscription
● Resource utilization is typically
low in most clusters (20-50%)
○ Provision for peak usage
● Usage < Allocation
○ Mean Usage = ½ Peak Usage
47© Cloudera, Inc. All rights reserved.
Oversubscription
● Oversubscription
○ Allocate opportunistic containers to use allocated-but-unused resources
○ Jobs automatically use these unless they opt-out
○ Threshold to control aggressiveness of oversubscription
○ Threshold to trigger preemption
● Currently in progress
48© Cloudera, Inc. All rights reserved.
● Long Running Services
○ Slider merging into YARN
○ Docker support
● Scheduler improvements
○ Capacity scheduler
■ Performance and preemption
improvements
■ Online scheduling (“global
scheduler”)
■ Queue management
○ Fair scheduler
■ Performance and preemption
improvements
● High availability improvements
○ Better handling of transient
network issues
○ ZK-store scalability: Limit number
of children under a znode
● MapReduce Native Collector
(MAPREDUCE-2841)
○ Native implementation of the map
output collector
○ Up to 30% faster for
shuffle-intensive jobs
Other YARN Improvements
49© Cloudera, Inc. All rights reserved.
Summary: What’s new in Hadoop 3.0?
● Storage Optimization
○ HDFS: Erasure codes
● Improved Visibility into Cluster Operations
○ YARN: ATSv2
○ YARN: New UI
● Scalability & Multi-tenancy
○ YARN: Federation
● Improved Utilization
○ YARN: Opportunistic Containers
○ YARN: Oversubscription
● Refactor Base
○ Lots of Trunk content
○ JDK8 and newer dependent libraries
50© Cloudera, Inc. All rights reserved.
Compatibility and Testing
51© Cloudera, Inc. All rights reserved.
Compatibility
● Strong feedback from large users on the need for compatibility
● Preserves wire-compatibility with Hadoop 2 clients
○ Impossible to coordinate upgrading off-cluster Hadoop clients
● Will support rolling upgrade from Hadoop 2 to Hadoop 3
○ Can’t take downtime to upgrade a business-critical cluster
● Not fully preserving API compatibility!
○ Dependency version bumps
○ Removal of deprecated APIs and tools
○ Shell script rewrite, rework of Hadoop tools scripts
○ Incompatible bug fixes
52© Cloudera, Inc. All rights reserved.
Testing and Validation
● Cloudera CDH 6 is based on upstream Hadoop 3.0.0
○ Running full test suite
○ Integration of Hadoop 3 with all components in CDH stack
○ Same integration tests used to validate CDH5
● Plans for extensive HDFS EC testing by Cloudera and Intel
● Happy synergy between 2.8.x and 3.0.x lines
○ Shares much of the same code, fixes flow into both
○ Yahoo! doing scale testing of 2.8.0
53© Cloudera, Inc. All rights reserved.
Conclusion
● Hadoop 3.0.0 GA is out!
● Shiny new features
○ HDFS erasure coding
○ Client classpath isolation
○ YARN ATSv2
○ YARN Federation
○ Opportunistic containers and oversubscription
● Great time to get involved in testing and validation
54© Cloudera, Inc. All rights reserved.
Thank you
Andrew Wang Daniel Templeton
andrew.wang@cloudera.com daniel@cloudera.com

More Related Content

What's hot (20)

PDF
API Security Best Practices & Guidelines
Prabath Siriwardena
 
PDF
Cluster-as-code. The Many Ways towards Kubernetes
QAware GmbH
 
PPTX
Room 1 - 7 - Lê Quốc Đạt - Upgrading network of Openstack to SDN with Tungste...
Vietnam Open Infrastructure User Group
 
PDF
Introduction to Vault
Knoldus Inc.
 
PDF
Domain driven design and model driven development
Dmitry Geyzersky
 
PDF
Karpenter
Knoldus Inc.
 
PDF
Introduction to Event Driven Architecture
CitiusTech
 
PDF
Cloud Native Applications on OpenShift
Serhat Dirik
 
PDF
OpenStack Swift
openstackindia
 
PPTX
Apache Kafka at LinkedIn
Discover Pinterest
 
PPTX
Cloud Native: what is it? Why?
Juan Pablo Genovese
 
PPTX
Room 2 - 1 - Phạm Quang Minh - A real DevOps culture in practice
Vietnam Open Infrastructure User Group
 
PPTX
Hashicorp Vault ppt
Shrey Agarwal
 
PDF
Istio : Service Mesh
Knoldus Inc.
 
PDF
Kubernetes 101
Crevise Technologies
 
PPTX
Survey of High Performance NoSQL Systems
ScyllaDB
 
PPTX
Write smart contract with solidity on Ethereum
Murughan Palaniachari
 
PDF
Blockchain 101 | Blockchain Tutorial | Blockchain Smart Contracts | Blockchai...
Edureka!
 
PDF
Adopting HashiCorp Vault
Nicolas Corrarello
 
PDF
Rust Embedded Development on ESP32 and basics of Async with Embassy
Juraj Michálek
 
API Security Best Practices & Guidelines
Prabath Siriwardena
 
Cluster-as-code. The Many Ways towards Kubernetes
QAware GmbH
 
Room 1 - 7 - Lê Quốc Đạt - Upgrading network of Openstack to SDN with Tungste...
Vietnam Open Infrastructure User Group
 
Introduction to Vault
Knoldus Inc.
 
Domain driven design and model driven development
Dmitry Geyzersky
 
Karpenter
Knoldus Inc.
 
Introduction to Event Driven Architecture
CitiusTech
 
Cloud Native Applications on OpenShift
Serhat Dirik
 
OpenStack Swift
openstackindia
 
Apache Kafka at LinkedIn
Discover Pinterest
 
Cloud Native: what is it? Why?
Juan Pablo Genovese
 
Room 2 - 1 - Phạm Quang Minh - A real DevOps culture in practice
Vietnam Open Infrastructure User Group
 
Hashicorp Vault ppt
Shrey Agarwal
 
Istio : Service Mesh
Knoldus Inc.
 
Kubernetes 101
Crevise Technologies
 
Survey of High Performance NoSQL Systems
ScyllaDB
 
Write smart contract with solidity on Ethereum
Murughan Palaniachari
 
Blockchain 101 | Blockchain Tutorial | Blockchain Smart Contracts | Blockchai...
Edureka!
 
Adopting HashiCorp Vault
Nicolas Corrarello
 
Rust Embedded Development on ESP32 and basics of Async with Embassy
Juraj Michálek
 

Similar to Apache Hadoop 3 (20)

PDF
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
PDF
Hadoop 3 @ Hadoop Summit San Jose 2017
Junping Du
 
PPTX
Hadoop 3 (2017 hadoop taiwan workshop)
Wei-Chiu Chuang
 
PPTX
What's new in hadoop 3.0
Heiko Loewe
 
PPTX
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
Newton Alex
 
PDF
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
PDF
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
PDF
Yarns About Yarn
Cloudera, Inc.
 
PPTX
Apache Spark Operations
Cloudera, Inc.
 
PPTX
Yarns about YARN: Migrating to MapReduce v2
DataWorks Summit
 
PDF
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Cloudera, Inc.
 
PPTX
Getting Apache Spark Customers to Production
Cloudera, Inc.
 
PDF
Troubleshooting Hadoop: Distributed Debugging
Great Wide Open
 
PPTX
Hadoop 2.0 yarn arch training
Nandan Kumar
 
PPTX
What's new in Hadoop Common and HDFS
DataWorks Summit/Hadoop Summit
 
PPTX
Five Tips for Running Cloudera on AWS
Cloudera, Inc.
 
PPTX
Scale-Out Resource Management at Microsoft using Apache YARN
DataWorks Summit/Hadoop Summit
 
PDF
The State of HBase Replication
HBaseCon
 
PPTX
Introduction to YARN and MapReduce 2
Cloudera, Inc.
 
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Junping Du
 
Hadoop 3 (2017 hadoop taiwan workshop)
Wei-Chiu Chuang
 
What's new in hadoop 3.0
Heiko Loewe
 
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
Newton Alex
 
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
Yarns About Yarn
Cloudera, Inc.
 
Apache Spark Operations
Cloudera, Inc.
 
Yarns about YARN: Migrating to MapReduce v2
DataWorks Summit
 
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Cloudera, Inc.
 
Getting Apache Spark Customers to Production
Cloudera, Inc.
 
Troubleshooting Hadoop: Distributed Debugging
Great Wide Open
 
Hadoop 2.0 yarn arch training
Nandan Kumar
 
What's new in Hadoop Common and HDFS
DataWorks Summit/Hadoop Summit
 
Five Tips for Running Cloudera on AWS
Cloudera, Inc.
 
Scale-Out Resource Management at Microsoft using Apache YARN
DataWorks Summit/Hadoop Summit
 
The State of HBase Replication
HBaseCon
 
Introduction to YARN and MapReduce 2
Cloudera, Inc.
 
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
PPTX
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
PPTX
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
PPTX
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
PPTX
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
PPTX
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
PPTX
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
PPTX
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
PPTX
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Ad

Recently uploaded (20)

PDF
Technical-Careers-Roadmap-in-Software-Market.pdf
Hussein Ali
 
PPTX
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PDF
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
PPTX
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
PPTX
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PDF
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PDF
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
PDF
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Technical-Careers-Roadmap-in-Software-Market.pdf
Hussein Ali
 
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 

Apache Hadoop 3

  • 1. 1© Cloudera, Inc. All rights reserved. Apache Hadoop 3 Andrew Wang Daniel Templeton [email protected] [email protected]
  • 2. 2© Cloudera, Inc. All rights reserved. Who We Are Andrew Wang ● HDFS @ Cloudera ● Hadoop PMC Member ● Release Manager for Hadoop 3.0 Daniel Templeton ● YARN @ Cloudera ● Hadoop PMC Member
  • 3. 3© Cloudera, Inc. All rights reserved. An Abbreviated History of Hadoop Releases Date Release Major Notes 2007-11-04 0.14.1 First release at the ASF 2011-12-27 1.0.0 Security, HBase support 2012-05-23 2.0.0 YARN, NameNode HA, wire compat 2014-11-18 2.6.0 HDFS encryption, rolling upgrade, node labels 2015-04-21 2.7.0 Truncate, Variable-length blocks, YARN Global Caching, 2017-03-22 2.8.0 Cloud improvement, Azure Data Lake, and etc. 2017-11-17 2.9.0 Stability Improvement 2017-12-13 3.0.0 Java 8, Erasure Coding, S3Guard, YARN Timeline Service
  • 4. 4© Cloudera, Inc. All rights reserved. Motivation for Hadoop 3 ● Upgrade minimum Java version to Java 8 ○ Java 7 end-of-life in April 2015 ○ Many Java libraries now only support Java 8 ● HDFS erasure coding ○ Major feature that refactored core pieces of HDFS ○ Too big to backport to 2.x ● Classpath isolation ○ Potentially impacts all clients ● Other miscellaneous incompatible bugfixes and improvements ○ Hadoop 2.x was branched in 2011 ○ 6 years of changes waiting for 3.0
  • 5. 5© Cloudera, Inc. All rights reserved. Hadoop 3 Status and Release Plan ● After four alphas and one beta, 3.0.0 is out! ● Took close to two years from inception ● 3.0.1 and 3.1.0 are already in progress https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+3.0.0+release Release Date 3.0.0-alpha1 2016-09-03 ✔ 3.0.0-alpha2 2017-01-25 ✔ 3.0.0-alpha3 2017-05-26 ✔ 3.0.0-alpha4 2017-07-07 ✔ 3.0.0-beta1 2017-10-03 ✔ 3.0.0 GA 2017-12-13 ✔ 3.0.1 2017 Mar
  • 6. 6© Cloudera, Inc. All rights reserved. HDFS & Hadoop Features
  • 7. 7© Cloudera, Inc. All rights reserved. 3x replication vs. Erasure coding b1 b2 b3 /foo.csv - 3 block file
  • 8. 8© Cloudera, Inc. All rights reserved. 3x replication vs. Erasure coding b1 b2 b3 /foo.csv - 3 block file b1 b2 b3 b1 b2 b3
  • 9. 9© Cloudera, Inc. All rights reserved. 3x replication vs. Erasure coding b1 b2 b3 /foo.csv - 3 block file b1 b2 b3 b1 b2 b3 3 replicas 3 blocks 3 x 3 = 9 total replicas 9 / 3 = 200% overhead!
  • 10. 10© Cloudera, Inc. All rights reserved. 3x replication vs. Erasure coding b1 b2 b3 /foo.csv - 3 block file
  • 11. 11© Cloudera, Inc. All rights reserved. 3x replication vs. Erasure coding b1 b2 b3 /foo.csv - 3 block file p1 p2
  • 12. 12© Cloudera, Inc. All rights reserved. 3x replication vs. Erasure coding b1 b2 b3 /foo.csv - 3 block file p1 p2 3 data blocks 2 parity blocks 3 + 2 = 5 replicas 5 / 3 = 67% overhead!
  • 13. 13© Cloudera, Inc. All rights reserved. 3x replication vs. Erasure coding b1 b2 b3 /foo.csv - 3 block file p1 p2 3 data blocks 2 parity blocks 3 + 2 = 5 replicas 5 / 3 = 67% overhead! b1 b2 b10 /bigfoo.csv - 10 block file p1 p4 10 data blocks 4 parity blocks 10 + 4 = 14 replicas 14 / 10 = 40% overhead! ... ...
  • 14. 14© Cloudera, Inc. All rights reserved. EC Reconstruction b1 b2 b3 /foo.csv - 3 block file p1 p2 Reed-Solomon (3,2)
  • 15. 15© Cloudera, Inc. All rights reserved. EC Reconstruction b1 b2 b3 /foo.csv - 3 block file p1 p2 Reed-Solomon (3,2) X
  • 16. 16© Cloudera, Inc. All rights reserved. EC Reconstruction b1 b2 b3 /foo.csv - 3 block file p1 p2 Reed-Solomon (3,2) Read 3 remaining blocks b3 Run RS to recover b3 New copy of b3 recovered X
  • 17. 17© Cloudera, Inc. All rights reserved. Erasure coding (HDFS-7285) ● Motivation: improve storage efficiency of HDFS ○ ~2x the storage efficiency compared to 3x replication ○ Reduction of overhead from 200% to 40% ● Uses Reed-Solomon(k,m) erasure codes instead of replication ○ Support for multiple erasure coding policies ○ RS(3,2), RS(6,3), RS(10,4) ● Can improves data durability ○ RS(6,3) can tolerate 3 failures ○ RS(10,4) can tolerate 4 failures ● Missing blocks reconstructed from remaining blocks
  • 18. 18© Cloudera, Inc. All rights reserved. EC implications ● File data is striped across multiple nodes and racks ● Reads and writes are remote and cross-rack ● Reconstruction is network-intensive, reads m blocks cross-rack ● Important to use Intel’s optimized ISA-L for performance ○ 1+ GB/s encode/decode speed, much faster than Java implementation ● Combine data into larger files to avoid an explosion in # replicas ○ Bad: 1x1GB file -> RS(10,4) -> 14x100MB EC blocks (4.6x # replicas) ○ Good: 10x1GB file -> RS(10,4) -> 14x1GB EC blocks (0.46x # replicas) ● Works best for archival / cold data use cases
  • 19. 19© Cloudera, Inc. All rights reserved. EC performance
  • 20. 20© Cloudera, Inc. All rights reserved. EC performance
  • 21. 21© Cloudera, Inc. All rights reserved. EC performance
  • 22. 22© Cloudera, Inc. All rights reserved. Erasure coding status ● Massive development effort by the Hadoop community ○ 20+ contributors from many companies ■ Cloudera, Intel, Hortonworks, Huawei, Y! JP, … ○ 100s of commits over more than three years (started in 2014) ● Erasure coding is ready in 3.0.0 GA! ● Current focus is on testing and integration efforts ○ Want the complete Hadoop stack to work with HDFS erasure coding enabled ○ Ongoing stress / endurance testing to ensure stability at scale
  • 23. 23© Cloudera, Inc. All rights reserved. ● Hadoop leaks lots of dependencies onto the application’s classpath ○ Known offenders: Guava, Protobuf, Jackson, Jetty, … ● No separate HDFS client jar means server jars are leaked ● YARN / MR clients not shaded ● HDFS-6200: Split HDFS client into separate JAR ● HADOOP-11804: Shaded hadoop-client dependency ● YARN-6466: Shade the task umbilical for a clean YARN container environment (ongoing) Classpath isolation (HADOOP-11656)
  • 24. 24© Cloudera, Inc. All rights reserved. Miscellaneous ● Supportability improvements ○ Shell script rewrite ○ Intra-DataNode balancer ○ Move default ports out of the ephemeral range ● Support for multiple Standby NameNodes ● Cloud enhancements ○ Support for Microsoft Azure Data Lake and Aliyun OSS ○ S3 consistency and performance improvements ● Tightened Hadoop compatibility policy
  • 25. 25© Cloudera, Inc. All rights reserved. YARN Features
  • 26. 26© Cloudera, Inc. All rights reserved. Job History Server Resource Manager
  • 27. 27© Cloudera, Inc. All rights reserved. Job History Server Resource Manager jobs
  • 28. 28© Cloudera, Inc. All rights reserved. Job History Server Resource Manager jobs Job History Server
  • 29. 29© Cloudera, Inc. All rights reserved. Job History Server Resource Manager jobs Job History Server HDFS Node Manager
  • 30. 30© Cloudera, Inc. All rights reserved. Job History Server Resource Manager jobs Job History Server Spark History Server
  • 31. 31© Cloudera, Inc. All rights reserved. Job History Server Resource Manager jobs Job History Server Spark History Server ?
  • 32. 32© Cloudera, Inc. All rights reserved. Application Timeline Service v2 ● Store for application and system events and data ○ Distributed ○ Scalable ○ Structured Data Model ● Updated in real time ○ Application status ○ Application metrics ○ System metrics ● Fed by resource manager, node manager, and application masters ● REST API
  • 33. 33© Cloudera, Inc. All rights reserved. Application Timeline Service v2 Resource Manager jobs Application Timeline Service HBase
  • 34. 34© Cloudera, Inc. All rights reserved. Timeline Reader Timeline Reader Application Timeline Service v2 Resource Manager Timeline Collecter HBase Node Manager Application Master Timeline Collecter Timeline Reader
  • 35. 35© Cloudera, Inc. All rights reserved. Application Timeline Service v2 Flows
  • 36. 36© Cloudera, Inc. All rights reserved. Application Timeline Service v2 Flows
  • 37. 37© Cloudera, Inc. All rights reserved. Application Timeline Service v2 Flows
  • 38. 38© Cloudera, Inc. All rights reserved. Old YARN UI
  • 39. 39© Cloudera, Inc. All rights reserved. New YARN UI ● Rich client application ○ Built on Node.js and Ember ● Improved visibility into cluster usage ○ Memory, CPU ○ By queues and applications ○ Sunburst graphs for hierarchical queues ○ NodeManager heatmap ● ATSv2 integration ○ Plot container start/stop events ○ Easy to capture delays in app execution
  • 40. 40© Cloudera, Inc. All rights reserved. New YARN UI: Cluster Overview
  • 41. 41© Cloudera, Inc. All rights reserved. New YARN UI: Queues
  • 42. 42© Cloudera, Inc. All rights reserved. ● Before Hadoop 3 memory and CPU are the only managed resources ● Resource Types allows adding new managed resources ○ Countable resources: GPUs, Disks etc. ○ Static resources: Java version, Python version, hardware profile, ... ■ Still in proposal stage ● Resource profiles ○ Similar conceptually to EC2 instance types ○ Capture complex resource request ● DRF for scheduling ● Current virtual CPU cores and memory resources work as before Resource Types
  • 43. 43© Cloudera, Inc. All rights reserved. YARN Federation ● YARN scalability ○ Twitter runs a 10k node cluster with fair scheduler ○ Yahoo! runs 4k node cluster with capacity scheduler ● Federation ○ Restrict users to sub-clusters based on policy ○ Scalability to 100k nodes and beyond ○ Independent cluster scheduling
  • 44. 44© Cloudera, Inc. All rights reserved. YARN Federation Router Resource Manager Node Manager Node Manager Node Manager Node Manager Resource Manager Node Manager Node Manager Node Manager Node Manager Policy Admin
  • 45. 45© Cloudera, Inc. All rights reserved. Opportunistic Containers ● Scheduler’s job is to keep all resources busy ● Scheduling gaps ○ Nothing to run ○ Resource contention ○ Resource reservations ● Opportunistic containers fill those gaps ○ Requested explicitly ○ Dedicated scheduler ○ Queued at the node managers ○ Scheduled locally when resources are available ○ Preempted when guaranteed containers need to run ● Coming in 2.9 and 3.0
  • 46. 46© Cloudera, Inc. All rights reserved. Oversubscription ● Resource utilization is typically low in most clusters (20-50%) ○ Provision for peak usage ● Usage < Allocation ○ Mean Usage = ½ Peak Usage
  • 47. 47© Cloudera, Inc. All rights reserved. Oversubscription ● Oversubscription ○ Allocate opportunistic containers to use allocated-but-unused resources ○ Jobs automatically use these unless they opt-out ○ Threshold to control aggressiveness of oversubscription ○ Threshold to trigger preemption ● Currently in progress
  • 48. 48© Cloudera, Inc. All rights reserved. ● Long Running Services ○ Slider merging into YARN ○ Docker support ● Scheduler improvements ○ Capacity scheduler ■ Performance and preemption improvements ■ Online scheduling (“global scheduler”) ■ Queue management ○ Fair scheduler ■ Performance and preemption improvements ● High availability improvements ○ Better handling of transient network issues ○ ZK-store scalability: Limit number of children under a znode ● MapReduce Native Collector (MAPREDUCE-2841) ○ Native implementation of the map output collector ○ Up to 30% faster for shuffle-intensive jobs Other YARN Improvements
  • 49. 49© Cloudera, Inc. All rights reserved. Summary: What’s new in Hadoop 3.0? ● Storage Optimization ○ HDFS: Erasure codes ● Improved Visibility into Cluster Operations ○ YARN: ATSv2 ○ YARN: New UI ● Scalability & Multi-tenancy ○ YARN: Federation ● Improved Utilization ○ YARN: Opportunistic Containers ○ YARN: Oversubscription ● Refactor Base ○ Lots of Trunk content ○ JDK8 and newer dependent libraries
  • 50. 50© Cloudera, Inc. All rights reserved. Compatibility and Testing
  • 51. 51© Cloudera, Inc. All rights reserved. Compatibility ● Strong feedback from large users on the need for compatibility ● Preserves wire-compatibility with Hadoop 2 clients ○ Impossible to coordinate upgrading off-cluster Hadoop clients ● Will support rolling upgrade from Hadoop 2 to Hadoop 3 ○ Can’t take downtime to upgrade a business-critical cluster ● Not fully preserving API compatibility! ○ Dependency version bumps ○ Removal of deprecated APIs and tools ○ Shell script rewrite, rework of Hadoop tools scripts ○ Incompatible bug fixes
  • 52. 52© Cloudera, Inc. All rights reserved. Testing and Validation ● Cloudera CDH 6 is based on upstream Hadoop 3.0.0 ○ Running full test suite ○ Integration of Hadoop 3 with all components in CDH stack ○ Same integration tests used to validate CDH5 ● Plans for extensive HDFS EC testing by Cloudera and Intel ● Happy synergy between 2.8.x and 3.0.x lines ○ Shares much of the same code, fixes flow into both ○ Yahoo! doing scale testing of 2.8.0
  • 53. 53© Cloudera, Inc. All rights reserved. Conclusion ● Hadoop 3.0.0 GA is out! ● Shiny new features ○ HDFS erasure coding ○ Client classpath isolation ○ YARN ATSv2 ○ YARN Federation ○ Opportunistic containers and oversubscription ● Great time to get involved in testing and validation
  • 54. 54© Cloudera, Inc. All rights reserved. Thank you Andrew Wang Daniel Templeton [email protected] [email protected]