Apache Hadoop YARN: Present and Future

Apache Hadoop YARN: Present and
Future
Vinod Kumar Vavilapalli
Hortonworks

© Hortonworks Inc. 2014
Apache Hadoop YARN
Present and Future
Vinod Kumar Vavilapalli
vinodkv [at] apache.org
@tshooter
Page 2

A quick show of hands..
• Hadoop 2
Page 3
Architecting the Future of Big Data
Real life Hadoop Logo

Who am I?
• 6.75 Hadoop-years old
• Last thing at School – a two node Tomcat cluster. Three months
later, first thing at job, brought down a 800 node cluster ;)
• Previously @Yahoo!
• Now @Hortonworks
• Two hats
– Hortonworks: Hadoop MapReduce and YARN Development lead
– Apache: Apache Hadoop YARN lead. Apache Hadoop PMC, Apache Member
• Worked/working on
– YARN, Hadoop MapReduce, HadoopOnDemand, CapacityScheduler, Hadoop
security
– Apache Ambari: Kickstarted the project and its first release
– Stinger: High performance data processing with Hadoop/Hive
• Lots of trouble shooting on clusters
• 99% + code in Apache, Hadoop
Page 4

Agenda
• Apache Hadoop 2 : Overview
• Past
• Present
• Future
Page 5

Apache Hadoop 2
Next Generation Architecture
Page 6

What is YARN?
• Resource Management Platform
– MapReduce v2
– Beyond MapReduce with Tez, Storm, Spark; in Hadoop!
– Did I mention Services like HBase, Accumulo on YARN with HoYA/Slider?
• How is it different from Hadoop 1? ..
Page 7

Hadoop 1 vs Hadoop 2
HADOOP 1.0
HDFS
(redundant, reliable storage)
MapReduce
(cluster resource management
& data processing)
HDFS2
(redundant, highly-available & reliable storage)
YARN
(cluster resource management)
MapReduce
(data processing)
Others
HADOOP 2.0
Single Use System
Batch Apps
Multi Purpose Platform
Batch, Interactive, Online, Streaming, …
Page 8

Key Benefits of YARN
• Scale
• New Programming Models & Services
• Improved cluster utilization
• Agility
• To infinity and beyond ..
Page 9

Why Migrate?
• 2.0 >= 2 * 1.0
– HDFS: Lots of ground-breaking features
– YARN: Next generation architecture
• Return on Investment: 2x throughput on same hardware!
• Ready for improvements in hardware
• Not convinced? Let’s see what others are saying!
Page 10

Yahoo!
• Leader/Visionary on all things Hadoop!
• On YARN (0.23.x)
• Moving fast to 2.x
Page 11
http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-more-ever-54421.html

Twitter
Page 12

Ebay
• Has one of the largest Hadoop clusters in the industry with many
petabytes of data
• Migrated production clusters to Hadoop-2
• Go to Mayank’s talk
– “Hadoop-2 @ ebay”!
– Thursday, April 3
– Track : Deployment and Operations
• Should be convinced by now .. . No?
Page 13

YARN: the Data Operating System
Page 14

Present
Page 15

Apache Hadoop releases
• 15 October, 2013
• The 1st GA release of Apache Hadoop 2.x
• YARN
– First stable and supported release of YARN
– Binary Compatibility for MapReduce applications built on hadoop-1.x
– YARN level APIs solidified for the future
– Performance
– Scale!
• HDFS
– High Availability for HDFS
– HDFS Federation
– HDFS Snapshots
– NFSv3 access to data in HDFS
• Support for running Hadoop on Microsoft Windows
• Substantial amount of integration testing with rest of projects in the
ecosystem
Page 16
Apache Hadoop 2.2

Apache Hadoop releases (contd)
• 24 February, 2014
• First post GA release for the year 2014
• Alpha features in YARN
– ResourceManager HA
– Application History
– Will cover in the 2.4 content
• HDFS
– Details follow..
• Number of bug-fixes, enhancements
Page 17
Apache Hadoop 2.3

HDFS: Heterogeneous Storage
Page 18

HDFS: DataNode caching
Page 19

Apache Hadoop releases (contd)
• Very soon!
• YARN
– Details follow..
– ResourceManager restart fail-over for high availability
– Preemption
– Application History and timeline
• HDFS
– FileSystem ACLs
– Rolling upgrades
Page 20
Apache Hadoop 2.4

ResourceManager Restart and fail-over
Page 21
ZooKeeper

Capacity Scheduler Preemption
Page 22

Application History and Timeline
• Few MR specific implementations: History and web-UI
• Not just MR anymore!
• History
– MapReduce specific Job History Server
– Beyond ResourceManager Restart
• Timeline
– Framework specific event collection and UIs
• Run analytics on historical apps!
Page 23

Future
Page 24

Future: Operational enhancements
• Rolling upgrades
– No/minimal impact to users
– Ideal: Always rolling!
• HDFS in
• YARN
Page 25

Future: Enabling more apps
• Beyond MR
• Discussing next
– Long running services
– Isolation
– Multi-dimensional resource
scheduling
Page 26

Future: Long running services
• You can run them already!
• Few enhancements needed
– Logs
– Security
– Management/monitoring
• Resource sharing across
workload types
• Project Slider
Page 27

Fine-grain isolation for multi-tenancy
• Custom memory-monitoring
• Cgroups
• Linux Containers
• VMs
Page 28

Multi-resource scheduling
• Today – memory & cpu
– Physical memory / virtual memory
– Cpu Cores – Virtual cores
• CPU stuff: More bake in
• Disks
– Space
– IOPS
• Network
Page 29

Other features
• Application SLAs
• Node labels
• Node affinity/anti-affinity
• Better online queue-management
Page 30

YARN Ecosystem
Beyond the core YARN project: Briefly
Page 31

Eco-system
Page 32
Applications Powered by YARN
Apache Giraph – Graph Processing
Apache Hama – BSP
Apache Hadoop MapReduce – Batch
Apache Tez – Batch/Interactive
Apache S4 – Stream Processing
Apache Samza – Stream Processing
Apache Storm – Stream Processing
Apache Spark – Iterative applications
HOYA – HBase on YARN
YARN Frameworks
Apache Twill
REEF by Microsoft
Spring support for Hadoop 2
There's an app for that...
YARN App Marketplace!

Apache TEZ
• Moving beyond MR
• A data processing framework that can execute a complex DAG of
tasks.
• “Apache Tez - A New Chapter in Hadoop Data Processing”
– By Siddharth Seth: YARN & Tez Committer/PMC Member
– Thursday, April 3 (4:20-5:00pm)
Page 33

Recap
Page 34

Recap
Page 35
• Apache Hadoop 2 is, at least, twice as good!
• Exciting journey with Hadoop for this decade…
– Hadoop is no longer a one-trick pony, err elephant
– Beyond just HDFS & MapReduce
• Architecture for the future
– Centralized data
– Exciting spectrum of application types, workloads and usecases

Couple more things..
Page 36

The Book is out!
Page 37

Page 38

Thank you!
Page 39
Download Sandbox: Experience Apache Hadoop
Both 2.x and 1.x Versions Available!
http://hortonworks.com/products/hortonworks-sandbox/
Questions Time!

Apache Hadoop YARN: Present and Future

Apache Hadoop YARN: Present and Future

More Related Content

What's hot (20)

Viewers also liked (16)

Similar to Apache Hadoop YARN: Present and Future (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Apache Hadoop YARN: Present and Future

Editor's Notes