SlideShare a Scribd company logo
Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Tez
Bikas Saha @bikassaha
Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Hadoop YARN and HDFS
Flexible
Enables other purpose-built data processing
models beyond MapReduce (batch), such as
interactive and streaming
Efficient
Double processing IN Hadoop on the same
hardware while providing predictable
performance & quality of service
Shared
Provides a stable, reliable, secure
foundation and shared operational
services across multiple workloads
The Data Operating System for Hadoop 2.x
Data Processing Engines Run Natively IN Hadoop
BATCH
MapReduce
LOG STORE
Kafka
STREAMING
Storm
IN-MEMORY
Spark
GRAPH
Giraph
SAS
LASR, HPA
ONLINE
HBase, Accumulo
OTHERS
HDFS: Redundant, Reliable Storage
YARN: Cluster Resource Management
Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez
•API’s and libraries to create data processing applications on YARN
•Customizable and adaptable DAG definition
•Orchestration framework to execute the DAG in a Hadoop cluster
•NOT a general purpose execution engine
Open Source
Apache Project
Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez – Goals
• Tez solves the hard problems of running on a distributed Hadoop environment
• Apps can focus on solving their domain specific problems
• Tez instantiates the physical execution structure. App fills in logic and behavior
• API targets data processing specified as a data flow graph
App
Tez
• Custom application logic
• Custom data format
• Custom data transfer technology
• Distributed parallel execution
• Negotiating resources from the Hadoop framework
• Fault tolerance and recovery
• Shared library of ready-to-use components
• Built-in performance optimizations
• Hadoop Security
Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez – Adoption
• Apache Hive
– Most popular SQL-like interface for data in Hadoop
• Apache Pig
– Scripting language used in some of the largest Hadoop installations
• Apache Flink (Stratosphere project from TU Berlin)
– General purpose engine with language integrated data processing API
• Cascading + Scalding
– Language integrated data processing API in Java/Scala
• Commercial Products
– Datameer, Syncsort and other in progress
Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez – Performance benefits
• Apache Hive
– Order of magnitude improvement in performance
– Speed up mainly from flexible DAG definition and runtime graph reconfiguration
– Performance oriented orchestration layer and shared library components
Hive : TPC-DS Query 64
Logical DAG
Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez – Scale and Reliability
• Apache Pig
– Predominant number of data processing jobs at Yahoo with up to 5000 node clusters
– Multi-Petabyte jobs
– On track for using Pig with Tez for all production Pig jobs
– Already use Hive with Tez for large scale analytics
• Hortonworks customers
– All new customers default on Hive with Tez
• Cascading + Scalding
– Cascading 3.0 released with Tez integration
– Very promising results with beta users
http://scalding.io/2015/05/scalding-cascading-tez-♥/
© Hortonworks Inc. 2013
Tez – DAG API
// Define DAG
DAG dag = DAG.create();
// Define Vertex
Vertex Scan1 = Vertex.create(Processor.class);
// Define Edge
Edge edge = Edge.create(Scan1, Partition1,
SCATTER_GATHER, PERSISTED, SEQUENTIAL,
Output.class, Input.class);
// Connect them
dag.addVertex(Scan1).addEdge(edge)….
Page 8
Defines the global logical processing flow
Scan1 Scan2
Partition1 Partition2
Join
Scatter
Gather
Scatter
Gather
© Hortonworks Inc. 2013
Tez – Logical DAG expansion at Runtime
Page 9
Partition1
Scan2
Partition2
Join
Scan1
© Hortonworks Inc. 2013
Tez – Task Composition
Page 10
V-A
V-B V-C
Logical DAG
Output-1 Output-3
Processor-A
Input-2
Processor-B
Input-4
Processor-C
Task A
Task B Task C
Edge AB Edge AC
V-A = { Processor-A.class }
V-B = { Processor-B.class }
V-C = { Processor-C.class }
Edge AB = { V-A, V-B,
Output-1.class, Input-2.class }
Edge AC = { V-A, V-C,
Output-3.class, Input-4.class }
© Hortonworks Inc. 2013
Tez – Composable Task Model
Page 11
Hive Processor
HDFS
Input
Remote
File
Server
Input
HDFS
Output
Local
Disk
Output
Custom Processor
HDFS
Input
Remote
File
Server
Input
HDFS
Output
Local
Disk
Output
Custom Processor
RDMA
Input
Native
DB
Input
Kakfa
Pub-Sub
Output
Amazon
S3
Output
Adopt Evolve Optimize
© Hortonworks Inc. 2013
Tez – Customizable Core Engine
Page 12
Vertex-2
Vertex-1
Start
vertex
Vertex Manager
Start
tasks
DAG
Scheduler
Get Priority
Get Priority
Start
vertex
Task
Scheduler
Get container
Get container
• Vertex Manager
• Determines task
parallelism
• Determines when
tasks in a vertex can
start.
• DAG Scheduler
Determines priority of
task
• Task Scheduler
Allocates containers
from YARN and assigns
them to tasks
© Hortonworks Inc. 2013
Tez – Customizable core engine: graph reconfiguration
Page 14
Vertex 1 tasks
Vertex 2 Input Data
App Master
Input Initializer
+
Vertex Manager
Filtering values
Vertex State
Machine
Reconfigure Vertex
Apply Filter to Prune Input Partitions
Event Model
Map tasks send data
statistics events to the
Reduce Vertex Manager.
Vertex Manager
Pluggable application logic
that understands the data
statistics and can formulate
the correct parallelism.
Advises vertex controller on
parallelism
Hive – Dynamic Partition Pruning
© Hortonworks Inc. 2013
Tez – Engineering optimizations
•Container re-use
•Support for user sessions
•Event-based control flow
Page 15
© Hortonworks Inc. 2013
Tez – Developer tools – Local Mode
• Fast prototyping – no hadoop setup required
• Quick turnaround in Unit testing – no overheads for allocating resources , launching
JVM’s.
• Easy debuggability – Single JVM
• Scheduling / RPC invocations skipped
Page 16
© Hortonworks Inc. 2013
Tez – Developer Tools - Tez UI
• View Status and
progress of DAG/Vertex
• Diagnostics on failure
• View counters for
DAG/Vertex
• View and compare
counters across
tasks/attempts
• View app specific
information
Page 17
© Hortonworks Inc. 2013
Tez – Developer Tools - Tez UI
Page 18
© Hortonworks Inc. 2013
Tez – Job Analysis tools - Swimlanes
• “$TEZ_HOME/tez-tools/swimlanes/yarn-swimlanes.sh <app_id>”
Page 19
© Hortonworks Inc. 2013
Tez – Job Analysis tools – Shuffle performance
• View shuffle performance between nodes
Page 20
© Hortonworks Inc. 2013
Tez – Job Analysis tools – Shuffle performance
• View shuffle performance between nodes
Page 21
© Hortonworks Inc. 2013
Tez – Hybrid Execution
Page 22
• Run “compute where its most
efficient”
• Building on the pluggable design of
Tez, different vertices in the DAG can
run in different execution
environments
• Hive LLAP daemons can run initial
scans, map joins etc. while large joins
can run in YARN containers
• Best of both worlds and the pattern
can be repeated for Apache Phoenix or
your MPP database
MPP
Daemon
MPP
Daemon
MPP
Daemon
MPP
Daemon
MPP
Daemon
MPP
Daemon
Vertex 1
Vertex 2
Vertex 3
YARNYARN YARN
Join
Scan/Filter
© Hortonworks Inc. 2013
Tez – How can you help?
•Improve core Tez infrastructure
– Apache open source project. Your use cases and code are welcome
•Port DB ideas to Hive+Tez world
– Evolve distributed query optimization and execution
•Use Tez hybrid execution
– Use the Hive-LLAP pattern to get the best of both worlds with your
execution environment
•Integrate your project with Tez
– Get benefits similar to Hive, Pig, Cascading, Flink. Takes between 1-6
months depending on the complexity of the target project
© Hortonworks Inc. 2013
Tez – How to contribute
•Useful links
– Work tracking: https://issues.apache.org/jira/browse/TEZ
– Code: https://github.com/apache/tez
– Developer list: dev@tez.apache.org
User list: user@tez.apache.org
Issues list: issues@tez.apache.org
© Hortonworks Inc. 2013
Tez
Thanks for your time and attention!
Video with Deep Dive on Tez
http://goo.gl/BL67o7
http://www.infoq.com/presentations/apache-tez
Questions?
@bikassaha
Page 25

More Related Content

What's hot (20)

PDF
DevOps for Databricks
Databricks
 
PPTX
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Cloudera, Inc.
 
PDF
Introduction to Spark Internals
Pietro Michiardi
 
PPTX
Apache Tez - Accelerating Hadoop Data Processing
hitesh1892
 
PPTX
Overview of new features in Apache Ranger
DataWorks Summit
 
PPTX
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
PDF
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
DataWorks Summit
 
PDF
Kubernetes 101 - an Introduction to Containers, Kubernetes, and OpenShift
DevOps.com
 
PPTX
Apache Tez – Present and Future
DataWorks Summit
 
PDF
Hive tuning
Michael Zhang
 
PDF
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
HostedbyConfluent
 
PPTX
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop REST API Security with Apache Knox Gateway
DataWorks Summit
 
PPTX
ORC Deep Dive 2020
Owen O'Malley
 
PDF
Apache Sentry for Hadoop security
bigdatagurus_meetup
 
PPTX
Query Compilation in Impala
Cloudera, Inc.
 
PPTX
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
Cloudera, Inc.
 
PPTX
Hive: Loading Data
Benjamin Leonhardi
 
PDF
Virtual Nodes: Rethinking Topology in Cassandra
Eric Evans
 
PDF
Architecting for the Cloud using NetflixOSS - Codemash Workshop
Sudhir Tonse
 
DevOps for Databricks
Databricks
 
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Cloudera, Inc.
 
Introduction to Spark Internals
Pietro Michiardi
 
Apache Tez - Accelerating Hadoop Data Processing
hitesh1892
 
Overview of new features in Apache Ranger
DataWorks Summit
 
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
DataWorks Summit
 
Kubernetes 101 - an Introduction to Containers, Kubernetes, and OpenShift
DevOps.com
 
Apache Tez – Present and Future
DataWorks Summit
 
Hive tuning
Michael Zhang
 
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
HostedbyConfluent
 
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
Hadoop REST API Security with Apache Knox Gateway
DataWorks Summit
 
ORC Deep Dive 2020
Owen O'Malley
 
Apache Sentry for Hadoop security
bigdatagurus_meetup
 
Query Compilation in Impala
Cloudera, Inc.
 
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
Cloudera, Inc.
 
Hive: Loading Data
Benjamin Leonhardi
 
Virtual Nodes: Rethinking Topology in Cassandra
Eric Evans
 
Architecting for the Cloud using NetflixOSS - Codemash Workshop
Sudhir Tonse
 

Viewers also liked (20)

PDF
Quick Introduction to Apache Tez
GetInData
 
PDF
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 
PPTX
June 10 145pm hortonworks_tan & welch_v2
DataWorks Summit
 
PPTX
Internet of Things Crash Course Workshop at Hadoop Summit
DataWorks Summit
 
PPTX
Securing Hadoop with Apache Ranger
DataWorks Summit
 
PPTX
Spark crash course workshop at Hadoop Summit
DataWorks Summit
 
PPTX
Hadoop crash course workshop at Hadoop Summit
DataWorks Summit
 
PPTX
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
DataWorks Summit
 
PPTX
Yahoo's Experience Running Pig on Tez at Scale
DataWorks Summit/Hadoop Summit
 
PPTX
Pig on Tez: Low Latency Data Processing with Big Data
DataWorks Summit
 
PDF
Hadoop 生態系十年回顧與未來展望
Jazz Yao-Tsung Wang
 
PDF
Hive Now Sparks
DataWorks Summit
 
PPTX
February 2014 HUG : Hive On Tez
Yahoo Developer Network
 
PPTX
Tuning up with Apache Tez
Gal Vinograd
 
PDF
Oozie sweet
mislam77
 
PPTX
Authoring and Hosting Applications on YARN using Slider
DataWorks Summit
 
PPTX
What's new in Ambari
DataWorks Summit
 
PPTX
Hadoop Operations - Best Practices from the Field
DataWorks Summit
 
PPTX
Rocking the World of Big Data at Centrica
DataWorks Summit/Hadoop Summit
 
Quick Introduction to Apache Tez
GetInData
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 
June 10 145pm hortonworks_tan & welch_v2
DataWorks Summit
 
Internet of Things Crash Course Workshop at Hadoop Summit
DataWorks Summit
 
Securing Hadoop with Apache Ranger
DataWorks Summit
 
Spark crash course workshop at Hadoop Summit
DataWorks Summit
 
Hadoop crash course workshop at Hadoop Summit
DataWorks Summit
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
DataWorks Summit
 
Yahoo's Experience Running Pig on Tez at Scale
DataWorks Summit/Hadoop Summit
 
Pig on Tez: Low Latency Data Processing with Big Data
DataWorks Summit
 
Hadoop 生態系十年回顧與未來展望
Jazz Yao-Tsung Wang
 
Hive Now Sparks
DataWorks Summit
 
February 2014 HUG : Hive On Tez
Yahoo Developer Network
 
Tuning up with Apache Tez
Gal Vinograd
 
Oozie sweet
mislam77
 
Authoring and Hosting Applications on YARN using Slider
DataWorks Summit
 
What's new in Ambari
DataWorks Summit
 
Hadoop Operations - Best Practices from the Field
DataWorks Summit
 
Rocking the World of Big Data at Centrica
DataWorks Summit/Hadoop Summit
 
Ad

Similar to Apache Tez - A unifying Framework for Hadoop Data Processing (20)

PPTX
Apache Tez -- A modern processing engine
bigdatagurus_meetup
 
PPTX
Tez big datacamp-la-bikas_saha
Data Con LA
 
PPTX
YARN Ready: Integrating to YARN with Tez
Hortonworks
 
PDF
Apache Spark Workshop at Hadoop Summit
Saptak Sen
 
PDF
Hortonworks tech workshop in-memory processing with spark
Hortonworks
 
PPTX
Getting started big data
Kibrom Gebrehiwot
 
PPTX
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Modern Data Stack France
 
PDF
Tez: Accelerating Data Pipelines - fifthel
t3rmin4t0r
 
PPTX
Hackathon bonn
Emil Andreas Siemes
 
PDF
Apache Tez : Accelerating Hadoop Query Processing
Teddy Choi
 
PPTX
Tez Data Processing over Yarn
InMobi Technology
 
PPTX
Apache Tez : Accelerating Hadoop Query Processing
Bikas Saha
 
PDF
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Edureka!
 
PDF
Gunther hagleitner:apache hive & stinger
hdhappy001
 
PPTX
SQL On Hadoop
Muhammad Ali
 
PPTX
February 2014 HUG : Tez Details and Insides
Yahoo Developer Network
 
PPT
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
PPTX
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
VMware Tanzu
 
PDF
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Hortonworks
 
PPTX
Get Started Building YARN Applications
Hortonworks
 
Apache Tez -- A modern processing engine
bigdatagurus_meetup
 
Tez big datacamp-la-bikas_saha
Data Con LA
 
YARN Ready: Integrating to YARN with Tez
Hortonworks
 
Apache Spark Workshop at Hadoop Summit
Saptak Sen
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks
 
Getting started big data
Kibrom Gebrehiwot
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Modern Data Stack France
 
Tez: Accelerating Data Pipelines - fifthel
t3rmin4t0r
 
Hackathon bonn
Emil Andreas Siemes
 
Apache Tez : Accelerating Hadoop Query Processing
Teddy Choi
 
Tez Data Processing over Yarn
InMobi Technology
 
Apache Tez : Accelerating Hadoop Query Processing
Bikas Saha
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Edureka!
 
Gunther hagleitner:apache hive & stinger
hdhappy001
 
SQL On Hadoop
Muhammad Ali
 
February 2014 HUG : Tez Details and Insides
Yahoo Developer Network
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
VMware Tanzu
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Hortonworks
 
Get Started Building YARN Applications
Hortonworks
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

Recently uploaded (20)

PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Digital Circuits, important subject in CS
contactparinay1
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 

Apache Tez - A unifying Framework for Hadoop Data Processing

  • 1. Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Tez Bikas Saha @bikassaha
  • 2. Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Hadoop YARN and HDFS Flexible Enables other purpose-built data processing models beyond MapReduce (batch), such as interactive and streaming Efficient Double processing IN Hadoop on the same hardware while providing predictable performance & quality of service Shared Provides a stable, reliable, secure foundation and shared operational services across multiple workloads The Data Operating System for Hadoop 2.x Data Processing Engines Run Natively IN Hadoop BATCH MapReduce LOG STORE Kafka STREAMING Storm IN-MEMORY Spark GRAPH Giraph SAS LASR, HPA ONLINE HBase, Accumulo OTHERS HDFS: Redundant, Reliable Storage YARN: Cluster Resource Management
  • 3. Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Tez •API’s and libraries to create data processing applications on YARN •Customizable and adaptable DAG definition •Orchestration framework to execute the DAG in a Hadoop cluster •NOT a general purpose execution engine Open Source Apache Project
  • 4. Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Tez – Goals • Tez solves the hard problems of running on a distributed Hadoop environment • Apps can focus on solving their domain specific problems • Tez instantiates the physical execution structure. App fills in logic and behavior • API targets data processing specified as a data flow graph App Tez • Custom application logic • Custom data format • Custom data transfer technology • Distributed parallel execution • Negotiating resources from the Hadoop framework • Fault tolerance and recovery • Shared library of ready-to-use components • Built-in performance optimizations • Hadoop Security
  • 5. Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Tez – Adoption • Apache Hive – Most popular SQL-like interface for data in Hadoop • Apache Pig – Scripting language used in some of the largest Hadoop installations • Apache Flink (Stratosphere project from TU Berlin) – General purpose engine with language integrated data processing API • Cascading + Scalding – Language integrated data processing API in Java/Scala • Commercial Products – Datameer, Syncsort and other in progress
  • 6. Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Tez – Performance benefits • Apache Hive – Order of magnitude improvement in performance – Speed up mainly from flexible DAG definition and runtime graph reconfiguration – Performance oriented orchestration layer and shared library components Hive : TPC-DS Query 64 Logical DAG
  • 7. Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Tez – Scale and Reliability • Apache Pig – Predominant number of data processing jobs at Yahoo with up to 5000 node clusters – Multi-Petabyte jobs – On track for using Pig with Tez for all production Pig jobs – Already use Hive with Tez for large scale analytics • Hortonworks customers – All new customers default on Hive with Tez • Cascading + Scalding – Cascading 3.0 released with Tez integration – Very promising results with beta users http://scalding.io/2015/05/scalding-cascading-tez-♥/
  • 8. © Hortonworks Inc. 2013 Tez – DAG API // Define DAG DAG dag = DAG.create(); // Define Vertex Vertex Scan1 = Vertex.create(Processor.class); // Define Edge Edge edge = Edge.create(Scan1, Partition1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, Output.class, Input.class); // Connect them dag.addVertex(Scan1).addEdge(edge)…. Page 8 Defines the global logical processing flow Scan1 Scan2 Partition1 Partition2 Join Scatter Gather Scatter Gather
  • 9. © Hortonworks Inc. 2013 Tez – Logical DAG expansion at Runtime Page 9 Partition1 Scan2 Partition2 Join Scan1
  • 10. © Hortonworks Inc. 2013 Tez – Task Composition Page 10 V-A V-B V-C Logical DAG Output-1 Output-3 Processor-A Input-2 Processor-B Input-4 Processor-C Task A Task B Task C Edge AB Edge AC V-A = { Processor-A.class } V-B = { Processor-B.class } V-C = { Processor-C.class } Edge AB = { V-A, V-B, Output-1.class, Input-2.class } Edge AC = { V-A, V-C, Output-3.class, Input-4.class }
  • 11. © Hortonworks Inc. 2013 Tez – Composable Task Model Page 11 Hive Processor HDFS Input Remote File Server Input HDFS Output Local Disk Output Custom Processor HDFS Input Remote File Server Input HDFS Output Local Disk Output Custom Processor RDMA Input Native DB Input Kakfa Pub-Sub Output Amazon S3 Output Adopt Evolve Optimize
  • 12. © Hortonworks Inc. 2013 Tez – Customizable Core Engine Page 12 Vertex-2 Vertex-1 Start vertex Vertex Manager Start tasks DAG Scheduler Get Priority Get Priority Start vertex Task Scheduler Get container Get container • Vertex Manager • Determines task parallelism • Determines when tasks in a vertex can start. • DAG Scheduler Determines priority of task • Task Scheduler Allocates containers from YARN and assigns them to tasks
  • 13. © Hortonworks Inc. 2013 Tez – Customizable core engine: graph reconfiguration Page 14 Vertex 1 tasks Vertex 2 Input Data App Master Input Initializer + Vertex Manager Filtering values Vertex State Machine Reconfigure Vertex Apply Filter to Prune Input Partitions Event Model Map tasks send data statistics events to the Reduce Vertex Manager. Vertex Manager Pluggable application logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism Hive – Dynamic Partition Pruning
  • 14. © Hortonworks Inc. 2013 Tez – Engineering optimizations •Container re-use •Support for user sessions •Event-based control flow Page 15
  • 15. © Hortonworks Inc. 2013 Tez – Developer tools – Local Mode • Fast prototyping – no hadoop setup required • Quick turnaround in Unit testing – no overheads for allocating resources , launching JVM’s. • Easy debuggability – Single JVM • Scheduling / RPC invocations skipped Page 16
  • 16. © Hortonworks Inc. 2013 Tez – Developer Tools - Tez UI • View Status and progress of DAG/Vertex • Diagnostics on failure • View counters for DAG/Vertex • View and compare counters across tasks/attempts • View app specific information Page 17
  • 17. © Hortonworks Inc. 2013 Tez – Developer Tools - Tez UI Page 18
  • 18. © Hortonworks Inc. 2013 Tez – Job Analysis tools - Swimlanes • “$TEZ_HOME/tez-tools/swimlanes/yarn-swimlanes.sh <app_id>” Page 19
  • 19. © Hortonworks Inc. 2013 Tez – Job Analysis tools – Shuffle performance • View shuffle performance between nodes Page 20
  • 20. © Hortonworks Inc. 2013 Tez – Job Analysis tools – Shuffle performance • View shuffle performance between nodes Page 21
  • 21. © Hortonworks Inc. 2013 Tez – Hybrid Execution Page 22 • Run “compute where its most efficient” • Building on the pluggable design of Tez, different vertices in the DAG can run in different execution environments • Hive LLAP daemons can run initial scans, map joins etc. while large joins can run in YARN containers • Best of both worlds and the pattern can be repeated for Apache Phoenix or your MPP database MPP Daemon MPP Daemon MPP Daemon MPP Daemon MPP Daemon MPP Daemon Vertex 1 Vertex 2 Vertex 3 YARNYARN YARN Join Scan/Filter
  • 22. © Hortonworks Inc. 2013 Tez – How can you help? •Improve core Tez infrastructure – Apache open source project. Your use cases and code are welcome •Port DB ideas to Hive+Tez world – Evolve distributed query optimization and execution •Use Tez hybrid execution – Use the Hive-LLAP pattern to get the best of both worlds with your execution environment •Integrate your project with Tez – Get benefits similar to Hive, Pig, Cascading, Flink. Takes between 1-6 months depending on the complexity of the target project
  • 23. © Hortonworks Inc. 2013 Tez – How to contribute •Useful links – Work tracking: https://issues.apache.org/jira/browse/TEZ – Code: https://github.com/apache/tez – Developer list: [email protected] User list: [email protected] Issues list: [email protected]
  • 24. © Hortonworks Inc. 2013 Tez Thanks for your time and attention! Video with Deep Dive on Tez http://goo.gl/BL67o7 http://www.infoq.com/presentations/apache-tez Questions? @bikassaha Page 25

Editor's Notes

  • #3: TODO: Rohit compile list of current apps out there and 1-2 sentences on what they do for the notes here The first wave of Hadoop was about HDFS and MapReduce where MapReduce had a split brain, so to speak. It was a framework for massive distributed data processing, but it also had all of the Job Management capabilities built into it. The second wave of Hadoop is upon us and a component called YARN has emerged that generalizes Hadoop’s Cluster Resource Management in a way where MapReduce is NOW just one of many frameworks or applications that can run atop YARN. Simply put, YARN is the distributed operating system for data processing applications. For those curious, YARN stands for “Yet Another Resource Negotiator”. [CLICK] As I like to say, YARN enables applications to run natively IN Hadoop versus ON HDFS or next to Hadoop. [CLICK] Why is that important? Businesses do NOT want to stovepipe clusters based on batch processing versus interactive SQL versus online data serving versus real-time streaming use cases. They're adopting a big data strategy so they can get ALL of their data in one place and access that data in a wide variety of ways. With predictable performance and quality of service. [CLICK] This second wave of Hadoop represents a major rearchitecture that has been underway for 3 or 4 years. And this slide shows just a sampling of open source projects that are or will be leveraging YARN in the not so distant future. For example, engineers at Yahoo have shared open source code that enables Twitter Storm to run on YARN. Apache Giraph is a graph processing system that is YARN enabled. Spark is an in-memory data processing system built at Berkeley that’s been recently contributed to the Apache Software Foundation. OpenMPI is an open source Message Passing Interface system for HPC that works on YARN. These are just a few examples.
  • #14: For anyone who has been working on MapReduce, there is this age-old problem around “how do I figure out the correct number of reducers?”. We guess some number at compile-time and usually that turns out to be incorrect at run-time. Let’s see how we can use the Tez model to fix that. So here is this Map Vertex and this Reduce Vertex, which have these tasks running and you have the Vertex Manager running inside the framework … [CLICK] The Map Tasks can send Data Size Statistics to the Vertex Manager, which can then extrapolate those statistics to figure out “what would be the final size of the data when all of these Maps finish?”. Based on that, it can realize that the data size is actually smaller than expected, and I can actually run two reduce tasks instead of three. [CLICK] The Vertex Manager sends a Set Paralellism command to the framework which changes the routing information in-between these two tasks and also cancels the last task.
  • #15: For anyone who has been working on MapReduce, there is this age-old problem around “how do I figure out the correct number of reducers?”. We guess some number at compile-time and usually that turns out to be incorrect at run-time. Let’s see how we can use the Tez model to fix that. So here is this Map Vertex and this Reduce Vertex, which have these tasks running and you have the Vertex Manager running inside the framework … [CLICK] The Map Tasks can send Data Size Statistics to the Vertex Manager, which can then extrapolate those statistics to figure out “what would be the final size of the data when all of these Maps finish?”. Based on that, it can realize that the data size is actually smaller than expected, and I can actually run two reduce tasks instead of three. [CLICK] The Vertex Manager sends a Set Paralellism command to the framework which changes the routing information in-between these two tasks and also cancels the last task.