SlideShare a Scribd company logo
Graphene – Microsoft
SCOPE on Tez
Hitesh Sharma (Principal Software Eng. Manager)
Anupam (Senior Software Engineer)
Agenda
• Overview of SCOPE and Cosmos
• SCOPE Job Manager responsibilities
• Design of Graphene
• Features required in Tez
Cosmos
Environment
• A Microsoft-internal platform
for building big-data
applications
• Available externally as Azure
Data Lake Analytics
• Enable customers to transform
data of any scale into new
business assets easily at low
cost in the cloud
Cosmos: World’s Biggest YARN Cluster!
Single DC >
40K machines
Multiple DCs
> 500,000 jobs
/ day
~ 3 billion
containers/day
High avg. CPU
utilization
Three Nines
Exabytes in
storage
100s of PB
processed/day
Exabytes of
data moved
SCOPE
• Scripting language for Cosmos
• Influenced by SQL and relational
concepts
• Works great with C# and .NET
• Very extensible
• Auto scale
• Naturally parallelizable computation
• Lower the barrier to write efficient
programs
RawData =
EXTRACT
Clicks:int,
Domain:string
FROM @“RAWWEBDATA.TSV”
USING DefaultTextExtractor();
WebData =
SELECT *,
Domain.Trim().ToUpper()
AS NormalizedDomain
FROM RawData;
OUTPUT WebData
TO “WEBDATA.TSV”
USING DefaultTextOutputter();
CosmosFront-EndService
Optimizer
Job Manager
Compiler
Runtime
Engine
SCOPE platform
Job scale
• Single job can consume > 1PB of
data
• > 15000 concurrent tasks (degree of
parallelism)
• Thousands of vertices
• DAGs can be very wide, very deep,
or both
• Millions of tasks in a job
• Billions of edges
Job Manager
• DAG execution
• Builds execution graph
• Topologically executes the DAG
• Keep track of state of the job/vertices​
• Dynamic DAG updates
• Rack level aggregation
• Broadcast tree
• Fault tolerance
• Handle failures and do revocations
• Detect and mitigate outliers
Job Manager
• Scheduling
• Keep track of cluster resources​
• Distributed scheduler
• Requests bonus or opportunistic containers
to increase utilization
• Can upgrade opportunistic containers
• Container reuse
• Opportunistic containers present some
interesting choices for reuse
• Tricky to implement
Time
Parallel
containers
Max containers
allowed
Job Manager
• Finalization
• Concatenate final outputs​
• Metadata operations​
• Tooling
• Near real-time feedback
• Finding the critical path
• Structured error reporting
Current
Challenges
• Higher cost of ownership
• No AM recovery
• Tied to Cosmos infrastructure
• Memory inefficient
• Native support for interactive
workload
Status and Roadmap
Prototype
Run
benchmarks
like TPC-X,
TeraSort etc.
Offline
flighting of
customer
jobs
First stage of
production
deployment
Late 2017
Mid 2018
Late 2018
Early 2018
Design and
Implementation of
Graphene
Guiding
Principles
Minimal changes
in SCOPE stack
Work with
community
Use Tez
extensibility
Maintain
compatibility
Consume output of
compilation to
generate DAG
Algebr
a
Launch and
communicate with
ScopeEngine
Engine
Produce status,
debugging, and
error details for
existing tooling
Tooling
Interact with
storage layer
Store
Graphene – Integration Points
Graphene – Application Master
GRAPHENE AM
GrapheneDAGAppMaster
DAG
Converter
Algebra
Legend
Tez Component
Uses Tez API
External Component
DAG
Store Client
Input InitializerDAGAppMaster
DAGImpl
Custom Edge
and Vertex Mgr
Tez
Magic
Task
Graphene – Task Execution
Task Container
SCOPE Engine
SCOPE Processor
SCOPE Input SCOPE Output
SCOPE TaskTez
Magic!
GRAPHENE AM
AM Container
Launch Container
InputFailedEvent/DataMovementEvent
InputDataInformationEvent or
DataMovementEvent Task CommandStatus & Error
Legend
Tez Component
Uses Tez API
External Component
Graphene – Tooling Integration
Task Container
SCOPE Engine
SCOPE Task
Periodic Stats and Diag
Legend
Tez Component
Uses Tez API
External Component
Statistics & DiagTez
Magic
GRAPHENE AM
AM Container
JobProfiler:
EventListener
Real Time
Stats
Historic
Stats Task Level Stats
Vertex Level Stats
Experience So Far
Reliability
As expected from
a production ready
software
No major bugs or
reliability issues
Onboarding
Modular and
tested code
Documentation :
Opportunity to
contribute
Community
Very responsive
Special thanks to
Bikas Saha, Kuhu
Shukla, Jonathan
Eagles
Scaling Tez
• Existing Cosmos workloads can have >
15k parallel tasks
• Acquiring and managing these
containers
• Managing communications with
these tasks
• Providing real time progress for
all the tasks
Scaling Tez
• Optimize AM memory
• Metadata management for large
inputs
• Memory pressure under large
event throughput
• Large DAGs with > 2000 vertices
and > 1 million tasks
• Optimizations for deep DAGs
Integrating
with YARN
Opportunistic
containers
• Mechanism to drive up utilization of
cluster
• AM has deep understanding of the
capability
• Effectively using opportunistic
containers in scheduler
• Harder scheduling choices with
container reuse
AM Recovery
• High priority customer ask
• Need to plugin Graphene to this AM
resiliency
• Deterministic and reliable recovery
with dynamic behavior
Conclusion
Microsoft SCOPE analytics running on
Apache YARN and Tez!
Our Journey has just started.
We invite you to collaborate.
References
• Apache Tez: A Unifying Framework for Modeling and Building Data
Processing Applications [SIGMOD, 2015]
• SCOPE: easy and efficient parallel processing of massive data sets
[VLDB, 2008]
• Apollo: Scalable and Coordinated Scheduling for Cloud-Scale
Computing [OSDI, 2014]
• Dryad: distributed data-parallel programs from sequential building
blocks [EuroSys, 2007]
• Lessons learned from scaling YARN to 40k machines in a multi tenancy
environment. [DataWorksSummit, 2017]

More Related Content

PPTX
A machine learning and data science pipeline for real companies
DataWorks Summit
 
PPTX
What's new in apache hive
DataWorks Summit
 
PPTX
Quality for the Hadoop Zoo
DataWorks Summit
 
PDF
Hadoop 3 @ Hadoop Summit San Jose 2017
Junping Du
 
PPTX
Bootstrapping state in Apache Flink
DataWorks Summit
 
PPTX
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
DataWorks Summit
 
PDF
Realizing the promise of portable data processing with Apache Beam
DataWorks Summit
 
PPTX
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
DataWorks Summit
 
A machine learning and data science pipeline for real companies
DataWorks Summit
 
What's new in apache hive
DataWorks Summit
 
Quality for the Hadoop Zoo
DataWorks Summit
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Junping Du
 
Bootstrapping state in Apache Flink
DataWorks Summit
 
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
DataWorks Summit
 
Realizing the promise of portable data processing with Apache Beam
DataWorks Summit
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
DataWorks Summit
 

What's hot (20)

PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
PDF
Leveraging docker for hadoop build automation and big data stack provisioning
Evans Ye
 
PDF
Present and future of unified, portable, and efficient data processing with A...
DataWorks Summit
 
PPTX
Migrating Analytics to the Cloud at Fannie Mae
DataWorks Summit
 
PPTX
Preventative Maintenance of Robots in Automotive Industry
DataWorks Summit/Hadoop Summit
 
PPTX
Running secured Spark job in Kubernetes compute cluster and integrating with ...
DataWorks Summit
 
PPTX
Presto query optimizer: pursuit of performance
DataWorks Summit
 
PPTX
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
DataWorks Summit
 
PPTX
Apache Hadoop YARN: Present and Future
DataWorks Summit
 
PPTX
Lessons learned from running Spark on Docker
DataWorks Summit
 
PPTX
Practice of large Hadoop cluster in China Mobile
DataWorks Summit
 
PPTX
Time-oriented event search. A new level of scale
DataWorks Summit/Hadoop Summit
 
PPTX
Tame that Beast
DataWorks Summit/Hadoop Summit
 
PPTX
Insights into Real-world Data Management Challenges
DataWorks Summit
 
PDF
Fast SQL on Hadoop, Really?
DataWorks Summit
 
PPTX
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
DataWorks Summit
 
PPTX
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
DataWorks Summit
 
PDF
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
 
PPTX
SAM—streaming analytics made easy
DataWorks Summit
 
PPTX
Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...
DataWorks Summit
 
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
Leveraging docker for hadoop build automation and big data stack provisioning
Evans Ye
 
Present and future of unified, portable, and efficient data processing with A...
DataWorks Summit
 
Migrating Analytics to the Cloud at Fannie Mae
DataWorks Summit
 
Preventative Maintenance of Robots in Automotive Industry
DataWorks Summit/Hadoop Summit
 
Running secured Spark job in Kubernetes compute cluster and integrating with ...
DataWorks Summit
 
Presto query optimizer: pursuit of performance
DataWorks Summit
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
DataWorks Summit
 
Apache Hadoop YARN: Present and Future
DataWorks Summit
 
Lessons learned from running Spark on Docker
DataWorks Summit
 
Practice of large Hadoop cluster in China Mobile
DataWorks Summit
 
Time-oriented event search. A new level of scale
DataWorks Summit/Hadoop Summit
 
Insights into Real-world Data Management Challenges
DataWorks Summit
 
Fast SQL on Hadoop, Really?
DataWorks Summit
 
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
DataWorks Summit
 
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
DataWorks Summit
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
 
SAM—streaming analytics made easy
DataWorks Summit
 
Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...
DataWorks Summit
 
Ad

Similar to Graphene – Microsoft SCOPE on Tez (20)

PPTX
Building FoundationDB
FoundationDB
 
PPTX
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Andrii Vozniuk
 
KEY
Writing Scalable Software in Java
Ruben Badaró
 
PPTX
Solving Office 365 Big Challenges using Cassandra + Spark
Anubhav Kale
 
PDF
Low Latency Polyglot Model Scoring using Apache Apex
Apache Apex
 
PPTX
Machine Learning on Distributed Systems by Josh Poduska
Data Con LA
 
PPTX
Apache Tez : Accelerating Hadoop Query Processing
Bikas Saha
 
PDF
An overview of modern scalable web development
Tung Nguyen
 
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
PDF
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
confluent
 
PPTX
Stateful streaming and the challenge of state
Yoni Farin
 
PDF
Alluxio - Scalable Filesystem Metadata Services
Alluxio, Inc.
 
PDF
Cassandra CLuster Management by Japan Cassandra Community
Hiromitsu Komatsu
 
PDF
On-boarding with JanusGraph Performance
Chin Huang
 
PPTX
Simulation of Heterogeneous Cloud Infrastructures
CloudLightning
 
PPT
Ecss des
Raminder Singh
 
PPTX
YARN Ready: Integrating to YARN with Tez
Hortonworks
 
PPTX
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Avere Systems
 
PPTX
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Building FoundationDB
FoundationDB
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Andrii Vozniuk
 
Writing Scalable Software in Java
Ruben Badaró
 
Solving Office 365 Big Challenges using Cassandra + Spark
Anubhav Kale
 
Low Latency Polyglot Model Scoring using Apache Apex
Apache Apex
 
Machine Learning on Distributed Systems by Josh Poduska
Data Con LA
 
Apache Tez : Accelerating Hadoop Query Processing
Bikas Saha
 
An overview of modern scalable web development
Tung Nguyen
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
confluent
 
Stateful streaming and the challenge of state
Yoni Farin
 
Alluxio - Scalable Filesystem Metadata Services
Alluxio, Inc.
 
Cassandra CLuster Management by Japan Cassandra Community
Hiromitsu Komatsu
 
On-boarding with JanusGraph Performance
Chin Huang
 
Simulation of Heterogeneous Cloud Infrastructures
CloudLightning
 
Ecss des
Raminder Singh
 
YARN Ready: Integrating to YARN with Tez
Hortonworks
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Avere Systems
 
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

Recently uploaded (20)

PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
Doc9.....................................
SofiaCollazos
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Doc9.....................................
SofiaCollazos
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
The Future of Artificial Intelligence (AI)
Mukul
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 

Graphene – Microsoft SCOPE on Tez

  • 1. Graphene – Microsoft SCOPE on Tez Hitesh Sharma (Principal Software Eng. Manager) Anupam (Senior Software Engineer)
  • 2. Agenda • Overview of SCOPE and Cosmos • SCOPE Job Manager responsibilities • Design of Graphene • Features required in Tez
  • 3. Cosmos Environment • A Microsoft-internal platform for building big-data applications • Available externally as Azure Data Lake Analytics • Enable customers to transform data of any scale into new business assets easily at low cost in the cloud
  • 4. Cosmos: World’s Biggest YARN Cluster! Single DC > 40K machines Multiple DCs > 500,000 jobs / day ~ 3 billion containers/day High avg. CPU utilization Three Nines Exabytes in storage 100s of PB processed/day Exabytes of data moved
  • 5. SCOPE • Scripting language for Cosmos • Influenced by SQL and relational concepts • Works great with C# and .NET • Very extensible • Auto scale • Naturally parallelizable computation • Lower the barrier to write efficient programs RawData = EXTRACT Clicks:int, Domain:string FROM @“RAWWEBDATA.TSV” USING DefaultTextExtractor(); WebData = SELECT *, Domain.Trim().ToUpper() AS NormalizedDomain FROM RawData; OUTPUT WebData TO “WEBDATA.TSV” USING DefaultTextOutputter();
  • 7. Job scale • Single job can consume > 1PB of data • > 15000 concurrent tasks (degree of parallelism) • Thousands of vertices • DAGs can be very wide, very deep, or both • Millions of tasks in a job • Billions of edges
  • 8. Job Manager • DAG execution • Builds execution graph • Topologically executes the DAG • Keep track of state of the job/vertices​ • Dynamic DAG updates • Rack level aggregation • Broadcast tree • Fault tolerance • Handle failures and do revocations • Detect and mitigate outliers
  • 9. Job Manager • Scheduling • Keep track of cluster resources​ • Distributed scheduler • Requests bonus or opportunistic containers to increase utilization • Can upgrade opportunistic containers • Container reuse • Opportunistic containers present some interesting choices for reuse • Tricky to implement Time Parallel containers Max containers allowed
  • 10. Job Manager • Finalization • Concatenate final outputs​ • Metadata operations​ • Tooling • Near real-time feedback • Finding the critical path • Structured error reporting
  • 11. Current Challenges • Higher cost of ownership • No AM recovery • Tied to Cosmos infrastructure • Memory inefficient • Native support for interactive workload
  • 12. Status and Roadmap Prototype Run benchmarks like TPC-X, TeraSort etc. Offline flighting of customer jobs First stage of production deployment Late 2017 Mid 2018 Late 2018 Early 2018
  • 14. Guiding Principles Minimal changes in SCOPE stack Work with community Use Tez extensibility Maintain compatibility
  • 15. Consume output of compilation to generate DAG Algebr a Launch and communicate with ScopeEngine Engine Produce status, debugging, and error details for existing tooling Tooling Interact with storage layer Store Graphene – Integration Points
  • 16. Graphene – Application Master GRAPHENE AM GrapheneDAGAppMaster DAG Converter Algebra Legend Tez Component Uses Tez API External Component DAG Store Client Input InitializerDAGAppMaster DAGImpl Custom Edge and Vertex Mgr Tez Magic Task
  • 17. Graphene – Task Execution Task Container SCOPE Engine SCOPE Processor SCOPE Input SCOPE Output SCOPE TaskTez Magic! GRAPHENE AM AM Container Launch Container InputFailedEvent/DataMovementEvent InputDataInformationEvent or DataMovementEvent Task CommandStatus & Error Legend Tez Component Uses Tez API External Component
  • 18. Graphene – Tooling Integration Task Container SCOPE Engine SCOPE Task Periodic Stats and Diag Legend Tez Component Uses Tez API External Component Statistics & DiagTez Magic GRAPHENE AM AM Container JobProfiler: EventListener Real Time Stats Historic Stats Task Level Stats Vertex Level Stats
  • 19. Experience So Far Reliability As expected from a production ready software No major bugs or reliability issues Onboarding Modular and tested code Documentation : Opportunity to contribute Community Very responsive Special thanks to Bikas Saha, Kuhu Shukla, Jonathan Eagles
  • 20. Scaling Tez • Existing Cosmos workloads can have > 15k parallel tasks • Acquiring and managing these containers • Managing communications with these tasks • Providing real time progress for all the tasks
  • 21. Scaling Tez • Optimize AM memory • Metadata management for large inputs • Memory pressure under large event throughput • Large DAGs with > 2000 vertices and > 1 million tasks • Optimizations for deep DAGs
  • 22. Integrating with YARN Opportunistic containers • Mechanism to drive up utilization of cluster • AM has deep understanding of the capability • Effectively using opportunistic containers in scheduler • Harder scheduling choices with container reuse
  • 23. AM Recovery • High priority customer ask • Need to plugin Graphene to this AM resiliency • Deterministic and reliable recovery with dynamic behavior
  • 24. Conclusion Microsoft SCOPE analytics running on Apache YARN and Tez! Our Journey has just started. We invite you to collaborate.
  • 25. References • Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications [SIGMOD, 2015] • SCOPE: easy and efficient parallel processing of massive data sets [VLDB, 2008] • Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing [OSDI, 2014] • Dryad: distributed data-parallel programs from sequential building blocks [EuroSys, 2007] • Lessons learned from scaling YARN to 40k machines in a multi tenancy environment. [DataWorksSummit, 2017]

Editor's Notes

  • #2: We are here to talk about how we are looking to power SCOPE with Tez.
  • #3: We will do a quick overview of Cosmos and SCOPE Then we will talk about the role job manager plays in the system How we are looking to fix some of the problems that we have by leveraging Tez Anupam will dig a little deeper into design of Graphene He will be talking about the challenges in front of us and why we need your help to take Tez to the next level
  • #4: A Microsoft-internal platform for building big-data applications​ ​ Used across Microsoft by Bing, Azure, Windows, Office for data mining and analysis. Available externally as Azure Data Lake Analytics Lets user focus on transforming data to gain insights while we focus on operating the platform at lower COGS.
  • #6: SCOPE is the main scripting language for Cosmos. Targeted for large-scale data analysis. You could run a script over 1GB, TB, or a PB and we handle scaling that. SQL like language that allows C# and .NET devs to get started easily. On the right is a sample SCOPE script. In this case we are reading a TSV file and running a select statement on that, add a new column, and output that as a new file. Users can easily define their own functions and even implement their own versions of operators like extractors, processors, and outputters. Users just write the scripts thinking it is going to run on a single machine and we scale it out on the cluster. This means that the nitty-gritties of dealing from failures and retries is not something a user should worry about
  • #7: Users submit a SCOPE script from VS using Scope studio plugin. The script goes through Cosmos Job Service and FE where it is compiled by Scope compiler. Compiler produces an AST representation of the script along with the codegen DLLs for user code and other artifacts. Optimizer makes decisions about execution plan, parallelism, and generates an algebra. Job manager, which is us, parses the algebra and starts executing the DAG on the cluster. As part of the execution JM launches Scope engine on the tasks which provides implementations of many standard physical operators. JM gives the Scope engine input paths to read and outputs to produce. Typically outputs of one vertex become input to some other vertex and DAG execution continues. --- SCOPE compiler and optimizer are responsible for generating an efficient execution plan and the runtime.
  • #9: So what are the responsibilities of the Job Manager? DAG execution JM is the central and coordinating process for all processing vertices within an application. The primary function of the JM is to construct the runtime DAG from the compile time representation of a DAG and execute over it. The JM schedules a DAG vertex onto the cluster nodes when all the inputs are ready. JM can also do dynamic updates to the graph like a pod level aggregation or build a broadcast tree. Fault tolerance The Job Manager monitors progress of all executing vertices. Failing vertices are re-executed a limited number of times and if there are too many failures, the job is terminated. JM also detects slower tasks in a vertex and reexecutes them elsewhere on the cluster.
  • #10: Scheduling When a task is ready then JM looks for a machine in the cluster to run the task upon. The global cluster load information used by each JM is provided through the cooperation of two additional entities in the system: a Resource Monitor (RM) for each cluster and a Process Node (PN) on each server. The RM aggregates load information from PNs across the cluster continuously, providing a global view of the cluster status for each JM to make informed scheduling decisions. It also enforces token limits.. Users typically give a job some tokens to run. Each token amounts to 2 cores and 6GB. JM ensures that the resources used by the job never exceed the allocated number of tokens.
  • #11: When the job finishes then the JM finalizes the outputs so they become visible to the user. It also supports some custom metadata operations like catalog updates.
  • #14: Go through details. Explains design and implementation decisions for Graphene and how we use Tez
  • #15: 1m Once we decided to implement SCOPE AM using Tez. We decided upon certain ground rules or guiding principles we would use to accomplish this goal. Hitesh already gave us an idea about the scale of Cosmos and SCOPE workloads, and how critical they are for Microsoft’s business. The compiler, optimizer, execution engine and tooling will be minimally changed, in order to allow for a staged transition. Tez has very powerful set of APIs to allow any system to plugin. We will be using these extensibility points as much as possible. Finally, for features that we feel the need to add to Tez, we will be working with the community and making them work generally for all Tez users as much as possible. With these ground rules set we started working on porting Scope to run on top of Tez.
  • #16: 3m The need to seamlessly upgrade from current job manager to graphene implies that graphene should be a drop-in replacement for current job manager. As Hitesh showed, doing this at Cosmos scale while being the backbone of Microsoft’s analytics need implies least perturbation. This meant that the SCOPE AM on Tez had to mimic existing job manager kind of behavior. Graphene has 4 unique integration point in Cosmos SCOPE stack not native to Tez. This introduction of our guiding principles and integration points will be helpful to understand our implementation and the rationale behind our design choices.
  • #17: 5m
  • #18: 6m
  • #19: 7m
  • #20: 8m
  • #21: 10m Bring learnings from Job Manager back to Tez.
  • #22: 12m
  • #23: 13m
  • #24: 15m
  • #25: 16m
  • #26: 16m