Graphene – Microsoft SCOPE on Tez

Graphene – Microsoft
SCOPE on Tez
Hitesh Sharma (Principal Software Eng. Manager)
Anupam (Senior Software Engineer)

Agenda
• Overview of SCOPE and Cosmos
• SCOPE Job Manager responsibilities
• Design of Graphene
• Features required in Tez

Cosmos
Environment
• A Microsoft-internal platform
for building big-data
applications
• Available externally as Azure
Data Lake Analytics
• Enable customers to transform
data of any scale into new
business assets easily at low
cost in the cloud

Cosmos: World’s Biggest YARN Cluster!
Single DC >
40K machines
Multiple DCs
> 500,000 jobs
/ day
~ 3 billion
containers/day
High avg. CPU
utilization
Three Nines
Exabytes in
storage
100s of PB
processed/day
Exabytes of
data moved

SCOPE
• Scripting language for Cosmos
• Influenced by SQL and relational
concepts
• Works great with C# and .NET
• Very extensible
• Auto scale
• Naturally parallelizable computation
• Lower the barrier to write efficient
programs
RawData =
EXTRACT
Clicks:int,
Domain:string
FROM @“RAWWEBDATA.TSV”
USING DefaultTextExtractor();
WebData =
SELECT *,
Domain.Trim().ToUpper()
AS NormalizedDomain
FROM RawData;
OUTPUT WebData
TO “WEBDATA.TSV”
USING DefaultTextOutputter();

CosmosFront-EndService
Optimizer
Job Manager
Compiler
Runtime
Engine
SCOPE platform

Job scale
• Single job can consume > 1PB of
data
• > 15000 concurrent tasks (degree of
parallelism)
• Thousands of vertices
• DAGs can be very wide, very deep,
or both
• Millions of tasks in a job
• Billions of edges

Job Manager
• DAG execution
• Builds execution graph
• Topologically executes the DAG
• Keep track of state of the job/vertices
• Dynamic DAG updates
• Rack level aggregation
• Broadcast tree
• Fault tolerance
• Handle failures and do revocations
• Detect and mitigate outliers

Job Manager
• Scheduling
• Keep track of cluster resources
• Distributed scheduler
• Requests bonus or opportunistic containers
to increase utilization
• Can upgrade opportunistic containers
• Container reuse
• Opportunistic containers present some
interesting choices for reuse
• Tricky to implement
Time
Parallel
containers
Max containers
allowed

Job Manager
• Finalization
• Concatenate final outputs
• Metadata operations
• Tooling
• Near real-time feedback
• Finding the critical path
• Structured error reporting

Current
Challenges
• Higher cost of ownership
• No AM recovery
• Tied to Cosmos infrastructure
• Memory inefficient
• Native support for interactive
workload

Status and Roadmap
Prototype
Run
benchmarks
like TPC-X,
TeraSort etc.
Offline
flighting of
customer
jobs
First stage of
production
deployment
Late 2017
Mid 2018
Late 2018
Early 2018

Design and
Implementation of
Graphene

Guiding
Principles
Minimal changes
in SCOPE stack
Work with
community
Use Tez
extensibility
Maintain
compatibility

Consume output of
compilation to
generate DAG
Algebr
a
Launch and
communicate with
ScopeEngine
Engine
Produce status,
debugging, and
error details for
existing tooling
Tooling
Interact with
storage layer
Store
Graphene – Integration Points

Graphene – Application Master
GRAPHENE AM
GrapheneDAGAppMaster
DAG
Converter
Algebra
Legend
Tez Component
Uses Tez API
External Component
DAG
Store Client
Input InitializerDAGAppMaster
DAGImpl
Custom Edge
and Vertex Mgr
Tez
Magic
Task

Graphene – Task Execution
Task Container
SCOPE Engine
SCOPE Processor
SCOPE Input SCOPE Output
SCOPE TaskTez
Magic!
GRAPHENE AM
AM Container
Launch Container
InputFailedEvent/DataMovementEvent
InputDataInformationEvent or
DataMovementEvent Task CommandStatus & Error
Legend
Tez Component
Uses Tez API
External Component

Graphene – Tooling Integration
Task Container
SCOPE Engine
SCOPE Task
Periodic Stats and Diag
Legend
Tez Component
Uses Tez API
External Component
Statistics & DiagTez
Magic
GRAPHENE AM
AM Container
JobProfiler:
EventListener
Real Time
Stats
Historic
Stats Task Level Stats
Vertex Level Stats

Experience So Far
Reliability
As expected from
a production ready
software
No major bugs or
reliability issues
Onboarding
Modular and
tested code
Documentation :
Opportunity to
contribute
Community
Very responsive
Special thanks to
Bikas Saha, Kuhu
Shukla, Jonathan
Eagles

Scaling Tez
• Existing Cosmos workloads can have >
15k parallel tasks
• Acquiring and managing these
containers
• Managing communications with
these tasks
• Providing real time progress for
all the tasks

Scaling Tez
• Optimize AM memory
• Metadata management for large
inputs
• Memory pressure under large
event throughput
• Large DAGs with > 2000 vertices
and > 1 million tasks
• Optimizations for deep DAGs

Integrating
with YARN
Opportunistic
containers
• Mechanism to drive up utilization of
cluster
• AM has deep understanding of the
capability
• Effectively using opportunistic
containers in scheduler
• Harder scheduling choices with
container reuse

AM Recovery
• High priority customer ask
• Need to plugin Graphene to this AM
resiliency
• Deterministic and reliable recovery
with dynamic behavior

Conclusion
Microsoft SCOPE analytics running on
Apache YARN and Tez!
Our Journey has just started.
We invite you to collaborate.

References
• Apache Tez: A Unifying Framework for Modeling and Building Data
Processing Applications [SIGMOD, 2015]
• SCOPE: easy and efficient parallel processing of massive data sets
[VLDB, 2008]
• Apollo: Scalable and Coordinated Scheduling for Cloud-Scale
Computing [OSDI, 2014]
• Dryad: distributed data-parallel programs from sequential building
blocks [EuroSys, 2007]
• Lessons learned from scaling YARN to 40k machines in a multi tenancy
environment. [DataWorksSummit, 2017]

Graphene – Microsoft SCOPE on Tez

More Related Content

What's hot (20)

Similar to Graphene – Microsoft SCOPE on Tez (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Graphene – Microsoft SCOPE on Tez

Editor's Notes