Oozie sweet

Oozie: A Workflow Scheduling For Hadoop

Mohammad Islam

What is Hadoop?

•  A framework for very large scale data
processing in distributed environment
•  Main Idea came from Google
•  Implemented at Yahoo and open sourced
to Apache.
•  Hadoop is free!

What is Hadoop? Contd.

•  Main components:
–  HDFS
–  Map-Reduce
•  Highly scalable, fault-tolerant system
•  Built on commodity hardware.

Yet Another WF System?
•  Workflow management is a matured field.
•  A lot of WF systems are already available
•  Existing WF system could be hacked to use
for Hadoop
•  But no WF system built-for Hadoop.
•  Hadoop has some benefits and shortcomings.
•  Oozie was designed only for Hadoop
considering those features

Oozie in Hadoop Eco-System

Oozie

HCatalog
Pig Sqoop Hive
Oozie

Map-Reduce

HDFS

A Workflow Engine
•  Oozie executes workflow defined as DAG of jobs
•  The job type includes: Map-Reduce/Pig/Hive/Any script/
Custom Java Code etc
M/R
streaming
job

M/R
start fork join
job

Pig MORE
decision
job

M/R ENOUGH
job

FS
end Java
job

A Scheduler
•  Oozie executes workflow based on:
–  Time Dependency (Frequency)
–  Data Dependency

Oozie Server
Check
WS API Oozie Data Availability
Coordinator

Oozie
Oozie Workflow
Client Hadoop

Bundle
•  A new abstraction layer on top of Coordinator.
•  Users can define and execute a bunch of coordinator
applications.
•  Bundle is optional

Tomcat
Check
WS API Data Availability
Bundle

Coordinator

Oozie Workflow
Client Hadoop

Oozie Abstraction Layers
Bundle Layer 1

Coord Job Coord Job

Layer 2
Coord Coord Coord Coord
Action Action Action Action

WF Job WF Job WF Job WF Job

Layer 3
M/R PIG M/R PIG
Job Job Job Job

Access to Oozie Service

•  Four different ways:
– Using Oozie CLI client
– Java API
– REST API
– Web Interface (read-only)

Installing Oozie

Step 1: Download the Oozie tarball
curl -O http://mirrors.sonic.net/apache/incubator/oozie/oozie-3.1.3-incubating/
oozie-3.1.3-incubating-distro.tar.gz

Step 2: Unpack the tarball
tar –xzvf <PATH_TO_OOZIE_TAR>

Step 3: Run the setup script
bin/oozie-setup.sh -hadoop 0.20.200 ${HADOOP_HOME} -extjs /tmp/ext-2.2.zip

Step 4: Start oozie
bin/oozie-start.sh

Step 5: Check status of oozie
bin/oozie admin -oozie http://localhost:11000/oozie -status

Running an Example

•  Standalone Map-Reduce job
$ hadoop jar /usr/joe/hadoop-examples.jar org.myorg.wordcount inputDir outputDir

•  Using Oozie

MapReduce OK <workflow –app name =..>
Start End <start..>
wordcount
<action>
<map-reduce>
ERROR ……
……
</workflow>
Kill

Example DAG Workflow.xml

Example Workflow
<action name=’wordcount'>
<map-reduce>
<configuration>
<property>
<name>mapred.mapper.class</name> mapred.mapper.class =
<value>org.myorg.WordCount.Map</value> org.myorg.WordCount.Map
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.myorg.WordCount.Reduce</value> mapred.reducer.class =
</property> org.myorg.WordCount.Reduce
<property>
<name>mapred.input.dir</name>
<value>usr/joe/inputDir </value> mapred.input.dir = inputDir
</property>
<property>
<name>mapred.output.dir</name>
<value>/usr/joe/outputDir</value> mapred.output.dir = outputDir
</property>
</configuration>
</map-reduce>
</action>

A Workflow Application
Three components required for a Workflow:

1)  Workflow.xml:
Contains job definition

2) Libraries:
optional ‘lib/’ directory contains .jar/.so files

3) Properties file:
•  Parameterization of Workflow xml
•  Mandatory property is oozie.wf.application.path

Workflow Submission
Deploy Workflow to HDFS

$ hadoop fs –put wf_job hdfs://bar.com:9000/usr/abc/wf_job

Run Workflow Job

$ oozie job –run -config job.properties -oozie http://localhost:11000/oozie/
Workflow ID: 00123-123456-oozie-wrkf-W

Check Workflow Job Status

$ oozie job –info 00123-123456-oozie-wrkf-W -oozie http://localhost:11000/
oozie/

Oozie Web Console

18

Oozie Web Console: Job Details

19

Oozie Web Console: Action
Details

20

Hadoop Job Details

21

Use Case : Time Triggers
•  Execute your workflow every 15 minutes
(CRON)

00:15 00:30 00:45 01:00

22

Run Workflow every 15 mins

23

Use Case: Time and Data
Triggers
•  Execute your workflow every hour, but only run
them when the input data is ready.
Hadoop
Input Data
Exists?

01:00 02:00 03:00 04:00

24

Data Triggers
<coordinator-app name=“coord1” frequency=“${1*HOURS}”…>
<datasets>
<dataset name="logs" frequency=“${1*HOURS}”

initial-instance="2009-01-01T23:59Z">
<uri-template>
hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}
</uri-template>
</dataset>
</datasets>
<input-events>
<data-in name=“inputLogs” dataset="logs">
<instance>${current(0)}</instance>
</data-in>
</input-events>
25

</coordinator-app>

Running an Coordinator Job
Application Deployment
$ hadoop fs –put coord_job hdfs://bar.com:9000/usr/abc/coord_job

Coordinator Job Parameters:
$ cat job.properties
oozie.coord.application.path = hdfs://bar.com:9000/usr/abc/coord_job

Job Submission
$ oozie job –run -conﬁg job.properties
job: 000001-20090525161321-oozie-xyz-C

26

Debugging an Coordinator Job
Coordinator Job Information
$ oozie job -info 000001-20090525161321-oozie-xyz-C
Job Name : wordcount-coord
App Path : hdfs://bar.com:9000/usr/abc/coord_job
Status : RUNNING

Coordinator Job Log
$ oozie job –log 000001-20090525161321-oozie-xyz-C

Coordinator Job Deﬁnition
$ oozie job –deﬁnition 000001-20090525161321-oozie-xyz-C

27

Three Questions …
Do you need Oozie?

Q1 : Do you have multiple jobs with
dependency?
Q2 : Does your job start based on time or data
availability?
Q3 : Do you need monitoring and operational
support for your jobs?
If any one of your answers is YES,
then you should consider Oozie!

What Oozie is NOT

•  Oozie is not a resource scheduler

•  Oozie is not for off-grid scheduling
o  Note: Off-grid execution is possible through
SSH action.

•  If you want to submit your job occasionally,
Oozie is NOT a must.
o  Oozie provides REST API based submission.

Oozie in Apache
Main Contributors

Oozie in Apache

•  Y! internal usages:
–  Total number of user : 375
–  Total number of processed jobs ≈ 750K/
month
•  External downloads:
–  2500+ in last year from GitHub
–  A large number of downloads maintained by
3rd party packaging.

Oozie Usages Contd.

•  User Community:
–  Membership
•  Y! internal - 286
•  External – 163
–  No of Messages (approximate)
•  Y! internal – 7/day
•  External – 10+/day

Oozie: Towards a Scalable Workflow
Management System for Hadoop

Mohammad Islam

Key Features and Design
Decisions
•  Multi-tenant
•  Security
–  Authenticate every request
–  Pass appropriate token to Hadoop job
•  Scalability
–  Vertical: Add extra memory/disk
–  Horizontal: Add machines

Oozie Job Processing

Oozie Security

Hadoop
Access
Secure
Job Kerberos
Oozie Server

End
user

Oozie-Hadoop Security

Oozie Security

Hadoop
Access
Secure
Job Kerberos
Oozie Server

End user

Oozie-Hadoop Security

•  Oozie is a multi-tenant system
•  Job can be scheduled to run later
•  Oozie submits/maintains the hadoop jobs
•  Hadoop needs security token for each
request

Question: Who should provide the security
token to hadoop and how?

Oozie-Hadoop Security Contd.

•  Answer: Oozie
•  How?
– Hadoop considers Oozie as a super-user
– Hadoop does not check end-user
credential
– Hadoop only checks the credential of
Oozie process

•  BUT hadoop job is executed as end-user.
•  Oozie utilizes doAs() functionality of Hadoop.

User-Oozie Security

Oozie Security

Hadoop
Access
Secure
Job Kerberos
Oozie Server

End user

Why Oozie Security?

•  One user should not modify another user’s
job
•  Hadoop doesn’t authenticate end–user
•  Oozie has to verify its user before passing
the job to Hadoop

How does Oozie Support Security?

•  Built-in authentication
–  Kerberos
–  Non-secured (default)
•  Design Decision
–  Pluggable authentication
–  Easy to include new type of authentication
–  Yahoo supports 3 types of authentication.

Job Submission to Hadoop

•  Oozie is designed to handle thousands of
jobs at the same time

•  Question : Should Oozie server
–  Submit the hadoop job directly?
–  Wait for it to finish?

•  Answer: No

Job Submission Contd.
•  Reason
–  Resource constraints: A single Oozie process
can’t simultaneously create thousands of thread
for each hadoop job. (Scaling limitation)
–  Isolation: Running user code on Oozie server
might de-stabilize Oozie
•  Design Decision
–  Create a launcher hadoop job
–  Execute the actual user job from the launcher.
–  Wait asynchronously for the job to finish.

Job Submission to Hadoop

Hadoop Cluster
5 Job
Actual
Tracker
M/R Job
Oozie 3
Server 1 4
Launcher
2 Mapper

Job Submission Contd.

•  Advantages
–  Horizontal scalability: If load increases, add
machines into Hadoop cluster
–  Stability: Isolation of user code and system
process
•  Disadvantages
–  Extra map-slot is occupied by each job.

Production Setup

•  Total number of nodes: 42K+
•  Total number of Clusters: 25+
•  Data presented from two clusters
•  Each of them have nearly 4K nodes
•  Total number of users /cluster = 50

Oozie Usage Pattern @ Y!
Distribution of Job Types On Production Clusters
50

45

40

35
Percentage

30

25

20 #1 Cluster
15 #2 Cluster
10

5

0

fs java map-reduce pig
Job type

Experimental Setup

•  Number of nodes: 7
•  Number of map-slots: 28
•  4 Core, RAM: 16 GB
•  64 bit RHEL
•  Oozie Server
–  3 GB RAM
–  Internal Queue size = 10 K
–  # Worker Thread = 300

Job Acceptance
Workflow Acceptance Rate
workflows Accepted/Min

1400
1200
1000
800
600
400
200
0
2 6 10 14 20 40 52 100 120 200 320 640
Number of Submission Threads

Observation: Oozie can accept a large number of jobs

Time Line of a Oozie Job

User Oozie Job Job
submits submits to completes completes
Job Hadoop at Hadoop at Oozie

Time

Preparation Completion
Overhead Overhead

Total Oozie Overhead = Preparation + Completion

Oozie Overhead
Per Action Overhead
Overhead in millisecs

1800
1600
1400
1200
1000
800
600
400
200
0
1 Action 5 Actions 10 Actions 50 Actions
Number of Actions/Workflow

Observation: Oozie overhead is less when multiple
actions are in the same workflow.

Future Work

•  Scalability
•  Hot-Hot/Load balancing service
•  Replace SQL DB with Zookeeper
•  Event-based data processing
•  Asynchronous Data processing
•  User-defined event
•  Extend the benchmarking scope
•  Monitoring WS API

Take Away ..

•  Oozie is
– Apache project
– Scalable, Reliable and multi-tenant
– Deployed in production and growing
fast.

Q&A

Mohammad K Islam
kamrul@yahoo-inc.com
http://incubator.apache.org/oozie/

Coordinator Application Lifecycle
Coordinator Job

0*f 1*f 2*f … … N*f
start end

action … …
create

Action Action Action Action
0 1 2 N

action
start Coordinator Engine
Workflow Engine
A
WF WF WF WF

B C

56

Vertical Scalability

•  Oozie asynchronously processes the user
request.
•  Memory resident
–  An internal queue to store any sub-task.
–  Thread pool to process the sub-tasks.
•  Both items
–  Fixed in size
–  Don’t change due to load variations
–  Size can be configured if needed. Might need
extra memory

Challenges in Scalability

•  Centralized persistence storage
–  Currently supports any SQL DB (such as
Oracle, MySQL, derby etc)
–  Each DBMS has respective limitations
–  Oozie scalability is limited by the underlying
DBMS
–  Using of Zookeeper might be an option

Oozie Workflow Application
•  Contents
–  A workflow.xml file
–  Resource files, config files and Pig scripts
–  All necessary JAR and native library files

•  Parameters
–  The workflow.xml, is parameterized,
parameters can be propagated to map-reduce,
pig & ssh jobs

•  Deployment
–  In a directory in the HDFS of the Hadoop cluster
where the Hadoop & Pig jobs will run

59

Oozie
Running a Workflow Job cmd

Workflow Application Deployment

$ hadoop fs –mkdir hdfs://usr/tucu/wordcount-wf

$ hadoop fs –mkdir hdfs://usr/tucu/wordcount-wf/lib

$ hadoop fs –copyFromLocal workflow.xml wordcount.xml hdfs://usr/tucu/wordcount-wf

$ hadoop fs –copyFromLocal hadoop-examples.jar hdfs://usr/tucu/wordcount-wf/lib

$

Workflow Job Execution

$ oozie run -o http://foo.corp:8080/oozie
-a hdfs://bar.corp:9000/usr/tucu/wordcount-wf  
input=/data/2008/input output=/data/2008/output

Workflow job id [1234567890-wordcount-wf]

$ 

Workflow Job Status

$ oozie status -o http://foo.corp:8080/oozie -j 1234567890-wordcount-wf

Workflow job status [RUNNING]
...

$

60

Big Features (1/ 2)

•  Integration with Hadoop 0.23
•  Event-based data processing (3.3)
•  Asynchronous Data processing (3.3)
•  User-defined event (Next)
•  HCatalog integration (3.3)
–  Non-polling approach

Big Features (2/ 2)

•  DAG AM (Next)
–  WF will run as AM on hadoop cluster
–  Higher scalability
•  Extend coordinator’s scope (3.3)
–  Currently supports only WF
–  Any job type such as pig, hive, MR, java, distcp
can be scheduled without WF.
•  Dataflow-based processing in Hadoop(Next)
–  Create an abstraction to make things easier.

Usability (1/2)

•  Easy adoption
–  Modeling tool (3.3)
–  IDE integration (3.3)
–  Modular Configurations (3.3)
•  Monitoring API (3.3)
•  Improved Oozie application management
(next+)
–  Automate upload the application to hdfs.
–  Application versioning.
–  Seamless upgrade

Usability (2/2)

•  Shell and Distcp Action
•  Improved UI for coordinator
•  Mini-Oozie for CI
–  Like Mini-cluster
•  Support multiple versions (3.3)
–  Pig, Distcp, Hive etc.
•  Allow job notification through JMS (Next ++)
•  Prioritization (Next++)
–  By user, system level.

Reliability

•  Auto-Retry in WF Action level

•  High-Availability
–  Hot-Warm through ZooKeeper (3.3)
–  Hot-Hot with Load Balancing (Next)

Manageability

•  Email action

•  Query Pig Stats/Hadoop Counters
–  Runtime control of Workflow based on stats
–  Application-level control using the stats

REST-API for Hadoop Components

•  Direct access to Hadoop components
–  Emulates the command line through REST
API.
•  Supported Products:
–  Pig
–  Map Reduce

Oozie sweet

More Related Content

Viewers also liked (20)

Similar to Oozie sweet (20)

Recently uploaded (20)

Oozie sweet