SlideShare a Scribd company logo
Data science lifecycle with Apache Flink
and
Apache Zeppelin (incubating)
Flink Forward
Moon moon@nflabs.com
NFLabs www.nflabs.com
Content
1. Data science lifecycle
2. Zeppelin for data science
3. Zeppelin and Flink
4. Project Roadmap
Data science lifecycle
Data Science: process
https://en.wikipedia.org/wiki/Data_analysis
Data Science: tools
MLlib
Data Science: people
Engineer Data Scientist
DevOps Business
http://aarondavis.design/
Content
1. Data science lifecycle
2. Zeppelin for data science
3. Zeppelin and Flink
4. Project Roadmap
Zeppelin for data scientist
ProjectTimeline
ASF Incubation12.2014
08.2014 Started getting adoption
http://zeppelin.incubator.apache.org
12.2012 Commercial Product for data analysis
10.2013 Open sourced a single feature
Hadoop Landscape
Cloudera-ML
ML-base
MRQL
Shark
?
Commercial Product
12.2012
Zeppelin
10.2013
Zeppelin
10.2013
Zeppelin
08.2014
Zeppelin
08.2014
Third-party Products
10.2014
Apache Incubation Proposal
11.2014
Acceptance by Incubator
23.12.2014
Current Status
1 Release
68 Contributors worldwide
722 Stars on GH
300/900 Emails at users/dev @i.a.o
Interactive Notebooks
InteractiveVisualization
Multiple Backends
Zeppelin & Friends
Z-Manager
ZeppelinHub
…⋯
Collaboration/Sharing
Packaging & Deployment Zeppelin + Full stack on a cloud
Packages Backend Integration
OnlineViewer
Deployment
https://github.com/hortonworks-gallery/ambari-zeppelin-service
Deployment
As a Service
Before
Cloudera-ML
ML-base
MRQL
Shark
?
After
Cloudera-ML
ML-base
MRQL
Shark
Content
1. Data science lifecycle
2. Zeppelin for data science
3. Zeppelin and Flink
4. Project Roadmap
Flink integration
Integrated through Interpreter 

Data processing system abstraction in Zeppelin
Interpreter
http://zeppelin.incubator.apache.org/docs/development/writingzeppelininterpreter.html
Writing an Interpreter
public abstract void open();
public abstract void close();
public abstract InterpreterResult interpret(String st, InterpreterContext context);
public abstract void cancel(InterpreterContext context);
public abstract int getProgress(InterpreterContext context);
public abstract List<String> completion(String buf, int cursor);
public abstract FormType getFormType();
public Scheduler getScheduler();
Must
have
Good
to have
Advanced
Flink Interpreter
https://github.com/apache/incubator-zeppelin/blob/master/flink/src/main/java/org/apache/zeppelin/flink/FlinkInterpreter.java
Zeppelin Server
Thrift
Flink Interpreter
Interpreter JVM
process
FlinkILoop
ExecutionEnvironment
Using interpreter
Configure Bind use
Using interpreter
Use different
interpreters in the
same notebook
Display System
Zeppelin Server
Flink Interpreter Other Interpreter
Zeppelin webapp
Websocket, REST
Text Html Table Angular
Display System
Select display
system through
output
Built in scheduler
Built-in scheduler runs
your notebook with
cron expression.
Flexible layout
Flexible layout
DEMO
Content
1. Data science lifecycle
2. Zeppelin for data science
3. Zeppelin and Flink
4. Project Roadmap
Flink Integration
• ZeppelinContext :Access to Zeppelin provided features
• - Dynamic form
• - Angular display system
• Dependency loading
• Auto completion
• Cancel
• Get progress information
Thank you
Q & A
Moon
moon@nflabs.com
NFLabs
www.nflabs.com
http://zeppelin.incubator.apache.org/
Project roadmap
Multi-tenancy
Two approaches
1. Implement authentication,ACL inside of Zeppelin
https://github.com/apache/incubator-zeppelin/pull/53
2. Run Zeppelin on top of Docker



http://github.com/NFLabs/z-manager
Zeppelin for organizations
An Engineer
engineer by http://aarondavis.design/
ATeam
engineer by http://aarondavis.design/
An Organization
engineer by http://aarondavis.design/
That’s too many!
engineer by http://aarondavis.design/
What is the problem?
Too much:
Install
Configure
Cluster resources
Solution?
We have containers
+
reverse proxy
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Z Manager PoC
httpd + mod_php
nginx
Linux box
engineer by http://aarondavis.design/
2 days, bash + php :(
Z Manager PoC
Z Manager
http://github.com/NFLabs/z-manager
Apache 2.0 Licence
Containerized deployment per user
Reverse proxy
Single binary
Simple web application
Z Manager
SGA to ASF coming *
Z Manager
Auto-update
engineer by http://aarondavis.design/
Linux box
go + react :)
Z Manager process
Z Manager
Helium
People do the similar work
with different data
New visualization
Model & Algorithm
Data process pipeline
engineer by http://aarondavis.design/
Package and distribute work
New visualization
Model & Algorithm
Data process pipeline
Pkg
Repo
engineer by http://aarondavis.design/
Helium
https://s.apache.org/helium
Platform for
on top of Apache Zeppelin
Data Analytics Application
Helium Application
= +
View Algorithm
Zeppelin provided Resources
Resources
Data
Computing
Any java object
 -
 Result
 of
 last
 execution

 -

More Related Content

What's hot (20)

PPTX
The Evolution of (Open Source) Data Processing
Aljoscha Krettek
 
PDF
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Apache Flink Taiwan User Group
 
PPTX
Apache Flink(tm) - A Next-Generation Stream Processor
Aljoscha Krettek
 
PPTX
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
PPTX
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Jamie Grier
 
PDF
Apache Spark vs Apache Flink
AKASH SIHAG
 
PPTX
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Robert Metzger
 
PDF
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Vasia Kalavri
 
PPTX
Functional Comparison and Performance Evaluation of Streaming Frameworks
Huafeng Wang
 
PPTX
January 2016 Flink Community Update & Roadmap 2016
Robert Metzger
 
PPTX
GOTO Night Amsterdam - Stream processing with Apache Flink
Robert Metzger
 
PDF
Stream Processing with Apache Flink
C4Media
 
PDF
Apache Flink: Streaming Done Right @ FOSDEM 2016
Till Rohrmann
 
PPTX
Streaming in the Wild with Apache Flink
Kostas Tzoumas
 
PPTX
The Past, Present, and Future of Apache Flink®
Aljoscha Krettek
 
PPTX
Stateful Stream Processing at In-Memory Speed
Jamie Grier
 
PPTX
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
PDF
Jamie Grier - Robust Stream Processing with Apache Flink
Flink Forward
 
PDF
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Flink Forward
 
PDF
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads
Flink Forward
 
The Evolution of (Open Source) Data Processing
Aljoscha Krettek
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Apache Flink Taiwan User Group
 
Apache Flink(tm) - A Next-Generation Stream Processor
Aljoscha Krettek
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Jamie Grier
 
Apache Spark vs Apache Flink
AKASH SIHAG
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Robert Metzger
 
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Vasia Kalavri
 
Functional Comparison and Performance Evaluation of Streaming Frameworks
Huafeng Wang
 
January 2016 Flink Community Update & Roadmap 2016
Robert Metzger
 
GOTO Night Amsterdam - Stream processing with Apache Flink
Robert Metzger
 
Stream Processing with Apache Flink
C4Media
 
Apache Flink: Streaming Done Right @ FOSDEM 2016
Till Rohrmann
 
Streaming in the Wild with Apache Flink
Kostas Tzoumas
 
The Past, Present, and Future of Apache Flink®
Aljoscha Krettek
 
Stateful Stream Processing at In-Memory Speed
Jamie Grier
 
Apache Flink Overview at SF Spark and Friends
Stephan Ewen
 
Jamie Grier - Robust Stream Processing with Apache Flink
Flink Forward
 
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Flink Forward
 
Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads
Flink Forward
 

Viewers also liked (20)

PDF
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Flink Forward
 
PPTX
Apache Flink Training: System Overview
Flink Forward
 
PDF
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Flink Forward
 
PDF
Vasia Kalavri – Training: Gelly School
Flink Forward
 
PPTX
Apache Flink: API, runtime, and project roadmap
Kostas Tzoumas
 
PDF
Ufuc Celebi – Stream & Batch Processing in one System
Flink Forward
 
PDF
Apache Flink internals
Kostas Tzoumas
 
PDF
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Flink Forward
 
PPTX
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Flink Forward
 
PPTX
Assaf Araki – Real Time Analytics at Scale
Flink Forward
 
PPTX
Apache Flink - Hadoop MapReduce Compatibility
Fabian Hueske
 
PPTX
Apache Flink Training: DataStream API Part 2 Advanced
Flink Forward
 
PPTX
Kamal Hakimzadeh – Reproducible Distributed Experiments
Flink Forward
 
PDF
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Flink Forward
 
PDF
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Flink Forward
 
PDF
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
PDF
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 
PDF
Matthias J. Sax – A Tale of Squirrels and Storms
Flink Forward
 
PPTX
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
PDF
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Flink Forward
 
Apache Flink Training: System Overview
Flink Forward
 
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Flink Forward
 
Vasia Kalavri – Training: Gelly School
Flink Forward
 
Apache Flink: API, runtime, and project roadmap
Kostas Tzoumas
 
Ufuc Celebi – Stream & Batch Processing in one System
Flink Forward
 
Apache Flink internals
Kostas Tzoumas
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Flink Forward
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Flink Forward
 
Assaf Araki – Real Time Analytics at Scale
Flink Forward
 
Apache Flink - Hadoop MapReduce Compatibility
Fabian Hueske
 
Apache Flink Training: DataStream API Part 2 Advanced
Flink Forward
 
Kamal Hakimzadeh – Reproducible Distributed Experiments
Flink Forward
 
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Flink Forward
 
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Flink Forward
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 
Matthias J. Sax – A Tale of Squirrels and Storms
Flink Forward
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 
Ad

Similar to Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin (20)

PPTX
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Spark Summit
 
PDF
DEEP: a user success story
EOSC-hub project
 
PDF
Data science lifecycle with Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Apache Zeppelin Helium and Beyond
DataWorks Summit/Hadoop Summit
 
PDF
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
PPT
Mark Hughes Annual Seminar Presentation on Open Source
Tracy Kent
 
PDF
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
Ceph Community
 
PDF
Cloud-Native .Net des applications containerisées .Net sur Linux, Windows e...
VMware Tanzu
 
PPTX
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Slim Baltagi
 
PPT
Capital onehadoopintro
Doug Chang
 
PPTX
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Slim Baltagi
 
PPTX
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Slim Baltagi
 
PPTX
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
DataWorks Summit/Hadoop Summit
 
PDF
Realizing the promise of portability with Apache Beam
J On The Beach
 
PPTX
Unified Batch and Real-Time Stream Processing Using Apache Flink
Slim Baltagi
 
PPTX
Adopt openjdk and how it impacts you in 2020
George Adams
 
PDF
Red hat cloud platforms
Giovanni Galloro
 
PDF
Future of Data Platform in Cloud Native world
Srivatsan Srinivasan
 
PDF
Extending DevOps to Big Data Applications with Kubernetes
Nicola Ferraro
 
PDF
Introduction to data science with H2O-Chicago
Sri Ambati
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Spark Summit
 
DEEP: a user success story
EOSC-hub project
 
Data science lifecycle with Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Apache Zeppelin Helium and Beyond
DataWorks Summit/Hadoop Summit
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
Mark Hughes Annual Seminar Presentation on Open Source
Tracy Kent
 
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
Ceph Community
 
Cloud-Native .Net des applications containerisées .Net sur Linux, Windows e...
VMware Tanzu
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Slim Baltagi
 
Capital onehadoopintro
Doug Chang
 
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Slim Baltagi
 
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Slim Baltagi
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
DataWorks Summit/Hadoop Summit
 
Realizing the promise of portability with Apache Beam
J On The Beach
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Slim Baltagi
 
Adopt openjdk and how it impacts you in 2020
George Adams
 
Red hat cloud platforms
Giovanni Galloro
 
Future of Data Platform in Cloud Native world
Srivatsan Srinivasan
 
Extending DevOps to Big Data Applications with Kubernetes
Nicola Ferraro
 
Introduction to data science with H2O-Chicago
Sri Ambati
 
Ad

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
PPTX
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
PDF
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
PPTX
Autoscaling Flink with Reactive Mode
Flink Forward
 
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
PPTX
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
PDF
Flink powered stream processing platform at Pinterest
Flink Forward
 
PPTX
Apache Flink in the Cloud-Native Era
Flink Forward
 
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
PPTX
The Current State of Table API in 2022
Flink Forward
 
PDF
Flink SQL on Pulsar made easy
Flink Forward
 
PPTX
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
PPTX
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
PDF
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Autoscaling Flink with Reactive Mode
Flink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
Flink powered stream processing platform at Pinterest
Flink Forward
 
Apache Flink in the Cloud-Native Era
Flink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
The Current State of Table API in 2022
Flink Forward
 
Flink SQL on Pulsar made easy
Flink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 

Recently uploaded (20)

PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
July Patch Tuesday
Ivanti
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PPTX
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
July Patch Tuesday
Ivanti
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 

Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin