SlideShare a Scribd company logo
What makes Data driven
environments more efficient and how to
build a data science toolchain around
Notebook technologies
Creator of Apache Zeppelin
Co-Founder, CTO
Moon soo Lee
moon@zepl.com
#GDSC 2018
Who am I
A true believer that data science notebook changes how
people collaborate
Creator of Apache Zeppelin
Co-founder
https://github.com/Leemoonsoo
#GDSC 2018
It was 2013, really wanted to have
interactive analytics interface for .
#GDSC 2018
Started an opensource project -
Zeppelin http://zeppelin-project.org/
data science notebook.Became an project in 2016.
http://zeppelin.apache.org
#GDSC 2018
Iterations REPL interface (2012)
Editor / Result interface (2013)
Notebook interface (2014)
#GDSC 2018
Pilot to Production in 1 day
Hey, take a look
I need an update every morning!
#GDSC 2018
More notebook consumers than producers
#GDSC 2018
At the same time
Opensource project receiving contributions like
Authentication
Access control
#GDSC 2018
Realized that notebook is a great collaboration tool
Why notebook?
#GDSC 2018
Notebook is
- Interactive
- Flexible
- Visualized
- Inline description
- Contain a story
- Shareable
#GDSC 2018
How to build collaborative environment
with notebook technology
Data sharing
Multi-user
environment
Notebook sharing
#GDSC 2018
Data scientist
Data engineer Data Analyst
Marketing
SW
engineer
Sales
Executive
You
Notebook Sharing
#GDSC 2018
You’re using only half of its
potential if not sharing
#GDSC 2018
Github
nbviewer
Zeppelin
Airbnb/knowledge-repo
Commercial services for notebook sharing
VCS
Open
source
Service
#GDSC 2018
Github
● Store notebook in github
● Versioning
● Github provides .ipynb viewer
● Fork / pull request / merge
● Private / Public / Team / Org
● Hard to apply Notebook level ACL
● Not easy for Non-engineers
#GDSC 2018
nbviewer
● Publishing notebook
● Share notebook by
sharing link
● Easy use
● No access control
Nbconvert (endering ipynb to static HTML) as a webservice
#GDSC 2018
Apache Zeppelin
● Share notebook with ACL, Read/Write/Execute
● In case of Jupyter notebook, need to convert .ipynb to
zeppelin format in command line.
#GDSC 2018
Airbnb/knowledge-repo
https://github.com/airbnb/knowledge-repo
● .ipynb, md as a post
● Git repo for version
control
● Feeds
● Search
● No access control
#GDSC 2018
Commercial services for notebook sharing
Google Colab
● Share notebook through google drive
● View/Edit/Run ipynb notebook using Colab
● Realtime collaboration
ZEPL
● Notebook level ACL
● View/Edit/Run .ipynb and Zeppelin notebook
● Realtime collaboration
● Import existing notebook from git/s3 storage
www.zepl.com
#GDSC 2018
Data Sharing
#GDSC 2018
DON’Ts
● Email attach
● Direct send
● Share through USB
● ...
Email attach
Local copy in laptop
USB drive
#GDSC 2018
DO’s
● Provide access to the same
dataset
● Access control capability
● Horizontal scalability
#GDSC 2018
Data catalog
● Provides location of data, what it means and how to load
○ e.g.
● Catalogue need to be accessible / searchable / annotatable
● Many different way to build depends on team / infra
○ Hive Metastore as a data catalog
○ Cloud infrastructure service (e.g. AWS glue data catalog, Azure data catalog)
○ Data catalog / publishing software (e.g. CKAN, DKAN)
○ Custom built on top of RDBMS, Nosql, Indexing engine
○ Build data catalog using Notebook
Dataset Location Schema Note
Activity s3://service/activity Date (DateTime), type (INT), action(String) Type is either RUN or STOP. ….
Images s3://service/images 512x256 pixel images Images are collected from profile photo...
#GDSC 2018
Build data catalog using Notebook
● Flexible enough to describe data
● Searchable, shareable, annotatable
● Programmatic generation
#GDSC 2018
Multi-user environment
#GDSC 2018
I like my notebook running on my laptop.
No you don’t.
#GDSC 2018
Sign in and Run
Install libraries and
Install notebook and
Configure driver, environments and
Request access to data and
Setup access to notebook repo and
….
Run
#GDSC 2018
Reverse Proxy
JupyterHub
/hub
Jupyter server
Kernel (Python, R)
Jupyter server
Kernel (Python, R)
/user/[name]
Authenticator
Spawner
Notebook
Storage
(Filesystem, Git, etc)
LDAP,
OAuth,
etc
Docker, k8s
Zeppelin Server
LDAP,
OAuth,
etc
Notebook
Storage
(Filesystem, Git, etc)
Interpreter Manager
Auth / ACL
Interpreter (kernel)
Interpreter (kernel)
Interpreter (kernel)
#GDSC 2018
● Easier to implement / manage
● Notebook sharing is decoupled with
execution environment
● Usually notebook sharing is basic or
restricted. (no notebook level ACL)
● e.g.
○ JupyterHub
○ AWS Sagemaker
Reverse Proxy
Single user
Notebook server
Kernel
Single user
Notebook server
Kernel
Notebook
Storage
Multi user
Notebook server
Notebook
Storage
Kernel Kernel Kernel
Browser
Browser
● More complex to implement / manage
● Notebook sharing is coupled with execution
environment
● Usually notebook sharing is more advanced
and fine grained
● e.g.
○ Apache Zeppelin
○ ZEPL
○ Google Colab
#GDSC 2018
Conclusion
Notebook Share
Data share
Multi-user environment
Collaboration
#GDSC 2018
Thanks

More Related Content

What's hot (20)

PDF
Plotly dash and data visualisation in Python
Volodymyr Kazantsev
 
PDF
Deep dive into serverless on Google Cloud
Bret McGowen - NYC Google Developer Advocate
 
PDF
Modular GraphQL with Schema Stitching
Sashko Stubailo
 
PDF
Adding GraphQL to your existing architecture
Sashko Stubailo
 
PDF
GraphQL + relay
Cédric GILLET
 
PPTX
Meetup
Giovanni Perna
 
PDF
GraphQL in Production
Bogdan Nedelcu
 
PPTX
20170927 py data_n3_bokeh_plotly
Andrey Vykhodtsev
 
PDF
GraphQL
Joel Corrêa
 
ODP
Go lambda-presentation
Steven White
 
PDF
Kubernetes Config Management Landscape
Tomasz Tarczyński
 
PDF
GraphQL in an Age of REST
Yos Riady
 
PPTX
Google cloud infrastructure workshop
Akash Agrawal
 
PDF
GraphQL & Relay
Viacheslav Slinko
 
PDF
Serverless with Google Cloud
Bret McGowen - NYC Google Developer Advocate
 
PPTX
Introduction to GraphQL
Rodrigo Prates
 
PDF
月刊ライトニングトーク 2014/06-07: 前回からのダイジェスト
Seiya Konno
 
PDF
Firebase Code Lab - 2015 GDG Buffalo DevFest
Bret McGowen - NYC Google Developer Advocate
 
PDF
Introduction to GraphQL
İlker Güller
 
PDF
How to GraphQL
Tomasz Bak
 
Plotly dash and data visualisation in Python
Volodymyr Kazantsev
 
Deep dive into serverless on Google Cloud
Bret McGowen - NYC Google Developer Advocate
 
Modular GraphQL with Schema Stitching
Sashko Stubailo
 
Adding GraphQL to your existing architecture
Sashko Stubailo
 
GraphQL + relay
Cédric GILLET
 
GraphQL in Production
Bogdan Nedelcu
 
20170927 py data_n3_bokeh_plotly
Andrey Vykhodtsev
 
GraphQL
Joel Corrêa
 
Go lambda-presentation
Steven White
 
Kubernetes Config Management Landscape
Tomasz Tarczyński
 
GraphQL in an Age of REST
Yos Riady
 
Google cloud infrastructure workshop
Akash Agrawal
 
GraphQL & Relay
Viacheslav Slinko
 
Serverless with Google Cloud
Bret McGowen - NYC Google Developer Advocate
 
Introduction to GraphQL
Rodrigo Prates
 
月刊ライトニングトーク 2014/06-07: 前回からのダイジェスト
Seiya Konno
 
Firebase Code Lab - 2015 GDG Buffalo DevFest
Bret McGowen - NYC Google Developer Advocate
 
Introduction to GraphQL
İlker Güller
 
How to GraphQL
Tomasz Bak
 

Similar to Collaborative environment with data science notebook (20)

PPTX
4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai
Luke Han
 
PPTX
Threat hunting using notebook technologies
Ashwin Patil, GCIH, GCIA, GCFE
 
PPTX
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
Carolyn Duby
 
PDF
Big Data Analytics London - Data Science in the Cloud
Margriet Groenendijk
 
PPTX
Toulouse Data Science meetup - Apache zeppelin
Gérard Dupont
 
PPTX
Azure Notebooks - Jupyter for the Cloud
Cameron Vetter
 
PPTX
Oasis – data analysis platform for enterprise
LINE Corporation
 
PPTX
Data Science at Scale with Apache Spark and Zeppelin Notebook
Carolyn Duby
 
PPTX
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Romit Mehta
 
PDF
Data science lifecycle with Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Data analytics in the cloud with Jupyter notebooks.
Graham Dumpleton
 
PDF
Notebooks in IBM
Rosario Cunha
 
PDF
Computable content: Notebooks, containers, and data-centric organizational le...
Domino Data Lab
 
PDF
Big analytics meetup - Extended Jupyter Kernel Gateway
Luciano Resende
 
PDF
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Databricks
 
PDF
Teaching with JupyterHub - lessons learned
Martin Christen
 
PDF
PPT5: Neuron Introduction
akira-ai
 
PDF
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
seoul_engineer
 
PDF
Data analysis with Pandas and Spark
Felix Crisan
 
PDF
Jupyter: A Gateway for Scientific Collaboration and Education
Carol Willing
 
4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai
Luke Han
 
Threat hunting using notebook technologies
Ashwin Patil, GCIH, GCIA, GCFE
 
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
Carolyn Duby
 
Big Data Analytics London - Data Science in the Cloud
Margriet Groenendijk
 
Toulouse Data Science meetup - Apache zeppelin
Gérard Dupont
 
Azure Notebooks - Jupyter for the Cloud
Cameron Vetter
 
Oasis – data analysis platform for enterprise
LINE Corporation
 
Data Science at Scale with Apache Spark and Zeppelin Notebook
Carolyn Duby
 
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Romit Mehta
 
Data science lifecycle with Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Data analytics in the cloud with Jupyter notebooks.
Graham Dumpleton
 
Notebooks in IBM
Rosario Cunha
 
Computable content: Notebooks, containers, and data-centric organizational le...
Domino Data Lab
 
Big analytics meetup - Extended Jupyter Kernel Gateway
Luciano Resende
 
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Databricks
 
Teaching with JupyterHub - lessons learned
Martin Christen
 
PPT5: Neuron Introduction
akira-ai
 
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
seoul_engineer
 
Data analysis with Pandas and Spark
Felix Crisan
 
Jupyter: A Gateway for Scientific Collaboration and Education
Carol Willing
 
Ad

Recently uploaded (20)

PPT
Oxygen Co2 Transport in the Lungs(Exchange og gases)
SUNDERLINSHIBUD
 
PPTX
Thermal runway and thermal stability.pptx
godow93766
 
PPT
inherently safer design for engineering.ppt
DhavalShah616893
 
PPTX
MPMC_Module-2 xxxxxxxxxxxxxxxxxxxxx.pptx
ShivanshVaidya5
 
PPTX
ISO/IEC JTC 1/WG 9 (MAR) Convenor Report
Kurata Takeshi
 
PDF
MOBILE AND WEB BASED REMOTE BUSINESS MONITORING SYSTEM
ijait
 
PPTX
artificial intelligence applications in Geomatics
NawrasShatnawi1
 
PPTX
Break Statement in Programming with 6 Real Examples
manojpoojary2004
 
PDF
Statistical Data Analysis Using SPSS Software
shrikrishna kesharwani
 
PPTX
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
PPTX
Benefits_^0_Challigi😙🏡💐8fenges[1].pptx
akghostmaker
 
PDF
6th International Conference on Machine Learning Techniques and Data Science ...
ijistjournal
 
PPTX
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
PDF
monopile foundation seminar topic for civil engineering students
Ahina5
 
PPTX
MobileComputingMANET2023 MobileComputingMANET2023.pptx
masterfake98765
 
PPTX
Structural Functiona theory this important for the theorist
cagumaydanny26
 
PPTX
REINFORCEMENT AS CONSTRUCTION MATERIALS.pptx
mohaiminulhaquesami
 
PPTX
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
PPTX
Types of Bearing_Specifications_PPT.pptx
PranjulAgrahariAkash
 
PDF
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
Oxygen Co2 Transport in the Lungs(Exchange og gases)
SUNDERLINSHIBUD
 
Thermal runway and thermal stability.pptx
godow93766
 
inherently safer design for engineering.ppt
DhavalShah616893
 
MPMC_Module-2 xxxxxxxxxxxxxxxxxxxxx.pptx
ShivanshVaidya5
 
ISO/IEC JTC 1/WG 9 (MAR) Convenor Report
Kurata Takeshi
 
MOBILE AND WEB BASED REMOTE BUSINESS MONITORING SYSTEM
ijait
 
artificial intelligence applications in Geomatics
NawrasShatnawi1
 
Break Statement in Programming with 6 Real Examples
manojpoojary2004
 
Statistical Data Analysis Using SPSS Software
shrikrishna kesharwani
 
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
Benefits_^0_Challigi😙🏡💐8fenges[1].pptx
akghostmaker
 
6th International Conference on Machine Learning Techniques and Data Science ...
ijistjournal
 
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
monopile foundation seminar topic for civil engineering students
Ahina5
 
MobileComputingMANET2023 MobileComputingMANET2023.pptx
masterfake98765
 
Structural Functiona theory this important for the theorist
cagumaydanny26
 
REINFORCEMENT AS CONSTRUCTION MATERIALS.pptx
mohaiminulhaquesami
 
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
Types of Bearing_Specifications_PPT.pptx
PranjulAgrahariAkash
 
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
Ad

Collaborative environment with data science notebook

  • 1. What makes Data driven environments more efficient and how to build a data science toolchain around Notebook technologies Creator of Apache Zeppelin Co-Founder, CTO Moon soo Lee [email protected]
  • 2. #GDSC 2018 Who am I A true believer that data science notebook changes how people collaborate Creator of Apache Zeppelin Co-founder https://github.com/Leemoonsoo
  • 3. #GDSC 2018 It was 2013, really wanted to have interactive analytics interface for .
  • 4. #GDSC 2018 Started an opensource project - Zeppelin http://zeppelin-project.org/ data science notebook.Became an project in 2016. http://zeppelin.apache.org
  • 5. #GDSC 2018 Iterations REPL interface (2012) Editor / Result interface (2013) Notebook interface (2014)
  • 6. #GDSC 2018 Pilot to Production in 1 day Hey, take a look I need an update every morning!
  • 7. #GDSC 2018 More notebook consumers than producers
  • 8. #GDSC 2018 At the same time Opensource project receiving contributions like Authentication Access control
  • 9. #GDSC 2018 Realized that notebook is a great collaboration tool Why notebook?
  • 10. #GDSC 2018 Notebook is - Interactive - Flexible - Visualized - Inline description - Contain a story - Shareable
  • 11. #GDSC 2018 How to build collaborative environment with notebook technology Data sharing Multi-user environment Notebook sharing
  • 12. #GDSC 2018 Data scientist Data engineer Data Analyst Marketing SW engineer Sales Executive You Notebook Sharing
  • 13. #GDSC 2018 You’re using only half of its potential if not sharing
  • 15. #GDSC 2018 Github ● Store notebook in github ● Versioning ● Github provides .ipynb viewer ● Fork / pull request / merge ● Private / Public / Team / Org ● Hard to apply Notebook level ACL ● Not easy for Non-engineers
  • 16. #GDSC 2018 nbviewer ● Publishing notebook ● Share notebook by sharing link ● Easy use ● No access control Nbconvert (endering ipynb to static HTML) as a webservice
  • 17. #GDSC 2018 Apache Zeppelin ● Share notebook with ACL, Read/Write/Execute ● In case of Jupyter notebook, need to convert .ipynb to zeppelin format in command line.
  • 18. #GDSC 2018 Airbnb/knowledge-repo https://github.com/airbnb/knowledge-repo ● .ipynb, md as a post ● Git repo for version control ● Feeds ● Search ● No access control
  • 19. #GDSC 2018 Commercial services for notebook sharing Google Colab ● Share notebook through google drive ● View/Edit/Run ipynb notebook using Colab ● Realtime collaboration ZEPL ● Notebook level ACL ● View/Edit/Run .ipynb and Zeppelin notebook ● Realtime collaboration ● Import existing notebook from git/s3 storage www.zepl.com
  • 21. #GDSC 2018 DON’Ts ● Email attach ● Direct send ● Share through USB ● ... Email attach Local copy in laptop USB drive
  • 22. #GDSC 2018 DO’s ● Provide access to the same dataset ● Access control capability ● Horizontal scalability
  • 23. #GDSC 2018 Data catalog ● Provides location of data, what it means and how to load ○ e.g. ● Catalogue need to be accessible / searchable / annotatable ● Many different way to build depends on team / infra ○ Hive Metastore as a data catalog ○ Cloud infrastructure service (e.g. AWS glue data catalog, Azure data catalog) ○ Data catalog / publishing software (e.g. CKAN, DKAN) ○ Custom built on top of RDBMS, Nosql, Indexing engine ○ Build data catalog using Notebook Dataset Location Schema Note Activity s3://service/activity Date (DateTime), type (INT), action(String) Type is either RUN or STOP. …. Images s3://service/images 512x256 pixel images Images are collected from profile photo...
  • 24. #GDSC 2018 Build data catalog using Notebook ● Flexible enough to describe data ● Searchable, shareable, annotatable ● Programmatic generation
  • 26. #GDSC 2018 I like my notebook running on my laptop. No you don’t.
  • 27. #GDSC 2018 Sign in and Run Install libraries and Install notebook and Configure driver, environments and Request access to data and Setup access to notebook repo and …. Run
  • 28. #GDSC 2018 Reverse Proxy JupyterHub /hub Jupyter server Kernel (Python, R) Jupyter server Kernel (Python, R) /user/[name] Authenticator Spawner Notebook Storage (Filesystem, Git, etc) LDAP, OAuth, etc Docker, k8s Zeppelin Server LDAP, OAuth, etc Notebook Storage (Filesystem, Git, etc) Interpreter Manager Auth / ACL Interpreter (kernel) Interpreter (kernel) Interpreter (kernel)
  • 29. #GDSC 2018 ● Easier to implement / manage ● Notebook sharing is decoupled with execution environment ● Usually notebook sharing is basic or restricted. (no notebook level ACL) ● e.g. ○ JupyterHub ○ AWS Sagemaker Reverse Proxy Single user Notebook server Kernel Single user Notebook server Kernel Notebook Storage Multi user Notebook server Notebook Storage Kernel Kernel Kernel Browser Browser ● More complex to implement / manage ● Notebook sharing is coupled with execution environment ● Usually notebook sharing is more advanced and fine grained ● e.g. ○ Apache Zeppelin ○ ZEPL ○ Google Colab
  • 30. #GDSC 2018 Conclusion Notebook Share Data share Multi-user environment Collaboration