SlideShare a Scribd company logo
Data Science at Scale with Apache
Spark and Zeppelin Notebook
Carolyn Duby
Big Data Solutions Architect
Hortonworks
About Carolyn Duby
• Big Data Solutions Architect
• High performance data intensive systems
• Data science
• ScB ScM Computer Science, Brown University
• LinkedIn: https://www.linkedin.com/in/carolynduby/
• Twitter: @carolynduby
• Github: carolynduby
• Hortonworks
– Innovation through data
– Enterprise ready, 100% open source, modern data platforms
– Engineering, Technical Support, Professional Services, Training
https://www.meetup.com/futureofdata-boston
Agenda
• Moving beyond the desktop to distributed
computing
– Apache Spark
• Recording your results and sharing them with
others
– Apache Zeppelin
Are you Outgrowing
your Desktop?
• Analyzing and training with portion of
available data
• Analysis or training too slow
• Out of memory
• Data accumulates over time
How do you collaborate and
record results?
• Show your work
– Effective peer review
– Answer questions more quickly
– Correct errors
– Apply methods to other data
• Increased quality and respect for results
• Justify business decisions
Data Science at Scale with Apache
Open Source
• Apache Spark version 2.1
– Cleaning and analysis of large data sets
– http://spark.apache.org
• Apache Zeppelin Notebook 0.7.0
– Capture and share analysis
– Visualize data for exploration and results
– https://zeppelin.apache.org
Apache SPARK
• Distributed processing efficiently crunches large
data sets
– Optimized
– Horizontally scalable with multi tenancy
– Fault tolerant
• One platform for streaming, cleaning, analyzing
• Elegant APIs – Scala, Python, Java, R
• Many data source connectors – file system, HDFS,
Hive, Phoenix, S3, etc
SPARK Libraries
• Same API for all data sources
• SQL - http://spark.apache.org/sql/
– Access structured data and combine with other sources
• MLLIB - http://spark.apache.org/mllib/
– Machine learning for training models and predicting
• GraphX - http://spark.apache.org/graphx/
– Connectivity algorithms
• Streaming - http://spark.apache.org/streaming/
– Complex event processing and data ingest
Zeppelin
• Notebook
– Combine mark down, shell, spark, sql commands in
same notebook
– Easily integrate with Spark in different languages
– Visualize data using graphs and pivot charts
– Share notebooks or paragraphs
Data Science at Scale with Apache Spark and Zeppelin Notebook
ARCHITECTURE
Spark Driver
Zeppelin
Spark
Application Master
YARN container
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
Client Browser
Getting Started
• Use a distribution
– Curated set of compatible open source projects
• Sandbox - single node cluster in VM or Azure
– https://hortonworks.com/products/sandbox/
• Hortonworks Community Connection
– http://community.hortonworks.com
• On premise
– Use Apache Ambari to manage on premise physical hardware
• Cloud
– Automated provisioning with Cloudbreak
(https://github.com/hortonworks/cloudbreak)
– AWS, Azure, Google Cloud
Zeppelin Basics
• Notes are composed of paragraphs
• Paragraph contains code or markdown
– Specify interpreter - % <interpreter name> or blank for
default
– Enter commands
– Click play button to run code on cluster
– Results display in paragraph
• Code and results can be shown or hidden
Create/open
Note
Note tools
Paragraph
tools
User and note
configuration
Markdown
Interpreter (%md)
(editor hidden)
Shell
Interpreter (%sh)
(editor shown)
Markdown
# headers
%md
hyperlink
show/hide
editor
run paragraph
run all paragraphs
block quote
Example
• Crimes in Chicago
Kaggle Dataset
• Interesting
opportunities for
time series and
prediction https://www.kaggle.com/currie32/crimes-in-chicago
Data PIPELINE
Acquire
Kaggle
Common Store
Raw CSV zip Clean ORC
Clean
Explore
Analyze
Optimizing Data Cleaning
• Keep a raw copy
– Web sites go away, remove data, change links and interfaces
• Store the clean data
– Saves time each time you analyze
• Use a standard format (Optimized Row Columnar(ORC), parquet,
etc)
– Query data with hive
• Shared location if security and privacy requirements allow
– Collaborate by sharing data with others
Acquire Dataset
Acquire
Kaggle
Common Store
Raw CSV zip Clean ORC
Clean
Explore
Analyze
• %sh interpreter
• Bash shell
• Show intermediate
results for debug
CLEAN DATASET
Acquire
Kaggle
Common Store
Raw CSV zip Clean ORC
Clean
Explore
Analyze
Data Science at Scale with Apache Spark and Zeppelin Notebook
Switching
To spark
Scala
code
Spark is fast but lazy
• Transformations
• Specify which data to read
• Modify data
• Actions
• Show data
• Write data
Header and
Case data on
Same CSV line
Data Science at Scale with Apache Spark and Zeppelin Notebook
Data Science at Scale with Apache Spark and Zeppelin Notebook
Apply numeric types
On clean data
Add some columns
to make
aggregations easier
Table for SQL
Save clean
data as ORC
EXPLORE DATASET
Acquire
Kaggle
Common Store
Raw CSV zip Clean ORC
Clean
Explore
Analyze
Read clean
data and
create table
Specify query
Select visualization
Configure visualization
X
Y
Data Science at Scale with Apache Spark and Zeppelin Notebook
Data Science at Scale with Apache Spark and Zeppelin Notebook
Data Science at Scale with Apache Spark and Zeppelin Notebook
Data Science at Scale with Apache Spark and Zeppelin Notebook
Data Science at Scale with Apache Spark and Zeppelin Notebook
Hover to see
values
Data Science at Scale with Apache Spark and Zeppelin Notebook
ANALYZE DATASET
Acquire
Kaggle
Common Store
Raw CSV zip Clean ORC
Clean
Explore
Analyze
CREATE DATA TO FIT POISSON
Data Science at Scale with Apache Spark and Zeppelin Notebook
Fit Poisson Model
Evaluate
Model
Model Pipelines
• https://spark.apache.org/docs/2.1.0/ml-
pipeline.html
TRANFORMER
Transformers Estimator Pipeline Model
Training
data
Test
data
Prediction
s
Python – Looks like Scala
R
Tips and Tricks
• Use val for variables used across paragraphs
– Vars can yield unpredictable results when run out of
order
• Break up big notebooks
– Store intermediate results
– Avoid reloading and recalculating the same values
• Verify your notebook by running all paragraphs
Sharing Notebooks
• Share link to notebook or paragraph
– Readers access your Zeppelin server
– Use logins and permissions
• Export to JSON and save to shared file
– Readers get JSON from shared file (github, cloud, etc)
– Import to their Zeppelin server
• Sync your to Zeppelin Hub (https://www.zepl.com)
– Share Zeppelin Hub link with readers
– Free version for small teams
Questions and THANK YOU!
REFERENCES
Reproducible Research
• Sandve GK, Nekrutenko A, Taylor J, Hovig E
(2013) Ten Simple Rules for Reproducible
Computational Research. PLoS Comput Biol
9(10): e1003285.
doi:10.1371/journal.pcbi.1003285
– http://journals.plos.org/ploscompbiol/article/file?
id=10.1371/journal.pcbi.1003285&type=printable
Zeppelin and Spark
• Spark
– https://dzone.com/articles/try-the-latest-innovations-in-
apache-spark-and-apa
– https://hortonworks.com/hadoop-tutorial/learning-spark-
zeppelin/
– https://spark.apache.org/docs/2.1.0/ml-pipeline.html
• Example Notebooks
– https://github.com/hortonworks-gallery/zeppelin-notebooks
Zeppelin Interpreters
• Markdown syntax
– http://daringfireball.net/projects/markdown/syntax
Example
• Chicago Crimes Data Set
– https://www.kaggle.com/currie32/crimes-in-
Chicago
• Example notebooks
– https://github.com/carolynduby/ODSC2017
www.globalbigdataconference.com
Twitter : @bigdataconf

More Related Content

What's hot (20)

PPTX
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Alexander Dean
 
PPTX
Zeppelin at Twitter
Prasad Wagle
 
PDF
Data Pipeline for The Big Data/Data Science OKC
Mark Smith
 
PDF
Spark Summit EU talk by Oscar Castaneda
Spark Summit
 
PDF
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Anya Bida
 
PDF
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Databricks
 
PDF
Madrid Meetup
Sri Ambati
 
PDF
NoSQL Riak MongoDB Elasticsearch - All The Same?
Eberhard Wolff
 
PPTX
A Smarter Pig: Building a SQL interface to Pig using Apache Calcite
Salesforce Engineering
 
PDF
Spark Uber Development Kit
Jen Aman
 
PDF
Spark and Bloomberg by Sudarshan Kadambi and Partha Nageswaran
Spark Summit
 
PPTX
Unified Log London (May 2015) - Why your company needs a unified log
Alexander Dean
 
PPTX
Dive Into Azure Data Lake - PASS 2017
Ike Ellis
 
PPTX
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Sid Anand
 
PDF
Leveraging the power of solr with spark
jweigend
 
PDF
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Spark Summit
 
PDF
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
Adam Doyle
 
PPTX
Apache Spark in Industry
Dorian Beganovic
 
PDF
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Lucidworks
 
PDF
The evolution of Apache Calcite and its Community
Julian Hyde
 
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Alexander Dean
 
Zeppelin at Twitter
Prasad Wagle
 
Data Pipeline for The Big Data/Data Science OKC
Mark Smith
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit
 
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Anya Bida
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Databricks
 
Madrid Meetup
Sri Ambati
 
NoSQL Riak MongoDB Elasticsearch - All The Same?
Eberhard Wolff
 
A Smarter Pig: Building a SQL interface to Pig using Apache Calcite
Salesforce Engineering
 
Spark Uber Development Kit
Jen Aman
 
Spark and Bloomberg by Sudarshan Kadambi and Partha Nageswaran
Spark Summit
 
Unified Log London (May 2015) - Why your company needs a unified log
Alexander Dean
 
Dive Into Azure Data Lake - PASS 2017
Ike Ellis
 
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Sid Anand
 
Leveraging the power of solr with spark
jweigend
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Spark Summit
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
Adam Doyle
 
Apache Spark in Industry
Dorian Beganovic
 
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Lucidworks
 
The evolution of Apache Calcite and its Community
Julian Hyde
 

Similar to Data Science at Scale with Apache Spark and Zeppelin Notebook (20)

PPTX
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
Carolyn Duby
 
PPTX
Apache Spark Crash Course
DataWorks Summit
 
PPTX
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
DataWorks Summit/Hadoop Summit
 
PPTX
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
PPTX
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
 
PDF
Big Data visualization with Apache Spark and Zeppelin
prajods
 
PDF
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Krishna Sankar
 
PPTX
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
DataWorks Summit/Hadoop Summit
 
PDF
Apache Zeppelin and Spark for Enterprise Data Science
Bikas Saha
 
PPTX
Apache Zeppelin and Spark for Enterprise Data Science
Bikas Saha
 
PPTX
Apache Spark: Lightning Fast Cluster Computing
All Things Open
 
PDF
Apache Spark Tutorial
Ahmet Bulut
 
PPTX
Intro to Spark with Zeppelin
Hortonworks
 
PPTX
Architecting an Open Source AI Platform 2018 edition
David Talby
 
PPTX
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
PPTX
Data Day Seattle 2015: Sarah Guido
Bitly
 
PPTX
Spark Summit EMEA - Arun Murthy's Keynote
Hortonworks
 
PPTX
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark Summit
 
PDF
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
PDF
#HSTokyo16 Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
Carolyn Duby
 
Apache Spark Crash Course
DataWorks Summit
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
DataWorks Summit/Hadoop Summit
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
 
Big Data visualization with Apache Spark and Zeppelin
prajods
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Krishna Sankar
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
DataWorks Summit/Hadoop Summit
 
Apache Zeppelin and Spark for Enterprise Data Science
Bikas Saha
 
Apache Zeppelin and Spark for Enterprise Data Science
Bikas Saha
 
Apache Spark: Lightning Fast Cluster Computing
All Things Open
 
Apache Spark Tutorial
Ahmet Bulut
 
Intro to Spark with Zeppelin
Hortonworks
 
Architecting an Open Source AI Platform 2018 edition
David Talby
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
Data Day Seattle 2015: Sarah Guido
Bitly
 
Spark Summit EMEA - Arun Murthy's Keynote
Hortonworks
 
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark Summit
 
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
#HSTokyo16 Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Ad

Recently uploaded (20)

PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Digital Circuits, important subject in CS
contactparinay1
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Ad

Data Science at Scale with Apache Spark and Zeppelin Notebook

  • 1. Data Science at Scale with Apache Spark and Zeppelin Notebook Carolyn Duby Big Data Solutions Architect Hortonworks
  • 2. About Carolyn Duby • Big Data Solutions Architect • High performance data intensive systems • Data science • ScB ScM Computer Science, Brown University • LinkedIn: https://www.linkedin.com/in/carolynduby/ • Twitter: @carolynduby • Github: carolynduby • Hortonworks – Innovation through data – Enterprise ready, 100% open source, modern data platforms – Engineering, Technical Support, Professional Services, Training
  • 4. Agenda • Moving beyond the desktop to distributed computing – Apache Spark • Recording your results and sharing them with others – Apache Zeppelin
  • 5. Are you Outgrowing your Desktop? • Analyzing and training with portion of available data • Analysis or training too slow • Out of memory • Data accumulates over time
  • 6. How do you collaborate and record results? • Show your work – Effective peer review – Answer questions more quickly – Correct errors – Apply methods to other data • Increased quality and respect for results • Justify business decisions
  • 7. Data Science at Scale with Apache Open Source • Apache Spark version 2.1 – Cleaning and analysis of large data sets – http://spark.apache.org • Apache Zeppelin Notebook 0.7.0 – Capture and share analysis – Visualize data for exploration and results – https://zeppelin.apache.org
  • 8. Apache SPARK • Distributed processing efficiently crunches large data sets – Optimized – Horizontally scalable with multi tenancy – Fault tolerant • One platform for streaming, cleaning, analyzing • Elegant APIs – Scala, Python, Java, R • Many data source connectors – file system, HDFS, Hive, Phoenix, S3, etc
  • 9. SPARK Libraries • Same API for all data sources • SQL - http://spark.apache.org/sql/ – Access structured data and combine with other sources • MLLIB - http://spark.apache.org/mllib/ – Machine learning for training models and predicting • GraphX - http://spark.apache.org/graphx/ – Connectivity algorithms • Streaming - http://spark.apache.org/streaming/ – Complex event processing and data ingest
  • 10. Zeppelin • Notebook – Combine mark down, shell, spark, sql commands in same notebook – Easily integrate with Spark in different languages – Visualize data using graphs and pivot charts – Share notebooks or paragraphs
  • 12. ARCHITECTURE Spark Driver Zeppelin Spark Application Master YARN container Spark Executor YARN container Task Task Spark Executor YARN container Task Task Spark Executor YARN container Task Task Client Browser
  • 13. Getting Started • Use a distribution – Curated set of compatible open source projects • Sandbox - single node cluster in VM or Azure – https://hortonworks.com/products/sandbox/ • Hortonworks Community Connection – http://community.hortonworks.com • On premise – Use Apache Ambari to manage on premise physical hardware • Cloud – Automated provisioning with Cloudbreak (https://github.com/hortonworks/cloudbreak) – AWS, Azure, Google Cloud
  • 14. Zeppelin Basics • Notes are composed of paragraphs • Paragraph contains code or markdown – Specify interpreter - % <interpreter name> or blank for default – Enter commands – Click play button to run code on cluster – Results display in paragraph • Code and results can be shown or hidden
  • 15. Create/open Note Note tools Paragraph tools User and note configuration Markdown Interpreter (%md) (editor hidden) Shell Interpreter (%sh) (editor shown)
  • 17. Example • Crimes in Chicago Kaggle Dataset • Interesting opportunities for time series and prediction https://www.kaggle.com/currie32/crimes-in-chicago
  • 18. Data PIPELINE Acquire Kaggle Common Store Raw CSV zip Clean ORC Clean Explore Analyze
  • 19. Optimizing Data Cleaning • Keep a raw copy – Web sites go away, remove data, change links and interfaces • Store the clean data – Saves time each time you analyze • Use a standard format (Optimized Row Columnar(ORC), parquet, etc) – Query data with hive • Shared location if security and privacy requirements allow – Collaborate by sharing data with others
  • 20. Acquire Dataset Acquire Kaggle Common Store Raw CSV zip Clean ORC Clean Explore Analyze
  • 21. • %sh interpreter • Bash shell • Show intermediate results for debug
  • 22. CLEAN DATASET Acquire Kaggle Common Store Raw CSV zip Clean ORC Clean Explore Analyze
  • 25. Spark is fast but lazy • Transformations • Specify which data to read • Modify data • Actions • Show data • Write data
  • 26. Header and Case data on Same CSV line
  • 29. Apply numeric types On clean data Add some columns to make aggregations easier
  • 30. Table for SQL Save clean data as ORC
  • 31. EXPLORE DATASET Acquire Kaggle Common Store Raw CSV zip Clean ORC Clean Explore Analyze
  • 41. ANALYZE DATASET Acquire Kaggle Common Store Raw CSV zip Clean ORC Clean Explore Analyze
  • 42. CREATE DATA TO FIT POISSON
  • 47. Python – Looks like Scala
  • 48. R
  • 49. Tips and Tricks • Use val for variables used across paragraphs – Vars can yield unpredictable results when run out of order • Break up big notebooks – Store intermediate results – Avoid reloading and recalculating the same values • Verify your notebook by running all paragraphs
  • 50. Sharing Notebooks • Share link to notebook or paragraph – Readers access your Zeppelin server – Use logins and permissions • Export to JSON and save to shared file – Readers get JSON from shared file (github, cloud, etc) – Import to their Zeppelin server • Sync your to Zeppelin Hub (https://www.zepl.com) – Share Zeppelin Hub link with readers – Free version for small teams
  • 53. Reproducible Research • Sandve GK, Nekrutenko A, Taylor J, Hovig E (2013) Ten Simple Rules for Reproducible Computational Research. PLoS Comput Biol 9(10): e1003285. doi:10.1371/journal.pcbi.1003285 – http://journals.plos.org/ploscompbiol/article/file? id=10.1371/journal.pcbi.1003285&type=printable
  • 54. Zeppelin and Spark • Spark – https://dzone.com/articles/try-the-latest-innovations-in- apache-spark-and-apa – https://hortonworks.com/hadoop-tutorial/learning-spark- zeppelin/ – https://spark.apache.org/docs/2.1.0/ml-pipeline.html • Example Notebooks – https://github.com/hortonworks-gallery/zeppelin-notebooks
  • 55. Zeppelin Interpreters • Markdown syntax – http://daringfireball.net/projects/markdown/syntax
  • 56. Example • Chicago Crimes Data Set – https://www.kaggle.com/currie32/crimes-in- Chicago • Example notebooks – https://github.com/carolynduby/ODSC2017