Data Science at Scale with Apache Spark and Zeppelin Notebook

Data Science at Scale with Apache
Spark and Zeppelin Notebook
Carolyn Duby
Big Data Solutions Architect
Hortonworks

About Carolyn Duby
• Big Data Solutions Architect
• High performance data intensive systems
• Data science
• ScB ScM Computer Science, Brown University
• LinkedIn: https://www.linkedin.com/in/carolynduby/
• Twitter: @carolynduby
• Github: carolynduby
• Hortonworks
– Innovation through data
– Enterprise ready, 100% open source, modern data platforms
– Engineering, Technical Support, Professional Services, Training

https://www.meetup.com/futureofdata-boston

Agenda
• Moving beyond the desktop to distributed
computing
– Apache Spark
• Recording your results and sharing them with
others
– Apache Zeppelin

Are you Outgrowing
your Desktop?
• Analyzing and training with portion of
available data
• Analysis or training too slow
• Out of memory
• Data accumulates over time

How do you collaborate and
record results?
• Show your work
– Effective peer review
– Answer questions more quickly
– Correct errors
– Apply methods to other data
• Increased quality and respect for results
• Justify business decisions

Data Science at Scale with Apache
Open Source
• Apache Spark version 2.1
– Cleaning and analysis of large data sets
– http://spark.apache.org
• Apache Zeppelin Notebook 0.7.0
– Capture and share analysis
– Visualize data for exploration and results
– https://zeppelin.apache.org

Apache SPARK
• Distributed processing efficiently crunches large
data sets
– Optimized
– Horizontally scalable with multi tenancy
– Fault tolerant
• One platform for streaming, cleaning, analyzing
• Elegant APIs – Scala, Python, Java, R
• Many data source connectors – file system, HDFS,
Hive, Phoenix, S3, etc

SPARK Libraries
• Same API for all data sources
• SQL - http://spark.apache.org/sql/
– Access structured data and combine with other sources
• MLLIB - http://spark.apache.org/mllib/
– Machine learning for training models and predicting
• GraphX - http://spark.apache.org/graphx/
– Connectivity algorithms
• Streaming - http://spark.apache.org/streaming/
– Complex event processing and data ingest

Zeppelin
• Notebook
– Combine mark down, shell, spark, sql commands in
same notebook
– Easily integrate with Spark in different languages
– Visualize data using graphs and pivot charts
– Share notebooks or paragraphs

Data Science at Scale with Apache Spark and Zeppelin Notebook

ARCHITECTURE
Spark Driver
Zeppelin
Spark
Application Master
YARN container
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
Client Browser

Getting Started
• Use a distribution
– Curated set of compatible open source projects
• Sandbox - single node cluster in VM or Azure
– https://hortonworks.com/products/sandbox/
• Hortonworks Community Connection
– http://community.hortonworks.com
• On premise
– Use Apache Ambari to manage on premise physical hardware
• Cloud
– Automated provisioning with Cloudbreak
(https://github.com/hortonworks/cloudbreak)
– AWS, Azure, Google Cloud

Zeppelin Basics
• Notes are composed of paragraphs
• Paragraph contains code or markdown
– Specify interpreter - % <interpreter name> or blank for
default
– Enter commands
– Click play button to run code on cluster
– Results display in paragraph
• Code and results can be shown or hidden

Create/open
Note
Note tools
Paragraph
tools
User and note
configuration
Markdown
Interpreter (%md)
(editor hidden)
Shell
Interpreter (%sh)
(editor shown)

Markdown
# headers
%md
hyperlink
show/hide
editor
run paragraph
run all paragraphs
block quote

Example
• Crimes in Chicago
Kaggle Dataset
• Interesting
opportunities for
time series and
prediction https://www.kaggle.com/currie32/crimes-in-chicago

Data PIPELINE
Acquire
Kaggle
Common Store
Raw CSV zip Clean ORC
Clean
Explore
Analyze

Optimizing Data Cleaning
• Keep a raw copy
– Web sites go away, remove data, change links and interfaces
• Store the clean data
– Saves time each time you analyze
• Use a standard format (Optimized Row Columnar(ORC), parquet,
etc)
– Query data with hive
• Shared location if security and privacy requirements allow
– Collaborate by sharing data with others

Acquire Dataset
Acquire
Kaggle
Common Store
Clean
Explore
Analyze

• %sh interpreter
• Bash shell
• Show intermediate
results for debug

CLEAN DATASET
Acquire
Kaggle
Common Store
Clean
Explore
Analyze

Spark is fast but lazy
• Transformations
• Specify which data to read
• Modify data
• Actions
• Show data
• Write data

Header and
Case data on
Same CSV line

Apply numeric types
On clean data
Add some columns
to make
aggregations easier

Table for SQL
Save clean
data as ORC

EXPLORE DATASET
Acquire
Kaggle
Common Store
Clean
Explore
Analyze

Read clean
data and
create table

Specify query
Select visualization
Configure visualization
X
Y

ANALYZE DATASET
Acquire
Kaggle
Common Store
Clean
Explore
Analyze

Model Pipelines
• https://spark.apache.org/docs/2.1.0/ml-
pipeline.html
TRANFORMER
Transformers Estimator Pipeline Model
Training
data
Test
data
Prediction
s

Tips and Tricks
• Use val for variables used across paragraphs
– Vars can yield unpredictable results when run out of
order
• Break up big notebooks
– Store intermediate results
– Avoid reloading and recalculating the same values
• Verify your notebook by running all paragraphs

Sharing Notebooks
• Share link to notebook or paragraph
– Readers access your Zeppelin server
– Use logins and permissions
• Export to JSON and save to shared file
– Readers get JSON from shared file (github, cloud, etc)
– Import to their Zeppelin server
• Sync your to Zeppelin Hub (https://www.zepl.com)
– Share Zeppelin Hub link with readers
– Free version for small teams

Reproducible Research
• Sandve GK, Nekrutenko A, Taylor J, Hovig E
(2013) Ten Simple Rules for Reproducible
Computational Research. PLoS Comput Biol
9(10): e1003285.
doi:10.1371/journal.pcbi.1003285
– http://journals.plos.org/ploscompbiol/article/file?
id=10.1371/journal.pcbi.1003285&type=printable

Zeppelin and Spark
• Spark
– https://dzone.com/articles/try-the-latest-innovations-in-
apache-spark-and-apa
– https://hortonworks.com/hadoop-tutorial/learning-spark-
zeppelin/
– https://spark.apache.org/docs/2.1.0/ml-pipeline.html
• Example Notebooks
– https://github.com/hortonworks-gallery/zeppelin-notebooks

Zeppelin Interpreters
• Markdown syntax
– http://daringfireball.net/projects/markdown/syntax

Example
• Chicago Crimes Data Set
– https://www.kaggle.com/currie32/crimes-in-
Chicago
• Example notebooks
– https://github.com/carolynduby/ODSC2017

www.globalbigdataconference.com
Twitter : @bigdataconf

Data Science at Scale with Apache Spark and Zeppelin Notebook

More Related Content

What's hot (20)

Similar to Data Science at Scale with Apache Spark and Zeppelin Notebook (20)

Recently uploaded (20)

Data Science at Scale with Apache Spark and Zeppelin Notebook