SlideShare a Scribd company logo
SPARKLY NOTEBOOK: INTERACTIVE
ANALYSIS AND VISUALIZATION WITH SPARK
FELIX CHEUNG
APRIL 2015
HTTP://WWW.MEETUP.COM/SEATTLE-SPARK-MEETUP/EVENTS/208711962/
SETUP
• Spark on CDH cluster
• Vagrant - 2-nodes - custom provisioning
AGENDA
• IPython + PySpark cluster
• Zeppelin
• Spark’s Streaming k-means
• Lightning
Sparkly Notebook: Interactive Analysis and Visualization with Spark
SPARK - 10 SEC INTRODUCTION
• Spark
• Spark SQL + Data Frame + data source
• Spark Streaming
• MLlib
• GraphX
It’s a lot of time looking at data..
REPL
• Read-Eval-Print-Loop
Set of REPL related to Spark…
$	
  spark-­‐shell	
  
Welcome	
  to	
  
	
  	
  	
  	
  	
  	
  ____	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  __	
  
	
  	
  	
  	
  	
  /	
  __/__	
  	
  ___	
  _____/	
  /__	
  
	
  	
  	
  	
  _	
  /	
  _	
  /	
  _	
  `/	
  __/	
  	
  '_/	
  
	
  	
  	
  /___/	
  .__/_,_/_/	
  /_/_	
  	
  	
  version	
  1.2.0-­‐SNAPSHOT	
  
	
  	
  	
  	
  	
  	
  /_/	
  
Using	
  Scala	
  version	
  2.10.4	
  (Java	
  HotSpot(TM)	
  64-­‐Bit	
  Server	
  VM,	
  Java	
  1.7.0_67)	
  
Type	
  in	
  expressions	
  to	
  have	
  them	
  evaluated.	
  
Type	
  :help	
  for	
  more	
  information.	
  
15/04/15	
  11:31:28	
  INFO	
  SparkILoop:	
  Created	
  spark	
  context..	
  
Spark	
  context	
  available	
  as	
  sc.	
  
scala>	
  val	
  a	
  =	
  sc.parallelize(1	
  to	
  100)	
  
a:	
  org.apache.spark.rdd.RDD[Int]	
  =	
  ParallelCollectionRDD[0]	
  at	
  parallelize	
  at	
  <console>:12	
  
scala>	
  a.collect.foreach(x	
  =>	
  println(x))	
  
1	
  
2	
  
3	
  
4
GOOD
• See results instantly
NOT SO GOOD
• Ok as an IDE
• No Save / Repeat
• No visualization
NOTEBOOK
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Jupyter
IPython will continue to exist as a Python kernel for Jupyter, but
the notebook and other language-agnostic parts of IPython will
move to new projects under the Jupyter name. IPython 3.0 will
be the last monolithic release of IPython.
!
“IPython” http://ipython.org/
• interactive shell
• browser-based notebook
• 'Kernel'
• great support for visualization library (eg. matplotlib)
• built on pyzmq, tornado
IPYTHON/JUPYTER
IPYTHON NOTEBOOK

NOTEBOOK == BROWSER-BASED REPL
IPython Notebook is a web-based interactive
computational environment for creating IPython
notebooks. An IPython notebook is a JSON
document containing an ordered list of input/output
cells which can contain code, text, mathematics,
plots and rich media.
MATPLOTLIB
matplotlib tries to make easy things easy and hard things
possible. You can generate plots, histograms, power
spectra, bar charts, errorcharts, scatterplots, etc, with just a
few lines of code, with familiar MATLAB APIs.
plt.barh(y_pos,	
  performance,	
  xerr=error,	
  
align='center',	
  alpha=0.4)	
  
plt.yticks(y_pos,	
  people)	
  
plt.xlabel('Performance')	
  
plt.title('How	
  fast	
  do	
  you	
  want	
  to	
  go	
  today?')	
  
plt.show()
PYSPARK
• Spark on Python, this serves as the Kernel,
integrating with IPython
• Each notebook spins up a new instance of the
Kernel (ie. PySpark running as the Spark Driver, in
different deploy mode Spark/PySpark supports)
(All notebook examples are a subset of those in
the Meetup reconstructed here)
Markdown
Spark in
Python
Source: http://nbviewer.ipython.org/github/ResearchComputing/
scientific_computing_tutorials/blob/master/spark/02_word_count.ipynb
Sparkly Notebook: Interactive Analysis and Visualization with Spark
WORD2VEC EXAMPLE
Word2Vec computes distributed vector
representation of words. Distributed vector
representation is showed to be useful in many
natural language processing applications such as
named entity recognition, disambiguation, parsing,
tagging and machine translation.

https://code.google.com/p/word2vec/
Spark MLlib implements the Skip-gram approach.
With Skip-gram we want to predict a window of
words given a single word.
WORD2VEC DATASET
Wikipedia dump http://mattmahoney.net/dc/
textdata
grep	
  -­‐o	
  -­‐E	
  'w+(W+w+){0,15}'	
  text8	
  >	
  text8_lines	
  
then randomly sampled to ~200k lines
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
matplotlib: http://matplotlib.org
Seaborn: http://stanford.edu/~mwaskom/software/seaborn/
Bokeh: http://bokeh.pydata.org/en/latest/
MORE VISUALIZATIONS Seaborn
Bokeh
matplotlib
SETUP
To setup IPython
• Python 2.7.9 (separate from CentOS default 2.6.6), on all
nodes
• matplotlib, on the host running IPython
To run IPython with the PySpark Kernel, set these in the environment

(Please check out my handy script on github)
!
!
!
PYSPARK_PYTHON command to run python, eg. “python2.7”
PYSPARK_DRIVER_PYTHON command to run ipython
PYSPARK_DRIVER_PYTHON_OPTS “notebook —profile”
PYSPARK_SUBMIT_ARGS pyspark commandline, eg. --master --deploy_mode
YARN_CONF_DIR if YARN mode
LD_LIBRARY_PATH for matplotlib
IPYTHON/JUPYTER KERNELS
• IPython
• IGo
• Bash
• IR
• IHaskell
• IMatlab
• ICSharp
• IScala
• IRuby
• IJulia
.. and more https://github.com/ipython/ipython/wiki/IPython-kernels-for-other-
languages
ZEPPELIN
Apache Zeppelin (incubating) is interactive data analytics environment
for distributed data processing system. It provides beautiful interactive
web-based interface, data visualization, collaborative work
environment and many other nice features to make your data analytics
more fun and enjoyable.
Zeppelin has been incubating since Dec 2014.

https://zeppelin.incubator.apache.org/
Sparkly Notebook: Interactive Analysis and Visualization with Spark
shell script &

calling library package
Load and process data

with Spark
SQL query powered by
Spark SQL -

progress &

parameterization via
dynamic form
Python &

data passing across
languages (interpreters)
ZEPPELIN ARCHITECTURE
Realtime collaboration
- enabled by
websocket
communications
Frontend: AngularJS 

Backend server: Java 

Interpreters: Java

Visualization: NVD3
INTERPRETERS
• Spark group
• Spark (Scala)
• PySpark
• Spark SQL
• Dependency
• Markdownjs
• Shell
• Hive
• Coming: jdbc, Tajo, etc.
CLUSTERING
• Clustering tries to find natural groupings in
data. It puts objects into groups in which
those within a group are more similar to each
other than to those in other groups.
• Unsupervised learning
K-MEANS
• First, given an initial set of k cluster centers,
we find which cluster each data point is
closest to
• Then, we compute the average of each of the
new clusters and use the result to update our
cluster centers
Sparkly Notebook: Interactive Analysis and Visualization with Spark
K-MEANS|| IN MLLIB
• a parallelized variant of the k-means++

http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
Parameters:
• k is the number of desired clusters.
• maxIterations is the maximum number of iterations to run.
• initializationMode specifies either random initialization or initialization via
k-means||.
• runs is the number of times to run the k-means algorithm (k-means is not
guaranteed to find a globally optimal solution, and when run multiple
times on a given dataset, the algorithm returns the best clustering result).
• initializationSteps determines the number of steps in the k-means||
algorithm.
• epsilon determines the distance threshold within which we consider k-
means to have converged.
CASE STUDY:

K-MEANS - ZEPPELIN
Details on github at: http://bit.ly/1JWOPh8
ANOMALY DETECTION WITH K-MEANS
Using Spark DataFrame, csv data source, to process KDDCup’99 data

Scoring with different k values
COMING SOON (NOW!)
Realtime updates
Dashboard
Spark-notebook: https://github.com/andypetrella/spark-notebook
ISpark: https://github.com/tribbloid/ISpark
Spark Kernel: https://github.com/ibm-et/spark-kernel
Jove: https://github.com/jove-sh/jove-notebook
Beaker: https://github.com/twosigma/beaker-notebook
OTHER NOTEBOOKS
• Spark-notebook
• ISpark
• Spark Kernel
• Jove Notebook
• Beaker
• Databricks Cloud notebook
PART 2
STREAMING K-MEANS
WHY STREAMING?
• Train - model - predict works well on static
data
• What if data is
• Coming in streams
• Changing over time?
STREAMING K-MEANS DESIGN
• Proposed by Dr Jeremy Freeman (here)
STREAMING K-MEANS
• key concept: forgetfulness
• balances the relative importance of new
data versus past history
• half-life
• time it takes before past data contributes to
only one half of the current model
STREAMING K-MEANS
• time unit
• batches (which have a fixed duration in
time), or points
• eliminate dying clusters

VISUALIZING

STREAMING K-MEANS - LIGHTNING
LIGHTNING
• Lightning - data visualization server

http://lightning-viz.org
• provides API-based access to reproducible, web-
based, interactive visualizations. It includes a core set
of visualization types, but is built for extendability
and customization. Lightning supports modern
libraries like d3.js and three.js, and is designed for
interactivity over large data sets and continuously
updating data streams.
VISUALIZING STREAMING K-
MEANS ON IPYTHON + LIGHTNING
RUNNING LIGHTNING
• API: node.js, Python, Scala
• Extension support for custom chart (eg. d3.js)
• Requirements:
• Postgres recommended (SQLlite ok)
• node.js (npm , gulp)
The Freeman Lab at Janelia Research Campus uses Lightning to visualize
large-scale neural recordings from zebrafish, in collaboration with the
Ahrens Lab
SPARK STREAMING K-MEANS
DEMO
Environment
• requires: numpy, scipy, scikit-learn
• IPython/Python requires: lightning-python package
Demo consists of 3 parts:

https://github.com/felixcheung/spark-ml-streaming
• Python driver script, data generator
• Scala job - Spark Streaming & Streaming k-means
• IPython notebook to process result, visualize with Lightning

Originally this was part of the Python driver script - it has
been modified for this talk to run within IPython
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
CHALLENGES
• Package management
• Version/build conflicts!
YOU CAN RUN THIS TOO!
• Notebooks available at http://bit.ly/1JWOPh8
• Everything is heavily scripted and automated

Vagrant config for local, virtual environment
available at http://bit.ly/1DB3OLw
QUESTION?
!
https://github.com/felixcheung
linkedin: http://linkd.in/1OeZDb7
blog: http://bit.ly/1E2z6OI
!

More Related Content

What's hot (20)

PDF
PyCoRAM: Python-Verilog高位合成とメモリ抽象化によるFPGAアクセラレータ向けIPコア開発フレームワーク (FPGAX #05)
Shinya Takamaeda-Y
 
PPTX
6 lowpan
Siva Kumar
 
PDF
Hokkaido.cap #osc11do Wiresharkを使いこなそう!
Panda Yamaki
 
PPTX
Open Shortest Path First
Kashif Latif
 
PPTX
ENHANCED IGRP (EIGRP) AND OPEN SHORTEST PATH FIRST (OSPF)
anilinvns
 
PDF
FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...
Numenta
 
PPT
Mpls Services
Kristof De Brouwer
 
PPSX
VLSI Design Flow
Dr. A. B. Shinde
 
PDF
kubernetes-meetup-tokyo-20210624-kubevirt
Yukinori Sagara
 
PPT
10 gigabit ethernet technology
Sajan Sahu
 
PDF
Blockchain Demystified
Bangladesh Network Operators Group
 
PPT
BGP protocol presentation
Gorantla Mohanavamsi
 
PPTX
A very good introduction to IPv6
Syed Arshad
 
PPTX
Tutorial: Using GoBGP as an IXP connecting router
Shu Sugimoto
 
PDF
QoS Cheatsheet by packetlife.net
Febrian ‎
 
PPTX
VoIP Techniques and Challenges PRESENTATION
Karama Said(BEng,MSc)
 
PPTX
Internet Protocol version 6
Rekha Yadav
 
PPT
Les vlans
Lhoussain Ait Benmouh
 
PPT
Internet Protocol Version 6
sandeepjain
 
PDF
Spanning Tree Protocol Cheat Sheet
Prasanna Shanmugasundaram
 
PyCoRAM: Python-Verilog高位合成とメモリ抽象化によるFPGAアクセラレータ向けIPコア開発フレームワーク (FPGAX #05)
Shinya Takamaeda-Y
 
6 lowpan
Siva Kumar
 
Hokkaido.cap #osc11do Wiresharkを使いこなそう!
Panda Yamaki
 
Open Shortest Path First
Kashif Latif
 
ENHANCED IGRP (EIGRP) AND OPEN SHORTEST PATH FIRST (OSPF)
anilinvns
 
FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...
Numenta
 
Mpls Services
Kristof De Brouwer
 
VLSI Design Flow
Dr. A. B. Shinde
 
kubernetes-meetup-tokyo-20210624-kubevirt
Yukinori Sagara
 
10 gigabit ethernet technology
Sajan Sahu
 
Blockchain Demystified
Bangladesh Network Operators Group
 
BGP protocol presentation
Gorantla Mohanavamsi
 
A very good introduction to IPv6
Syed Arshad
 
Tutorial: Using GoBGP as an IXP connecting router
Shu Sugimoto
 
QoS Cheatsheet by packetlife.net
Febrian ‎
 
VoIP Techniques and Challenges PRESENTATION
Karama Said(BEng,MSc)
 
Internet Protocol version 6
Rekha Yadav
 
Internet Protocol Version 6
sandeepjain
 
Spanning Tree Protocol Cheat Sheet
Prasanna Shanmugasundaram
 

Viewers also liked (13)

PDF
Principles of Data Visualization
Eamonn Maguire
 
PPT
Big Data Real Time Analytics - A Facebook Case Study
Nati Shalom
 
PPTX
Data Visualization - What can you see? #baai17
Eugene O'Loughlin
 
PPTX
Data Science with Spark & Zeppelin
Vinay Shukla
 
PDF
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Spark Summit
 
PDF
Manual de programacion_con_robots_para_la_escuela
Angel De las Heras
 
PPTX
Introduction to Streaming Distributed Processing with Storm
Brandon O'Brien
 
PPTX
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Brandon O'Brien
 
PDF
Apache Zeppelin으로 데이터 분석하기
SangWoo Kim
 
PPTX
Real time data viz with Spark Streaming, Kafka and D3.js
Ben Laird
 
PDF
Big Data visualization with Apache Spark and Zeppelin
prajods
 
PDF
Brief introduction to data visualization
Zach Gemignani
 
Principles of Data Visualization
Eamonn Maguire
 
Big Data Real Time Analytics - A Facebook Case Study
Nati Shalom
 
Data Visualization - What can you see? #baai17
Eugene O'Loughlin
 
Data Science with Spark & Zeppelin
Vinay Shukla
 
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Spark Summit
 
Manual de programacion_con_robots_para_la_escuela
Angel De las Heras
 
Introduction to Streaming Distributed Processing with Storm
Brandon O'Brien
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Brandon O'Brien
 
Apache Zeppelin으로 데이터 분석하기
SangWoo Kim
 
Real time data viz with Spark Streaming, Kafka and D3.js
Ben Laird
 
Big Data visualization with Apache Spark and Zeppelin
prajods
 
Brief introduction to data visualization
Zach Gemignani
 
Ad

Similar to Sparkly Notebook: Interactive Analysis and Visualization with Spark (20)

PPTX
Online Tweet Sentiment Analysis with Apache Spark
Davide Nardone
 
PDF
20151015 zagreb spark_notebooks
Andrey Vykhodtsev
 
PPTX
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
Carolyn Duby
 
PDF
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Krishna Sankar
 
PDF
PPT4: Frameworks & Libraries of Machine Learning & Deep Learning
akira-ai
 
PDF
Data analysis with Pandas and Spark
Felix Crisan
 
PPTX
Data Science at Scale with Apache Spark and Zeppelin Notebook
Carolyn Duby
 
PDF
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
PDF
Streaming & Scaling Spark - London Spark Meetup 2016
Holden Karau
 
PDF
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Holden Karau
 
PDF
Life of PySpark - A tale of two environments
Shankar M S
 
PDF
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
PDF
Energy analytics with Apache Spark workshop
QuantUniversity
 
PPTX
Predictive maintenance withsensors_in_utilities_
Tina Zhang
 
PDF
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Holden Karau
 
PDF
04 open source_tools
Marco Quartulli
 
PDF
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
PDF
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking
 
PDF
Spark-summit-2013 Matei Zaharia
Prabeesh K
 
PDF
Apache spark presentation
Mahboob Hussain
 
Online Tweet Sentiment Analysis with Apache Spark
Davide Nardone
 
20151015 zagreb spark_notebooks
Andrey Vykhodtsev
 
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
Carolyn Duby
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Krishna Sankar
 
PPT4: Frameworks & Libraries of Machine Learning & Deep Learning
akira-ai
 
Data analysis with Pandas and Spark
Felix Crisan
 
Data Science at Scale with Apache Spark and Zeppelin Notebook
Carolyn Duby
 
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Streaming & Scaling Spark - London Spark Meetup 2016
Holden Karau
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Holden Karau
 
Life of PySpark - A tale of two environments
Shankar M S
 
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
Energy analytics with Apache Spark workshop
QuantUniversity
 
Predictive maintenance withsensors_in_utilities_
Tina Zhang
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Holden Karau
 
04 open source_tools
Marco Quartulli
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking
 
Spark-summit-2013 Matei Zaharia
Prabeesh K
 
Apache spark presentation
Mahboob Hussain
 
Ad

Recently uploaded (20)

PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
BinarySearchTree in datastructures in detail
kichokuttu
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 

Sparkly Notebook: Interactive Analysis and Visualization with Spark

  • 1. SPARKLY NOTEBOOK: INTERACTIVE ANALYSIS AND VISUALIZATION WITH SPARK FELIX CHEUNG APRIL 2015 HTTP://WWW.MEETUP.COM/SEATTLE-SPARK-MEETUP/EVENTS/208711962/
  • 2. SETUP • Spark on CDH cluster • Vagrant - 2-nodes - custom provisioning
  • 3. AGENDA • IPython + PySpark cluster • Zeppelin • Spark’s Streaming k-means • Lightning
  • 5. SPARK - 10 SEC INTRODUCTION • Spark • Spark SQL + Data Frame + data source • Spark Streaming • MLlib • GraphX
  • 6. It’s a lot of time looking at data..
  • 8. Set of REPL related to Spark…
  • 9. $  spark-­‐shell   Welcome  to              ____                            __            /  __/__    ___  _____/  /__          _  /  _  /  _  `/  __/    '_/        /___/  .__/_,_/_/  /_/_      version  1.2.0-­‐SNAPSHOT              /_/   Using  Scala  version  2.10.4  (Java  HotSpot(TM)  64-­‐Bit  Server  VM,  Java  1.7.0_67)   Type  in  expressions  to  have  them  evaluated.   Type  :help  for  more  information.   15/04/15  11:31:28  INFO  SparkILoop:  Created  spark  context..   Spark  context  available  as  sc.   scala>  val  a  =  sc.parallelize(1  to  100)   a:  org.apache.spark.rdd.RDD[Int]  =  ParallelCollectionRDD[0]  at  parallelize  at  <console>:12   scala>  a.collect.foreach(x  =>  println(x))   1   2   3   4
  • 10. GOOD • See results instantly
  • 11. NOT SO GOOD • Ok as an IDE • No Save / Repeat • No visualization
  • 14. Jupyter IPython will continue to exist as a Python kernel for Jupyter, but the notebook and other language-agnostic parts of IPython will move to new projects under the Jupyter name. IPython 3.0 will be the last monolithic release of IPython. ! “IPython” http://ipython.org/ • interactive shell • browser-based notebook • 'Kernel' • great support for visualization library (eg. matplotlib) • built on pyzmq, tornado IPYTHON/JUPYTER
  • 15. IPYTHON NOTEBOOK
 NOTEBOOK == BROWSER-BASED REPL IPython Notebook is a web-based interactive computational environment for creating IPython notebooks. An IPython notebook is a JSON document containing an ordered list of input/output cells which can contain code, text, mathematics, plots and rich media.
  • 16. MATPLOTLIB matplotlib tries to make easy things easy and hard things possible. You can generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, etc, with just a few lines of code, with familiar MATLAB APIs. plt.barh(y_pos,  performance,  xerr=error,   align='center',  alpha=0.4)   plt.yticks(y_pos,  people)   plt.xlabel('Performance')   plt.title('How  fast  do  you  want  to  go  today?')   plt.show()
  • 17. PYSPARK • Spark on Python, this serves as the Kernel, integrating with IPython • Each notebook spins up a new instance of the Kernel (ie. PySpark running as the Spark Driver, in different deploy mode Spark/PySpark supports)
  • 18. (All notebook examples are a subset of those in the Meetup reconstructed here)
  • 23. WORD2VEC EXAMPLE Word2Vec computes distributed vector representation of words. Distributed vector representation is showed to be useful in many natural language processing applications such as named entity recognition, disambiguation, parsing, tagging and machine translation.
 https://code.google.com/p/word2vec/ Spark MLlib implements the Skip-gram approach. With Skip-gram we want to predict a window of words given a single word.
  • 24. WORD2VEC DATASET Wikipedia dump http://mattmahoney.net/dc/ textdata grep  -­‐o  -­‐E  'w+(W+w+){0,15}'  text8  >  text8_lines   then randomly sampled to ~200k lines
  • 27. matplotlib: http://matplotlib.org Seaborn: http://stanford.edu/~mwaskom/software/seaborn/ Bokeh: http://bokeh.pydata.org/en/latest/ MORE VISUALIZATIONS Seaborn Bokeh matplotlib
  • 28. SETUP To setup IPython • Python 2.7.9 (separate from CentOS default 2.6.6), on all nodes • matplotlib, on the host running IPython To run IPython with the PySpark Kernel, set these in the environment
 (Please check out my handy script on github) ! ! ! PYSPARK_PYTHON command to run python, eg. “python2.7” PYSPARK_DRIVER_PYTHON command to run ipython PYSPARK_DRIVER_PYTHON_OPTS “notebook —profile” PYSPARK_SUBMIT_ARGS pyspark commandline, eg. --master --deploy_mode YARN_CONF_DIR if YARN mode LD_LIBRARY_PATH for matplotlib
  • 29. IPYTHON/JUPYTER KERNELS • IPython • IGo • Bash • IR • IHaskell • IMatlab • ICSharp • IScala • IRuby • IJulia .. and more https://github.com/ipython/ipython/wiki/IPython-kernels-for-other- languages
  • 31. Apache Zeppelin (incubating) is interactive data analytics environment for distributed data processing system. It provides beautiful interactive web-based interface, data visualization, collaborative work environment and many other nice features to make your data analytics more fun and enjoyable. Zeppelin has been incubating since Dec 2014.
 https://zeppelin.incubator.apache.org/
  • 33. shell script &
 calling library package Load and process data
 with Spark
  • 34. SQL query powered by Spark SQL -
 progress &
 parameterization via dynamic form
  • 35. Python &
 data passing across languages (interpreters)
  • 36. ZEPPELIN ARCHITECTURE Realtime collaboration - enabled by websocket communications Frontend: AngularJS 
 Backend server: Java 
 Interpreters: Java
 Visualization: NVD3
  • 37. INTERPRETERS • Spark group • Spark (Scala) • PySpark • Spark SQL • Dependency • Markdownjs • Shell • Hive • Coming: jdbc, Tajo, etc.
  • 38. CLUSTERING • Clustering tries to find natural groupings in data. It puts objects into groups in which those within a group are more similar to each other than to those in other groups. • Unsupervised learning
  • 39. K-MEANS • First, given an initial set of k cluster centers, we find which cluster each data point is closest to • Then, we compute the average of each of the new clusters and use the result to update our cluster centers
  • 41. K-MEANS|| IN MLLIB • a parallelized variant of the k-means++
 http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf Parameters: • k is the number of desired clusters. • maxIterations is the maximum number of iterations to run. • initializationMode specifies either random initialization or initialization via k-means||. • runs is the number of times to run the k-means algorithm (k-means is not guaranteed to find a globally optimal solution, and when run multiple times on a given dataset, the algorithm returns the best clustering result). • initializationSteps determines the number of steps in the k-means|| algorithm. • epsilon determines the distance threshold within which we consider k- means to have converged.
  • 43. Details on github at: http://bit.ly/1JWOPh8 ANOMALY DETECTION WITH K-MEANS Using Spark DataFrame, csv data source, to process KDDCup’99 data
 Scoring with different k values
  • 47. Spark-notebook: https://github.com/andypetrella/spark-notebook ISpark: https://github.com/tribbloid/ISpark Spark Kernel: https://github.com/ibm-et/spark-kernel Jove: https://github.com/jove-sh/jove-notebook Beaker: https://github.com/twosigma/beaker-notebook OTHER NOTEBOOKS • Spark-notebook • ISpark • Spark Kernel • Jove Notebook • Beaker • Databricks Cloud notebook
  • 49. WHY STREAMING? • Train - model - predict works well on static data • What if data is • Coming in streams • Changing over time?
  • 50. STREAMING K-MEANS DESIGN • Proposed by Dr Jeremy Freeman (here)
  • 51. STREAMING K-MEANS • key concept: forgetfulness • balances the relative importance of new data versus past history • half-life • time it takes before past data contributes to only one half of the current model
  • 52. STREAMING K-MEANS • time unit • batches (which have a fixed duration in time), or points • eliminate dying clusters

  • 55. • Lightning - data visualization server
 http://lightning-viz.org • provides API-based access to reproducible, web- based, interactive visualizations. It includes a core set of visualization types, but is built for extendability and customization. Lightning supports modern libraries like d3.js and three.js, and is designed for interactivity over large data sets and continuously updating data streams. VISUALIZING STREAMING K- MEANS ON IPYTHON + LIGHTNING
  • 56. RUNNING LIGHTNING • API: node.js, Python, Scala • Extension support for custom chart (eg. d3.js) • Requirements: • Postgres recommended (SQLlite ok) • node.js (npm , gulp)
  • 57. The Freeman Lab at Janelia Research Campus uses Lightning to visualize large-scale neural recordings from zebrafish, in collaboration with the Ahrens Lab
  • 58. SPARK STREAMING K-MEANS DEMO Environment • requires: numpy, scipy, scikit-learn • IPython/Python requires: lightning-python package Demo consists of 3 parts:
 https://github.com/felixcheung/spark-ml-streaming • Python driver script, data generator • Scala job - Spark Streaming & Streaming k-means • IPython notebook to process result, visualize with Lightning
 Originally this was part of the Python driver script - it has been modified for this talk to run within IPython
  • 61. CHALLENGES • Package management • Version/build conflicts!
  • 62. YOU CAN RUN THIS TOO! • Notebooks available at http://bit.ly/1JWOPh8 • Everything is heavily scripted and automated
 Vagrant config for local, virtual environment available at http://bit.ly/1DB3OLw