SlideShare a Scribd company logo
Analytics pipelines with
Jupyter and Spark
Who we are
● NETOPIA
● mobilPay
● mobilPay Wallet
● web2sms
● btko.in
● kartela.ro
● mobilender.mx
Challenges
Three dimensional problem
● Time: Past events or crystall ball?
● Profile: Who is looking at the data?
● Quantity: How much data is there to look at?
Profile
● Data Scientist
● Data Engineer
Quantity
● Hundreds of MB to a few GB
● Up to million events/records
vs.
● GB to TB to PB
● Hundreds of millions to billions and beyond
events/records
Also
● Computing vs. Storage
● Vertical vs. Horizontal scalability
● Distributed/ML libraries
● Dependency hell
Time
NOW
Past Future
Analytics Forecasting
(a.k.a. Prediction)
“Classic” Approach
Small Data Big Data
Data Engineer grep, sed, awk Java, Scala, Python, PIG,
Hadoop, lately Spark &
others
Data Scientist R/RStudio No way, Josè!
New Approach
Small Data Big Data
Data Engineer
Notebook Technologies: Jupyter (most used),
zeppelin, but also less known ones (Rodeo,
Beaker)
Data Scientist
Data analysis with
Jupyter, Pandas and Spark
Outline
About the data:
● Set of mobile transactions
● Set (separate) of retail transactions
About the tools: Jupyter, Pandas and Spark
Our experience
Future work
Mobile transactions Retail data
Elements of
analysis
Transactions Transactions, Products, Stock data
We know Transaction value, User identifier,
Merchant
Transaction value, Sold products,
Merchant
We don’t know What product was bought Who the user is
Size Hundreds of thousands of entries Hundreds of millions of entries
Status Building prediction models Gathering data
Datasets
Mobile transactions data
SQL Database
Mobile data: Environment
Preprocessing notebooks
Analysis and model testing notebooks
Pandas R (with rpy2)
scikit-learn Custom code
CSV files
pickle files
Other input sources
Jupyter
notebooks
in Docker
container
with
Anaconda
Diagnostics
Cleaning
Feature building
Raw data
Models
Visualizations
Docker image
… with Anaconda
● Anaconda: package manager
for data science
● Using docker-compose for
setting up container
parameters
● Many available images
● Our base image:
○ pyspark from Jupyter Docker
Stacks
○ Extended with required libraries
● Libraries are added or
updated with docker build:
○ Self-contained
○ Easy versioning
Jupyter Notebook
(1)
Web application for creating
documents with live code,
explanations and visualizations
● Initially, part of IPython
● Narrative with live code
● Protocol for interactive
exploration
○ Run blocks of code
○ Embedded JS
● Executable documents
○ Code
○ HTML and Markdown
○ Metadata
● Kernels for multiple
languages
○ Python
○ R
○ Scala
○ Bash
● Internal format: JSON
Jupyter Notebook
(2)
Web application for creating
documents with live code,
explanations and visualizations
● Plugins and widgets
● Easy to share (formats:
Notebook, PDF, HTML, …)
● Large ecosystem
○ Jupyter Lab / Jupyter Hub
○ GitHub visualizations
○ Blog integration
○ Education: teaching, evaluation
○ Microsoft, Google, Bloomberg,
IBM, O'Reilly
○ Executable books
● Versioning is complicated
Pandas
● DataFrame objects
○ Tabular data structures
○ Each column has one data type
● Based on numpy (fast)
● Processing is (mostly) done in
memory
● Data manipulation:
○ Hierarchical indexing
○ Reshaping, pivoting, grouping
○ String operations
○ Time series operations
● Reading / writing from / to
many formats (CSV, JSON,
HDF5, …)
● Visualization: matplotlib,
Seaborn, Bokeh, …
Python library for data
manipulation and analysis
rpy2
Interface between Python and
R
● Translates DataFrames
between Python and R
● Python in Jupyter: use %%R
● Direct access to R objects
(rpy2.robjects)
Jupyter, Pandas and R
R with Rpy2
Python
HTML and Markdown
Notebook
Mobile data: User retention
Active users:
● Classic: 1+ transactions in a given period
● Rolling: 1+ transactions in a given or
subsequent period
Plots:
● X: period (day, week, month)
● Y (cohort): period or another type of
segment
● By transaction criteria (merchant,
product, etc.)
Results:
● Response to campaigns
● Activity recurrence
Cohorts
Periods
Mobile data: Correlations
Features:
● How similar are two features?
Merchants:
● Which merchants have common users?
Products:
● Which products are sold together?
Mobile data: Clusters
● Group users by behavior
● Identify outliers
● Future: automatic cluster labeling
Retail transactions data
Retail data: Our experience
First try: Out-of-core processing with HDF5
● Data does did not fit in memory
● HDF5: format for large data
● Pandas + HDF5, Blaze, Dask, Odo
● Easy to use functions
● Library incompatibilities
● Slow queries, use indexes
● Occasional runtime errors
Cassandra
Retail data: Environment
Preprocessing notebooks
Analysis and model testing notebooks
Large data:
Spark ML + scikit-learn
Small (selection) data:
Pandas, scikit-learn and R
CSV files
Apache Parquet
Cassandra
Other input sources
Jupyter
notebooks
in Docker
containers
with Spark
and
Anaconda
Diagnostics
Cleaning
Feature building
Raw data
Models
Visualizations
In progress
Spark
Engine for big data processing
● DataFrames
○ Built on top of RDDs
○ Similar to Pandas and R
○ SQL queries
○ Automatic query optimization
through query plan
○ String , date-time and statistics
functions
○ Group by, filters
● Jupyter integration: work in
progress
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
Spark
Machine Learning
MLlib and ML
● MLlib
○ Uses RDDs
○ Summaries, correlations,
sampling
○ SVMs, logistic regression,
decision trees, ensembles and
Naive Bayes
○ Clustering
○ Feature transformation
● ML
○ Works with DataFrames
○ Many wrappers for MLlib
○ Pipelines:
■ Transformers, Estimators,
Parameters
■ labelCol, featuresCol,
predictionCol, ...
○ R formulas (y ~ x1 + x2)
Retail data: Our experience
Current: Spark + Docker
● No issues at current size (several GBs)
● Docker Compose for creating master, workers and Jupyter container
(driver)
● ML libraries are easy to work with
● Incomplete Python API for ML (e.g., summaries)
● Documentation needs improvement
● Model diagnostics
○ Some metrics are available
○ Supplement with scikit-learn (example: build ROC curves)
● scikit-learn or R on top of Spark
○ Parallelize parameter search (e.g., grid search)
○ Spark sklearn (github.com/databricks/spark-sklearn): Grid Search
Future work
Mobile wallet transactions:
● Data fits in memory
● Use Spark for distributing workload
ERP transactions:
● Some data fits in memory, after processing
● Build a web app for data exploration
● Forecast
○ Sales
○ Inventory requirements
● Try Spark Streaming
http://xkcd.com/1425/

More Related Content

What's hot (19)

PDF
Python for All
Pragya Goyal
 
ODP
10 popular software programs written in python
ATEES Industrial Training Pvt Ltd
 
PDF
PuppetCamp SEA @ Blk 71 - What's New in Puppet DB
Walter Heck
 
PDF
Intro to Python Workshop San Diego, CA (January 19, 2013)
Kendall
 
ODP
Behold the Power of Python
Sarah Dutkiewicz
 
PDF
DRUG - RDSTK Talk
rtelmore
 
PDF
Python as the Zen of Data Science
Travis Oliphant
 
PDF
go 1.8 net/http timeouts
Yahoo!デベロッパーネットワーク
 
PPT
Python
Prem kumar
 
PDF
Welcome to Python
Elena Williams
 
PDF
Git by example
Abhijeet Kasurde
 
PPTX
An introduction to Jupyter notebooks and the Noteable service
Jisc
 
PDF
OpenStack: A python based IaaS provider
Flavio Percoco Premoli
 
PDF
Is it a Package or a Wrapper? Designing, Documenting, and Distributing a Pyth...
Azavea
 
PPTX
IPTC News in JSON Spring 2013
Stuart Myles
 
PPTX
MozillaPH Rust Hack & Learn Session 1
Robert 'Bob' Reyes
 
ODP
Bang a Gong, GIT It On, or Running Drupal With a GIT Repository (11/04/20 - B...
DrupalCape
 
PDF
Intro to Python
Daniel Greenfeld
 
Python for All
Pragya Goyal
 
10 popular software programs written in python
ATEES Industrial Training Pvt Ltd
 
PuppetCamp SEA @ Blk 71 - What's New in Puppet DB
Walter Heck
 
Intro to Python Workshop San Diego, CA (January 19, 2013)
Kendall
 
Behold the Power of Python
Sarah Dutkiewicz
 
DRUG - RDSTK Talk
rtelmore
 
Python as the Zen of Data Science
Travis Oliphant
 
Python
Prem kumar
 
Welcome to Python
Elena Williams
 
Git by example
Abhijeet Kasurde
 
An introduction to Jupyter notebooks and the Noteable service
Jisc
 
OpenStack: A python based IaaS provider
Flavio Percoco Premoli
 
Is it a Package or a Wrapper? Designing, Documenting, and Distributing a Pyth...
Azavea
 
IPTC News in JSON Spring 2013
Stuart Myles
 
MozillaPH Rust Hack & Learn Session 1
Robert 'Bob' Reyes
 
Bang a Gong, GIT It On, or Running Drupal With a GIT Repository (11/04/20 - B...
DrupalCape
 
Intro to Python
Daniel Greenfeld
 

Viewers also liked (20)

PPTX
2016年疾管署疫情監測週報(第44週)
衛生福利部疾病管制署
 
PDF
Implementation of Rubik's Cube Formula in PyCuber
Wey-Han Liaw
 
PPTX
Python for Data Analysis: Chapter 2
智哉 今西
 
PPTX
Intro to Python Data Analysis in Wakari
Karissa Rae McKelvey
 
PDF
BIG DATA サービス と ツール
Ngoc Dao
 
PPTX
Mobile Wallet Future in Bangladesh
Hasibur Rahman
 
PPTX
data science toolkit 101: set up Python, Spark, & Jupyter
Raj Singh
 
PDF
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
Romeo Kienzler
 
PDF
Using docker for data science - part 2
Calvin Giles
 
PDF
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Roberto Hashioka
 
PDF
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
DataStax Academy
 
PDF
Time Series Processing with Solr and Spark
Josef Adersberger
 
PDF
Growing the Mesos Ecosystem
Mesosphere Inc.
 
PPTX
Practical Data Analysis in Python
Hilary Mason
 
PDF
Overview of DataStax OpsCenter
DataStax
 
PPTX
High Performance Processing of Streaming Data
Geoffrey Fox
 
PPTX
Data analysis with pandas
Outreach Digital
 
PPTX
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Data Con LA
 
PDF
Getting started with pandas
maikroeder
 
ODP
Data Analysis in Python
Richard Herrell
 
2016年疾管署疫情監測週報(第44週)
衛生福利部疾病管制署
 
Implementation of Rubik's Cube Formula in PyCuber
Wey-Han Liaw
 
Python for Data Analysis: Chapter 2
智哉 今西
 
Intro to Python Data Analysis in Wakari
Karissa Rae McKelvey
 
BIG DATA サービス と ツール
Ngoc Dao
 
Mobile Wallet Future in Bangladesh
Hasibur Rahman
 
data science toolkit 101: set up Python, Spark, & Jupyter
Raj Singh
 
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
Romeo Kienzler
 
Using docker for data science - part 2
Calvin Giles
 
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Roberto Hashioka
 
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
DataStax Academy
 
Time Series Processing with Solr and Spark
Josef Adersberger
 
Growing the Mesos Ecosystem
Mesosphere Inc.
 
Practical Data Analysis in Python
Hilary Mason
 
Overview of DataStax OpsCenter
DataStax
 
High Performance Processing of Streaming Data
Geoffrey Fox
 
Data analysis with pandas
Outreach Digital
 
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Data Con LA
 
Getting started with pandas
maikroeder
 
Data Analysis in Python
Richard Herrell
 
Ad

Similar to Data analysis with Pandas and Spark (20)

PDF
Jupyter For Data Science Exploratory Analysis Statistical Modeling Machine Le...
ainaniccallo68
 
PDF
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
PDF
Anaconda and PyData Solutions
Travis Oliphant
 
PDF
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Alexey Zinoviev
 
PDF
Spark meetup TCHUG
Ryan Bosshart
 
PDF
Data processing with spark in r & python
Maloy Manna, PMP®
 
PDF
20151015 zagreb spark_notebooks
Andrey Vykhodtsev
 
PPTX
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Edureka!
 
PPTX
Big Data tools in practice
Darko Marjanovic
 
PDF
Continuum Analytics and Python
Travis Oliphant
 
PDF
Data science apps: beyond notebooks
Natalino Busa
 
PDF
Data Science Apps: Beyond Notebooks - Natalino Busa - Codemotion Amsterdam 2017
Codemotion
 
PDF
London level39
Travis Oliphant
 
PDF
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
PDF
Accelerating Big Data beyond the JVM - Fosdem 2018
Holden Karau
 
PDF
Python webinar 4th june
Edureka!
 
PDF
Big data berlin
kammeyer
 
PDF
Python for Data Science: A Comprehensive Guide
priyanka rajput
 
PDF
Big data beyond the JVM - DDTX 2018
Holden Karau
 
PDF
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau
 
Jupyter For Data Science Exploratory Analysis Statistical Modeling Machine Le...
ainaniccallo68
 
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Anaconda and PyData Solutions
Travis Oliphant
 
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Alexey Zinoviev
 
Spark meetup TCHUG
Ryan Bosshart
 
Data processing with spark in r & python
Maloy Manna, PMP®
 
20151015 zagreb spark_notebooks
Andrey Vykhodtsev
 
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Edureka!
 
Big Data tools in practice
Darko Marjanovic
 
Continuum Analytics and Python
Travis Oliphant
 
Data science apps: beyond notebooks
Natalino Busa
 
Data Science Apps: Beyond Notebooks - Natalino Busa - Codemotion Amsterdam 2017
Codemotion
 
London level39
Travis Oliphant
 
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Holden Karau
 
Python webinar 4th june
Edureka!
 
Big data berlin
kammeyer
 
Python for Data Science: A Comprehensive Guide
priyanka rajput
 
Big data beyond the JVM - DDTX 2018
Holden Karau
 
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau
 
Ad

More from Felix Crisan (15)

PDF
Big data uservices
Felix Crisan
 
PDF
Bitcoin:Next
Felix Crisan
 
PPTX
BigData in BlockChains
Felix Crisan
 
PDF
Lightning Network
Felix Crisan
 
PDF
Smart contracts using web3.js
Felix Crisan
 
PDF
Smart contracts in Solidity
Felix Crisan
 
PDF
Mashing the data
Felix Crisan
 
PDF
Big(data) in block(chains)
Felix Crisan
 
PDF
Enablers for o commerce
Felix Crisan
 
PDF
mcommad
Felix Crisan
 
PDF
NoSQL solutions
Felix Crisan
 
PDF
Deconstructing Lambda architectures
Felix Crisan
 
PDF
402 @ Mobile next
Felix Crisan
 
PDF
Presentation for the first Bucharest Big data meetup
Felix Crisan
 
PDF
TCP/IP of money
Felix Crisan
 
Big data uservices
Felix Crisan
 
Bitcoin:Next
Felix Crisan
 
BigData in BlockChains
Felix Crisan
 
Lightning Network
Felix Crisan
 
Smart contracts using web3.js
Felix Crisan
 
Smart contracts in Solidity
Felix Crisan
 
Mashing the data
Felix Crisan
 
Big(data) in block(chains)
Felix Crisan
 
Enablers for o commerce
Felix Crisan
 
mcommad
Felix Crisan
 
NoSQL solutions
Felix Crisan
 
Deconstructing Lambda architectures
Felix Crisan
 
402 @ Mobile next
Felix Crisan
 
Presentation for the first Bucharest Big data meetup
Felix Crisan
 
TCP/IP of money
Felix Crisan
 

Recently uploaded (20)

PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 

Data analysis with Pandas and Spark

  • 2. Who we are ● NETOPIA ● mobilPay ● mobilPay Wallet ● web2sms ● btko.in ● kartela.ro ● mobilender.mx
  • 4. Three dimensional problem ● Time: Past events or crystall ball? ● Profile: Who is looking at the data? ● Quantity: How much data is there to look at?
  • 6. Quantity ● Hundreds of MB to a few GB ● Up to million events/records vs. ● GB to TB to PB ● Hundreds of millions to billions and beyond events/records
  • 7. Also ● Computing vs. Storage ● Vertical vs. Horizontal scalability ● Distributed/ML libraries ● Dependency hell
  • 9. “Classic” Approach Small Data Big Data Data Engineer grep, sed, awk Java, Scala, Python, PIG, Hadoop, lately Spark & others Data Scientist R/RStudio No way, Josè!
  • 10. New Approach Small Data Big Data Data Engineer Notebook Technologies: Jupyter (most used), zeppelin, but also less known ones (Rodeo, Beaker) Data Scientist
  • 11. Data analysis with Jupyter, Pandas and Spark
  • 12. Outline About the data: ● Set of mobile transactions ● Set (separate) of retail transactions About the tools: Jupyter, Pandas and Spark Our experience Future work
  • 13. Mobile transactions Retail data Elements of analysis Transactions Transactions, Products, Stock data We know Transaction value, User identifier, Merchant Transaction value, Sold products, Merchant We don’t know What product was bought Who the user is Size Hundreds of thousands of entries Hundreds of millions of entries Status Building prediction models Gathering data Datasets
  • 15. SQL Database Mobile data: Environment Preprocessing notebooks Analysis and model testing notebooks Pandas R (with rpy2) scikit-learn Custom code CSV files pickle files Other input sources Jupyter notebooks in Docker container with Anaconda Diagnostics Cleaning Feature building Raw data Models Visualizations
  • 16. Docker image … with Anaconda ● Anaconda: package manager for data science ● Using docker-compose for setting up container parameters ● Many available images ● Our base image: ○ pyspark from Jupyter Docker Stacks ○ Extended with required libraries ● Libraries are added or updated with docker build: ○ Self-contained ○ Easy versioning
  • 17. Jupyter Notebook (1) Web application for creating documents with live code, explanations and visualizations ● Initially, part of IPython ● Narrative with live code ● Protocol for interactive exploration ○ Run blocks of code ○ Embedded JS ● Executable documents ○ Code ○ HTML and Markdown ○ Metadata ● Kernels for multiple languages ○ Python ○ R ○ Scala ○ Bash ● Internal format: JSON
  • 18. Jupyter Notebook (2) Web application for creating documents with live code, explanations and visualizations ● Plugins and widgets ● Easy to share (formats: Notebook, PDF, HTML, …) ● Large ecosystem ○ Jupyter Lab / Jupyter Hub ○ GitHub visualizations ○ Blog integration ○ Education: teaching, evaluation ○ Microsoft, Google, Bloomberg, IBM, O'Reilly ○ Executable books ● Versioning is complicated
  • 19. Pandas ● DataFrame objects ○ Tabular data structures ○ Each column has one data type ● Based on numpy (fast) ● Processing is (mostly) done in memory ● Data manipulation: ○ Hierarchical indexing ○ Reshaping, pivoting, grouping ○ String operations ○ Time series operations ● Reading / writing from / to many formats (CSV, JSON, HDF5, …) ● Visualization: matplotlib, Seaborn, Bokeh, … Python library for data manipulation and analysis
  • 20. rpy2 Interface between Python and R ● Translates DataFrames between Python and R ● Python in Jupyter: use %%R ● Direct access to R objects (rpy2.robjects)
  • 21. Jupyter, Pandas and R R with Rpy2 Python HTML and Markdown Notebook
  • 22. Mobile data: User retention Active users: ● Classic: 1+ transactions in a given period ● Rolling: 1+ transactions in a given or subsequent period Plots: ● X: period (day, week, month) ● Y (cohort): period or another type of segment ● By transaction criteria (merchant, product, etc.) Results: ● Response to campaigns ● Activity recurrence Cohorts Periods
  • 23. Mobile data: Correlations Features: ● How similar are two features? Merchants: ● Which merchants have common users? Products: ● Which products are sold together?
  • 24. Mobile data: Clusters ● Group users by behavior ● Identify outliers ● Future: automatic cluster labeling
  • 26. Retail data: Our experience First try: Out-of-core processing with HDF5 ● Data does did not fit in memory ● HDF5: format for large data ● Pandas + HDF5, Blaze, Dask, Odo ● Easy to use functions ● Library incompatibilities ● Slow queries, use indexes ● Occasional runtime errors
  • 27. Cassandra Retail data: Environment Preprocessing notebooks Analysis and model testing notebooks Large data: Spark ML + scikit-learn Small (selection) data: Pandas, scikit-learn and R CSV files Apache Parquet Cassandra Other input sources Jupyter notebooks in Docker containers with Spark and Anaconda Diagnostics Cleaning Feature building Raw data Models Visualizations In progress
  • 28. Spark Engine for big data processing ● DataFrames ○ Built on top of RDDs ○ Similar to Pandas and R ○ SQL queries ○ Automatic query optimization through query plan ○ String , date-time and statistics functions ○ Group by, filters ● Jupyter integration: work in progress https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
  • 29. Spark Machine Learning MLlib and ML ● MLlib ○ Uses RDDs ○ Summaries, correlations, sampling ○ SVMs, logistic regression, decision trees, ensembles and Naive Bayes ○ Clustering ○ Feature transformation ● ML ○ Works with DataFrames ○ Many wrappers for MLlib ○ Pipelines: ■ Transformers, Estimators, Parameters ■ labelCol, featuresCol, predictionCol, ... ○ R formulas (y ~ x1 + x2)
  • 30. Retail data: Our experience Current: Spark + Docker ● No issues at current size (several GBs) ● Docker Compose for creating master, workers and Jupyter container (driver) ● ML libraries are easy to work with ● Incomplete Python API for ML (e.g., summaries) ● Documentation needs improvement ● Model diagnostics ○ Some metrics are available ○ Supplement with scikit-learn (example: build ROC curves) ● scikit-learn or R on top of Spark ○ Parallelize parameter search (e.g., grid search) ○ Spark sklearn (github.com/databricks/spark-sklearn): Grid Search
  • 31. Future work Mobile wallet transactions: ● Data fits in memory ● Use Spark for distributing workload ERP transactions: ● Some data fits in memory, after processing ● Build a web app for data exploration ● Forecast ○ Sales ○ Inventory requirements ● Try Spark Streaming http://xkcd.com/1425/