1
Better {ML} Together: 

GraphLab Create + Spark
amanda casari
senior data scientist @ Concur
@amcasari
+
2
{brief} intro to Spark +
Step 1: Create Resilient Distributed
Dataset (RDD)
–  Contain arbitrary Java or Python objects
Step 2: Perform parallel operations
–  Transformations define new dataset (DAG)
–  Actions kick off a job on the cluster
Source: Apache Spark
3
ML in GraphLab Create
Recommender system
Factorization-based methods
Neighborhood-based item similarity
Popularity-based methods
Classification
Deep learning
Boosted trees
Support vector machine
Logistic regression
Regression
Boosted trees
Deep learning
Linear regression
Text analysis
Topic modeling (LDA)
Featurization utilities: bm25, td_idf, remove stop words, etc.
Image analysis
Deep learning
Image load, resize, mean computation
Clustering – K-means
Graph analytics
Nearest neighbors
Vowpal Wabbit wrapper
MLlib + GraphX on Apache Spark
Classification and regression
linear models (SVMs, logistic regression, linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
Isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
Power iteration clustering (PIC)
Latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
Graph analytics
ML in PySpark > even better +
+ scikit-learn
+ pandas
Concur customers.
All data, January 2015
Concur Mobile
~3.5M
users
TripIt
~11M
users
Travel
~18,750
clients
Expense
~16,500
clients
5
Build Data Science Products
to Help Deliver the Perfect Trip
through Personalization
{and Intelligent Automation}
6
Use Case: Identifying Customer Behavior
+
•  Can unsupervised learning help us learn more about our user behavior…without
additional instrumentation?
Business Problem
Automate the process of
expense filing by predicting for a
user:
1.  the correct currency of a
transaction
2.  the correct reimbursement
currency for a customer
Data Set
Expense Report
Transactions
(AKA Everyone’s Favorite
Part of Business Travel!)
A typical sprint planning...
Data Science Product Manager: “Just see how things go!”
A typical sprint demo...
Data Science Product Manager: “So it’s ready for production????”
9
problem formulation > tool-chain construction
? - How best to quickly move
from raw through pipeline to:
1.  Show value of work
2.  Communicate results
3.  Move models into
production pipeline
+
…NOTEBOOKS >
Jupyter + PySpark +
GraphLab Create
Use lots of different
data sources
ETL
Clean up your
data
Run
Sophisticated
Analytics
Discover
Insights!
Integrate
Results into
Pipelines
Communicate
Results and
Value
Based on Diagram Source: Paco Nathan
10
+
{0. Initial ETL job with Spark}
1.  Import data to SparkContext
2.  Convert to DataFrames RDD for more exploration
3.  Convert Spark DataFrame to GraphLab Create SFrame
4.  Exploration in GraphLab Create
5.  Case : Clustering Users based on Transaction History
+ Additional steps to improve performance using GLC toolkits
7. Save model for deployment to production
8. Convert SFrame back to RDD
QUICK demo: explore > readyForProd
Demo time!+
12
@amcasari
amanda.casari@concur.com
humor from xkcd
+
Seattle Spark Meetup
Thanks to Emad @Dato!
{Appendix}+
14
my local setup
•  Hardware:
•  MacBook Pro (mid 2014)
•  ~420 GB free disk space
•  16GB RAM
•  2.5GHz with 4 cores
…{Not exactly an enterprise AWS cluster}
•  Software:
•  Spark 1.2.2/ 1.3.1 / 1.4.0 for Hadoop 2.4
Using Spark 1.3.1 for Demo
•  GraphLab Create 0.internal
•  Hadoop 2.5.2
•  Scala 2.11.7
•  Python 2.7.10 on Anaconda 2.3.0
+
15
working with glc & spark
•  Get these things working together first!
–  Hadoop 2.4+ , YARN {if using Spark built on Hadoop}
–  Spark 1.1+ {integrate with YARN if plan to use yarn-client}
•  Increased spark.driver.memory to 3g (ref here)
•  Added spark.driver.extraClassPath to spark-defaults.conf (ref here)
•  Install GraphLab Create in a clean Python 2.7+ environment
•  Follow instructions on how to point Spark at your glc jar
•  If your JAVA_HOME is different than your HADOOP_JAVA_HOME, set GRAPHLAB_JAVA_HOME
•  Helpful hints on working with IPython + Spark (including symlink info) here
•  It’s not just you but it might be your configuration…
–  Spark 1.3.1+ + IPython Kernel in yarn-client may be having issues? {we will stick with local for demo}
+

More Related Content

PPTX
What’s New in the Berkeley Data Analytics Stack
PPTX
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
PDF
Spark Meetup @ Netflix, 05/19/2015
PPTX
Distributed Deep Learning + others for Spark Meetup
PDF
Scalable data structures for data science
PDF
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
PDF
Snorkel: Dark Data and Machine Learning with Christopher Ré
PPTX
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
What’s New in the Berkeley Data Analytics Stack
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
Spark Meetup @ Netflix, 05/19/2015
Distributed Deep Learning + others for Spark Meetup
Scalable data structures for data science
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Snorkel: Dark Data and Machine Learning with Christopher Ré
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov

What's hot (20)

PPTX
Making Machine Learning Scale: Single Machine and Distributed
PDF
Deep Learning with MXNet - Dmitry Larko
PDF
Kaz Sato, Evangelist, Google at MLconf ATL 2016
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
PDF
Large-Scale Machine Learning with Apache Spark
PPTX
Machine Learning with Azure
PDF
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
PDF
DeepLearning4J: Open Source Neural Net Platform
PDF
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
PDF
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
PPTX
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
PDF
New Capabilities in the PyData Ecosystem
PPTX
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
PDF
Web-Scale Graph Analytics with Apache® Spark™
PDF
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
PPTX
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
PDF
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
PDF
Designing Distributed Machine Learning on Apache Spark
PPTX
Large Scale Machine learning with Spark
PDF
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Making Machine Learning Scale: Single Machine and Distributed
Deep Learning with MXNet - Dmitry Larko
Kaz Sato, Evangelist, Google at MLconf ATL 2016
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Large-Scale Machine Learning with Apache Spark
Machine Learning with Azure
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
DeepLearning4J: Open Source Neural Net Platform
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
New Capabilities in the PyData Ecosystem
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Web-Scale Graph Analytics with Apache® Spark™
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Designing Distributed Machine Learning on Apache Spark
Large Scale Machine learning with Spark
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...

Similar to Better {ML} Together: GraphLab Create + Spark (20)

PDF
End-to-end Data Pipeline with Apache Spark
PDF
Simplifying Big Data Analytics with Apache Spark
PPTX
GraphLab Conference 2014 Keynote - Carlos Guestrin
PDF
Scalable Data Science with SparkR
PDF
Apache Spark Overview @ ferret
PPTX
MLconf NYC Xiangrui Meng
PPTX
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
PPTX
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
PDF
Getting started with Apache Spark in Python - PyLadies Toronto 2016
PDF
Spark Under the Hood - Meetup @ Data Science London
PDF
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
PDF
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
PDF
An introduction into Spark ML plus how to go beyond when you get stuck
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
PPT
Apache spark-melbourne-april-2015-meetup
PDF
Introduction to apache spark
PDF
Apache Spark & MLlib
PDF
Spark meetup TCHUG
PDF
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
PPTX
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018
End-to-end Data Pipeline with Apache Spark
Simplifying Big Data Analytics with Apache Spark
GraphLab Conference 2014 Keynote - Carlos Guestrin
Scalable Data Science with SparkR
Apache Spark Overview @ ferret
MLconf NYC Xiangrui Meng
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Spark Under the Hood - Meetup @ Data Science London
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
An introduction into Spark ML plus how to go beyond when you get stuck
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark-melbourne-april-2015-meetup
Introduction to apache spark
Apache Spark & MLlib
Spark meetup TCHUG
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018

More from Turi, Inc. (20)

PPTX
Webinar - Analyzing Video
PDF
Webinar - Patient Readmission Risk
PPTX
Webinar - Know Your Customer - Arya (20160526)
PPTX
Webinar - Product Matching - Palombo (20160428)
PPTX
Webinar - Pattern Mining Log Data - Vega (20160426)
PPTX
Webinar - Fraud Detection - Palombo (20160428)
PPTX
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
PDF
Pattern Mining: Extracting Value from Log Data
PPTX
Intelligent Applications with Machine Learning Toolkits
PPTX
Text Analysis with Machine Learning
PPTX
Machine Learning with GraphLab Create
PPTX
Machine Learning in Production with Dato Predictive Services
PPTX
Machine Learning in 2016: Live Q&A with Carlos Guestrin
PPTX
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
PDF
Introduction to Recommender Systems
PDF
Machine learning in production
PPTX
Overview of Machine Learning and Feature Engineering
PPTX
SFrame
PPT
Building Personalized Data Products with Dato
PPTX
Getting Started With Dato - August 2015
Webinar - Analyzing Video
Webinar - Patient Readmission Risk
Webinar - Know Your Customer - Arya (20160526)
Webinar - Product Matching - Palombo (20160428)
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Fraud Detection - Palombo (20160428)
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Pattern Mining: Extracting Value from Log Data
Intelligent Applications with Machine Learning Toolkits
Text Analysis with Machine Learning
Machine Learning with GraphLab Create
Machine Learning in Production with Dato Predictive Services
Machine Learning in 2016: Live Q&A with Carlos Guestrin
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Recommender Systems
Machine learning in production
Overview of Machine Learning and Feature Engineering
SFrame
Building Personalized Data Products with Dato
Getting Started With Dato - August 2015

Recently uploaded (20)

PPTX
Information-Technology-in-Human-Society.pptx
PDF
Altius execution marketplace concept.pdf
PPTX
Report in SIP_Distance_Learning_Technology_Impact.pptx
PDF
Streamline Vulnerability Management From Minimal Images to SBOMs
PDF
Rapid Prototyping: A lecture on prototyping techniques for interface design
PDF
giants, standing on the shoulders of - by Daniel Stenberg
PDF
CCUS-as-the-Missing-Link-to-Net-Zero_AksCurious.pdf
PDF
Advancing precision in air quality forecasting through machine learning integ...
PPTX
How to use fields_get method in Odoo 18
PPTX
Rise of the Digital Control Grid Zeee Media and Hope and Tivon FTWProject.com
PPTX
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
PDF
Introduction to MCP and A2A Protocols: Enabling Agent Communication
PDF
EIS-Webinar-Regulated-Industries-2025-08.pdf
PDF
A symptom-driven medical diagnosis support model based on machine learning te...
PDF
Data Virtualization in Action: Scaling APIs and Apps with FME
PDF
Human Computer Interaction Miterm Lesson
PPTX
How to Convert Tickets Into Sales Opportunity in Odoo 18
PDF
Identification of potential depression in social media posts
PDF
The AI Revolution in Customer Service - 2025
PPTX
Build automations faster and more reliably with UiPath ScreenPlay
Information-Technology-in-Human-Society.pptx
Altius execution marketplace concept.pdf
Report in SIP_Distance_Learning_Technology_Impact.pptx
Streamline Vulnerability Management From Minimal Images to SBOMs
Rapid Prototyping: A lecture on prototyping techniques for interface design
giants, standing on the shoulders of - by Daniel Stenberg
CCUS-as-the-Missing-Link-to-Net-Zero_AksCurious.pdf
Advancing precision in air quality forecasting through machine learning integ...
How to use fields_get method in Odoo 18
Rise of the Digital Control Grid Zeee Media and Hope and Tivon FTWProject.com
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
Introduction to MCP and A2A Protocols: Enabling Agent Communication
EIS-Webinar-Regulated-Industries-2025-08.pdf
A symptom-driven medical diagnosis support model based on machine learning te...
Data Virtualization in Action: Scaling APIs and Apps with FME
Human Computer Interaction Miterm Lesson
How to Convert Tickets Into Sales Opportunity in Odoo 18
Identification of potential depression in social media posts
The AI Revolution in Customer Service - 2025
Build automations faster and more reliably with UiPath ScreenPlay

Better {ML} Together: GraphLab Create + Spark

  • 1. 1 Better {ML} Together: 
 GraphLab Create + Spark amanda casari senior data scientist @ Concur @amcasari +
  • 2. 2 {brief} intro to Spark + Step 1: Create Resilient Distributed Dataset (RDD) –  Contain arbitrary Java or Python objects Step 2: Perform parallel operations –  Transformations define new dataset (DAG) –  Actions kick off a job on the cluster Source: Apache Spark
  • 3. 3 ML in GraphLab Create Recommender system Factorization-based methods Neighborhood-based item similarity Popularity-based methods Classification Deep learning Boosted trees Support vector machine Logistic regression Regression Boosted trees Deep learning Linear regression Text analysis Topic modeling (LDA) Featurization utilities: bm25, td_idf, remove stop words, etc. Image analysis Deep learning Image load, resize, mean computation Clustering – K-means Graph analytics Nearest neighbors Vowpal Wabbit wrapper MLlib + GraphX on Apache Spark Classification and regression linear models (SVMs, logistic regression, linear regression) naive Bayes decision trees ensembles of trees (Random Forests and Gradient-Boosted Trees) Isotonic regression Collaborative filtering alternating least squares (ALS) Clustering k-means Gaussian mixture Power iteration clustering (PIC) Latent Dirichlet allocation (LDA) streaming k-means Dimensionality reduction singular value decomposition (SVD) principal component analysis (PCA) Optimization (developer) stochastic gradient descent limited-memory BFGS (L-BFGS) Graph analytics ML in PySpark > even better + + scikit-learn + pandas
  • 4. Concur customers. All data, January 2015 Concur Mobile ~3.5M users TripIt ~11M users Travel ~18,750 clients Expense ~16,500 clients
  • 5. 5 Build Data Science Products to Help Deliver the Perfect Trip through Personalization {and Intelligent Automation}
  • 6. 6 Use Case: Identifying Customer Behavior + •  Can unsupervised learning help us learn more about our user behavior…without additional instrumentation? Business Problem Automate the process of expense filing by predicting for a user: 1.  the correct currency of a transaction 2.  the correct reimbursement currency for a customer Data Set Expense Report Transactions (AKA Everyone’s Favorite Part of Business Travel!)
  • 7. A typical sprint planning... Data Science Product Manager: “Just see how things go!”
  • 8. A typical sprint demo... Data Science Product Manager: “So it’s ready for production????”
  • 9. 9 problem formulation > tool-chain construction ? - How best to quickly move from raw through pipeline to: 1.  Show value of work 2.  Communicate results 3.  Move models into production pipeline + …NOTEBOOKS > Jupyter + PySpark + GraphLab Create Use lots of different data sources ETL Clean up your data Run Sophisticated Analytics Discover Insights! Integrate Results into Pipelines Communicate Results and Value Based on Diagram Source: Paco Nathan
  • 10. 10 + {0. Initial ETL job with Spark} 1.  Import data to SparkContext 2.  Convert to DataFrames RDD for more exploration 3.  Convert Spark DataFrame to GraphLab Create SFrame 4.  Exploration in GraphLab Create 5.  Case : Clustering Users based on Transaction History + Additional steps to improve performance using GLC toolkits 7. Save model for deployment to production 8. Convert SFrame back to RDD QUICK demo: explore > readyForProd
  • 12. 12 @amcasari [email protected] humor from xkcd + Seattle Spark Meetup Thanks to Emad @Dato!
  • 14. 14 my local setup •  Hardware: •  MacBook Pro (mid 2014) •  ~420 GB free disk space •  16GB RAM •  2.5GHz with 4 cores …{Not exactly an enterprise AWS cluster} •  Software: •  Spark 1.2.2/ 1.3.1 / 1.4.0 for Hadoop 2.4 Using Spark 1.3.1 for Demo •  GraphLab Create 0.internal •  Hadoop 2.5.2 •  Scala 2.11.7 •  Python 2.7.10 on Anaconda 2.3.0 +
  • 15. 15 working with glc & spark •  Get these things working together first! –  Hadoop 2.4+ , YARN {if using Spark built on Hadoop} –  Spark 1.1+ {integrate with YARN if plan to use yarn-client} •  Increased spark.driver.memory to 3g (ref here) •  Added spark.driver.extraClassPath to spark-defaults.conf (ref here) •  Install GraphLab Create in a clean Python 2.7+ environment •  Follow instructions on how to point Spark at your glc jar •  If your JAVA_HOME is different than your HADOOP_JAVA_HOME, set GRAPHLAB_JAVA_HOME •  Helpful hints on working with IPython + Spark (including symlink info) here •  It’s not just you but it might be your configuration… –  Spark 1.3.1+ + IPython Kernel in yarn-client may be having issues? {we will stick with local for demo} +