Better {ML} Together: GraphLab Create + Spark

1
Better {ML} Together:  
GraphLab Create + Spark
amanda casari
senior data scientist @ Concur
@amcasari
+

2
{brief} intro to Spark +
Step 1: Create Resilient Distributed
Dataset (RDD)
–  Contain arbitrary Java or Python objects
Step 2: Perform parallel operations
–  Transformations deﬁne new dataset (DAG)
–  Actions kick off a job on the cluster
Source: Apache Spark

3
ML in GraphLab Create
Recommender system
Factorization-based methods
Neighborhood-based item similarity
Popularity-based methods
Classification
Deep learning
Boosted trees
Support vector machine
Logistic regression
Regression
Boosted trees
Deep learning
Linear regression
Text analysis
Topic modeling (LDA)
Featurization utilities: bm25, td_idf, remove stop words, etc.
Image analysis
Deep learning
Image load, resize, mean computation
Clustering – K-means
Graph analytics
Nearest neighbors
Vowpal Wabbit wrapper
MLlib + GraphX on Apache Spark
Classification and regression
linear models (SVMs, logistic regression, linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
Isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
Power iteration clustering (PIC)
Latent Dirichlet allocation (LDA)
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
Graph analytics
ML in PySpark > even better +
+ scikit-learn
+ pandas

Concur customers.
All data, January 2015
Concur Mobile
~3.5M
users
TripIt
~11M
users
Travel
~18,750
clients
Expense
~16,500
clients

5
Build Data Science Products
to Help Deliver the Perfect Trip
through Personalization
{and Intelligent Automation}

6
Use Case: Identifying Customer Behavior
+
•  Can unsupervised learning help us learn more about our user behavior…without
additional instrumentation?
Business Problem
Automate the process of
expense ﬁling by predicting for a
user:
1.  the correct currency of a
transaction
2.  the correct reimbursement
currency for a customer
Data Set
Expense Report
Transactions
(AKA Everyone’s Favorite
Part of Business Travel!)

A typical sprint planning...
Data Science Product Manager: “Just see how things go!”

A typical sprint demo...
Data Science Product Manager: “So it’s ready for production????”

9
problem formulation > tool-chain construction
? - How best to quickly move
from raw through pipeline to:
1.  Show value of work
2.  Communicate results
3.  Move models into
production pipeline
+
…NOTEBOOKS >
Jupyter + PySpark +
GraphLab Create
Use lots of different
data sources
ETL
Clean up your
data
Run
Sophisticated
Analytics
Discover
Insights!
Integrate
Results into
Pipelines
Communicate
Results and
Value
Based on Diagram Source: Paco Nathan

10
+
{0. Initial ETL job with Spark}
1.  Import data to SparkContext
2.  Convert to DataFrames RDD for more exploration
3.  Convert Spark DataFrame to GraphLab Create SFrame
4.  Exploration in GraphLab Create
5.  Case : Clustering Users based on Transaction History
+ Additional steps to improve performance using GLC toolkits
7. Save model for deployment to production
8. Convert SFrame back to RDD
QUICK demo: explore > readyForProd

12
@amcasari
amanda.casari@concur.com
humor from xkcd
+
Seattle Spark Meetup
Thanks to Emad @Dato!

14
my local setup
•  Hardware:
•  MacBook Pro (mid 2014)
•  ~420 GB free disk space
•  16GB RAM
•  2.5GHz with 4 cores
…{Not exactly an enterprise AWS cluster}
•  Software:
•  Spark 1.2.2/ 1.3.1 / 1.4.0 for Hadoop 2.4
Using Spark 1.3.1 for Demo
•  GraphLab Create 0.internal
•  Hadoop 2.5.2
•  Scala 2.11.7
•  Python 2.7.10 on Anaconda 2.3.0
+

15
working with glc & spark
•  Get these things working together ﬁrst!
–  Hadoop 2.4+ , YARN {if using Spark built on Hadoop}
–  Spark 1.1+ {integrate with YARN if plan to use yarn-client}
•  Increased spark.driver.memory to 3g (ref here)
•  Added spark.driver.extraClassPath to spark-defaults.conf (ref here)
•  Install GraphLab Create in a clean Python 2.7+ environment
•  Follow instructions on how to point Spark at your glc jar
•  If your JAVA_HOME is different than your HADOOP_JAVA_HOME, set GRAPHLAB_JAVA_HOME
•  Helpful hints on working with IPython + Spark (including symlink info) here
•  It’s not just you but it might be your conﬁguration…
–  Spark 1.3.1+ + IPython Kernel in yarn-client may be having issues? {we will stick with local for demo}
+

Better {ML} Together: GraphLab Create + Spark

More Related Content

What's hot (20)

Similar to Better {ML} Together: GraphLab Create + Spark (20)

More from Turi, Inc. (20)

Recently uploaded (20)

Better {ML} Together: GraphLab Create + Spark