SlideShare a Scribd company logo
®
© 2014 MapR Technologies 1
®
© 2014 MapR Technologies
Machine Learning with Spark
Carol McDonald
®
© 2014 MapR Technologies 2
Agenda
•  Classification
•  Clustering
•  Collaborative Filtering with Spark
•  Model training
•  Alternating Least Squares
•  The code
®
© 2014 MapR Technologies 3
Three Categories of Techniques for Machine Learning
classification
Collaborative filtering (recommendation)
clustering
Groups similar
items
identifies
category for
item
Recommend
items
®
© 2014 MapR Technologies 4
What is classification
Form of ML that:
•  identifies which category an item belongs to
•  Uses supervised learning algorithms
•  Data is labeled
Examples:
•  Spam Detection
•  spam/non-spam
•  Credit Card Fraud Detection
•  fraud/non-fraud
•  Sentiment analysis
®
© 2014 MapR Technologies 5
Building and deploying a classifier model
®
© 2014 MapR Technologies 6
If it walks/swims/quacks like a duck …
Attributes, Features:
•  If it walks
•  If it swims
•  If it quacks
“When I see a bird that walks like a duck
and swims like a duck and quacks like a
duck, I call that bird a duck.”
classify something based on “if” conditions.
Answer, Label:
•  Duck
•  Not duck
®
© 2014 MapR Technologies 7
… then it must be a duck
ducks not ducks
walks
quacks
swims
Label:
•  Duck
•  Not
duck
Features:
•  walks
•  swims
•  quacks
®
© 2014 MapR Technologies 8
Building and deploying a classifier model
Reference Learning Spark Oreilly Book
®
© 2014 MapR Technologies 9
Vectorizing Data
•  identify interesting features (those that contribute to the model)
•  assign features to dimensions
Example: vectorize an apple
Features: [size, color, weight]
Example: vectorize a text document
(Term Frequency Inverse Term Frequency)
Dictionary: [a, advance, after, …, you, yourself, youth, zigzag]
[3.2, 16777184.0, 45.8][223,1,1,0,…,12,10,6,1]
®
© 2014 MapR Technologies 10
Build Term Frequency Feature vectors
// examples of spam
val spam = sc.textFile("spam.txt")
// examples of not spam
val normal = sc.textFile("normal.txt”)
// Create a HashingTF map email text to vectors of features
val tf = new HashingTF(numFeatures = 10000)
// Each email each word is mapped to one feature.
val spamFeatures = spam
.map(email => tf.transform(email.split(" ")))
val normalFeatures = normal
.map(email => tf.transform(email.split(" ")))
Reference Learning Spark Oreilly Book
®
© 2014 MapR Technologies 11
Building and deploying a classifier model
®
© 2014 MapR Technologies 12
Build Model
val trainingData = positiveExamples.union(negativeExamples)
trainingData.cache() // Cache for iterative algorithm.
// Run Logistic Regression using the SGD algorithm.
val model = new LogisticRegressionWithSGD()
.run(trainingData)
Reference Learning Spark Oreilly Book
®
© 2014 MapR Technologies 13
Building and deploying a classifier model
Reference Learning Spark Oreilly Book
®
© 2014 MapR Technologies 14
Model Evaluation
// Test on a positive example (spam)
Vector posTest = tf.transform(Arrays.asList(
"O M G GET cheap stuff by sending money to...".split(" ")));
// negative test not spam
Vector negTest = tf.transform(Arrays.asList(
"Hi Dad, I started studying Spark the other ...".split(" ")));
System.out.println("Prediction for positive: " +
model.predict(posTest));
System.out.println("Prediction for negative: " +
model.predict(negTest));
®
© 2014 MapR Technologies 15
Three Categories of Techniques for Machine Learning
classification
Collaborative filtering (recommendation)
clustering
®
© 2014 MapR Technologies 16
Clustering
•  Clustering is the unsupervised learning task that involves grouping objects
into clusters of high similarity
–  Search results grouping
–  grouping of customers by similar habits
–  Anomaly detection
•  data traffic
–  Text categorization
®
© 2014 MapR Technologies 17
What is Clustering?
Clustering = (unsupervised) task of grouping similar objects
MLlib K-means algorithm for clustering
1.  randomly initialize centers of
clusters
2.  Assign all points to the closest
cluster center
3.  Change cluster centers to be in the
middle of its points
4.  Repeat until convergence
®
© 2014 MapR Technologies 18
What is Clustering?
Clustering = (unsupervised) task of grouping similar objects
®
© 2014 MapR Technologies 19
Examples of ML Algorithms
machine learning
supervised unsupervised
•  Classification
•  Naïve Bayes
•  SVM
•  Random Decision Forests
•  Regression
•  Linear
•  logistic
•  Clustering
•  K-means
•  Dimensionality reduction
•  Principal Component Analysis
•  SVD
®
© 2014 MapR Technologies 20
ML Algorithms
http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
®
© 2014 MapR Technologies 21
Three Categories of Techniques for Machine Learning
classification
Collaborative filtering (recommendation)
clustering
®
© 2014 MapR Technologies 22
Collaborative Filtering with Spark
•  Recommend Items
–  (filtering)
•  Based on User preferences data
–  (collaborative)
®
© 2014 MapR Technologies 23
Train a Model to Make Predictions
New
Data
Model Predictions
Training
Data
ModelAlgorithm
Ted and Carol like Movie B and C
Bob likes Movie B, What might he like ?
Bob likes Movie B, Predict C
®
© 2014 MapR Technologies 24
Alternating Least Squares
•  approximates sparse user item rating matrix
–  as product of two dense matrices, User and Item factor matrices
–  tries to learn the hidden features of each user and item
–  algorithm alternatively fixes one factor matrix and solves for the other
?
®
© 2014 MapR Technologies 25
ML Cross Validation Process
Data
Model
Training/
Building
Test Model
Predictions
Test
Set
Train Test loop
Training
Set
®
© 2014 MapR Technologies 26
Ratings Data
®
© 2014 MapR Technologies 27
Parse Input
// parse input UserID::MovieID::Rating
def parseRating(str: String): Rating= {
val fields = str.split("::")
Rating(fields(0).toInt, fields(1).toInt,
fields(2).toDouble)
}
// create an RDD of Ratings objects
val ratingsRDD = ratingText.map(parseRating).cache()
®
© 2014 MapR Technologies 28
Build Model
Data
Build
Model
Test
Set
Training
Set
split ratings RDD into training data RDD (80%)
and test data RDD (20%)
build a user product matrix model
®
© 2014 MapR Technologies 29
Create Model
// Randomly split ratings RDD into training data RDD (80%)
and test data RDD (20%)
val splits = ratingsRDD.randomSplit(Array(0.8, 0.2), 0L)
val trainingRatingsRDD = splits(0).cache()
val testRatingsRDD = splits(1).cache()
// build a ALS user product matrix model with rank=20,
iterations=10
val model = (new ALS().setRank(20).setIterations(10)
.run(trainingRatingsRDD))
®
© 2014 MapR Technologies 30
Get predictions
// get predicted ratings to compare to test ratings
val testUserProductRDD = testRatingsRDD.map {
case Rating(user, product, rating) => (user, product)
}
// call model.predict with test Userid, MovieId input data
val predictionsForTestRDD = model.predict(testUserProductRDD)
User, Movie
Test
Data
Model
Predicted
Ratings
®
© 2014 MapR Technologies 31
Compare predictions to Tests
Join predicted ratings to test ratings in order to compare
((user, product),test rating) ((user, product), predicted rating)
((user, product),(test rating, predicted rating))
Key, Value Key, Value
Key, Value
®
© 2014 MapR Technologies 32
Test Model
// prepare predictions for comparison
val predictionsKeyedByUserProductRDD = predictionsForTestRDD.map{
case Rating(user, product, rating) => ((user, product), rating)
}
// prepare test for comparison
val testKeyedByUserProductRDD = testRatingsRDD.map{
case Rating(user, product, rating) => ((user, product), rating)
}
//Join the test with predictions
val testAndPredictionsJoinedRDD = testKeyedByUserProductRDD
.join(predictionsKeyedByUserProductRDD)
®
© 2014 MapR Technologies 33
Compare predictions to Tests
Find False positives: Where
test rating <= 1 and predicted rating >= 4
((user, product),(test rating, predicted rating))
Key, Value
®
© 2014 MapR Technologies 34
Test Model
val falsePositives =(testAndPredictionsJoinedRDD.filter{
case ((user, product), (ratingT, ratingP)) =>
(ratingT <= 1 && ratingP >=4)
})
falsePositives.take(2)
Array[((Int, Int), (Double, Double))] =
((3842,2858),(1.0,4.106488210964762)),
((6031,3194),(1.0,4.790778049100913))
®
© 2014 MapR Technologies 35
Test Model Mean Absolute Error
//Evaluate the model using Mean Absolute Error (MAE) between
test and predictions
val meanAbsoluteError = testAndPredictionsJoinedRDD.map {
case ((user, product), (testRating, predRating)) =>
val err = (testRating - predRating)
Math.abs(err)
}.mean()
meanAbsoluteError: Double = 0.7244940545944053
®
© 2014 MapR Technologies 36
Get Predictions for new user
val newRatingsRDD=sc.parallelize(Array(Rating(0,260,4),Rating(0,1,3))
// union
val unionRatingsRDD = ratingsRDD.union(newRatingsRDD)
// build a ALS user product matrix model
val model = (new ALS().setRank(20).setIterations(10)
.run(unionRatingsRDD))
// get 5 recs for userid 0
val topRecsForUser = model.recommendProducts(0, 5)
®
© 2014 MapR Technologies 37
Soon to Come
•  Spark On Demand Training
–  https://www.mapr.com/services/mapr-academy/
•  Blogs and Tutorials:
–  Movie Recommendations with Collaborative Filtering
–  Spark Streaming
®
© 2014 MapR Technologies 38
Machine Learning Blog
•  https://www.mapr.com/blog/parallel-and-iterative-processing-
machine-learning-recommendations-spark
®
© 2014 MapR Technologies 39
Spark on MapR
•  Certified Spark Distribution
•  Fully supported and packaged by MapR in partnership with
Databricks
•  YARN integration
–  Spark can then allocate resources from cluster when needed
®
© 2014 MapR Technologies 40
References
•  Spark Online course: learn.mapr.com
•  Spark web site: http://spark.apache.org/
•  https://databricks.com/
•  Spark on MapR:
–  http://www.mapr.com/products/apache-spark
•  Spark SQL and DataFrame Guide
•  Apache Spark vs. MapReduce – Whiteboard Walkthrough
•  Learning Spark - O'Reilly Book
•  Apache Spark
®
© 2014 MapR Technologies 41
Q&A
@mapr maprtech
Engage with us!
MapR
maprtech
mapr-technologies

More Related Content

What's hot (20)

PDF
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
DB Tsai
 
PDF
Introduction to Spark
Carol McDonald
 
PDF
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit
 
PDF
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Spark Summit
 
PDF
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Databricks
 
PDF
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Databricks
 
PPTX
Introduction to Mahout
Ted Dunning
 
PDF
Machine Learning using Apache Spark MLlib
IMC Institute
 
PDF
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
CloudxLab
 
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
PDF
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Spark Summit
 
PPTX
Machine Learning With Spark
Shivaji Dutta
 
PPTX
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
MLconf
 
PDF
Spark 101
Mohit Garg
 
PDF
Spark Summit EU talk by Reza Karimi
Spark Summit
 
PDF
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Modern Data Stack France
 
PDF
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
Spark Summit
 
PDF
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Databricks
 
PDF
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit
 
PDF
A Graph-Based Method For Cross-Entity Threat Detection
Jen Aman
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
DB Tsai
 
Introduction to Spark
Carol McDonald
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Spark Summit
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Databricks
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Databricks
 
Introduction to Mahout
Ted Dunning
 
Machine Learning using Apache Spark MLlib
IMC Institute
 
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...
CloudxLab
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Spark Summit
 
Machine Learning With Spark
Shivaji Dutta
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
MLconf
 
Spark 101
Mohit Garg
 
Spark Summit EU talk by Reza Karimi
Spark Summit
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Modern Data Stack France
 
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
Spark Summit
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Databricks
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit
 
A Graph-Based Method For Cross-Entity Threat Detection
Jen Aman
 

Viewers also liked (6)

PDF
Crab: A Python Framework for Building Recommender Systems
Marcel Caraciolo
 
PDF
MLlib: Spark's Machine Learning Library
jeykottalam
 
PPTX
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Varad Meru
 
PPTX
Collaborative Filtering using KNN
Şeyda Hatipoğlu
 
PDF
Recommender Systems with Apache Spark's ALS Function
Will Johnson
 
PDF
Collaborative Filtering and Recommender Systems By Navisro Analytics
Navisro Analytics
 
Crab: A Python Framework for Building Recommender Systems
Marcel Caraciolo
 
MLlib: Spark's Machine Learning Library
jeykottalam
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Varad Meru
 
Collaborative Filtering using KNN
Şeyda Hatipoğlu
 
Recommender Systems with Apache Spark's ALS Function
Will Johnson
 
Collaborative Filtering and Recommender Systems By Navisro Analytics
Navisro Analytics
 
Ad

Similar to Apache Spark Machine Learning (20)

PPTX
Parallel and Iterative Processing for Machine Learning Recommendations with S...
MapR Technologies
 
PDF
Hadoop France meetup Feb2016 : recommendations with spark
Modern Data Stack France
 
PDF
Free Code Friday - Machine Learning with Apache Spark
MapR Technologies
 
PPTX
Apache Spark Machine Learning Decision Trees
Carol McDonald
 
PPTX
Intro to Apache Spark by Marco Vasquez
MapR Technologies
 
PDF
Introduction to Collaborative Filtering with Apache Mahout
sscdotopen
 
PDF
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PAPIs.io
 
PDF
Training Large-scale Ad Ranking Models in Spark
Patrick Pletscher
 
PPTX
Azure Machine Learning Dotnet Campus 2015
antimo musone
 
PDF
Object Oriented Programming in Matlab
AlbanLevy
 
PDF
MACHINE LEARNING FOR OPTIMIZING SEARCH RESULTS WITH DRUPAL & APACHE SOLR
DrupalCamp Kyiv
 
PDF
Data Science in the Elastic Stack
Rochelle Sonnenberg
 
PDF
ML-Ops how to bring your data science to production
Herman Wu
 
PDF
Silicon valleycodecamp2013
Sanjeev Mishra
 
PPTX
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Chetan Khatri
 
PDF
Designing and Building a Graph Database Application – Architectural Choices, ...
Neo4j
 
PDF
Data science with R - Clustering and Classification
Brigitte Mueller
 
PDF
(Py)testing the Limits of Machine Learning
Rebecca Bilbro
 
PPTX
DataScience-101
Karthikeyan VK
 
PPTX
Data Science Challenge presentation given to the CinBITools Meetup Group
Doug Needham
 
Parallel and Iterative Processing for Machine Learning Recommendations with S...
MapR Technologies
 
Hadoop France meetup Feb2016 : recommendations with spark
Modern Data Stack France
 
Free Code Friday - Machine Learning with Apache Spark
MapR Technologies
 
Apache Spark Machine Learning Decision Trees
Carol McDonald
 
Intro to Apache Spark by Marco Vasquez
MapR Technologies
 
Introduction to Collaborative Filtering with Apache Mahout
sscdotopen
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PAPIs.io
 
Training Large-scale Ad Ranking Models in Spark
Patrick Pletscher
 
Azure Machine Learning Dotnet Campus 2015
antimo musone
 
Object Oriented Programming in Matlab
AlbanLevy
 
MACHINE LEARNING FOR OPTIMIZING SEARCH RESULTS WITH DRUPAL & APACHE SOLR
DrupalCamp Kyiv
 
Data Science in the Elastic Stack
Rochelle Sonnenberg
 
ML-Ops how to bring your data science to production
Herman Wu
 
Silicon valleycodecamp2013
Sanjeev Mishra
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Chetan Khatri
 
Designing and Building a Graph Database Application – Architectural Choices, ...
Neo4j
 
Data science with R - Clustering and Classification
Brigitte Mueller
 
(Py)testing the Limits of Machine Learning
Rebecca Bilbro
 
DataScience-101
Karthikeyan VK
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Doug Needham
 
Ad

More from Carol McDonald (20)

PDF
Introduction to machine learning with GPUs
Carol McDonald
 
PDF
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Carol McDonald
 
PDF
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Carol McDonald
 
PDF
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Carol McDonald
 
PDF
Predicting Flight Delays with Spark Machine Learning
Carol McDonald
 
PDF
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Carol McDonald
 
PDF
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Carol McDonald
 
PDF
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Carol McDonald
 
PDF
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Carol McDonald
 
PDF
How Big Data is Reducing Costs and Improving Outcomes in Health Care
Carol McDonald
 
PDF
Demystifying AI, Machine Learning and Deep Learning
Carol McDonald
 
PDF
Spark graphx
Carol McDonald
 
PDF
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Carol McDonald
 
PDF
Streaming patterns revolutionary architectures
Carol McDonald
 
PDF
Spark machine learning predicting customer churn
Carol McDonald
 
PDF
Fast Cars, Big Data How Streaming can help Formula 1
Carol McDonald
 
PDF
Applying Machine Learning to Live Patient Data
Carol McDonald
 
PDF
Streaming Patterns Revolutionary Architectures with the Kafka API
Carol McDonald
 
PDF
Advanced Threat Detection on Streaming Data
Carol McDonald
 
PDF
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Carol McDonald
 
Introduction to machine learning with GPUs
Carol McDonald
 
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Carol McDonald
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Carol McDonald
 
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Carol McDonald
 
Predicting Flight Delays with Spark Machine Learning
Carol McDonald
 
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Carol McDonald
 
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Carol McDonald
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Carol McDonald
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Carol McDonald
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
Carol McDonald
 
Demystifying AI, Machine Learning and Deep Learning
Carol McDonald
 
Spark graphx
Carol McDonald
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Carol McDonald
 
Streaming patterns revolutionary architectures
Carol McDonald
 
Spark machine learning predicting customer churn
Carol McDonald
 
Fast Cars, Big Data How Streaming can help Formula 1
Carol McDonald
 
Applying Machine Learning to Live Patient Data
Carol McDonald
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Carol McDonald
 
Advanced Threat Detection on Streaming Data
Carol McDonald
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Carol McDonald
 

Recently uploaded (20)

PDF
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
PPTX
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PDF
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
PPTX
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
PDF
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PPTX
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
PDF
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
PPTX
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PPTX
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 

Apache Spark Machine Learning

  • 1. ® © 2014 MapR Technologies 1 ® © 2014 MapR Technologies Machine Learning with Spark Carol McDonald
  • 2. ® © 2014 MapR Technologies 2 Agenda •  Classification •  Clustering •  Collaborative Filtering with Spark •  Model training •  Alternating Least Squares •  The code
  • 3. ® © 2014 MapR Technologies 3 Three Categories of Techniques for Machine Learning classification Collaborative filtering (recommendation) clustering Groups similar items identifies category for item Recommend items
  • 4. ® © 2014 MapR Technologies 4 What is classification Form of ML that: •  identifies which category an item belongs to •  Uses supervised learning algorithms •  Data is labeled Examples: •  Spam Detection •  spam/non-spam •  Credit Card Fraud Detection •  fraud/non-fraud •  Sentiment analysis
  • 5. ® © 2014 MapR Technologies 5 Building and deploying a classifier model
  • 6. ® © 2014 MapR Technologies 6 If it walks/swims/quacks like a duck … Attributes, Features: •  If it walks •  If it swims •  If it quacks “When I see a bird that walks like a duck and swims like a duck and quacks like a duck, I call that bird a duck.” classify something based on “if” conditions. Answer, Label: •  Duck •  Not duck
  • 7. ® © 2014 MapR Technologies 7 … then it must be a duck ducks not ducks walks quacks swims Label: •  Duck •  Not duck Features: •  walks •  swims •  quacks
  • 8. ® © 2014 MapR Technologies 8 Building and deploying a classifier model Reference Learning Spark Oreilly Book
  • 9. ® © 2014 MapR Technologies 9 Vectorizing Data •  identify interesting features (those that contribute to the model) •  assign features to dimensions Example: vectorize an apple Features: [size, color, weight] Example: vectorize a text document (Term Frequency Inverse Term Frequency) Dictionary: [a, advance, after, …, you, yourself, youth, zigzag] [3.2, 16777184.0, 45.8][223,1,1,0,…,12,10,6,1]
  • 10. ® © 2014 MapR Technologies 10 Build Term Frequency Feature vectors // examples of spam val spam = sc.textFile("spam.txt") // examples of not spam val normal = sc.textFile("normal.txt”) // Create a HashingTF map email text to vectors of features val tf = new HashingTF(numFeatures = 10000) // Each email each word is mapped to one feature. val spamFeatures = spam .map(email => tf.transform(email.split(" "))) val normalFeatures = normal .map(email => tf.transform(email.split(" "))) Reference Learning Spark Oreilly Book
  • 11. ® © 2014 MapR Technologies 11 Building and deploying a classifier model
  • 12. ® © 2014 MapR Technologies 12 Build Model val trainingData = positiveExamples.union(negativeExamples) trainingData.cache() // Cache for iterative algorithm. // Run Logistic Regression using the SGD algorithm. val model = new LogisticRegressionWithSGD() .run(trainingData) Reference Learning Spark Oreilly Book
  • 13. ® © 2014 MapR Technologies 13 Building and deploying a classifier model Reference Learning Spark Oreilly Book
  • 14. ® © 2014 MapR Technologies 14 Model Evaluation // Test on a positive example (spam) Vector posTest = tf.transform(Arrays.asList( "O M G GET cheap stuff by sending money to...".split(" "))); // negative test not spam Vector negTest = tf.transform(Arrays.asList( "Hi Dad, I started studying Spark the other ...".split(" "))); System.out.println("Prediction for positive: " + model.predict(posTest)); System.out.println("Prediction for negative: " + model.predict(negTest));
  • 15. ® © 2014 MapR Technologies 15 Three Categories of Techniques for Machine Learning classification Collaborative filtering (recommendation) clustering
  • 16. ® © 2014 MapR Technologies 16 Clustering •  Clustering is the unsupervised learning task that involves grouping objects into clusters of high similarity –  Search results grouping –  grouping of customers by similar habits –  Anomaly detection •  data traffic –  Text categorization
  • 17. ® © 2014 MapR Technologies 17 What is Clustering? Clustering = (unsupervised) task of grouping similar objects MLlib K-means algorithm for clustering 1.  randomly initialize centers of clusters 2.  Assign all points to the closest cluster center 3.  Change cluster centers to be in the middle of its points 4.  Repeat until convergence
  • 18. ® © 2014 MapR Technologies 18 What is Clustering? Clustering = (unsupervised) task of grouping similar objects
  • 19. ® © 2014 MapR Technologies 19 Examples of ML Algorithms machine learning supervised unsupervised •  Classification •  Naïve Bayes •  SVM •  Random Decision Forests •  Regression •  Linear •  logistic •  Clustering •  K-means •  Dimensionality reduction •  Principal Component Analysis •  SVD
  • 20. ® © 2014 MapR Technologies 20 ML Algorithms http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
  • 21. ® © 2014 MapR Technologies 21 Three Categories of Techniques for Machine Learning classification Collaborative filtering (recommendation) clustering
  • 22. ® © 2014 MapR Technologies 22 Collaborative Filtering with Spark •  Recommend Items –  (filtering) •  Based on User preferences data –  (collaborative)
  • 23. ® © 2014 MapR Technologies 23 Train a Model to Make Predictions New Data Model Predictions Training Data ModelAlgorithm Ted and Carol like Movie B and C Bob likes Movie B, What might he like ? Bob likes Movie B, Predict C
  • 24. ® © 2014 MapR Technologies 24 Alternating Least Squares •  approximates sparse user item rating matrix –  as product of two dense matrices, User and Item factor matrices –  tries to learn the hidden features of each user and item –  algorithm alternatively fixes one factor matrix and solves for the other ?
  • 25. ® © 2014 MapR Technologies 25 ML Cross Validation Process Data Model Training/ Building Test Model Predictions Test Set Train Test loop Training Set
  • 26. ® © 2014 MapR Technologies 26 Ratings Data
  • 27. ® © 2014 MapR Technologies 27 Parse Input // parse input UserID::MovieID::Rating def parseRating(str: String): Rating= { val fields = str.split("::") Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble) } // create an RDD of Ratings objects val ratingsRDD = ratingText.map(parseRating).cache()
  • 28. ® © 2014 MapR Technologies 28 Build Model Data Build Model Test Set Training Set split ratings RDD into training data RDD (80%) and test data RDD (20%) build a user product matrix model
  • 29. ® © 2014 MapR Technologies 29 Create Model // Randomly split ratings RDD into training data RDD (80%) and test data RDD (20%) val splits = ratingsRDD.randomSplit(Array(0.8, 0.2), 0L) val trainingRatingsRDD = splits(0).cache() val testRatingsRDD = splits(1).cache() // build a ALS user product matrix model with rank=20, iterations=10 val model = (new ALS().setRank(20).setIterations(10) .run(trainingRatingsRDD))
  • 30. ® © 2014 MapR Technologies 30 Get predictions // get predicted ratings to compare to test ratings val testUserProductRDD = testRatingsRDD.map { case Rating(user, product, rating) => (user, product) } // call model.predict with test Userid, MovieId input data val predictionsForTestRDD = model.predict(testUserProductRDD) User, Movie Test Data Model Predicted Ratings
  • 31. ® © 2014 MapR Technologies 31 Compare predictions to Tests Join predicted ratings to test ratings in order to compare ((user, product),test rating) ((user, product), predicted rating) ((user, product),(test rating, predicted rating)) Key, Value Key, Value Key, Value
  • 32. ® © 2014 MapR Technologies 32 Test Model // prepare predictions for comparison val predictionsKeyedByUserProductRDD = predictionsForTestRDD.map{ case Rating(user, product, rating) => ((user, product), rating) } // prepare test for comparison val testKeyedByUserProductRDD = testRatingsRDD.map{ case Rating(user, product, rating) => ((user, product), rating) } //Join the test with predictions val testAndPredictionsJoinedRDD = testKeyedByUserProductRDD .join(predictionsKeyedByUserProductRDD)
  • 33. ® © 2014 MapR Technologies 33 Compare predictions to Tests Find False positives: Where test rating <= 1 and predicted rating >= 4 ((user, product),(test rating, predicted rating)) Key, Value
  • 34. ® © 2014 MapR Technologies 34 Test Model val falsePositives =(testAndPredictionsJoinedRDD.filter{ case ((user, product), (ratingT, ratingP)) => (ratingT <= 1 && ratingP >=4) }) falsePositives.take(2) Array[((Int, Int), (Double, Double))] = ((3842,2858),(1.0,4.106488210964762)), ((6031,3194),(1.0,4.790778049100913))
  • 35. ® © 2014 MapR Technologies 35 Test Model Mean Absolute Error //Evaluate the model using Mean Absolute Error (MAE) between test and predictions val meanAbsoluteError = testAndPredictionsJoinedRDD.map { case ((user, product), (testRating, predRating)) => val err = (testRating - predRating) Math.abs(err) }.mean() meanAbsoluteError: Double = 0.7244940545944053
  • 36. ® © 2014 MapR Technologies 36 Get Predictions for new user val newRatingsRDD=sc.parallelize(Array(Rating(0,260,4),Rating(0,1,3)) // union val unionRatingsRDD = ratingsRDD.union(newRatingsRDD) // build a ALS user product matrix model val model = (new ALS().setRank(20).setIterations(10) .run(unionRatingsRDD)) // get 5 recs for userid 0 val topRecsForUser = model.recommendProducts(0, 5)
  • 37. ® © 2014 MapR Technologies 37 Soon to Come •  Spark On Demand Training –  https://www.mapr.com/services/mapr-academy/ •  Blogs and Tutorials: –  Movie Recommendations with Collaborative Filtering –  Spark Streaming
  • 38. ® © 2014 MapR Technologies 38 Machine Learning Blog •  https://www.mapr.com/blog/parallel-and-iterative-processing- machine-learning-recommendations-spark
  • 39. ® © 2014 MapR Technologies 39 Spark on MapR •  Certified Spark Distribution •  Fully supported and packaged by MapR in partnership with Databricks •  YARN integration –  Spark can then allocate resources from cluster when needed
  • 40. ® © 2014 MapR Technologies 40 References •  Spark Online course: learn.mapr.com •  Spark web site: http://spark.apache.org/ •  https://databricks.com/ •  Spark on MapR: –  http://www.mapr.com/products/apache-spark •  Spark SQL and DataFrame Guide •  Apache Spark vs. MapReduce – Whiteboard Walkthrough •  Learning Spark - O'Reilly Book •  Apache Spark
  • 41. ® © 2014 MapR Technologies 41 Q&A @mapr maprtech Engage with us! MapR maprtech mapr-technologies