SlideShare a Scribd company logo
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
Apache Spark
Large-scale recommendations with Apache Spark and Python
Christian S. Perone
christian.perone@gmail.com
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
AGENDA
INTRODUCTION
Big Data
The Elephant
APACHE SPARK
Apache Spark Introduction
Resilient Distributed Datasets
Data Frames
Spark and Machine Learning
COLLABORATIVE FILTERING
Introduction
Factorization
Practice time
Q&A
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHO AM I
Christian S. Perone
Machine Learning/Software Engineer
Blog
http://blog.christianperone.com
Open-source projects
https://github.com/perone
Twitter @tarantulae
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
Section I
INTRODUCTION
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS BIG DATA ?
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS BIG DATA ?
Future is data-based
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS BIG DATA ?
Future is data-based
User generated content
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS BIG DATA ?
Future is data-based
User generated content
Online / streaming
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS BIG DATA ?
Future is data-based
User generated content
Online / streaming
Internet of Things (IoT)
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS BIG DATA ?
Future is data-based
User generated content
Online / streaming
Internet of Things (IoT)
We want to being able to handle data, query, build models, make
predictions, etc.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
THE CASE AGAINST THE ELEPHANT
The truth is that Map-Reduce as a processing paradigm continues to be
severely restrictive, and is no more than a subset of richer processing
systems.
—Paper Trail, The Elephant was a Trojan Horse – 2014
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
THE CASE AGAINST THE ELEPHANT
The truth is that Map-Reduce as a processing paradigm continues to be
severely restrictive, and is no more than a subset of richer processing
systems.
—Paper Trail, The Elephant was a Trojan Horse – 2014
(...) we don’t really use MapReduce anymore.
—Urs Hölzle, Google I/O Keynote (see context) – 2014
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
THE CASE AGAINST THE ELEPHANT
The truth is that Map-Reduce as a processing paradigm continues to be
severely restrictive, and is no more than a subset of richer processing
systems.
—Paper Trail, The Elephant was a Trojan Horse – 2014
(...) we don’t really use MapReduce anymore.
—Urs Hölzle, Google I/O Keynote (see context) – 2014
Every real distributed machine learning (ML) researcher/engineer knows
that MR is bad. ML algorithms are iterative and MR is not suited for
iterative algorithms, which is due to unnecessary frequent I/O (...).
—Kenneth Tran, On the imminent decline of MapReduce – 2014
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
THE CASE AGAINST THE ELEPHANT
The Mahout community decided to move its codebase onto modern data
processing systems that offer a richer programming model and more
efficient execution than Hadoop MapReduce. Mahout will therefore reject
new MapReduce algorithm implementations from now on (...).
—Mahaut, Goodbye MapReduce – 2014
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
Section II
APACHE SPARK
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS APACHE SPARK ?
Apache Spark is a fast and expressive cluster computing system
compatible with Apache Hadoop.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS APACHE SPARK ?
Apache Spark is a fast and expressive cluster computing system
compatible with Apache Hadoop.
It improves computation performance by means of:
In-memory computing primitives
General computation graphs
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS APACHE SPARK ?
Apache Spark is a fast and expressive cluster computing system
compatible with Apache Hadoop.
It improves computation performance by means of:
In-memory computing primitives
General computation graphs
Spark has a rich API and bindings for Scala/Python/Java/R,
including an iterative shell for Python and Scala.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
WHAT IS APACHE SPARK ?
Apache Spark is a fast and expressive cluster computing system
compatible with Apache Hadoop.
It improves computation performance by means of:
In-memory computing primitives
General computation graphs
Spark has a rich API and bindings for Scala/Python/Java/R,
including an iterative shell for Python and Scala.
We will focus in the Python API.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CONCEPTS
The main goal of Spark is to provide the user an API to work with
distributed collections of data like if they were local. These
collections are called RDD (Resilient Distributed Dataset).
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CONCEPTS
The main goal of Spark is to provide the user an API to work with
distributed collections of data like if they were local. These
collections are called RDD (Resilient Distributed Dataset).
Immutable collections of objects spread across a cluster
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CONCEPTS
The main goal of Spark is to provide the user an API to work with
distributed collections of data like if they were local. These
collections are called RDD (Resilient Distributed Dataset).
Immutable collections of objects spread across a cluster
Built using parallel transformations (map, reduce, filter, group,
etc)
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CONCEPTS
The main goal of Spark is to provide the user an API to work with
distributed collections of data like if they were local. These
collections are called RDD (Resilient Distributed Dataset).
Immutable collections of objects spread across a cluster
Built using parallel transformations (map, reduce, filter, group,
etc)
These RDDs can be rebuild upon failure and they are lazy
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CONCEPTS
The main goal of Spark is to provide the user an API to work with
distributed collections of data like if they were local. These
collections are called RDD (Resilient Distributed Dataset).
Immutable collections of objects spread across a cluster
Built using parallel transformations (map, reduce, filter, group,
etc)
These RDDs can be rebuild upon failure and they are lazy
Controllable persistence for reuse (including caching in RAM)
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - RDDS
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - TRANSFORMATIONS VS ACTIONS
The operations that can be applied on the RDDs have two main
types:
TRANSFORMATIONS
These are the lazy operations to create new RDDs based on other
RDDs. Example:
map, filter, union, distinct, etc.
ACTIONS
These are the operations that actually does some computation and get the
results or write to disk. Example:
count, collect, first
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - JOB EXECUTION
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
SPARK ITERATIVE SHELL
./bin/pyspark --master local[4]
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
SPARK ITERATIVE SHELL
./bin/pyspark --master local[4]
Creating a RDD from a list:
>>> data = [1, 2, 3, 4, 5, 6, 7, 8]
>>> rdd = sc.parallelize(data)
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
SPARK ITERATIVE SHELL
./bin/pyspark --master local[4]
Creating a RDD from a list:
>>> data = [1, 2, 3, 4, 5, 6, 7, 8]
>>> rdd = sc.parallelize(data)
Creating a RDD from a file:
>>> rdd = sc.textFile("data.txt")
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
TRANSFORMATIONS AND ACTIONS
Filtering and counting a big log:
>>> rdd_log = sc.textFile('nginx_access.log')
>>> rdd_log.filter(lambda l: 'x.html' in l).count()
238
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
TRANSFORMATIONS AND ACTIONS
Filtering and counting a big log:
>>> rdd_log = sc.textFile('nginx_access.log')
>>> rdd_log.filter(lambda l: 'x.html' in l).count()
238
Collecting the interesting lines:
>>> rdd_log = sc.textFile('nginx_access.log')
>>> lines = rdd_log.filter(lambda l: 'x.html' in l).collect()
>>> lines
['201.140.8.128 [19/Jun/2012:09:17:31 +0100] 
"GET /x.html HTTP/1.1"', (...)]
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
TRANSFORMATIONS AND ACTIONS
Filtering and counting a big log:
>>> rdd_log = sc.textFile('nginx_access.log')
>>> rdd_log.filter(lambda l: 'x.html' in l).count()
238
Collecting the interesting lines:
>>> rdd_log = sc.textFile('nginx_access.log')
>>> lines = rdd_log.filter(lambda l: 'x.html' in l).collect()
>>> lines
['201.140.8.128 [19/Jun/2012:09:17:31 +0100] 
"GET /x.html HTTP/1.1"', (...)]
Breaking down:
>>> filter_rdd = rdd_log.filter(lambda l: 'x.html' in l)
>>> filter_rdd.count()
238
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - RDDS VS DATAFRAME
RDDs are usually not very intuitive to read for complex
computations, they can be seen as how Spark is going to do
the computation instead of describing what you want to do.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - RDDS VS DATAFRAME
RDDs are usually not very intuitive to read for complex
computations, they can be seen as how Spark is going to do
the computation instead of describing what you want to do.
They also miss some important optimizations, specially for
PySpark.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - RDDS VS DATAFRAME
RDDs are usually not very intuitive to read for complex
computations, they can be seen as how Spark is going to do
the computation instead of describing what you want to do.
They also miss some important optimizations, specially for
PySpark.
That’s why DataFrames are so awesome.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - DATAFRAMES
DataFrames provide a DSL for structure data manipulation.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - DATAFRAMES
DataFrames provide a DSL for structure data manipulation.
Very similar to Pandas DataFrames (also contain methods for
conversions).
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - DATAFRAMES
DataFrames provide a DSL for structure data manipulation.
Very similar to Pandas DataFrames (also contain methods for
conversions).
Can load data from JSON/Parquet/libsvm/etc.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - DATAFRAMES
DataFrames provide a DSL for structure data manipulation.
Very similar to Pandas DataFrames (also contain methods for
conversions).
Can load data from JSON/Parquet/libsvm/etc.
Optimizer is able to look inside of operations.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - DATAFRAMES
./bin/pyspark --master local[4]
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - DATAFRAMES
./bin/pyspark --master local[4]
Creating a DataFrame from a JSON:
>>> df = spark.read.json("example.json")
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - DATAFRAMES
./bin/pyspark --master local[4]
Creating a DataFrame from a JSON:
>>> df = spark.read.json("example.json")
Filter by a column:
>>> df.filter(df["User"]=="Perone").count()
120
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CATALYST BASICS
Powers both the SQL queries and also the DataFrame API.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CATALYST BASICS
Powers both the SQL queries and also the DataFrame API.
Extensible query optimizer.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CATALYST BASICS
Powers both the SQL queries and also the DataFrame API.
Extensible query optimizer.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CATALYST BASICS
Add(Attribute(x), Add(Literal(1), Literal(2)))
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CATALYST BASICS
Add(Attribute(x), Add(Literal(1), Literal(2)))
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - CATALYST BASICS
Add(Attribute(x), Add(Literal(1), Literal(2)))
tree.transform {
case Add(Literal(c1), Literal(c2)) => Literal(c1+c2)
case Add(left, Literal(0)) => left
case Add(Literal(0), right) => right
}
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - SPARK.ML VS SPARK.MLLIB
As of Spark 2.0, the RDD-based APIs in the spark.mllib package
have entered maintenance mode. The primary Machine Learning
API for Spark is now the DataFrame-based API in the spark.ml
package.
—http://spark.apache.org/docs/latest/ml-guide.html
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - SPARK.ML VS SPARK.MLLIB
As of Spark 2.0, the RDD-based APIs in the spark.mllib package
have entered maintenance mode. The primary Machine Learning
API for Spark is now the DataFrame-based API in the spark.ml
package.
—http://spark.apache.org/docs/latest/ml-guide.html
MLlib will still support the RDD-based API with bug fixes.
No more new features to the RDD-based API.
In the Spark 2.x releases, will add features to the DataFrames-based
API to reach feature parity with the RDD-based API.
After reaching feature parity (roughly estimated for Spark 2.2), the
RDD-based API will be deprecated.
The RDD-based API is expected to be removed in Spark 3.0.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - STACK
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - ML
Make practical machine learning scalable and easy. At a high level,
it provides tools such as:
ML Algorithms: algorithms such as classification, regression,
clustering, and collaborative filtering
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - ML
Make practical machine learning scalable and easy. At a high level,
it provides tools such as:
ML Algorithms: algorithms such as classification, regression,
clustering, and collaborative filtering
Featurization: feature extraction, transformation,
dimensionality reduction, and selection
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - ML
Make practical machine learning scalable and easy. At a high level,
it provides tools such as:
ML Algorithms: algorithms such as classification, regression,
clustering, and collaborative filtering
Featurization: feature extraction, transformation,
dimensionality reduction, and selection
Pipelines: tools for constructing, evaluating, and tuning ML
Pipelines
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - ML
Make practical machine learning scalable and easy. At a high level,
it provides tools such as:
ML Algorithms: algorithms such as classification, regression,
clustering, and collaborative filtering
Featurization: feature extraction, transformation,
dimensionality reduction, and selection
Pipelines: tools for constructing, evaluating, and tuning ML
Pipelines
Persistence: saving and loading models, Pipelines, etc.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - ML
Make practical machine learning scalable and easy. At a high level,
it provides tools such as:
ML Algorithms: algorithms such as classification, regression,
clustering, and collaborative filtering
Featurization: feature extraction, transformation,
dimensionality reduction, and selection
Pipelines: tools for constructing, evaluating, and tuning ML
Pipelines
Persistence: saving and loading models, Pipelines, etc.
Utilities: linear algebra, statistics, data handling, etc.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - ML
Word2vec example using spark.ml:
>>> from pyspark.ml.feature import Word2Vec
>>> documents = [
... ("Hi I heard about Spark".split(" "), ),
... ("I wish Java could use case classes".split(" "), ),
... ("Logistic regression models are neat".split(" "), )
... ]
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - ML
Word2vec example using spark.ml:
>>> from pyspark.ml.feature import Word2Vec
>>> documents = [
... ("Hi I heard about Spark".split(" "), ),
... ("I wish Java could use case classes".split(" "), ),
... ("Logistic regression models are neat".split(" "), )
... ]
>>> documentDF = spark.createDataFrame(documents, ["text"])
>>> documentDF.take(1)
[Row(text=[u'Hi', u'I', u'heard', u'about', u'Spark'])]
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - ML
Word2vec example using spark.ml:
>>> from pyspark.ml.feature import Word2Vec
>>> documents = [
... ("Hi I heard about Spark".split(" "), ),
... ("I wish Java could use case classes".split(" "), ),
... ("Logistic regression models are neat".split(" "), )
... ]
>>> documentDF = spark.createDataFrame(documents, ["text"])
>>> documentDF.take(1)
[Row(text=[u'Hi', u'I', u'heard', u'about', u'Spark'])]
>>> word2Vec = Word2Vec(vectorSize=3, minCount=0,
... inputCol="text", outputCol="result")
>>> model = word2Vec.fit(documentDF)
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
APACHE SPARK - ML
Word2vec example using spark.ml:
>>> from pyspark.ml.feature import Word2Vec
>>> documents = [
... ("Hi I heard about Spark".split(" "), ),
... ("I wish Java could use case classes".split(" "), ),
... ("Logistic regression models are neat".split(" "), )
... ]
>>> documentDF = spark.createDataFrame(documents, ["text"])
>>> documentDF.take(1)
[Row(text=[u'Hi', u'I', u'heard', u'about', u'Spark'])]
>>> word2Vec = Word2Vec(vectorSize=3, minCount=0,
... inputCol="text", outputCol="result")
>>> model = word2Vec.fit(documentDF)
>>> result = model.transform(documentDF)
>>> result.take(1)
[Row(text=[u'Hi', u'I', u'heard', u'about', u'Spark'],
result=DenseVector([-0.0168, 0.0042, -0.0308]))]
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
Section III
COLLABORATIVE FILTERING
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
COLLABORATIVE FILTERING
Collaborative filtering methods are based on collecting and
analyzing a large amount of information on users behaviors,
activities or preferences and predicting what users will like based on
their similarity to other users.
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
COLLABORATIVE FILTERING
Collaborative filtering methods are based on collecting and
analyzing a large amount of information on users behaviors,
activities or preferences and predicting what users will like based on
their similarity to other users.
Doesn’t rely on content like content-based methods (complex
items)
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
COLLABORATIVE FILTERING
Collaborative filtering methods are based on collecting and
analyzing a large amount of information on users behaviors,
activities or preferences and predicting what users will like based on
their similarity to other users.
Doesn’t rely on content like content-based methods (complex
items)
Doesn’t need item/user metadata
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
COLLABORATIVE FILTERING
Collaborative filtering methods are based on collecting and
analyzing a large amount of information on users behaviors,
activities or preferences and predicting what users will like based on
their similarity to other users.
Doesn’t rely on content like content-based methods (complex
items)
Doesn’t need item/user metadata
Suffers from “new item” problem
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
COLLABORATIVE FILTERING
Collaborative filtering methods are based on collecting and
analyzing a large amount of information on users behaviors,
activities or preferences and predicting what users will like based on
their similarity to other users.
Doesn’t rely on content like content-based methods (complex
items)
Doesn’t need item/user metadata
Suffers from “new item” problem
Cold start
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
EXPLICIT FACTORIZATION
Approximate the ratings matrix:
( )
(x y
)()
?231
1??4
32??
532?
Items
Users
Christian
AC/DCBackinBlack
≈
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
EXPLICIT FACTORIZATION
Approximate the ratings matrix:
( )
(x y
)()
?231
1??4
32??
532?
Items
Users
Christian
AC/DCBackinBlack
≈
OPTIMIZATION
minx,y
u,i
(rui − xT
u yi)2
+ λ(
u
xu
2
+
i
yi
2
)
* omitted biases
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
LET’S DO IT
Practice time !
Notebook at: https://github.com/perone/spark-als-intro
Load/parse data
Pandas integration, sampling, plotting
Spark SQL
Split data (train/test)
Build model
Train model
Evaluate model
Have fun !
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
Section IV
Q&A
INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A
Q&A

More Related Content

What's hot (20)

PDF
Introduction to Spark Internals
Pietro Michiardi
 
PDF
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
PDF
Introduction to apache spark
Aakashdata
 
PPTX
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
PDF
Apache spark linkedin
Yukti Kaura
 
PPTX
Programming in Spark using PySpark
Mostafa
 
PDF
Spark overview
Lisa Hua
 
PDF
Hadoop and Spark
Shravan (Sean) Pabba
 
PDF
Introduction to spark
Duyhai Doan
 
PPTX
Introduction to Apache Spark
Mohamed hedi Abidi
 
PDF
Apache Spark RDDs
Dean Chen
 
PPTX
Spark tutorial
Sahan Bulathwela
 
PDF
Introduction to Apache Spark
Samy Dindane
 
PPTX
Intro to Apache Spark
Robert Sanders
 
PPTX
Apache Spark overview
DataArt
 
PDF
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
PDF
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
PPTX
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
PPTX
Transformations and actions a visual guide training
Spark Summit
 
PPTX
Introduction to Apache Spark and MLlib
pumaranikar
 
Introduction to Spark Internals
Pietro Michiardi
 
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Introduction to apache spark
Aakashdata
 
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
Apache spark linkedin
Yukti Kaura
 
Programming in Spark using PySpark
Mostafa
 
Spark overview
Lisa Hua
 
Hadoop and Spark
Shravan (Sean) Pabba
 
Introduction to spark
Duyhai Doan
 
Introduction to Apache Spark
Mohamed hedi Abidi
 
Apache Spark RDDs
Dean Chen
 
Spark tutorial
Sahan Bulathwela
 
Introduction to Apache Spark
Samy Dindane
 
Intro to Apache Spark
Robert Sanders
 
Apache Spark overview
DataArt
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
Transformations and actions a visual guide training
Spark Summit
 
Introduction to Apache Spark and MLlib
pumaranikar
 

Viewers also liked (20)

PDF
Word Embeddings - Introduction
Christian Perone
 
PPTX
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Christian Perone
 
PDF
Deep Learning - Convolutional Neural Networks
Christian Perone
 
PPTX
Southside Green - Opening Presentation
M. Damon Weiss
 
PDF
Convolutional Neural Networks (CNN)
Gaurav Mittal
 
PDF
Python - Introdução Básica
Christian Perone
 
PDF
Convolution Neural Networks
AhmedMahany
 
PPTX
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Jia-Bin Huang
 
PPTX
20170220 pielke-sr-climate-combined
Fabius Maximus
 
PDF
Backpropagation in Convolutional Neural Network
Hiroshi Kuwajima
 
PDF
Differential evolution
ҚяậŧĭҚậ Jậĭn
 
PDF
Python dict: прошлое, настоящее, будущее
delimitry
 
PPTX
From Data to Argument
rahulbot
 
PPT
OSCon - Performance vs Scalability
Gleicon Moraes
 
PDF
Monografia pós Graduação Cristiano Moreti
Cristiano Moreti
 
PDF
Architecture by Accident
Gleicon Moraes
 
PDF
Celery for internal API in SOA infrastructure
Roman Imankulov
 
PDF
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
Masumi Shirakawa
 
PDF
C++0x :: Introduction to some amazing features
Christian Perone
 
PDF
Lean Startup Basics @ FinTechMeetup Frankfurt
Paul Herwarth von Bittenfeld
 
Word Embeddings - Introduction
Christian Perone
 
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Christian Perone
 
Deep Learning - Convolutional Neural Networks
Christian Perone
 
Southside Green - Opening Presentation
M. Damon Weiss
 
Convolutional Neural Networks (CNN)
Gaurav Mittal
 
Python - Introdução Básica
Christian Perone
 
Convolution Neural Networks
AhmedMahany
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Jia-Bin Huang
 
20170220 pielke-sr-climate-combined
Fabius Maximus
 
Backpropagation in Convolutional Neural Network
Hiroshi Kuwajima
 
Differential evolution
ҚяậŧĭҚậ Jậĭn
 
Python dict: прошлое, настоящее, будущее
delimitry
 
From Data to Argument
rahulbot
 
OSCon - Performance vs Scalability
Gleicon Moraes
 
Monografia pós Graduação Cristiano Moreti
Cristiano Moreti
 
Architecture by Accident
Gleicon Moraes
 
Celery for internal API in SOA infrastructure
Roman Imankulov
 
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
Masumi Shirakawa
 
C++0x :: Introduction to some amazing features
Christian Perone
 
Lean Startup Basics @ FinTechMeetup Frankfurt
Paul Herwarth von Bittenfeld
 
Ad

Similar to Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python (20)

PDF
Apache Spark Overview
Vadim Y. Bichutskiy
 
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PPTX
Spark core
Prashant Gupta
 
PDF
[@NaukriEngineering] Apache Spark
Naukri.com
 
PPTX
OVERVIEW ON SPARK.pptx
Aishg4
 
PDF
Spark: A Unified Engine for Big Data Processing
ChadrequeCruzManuela
 
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
PPTX
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
PDF
Bds session 13 14
Infinity Tech Solutions
 
PDF
Intro to apache spark
Amine Sagaama
 
PPT
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
PDF
Apache Spark PDF
Naresh Rupareliya
 
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
PDF
Apache Spark and Python: unified Big Data analytics
Julien Anguenot
 
PPTX
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
PPTX
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
PPTX
Spark_tutorial (1).pptx
0111002
 
PDF
Introduction to Spark with Python
Gokhan Atil
 
Apache Spark Overview
Vadim Y. Bichutskiy
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Spark core
Prashant Gupta
 
[@NaukriEngineering] Apache Spark
Naukri.com
 
OVERVIEW ON SPARK.pptx
Aishg4
 
Spark: A Unified Engine for Big Data Processing
ChadrequeCruzManuela
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Bds session 13 14
Infinity Tech Solutions
 
Intro to apache spark
Amine Sagaama
 
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Apache Spark PDF
Naresh Rupareliya
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
Apache Spark and Python: unified Big Data analytics
Julien Anguenot
 
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Spark_tutorial (1).pptx
0111002
 
Introduction to Spark with Python
Gokhan Atil
 
Ad

More from Christian Perone (6)

PDF
PyTorch 2 Internals
Christian Perone
 
PDF
Gradient-based optimization for Deep Learning: a short introduction
Christian Perone
 
PDF
Bayesian modelling for COVID-19 seroprevalence studies
Christian Perone
 
PDF
Uncertainty Estimation in Deep Learning
Christian Perone
 
PDF
PyTorch under the hood
Christian Perone
 
PDF
Machine Learning com Python e Scikit-learn
Christian Perone
 
PyTorch 2 Internals
Christian Perone
 
Gradient-based optimization for Deep Learning: a short introduction
Christian Perone
 
Bayesian modelling for COVID-19 seroprevalence studies
Christian Perone
 
Uncertainty Estimation in Deep Learning
Christian Perone
 
PyTorch under the hood
Christian Perone
 
Machine Learning com Python e Scikit-learn
Christian Perone
 

Recently uploaded (20)

PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 

Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python

  • 1. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A Apache Spark Large-scale recommendations with Apache Spark and Python Christian S. Perone [email protected]
  • 2. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A AGENDA INTRODUCTION Big Data The Elephant APACHE SPARK Apache Spark Introduction Resilient Distributed Datasets Data Frames Spark and Machine Learning COLLABORATIVE FILTERING Introduction Factorization Practice time Q&A
  • 3. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A WHO AM I Christian S. Perone Machine Learning/Software Engineer Blog http://blog.christianperone.com Open-source projects https://github.com/perone Twitter @tarantulae
  • 4. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A Section I INTRODUCTION
  • 5. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A WHAT IS BIG DATA ?
  • 6. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A WHAT IS BIG DATA ? Future is data-based
  • 7. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A WHAT IS BIG DATA ? Future is data-based User generated content
  • 8. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A WHAT IS BIG DATA ? Future is data-based User generated content Online / streaming
  • 9. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A WHAT IS BIG DATA ? Future is data-based User generated content Online / streaming Internet of Things (IoT)
  • 10. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A WHAT IS BIG DATA ? Future is data-based User generated content Online / streaming Internet of Things (IoT) We want to being able to handle data, query, build models, make predictions, etc.
  • 11. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A THE CASE AGAINST THE ELEPHANT The truth is that Map-Reduce as a processing paradigm continues to be severely restrictive, and is no more than a subset of richer processing systems. —Paper Trail, The Elephant was a Trojan Horse – 2014
  • 12. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A THE CASE AGAINST THE ELEPHANT The truth is that Map-Reduce as a processing paradigm continues to be severely restrictive, and is no more than a subset of richer processing systems. —Paper Trail, The Elephant was a Trojan Horse – 2014 (...) we don’t really use MapReduce anymore. —Urs Hölzle, Google I/O Keynote (see context) – 2014
  • 13. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A THE CASE AGAINST THE ELEPHANT The truth is that Map-Reduce as a processing paradigm continues to be severely restrictive, and is no more than a subset of richer processing systems. —Paper Trail, The Elephant was a Trojan Horse – 2014 (...) we don’t really use MapReduce anymore. —Urs Hölzle, Google I/O Keynote (see context) – 2014 Every real distributed machine learning (ML) researcher/engineer knows that MR is bad. ML algorithms are iterative and MR is not suited for iterative algorithms, which is due to unnecessary frequent I/O (...). —Kenneth Tran, On the imminent decline of MapReduce – 2014
  • 14. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A THE CASE AGAINST THE ELEPHANT The Mahout community decided to move its codebase onto modern data processing systems that offer a richer programming model and more efficient execution than Hadoop MapReduce. Mahout will therefore reject new MapReduce algorithm implementations from now on (...). —Mahaut, Goodbye MapReduce – 2014
  • 15. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A Section II APACHE SPARK
  • 16. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A WHAT IS APACHE SPARK ? Apache Spark is a fast and expressive cluster computing system compatible with Apache Hadoop.
  • 17. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A WHAT IS APACHE SPARK ? Apache Spark is a fast and expressive cluster computing system compatible with Apache Hadoop. It improves computation performance by means of: In-memory computing primitives General computation graphs
  • 18. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A WHAT IS APACHE SPARK ? Apache Spark is a fast and expressive cluster computing system compatible with Apache Hadoop. It improves computation performance by means of: In-memory computing primitives General computation graphs Spark has a rich API and bindings for Scala/Python/Java/R, including an iterative shell for Python and Scala.
  • 19. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A WHAT IS APACHE SPARK ? Apache Spark is a fast and expressive cluster computing system compatible with Apache Hadoop. It improves computation performance by means of: In-memory computing primitives General computation graphs Spark has a rich API and bindings for Scala/Python/Java/R, including an iterative shell for Python and Scala. We will focus in the Python API.
  • 20. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - CONCEPTS The main goal of Spark is to provide the user an API to work with distributed collections of data like if they were local. These collections are called RDD (Resilient Distributed Dataset).
  • 21. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - CONCEPTS The main goal of Spark is to provide the user an API to work with distributed collections of data like if they were local. These collections are called RDD (Resilient Distributed Dataset). Immutable collections of objects spread across a cluster
  • 22. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - CONCEPTS The main goal of Spark is to provide the user an API to work with distributed collections of data like if they were local. These collections are called RDD (Resilient Distributed Dataset). Immutable collections of objects spread across a cluster Built using parallel transformations (map, reduce, filter, group, etc)
  • 23. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - CONCEPTS The main goal of Spark is to provide the user an API to work with distributed collections of data like if they were local. These collections are called RDD (Resilient Distributed Dataset). Immutable collections of objects spread across a cluster Built using parallel transformations (map, reduce, filter, group, etc) These RDDs can be rebuild upon failure and they are lazy
  • 24. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - CONCEPTS The main goal of Spark is to provide the user an API to work with distributed collections of data like if they were local. These collections are called RDD (Resilient Distributed Dataset). Immutable collections of objects spread across a cluster Built using parallel transformations (map, reduce, filter, group, etc) These RDDs can be rebuild upon failure and they are lazy Controllable persistence for reuse (including caching in RAM)
  • 25. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - RDDS
  • 26. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - TRANSFORMATIONS VS ACTIONS The operations that can be applied on the RDDs have two main types: TRANSFORMATIONS These are the lazy operations to create new RDDs based on other RDDs. Example: map, filter, union, distinct, etc. ACTIONS These are the operations that actually does some computation and get the results or write to disk. Example: count, collect, first
  • 27. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - JOB EXECUTION
  • 28. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A SPARK ITERATIVE SHELL ./bin/pyspark --master local[4]
  • 29. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A SPARK ITERATIVE SHELL ./bin/pyspark --master local[4] Creating a RDD from a list: >>> data = [1, 2, 3, 4, 5, 6, 7, 8] >>> rdd = sc.parallelize(data)
  • 30. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A SPARK ITERATIVE SHELL ./bin/pyspark --master local[4] Creating a RDD from a list: >>> data = [1, 2, 3, 4, 5, 6, 7, 8] >>> rdd = sc.parallelize(data) Creating a RDD from a file: >>> rdd = sc.textFile("data.txt")
  • 31. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A TRANSFORMATIONS AND ACTIONS Filtering and counting a big log: >>> rdd_log = sc.textFile('nginx_access.log') >>> rdd_log.filter(lambda l: 'x.html' in l).count() 238
  • 32. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A TRANSFORMATIONS AND ACTIONS Filtering and counting a big log: >>> rdd_log = sc.textFile('nginx_access.log') >>> rdd_log.filter(lambda l: 'x.html' in l).count() 238 Collecting the interesting lines: >>> rdd_log = sc.textFile('nginx_access.log') >>> lines = rdd_log.filter(lambda l: 'x.html' in l).collect() >>> lines ['201.140.8.128 [19/Jun/2012:09:17:31 +0100] "GET /x.html HTTP/1.1"', (...)]
  • 33. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A TRANSFORMATIONS AND ACTIONS Filtering and counting a big log: >>> rdd_log = sc.textFile('nginx_access.log') >>> rdd_log.filter(lambda l: 'x.html' in l).count() 238 Collecting the interesting lines: >>> rdd_log = sc.textFile('nginx_access.log') >>> lines = rdd_log.filter(lambda l: 'x.html' in l).collect() >>> lines ['201.140.8.128 [19/Jun/2012:09:17:31 +0100] "GET /x.html HTTP/1.1"', (...)] Breaking down: >>> filter_rdd = rdd_log.filter(lambda l: 'x.html' in l) >>> filter_rdd.count() 238
  • 34. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - RDDS VS DATAFRAME RDDs are usually not very intuitive to read for complex computations, they can be seen as how Spark is going to do the computation instead of describing what you want to do.
  • 35. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - RDDS VS DATAFRAME RDDs are usually not very intuitive to read for complex computations, they can be seen as how Spark is going to do the computation instead of describing what you want to do. They also miss some important optimizations, specially for PySpark.
  • 36. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - RDDS VS DATAFRAME RDDs are usually not very intuitive to read for complex computations, they can be seen as how Spark is going to do the computation instead of describing what you want to do. They also miss some important optimizations, specially for PySpark. That’s why DataFrames are so awesome.
  • 37. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - DATAFRAMES DataFrames provide a DSL for structure data manipulation.
  • 38. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - DATAFRAMES DataFrames provide a DSL for structure data manipulation. Very similar to Pandas DataFrames (also contain methods for conversions).
  • 39. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - DATAFRAMES DataFrames provide a DSL for structure data manipulation. Very similar to Pandas DataFrames (also contain methods for conversions). Can load data from JSON/Parquet/libsvm/etc.
  • 40. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - DATAFRAMES DataFrames provide a DSL for structure data manipulation. Very similar to Pandas DataFrames (also contain methods for conversions). Can load data from JSON/Parquet/libsvm/etc. Optimizer is able to look inside of operations.
  • 41. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - DATAFRAMES ./bin/pyspark --master local[4]
  • 42. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - DATAFRAMES ./bin/pyspark --master local[4] Creating a DataFrame from a JSON: >>> df = spark.read.json("example.json")
  • 43. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - DATAFRAMES ./bin/pyspark --master local[4] Creating a DataFrame from a JSON: >>> df = spark.read.json("example.json") Filter by a column: >>> df.filter(df["User"]=="Perone").count() 120
  • 44. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - CATALYST BASICS Powers both the SQL queries and also the DataFrame API.
  • 45. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - CATALYST BASICS Powers both the SQL queries and also the DataFrame API. Extensible query optimizer.
  • 46. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - CATALYST BASICS Powers both the SQL queries and also the DataFrame API. Extensible query optimizer.
  • 47. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - CATALYST BASICS Add(Attribute(x), Add(Literal(1), Literal(2)))
  • 48. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - CATALYST BASICS Add(Attribute(x), Add(Literal(1), Literal(2)))
  • 49. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - CATALYST BASICS Add(Attribute(x), Add(Literal(1), Literal(2))) tree.transform { case Add(Literal(c1), Literal(c2)) => Literal(c1+c2) case Add(left, Literal(0)) => left case Add(Literal(0), right) => right }
  • 50. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - SPARK.ML VS SPARK.MLLIB As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package. —http://spark.apache.org/docs/latest/ml-guide.html
  • 51. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - SPARK.ML VS SPARK.MLLIB As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package. —http://spark.apache.org/docs/latest/ml-guide.html MLlib will still support the RDD-based API with bug fixes. No more new features to the RDD-based API. In the Spark 2.x releases, will add features to the DataFrames-based API to reach feature parity with the RDD-based API. After reaching feature parity (roughly estimated for Spark 2.2), the RDD-based API will be deprecated. The RDD-based API is expected to be removed in Spark 3.0.
  • 52. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - STACK
  • 53. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - ML Make practical machine learning scalable and easy. At a high level, it provides tools such as: ML Algorithms: algorithms such as classification, regression, clustering, and collaborative filtering
  • 54. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - ML Make practical machine learning scalable and easy. At a high level, it provides tools such as: ML Algorithms: algorithms such as classification, regression, clustering, and collaborative filtering Featurization: feature extraction, transformation, dimensionality reduction, and selection
  • 55. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - ML Make practical machine learning scalable and easy. At a high level, it provides tools such as: ML Algorithms: algorithms such as classification, regression, clustering, and collaborative filtering Featurization: feature extraction, transformation, dimensionality reduction, and selection Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
  • 56. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - ML Make practical machine learning scalable and easy. At a high level, it provides tools such as: ML Algorithms: algorithms such as classification, regression, clustering, and collaborative filtering Featurization: feature extraction, transformation, dimensionality reduction, and selection Pipelines: tools for constructing, evaluating, and tuning ML Pipelines Persistence: saving and loading models, Pipelines, etc.
  • 57. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - ML Make practical machine learning scalable and easy. At a high level, it provides tools such as: ML Algorithms: algorithms such as classification, regression, clustering, and collaborative filtering Featurization: feature extraction, transformation, dimensionality reduction, and selection Pipelines: tools for constructing, evaluating, and tuning ML Pipelines Persistence: saving and loading models, Pipelines, etc. Utilities: linear algebra, statistics, data handling, etc.
  • 58. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - ML Word2vec example using spark.ml: >>> from pyspark.ml.feature import Word2Vec >>> documents = [ ... ("Hi I heard about Spark".split(" "), ), ... ("I wish Java could use case classes".split(" "), ), ... ("Logistic regression models are neat".split(" "), ) ... ]
  • 59. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - ML Word2vec example using spark.ml: >>> from pyspark.ml.feature import Word2Vec >>> documents = [ ... ("Hi I heard about Spark".split(" "), ), ... ("I wish Java could use case classes".split(" "), ), ... ("Logistic regression models are neat".split(" "), ) ... ] >>> documentDF = spark.createDataFrame(documents, ["text"]) >>> documentDF.take(1) [Row(text=[u'Hi', u'I', u'heard', u'about', u'Spark'])]
  • 60. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - ML Word2vec example using spark.ml: >>> from pyspark.ml.feature import Word2Vec >>> documents = [ ... ("Hi I heard about Spark".split(" "), ), ... ("I wish Java could use case classes".split(" "), ), ... ("Logistic regression models are neat".split(" "), ) ... ] >>> documentDF = spark.createDataFrame(documents, ["text"]) >>> documentDF.take(1) [Row(text=[u'Hi', u'I', u'heard', u'about', u'Spark'])] >>> word2Vec = Word2Vec(vectorSize=3, minCount=0, ... inputCol="text", outputCol="result") >>> model = word2Vec.fit(documentDF)
  • 61. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A APACHE SPARK - ML Word2vec example using spark.ml: >>> from pyspark.ml.feature import Word2Vec >>> documents = [ ... ("Hi I heard about Spark".split(" "), ), ... ("I wish Java could use case classes".split(" "), ), ... ("Logistic regression models are neat".split(" "), ) ... ] >>> documentDF = spark.createDataFrame(documents, ["text"]) >>> documentDF.take(1) [Row(text=[u'Hi', u'I', u'heard', u'about', u'Spark'])] >>> word2Vec = Word2Vec(vectorSize=3, minCount=0, ... inputCol="text", outputCol="result") >>> model = word2Vec.fit(documentDF) >>> result = model.transform(documentDF) >>> result.take(1) [Row(text=[u'Hi', u'I', u'heard', u'about', u'Spark'], result=DenseVector([-0.0168, 0.0042, -0.0308]))]
  • 62. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A Section III COLLABORATIVE FILTERING
  • 63. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A COLLABORATIVE FILTERING Collaborative filtering methods are based on collecting and analyzing a large amount of information on users behaviors, activities or preferences and predicting what users will like based on their similarity to other users.
  • 64. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A COLLABORATIVE FILTERING Collaborative filtering methods are based on collecting and analyzing a large amount of information on users behaviors, activities or preferences and predicting what users will like based on their similarity to other users. Doesn’t rely on content like content-based methods (complex items)
  • 65. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A COLLABORATIVE FILTERING Collaborative filtering methods are based on collecting and analyzing a large amount of information on users behaviors, activities or preferences and predicting what users will like based on their similarity to other users. Doesn’t rely on content like content-based methods (complex items) Doesn’t need item/user metadata
  • 66. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A COLLABORATIVE FILTERING Collaborative filtering methods are based on collecting and analyzing a large amount of information on users behaviors, activities or preferences and predicting what users will like based on their similarity to other users. Doesn’t rely on content like content-based methods (complex items) Doesn’t need item/user metadata Suffers from “new item” problem
  • 67. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A COLLABORATIVE FILTERING Collaborative filtering methods are based on collecting and analyzing a large amount of information on users behaviors, activities or preferences and predicting what users will like based on their similarity to other users. Doesn’t rely on content like content-based methods (complex items) Doesn’t need item/user metadata Suffers from “new item” problem Cold start
  • 68. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A EXPLICIT FACTORIZATION Approximate the ratings matrix: ( ) (x y )() ?231 1??4 32?? 532? Items Users Christian AC/DCBackinBlack ≈
  • 69. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A EXPLICIT FACTORIZATION Approximate the ratings matrix: ( ) (x y )() ?231 1??4 32?? 532? Items Users Christian AC/DCBackinBlack ≈ OPTIMIZATION minx,y u,i (rui − xT u yi)2 + λ( u xu 2 + i yi 2 ) * omitted biases
  • 70. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A LET’S DO IT Practice time ! Notebook at: https://github.com/perone/spark-als-intro Load/parse data Pandas integration, sampling, plotting Spark SQL Split data (train/test) Build model Train model Evaluate model Have fun !
  • 71. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A Section IV Q&A
  • 72. INTRODUCTION APACHE SPARK COLLABORATIVE FILTERING Q&A Q&A