SlideShare a Scribd company logo
Unsupervised Learning with Apache Spark
● Data scientist at Cloudera
● Recently lead Apache Spark development at
Cloudera
● Before that, committing on Apache Hadoop
● Before that, studying combinatorial
optimization and distributed systems at
Brown
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
● How many kinds of stuff are there?
● Why is some stuff not like the others?
● How do I contextualize new stuff?
● Is there a simpler way to represent this stuff?
● Learn hidden structure of your data
● Interpret new data as it relates to this
structure
● Clustering
○ Partition data into categories
● Dimensionality reduction
○ Find a condensed representation of your
data
● Designing a system for processing huge
data in parallel
● Taking advantage of it with algorithms that
work well in parallel
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
bigfile.txt lines
val lines = sc.textFile
(“bigfile.txt”)
numbers
Partition
Partition
Partition
Partition
Partition
Partition
HDFS
sum
Driver
val numbers = lines.map
((x) => x.toDouble) numbers.sum()
Unsupervised Learning with Apache Spark
bigfile.txt lines
val lines = sc.textFile
(“bigfile.txt”)
numbers
Partition
Partition
Partition
Partition
Partition
Partition
HDFS
sum
Driver
val numbers = lines.map
((x) => x.toInt) numbers.cache()
.sum()
bigfile.txt lines numbers
Partition
Partition
Partition
sum
Driver
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
Discrete Continuous
Supervised Classification
● Logistic regression (and
regularized variants)
● Linear SVM
● Naive Bayes
● Random Decision Forests
(soon)
Regression
● Linear regression (and
regularized variants)
Unsupervised Clustering
● K-means
Dimensionality reduction, matrix
factorization
● Principal component analysis /
singular value decomposition
● Alternating least squares
Discrete Continuous
Supervised Classification
● Logistic regression (and
regularized variants)
● Linear SVM
● Naive Bayes
● Random Decision Forests
(soon)
Regression
● Linear regression (and
regularized variants)
Unsupervised Clustering
● K-means
Dimensionality reduction, matrix
factorization
● Principal component analysis /
singular value decomposition
● Alternating least squares
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
● Anomalies as data points far away from any
cluster
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
val data = sc.textFile("kmeans_data.txt")
val parsedData = data.map( _.split(' ').map(_.toDouble))
// Cluster the data into two classes using KMeans
val numIterations = 20
val numClusters = 2
val clusters = KMeans.train(parsedData, numClusters,
numIterations)
● Alternate between two steps:
○ Assign each point to a cluster based on
existing centers
○ Recompute cluster centers from the
points in each cluster
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
● Alternate between two steps:
○ Assign each point to a cluster based on
existing centers
■ Process each data point independently
○ Recompute cluster centers from the
points in each cluster
■ Average across partitions
// Find the sum and count of points mapping to each center
val totalContribs = data.mapPartitions { points =>
val k = centers.length
val dims = centers(0).vector.length
val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]])
val counts = Array.fill(k)(0L)
points.foreach { point =>
val (bestCenter, cost) = KMeans.findClosest(centers, point)
costAccum += cost
sums(bestCenter) += point.vector
counts(bestCenter) += 1
}
val contribs = for (j <- 0 until k) yield {
(j, (sums(j), counts(j)))
}
contribs.iterator
}.reduceByKey(mergeContribs).collectAsMap()
// Update the cluster centers and costs
var changed = false
var j = 0
while (j < k) {
val (sum, count) = totalContribs(j)
if (count != 0) {
sum /= count.toDouble
val newCenter = new BreezeVectorWithNorm(sum)
if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) {
changed = true
}
centers(j) = newCenter
}
j += 1
}
if (!changed) {
logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations")
}
cost = costAccum.value
Unsupervised Learning with Apache Spark
● K-Means is very sensitive to initial set of
center points chosen.
● Best existing algorithm for choosing centers
is highly sequential.
Unsupervised Learning with Apache Spark
● Start with random point from dataset
● Pick another one randomly, with probability
proportional to distance from the closest
already chosen
● Repeat until initial centers chosen
● Initial cluster has expected bound of O(log k)
of optimum cost
● Requires k passes over the data
● Do only a few (~5) passes
● Sample m points on each pass
● Oversample
● Run K-Means++ on sampled points to find
initial centers
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
Discrete Continuous
Supervised Classification
● Logistic regression (and
regularized variants)
● Linear SVM
● Naive Bayes
● Random Decision Forests
(soon)
Regression
● Linear regression (and
regularized variants)
Unsupervised Clustering
● K-means
Dimensionality reduction, matrix
factorization
● Principal
component
analysis / singular value
decomposition
● Alternating least squares
● Select a basis for your data that
○ Is orthonormal
○ Maximizes variance along its axes
Unsupervised Learning with Apache Spark
● Find dominant trends
● Find a lower-dimensional representation that
lets you visualize the data
● Feature learning - find a representation that’
s good for clustering or classification
● Latent Semantic Analysis
val data: RDD[Vector] = ...
val mat = new RowMatrix(data)
// compute the top 5 principal components
val principalComponents =
mat.computePrincipalComponents(5)
// project data into subspace
val transformed = data.map(_.toBreeze *
mat.toBreeze)
● Center data
● Find covariance matrix
● Its eigenvectors are the principal
components
Datam
n
Covariance Matrix
n
n
Data
m
n
Data
Data
Data
Data
Data
Data
m
n
Data
Data
Data
Data
Data
n
n
n
n
...
Data
m
n
Data
Data
Data
Data
Data
n
n
n
n
... ...
n
n
n
n
n
n
def computeGramianMatrix (): Matrix = {
val n = numCols().toInt
val nt: Int = n * (n + 1) / 2
// Compute the upper triangular part of the gram matrix.
val GU = rows.aggregate( new BDV[Double](new Array[Double](nt)))(
seqOp = (U, v) => {
RowMatrix.dspr( 1.0, v, U.data)
U
},
combOp = (U1, U2) => U1 += U2
)
RowMatrix.triuToFull(n, GU.data)
}
n
n
● n^2 must fit in memory
● n^2 must fit in memory
● Not yet implemented: EM algorithm can do it
with O(kn), where k is the number of
principal components

More Related Content

What's hot (20)

PDF
Pivoting Data with SparkSQL by Andrew Ray
Spark Summit
 
PDF
Enhancing Spark SQL Optimizer with Reliable Statistics
Jen Aman
 
PDF
Large-Scale Machine Learning with Apache Spark
DB Tsai
 
PDF
Introduction to Machine Learning with Spark
datamantra
 
PDF
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Spark Summit
 
PDF
Machine learning at Scale with Apache Spark
Martin Zapletal
 
PDF
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
CloudxLab
 
PDF
Spark: Taming Big Data
Leonardo Gamas
 
PDF
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
PDF
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Spark Summit
 
PDF
Distributed computing with spark
Javier Santos Paniego
 
PDF
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Spark Summit
 
PPTX
Spark rdd vs data frame vs dataset
Ankit Beohar
 
PDF
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
MLconf
 
PDF
Big Data Analytics with Scala at SCALA.IO 2013
Samir Bessalah
 
PDF
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
PPTX
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Ilya Ganelin
 
PDF
Apache spark - Spark's distributed programming model
Martin Zapletal
 
PPTX
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Spark Summit
 
PDF
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Robert Metzger
 
Pivoting Data with SparkSQL by Andrew Ray
Spark Summit
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Jen Aman
 
Large-Scale Machine Learning with Apache Spark
DB Tsai
 
Introduction to Machine Learning with Spark
datamantra
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Spark Summit
 
Machine learning at Scale with Apache Spark
Martin Zapletal
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
CloudxLab
 
Spark: Taming Big Data
Leonardo Gamas
 
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Spark Summit
 
Distributed computing with spark
Javier Santos Paniego
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Spark Summit
 
Spark rdd vs data frame vs dataset
Ankit Beohar
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
MLconf
 
Big Data Analytics with Scala at SCALA.IO 2013
Samir Bessalah
 
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Ilya Ganelin
 
Apache spark - Spark's distributed programming model
Martin Zapletal
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Spark Summit
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Robert Metzger
 

Viewers also liked (9)

PDF
Introduction to Apache Spark
Datio Big Data
 
PDF
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
Spark Summit
 
PDF
Realizing AI Conversational Bot
Rakuten Group, Inc.
 
PDF
Artificial Intelligence at Work - Assist Workshop 2016 - Pilar Manchón - INTEL
Assist
 
PPTX
Parallel and Iterative Processing for Machine Learning Recommendations with S...
MapR Technologies
 
PDF
What to Expect for Big Data and Apache Spark in 2017
Databricks
 
PDF
Music Recommendations at Scale with Spark
Chris Johnson
 
PDF
Collaborative Filtering with Spark
Chris Johnson
 
PDF
The Chatbots Are Coming: A Guide to Chatbots, AI and Conversational Interfaces
TWG
 
Introduction to Apache Spark
Datio Big Data
 
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
Spark Summit
 
Realizing AI Conversational Bot
Rakuten Group, Inc.
 
Artificial Intelligence at Work - Assist Workshop 2016 - Pilar Manchón - INTEL
Assist
 
Parallel and Iterative Processing for Machine Learning Recommendations with S...
MapR Technologies
 
What to Expect for Big Data and Apache Spark in 2017
Databricks
 
Music Recommendations at Scale with Spark
Chris Johnson
 
Collaborative Filtering with Spark
Chris Johnson
 
The Chatbots Are Coming: A Guide to Chatbots, AI and Conversational Interfaces
TWG
 
Ad

Similar to Unsupervised Learning with Apache Spark (20)

PDF
Machine_Learning_Trushita
Trushita Redij
 
PPTX
Spark MLlib - Training Material
Bryan Yang
 
PPT
Hands on Mahout!
OSCON Byrum
 
PDF
book.pdf
dentistnikhil
 
PDF
Intro to Apache Spark - Lab
Mammoth Data
 
PDF
2014-08-14 Alpine Innovation to Spark
DB Tsai
 
PDF
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 
PDF
Data Science Cheatsheet.pdf
qawali1
 
PDF
R refcard-data-mining
ARIJ BenHarrath
 
PDF
Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Bi...
Big Data Spain
 
PPTX
Spark algorithms
Ashutosh Trivedi
 
PDF
Machine Learning Guide maXbox Starter62
Max Kleiner
 
PDF
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
Chetan Khatri
 
PDF
Alpine Tech Talk: System ML by Berthold Reinwald
Chester Chen
 
PDF
Distributed Machine Learning with Apache Mahout
Suneel Marthi
 
PPTX
05 k-means clustering
Subhas Kumar Ghosh
 
PPTX
Building and deploying analytics
Collin Bennett
 
PPT
[ppt]
butest
 
PPT
[ppt]
butest
 
PDF
Introduction to Big Data Science
Albert Bifet
 
Machine_Learning_Trushita
Trushita Redij
 
Spark MLlib - Training Material
Bryan Yang
 
Hands on Mahout!
OSCON Byrum
 
book.pdf
dentistnikhil
 
Intro to Apache Spark - Lab
Mammoth Data
 
2014-08-14 Alpine Innovation to Spark
DB Tsai
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 
Data Science Cheatsheet.pdf
qawali1
 
R refcard-data-mining
ARIJ BenHarrath
 
Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Bi...
Big Data Spain
 
Spark algorithms
Ashutosh Trivedi
 
Machine Learning Guide maXbox Starter62
Max Kleiner
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
Chetan Khatri
 
Alpine Tech Talk: System ML by Berthold Reinwald
Chester Chen
 
Distributed Machine Learning with Apache Mahout
Suneel Marthi
 
05 k-means clustering
Subhas Kumar Ghosh
 
Building and deploying analytics
Collin Bennett
 
[ppt]
butest
 
[ppt]
butest
 
Introduction to Big Data Science
Albert Bifet
 
Ad

Recently uploaded (20)

PPTX
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
PDF
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
PDF
Electrical Engineer operation Supervisor
ssaruntatapower143
 
PDF
aAn_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
PDF
Submit Your Papers-International Journal on Cybernetics & Informatics ( IJCI)
IJCI JOURNAL
 
PDF
SERVERLESS PERSONAL TO-DO LIST APPLICATION
anushaashraf20
 
PDF
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
PPTX
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
PDF
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
PDF
MODULE-5 notes [BCG402-CG&V] PART-B.pdf
Alvas Institute of Engineering and technology, Moodabidri
 
PPTX
Water Resources Engineering (CVE 728)--Slide 3.pptx
mohammedado3
 
PDF
20ES1152 Programming for Problem Solving Lab Manual VRSEC.pdf
Ashutosh Satapathy
 
PPTX
What is Shot Peening | Shot Peening is a Surface Treatment Process
Vibra Finish
 
PDF
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
PDF
WD2(I)-RFQ-GW-1415_ Shifting and Filling of Sand in the Pond at the WD5 Area_...
ShahadathHossain23
 
PPTX
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
PPTX
MODULE 05 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
PPTX
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
PPT
Footbinding.pptmnmkjkjkknmnnjkkkkkkkkkkkkkk
mamadoundiaye42742
 
PPT
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
Electrical Engineer operation Supervisor
ssaruntatapower143
 
aAn_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
Submit Your Papers-International Journal on Cybernetics & Informatics ( IJCI)
IJCI JOURNAL
 
SERVERLESS PERSONAL TO-DO LIST APPLICATION
anushaashraf20
 
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
MODULE-5 notes [BCG402-CG&V] PART-B.pdf
Alvas Institute of Engineering and technology, Moodabidri
 
Water Resources Engineering (CVE 728)--Slide 3.pptx
mohammedado3
 
20ES1152 Programming for Problem Solving Lab Manual VRSEC.pdf
Ashutosh Satapathy
 
What is Shot Peening | Shot Peening is a Surface Treatment Process
Vibra Finish
 
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
WD2(I)-RFQ-GW-1415_ Shifting and Filling of Sand in the Pond at the WD5 Area_...
ShahadathHossain23
 
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
MODULE 05 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
Footbinding.pptmnmkjkjkknmnnjkkkkkkkkkkkkkk
mamadoundiaye42742
 
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 

Unsupervised Learning with Apache Spark

  • 2. ● Data scientist at Cloudera ● Recently lead Apache Spark development at Cloudera ● Before that, committing on Apache Hadoop ● Before that, studying combinatorial optimization and distributed systems at Brown
  • 14. ● How many kinds of stuff are there? ● Why is some stuff not like the others? ● How do I contextualize new stuff? ● Is there a simpler way to represent this stuff?
  • 15. ● Learn hidden structure of your data ● Interpret new data as it relates to this structure
  • 16. ● Clustering ○ Partition data into categories ● Dimensionality reduction ○ Find a condensed representation of your data
  • 17. ● Designing a system for processing huge data in parallel ● Taking advantage of it with algorithms that work well in parallel
  • 21. bigfile.txt lines val lines = sc.textFile (“bigfile.txt”) numbers Partition Partition Partition Partition Partition Partition HDFS sum Driver val numbers = lines.map ((x) => x.toDouble) numbers.sum()
  • 23. bigfile.txt lines val lines = sc.textFile (“bigfile.txt”) numbers Partition Partition Partition Partition Partition Partition HDFS sum Driver val numbers = lines.map ((x) => x.toInt) numbers.cache() .sum()
  • 27. Discrete Continuous Supervised Classification ● Logistic regression (and regularized variants) ● Linear SVM ● Naive Bayes ● Random Decision Forests (soon) Regression ● Linear regression (and regularized variants) Unsupervised Clustering ● K-means Dimensionality reduction, matrix factorization ● Principal component analysis / singular value decomposition ● Alternating least squares
  • 28. Discrete Continuous Supervised Classification ● Logistic regression (and regularized variants) ● Linear SVM ● Naive Bayes ● Random Decision Forests (soon) Regression ● Linear regression (and regularized variants) Unsupervised Clustering ● K-means Dimensionality reduction, matrix factorization ● Principal component analysis / singular value decomposition ● Alternating least squares
  • 31. ● Anomalies as data points far away from any cluster
  • 37. val data = sc.textFile("kmeans_data.txt") val parsedData = data.map( _.split(' ').map(_.toDouble)) // Cluster the data into two classes using KMeans val numIterations = 20 val numClusters = 2 val clusters = KMeans.train(parsedData, numClusters, numIterations)
  • 38. ● Alternate between two steps: ○ Assign each point to a cluster based on existing centers ○ Recompute cluster centers from the points in each cluster
  • 44. ● Alternate between two steps: ○ Assign each point to a cluster based on existing centers ■ Process each data point independently ○ Recompute cluster centers from the points in each cluster ■ Average across partitions
  • 45. // Find the sum and count of points mapping to each center val totalContribs = data.mapPartitions { points => val k = centers.length val dims = centers(0).vector.length val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]]) val counts = Array.fill(k)(0L) points.foreach { point => val (bestCenter, cost) = KMeans.findClosest(centers, point) costAccum += cost sums(bestCenter) += point.vector counts(bestCenter) += 1 } val contribs = for (j <- 0 until k) yield { (j, (sums(j), counts(j))) } contribs.iterator }.reduceByKey(mergeContribs).collectAsMap()
  • 46. // Update the cluster centers and costs var changed = false var j = 0 while (j < k) { val (sum, count) = totalContribs(j) if (count != 0) { sum /= count.toDouble val newCenter = new BreezeVectorWithNorm(sum) if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) { changed = true } centers(j) = newCenter } j += 1 } if (!changed) { logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations") } cost = costAccum.value
  • 48. ● K-Means is very sensitive to initial set of center points chosen. ● Best existing algorithm for choosing centers is highly sequential.
  • 50. ● Start with random point from dataset ● Pick another one randomly, with probability proportional to distance from the closest already chosen ● Repeat until initial centers chosen
  • 51. ● Initial cluster has expected bound of O(log k) of optimum cost
  • 52. ● Requires k passes over the data
  • 53. ● Do only a few (~5) passes ● Sample m points on each pass ● Oversample ● Run K-Means++ on sampled points to find initial centers
  • 63. Discrete Continuous Supervised Classification ● Logistic regression (and regularized variants) ● Linear SVM ● Naive Bayes ● Random Decision Forests (soon) Regression ● Linear regression (and regularized variants) Unsupervised Clustering ● K-means Dimensionality reduction, matrix factorization ● Principal component analysis / singular value decomposition ● Alternating least squares
  • 64. ● Select a basis for your data that ○ Is orthonormal ○ Maximizes variance along its axes
  • 67. ● Find a lower-dimensional representation that lets you visualize the data ● Feature learning - find a representation that’ s good for clustering or classification ● Latent Semantic Analysis
  • 68. val data: RDD[Vector] = ... val mat = new RowMatrix(data) // compute the top 5 principal components val principalComponents = mat.computePrincipalComponents(5) // project data into subspace val transformed = data.map(_.toBreeze * mat.toBreeze)
  • 69. ● Center data ● Find covariance matrix ● Its eigenvectors are the principal components
  • 74. n n
  • 75. n n
  • 76. n n
  • 77. def computeGramianMatrix (): Matrix = { val n = numCols().toInt val nt: Int = n * (n + 1) / 2 // Compute the upper triangular part of the gram matrix. val GU = rows.aggregate( new BDV[Double](new Array[Double](nt)))( seqOp = (U, v) => { RowMatrix.dspr( 1.0, v, U.data) U }, combOp = (U1, U2) => U1 += U2 ) RowMatrix.triuToFull(n, GU.data) }
  • 78. n n
  • 79. ● n^2 must fit in memory
  • 80. ● n^2 must fit in memory ● Not yet implemented: EM algorithm can do it with O(kn), where k is the number of principal components