Unsupervised Learning with Apache Spark

● Data scientist at Cloudera
● Recently lead Apache Spark development at
Cloudera
● Before that, committing on Apache Hadoop
● Before that, studying combinatorial
optimization and distributed systems at
Brown

● How many kinds of stuff are there?
● Why is some stuff not like the others?
● How do I contextualize new stuff?
● Is there a simpler way to represent this stuff?

● Learn hidden structure of your data
● Interpret new data as it relates to this
structure

● Clustering
○ Partition data into categories
● Dimensionality reduction
○ Find a condensed representation of your
data

● Designing a system for processing huge
data in parallel
● Taking advantage of it with algorithms that
work well in parallel

bigfile.txt lines
val lines = sc.textFile
(“bigfile.txt”)
numbers
Partition
Partition
Partition
Partition
Partition
Partition
HDFS
sum
Driver
val numbers = lines.map
((x) => x.toDouble) numbers.sum()

bigfile.txt lines
val lines = sc.textFile
(“bigfile.txt”)
numbers
Partition
Partition
Partition
Partition
Partition
Partition
HDFS
sum
Driver
val numbers = lines.map
((x) => x.toInt) numbers.cache()
.sum()

bigfile.txt lines numbers
Partition
Partition
Partition
sum
Driver

Discrete Continuous
Supervised Classification
● Logistic regression (and
regularized variants)
● Linear SVM
● Naive Bayes
● Random Decision Forests
(soon)
Regression
● Linear regression (and
Unsupervised Clustering
● K-means
Dimensionality reduction, matrix
factorization
● Principal component analysis /
singular value decomposition
● Alternating least squares

● Anomalies as data points far away from any
cluster

val data = sc.textFile("kmeans_data.txt")
val parsedData = data.map( _.split(' ').map(_.toDouble))
// Cluster the data into two classes using KMeans
val numIterations = 20
val numClusters = 2
val clusters = KMeans.train(parsedData, numClusters,
numIterations)

● Alternate between two steps:
○ Assign each point to a cluster based on
existing centers
○ Recompute cluster centers from the
points in each cluster

● Alternate between two steps:
○ Assign each point to a cluster based on
existing centers
■ Process each data point independently
○ Recompute cluster centers from the
points in each cluster
■ Average across partitions

// Find the sum and count of points mapping to each center
val totalContribs = data.mapPartitions { points =>
val k = centers.length
val dims = centers(0).vector.length
val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]])
val counts = Array.fill(k)(0L)
points.foreach { point =>
val (bestCenter, cost) = KMeans.findClosest(centers, point)
costAccum += cost
sums(bestCenter) += point.vector
counts(bestCenter) += 1
}
val contribs = for (j <- 0 until k) yield {
(j, (sums(j), counts(j)))
}
contribs.iterator
}.reduceByKey(mergeContribs).collectAsMap()

// Update the cluster centers and costs
var changed = false
var j = 0
while (j < k) {
val (sum, count) = totalContribs(j)
if (count != 0) {
sum /= count.toDouble
val newCenter = new BreezeVectorWithNorm(sum)
if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) {
changed = true
}
centers(j) = newCenter
}
j += 1
}
if (!changed) {
logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations")
}
cost = costAccum.value

● K-Means is very sensitive to initial set of
center points chosen.
● Best existing algorithm for choosing centers
is highly sequential.

● Start with random point from dataset
● Pick another one randomly, with probability
proportional to distance from the closest
already chosen
● Repeat until initial centers chosen

● Initial cluster has expected bound of O(log k)
of optimum cost

● Requires k passes over the data

● Do only a few (~5) passes
● Sample m points on each pass
● Oversample
● Run K-Means++ on sampled points to find
initial centers

Discrete Continuous
Supervised Classification
● Logistic regression (and
● Linear SVM
● Naive Bayes
● Random Decision Forests
(soon)
Regression
● Linear regression (and
Unsupervised Clustering
● K-means
Dimensionality reduction, matrix
factorization
● Principal
component
analysis / singular value
decomposition
● Alternating least squares

● Select a basis for your data that
○ Is orthonormal
○ Maximizes variance along its axes

● Find a lower-dimensional representation that
lets you visualize the data
● Feature learning - find a representation that’
s good for clustering or classification
● Latent Semantic Analysis

val data: RDD[Vector] = ...
val mat = new RowMatrix(data)
// compute the top 5 principal components
val principalComponents =
mat.computePrincipalComponents(5)
// project data into subspace
val transformed = data.map(_.toBreeze *
mat.toBreeze)

● Center data
● Find covariance matrix
● Its eigenvectors are the principal
components

Data
m
n
Data
Data
Data
Data
Data

Data
m
n
Data
Data
Data
Data
Data
n
n
n
n
...

Data
m
n
Data
Data
Data
Data
Data
n
n
n
n
... ...

def computeGramianMatrix (): Matrix = {
val n = numCols().toInt
val nt: Int = n * (n + 1) / 2
// Compute the upper triangular part of the gram matrix.
val GU = rows.aggregate( new BDV[Double](new Array[Double](nt)))(
seqOp = (U, v) => {
RowMatrix.dspr( 1.0, v, U.data)
U
},
combOp = (U1, U2) => U1 += U2
)
RowMatrix.triuToFull(n, GU.data)
}

● n^2 must fit in memory
● Not yet implemented: EM algorithm can do it
with O(kn), where k is the number of
principal components

Unsupervised Learning with Apache Spark

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to Unsupervised Learning with Apache Spark (20)

Recently uploaded (20)

Unsupervised Learning with Apache Spark