Building Machine Learning Algorithms on Apache Spark with William Benton

Building machine learning
algorithms on Apache Spark
William Benton (@willb)
Red Hat, Inc.
Session hashtag: #EUds5

#EUds5
Forecast
Introducing our case study: self-organizing maps
Parallel implementations for partitioned collections (in particular, RDDs)
Beyond the RDD: data frames and ML pipelines
Practical considerations and key takeaways
6

#EUds5
Introducing self-organizing maps

#EUds5
Training self-organizing maps
9

#EUds5
10

#EUds5
11

#EUds5
12
while t < maxupdates:
random.shuffle(examples)
for ex in examples:
t = t + 1
if t == maxupdates:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt

#EUds5
12
for ex in examples:
t = t + 1
if t == maxupdates:
break
process the training
set in random order

#EUds5
12
for ex in examples:
t = t + 1
if t == maxupdates:
break
set in random order
the neighborhood size controls
how much of the map around
the BMU is affected

#EUds5
12
for ex in examples:
t = t + 1
if t == maxupdates:
break
set in random order
the neighborhood size controls
how much of the map around
the BMU is affected
the learning rate controls
how much closer to the
example each unit gets

#EUds5
Parallel implementations for
partitioned collections

#EUds5
Historical aside: Amdahl’s Law
14
1
1 - p
lim So =sp —> ∞

#EUds5
What forces serial execution?
15

#EUds5
16

#EUds5
17
state[t+1] =
combine(state[t], x)

#EUds5
18
state[t+1] =
combine(state[t], x)

#EUds5
19
f1: (T, T) => T
f2: (T, U) => T

#EUds5
How can we fix these?
20
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)

#EUds5
21
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)

#EUds5
22
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)

#EUds5
23
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)

#EUds5
24
L-BGFSSGD
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)

#EUds5
25
SGD L-BGFS
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
There will be examples of each
of these approaches for many
problems in the literature and
in open-source code!

#EUds5
Implementing atop RDDs
We’ll start with a batch implementation of our technique:
26
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)

#EUds5
27
state = newState()
for ex in examples:
state.hoods += hood
Each batch produces a model that
can be averaged with other models

#EUds5
28
state = newState()
for ex in examples:
state.hoods += hood
Each batch produces a model that
can be averaged with other models
partition

#EUds5
29
state = newState()
for ex in examples:
state.hoods += hood
This won’t always work!

#EUds5
An implementation template
30
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}

#EUds5
31
}
}
}
}
“fold”: update the state
for this partition with a
single new example

#EUds5
32
}
}
}
}
“reduce”: combine the
states from two partitions

#EUds5
33
}
}
this will cause the model object
to be serialized with the closure
nextModel

#EUds5
34
val current = sc.broadcast(nextModel)
state.update(current.value.lookup(example, i), example) }
}
current.unpersist
}
broadcast the current working
model for this iterationvar nextModel = initialModel
}
current.unpersist
} get the value of the
broadcast variable

#EUds5
35
}
current.unpersist
} remove the stale
broadcasted model
}
current.unpersist

#EUds5
36
}
current.unpersist
}
}
current.unpersist
}
the wrong implementation
of the right interface

#EUds5
workersdriver (aggregate)
Implementing on RDDs
37
⊕ ⊕ ⊕ ⊕ ⊕ ⊕
⊕ ⊕ ⊕ ⊕ ⊕
⊕ ⊕ ⊕ ⊕

#EUds5
38
⊕ ⊕ ⊕ ⊕ ⊕ ⊕
⊕ ⊕ ⊕ ⊕ ⊕
⊕ ⊕ ⊕ ⊕

#EUds5
38

#EUds5
workersdriver (treeAggregate)
39
⊕
⊕
⊕
⊕
⊕ ⊕
⊕ ⊕

#EUds5
40
⊕
⊕
⊕
⊕
⊕ ⊕
⊕ ⊕

#EUds5
41
⊕
⊕
⊕
⊕
⊕ ⊕
⊕ ⊕
⊕
⊕
⊕
⊕

#EUds5
42
⊕
⊕
⊕
⊕
⊕
⊕
⊕ ⊕
driver (treeAggregate)
⊕
⊕
⊕
⊕

#EUds5
43
⊕
⊕
⊕
⊕
⊕ ⊕
⊕ ⊕
⊕ ⊕

#EUds5
44
⊕ ⊕

#EUds5
driver (treeAggregate) workers
45
⊕ ⊕

#EUds5
driver (treeAggregate) workers
45
⊕ ⊕⊕

#EUds5
Beyond the RDD: Data frames and
ML Pipelines

#EUds5
RDDs: some good parts
47
val rdd: RDD[String] = /* ... */
rdd.map(_ * 3.0).collect()
val df: DataFrame = /* data frame with one String-valued column */
df.select($"_1" * 3.0).show()

#EUds5
48
doesn’t compile

#EUds5
49
doesn’t compile

#EUds5
50
doesn’t compile
crashes at runtime

#EUds5
51
rdd.map {
vec => (vec, model.value.closestWithSimilarity(vec))
}
val predict = udf ((vec: SV) =>
model.value.closestWithSimilarity(vec))
df.withColumn($"predictions", predict($"features"))

#EUds5
52
rdd.map {
vec => (vec, model.value.closestWithSimilarity(vec))
}
val predict = udf ((vec: SV) =>
model.value.closestWithSimilarity(vec))
df.withColumn($"predictions", predict($"features"))

#EUds5
RDDs versus query planning
53
val numbers1 = sc.parallelize(1 to 100000000)
numbers1.cartesian(numbers2)
.map((x, y) => (x, y, expensive(x, y)))
.filter((x, y, _) => isPrime(x), isPrime(y))

#EUds5
RDDs versus query planning
54
numbers1.filter(isPrime(_))
.cartesian(numbers2.filter(isPrime(_)))
.map((x, y) => (x, y, expensive(x, y)))

#EUds5
RDDs and the JVM heap
55
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))

#EUds5
RDDs and the Java heap
56
class
pointer flags size locks element pointer element pointer
class
pointer flags size locks 1.0
class
pointer flags size locks 3.0 4.0
2.0

#EUds5
57
class
class
class
2.0 32 bytes of data…

#EUds5
58
class
class
class
2.0
…and 64 bytes
of overhead!
32 bytes of data…

#EUds5
ML pipelines: a quick example
59
from pyspark.ml.clustering import KMeans
K, SEED = 100, 0xdea110c8
randomDF = make_random_df()
kmeans = KMeans().setK(K).setSeed(SEED).setFeaturesCol("features")
model = kmeans.fit(randomDF)
withPredictions = model.transform(randomDF).select("x", "y", "prediction")

#EUds5
Working with ML pipelines
60
estimator.fit(df)

#EUds5
61
estimator.fit(df) model.transform(df)

#EUds5
62
model.transform(df)

#EUds5
63
model.transform(df)

#EUds5
64

#EUds5
64
inputCol
epochs
seed

#EUds5
64
inputCol
epochs
seed
outputCol

#EUds5
Defining parameters
65
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...

#EUds5
Defining parameters
66
// ...
// ...

#EUds5
// ...
Defining parameters
67
// ...

#EUds5
Defining parameters
68
// ...
// ...

#EUds5
Defining parameters
69
// ...
// ...

#EUds5
Defining parameters
70
// ...
// ...

#EUds5
Don’t repeat yourself
71
/**
* Common params for KMeans and KMeansModel
*/
private[clustering] trait KMeansParams extends Params
with HasMaxIter with HasFeaturesCol
with HasSeed with HasPredictionCol with HasTol { /* ... */ }

#EUds5
Estimators and transformers
72
estimator.fit(df)

#EUds5
Estimators and transformers
72

#EUds5
Validate and transform at once
73
def transformSchema(schema: StructType):
StructType = {
// check that the input columns exist...
// ...and are the proper type
// ...and that the output columns don’t exist
// ...and then make a new schema
}

#EUds5
74
StructType = {
// check that the input columns exist…
require(schema.fieldNames.contains($(featuresCol)))
}

#EUds5
75
StructType = {
// check that the input columns exist...
schema($(featuresCol)) match {
case sf: StructField => require(sf.dataType.equals(VectorType))
}
}

#EUds5
76
StructType = {
require(!schema.fieldNames.contains($(predictionCol)))
require(!schema.fieldNames.contains($(similarityCol)))
}

#EUds5
77
StructType = {
schema.add($(predictionCol), "int")
.add($(similarityCol), "double")
}

#EUds5
Training on data frames
78
def fit(examples: DataFrame) = {
import examples.sparkSession.implicits._
import org.apache.spark.ml.linalg.{Vector=>SV}
val dfexamples = examples.select($(exampleCol)).rdd.map {
case Row(sv: SV) => sv
}
/* construct a model object with the result of training */
new SOMModel(train(dfexamples, $(x), $(y)))
}

#EUds5
Practical considerations 
and key takeaways

#EUds5
Retire your visibility hacks
80
package org.apache.spark.ml.hacks
object Hacks {
import org.apache.spark.ml.linalg.VectorUDT
val vectorUDT = new VectorUDT
}

#EUds5
Retire your visibility hacks
81
package org.apache.spark.ml.linalg
/* imports, etc., are elided ... */
@Since("2.0.0")
@DeveloperApi
object SQLDataTypes {
val VectorType: DataType = new VectorUDT
val MatrixType: DataType = new MatrixUDT
}

#EUds5
Caching training data
82
val wasUncached = examples.storageLevel == StorageLevel.NONE
if (wasUncached) { examples.cache() }
/* actually train here */
if (wasUncached) { examples.unpersist() }

#EUds5
Improve serial execution times
Are you repeatedly comparing training data to a model that only changes
once per iteration? Consider caching norms.
Are you doing a lot of dot products in a for loop? Consider replacing
these loops with a matrix-vector multiplication.
Seek to limit the number of library invocations you make and thus the
time you spend copying data to and from your linear algebra library.
83

#EUds5
Key takeaways
There are several techniques you can use to develop parallel
implementations of machine learning algorithms.
The RDD API may not be your favorite way to interact with Spark as a user,
but it can be extremely valuable if you’re developing libraries for Spark.
As a library developer, you might need to rely on developer APIs and dive
in to Spark’s source code, but things are getting easier with each release!
84

#EUds5
@willb • willb@redhat.com
https://chapeau.freevariable.com
https://radanalytics.io
Thanks!

Building Machine Learning Algorithms on Apache Spark with William Benton

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Building Machine Learning Algorithms on Apache Spark with William Benton (20)

More from Spark Summit (20)

Recently uploaded (20)

Building Machine Learning Algorithms on Apache Spark with William Benton