SlideShare a Scribd company logo
Building machine learning
algorithms on Apache Spark
William Benton (@willb)
Red Hat, Inc.
Session hashtag: #EUds5
#EUds5
Motivation
2
#EUds5
Motivation
3
#EUds5
Motivation
4
#EUds5
Motivation
5
#EUds5
Forecast
Introducing our case study: self-organizing maps
Parallel implementations for partitioned collections (in particular, RDDs)
Beyond the RDD: data frames and ML pipelines
Practical considerations and key takeaways
6
#EUds5
Introducing self-organizing maps
#EUds58
#EUds58
#EUds5
Training self-organizing maps
9
#EUds5
Training self-organizing maps
10
#EUds5
Training self-organizing maps
11
#EUds5
Training self-organizing maps
11
#EUds5
Training self-organizing maps
12
while t < maxupdates:
random.shuffle(examples)
for ex in examples:
t = t + 1
if t == maxupdates:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
#EUds5
Training self-organizing maps
12
while t < maxupdates:
random.shuffle(examples)
for ex in examples:
t = t + 1
if t == maxupdates:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
process the training
set in random order
#EUds5
Training self-organizing maps
12
while t < maxupdates:
random.shuffle(examples)
for ex in examples:
t = t + 1
if t == maxupdates:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
process the training
set in random order
the neighborhood size controls
how much of the map around
the BMU is affected
#EUds5
Training self-organizing maps
12
while t < maxupdates:
random.shuffle(examples)
for ex in examples:
t = t + 1
if t == maxupdates:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
process the training
set in random order
the neighborhood size controls
how much of the map around
the BMU is affected
the learning rate controls
how much closer to the
example each unit gets
#EUds5
Parallel implementations for
partitioned collections
#EUds5
Historical aside: Amdahl’s Law
14
1
1 - p
lim So =sp —> ∞
#EUds5
What forces serial execution?
15
#EUds5
What forces serial execution?
16
#EUds5
What forces serial execution?
17
state[t+1] =
combine(state[t], x)
#EUds5
What forces serial execution?
18
state[t+1] =
combine(state[t], x)
#EUds5
What forces serial execution?
19
f1: (T, T) => T
f2: (T, U) => T
#EUds5
What forces serial execution?
19
f1: (T, T) => T
f2: (T, U) => T
#EUds5
What forces serial execution?
19
f1: (T, T) => T
f2: (T, U) => T
#EUds5
How can we fix these?
20
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
#EUds5
How can we fix these?
21
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
#EUds5
How can we fix these?
22
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
#EUds5
How can we fix these?
23
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
#EUds5
How can we fix these?
24
L-BGFSSGD
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
#EUds5
How can we fix these?
25
SGD L-BGFS
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
There will be examples of each
of these approaches for many
problems in the literature and
in open-source code!
#EUds5
Implementing atop RDDs
We’ll start with a batch implementation of our technique:
26
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
#EUds5
Implementing atop RDDs
27
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
Each batch produces a model that
can be averaged with other models
#EUds5
Implementing atop RDDs
28
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
Each batch produces a model that
can be averaged with other models
partition
#EUds5
Implementing atop RDDs
29
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
This won’t always work!
#EUds5
An implementation template
30
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
#EUds5
An implementation template
31
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
“fold”: update the state
for this partition with a
single new example
#EUds5
An implementation template
32
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
“reduce”: combine the
states from two partitions
#EUds5
An implementation template
33
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
this will cause the model object
to be serialized with the closure
nextModel
#EUds5
An implementation template
34
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
}
broadcast the current working
model for this iterationvar nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
} get the value of the
broadcast variable
#EUds5
An implementation template
35
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
} remove the stale
broadcasted model
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
#EUds5
An implementation template
36
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
}
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
}
the wrong implementation
of the right interface
#EUds5
workersdriver (aggregate)
Implementing on RDDs
37
⊕ ⊕ ⊕ ⊕ ⊕ ⊕
⊕ ⊕ ⊕ ⊕ ⊕
⊕ ⊕ ⊕ ⊕
#EUds5
workersdriver (aggregate)
Implementing on RDDs
38
⊕ ⊕ ⊕ ⊕ ⊕ ⊕
⊕ ⊕ ⊕ ⊕ ⊕
⊕ ⊕ ⊕ ⊕
#EUds5
workersdriver (aggregate)
Implementing on RDDs
38
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
39
⊕
⊕
⊕
⊕
⊕ ⊕
⊕ ⊕
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
40
⊕
⊕
⊕
⊕
⊕ ⊕
⊕ ⊕
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
41
⊕
⊕
⊕
⊕
⊕ ⊕
⊕ ⊕
⊕
⊕
⊕
⊕
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
42
⊕
⊕
⊕
⊕
⊕
⊕
⊕ ⊕
driver (treeAggregate)
⊕
⊕
⊕
⊕
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
43
⊕
⊕
⊕
⊕
⊕ ⊕
⊕ ⊕
⊕ ⊕
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
44
⊕ ⊕
#EUds5
driver (treeAggregate) workers
Implementing on RDDs
45
⊕ ⊕
#EUds5
driver (treeAggregate) workers
Implementing on RDDs
45
⊕ ⊕⊕
#EUds5
Beyond the RDD: Data frames and
ML Pipelines
#EUds5
RDDs: some good parts
47
val rdd: RDD[String] = /* ... */
rdd.map(_ * 3.0).collect()
val df: DataFrame = /* data frame with one String-valued column */
df.select($"_1" * 3.0).show()
#EUds5
RDDs: some good parts
48
val rdd: RDD[String] = /* ... */
rdd.map(_ * 3.0).collect()
val df: DataFrame = /* data frame with one String-valued column */
df.select($"_1" * 3.0).show()
doesn’t compile
#EUds5
RDDs: some good parts
49
val rdd: RDD[String] = /* ... */
rdd.map(_ * 3.0).collect()
val df: DataFrame = /* data frame with one String-valued column */
df.select($"_1" * 3.0).show()
doesn’t compile
#EUds5
RDDs: some good parts
50
val rdd: RDD[String] = /* ... */
rdd.map(_ * 3.0).collect()
val df: DataFrame = /* data frame with one String-valued column */
df.select($"_1" * 3.0).show()
doesn’t compile
crashes at runtime
#EUds5
RDDs: some good parts
51
rdd.map {
vec => (vec, model.value.closestWithSimilarity(vec))
}
val predict = udf ((vec: SV) =>
model.value.closestWithSimilarity(vec))
df.withColumn($"predictions", predict($"features"))
#EUds5
RDDs: some good parts
52
rdd.map {
vec => (vec, model.value.closestWithSimilarity(vec))
}
val predict = udf ((vec: SV) =>
model.value.closestWithSimilarity(vec))
df.withColumn($"predictions", predict($"features"))
#EUds5
RDDs versus query planning
53
val numbers1 = sc.parallelize(1 to 100000000)
val numbers2 = sc.parallelize(1 to 1000000000)
numbers1.cartesian(numbers2)
.map((x, y) => (x, y, expensive(x, y)))
.filter((x, y, _) => isPrime(x), isPrime(y))
#EUds5
RDDs versus query planning
54
val numbers1 = sc.parallelize(1 to 100000000)
val numbers2 = sc.parallelize(1 to 1000000000)
numbers1.filter(isPrime(_))
.cartesian(numbers2.filter(isPrime(_)))
.map((x, y) => (x, y, expensive(x, y)))
#EUds5
RDDs and the JVM heap
55
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
#EUds5
RDDs and the Java heap
56
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
class
pointer flags size locks element pointer element pointer
class
pointer flags size locks 1.0
class
pointer flags size locks 3.0 4.0
2.0
#EUds5
RDDs and the Java heap
57
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
class
pointer flags size locks element pointer element pointer
class
pointer flags size locks 1.0
class
pointer flags size locks 3.0 4.0
2.0 32 bytes of data…
#EUds5
RDDs and the Java heap
58
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
class
pointer flags size locks element pointer element pointer
class
pointer flags size locks 1.0
class
pointer flags size locks 3.0 4.0
2.0
…and 64 bytes
of overhead!
32 bytes of data…
#EUds5
ML pipelines: a quick example
59
from pyspark.ml.clustering import KMeans
K, SEED = 100, 0xdea110c8
randomDF = make_random_df()
kmeans = KMeans().setK(K).setSeed(SEED).setFeaturesCol("features")
model = kmeans.fit(randomDF)
withPredictions = model.transform(randomDF).select("x", "y", "prediction")
#EUds5
Working with ML pipelines
60
estimator.fit(df)
#EUds5
Working with ML pipelines
61
estimator.fit(df) model.transform(df)
#EUds5
Working with ML pipelines
62
model.transform(df)
#EUds5
Working with ML pipelines
63
model.transform(df)
#EUds5
Working with ML pipelines
64
estimator.fit(df) model.transform(df)
#EUds5
Working with ML pipelines
64
estimator.fit(df) model.transform(df)
inputCol
epochs
seed
#EUds5
Working with ML pipelines
64
estimator.fit(df) model.transform(df)
inputCol
epochs
seed
outputCol
#EUds5
Defining parameters
65
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
#EUds5
Defining parameters
66
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
#EUds5
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
Defining parameters
67
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
#EUds5
Defining parameters
68
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
#EUds5
Defining parameters
69
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
#EUds5
Defining parameters
70
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
#EUds5
Don’t repeat yourself
71
/**
* Common params for KMeans and KMeansModel
*/
private[clustering] trait KMeansParams extends Params
with HasMaxIter with HasFeaturesCol
with HasSeed with HasPredictionCol with HasTol { /* ... */ }
#EUds5
Estimators and transformers
72
estimator.fit(df)
#EUds5
Estimators and transformers
72
estimator.fit(df)
#EUds5
Estimators and transformers
72
estimator.fit(df) model.transform(df)
#EUds5
Estimators and transformers
72
estimator.fit(df) model.transform(df)
#EUds5
Estimators and transformers
72
estimator.fit(df) model.transform(df)
#EUds5
Validate and transform at once
73
def transformSchema(schema: StructType):
StructType = {
// check that the input columns exist...
// ...and are the proper type
// ...and that the output columns don’t exist
// ...and then make a new schema
}
#EUds5
Validate and transform at once
74
def transformSchema(schema: StructType):
StructType = {
// check that the input columns exist…
require(schema.fieldNames.contains($(featuresCol)))
// ...and are the proper type
// ...and that the output columns don’t exist
// ...and then make a new schema
}
#EUds5
Validate and transform at once
75
def transformSchema(schema: StructType):
StructType = {
// check that the input columns exist...
// ...and are the proper type
schema($(featuresCol)) match {
case sf: StructField => require(sf.dataType.equals(VectorType))
}
// ...and that the output columns don’t exist
// ...and then make a new schema
}
#EUds5
Validate and transform at once
76
def transformSchema(schema: StructType):
StructType = {
// check that the input columns exist…
// ...and are the proper type
// ...and that the output columns don’t exist
require(!schema.fieldNames.contains($(predictionCol)))
require(!schema.fieldNames.contains($(similarityCol)))
// ...and then make a new schema
}
#EUds5
Validate and transform at once
77
def transformSchema(schema: StructType):
StructType = {
// check that the input columns exist…
// ...and are the proper type
// ...and that the output columns don’t exist
// ...and then make a new schema
schema.add($(predictionCol), "int")
.add($(similarityCol), "double")
}
#EUds5
Training on data frames
78
def fit(examples: DataFrame) = {
import examples.sparkSession.implicits._
import org.apache.spark.ml.linalg.{Vector=>SV}
val dfexamples = examples.select($(exampleCol)).rdd.map {
case Row(sv: SV) => sv
}
/* construct a model object with the result of training */
new SOMModel(train(dfexamples, $(x), $(y)))
}
#EUds5
Practical considerations

and key takeaways
#EUds5
Retire your visibility hacks
80
package org.apache.spark.ml.hacks
object Hacks {
import org.apache.spark.ml.linalg.VectorUDT
val vectorUDT = new VectorUDT
}
#EUds5
Retire your visibility hacks
81
package org.apache.spark.ml.linalg
/* imports, etc., are elided ... */
@Since("2.0.0")
@DeveloperApi
object SQLDataTypes {
val VectorType: DataType = new VectorUDT
val MatrixType: DataType = new MatrixUDT
}
#EUds5
Caching training data
82
val wasUncached = examples.storageLevel == StorageLevel.NONE
if (wasUncached) { examples.cache() }
/* actually train here */
if (wasUncached) { examples.unpersist() }
#EUds5
Improve serial execution times
Are you repeatedly comparing training data to a model that only changes
once per iteration? Consider caching norms.
Are you doing a lot of dot products in a for loop? Consider replacing
these loops with a matrix-vector multiplication.
Seek to limit the number of library invocations you make and thus the
time you spend copying data to and from your linear algebra library.
83
#EUds5
Key takeaways
There are several techniques you can use to develop parallel
implementations of machine learning algorithms.
The RDD API may not be your favorite way to interact with Spark as a user,
but it can be extremely valuable if you’re developing libraries for Spark.
As a library developer, you might need to rely on developer APIs and dive
in to Spark’s source code, but things are getting easier with each release!
84
#EUds5
@willb • willb@redhat.com
https://chapeau.freevariable.com
https://radanalytics.io
Thanks!

More Related Content

What's hot (20)

PDF
ECMAScript 6 major changes
hayato
 
DOC
SE Computer, Programming Laboratory(210251) University of Pune
Bhavesh Shah
 
PDF
Monadologie
league
 
PDF
How to extend map? Or why we need collections redesign? - Scalar 2017
Szymon Matejczyk
 
PPTX
Row Pattern Matching 12c MATCH_RECOGNIZE OOW14
stewashton
 
PDF
Datastructures asignment
sreekanth3dce
 
PPTX
Taking your side effects aside
💡 Tomasz Kogut
 
PDF
Table of Useful R commands.
Dr. Volkan OBAN
 
PPTX
S1 3 derivadas_resueltas
jesquerrev1
 
PDF
R data mining-Time Series Analysis with R
Dr. Volkan OBAN
 
DOCX
2 d matrices
Himanshu Arora
 
PDF
numdoc
webuploader
 
PDF
Reactive Collections
Aleksandar Prokopec
 
PDF
Gremlin's Graph Traversal Machinery
Marko Rodriguez
 
PDF
Effector: we need to go deeper
Victor Didenko
 
DOCX
C programs
Lakshmi Sarvani Videla
 
PDF
Matlab file
rampal singh
 
PDF
Optimal Budget Allocation: Theoretical Guarantee and Efficient Algorithm
Tasuku Soma
 
PDF
Maximizing Submodular Function over the Integer Lattice
Tasuku Soma
 
ECMAScript 6 major changes
hayato
 
SE Computer, Programming Laboratory(210251) University of Pune
Bhavesh Shah
 
Monadologie
league
 
How to extend map? Or why we need collections redesign? - Scalar 2017
Szymon Matejczyk
 
Row Pattern Matching 12c MATCH_RECOGNIZE OOW14
stewashton
 
Datastructures asignment
sreekanth3dce
 
Taking your side effects aside
💡 Tomasz Kogut
 
Table of Useful R commands.
Dr. Volkan OBAN
 
S1 3 derivadas_resueltas
jesquerrev1
 
R data mining-Time Series Analysis with R
Dr. Volkan OBAN
 
2 d matrices
Himanshu Arora
 
numdoc
webuploader
 
Reactive Collections
Aleksandar Prokopec
 
Gremlin's Graph Traversal Machinery
Marko Rodriguez
 
Effector: we need to go deeper
Victor Didenko
 
Matlab file
rampal singh
 
Optimal Budget Allocation: Theoretical Guarantee and Efficient Algorithm
Tasuku Soma
 
Maximizing Submodular Function over the Integer Lattice
Tasuku Soma
 

Viewers also liked (20)

PPTX
Low Touch Machine Learning with Leah McGuire (Salesforce)
Spark Summit
 
PDF
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Spark Summit
 
PDF
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Spark Summit
 
PDF
Experimental Design for Distributed Machine Learning with Myles Baker
Databricks
 
PDF
Art of Feature Engineering for Data Science with Nabeel Sarwar
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...
Spark Summit
 
PDF
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
Spark Summit
 
PDF
Histogram Equalized Heat Maps from Log Data via Apache Spark with Arvind Rao
Spark Summit
 
PDF
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...
Spark Summit
 
PDF
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Spark Summit
 
PDF
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Databricks
 
Low Touch Machine Learning with Leah McGuire (Salesforce)
Spark Summit
 
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Spark Summit
 
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Spark Summit
 
Experimental Design for Distributed Machine Learning with Myles Baker
Databricks
 
Art of Feature Engineering for Data Science with Nabeel Sarwar
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...
Spark Summit
 
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
Spark Summit
 
Histogram Equalized Heat Maps from Log Data via Apache Spark with Arvind Rao
Spark Summit
 
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...
Spark Summit
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Spark Summit
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Databricks
 
Ad

Similar to Building Machine Learning Algorithms on Apache Spark with William Benton (20)

PPTX
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Databricks
 
PDF
Apache spark - Spark's distributed programming model
Martin Zapletal
 
PDF
Graph convolutional networks in apache spark
Emiliano Martinez Sanchez
 
PDF
Spark
Amir Payberah
 
PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Data Con LA
 
PDF
Arvindsujeeth scaladays12
Skills Matter Talks
 
PDF
Spark workshop
Wojciech Pituła
 
PDF
Distributed computing with spark
Javier Santos Paniego
 
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
Apache Spark for Library Developers with William Benton and Erik Erlandson
Databricks
 
PPTX
Study Notes: Apache Spark
Gao Yunzhong
 
PDF
Machine learning on streams of data
Tomasz Sosiński
 
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
PDF
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
CloudxLab
 
PDF
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
DataStax Academy
 
PDF
Large volume data analysis on the Typesafe Reactive Platform
Martin Zapletal
 
PDF
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
PDF
Machine learning at Scale with Apache Spark
Martin Zapletal
 
PDF
Big Data Analytics with Scala at SCALA.IO 2013
Samir Bessalah
 
PDF
Apache Spark: What? Why? When?
Massimo Schenone
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Databricks
 
Apache spark - Spark's distributed programming model
Martin Zapletal
 
Graph convolutional networks in apache spark
Emiliano Martinez Sanchez
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Data Con LA
 
Arvindsujeeth scaladays12
Skills Matter Talks
 
Spark workshop
Wojciech Pituła
 
Distributed computing with spark
Javier Santos Paniego
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache Spark for Library Developers with William Benton and Erik Erlandson
Databricks
 
Study Notes: Apache Spark
Gao Yunzhong
 
Machine learning on streams of data
Tomasz Sosiński
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
CloudxLab
 
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
DataStax Academy
 
Large volume data analysis on the Typesafe Reactive Platform
Martin Zapletal
 
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
Machine learning at Scale with Apache Spark
Martin Zapletal
 
Big Data Analytics with Scala at SCALA.IO 2013
Samir Bessalah
 
Apache Spark: What? Why? When?
Massimo Schenone
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
PDF
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Spark Summit
 
PDF
Variant-Apache Spark for Bioinformatics with Piotr Szul
Spark Summit
 
PDF
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 
PDF
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Spark Summit
 
PDF
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Spark Summit
 
Variant-Apache Spark for Bioinformatics with Piotr Szul
Spark Summit
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Spark Summit
 
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Spark Summit
 

Recently uploaded (20)

PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
Research Methodology Overview Introduction
ayeshagul29594
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 

Building Machine Learning Algorithms on Apache Spark with William Benton

  • 1. Building machine learning algorithms on Apache Spark William Benton (@willb) Red Hat, Inc. Session hashtag: #EUds5
  • 6. #EUds5 Forecast Introducing our case study: self-organizing maps Parallel implementations for partitioned collections (in particular, RDDs) Beyond the RDD: data frames and ML pipelines Practical considerations and key takeaways 6
  • 14. #EUds5 Training self-organizing maps 12 while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
  • 15. #EUds5 Training self-organizing maps 12 while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt process the training set in random order
  • 16. #EUds5 Training self-organizing maps 12 while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt process the training set in random order the neighborhood size controls how much of the map around the BMU is affected
  • 17. #EUds5 Training self-organizing maps 12 while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt process the training set in random order the neighborhood size controls how much of the map around the BMU is affected the learning rate controls how much closer to the example each unit gets
  • 19. #EUds5 Historical aside: Amdahl’s Law 14 1 1 - p lim So =sp —> ∞
  • 20. #EUds5 What forces serial execution? 15
  • 21. #EUds5 What forces serial execution? 16
  • 22. #EUds5 What forces serial execution? 17 state[t+1] = combine(state[t], x)
  • 23. #EUds5 What forces serial execution? 18 state[t+1] = combine(state[t], x)
  • 24. #EUds5 What forces serial execution? 19 f1: (T, T) => T f2: (T, U) => T
  • 25. #EUds5 What forces serial execution? 19 f1: (T, T) => T f2: (T, U) => T
  • 26. #EUds5 What forces serial execution? 19 f1: (T, T) => T f2: (T, U) => T
  • 27. #EUds5 How can we fix these? 20 a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
  • 28. #EUds5 How can we fix these? 21 a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
  • 29. #EUds5 How can we fix these? 22 a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
  • 30. #EUds5 How can we fix these? 23 a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
  • 31. #EUds5 How can we fix these? 24 L-BGFSSGD a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
  • 32. #EUds5 How can we fix these? 25 SGD L-BGFS a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c) There will be examples of each of these approaches for many problems in the literature and in open-source code!
  • 33. #EUds5 Implementing atop RDDs We’ll start with a batch implementation of our technique: 26 for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods)
  • 34. #EUds5 Implementing atop RDDs 27 for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) Each batch produces a model that can be averaged with other models
  • 35. #EUds5 Implementing atop RDDs 28 for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) Each batch produces a model that can be averaged with other models partition
  • 36. #EUds5 Implementing atop RDDs 29 for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) This won’t always work!
  • 37. #EUds5 An implementation template 30 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) }
  • 38. #EUds5 An implementation template 31 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } “fold”: update the state for this partition with a single new example
  • 39. #EUds5 An implementation template 32 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } “reduce”: combine the states from two partitions
  • 40. #EUds5 An implementation template 33 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } this will cause the model object to be serialized with the closure nextModel
  • 41. #EUds5 An implementation template 34 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } broadcast the current working model for this iterationvar nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } get the value of the broadcast variable
  • 42. #EUds5 An implementation template 35 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } remove the stale broadcasted model var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist
  • 43. #EUds5 An implementation template 36 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } the wrong implementation of the right interface
  • 44. #EUds5 workersdriver (aggregate) Implementing on RDDs 37 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  • 45. #EUds5 workersdriver (aggregate) Implementing on RDDs 38 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  • 47. #EUds5 workersdriver (treeAggregate) Implementing on RDDs 39 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  • 48. #EUds5 workersdriver (treeAggregate) Implementing on RDDs 40 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  • 49. #EUds5 workersdriver (treeAggregate) Implementing on RDDs 41 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  • 50. #EUds5 workersdriver (treeAggregate) Implementing on RDDs 42 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ driver (treeAggregate) ⊕ ⊕ ⊕ ⊕
  • 51. #EUds5 workersdriver (treeAggregate) Implementing on RDDs 43 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  • 55. #EUds5 Beyond the RDD: Data frames and ML Pipelines
  • 56. #EUds5 RDDs: some good parts 47 val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect() val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show()
  • 57. #EUds5 RDDs: some good parts 48 val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect() val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show() doesn’t compile
  • 58. #EUds5 RDDs: some good parts 49 val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect() val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show() doesn’t compile
  • 59. #EUds5 RDDs: some good parts 50 val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect() val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show() doesn’t compile crashes at runtime
  • 60. #EUds5 RDDs: some good parts 51 rdd.map { vec => (vec, model.value.closestWithSimilarity(vec)) } val predict = udf ((vec: SV) => model.value.closestWithSimilarity(vec)) df.withColumn($"predictions", predict($"features"))
  • 61. #EUds5 RDDs: some good parts 52 rdd.map { vec => (vec, model.value.closestWithSimilarity(vec)) } val predict = udf ((vec: SV) => model.value.closestWithSimilarity(vec)) df.withColumn($"predictions", predict($"features"))
  • 62. #EUds5 RDDs versus query planning 53 val numbers1 = sc.parallelize(1 to 100000000) val numbers2 = sc.parallelize(1 to 1000000000) numbers1.cartesian(numbers2) .map((x, y) => (x, y, expensive(x, y))) .filter((x, y, _) => isPrime(x), isPrime(y))
  • 63. #EUds5 RDDs versus query planning 54 val numbers1 = sc.parallelize(1 to 100000000) val numbers2 = sc.parallelize(1 to 1000000000) numbers1.filter(isPrime(_)) .cartesian(numbers2.filter(isPrime(_))) .map((x, y) => (x, y, expensive(x, y)))
  • 64. #EUds5 RDDs and the JVM heap 55 val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
  • 65. #EUds5 RDDs and the Java heap 56 val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0)) class pointer flags size locks element pointer element pointer class pointer flags size locks 1.0 class pointer flags size locks 3.0 4.0 2.0
  • 66. #EUds5 RDDs and the Java heap 57 val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0)) class pointer flags size locks element pointer element pointer class pointer flags size locks 1.0 class pointer flags size locks 3.0 4.0 2.0 32 bytes of data…
  • 67. #EUds5 RDDs and the Java heap 58 val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0)) class pointer flags size locks element pointer element pointer class pointer flags size locks 1.0 class pointer flags size locks 3.0 4.0 2.0 …and 64 bytes of overhead! 32 bytes of data…
  • 68. #EUds5 ML pipelines: a quick example 59 from pyspark.ml.clustering import KMeans K, SEED = 100, 0xdea110c8 randomDF = make_random_df() kmeans = KMeans().setK(K).setSeed(SEED).setFeaturesCol("features") model = kmeans.fit(randomDF) withPredictions = model.transform(randomDF).select("x", "y", "prediction")
  • 69. #EUds5 Working with ML pipelines 60 estimator.fit(df)
  • 70. #EUds5 Working with ML pipelines 61 estimator.fit(df) model.transform(df)
  • 71. #EUds5 Working with ML pipelines 62 model.transform(df)
  • 72. #EUds5 Working with ML pipelines 63 model.transform(df)
  • 73. #EUds5 Working with ML pipelines 64 estimator.fit(df) model.transform(df)
  • 74. #EUds5 Working with ML pipelines 64 estimator.fit(df) model.transform(df) inputCol epochs seed
  • 75. #EUds5 Working with ML pipelines 64 estimator.fit(df) model.transform(df) inputCol epochs seed outputCol
  • 76. #EUds5 Defining parameters 65 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  • 77. #EUds5 Defining parameters 66 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ... private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  • 78. #EUds5 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ... Defining parameters 67 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  • 79. #EUds5 Defining parameters 68 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ... private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  • 80. #EUds5 Defining parameters 69 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ... private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  • 81. #EUds5 Defining parameters 70 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ... private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  • 82. #EUds5 Don’t repeat yourself 71 /** * Common params for KMeans and KMeansModel */ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFeaturesCol with HasSeed with HasPredictionCol with HasTol { /* ... */ }
  • 88. #EUds5 Validate and transform at once 73 def transformSchema(schema: StructType): StructType = { // check that the input columns exist... // ...and are the proper type // ...and that the output columns don’t exist // ...and then make a new schema }
  • 89. #EUds5 Validate and transform at once 74 def transformSchema(schema: StructType): StructType = { // check that the input columns exist… require(schema.fieldNames.contains($(featuresCol))) // ...and are the proper type // ...and that the output columns don’t exist // ...and then make a new schema }
  • 90. #EUds5 Validate and transform at once 75 def transformSchema(schema: StructType): StructType = { // check that the input columns exist... // ...and are the proper type schema($(featuresCol)) match { case sf: StructField => require(sf.dataType.equals(VectorType)) } // ...and that the output columns don’t exist // ...and then make a new schema }
  • 91. #EUds5 Validate and transform at once 76 def transformSchema(schema: StructType): StructType = { // check that the input columns exist… // ...and are the proper type // ...and that the output columns don’t exist require(!schema.fieldNames.contains($(predictionCol))) require(!schema.fieldNames.contains($(similarityCol))) // ...and then make a new schema }
  • 92. #EUds5 Validate and transform at once 77 def transformSchema(schema: StructType): StructType = { // check that the input columns exist… // ...and are the proper type // ...and that the output columns don’t exist // ...and then make a new schema schema.add($(predictionCol), "int") .add($(similarityCol), "double") }
  • 93. #EUds5 Training on data frames 78 def fit(examples: DataFrame) = { import examples.sparkSession.implicits._ import org.apache.spark.ml.linalg.{Vector=>SV} val dfexamples = examples.select($(exampleCol)).rdd.map { case Row(sv: SV) => sv } /* construct a model object with the result of training */ new SOMModel(train(dfexamples, $(x), $(y))) }
  • 95. #EUds5 Retire your visibility hacks 80 package org.apache.spark.ml.hacks object Hacks { import org.apache.spark.ml.linalg.VectorUDT val vectorUDT = new VectorUDT }
  • 96. #EUds5 Retire your visibility hacks 81 package org.apache.spark.ml.linalg /* imports, etc., are elided ... */ @Since("2.0.0") @DeveloperApi object SQLDataTypes { val VectorType: DataType = new VectorUDT val MatrixType: DataType = new MatrixUDT }
  • 97. #EUds5 Caching training data 82 val wasUncached = examples.storageLevel == StorageLevel.NONE if (wasUncached) { examples.cache() } /* actually train here */ if (wasUncached) { examples.unpersist() }
  • 98. #EUds5 Improve serial execution times Are you repeatedly comparing training data to a model that only changes once per iteration? Consider caching norms. Are you doing a lot of dot products in a for loop? Consider replacing these loops with a matrix-vector multiplication. Seek to limit the number of library invocations you make and thus the time you spend copying data to and from your linear algebra library. 83
  • 99. #EUds5 Key takeaways There are several techniques you can use to develop parallel implementations of machine learning algorithms. The RDD API may not be your favorite way to interact with Spark as a user, but it can be extremely valuable if you’re developing libraries for Spark. As a library developer, you might need to rely on developer APIs and dive in to Spark’s source code, but things are getting easier with each release! 84