No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin Kulka and Michał Kaczmarczyk

#EUai9
Marcin Kulka and Michał Kaczmarczyk
9LivesData
Oct/26/2017
No More Cumbersomeness:
Automatic Predictive
Modeling
on Apache Spark

Who we are?
• Marcin Kulka – Senior Software
Engineer
• Michał Kaczmarczyk (Ph.D.) –
Software Architect, Team Leader and
Project Manager
2

Who we are?
• Advanced software R&D company (Warsaw,
Poland)
• 75+ scientists and software engineers
• Specializing in scalable storage,
distributed and big data systems
• Cooperating with partners all around the world
3

• Masato Asahara (Ph.D.) -
Researcher, NEC Data Science
Research Laboratory
• Ryohei Fujimaki (Ph.D.) -
Research Fellow, NEC Data
Science Research Laboratory
5

Agenda
• Typical use case for predictive modeling problem
• Our technology - Automatic Predictive Modeling
• Design challenges
• Evaluation results
• Our observations
6

Predictive analysis in industry and business
8
Driver risk
assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenance
Product price
optimization
Sales
optimization
Energy/water operation
mgmt

... but Predictive Modeling
• Takes a long time
• Requires high skills
9

Typical predictive modeling use case
1010
Training Data
Validation Data
Test Data
Highly accurate
prediction results

Typical predictive modeling use case
1111
Predictive
models
Training Data
Validation Data
Test Data
Highly accurate
prediction results

Predictive model design
12
Algorithm selection
Accuracy v s Transparency
Black box White box

13
Hyperparameters tuning
Best balance
Algorithm selection
Black box White box

14
Best balance
Feature selection
Algorithm selection
Black box White box
Determining a set of features
Sales ＝ f (Price, Location）
Sales ＝ f (Price, Weather）
or

15
Best balance
Feature selection
Algorithm selection
Black box White box
A lot of effort, many models…
or

16
Best balance
Feature selection
Algorithm selection
Black box White box
Many
iterations,
weeks...
or

17
Best balance
Feature selection
Algorithm selection
Black box White box
Many
iterations,
weeks...
or
Sophisticated knowledge...

Automatic predictive modeling
18
Best balance
Feature selection
Algorithm selection
Black box White box
or

Automatic predictive modeling
19
Best balance
Feature selection
Algorithm selection
Black box White box
Highly accurate
results in a short
time!
or

Exploring massive modeling possibilities
21
Data
preprocessing
strategies

22
Algorithms
Yes
No Yes
Data
preprocessing
strategies

23
Algorithms
Yes
No Yes
Data
preprocessing
strategies
Feature
selection!

24
Algorithms
Yes
No Yes
Hyperparameters
tuning
Data
preprocessing
strategies
Feature
selection!

25
Algorithms
Yes
No Yes
Data
preprocessing
strategies
Yes
No Yes
Feature
selection!
1000s of
models!
Hyperparameters
tuning

26
Algorithms
Yes
No Yes
Data
preprocessing
strategies
Yes
No Yes
Feature
selection!
1000s of
models!
Hyperparameters
tuning

Automating and accelerating with Spark
27
Complete in hours!
Yes
No Yes
Algorithms
Yes
No Yes
Data
preprocessing
strategies
Feature
selection!
Hyperparameters
tuning

28
Training
data
Validation
criteria
Validation
data
Modeling flow = training + validation

Modeling flow = training + validation
29
Training
data
Validation
data
Training
models
Validating
models
Models
Test
data
Best model
Validation
criteria

Modeling and prediction flow
30
Training
data
Validation
data
Training
models
Validating
models
Models
Test
data
Prediction
Best model
Validation
criteria
Best
prediction

Design challenges
and solutions
31

3232
Challenges to achieve high execution performance
• Using native ML
engines in Spark
• Parameter-aware
scheduling
• Predictive work
balancing
3232
θ1
θ2
θ3

3333
θ1
θ2
θ3
• Using native ML
engines in Spark
• Parameter-aware
scheduling
• Predictive work
balancing
Challenges to achieve high execution performance

Using native ML engines in Spark
Why?
34

Comparison of Spark and native ML engines
35
(+ Spark ML)
Native
ML engines

36
(+ Spark ML)
Native
ML engines
Scalability Yes No (or very limited)

37
(+ Spark ML)
Native
ML engines
Choice of algorithms Some Many
(+ possibly some
custom, very efficient)
Accuracy

38
(+ Spark ML)
Native
ML engines
(+ possibly some
Performance Medium Extremely high
Distributed
nature,
synchronization
overhead
Accuracy
If data fits
a single server

39
(+ Spark ML)
Native
ML engines
(+ possibly some
Distributed
nature,
synchronization
overhead
Accuracy
If data fits
a single server

• We would like to combine Spark and ML engines
40
(+ Spark ML)
Native
ML engines
(+ possibly some

Combining Spark and ML engines for training
41
Training
data
(parquet)
HDFS
Models

42
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models

43
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Yes
No Yes

44
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Yes
No Yes
’Single ML engine’
on a single executor

45
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Yes
No Yes
Input
requirements:
size & format
’Single ML engine’
on a single executor

46
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Yes
No Yes

47
Machine Learning
(map operation)
Converting to
RDD[Matrix]
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Yes
No Yes
Matrix
Matrix
Matrix

48
Machine Learning
(map operation)
Converting to
RDD[Matrix]
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Yes
No Yes
Matrix
Matrix
Matrix
RDD of huge, efficiently
stored objects optimized
for ML computations!!!

Converting to
RDD[Matrix]
49
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
HDFS
1000s of
models
Yes
No Yes
Yes
No Yes
Matrix
Matrix
Matrix
RDD of huge, efficiently
stored objects optimized
for ML computations!!!

Combining Spark and ML engines for validation
50
Validation
data
(parquet)
HDFS

51
Data preprocessing
(MapReduce)
Validation
data
(parquet)
HDFS

52
Converting to
RDD[Matrix]
Data preprocessing
(MapReduce)
Validation
data
(parquet)
HDFS
Matrix
Matrix
Matrix

Converting to
RDD[Matrix]
Matrix
Matrix
Matrix
53
Prediction
(map operation)
Data preprocessing
(MapReduce)
Validation
data
(parquet)
HDFS
Computing
validation results
for many models

Converting to
RDD[Matrix]
Matrix
Matrix
Matrix
54
Validation
(MapReduce)
Prediction
(map operation)
Data preprocessing
(MapReduce)
Validation
data
(parquet)
HDFS
Computing
validation scores

Converting to
RDD[Matrix]
Matrix
Matrix
Matrix
55
Validation
(MapReduce)
Prediction
(map operation)
Data preprocessing
(MapReduce)
Validation
data
(parquet)
HDFS
HDFS
Best
model

56
Predict
(map operation)
Convert to
RDD[Matrix]
Data preprocessing
(MapReduce)
Test data
(parquet)
HDFS
HDFS
Prediction
results
(parquet)
Matrix
Matrix
Matrix
Computations
only for selected
models
Combining Spark and ML engines for prediction

Design challenges
5757
θ1
θ2
θ3
• Using native ML
engines in Spark
• Parameter-aware
scheduling
• Predictive work
balancing

Many models to schedule
58
Matrix X3
Matrix X2
Matrix X1

59
Algorithms
Hyperparameters
Data
preprocessing
strategies
Parameters:
θ1, θ2, θ3 ...
Matrix X3
Matrix X2
Matrix X1

60
Algorithms
Hyperparameters
Data
preprocessing
strategies
Machine Learning
Yes
No Yes
Parameters:
θ1, θ2, θ3 ...
Matrix X3
Matrix X2
Matrix X1

Naive scheduling
61
Load &
Convert
Parameter θ1
Parameter θ1
Parameter θ1
Matrix X1
Matrix X2
Matrix X3
• Waste of memory
• Frequent data
loading from
other servers
• Frequent data to
matrix conversion
61
Load &
Convert
Parameter θ1
Parameter θ1
Parameter θ1
Matrix X1
Matrix X2
Matrix X3

62
Parameter-aware scheduling
62
• Efficient memory
usage
• Infrequent data
loading from
other servers
• Infrequent data to
matrix conversion
62
Parameter θ1
Parameter θ2
Parameter θ3
Matrix X1

Design challenges
6363
θ1
θ2
θ3
• Using native ML
engines in Spark
• Parameter-aware
scheduling
• Predictive work
balancing

Machine learning – most work intensive & time consuming part
64
Machine Learning
(map operation)
Convert
to matrix
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
HDFS
Yes
No Yes
We must ensure good
balance of paralleled
work
1000s of
models
Matrix
Matrix
Matrix

Naive balancing of models to compute
65
5 min 5 min
Complicated model

Naive balancing of models to compute
66
5 min 5 min
1 min 1 min Wait 8 min…Yes
No Yes
Yes
No Yes
Decision tree model
Complicated model

Predictive balancing
• Balancing
complex and
simple
models
(based on
previous
estimation)
• Complex
models first
5 min 1 min
5 min 1 min
Yes
No Yes
Yes
No Yes
♪～
♪～
67

Evaluation – targeting Top-10%
• Prediction problem
– Comparing Top-10% precision of targeting potential
positive samples
• Comparing with manual predictive modeling
– Done with scikit-learn v0.18.1
– Selected algorithms (Logistic Regression, SVM, Random
Forests)
– Selected preprocessing strategies
– All parameters of algorithms set with default values
• except Random Forest (n_estimators = 200)
69

Evaluation – data sets
• KDDCUP 2014 competition data
– 557K records for training and validate data
– 62K records for test data
– Features: 500
• KDDCUP 2015 competition data
– 108K records for training and validate data
– 12K records for test data
– Features: 500
• IJCAI 2015 competition data
– 87K records for training, validate and test data
– Features: 500
70

Evaluation – cluster specificaton
• Size: 3U
• Server modules: 34
• CPU: 272 cores (Intel Xeon D 2.1GHz)
– 128 cores used in the evaluation
• RAM: 2TB
• Storage: 34TB SSD
• Internal network: 10GbE
• Spark v1.6.0, Hadoop v2.7.3
71
Scalable Modular Server
(DX2000)

Evaluation results and conclusions
72
Data Our
technology
Logistic
regression
SVM Random
Forests
KDDCUP 2014 15.6% 13.5% 12.0% 14.8%
KDDCUP 2015 97.1% 95.5% 93.1% 97.2%
IJCAI 2015 8.2% 8.3% 8.1% 8.2%
Top-10% precision results

• Competitive results with good accuracy
73
Data Our
technology
Logistic
regression
SVM Random
Forests
KDDCUP 2014 15.6% 13.5% 12.0% 14.8%
KDDCUP 2015 97.1% 95.5% 93.1% 97.2%
IJCAI 2015 8.2% 8.3% 8.1% 8.2%
Top-10% precision results

• Short execution time
• Full automation of the whole process
• Handling data of any size
74
Data Our technology
KDDCUP 2014 172 minutes
KDDCUP 2015 45 minutes
IJCAI 2015 36 minutes
Execution time

Our observations
• Using RDD of huge but compact objects
optimized for ML computations
• Limiting execution time overhead in tests on
YARN
• Stable execution on YARN
76

Our observations
optimized for ML computations
YARN
77

Converting to
RDD[Matrix]
78
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
HDFS
1000s of
models
Yes
No Yes
Yes
No Yes
Matrix
Matrix
Matrix
RDD[DenseMatrix]

• Spark used for parallelization
• All the necessary data for a single execution kept
without memory overhead
• Performance critical operations executed:
– On objects with Linear Algebra operations optimized
– By fast native ML algorithms
79
RDD[DenseMatrix]

Our observations
optimized for fast computations
YARN
80

Limiting execution overhead in tests
• Submitting Spark application takes time
81
TestSpark submit Spark submit Test Spark submit Test

Limiting execution overhead in tests
• We submit only once
82
TestSpark submit Test Test
♪～

Our observations
optimized for fast computations
YARN
83

Stable execution on YARN
• Default configuration sometimes failing with not
enough memory
• Spark Web UI:
• Serving much memory to Spark but application
still failing
• Known problem in Spark
84

• JVM system memory spikes over YARN
limitation suddenly (*)
85
(*) Shivnath and Mayuresh. “Understanding Memory Management In Spark For Fun And Profit”, Spark Summit 2016.
YARN limitation
(6GB)
Time
Memory(GB)
Spike of JVM system
memory usage

• Tip: spark.yarn.executor.memoryOverhead to be
carefully configured
• Recommended overhead: 6-10%
• 15% overhead required in our case
• Must be thoroughly investigated
86
(http://spark.apache.org/docs/2.1.1/running-on-yarn.html)

Summary
• Predictive modeling problem
– Requires sophisticated knowledge
– Takes a long time
• Our technology: Automatic Predictive Modeling
– Combines Spark with native ML engines
– Fully automates the whole process
– Provides highly accurate results
– Takes at most hours
– Handles data of any size
88

Future work
• Extending to other models
(e.g. deep learning)
• Speeding up by GPU
• Reducing YARN memory
overhead
89

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin Kulka and Michał Kaczmarczyk

More Related Content

What's hot (20)

Similar to No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin Kulka and Michał Kaczmarczyk (20)

More from Spark Summit (20)

Recently uploaded (20)

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin Kulka and Michał Kaczmarczyk