SlideShare a Scribd company logo
#EUai9
Marcin Kulka and Michał Kaczmarczyk
9LivesData
Oct/26/2017
No More Cumbersomeness:
Automatic Predictive
Modeling
on Apache Spark
Who we are?
• Marcin Kulka – Senior Software
Engineer
• Michał Kaczmarczyk (Ph.D.) –
Software Architect, Team Leader and
Project Manager
2
Who we are?
• Advanced software R&D company (Warsaw,
Poland)
• 75+ scientists and software engineers
• Specializing in scalable storage,
distributed and big data systems
• Cooperating with partners all around the world
3
4
• Masato Asahara (Ph.D.) -
Researcher, NEC Data Science
Research Laboratory
• Ryohei Fujimaki (Ph.D.) -
Research Fellow, NEC Data
Science Research Laboratory
5
Agenda
• Typical use case for predictive modeling problem
• Our technology - Automatic Predictive Modeling
• Design challenges
• Evaluation results
• Our observations
6
Motivation
7
Predictive analysis in industry and business
8
Driver risk
assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenance
Product price
optimization
Sales
optimization
Energy/water operation
mgmt
... but Predictive Modeling
• Takes a long time
• Requires high skills
9
Typical predictive modeling use case
1010
Training Data
Validation Data
Test Data
Highly accurate
prediction results
Typical predictive modeling use case
1111
Predictive
models
Training Data
Validation Data
Test Data
Highly accurate
prediction results
Predictive model design
12
Algorithm selection
Accuracy v s Transparency
Black box White box
Predictive model design
13
Hyperparameters tuning
Best balance
Algorithm selection
Accuracy v s Transparency
Black box White box
Predictive model design
14
Hyperparameters tuning
Best balance
Feature selection
Algorithm selection
Accuracy v s Transparency
Black box White box
Determining a set of features
Sales = f (Price, Location)
Sales = f (Price, Weather)
or
Predictive model design
15
Hyperparameters tuning
Best balance
Feature selection
Algorithm selection
Accuracy v s Transparency
Black box White box
Determining a set of features
A lot of effort, many models…
Sales = f (Price, Location)
Sales = f (Price, Weather)
or
Predictive model design
16
Hyperparameters tuning
Best balance
Feature selection
Algorithm selection
Accuracy v s Transparency
Black box White box
Determining a set of features
A lot of effort, many models…
Many
iterations,
weeks...
Sales = f (Price, Location)
Sales = f (Price, Weather)
or
Predictive model design
17
Hyperparameters tuning
Best balance
Feature selection
Algorithm selection
Accuracy v s Transparency
Black box White box
Determining a set of features
A lot of effort, many models…
Many
iterations,
weeks...
Sales = f (Price, Location)
Sales = f (Price, Weather)
or
Sophisticated knowledge...
Automatic predictive modeling
18
Hyperparameters tuning
Best balance
Feature selection
Algorithm selection
Accuracy v s Transparency
Black box White box
Determining a set of features
Sales = f (Price, Location)
Sales = f (Price, Weather)
or
Automatic predictive modeling
19
Hyperparameters tuning
Best balance
Feature selection
Algorithm selection
Accuracy v s Transparency
Black box White box
Determining a set of features
Highly accurate
results in a short
time!
Sales = f (Price, Location)
Sales = f (Price, Weather)
or
Our technology
20
Exploring massive modeling possibilities
21
Data
preprocessing
strategies
Exploring massive modeling possibilities
22
Algorithms
Yes
No Yes
Data
preprocessing
strategies
Exploring massive modeling possibilities
23
Algorithms
Yes
No Yes
Data
preprocessing
strategies
Feature
selection!
Exploring massive modeling possibilities
24
Algorithms
Yes
No Yes
Hyperparameters
tuning
Data
preprocessing
strategies
Feature
selection!
Exploring massive modeling possibilities
25
Algorithms
Yes
No Yes
Data
preprocessing
strategies
Yes
No Yes
Feature
selection!
1000s of
models!
Hyperparameters
tuning
Exploring massive modeling possibilities
26
Algorithms
Yes
No Yes
Data
preprocessing
strategies
Yes
No Yes
Feature
selection!
1000s of
models!
Hyperparameters
tuning
Automating and accelerating with Spark
27
Complete in hours!
Yes
No Yes
Algorithms
Yes
No Yes
Data
preprocessing
strategies
Feature
selection!
Hyperparameters
tuning
28
Training
data
Validation
criteria
Validation
data
Modeling flow = training + validation
Modeling flow = training + validation
29
Training
data
Validation
data
Training
models
Validating
models
Models
Test
data
Best model
Validation
criteria
Modeling and prediction flow
30
Training
data
Validation
data
Training
models
Validating
models
Models
Test
data
Prediction
Best model
Validation
criteria
Best
prediction
Design challenges
and solutions
31
3232
Challenges to achieve high execution performance
• Using native ML
engines in Spark
• Parameter-aware
scheduling
• Predictive work
balancing
3232
θ1
θ2
θ3
3333
θ1
θ2
θ3
• Using native ML
engines in Spark
• Parameter-aware
scheduling
• Predictive work
balancing
Challenges to achieve high execution performance
Using native ML engines in Spark
Why?
34
Comparison of Spark and native ML engines
35
(+ Spark ML)
Native
ML engines
Comparison of Spark and native ML engines
36
(+ Spark ML)
Native
ML engines
Scalability Yes No (or very limited)
Comparison of Spark and native ML engines
37
(+ Spark ML)
Native
ML engines
Scalability Yes No (or very limited)
Choice of algorithms Some Many
(+ possibly some
custom, very efficient)
Accuracy
Comparison of Spark and native ML engines
38
(+ Spark ML)
Native
ML engines
Scalability Yes No (or very limited)
Choice of algorithms Some Many
(+ possibly some
custom, very efficient)
Performance Medium Extremely high
Distributed
nature,
synchronization
overhead
Accuracy
If data fits
a single server
Comparison of Spark and native ML engines
39
(+ Spark ML)
Native
ML engines
Scalability Yes No (or very limited)
Choice of algorithms Some Many
(+ possibly some
custom, very efficient)
Performance Medium Extremely high
Distributed
nature,
synchronization
overhead
Accuracy
If data fits
a single server
Comparison of Spark and native ML engines
• We would like to combine Spark and ML engines
40
(+ Spark ML)
Native
ML engines
Scalability Yes No (or very limited)
Choice of algorithms Some Many
(+ possibly some
custom, very efficient)
Performance Medium Extremely high
Combining Spark and ML engines for training
41
Training
data
(parquet)
HDFS
Models
42
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Combining Spark and ML engines for training
43
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Yes
No Yes
Combining Spark and ML engines for training
44
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Yes
No Yes
’Single ML engine’
on a single executor
Combining Spark and ML engines for training
45
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Yes
No Yes
Input
requirements:
size & format
’Single ML engine’
on a single executor
Combining Spark and ML engines for training
46
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Yes
No Yes
Combining Spark and ML engines for training
47
Machine Learning
(map operation)
Converting to
RDD[Matrix]
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Yes
No Yes
Matrix
Matrix
Matrix
Combining Spark and ML engines for training
48
Machine Learning
(map operation)
Converting to
RDD[Matrix]
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
Models
Yes
No Yes
Matrix
Matrix
Matrix
Combining Spark and ML engines for training
RDD of huge, efficiently
stored objects optimized
for ML computations!!!
Converting to
RDD[Matrix]
49
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
HDFS
1000s of
models
Yes
No Yes
Yes
No Yes
Matrix
Matrix
Matrix
RDD of huge, efficiently
stored objects optimized
for ML computations!!!
Combining Spark and ML engines for training
Combining Spark and ML engines for validation
50
Validation
data
(parquet)
HDFS
51
Data preprocessing
(MapReduce)
Validation
data
(parquet)
HDFS
Combining Spark and ML engines for validation
52
Converting to
RDD[Matrix]
Data preprocessing
(MapReduce)
Validation
data
(parquet)
HDFS
Matrix
Matrix
Matrix
Combining Spark and ML engines for validation
Converting to
RDD[Matrix]
Matrix
Matrix
Matrix
53
Prediction
(map operation)
Data preprocessing
(MapReduce)
Validation
data
(parquet)
HDFS
Computing
validation results
for many models
Combining Spark and ML engines for validation
Converting to
RDD[Matrix]
Matrix
Matrix
Matrix
54
Validation
(MapReduce)
Prediction
(map operation)
Data preprocessing
(MapReduce)
Validation
data
(parquet)
HDFS
Computing
validation scores
Combining Spark and ML engines for validation
Converting to
RDD[Matrix]
Matrix
Matrix
Matrix
55
Validation
(MapReduce)
Prediction
(map operation)
Data preprocessing
(MapReduce)
Validation
data
(parquet)
HDFS
HDFS
Best
model
Combining Spark and ML engines for validation
56
Predict
(map operation)
Convert to
RDD[Matrix]
Data preprocessing
(MapReduce)
Test data
(parquet)
HDFS
HDFS
Prediction
results
(parquet)
Matrix
Matrix
Matrix
Computations
only for selected
models
Combining Spark and ML engines for prediction
Design challenges
5757
θ1
θ2
θ3
• Using native ML
engines in Spark
• Parameter-aware
scheduling
• Predictive work
balancing
Many models to schedule
58
Matrix X3
Matrix X2
Matrix X1
Many models to schedule
59
Algorithms
Hyperparameters
Data
preprocessing
strategies
Parameters:
θ1, θ2, θ3 ...
Matrix X3
Matrix X2
Matrix X1
Many models to schedule
60
Algorithms
Hyperparameters
Data
preprocessing
strategies
Machine Learning
Yes
No Yes
Parameters:
θ1, θ2, θ3 ...
Matrix X3
Matrix X2
Matrix X1
Naive scheduling
61
Load &
Convert
Parameter θ1
Parameter θ1
Parameter θ1
Matrix X1
Matrix X2
Matrix X3
• Waste of memory
• Frequent data
loading from
other servers
• Frequent data to
matrix conversion
61
Load &
Convert
Parameter θ1
Parameter θ1
Parameter θ1
Matrix X1
Matrix X2
Matrix X3
62
Parameter-aware scheduling
62
• Efficient memory
usage
• Infrequent data
loading from
other servers
• Infrequent data to
matrix conversion
62
Parameter θ1
Parameter θ2
Parameter θ3
Matrix X1
Design challenges
6363
θ1
θ2
θ3
• Using native ML
engines in Spark
• Parameter-aware
scheduling
• Predictive work
balancing
Machine learning – most work intensive & time consuming part
64
Machine Learning
(map operation)
Convert
to matrix
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
HDFS
Yes
No Yes
We must ensure good
balance of paralleled
work
1000s of
models
Matrix
Matrix
Matrix
Naive balancing of models to compute
65
5 min 5 min
Complicated model
Naive balancing of models to compute
66
5 min 5 min
1 min 1 min Wait 8 min…Yes
No Yes
Yes
No Yes
Decision tree model
Complicated model
Predictive balancing
• Balancing
complex and
simple
models
(based on
previous
estimation)
• Complex
models first
5 min 1 min
5 min 1 min
Yes
No Yes
Yes
No Yes
♪~
♪~
67
Evaluation
68
Evaluation – targeting Top-10%
• Prediction problem
– Comparing Top-10% precision of targeting potential
positive samples
• Comparing with manual predictive modeling
– Done with scikit-learn v0.18.1
– Selected algorithms (Logistic Regression, SVM, Random
Forests)
– Selected preprocessing strategies
– All parameters of algorithms set with default values
• except Random Forest (n_estimators = 200)
69
Evaluation – data sets
• KDDCUP 2014 competition data
– 557K records for training and validate data
– 62K records for test data
– Features: 500
• KDDCUP 2015 competition data
– 108K records for training and validate data
– 12K records for test data
– Features: 500
• IJCAI 2015 competition data
– 87K records for training, validate and test data
– Features: 500
70
Evaluation – cluster specificaton
• Size: 3U
• Server modules: 34
• CPU: 272 cores (Intel Xeon D 2.1GHz)
– 128 cores used in the evaluation
• RAM: 2TB
• Storage: 34TB SSD
• Internal network: 10GbE
• Spark v1.6.0, Hadoop v2.7.3
71
Scalable Modular Server
(DX2000)
Evaluation results and conclusions
72
Data Our
technology
Logistic
regression
SVM Random
Forests
KDDCUP 2014 15.6% 13.5% 12.0% 14.8%
KDDCUP 2015 97.1% 95.5% 93.1% 97.2%
IJCAI 2015 8.2% 8.3% 8.1% 8.2%
Top-10% precision results
Evaluation results and conclusions
• Competitive results with good accuracy
73
Data Our
technology
Logistic
regression
SVM Random
Forests
KDDCUP 2014 15.6% 13.5% 12.0% 14.8%
KDDCUP 2015 97.1% 95.5% 93.1% 97.2%
IJCAI 2015 8.2% 8.3% 8.1% 8.2%
Top-10% precision results
Evaluation results and conclusions
• Short execution time
• Full automation of the whole process
• Handling data of any size
74
Data Our technology
KDDCUP 2014 172 minutes
KDDCUP 2015 45 minutes
IJCAI 2015 36 minutes
Execution time
Our observations
75
Our observations
• Using RDD of huge but compact objects
optimized for ML computations
• Limiting execution time overhead in tests on
YARN
• Stable execution on YARN
76
Our observations
• Using RDD of huge but compact objects
optimized for ML computations
• Limiting execution time overhead in tests on
YARN
• Stable execution on YARN
77
Converting to
RDD[Matrix]
78
Machine Learning
(map operation)
Data preprocessing
(MapReduce)
Training
data
(parquet)
HDFS
HDFS
1000s of
models
Yes
No Yes
Yes
No Yes
Matrix
Matrix
Matrix
RDD[DenseMatrix]
• Spark used for parallelization
• All the necessary data for a single execution kept
without memory overhead
• Performance critical operations executed:
– On objects with Linear Algebra operations optimized
– By fast native ML algorithms
79
RDD[DenseMatrix]
Our observations
• Using RDD of huge but compact objects
optimized for fast computations
• Limiting execution time overhead in tests on
YARN
• Stable execution on YARN
80
Limiting execution overhead in tests
• Submitting Spark application takes time
81
TestSpark submit Spark submit Test Spark submit Test
Limiting execution overhead in tests
• We submit only once
82
TestSpark submit Test Test
♪~
Our observations
• Using RDD of huge but compact objects
optimized for fast computations
• Limiting execution time overhead in tests on
YARN
• Stable execution on YARN
83
Stable execution on YARN
• Default configuration sometimes failing with not
enough memory
• Spark Web UI:
• Serving much memory to Spark but application
still failing
• Known problem in Spark
84
Stable execution on YARN
• JVM system memory spikes over YARN
limitation suddenly (*)
85
(*) Shivnath and Mayuresh. “Understanding Memory Management In Spark For Fun And Profit”, Spark Summit 2016.
YARN limitation
(6GB)
Time
Memory(GB)
Spike of JVM system
memory usage
Stable execution on YARN
• Tip: spark.yarn.executor.memoryOverhead to be
carefully configured
• Recommended overhead: 6-10%
• 15% overhead required in our case
• Must be thoroughly investigated
86
(http://spark.apache.org/docs/2.1.1/running-on-yarn.html)
Summary
87
Summary
• Predictive modeling problem
– Requires sophisticated knowledge
– Takes a long time
• Our technology: Automatic Predictive Modeling
– Combines Spark with native ML engines
– Fully automates the whole process
– Provides highly accurate results
– Takes at most hours
– Handles data of any size
88
Future work
• Extending to other models
(e.g. deep learning)
• Speeding up by GPU
• Reducing YARN memory
overhead
89
Thank you!
90

More Related Content

PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Ahsan Javed Awan
 
PDF
Willump: Optimizing Feature Computation in ML Inference
Databricks
 
PDF
Koalas: How Well Does Koalas Work?
Databricks
 
PDF
Anomaly Detection at Scale!
Databricks
 
PDF
Scaling Machine Learning with Apache Spark
Databricks
 
PDF
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Databricks
 
PDF
Extending Machine Learning Algorithms with PySpark
Databricks
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Ahsan Javed Awan
 
Willump: Optimizing Feature Computation in ML Inference
Databricks
 
Koalas: How Well Does Koalas Work?
Databricks
 
Anomaly Detection at Scale!
Databricks
 
Scaling Machine Learning with Apache Spark
Databricks
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Databricks
 
Extending Machine Learning Algorithms with PySpark
Databricks
 

What's hot (20)

PPTX
Practical Distributed Machine Learning Pipelines on Hadoop
DataWorks Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit
 
PDF
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Databricks
 
PDF
Scalable Deep Learning Platform On Spark In Baidu
Jen Aman
 
PDF
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Spark Summit
 
PDF
Spark Summit EU talk by Oscar Castaneda
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Productionizing Machine Learning Pipelines with Databricks and Azure ML
Databricks
 
PDF
Ray: Enterprise-Grade, Distributed Python
Databricks
 
PDF
Building an ML Platform with Ray and MLflow
Databricks
 
PDF
Automated Hyperparameter Tuning, Scaling and Tracking
Databricks
 
PDF
Auto-Pilot for Apache Spark Using Machine Learning
Databricks
 
PDF
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Databricks
 
PDF
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
 
PDF
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit
 
PDF
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks
 
PDF
Superworkflow of Graph Neural Networks with K8S and Fugue
Databricks
 
PDF
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Databricks
 
PDF
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Databricks
 
Practical Distributed Machine Learning Pipelines on Hadoop
DataWorks Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Databricks
 
Scalable Deep Learning Platform On Spark In Baidu
Jen Aman
 
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Spark Summit
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Productionizing Machine Learning Pipelines with Databricks and Azure ML
Databricks
 
Ray: Enterprise-Grade, Distributed Python
Databricks
 
Building an ML Platform with Ray and MLflow
Databricks
 
Automated Hyperparameter Tuning, Scaling and Tracking
Databricks
 
Auto-Pilot for Apache Spark Using Machine Learning
Databricks
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Databricks
 
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks
 
Superworkflow of Graph Neural Networks with K8S and Fugue
Databricks
 
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Databricks
 
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Databricks
 
Ad

Similar to No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin Kulka and Michał Kaczmarczyk (20)

PPTX
Deploying Data Science Engines to Production
Mostafa Majidpour
 
PDF
Scaling Ride-Hailing with Machine Learning on MLflow
Databricks
 
PPTX
Recommendations for Building Machine Learning Software
Justin Basilico
 
PDF
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PAPIs.io
 
PDF
Modeling at Scale: SigOpt at TWIMLcon 2019
SigOpt
 
PDF
Biomedical Signal and Image Analytics using MATLAB
CodeOps Technologies LLP
 
PPTX
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Rodney Joyce
 
PDF
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
BigML, Inc
 
PDF
MLOps Using MLflow
Databricks
 
PDF
RTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc
Object Automation
 
PPTX
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks
 
PPTX
AutoML for user segmentation: how to match millions of users with hundreds of...
Institute of Contemporary Sciences
 
PPTX
Big Data Paris
MapR Technologies
 
PPTX
Big Data Paris
Ted Dunning
 
PPTX
Production ready big ml workflows from zero to hero daniel marcous @ waze
Ido Shilon
 
PDF
AI for Software Engineering
Miroslaw Staron
 
PDF
The Analytics Frontier of the Hadoop Eco-System
inside-BigData.com
 
PPTX
Lessons Learned from Building Machine Learning Software at Netflix
Justin Basilico
 
PDF
Value of Data Science
Akin Osman Kazakci
 
PDF
Production model lifecycle management 2016 09
Greg Makowski
 
Deploying Data Science Engines to Production
Mostafa Majidpour
 
Scaling Ride-Hailing with Machine Learning on MLflow
Databricks
 
Recommendations for Building Machine Learning Software
Justin Basilico
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PAPIs.io
 
Modeling at Scale: SigOpt at TWIMLcon 2019
SigOpt
 
Biomedical Signal and Image Analytics using MATLAB
CodeOps Technologies LLP
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Rodney Joyce
 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
BigML, Inc
 
MLOps Using MLflow
Databricks
 
RTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc
Object Automation
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks
 
AutoML for user segmentation: how to match millions of users with hundreds of...
Institute of Contemporary Sciences
 
Big Data Paris
MapR Technologies
 
Big Data Paris
Ted Dunning
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Ido Shilon
 
AI for Software Engineering
Miroslaw Staron
 
The Analytics Frontier of the Hadoop Eco-System
inside-BigData.com
 
Lessons Learned from Building Machine Learning Software at Netflix
Justin Basilico
 
Value of Data Science
Akin Osman Kazakci
 
Production model lifecycle management 2016 09
Greg Makowski
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
PDF
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit
 
PDF
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Spark Summit
 
PDF
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Spark Summit
 
PDF
Variant-Apache Spark for Bioinformatics with Piotr Szul
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Spark Summit
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Spark Summit
 
Variant-Apache Spark for Bioinformatics with Piotr Szul
Spark Summit
 

Recently uploaded (20)

PPTX
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
PPTX
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin Kulka and Michał Kaczmarczyk

  • 1. #EUai9 Marcin Kulka and Michał Kaczmarczyk 9LivesData Oct/26/2017 No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark
  • 2. Who we are? • Marcin Kulka – Senior Software Engineer • Michał Kaczmarczyk (Ph.D.) – Software Architect, Team Leader and Project Manager 2
  • 3. Who we are? • Advanced software R&D company (Warsaw, Poland) • 75+ scientists and software engineers • Specializing in scalable storage, distributed and big data systems • Cooperating with partners all around the world 3
  • 4. 4
  • 5. • Masato Asahara (Ph.D.) - Researcher, NEC Data Science Research Laboratory • Ryohei Fujimaki (Ph.D.) - Research Fellow, NEC Data Science Research Laboratory 5
  • 6. Agenda • Typical use case for predictive modeling problem • Our technology - Automatic Predictive Modeling • Design challenges • Evaluation results • Our observations 6
  • 8. Predictive analysis in industry and business 8 Driver risk assessment Inventory Optimization Churn Retention Predictive Maintenance Product price optimization Sales optimization Energy/water operation mgmt
  • 9. ... but Predictive Modeling • Takes a long time • Requires high skills 9
  • 10. Typical predictive modeling use case 1010 Training Data Validation Data Test Data Highly accurate prediction results
  • 11. Typical predictive modeling use case 1111 Predictive models Training Data Validation Data Test Data Highly accurate prediction results
  • 12. Predictive model design 12 Algorithm selection Accuracy v s Transparency Black box White box
  • 13. Predictive model design 13 Hyperparameters tuning Best balance Algorithm selection Accuracy v s Transparency Black box White box
  • 14. Predictive model design 14 Hyperparameters tuning Best balance Feature selection Algorithm selection Accuracy v s Transparency Black box White box Determining a set of features Sales = f (Price, Location) Sales = f (Price, Weather) or
  • 15. Predictive model design 15 Hyperparameters tuning Best balance Feature selection Algorithm selection Accuracy v s Transparency Black box White box Determining a set of features A lot of effort, many models… Sales = f (Price, Location) Sales = f (Price, Weather) or
  • 16. Predictive model design 16 Hyperparameters tuning Best balance Feature selection Algorithm selection Accuracy v s Transparency Black box White box Determining a set of features A lot of effort, many models… Many iterations, weeks... Sales = f (Price, Location) Sales = f (Price, Weather) or
  • 17. Predictive model design 17 Hyperparameters tuning Best balance Feature selection Algorithm selection Accuracy v s Transparency Black box White box Determining a set of features A lot of effort, many models… Many iterations, weeks... Sales = f (Price, Location) Sales = f (Price, Weather) or Sophisticated knowledge...
  • 18. Automatic predictive modeling 18 Hyperparameters tuning Best balance Feature selection Algorithm selection Accuracy v s Transparency Black box White box Determining a set of features Sales = f (Price, Location) Sales = f (Price, Weather) or
  • 19. Automatic predictive modeling 19 Hyperparameters tuning Best balance Feature selection Algorithm selection Accuracy v s Transparency Black box White box Determining a set of features Highly accurate results in a short time! Sales = f (Price, Location) Sales = f (Price, Weather) or
  • 21. Exploring massive modeling possibilities 21 Data preprocessing strategies
  • 22. Exploring massive modeling possibilities 22 Algorithms Yes No Yes Data preprocessing strategies
  • 23. Exploring massive modeling possibilities 23 Algorithms Yes No Yes Data preprocessing strategies Feature selection!
  • 24. Exploring massive modeling possibilities 24 Algorithms Yes No Yes Hyperparameters tuning Data preprocessing strategies Feature selection!
  • 25. Exploring massive modeling possibilities 25 Algorithms Yes No Yes Data preprocessing strategies Yes No Yes Feature selection! 1000s of models! Hyperparameters tuning
  • 26. Exploring massive modeling possibilities 26 Algorithms Yes No Yes Data preprocessing strategies Yes No Yes Feature selection! 1000s of models! Hyperparameters tuning
  • 27. Automating and accelerating with Spark 27 Complete in hours! Yes No Yes Algorithms Yes No Yes Data preprocessing strategies Feature selection! Hyperparameters tuning
  • 29. Modeling flow = training + validation 29 Training data Validation data Training models Validating models Models Test data Best model Validation criteria
  • 30. Modeling and prediction flow 30 Training data Validation data Training models Validating models Models Test data Prediction Best model Validation criteria Best prediction
  • 32. 3232 Challenges to achieve high execution performance • Using native ML engines in Spark • Parameter-aware scheduling • Predictive work balancing 3232 θ1 θ2 θ3
  • 33. 3333 θ1 θ2 θ3 • Using native ML engines in Spark • Parameter-aware scheduling • Predictive work balancing Challenges to achieve high execution performance
  • 34. Using native ML engines in Spark Why? 34
  • 35. Comparison of Spark and native ML engines 35 (+ Spark ML) Native ML engines
  • 36. Comparison of Spark and native ML engines 36 (+ Spark ML) Native ML engines Scalability Yes No (or very limited)
  • 37. Comparison of Spark and native ML engines 37 (+ Spark ML) Native ML engines Scalability Yes No (or very limited) Choice of algorithms Some Many (+ possibly some custom, very efficient) Accuracy
  • 38. Comparison of Spark and native ML engines 38 (+ Spark ML) Native ML engines Scalability Yes No (or very limited) Choice of algorithms Some Many (+ possibly some custom, very efficient) Performance Medium Extremely high Distributed nature, synchronization overhead Accuracy If data fits a single server
  • 39. Comparison of Spark and native ML engines 39 (+ Spark ML) Native ML engines Scalability Yes No (or very limited) Choice of algorithms Some Many (+ possibly some custom, very efficient) Performance Medium Extremely high Distributed nature, synchronization overhead Accuracy If data fits a single server
  • 40. Comparison of Spark and native ML engines • We would like to combine Spark and ML engines 40 (+ Spark ML) Native ML engines Scalability Yes No (or very limited) Choice of algorithms Some Many (+ possibly some custom, very efficient) Performance Medium Extremely high
  • 41. Combining Spark and ML engines for training 41 Training data (parquet) HDFS Models
  • 43. 43 Machine Learning (map operation) Data preprocessing (MapReduce) Training data (parquet) HDFS Models Yes No Yes Combining Spark and ML engines for training
  • 44. 44 Machine Learning (map operation) Data preprocessing (MapReduce) Training data (parquet) HDFS Models Yes No Yes ’Single ML engine’ on a single executor Combining Spark and ML engines for training
  • 45. 45 Machine Learning (map operation) Data preprocessing (MapReduce) Training data (parquet) HDFS Models Yes No Yes Input requirements: size & format ’Single ML engine’ on a single executor Combining Spark and ML engines for training
  • 46. 46 Machine Learning (map operation) Data preprocessing (MapReduce) Training data (parquet) HDFS Models Yes No Yes Combining Spark and ML engines for training
  • 47. 47 Machine Learning (map operation) Converting to RDD[Matrix] Data preprocessing (MapReduce) Training data (parquet) HDFS Models Yes No Yes Matrix Matrix Matrix Combining Spark and ML engines for training
  • 48. 48 Machine Learning (map operation) Converting to RDD[Matrix] Data preprocessing (MapReduce) Training data (parquet) HDFS Models Yes No Yes Matrix Matrix Matrix Combining Spark and ML engines for training RDD of huge, efficiently stored objects optimized for ML computations!!!
  • 49. Converting to RDD[Matrix] 49 Machine Learning (map operation) Data preprocessing (MapReduce) Training data (parquet) HDFS HDFS 1000s of models Yes No Yes Yes No Yes Matrix Matrix Matrix RDD of huge, efficiently stored objects optimized for ML computations!!! Combining Spark and ML engines for training
  • 50. Combining Spark and ML engines for validation 50 Validation data (parquet) HDFS
  • 53. Converting to RDD[Matrix] Matrix Matrix Matrix 53 Prediction (map operation) Data preprocessing (MapReduce) Validation data (parquet) HDFS Computing validation results for many models Combining Spark and ML engines for validation
  • 54. Converting to RDD[Matrix] Matrix Matrix Matrix 54 Validation (MapReduce) Prediction (map operation) Data preprocessing (MapReduce) Validation data (parquet) HDFS Computing validation scores Combining Spark and ML engines for validation
  • 55. Converting to RDD[Matrix] Matrix Matrix Matrix 55 Validation (MapReduce) Prediction (map operation) Data preprocessing (MapReduce) Validation data (parquet) HDFS HDFS Best model Combining Spark and ML engines for validation
  • 56. 56 Predict (map operation) Convert to RDD[Matrix] Data preprocessing (MapReduce) Test data (parquet) HDFS HDFS Prediction results (parquet) Matrix Matrix Matrix Computations only for selected models Combining Spark and ML engines for prediction
  • 57. Design challenges 5757 θ1 θ2 θ3 • Using native ML engines in Spark • Parameter-aware scheduling • Predictive work balancing
  • 58. Many models to schedule 58 Matrix X3 Matrix X2 Matrix X1
  • 59. Many models to schedule 59 Algorithms Hyperparameters Data preprocessing strategies Parameters: θ1, θ2, θ3 ... Matrix X3 Matrix X2 Matrix X1
  • 60. Many models to schedule 60 Algorithms Hyperparameters Data preprocessing strategies Machine Learning Yes No Yes Parameters: θ1, θ2, θ3 ... Matrix X3 Matrix X2 Matrix X1
  • 61. Naive scheduling 61 Load & Convert Parameter θ1 Parameter θ1 Parameter θ1 Matrix X1 Matrix X2 Matrix X3 • Waste of memory • Frequent data loading from other servers • Frequent data to matrix conversion 61 Load & Convert Parameter θ1 Parameter θ1 Parameter θ1 Matrix X1 Matrix X2 Matrix X3
  • 62. 62 Parameter-aware scheduling 62 • Efficient memory usage • Infrequent data loading from other servers • Infrequent data to matrix conversion 62 Parameter θ1 Parameter θ2 Parameter θ3 Matrix X1
  • 63. Design challenges 6363 θ1 θ2 θ3 • Using native ML engines in Spark • Parameter-aware scheduling • Predictive work balancing
  • 64. Machine learning – most work intensive & time consuming part 64 Machine Learning (map operation) Convert to matrix Data preprocessing (MapReduce) Training data (parquet) HDFS HDFS Yes No Yes We must ensure good balance of paralleled work 1000s of models Matrix Matrix Matrix
  • 65. Naive balancing of models to compute 65 5 min 5 min Complicated model
  • 66. Naive balancing of models to compute 66 5 min 5 min 1 min 1 min Wait 8 min…Yes No Yes Yes No Yes Decision tree model Complicated model
  • 67. Predictive balancing • Balancing complex and simple models (based on previous estimation) • Complex models first 5 min 1 min 5 min 1 min Yes No Yes Yes No Yes ♪~ ♪~ 67
  • 69. Evaluation – targeting Top-10% • Prediction problem – Comparing Top-10% precision of targeting potential positive samples • Comparing with manual predictive modeling – Done with scikit-learn v0.18.1 – Selected algorithms (Logistic Regression, SVM, Random Forests) – Selected preprocessing strategies – All parameters of algorithms set with default values • except Random Forest (n_estimators = 200) 69
  • 70. Evaluation – data sets • KDDCUP 2014 competition data – 557K records for training and validate data – 62K records for test data – Features: 500 • KDDCUP 2015 competition data – 108K records for training and validate data – 12K records for test data – Features: 500 • IJCAI 2015 competition data – 87K records for training, validate and test data – Features: 500 70
  • 71. Evaluation – cluster specificaton • Size: 3U • Server modules: 34 • CPU: 272 cores (Intel Xeon D 2.1GHz) – 128 cores used in the evaluation • RAM: 2TB • Storage: 34TB SSD • Internal network: 10GbE • Spark v1.6.0, Hadoop v2.7.3 71 Scalable Modular Server (DX2000)
  • 72. Evaluation results and conclusions 72 Data Our technology Logistic regression SVM Random Forests KDDCUP 2014 15.6% 13.5% 12.0% 14.8% KDDCUP 2015 97.1% 95.5% 93.1% 97.2% IJCAI 2015 8.2% 8.3% 8.1% 8.2% Top-10% precision results
  • 73. Evaluation results and conclusions • Competitive results with good accuracy 73 Data Our technology Logistic regression SVM Random Forests KDDCUP 2014 15.6% 13.5% 12.0% 14.8% KDDCUP 2015 97.1% 95.5% 93.1% 97.2% IJCAI 2015 8.2% 8.3% 8.1% 8.2% Top-10% precision results
  • 74. Evaluation results and conclusions • Short execution time • Full automation of the whole process • Handling data of any size 74 Data Our technology KDDCUP 2014 172 minutes KDDCUP 2015 45 minutes IJCAI 2015 36 minutes Execution time
  • 76. Our observations • Using RDD of huge but compact objects optimized for ML computations • Limiting execution time overhead in tests on YARN • Stable execution on YARN 76
  • 77. Our observations • Using RDD of huge but compact objects optimized for ML computations • Limiting execution time overhead in tests on YARN • Stable execution on YARN 77
  • 78. Converting to RDD[Matrix] 78 Machine Learning (map operation) Data preprocessing (MapReduce) Training data (parquet) HDFS HDFS 1000s of models Yes No Yes Yes No Yes Matrix Matrix Matrix RDD[DenseMatrix]
  • 79. • Spark used for parallelization • All the necessary data for a single execution kept without memory overhead • Performance critical operations executed: – On objects with Linear Algebra operations optimized – By fast native ML algorithms 79 RDD[DenseMatrix]
  • 80. Our observations • Using RDD of huge but compact objects optimized for fast computations • Limiting execution time overhead in tests on YARN • Stable execution on YARN 80
  • 81. Limiting execution overhead in tests • Submitting Spark application takes time 81 TestSpark submit Spark submit Test Spark submit Test
  • 82. Limiting execution overhead in tests • We submit only once 82 TestSpark submit Test Test ♪~
  • 83. Our observations • Using RDD of huge but compact objects optimized for fast computations • Limiting execution time overhead in tests on YARN • Stable execution on YARN 83
  • 84. Stable execution on YARN • Default configuration sometimes failing with not enough memory • Spark Web UI: • Serving much memory to Spark but application still failing • Known problem in Spark 84
  • 85. Stable execution on YARN • JVM system memory spikes over YARN limitation suddenly (*) 85 (*) Shivnath and Mayuresh. “Understanding Memory Management In Spark For Fun And Profit”, Spark Summit 2016. YARN limitation (6GB) Time Memory(GB) Spike of JVM system memory usage
  • 86. Stable execution on YARN • Tip: spark.yarn.executor.memoryOverhead to be carefully configured • Recommended overhead: 6-10% • 15% overhead required in our case • Must be thoroughly investigated 86 (http://spark.apache.org/docs/2.1.1/running-on-yarn.html)
  • 88. Summary • Predictive modeling problem – Requires sophisticated knowledge – Takes a long time • Our technology: Automatic Predictive Modeling – Combines Spark with native ML engines – Fully automates the whole process – Provides highly accurate results – Takes at most hours – Handles data of any size 88
  • 89. Future work • Extending to other models (e.g. deep learning) • Speeding up by GPU • Reducing YARN memory overhead 89