SlideShare a Scribd company logo
Quick! Quick! Exploration!:
A framework for searching a predictive
model on Apache Spark
Masato Asahara*, Yoshiki Takahashi+
and Kazuyuki Shudo+
* NEC Corporation, + Tokyo Institute of Technology
Jun/21/2018 @DataWorks Summit 2018
2 © NEC Corporation 2018
Who we are?
▌Masato Asahara (Ph.D.)
▌Principal Software Architecture and Researcher
at NEC System Platform Research Laboratory
Masato Asahara (Ph.D.) is currently leading developments of Spark-based
machine learning and data analytics systems, which fully automate
predictive modeling.
Masato received his Ph.D. degree from Keio University, and has worked at
NEC for 8 years as a researcher in the field of distributed computing
systems and computing resource management technologies.
▌Yoshiki Takahashi
▌Master course student at Tokyo Institute of Technology
Yoshiki Takahashi is a student of the master of computer science program
at the graduate school of Tokyo Institute of Technology. His academic
research proposal is accepted in SysML 2018 which has attracted attention
since its previous workshop era in NIPS.
He worked on development of a Spark-based machine learning platform
for automatic predictive modeling in his internship program at NEC Data
Science Research Laboratories in 2017. He received his B.S. degree in 2017
from Tokyo Institute of Technology.
3 © NEC Corporation 2018
Agenda
Best model
x
x
x
x
x
x
1000+ Patterns
4 © NEC Corporation 2018
Agenda
Best model
x
x
x
Quick!
Scalable!
Plug-in
5 © NEC Corporation 2018
Agenda
MLlib
(A cluster with 16 CPU cores, Using HIGGS data sets of UCI ML repository)
× 𝟏𝟑 faster !!Our framework
Predictive Modeling Automation Framework
7 © NEC Corporation 2018
Predictive Analysis in Enterprise Area
Driver Risk
Assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenance
Product Price
Optimization
Sales
Optimization
Energy/Water
Operation Mgmt.
8 © NEC Corporation 2018
Pain of Modern Predictive Modeling
High skill
Precious CS Ph.D.s
Evolving ML Technology
Quick Trial w/ New ML algo.
Long time
Many Tuning Parameters
9 © NEC Corporation 2018
Our Framework automates Predictive Modeling!
High skill
Precious CS Ph.D.s
Evolving ML Technology
Quick Trial w/ New ML algo.
Long time
Many Tuning Parameters
10 © NEC Corporation 2018
Values of Our Automation Framework
Democratized to
business users
Quick model
selection
Easy integration with
future ML
implementations
Best
Design Challenges and Solutions
12 © NEC Corporation 2018
High Level Architecture
Training Data
Validate Data
Training Validate
Criteria
⋮
⋮
Run
HDFS
13 © NEC Corporation 2018
Design Challenges
High Scalability Open for ML Implementations
14 © NEC Corporation 2018
Design Challenges
High Scalability Open for ML Implementations
15 © NEC Corporation 2018
Just Adding Nodes doesn’t Improve Performance
5 min 4 min
1 min 2 min Wait 6 min
16 © NEC Corporation 2018
Scheduling as a Combinatorial Optimization Problem
𝑱𝒐𝒃 𝟏
𝑱𝒐𝒃 𝟐
𝑱𝒐𝒃 𝟑
𝑾 𝟏
𝑾 𝟐
5 min
5 min
𝑱𝒐𝒃 𝟏
𝑱𝒐𝒃 𝟐
𝑱𝒐𝒃 𝟑
𝑻 𝟏
𝑻 𝟐
𝐦𝐚𝐱 {𝑻 𝟏, 𝑻 𝟐}
2 min
1 min
5 min
Scheduler
⋮
17 © NEC Corporation 2018
Scheduling as a Combinatorial Optimization Problem
𝑱𝒐𝒃 𝟏
𝑱𝒐𝒃 𝟐
𝑱𝒐𝒃 𝟑
𝑾 𝟏
𝑾 𝟐
5 min
5 min
𝑱𝒐𝒃 𝟏
𝑱𝒐𝒃 𝟐
𝑱𝒐𝒃 𝟑
𝑻 𝟏
𝑻 𝟐
Minimize 𝐦𝐚𝐱 {𝑻 𝟏, 𝑻 𝟐}
2 min
1 min
5 min
Scheduler
⋮
18 © NEC Corporation 2018
Scheduling as a Combinatorial Optimization Problem
𝑱𝒐𝒃 𝟏
𝑱𝒐𝒃 𝟐
𝑱𝒐𝒃 𝟑
𝑾 𝟏
𝑾 𝟐
5 min
5 min
𝑱𝒐𝒃 𝟏
𝑱𝒐𝒃 𝟐
𝑱𝒐𝒃 𝟑
𝑻 𝟏
𝑻 𝟐
Minimize 𝐦𝐚𝐱 {𝑻 𝟏, 𝑻 𝟐}
??? min
Scheduler
⋮
??? min
??? min
19 © NEC Corporation 2018
eta
max_depth
⋮ ⋮
round
Scheduler Pre-profiles Job Time via Sampled Data
This training takes
xx.x sec.
Small Sampled Data Scheduler
20 © NEC Corporation 2018
Scheduler Pre-profiles Job Time via Sampled Data
Scheduling
Scheduler
Profiling
Automated Predictive
Modeling
Small Sampled Data Entire Data
21 © NEC Corporation 2018
Preliminary Evaluation: Pre-profiling Tasks Little Time
2.34%
In pre-profiling,
• sampling 1% data from training data
• executing trainings for same search
space as automatic prediction modeling
22 © NEC Corporation 2018
Design Challenges
High Scalability Open for ML Implementations
23 © NEC Corporation 2018
Reducing Implementation Costs to Add New ML impl.
Distributed Learning Validation /
Model Selection
24 © NEC Corporation 2018
Naïve Design: Requires Many Changes to plug-in New ML
Distributed Learning
Training
Training
Training
invoke
Add code
New ML Format Data
TF Format Data
XGB Format Data
25 © NEC Corporation 2018
Easy Integration with New ML impl. by Encapsulation
Distributed Learning Training
Training
Training
Training
Encapsulation
invoke
Common Format Data
♪~
Add code
26 © NEC Corporation 2018
Easy Integration with New ML impl. by Encapsulation
Validation /
Model selection
Prediction
Prediction
Prediction
Prediction
Encapsulation
invoke
Common Format Data
♪~
Add code
Evaluation
28 © NEC Corporation 2018
Evaluation Setup
▌Dataset
HIGGS (UCI Dataset Repository)
• 1M sampled data for each training,
validation and test data
• 28 features
▌Scheduler Training
Executes same grids for training
Using 1% sample of training data
▌Environment
Apache Spark 2.3.0
Apache Hadoop 3.1.0
▌Exploring Algorithms
Gradient Boosting Tree (GBT)
• XGBoost 0.8
• 864 grid points
Multi-layer Perceptron (MLP)
• TensorFlow 1.8.0
• 324 grid points
Logistic Regression (LR)
• scikit-learn 0.18.1
• 5 grid points
Random Forest (RF)
• scikit-learn 0.18.1
• 18 grid points
29 © NEC Corporation 2018
Evaluation Result: Total Execution Time
× 𝟏𝟑. 𝟏 faster !!
30 © NEC Corporation 2018
Spark MLlib Focuses on Scaling out for Huge Data Size
Core 1
Core 2
Core 3
Core 1
Core 2
Core 3
Next Model
Complete training !
Shuffle
Training
Dataset
31 © NEC Corporation 2018
Core 1
Core 2
Core 3
Core 1
Core 2
Core 3
Next Model
Complete training !
No-Shuffle
Training
Dataset
We Focuses on Huge Search Space of Parameter Tuning
Our Framework
Next Model
Next Model
Read entire
data
32 © NEC Corporation 2018
Evaluation Result: Execution Performance for Scalability
72.7%
78.4%
81.7%
84.7%
33 © NEC Corporation 2018
Evaluation Result: Improvement of Error and AUC
Classification
Accuracy
AUC
Best model* 0.756 0.837
Gradient Boosting Tree** (-0.013) 0.743 (-0.012) 0.825
Logistic Regression** (-0.114) 0.642 (-0.153) 0.684
Random Forest** (-0.032) 0.724 (-0.036) 0.801
* Best model produced by our framework.
** Using default hyper parameters of XGBoost and scikit-learn
34 © NEC Corporation 2018
Evaluation Result : Amount of Code for Adding New ML
# Lines of Code w/o comments
151 lines
292 lines
290 lines
python : 116
scala : 176
python : 90
scala : 200
Summary and Future work
36 © NEC Corporation 2018
Summary – Automation Framework for Predictive Modeling
Best model
x
x
x
Quick!
Scalable!
Plug-in
37 © NEC Corporation 2018
Values
Democratized to
business users
Quick model
selection
Easy integration with
future ML
implementations
Best
38 © NEC Corporation 2018
Design Challenges (Addressed)
High Scalability Open for ML Implementations
39 © NEC Corporation 2018
Future work - Convert Data Structure for Each ML impl.
Common Format :
Double[ ][ ]
Sparse
Column-oriented
Row-oriented
Memory Copy &
Convert
40 © NEC Corporation 2018
Common Memory Format can be Read w/o copy is Better
Common Format :
????
Sparse
Column-oriented
Row-oriented
Zero-copy read
Apache Arrow …?
Quick! Quick! Exploration!: A framework for searching a predictive model on Apache Spark

More Related Content

What's hot (20)

PPTX
The Future of Data Warehousing, Data Science and Machine Learning
ModusOptimum
 
PDF
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
Databricks
 
PPTX
Containers and Big Data
DataWorks Summit
 
PPTX
What’s new in Apache Spark 2.3
DataWorks Summit
 
PPTX
Lessons learned running a container cloud on YARN
DataWorks Summit
 
PPTX
Log I am your father
DataWorks Summit/Hadoop Summit
 
PPTX
Accelerating query processing with materialized views in Apache Hive
DataWorks Summit
 
PPTX
Sharing metadata across the data lake and streams
DataWorks Summit
 
PPTX
Optimizing your SparkML pipelines using the latest features in Spark 2.3
DataWorks Summit
 
PPTX
Apache Hadoop YARN: state of the union
DataWorks Summit
 
PPTX
Machine Learning Models in Production
DataWorks Summit
 
PPTX
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
Deepak Chandramouli
 
PPTX
Saving the elephant—now, not later
DataWorks Summit
 
PPTX
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
DataWorks Summit/Hadoop Summit
 
PPTX
Analyzing the World's Largest Security Data Lake!
DataWorks Summit
 
PDF
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks
 
PPTX
Adding structure to your streaming pipelines: moving from Spark streaming to ...
DataWorks Summit
 
PPTX
Next gen tooling for building streaming analytics apps: code-less development...
DataWorks Summit
 
PPTX
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
DataWorks Summit/Hadoop Summit
 
PPTX
Production Grade Data Science for Hadoop
DataWorks Summit/Hadoop Summit
 
The Future of Data Warehousing, Data Science and Machine Learning
ModusOptimum
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
Databricks
 
Containers and Big Data
DataWorks Summit
 
What’s new in Apache Spark 2.3
DataWorks Summit
 
Lessons learned running a container cloud on YARN
DataWorks Summit
 
Log I am your father
DataWorks Summit/Hadoop Summit
 
Accelerating query processing with materialized views in Apache Hive
DataWorks Summit
 
Sharing metadata across the data lake and streams
DataWorks Summit
 
Optimizing your SparkML pipelines using the latest features in Spark 2.3
DataWorks Summit
 
Apache Hadoop YARN: state of the union
DataWorks Summit
 
Machine Learning Models in Production
DataWorks Summit
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
Deepak Chandramouli
 
Saving the elephant—now, not later
DataWorks Summit
 
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
DataWorks Summit/Hadoop Summit
 
Analyzing the World's Largest Security Data Lake!
DataWorks Summit
 
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks
 
Adding structure to your streaming pipelines: moving from Spark streaming to ...
DataWorks Summit
 
Next gen tooling for building streaming analytics apps: code-less development...
DataWorks Summit
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
DataWorks Summit/Hadoop Summit
 
Production Grade Data Science for Hadoop
DataWorks Summit/Hadoop Summit
 

Similar to Quick! Quick! Exploration!: A framework for searching a predictive model on Apache Spark (20)

PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
 
PPTX
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
PPTX
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
PDF
Data ops: Machine Learning in production
Stepan Pushkarev
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 
PDF
Foundations for Scaling ML in Apache Spark
Databricks
 
PDF
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
BigMine
 
PDF
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
GeeksLab Odessa
 
PPTX
AdClickFraud_Bigdata-Apic-Ist-2019
Neha gupta
 
PDF
Automated Machine Learning
Yuriy Guts
 
PDF
Big Data Heterogeneous Mixture Learning on Spark
DataWorks Summit/Hadoop Summit
 
PDF
Deep learning and applications in non-cognitive domains II
Deakin University
 
PDF
Microsoft DevOps for AI with GoDataDriven
GoDataDriven
 
PDF
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
PDF
What's The Role Of Machine Learning In Fast Data And Streaming Applications?
Lightbend
 
PDF
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
PDF
Distributed Heterogeneous Mixture Learning On Spark
Spark Summit
 
PDF
Scaling & Transforming Stitch Fix's Visibility into What Folks will love
June Andrews
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
 
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Data ops: Machine Learning in production
Stepan Pushkarev
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 
Foundations for Scaling ML in Apache Spark
Databricks
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
BigMine
 
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
GeeksLab Odessa
 
AdClickFraud_Bigdata-Apic-Ist-2019
Neha gupta
 
Automated Machine Learning
Yuriy Guts
 
Big Data Heterogeneous Mixture Learning on Spark
DataWorks Summit/Hadoop Summit
 
Deep learning and applications in non-cognitive domains II
Deakin University
 
Microsoft DevOps for AI with GoDataDriven
GoDataDriven
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
What's The Role Of Machine Learning In Fast Data And Streaming Applications?
Lightbend
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
Distributed Heterogeneous Mixture Learning On Spark
Spark Summit
 
Scaling & Transforming Stitch Fix's Visibility into What Folks will love
June Andrews
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 

Quick! Quick! Exploration!: A framework for searching a predictive model on Apache Spark

  • 1. Quick! Quick! Exploration!: A framework for searching a predictive model on Apache Spark Masato Asahara*, Yoshiki Takahashi+ and Kazuyuki Shudo+ * NEC Corporation, + Tokyo Institute of Technology Jun/21/2018 @DataWorks Summit 2018
  • 2. 2 © NEC Corporation 2018 Who we are? ▌Masato Asahara (Ph.D.) ▌Principal Software Architecture and Researcher at NEC System Platform Research Laboratory Masato Asahara (Ph.D.) is currently leading developments of Spark-based machine learning and data analytics systems, which fully automate predictive modeling. Masato received his Ph.D. degree from Keio University, and has worked at NEC for 8 years as a researcher in the field of distributed computing systems and computing resource management technologies. ▌Yoshiki Takahashi ▌Master course student at Tokyo Institute of Technology Yoshiki Takahashi is a student of the master of computer science program at the graduate school of Tokyo Institute of Technology. His academic research proposal is accepted in SysML 2018 which has attracted attention since its previous workshop era in NIPS. He worked on development of a Spark-based machine learning platform for automatic predictive modeling in his internship program at NEC Data Science Research Laboratories in 2017. He received his B.S. degree in 2017 from Tokyo Institute of Technology.
  • 3. 3 © NEC Corporation 2018 Agenda Best model x x x x x x 1000+ Patterns
  • 4. 4 © NEC Corporation 2018 Agenda Best model x x x Quick! Scalable! Plug-in
  • 5. 5 © NEC Corporation 2018 Agenda MLlib (A cluster with 16 CPU cores, Using HIGGS data sets of UCI ML repository) × 𝟏𝟑 faster !!Our framework
  • 7. 7 © NEC Corporation 2018 Predictive Analysis in Enterprise Area Driver Risk Assessment Inventory Optimization Churn Retention Predictive Maintenance Product Price Optimization Sales Optimization Energy/Water Operation Mgmt.
  • 8. 8 © NEC Corporation 2018 Pain of Modern Predictive Modeling High skill Precious CS Ph.D.s Evolving ML Technology Quick Trial w/ New ML algo. Long time Many Tuning Parameters
  • 9. 9 © NEC Corporation 2018 Our Framework automates Predictive Modeling! High skill Precious CS Ph.D.s Evolving ML Technology Quick Trial w/ New ML algo. Long time Many Tuning Parameters
  • 10. 10 © NEC Corporation 2018 Values of Our Automation Framework Democratized to business users Quick model selection Easy integration with future ML implementations Best
  • 12. 12 © NEC Corporation 2018 High Level Architecture Training Data Validate Data Training Validate Criteria ⋮ ⋮ Run HDFS
  • 13. 13 © NEC Corporation 2018 Design Challenges High Scalability Open for ML Implementations
  • 14. 14 © NEC Corporation 2018 Design Challenges High Scalability Open for ML Implementations
  • 15. 15 © NEC Corporation 2018 Just Adding Nodes doesn’t Improve Performance 5 min 4 min 1 min 2 min Wait 6 min
  • 16. 16 © NEC Corporation 2018 Scheduling as a Combinatorial Optimization Problem 𝑱𝒐𝒃 𝟏 𝑱𝒐𝒃 𝟐 𝑱𝒐𝒃 𝟑 𝑾 𝟏 𝑾 𝟐 5 min 5 min 𝑱𝒐𝒃 𝟏 𝑱𝒐𝒃 𝟐 𝑱𝒐𝒃 𝟑 𝑻 𝟏 𝑻 𝟐 𝐦𝐚𝐱 {𝑻 𝟏, 𝑻 𝟐} 2 min 1 min 5 min Scheduler ⋮
  • 17. 17 © NEC Corporation 2018 Scheduling as a Combinatorial Optimization Problem 𝑱𝒐𝒃 𝟏 𝑱𝒐𝒃 𝟐 𝑱𝒐𝒃 𝟑 𝑾 𝟏 𝑾 𝟐 5 min 5 min 𝑱𝒐𝒃 𝟏 𝑱𝒐𝒃 𝟐 𝑱𝒐𝒃 𝟑 𝑻 𝟏 𝑻 𝟐 Minimize 𝐦𝐚𝐱 {𝑻 𝟏, 𝑻 𝟐} 2 min 1 min 5 min Scheduler ⋮
  • 18. 18 © NEC Corporation 2018 Scheduling as a Combinatorial Optimization Problem 𝑱𝒐𝒃 𝟏 𝑱𝒐𝒃 𝟐 𝑱𝒐𝒃 𝟑 𝑾 𝟏 𝑾 𝟐 5 min 5 min 𝑱𝒐𝒃 𝟏 𝑱𝒐𝒃 𝟐 𝑱𝒐𝒃 𝟑 𝑻 𝟏 𝑻 𝟐 Minimize 𝐦𝐚𝐱 {𝑻 𝟏, 𝑻 𝟐} ??? min Scheduler ⋮ ??? min ??? min
  • 19. 19 © NEC Corporation 2018 eta max_depth ⋮ ⋮ round Scheduler Pre-profiles Job Time via Sampled Data This training takes xx.x sec. Small Sampled Data Scheduler
  • 20. 20 © NEC Corporation 2018 Scheduler Pre-profiles Job Time via Sampled Data Scheduling Scheduler Profiling Automated Predictive Modeling Small Sampled Data Entire Data
  • 21. 21 © NEC Corporation 2018 Preliminary Evaluation: Pre-profiling Tasks Little Time 2.34% In pre-profiling, • sampling 1% data from training data • executing trainings for same search space as automatic prediction modeling
  • 22. 22 © NEC Corporation 2018 Design Challenges High Scalability Open for ML Implementations
  • 23. 23 © NEC Corporation 2018 Reducing Implementation Costs to Add New ML impl. Distributed Learning Validation / Model Selection
  • 24. 24 © NEC Corporation 2018 Naïve Design: Requires Many Changes to plug-in New ML Distributed Learning Training Training Training invoke Add code New ML Format Data TF Format Data XGB Format Data
  • 25. 25 © NEC Corporation 2018 Easy Integration with New ML impl. by Encapsulation Distributed Learning Training Training Training Training Encapsulation invoke Common Format Data ♪~ Add code
  • 26. 26 © NEC Corporation 2018 Easy Integration with New ML impl. by Encapsulation Validation / Model selection Prediction Prediction Prediction Prediction Encapsulation invoke Common Format Data ♪~ Add code
  • 28. 28 © NEC Corporation 2018 Evaluation Setup ▌Dataset HIGGS (UCI Dataset Repository) • 1M sampled data for each training, validation and test data • 28 features ▌Scheduler Training Executes same grids for training Using 1% sample of training data ▌Environment Apache Spark 2.3.0 Apache Hadoop 3.1.0 ▌Exploring Algorithms Gradient Boosting Tree (GBT) • XGBoost 0.8 • 864 grid points Multi-layer Perceptron (MLP) • TensorFlow 1.8.0 • 324 grid points Logistic Regression (LR) • scikit-learn 0.18.1 • 5 grid points Random Forest (RF) • scikit-learn 0.18.1 • 18 grid points
  • 29. 29 © NEC Corporation 2018 Evaluation Result: Total Execution Time × 𝟏𝟑. 𝟏 faster !!
  • 30. 30 © NEC Corporation 2018 Spark MLlib Focuses on Scaling out for Huge Data Size Core 1 Core 2 Core 3 Core 1 Core 2 Core 3 Next Model Complete training ! Shuffle Training Dataset
  • 31. 31 © NEC Corporation 2018 Core 1 Core 2 Core 3 Core 1 Core 2 Core 3 Next Model Complete training ! No-Shuffle Training Dataset We Focuses on Huge Search Space of Parameter Tuning Our Framework Next Model Next Model Read entire data
  • 32. 32 © NEC Corporation 2018 Evaluation Result: Execution Performance for Scalability 72.7% 78.4% 81.7% 84.7%
  • 33. 33 © NEC Corporation 2018 Evaluation Result: Improvement of Error and AUC Classification Accuracy AUC Best model* 0.756 0.837 Gradient Boosting Tree** (-0.013) 0.743 (-0.012) 0.825 Logistic Regression** (-0.114) 0.642 (-0.153) 0.684 Random Forest** (-0.032) 0.724 (-0.036) 0.801 * Best model produced by our framework. ** Using default hyper parameters of XGBoost and scikit-learn
  • 34. 34 © NEC Corporation 2018 Evaluation Result : Amount of Code for Adding New ML # Lines of Code w/o comments 151 lines 292 lines 290 lines python : 116 scala : 176 python : 90 scala : 200
  • 36. 36 © NEC Corporation 2018 Summary – Automation Framework for Predictive Modeling Best model x x x Quick! Scalable! Plug-in
  • 37. 37 © NEC Corporation 2018 Values Democratized to business users Quick model selection Easy integration with future ML implementations Best
  • 38. 38 © NEC Corporation 2018 Design Challenges (Addressed) High Scalability Open for ML Implementations
  • 39. 39 © NEC Corporation 2018 Future work - Convert Data Structure for Each ML impl. Common Format : Double[ ][ ] Sparse Column-oriented Row-oriented Memory Copy & Convert
  • 40. 40 © NEC Corporation 2018 Common Memory Format can be Read w/o copy is Better Common Format : ???? Sparse Column-oriented Row-oriented Zero-copy read Apache Arrow …?