SlideShare a Scribd company logo
@s_kontopoulos
Machine Learning at Scale: Challenges
and Solutions
Stavros Kontopoulos
Senior Software Engineer @ Lightbend, M.Sc.
@s_kontopoulos
Who am I?
2
skonto
s_kontopoulos
S. Software Engineer @ Lightbend, Fast Data Team
Apache Flink
Contributor at
SlideShare stavroskontopoulos
stavroskontopoulos
All trademarks and registered trademarks are property of their respective holders.
@s_kontopoulos
Agenda
- ML in the Enterprise
- ML from development to production
- Key technologies: Apache Spark as a case study
3
@s_kontopoulos
ML in the Enterprise
ML is a key tool that fuels the effort of coupling business monitoring (BI) with
predictive and prescriptive analytics.
business insights -> business optimization -> data monetization
4
@s_kontopoulos
ML in the Enterprise - The Data-Science LifeCycle
Identify Business Question
Identify and collect related Data
Data cleansing, feature extraction (Data pre-processing)
Experiment planning
Model Building
Model Evaluation
Model Deployment/Management in Production
Model Optimization - Performance
5
@s_kontopoulos
Machine Learning Model
A model is a function that maps inputs to outputs and essentially expresses a
mathematical abstraction.
Linear Regression:
Neural Network:
Random Forest:
Function composition
6
@s_kontopoulos
Model Evolution
- Models can be either pre-computed eg. trained off-line or updated on-line.
- Online ML with Streaming:
- Pure online means only use the latest arrived data point to update the model. Usually models
are updated per batch/window eg. online k-means though.
- An interesting case is when we sample the stream and train a model only when the distribution
changes.
- Adaptive supervised learning: SGD (Stochastic Gradient Descent) + random sampling
- Re-train the model by ignoring the previous one.
7
@s_kontopoulos
Machine Learning Pipeline
Machine learning pipeline in Production: describes all steps from data
preprocessing before feeding the model to model output processing
(post-processing).
8
@s_kontopoulos
Machine Learning Pipeline in Libraries
Pros:
- Data and test data go through the same steps
- Like a CI (continuous integration) pipeline people can reason about data
transformation
- Caching of computations
- Model serving easier 9
@s_kontopoulos
Multiple Models in a Pipeline
Within the same pipeline it is also possible to run multiple models:
a) Model Segmentation
b) Model Ensemble
c) Model Chaining
d) Model Composition
http://dmg.org/pmml/v4-1/MultipleModels.html
http://dl.acm.org/citation.cfm?id=1859403
10
@s_kontopoulos
Model Development & Production
Data Scientist
All trademarks and registered trademarks are property of their respective holders.
GO
Data Engineer
11
@s_kontopoulos
Model Standardization
12
ML Framework Model Definition
Evaluation
Data
Predictions
Export Import
PFA - Portable
Format For
Analytics
@s_kontopoulos
Model Standardization
13
- PFA or PMML won’t break the pipeline. PFA is more flexible than PMML.
“Unlike PMML, PFA has control structures to direct program flow, a true type system for both
model parameters and data, and its statistical functions are much more finely grained and can
accept callbacks to modify their behavior” (http://dmg.org/pfa/docs/motivation/)
- Custom model definitions and implementations are more flexible or more
optimized but could break the pipeline.
- Some Implementations:
- https://github.com/jpmml/jpmml-evaluator-spark
- https://github.com/jpmml
- https://github.com/opendatagroup/hadrian
@s_kontopoulos
Model Lifecycle
Some concerns about model lifecycle:
- Model evolution
- Model release practices
- Model versioning
- Model update process
14
@s_kontopoulos
Model Governance
● governed by the company’s policies and procedures, laws and regulations
and organization’s goals
● searchable across company
● be transparent, explainable, traceable and interpretable for auditors and
regulators. Example GDPR requirements:
https://iapp.org/news/a/is-there-a-right-to-explanation-for-machine-learning-in-
the-gdpr/
● have approval and release process
15
@s_kontopoulos
Model Server
“A model server is a system which handles the lifecycle of a model and provides
the required APIs for deploying a model/pipeline.”
Image: https://rise.cs.berkeley.edu/blog/low-latency-model-serving-clipper/ Image: https://www.tensorflow.org/serving/
CLIPPER Tensorflow Serving
16
@s_kontopoulos
Model Serving - Requirements
Other requirements:
- Response time - time to calculate a prediction. Could be a few mills.
- Throughput - predictions per second.
- Support for running multiple models (very common to run hundreds of models
eg. A telecom operator where there is one model per customer or in IoT one
model per site/sensor).
17
@s_kontopoulos
Model Serving - Requirements
- multiple versions of the same machine learning pipeline within the system.
One reason can be A/B testing.
- Model update- How quickly and easy a model can be updated?
- Uptime/reliability
18
@s_kontopoulos
Tensorflow Serving Issues
Not all systems cover the requirements. For example:
● Metadata not available. (https://github.com/tensorflow/serving/issues/612)
● No new models at runtime: (https://github.com/tensorflow/serving/issues/422)
● Can be hard to build from scratch:
https://github.com/tensorflow/serving/issues/327
19
@s_kontopoulos
Model Serving with Apache Flink
Apache Flink: Low latency compared to Spark streaming engine based on the
Beam model.
20
@s_kontopoulos
Model Serving with Apache Flink
Idea: Exploit Flink’s low latency capabilities for serving models. Focus on offline
models loaded from a permanent storage and update them without interruption.
FLIP Proposal:
(https://docs.google.com/document/d/1ON_t9S28_2LJ91Fks2yFw0RYyeZvIvndu8
oGRPsPuk8)
Combines different efforts: https://github.com/FlinkML
● https://github.com/FlinkML/flink-jpmml (https://radicalbit.io/)
● https://github.com/FlinkML/flink-modelServer (Boris Lublinsky)
● https://github.com/FlinkML/flink-tensorflow (Eron Wright)
21
@s_kontopoulos
Model Serving with Apache Flink
22
Use a control stream and a data Stream. Keep model in operator’s state. Join the streams.
Flink provides 2 ways of implementing low-level joins - key based join based on CoProcessFunction and
partitions-based join based on RichCoFlatMapFunction.
@s_kontopoulos
Model Serving with Apache Flink
23
More here:
https://info.lightbend.com/ebook-serving-machine-learning-models-register.html
@s_kontopoulos
Data Lakes
How can we work with data to cover future needs and use cases. We need a
robust ML framework plus flexible infrastructure. Data Warehouses will not work.
Data lake to the rescue.
“A data lake is a method of storing data within a system or repository, in its natural
format, that facilitates the collocation of data in various schemata and structural
forms, usually object blobs or files.”
- Wikipedia
24
@s_kontopoulos
Data Lakes
● Agility. It can be seen as a tool that makes data accessible to different users
and facilitates ML.
● Designed for low-cost storage
● Schema on read
● Security and governance still maturing.
25
@s_kontopoulos
Data Lake Issues
“Through 2018, 80% of data lakes will not include effective metadata management
capabilities, making them inefficient.”
- Gartner
Several vendors try to deliver end-to-end solutions: Databricks Delta platform, IBM
Watson Platform etc.
26
@s_kontopoulos
Notebooks
Very convenient for the data scientist or the analyst.
Production usually is based on traditional deployment methods.
- Spark Notebook
- Apache zeppelin
- Jupyter
27
@s_kontopoulos
ML with Apache Spark
“A popular big data framework for ML and data-science.”
- You can work locally and move to production fast
- ETL/Feature Engineering
- Hyper-parameter tuning
- Rich Model support
- Multiple language support (Scala, Java, Python, R)
28
@s_kontopoulos
Apache Spark - Intro
29
A framework for distributed in-memory data processing.
@s_kontopoulos
Apache Spark - Intro
- User defines computations/operations (map, flatMap etc) on the data-sets
(bounded or not) as a DAG.
- DAG is shipped to nodes where the data lie, computation is executed and
results are sent back to the user.
- The data-sets are considered as immutable distributed data (RDDs).
- Resilient Distributed Datasets (RDD) an immutable distributed
collection of objects.
30
@s_kontopoulos
Apache Spark - Basic Example in Scala
31
basic statistics, a hello world
for ML
@s_kontopoulos
Apache Spark - Intro
There are three APIs: RDD, DataFrames, Datasets
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dat
aframes-and-datasets.html
32
RDD DataFrames (SQL) Datasets
Syntax Errors Runtime Compile Time Compile Time
Analysis Errors Runtime Runtime Compile Time
@s_kontopoulos
Apache Spark - Intro
“Datasets support encoders which allow to map semi-structured formats (eg
JSON) to constructs of type safe languages (Scala, Java). Also they have better
performance compared to java serialization or kryo.”
33
@s_kontopoulos
MLliB
A library for machine learning on top of Spark. Has two APIs:
- RDD based (spark.mllib).
- Datasets / Dataframes based (spark.ml).
The latter is relatively new and makes it easier to construct a ML pipeline or run an
algorithm. The first is older with more features.
34
@s_kontopoulos
MLliB
“As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered
maintenance mode. “
What are the implications?
● MLlib will still support the RDD-based API in spark.mllib with bug fixes.
● MLlib will not add new features to the RDD-based API.
● In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
● After reaching feature parity (roughly estimated for Spark 2.3), the RDD-based API will be deprecated.
● The RDD-based API is expected to be removed in Spark 3.0.
35
@s_kontopoulos
MLliB
Supports different categories of ML algorithms:
● Basic statistics (correlations etc)
● Pipelines (LSH, TF-IDF)
● Extracting, transforming and selecting features
● Classification and Regression (Random forests, Gradient boosted trees)
● Clustering (K-means, LDA, etc)
● Collaborative filtering
● Frequent Pattern Mining
● Model selection and tuning
Allows to implement: Fraud detection, Recommendation engines,...
36
@s_kontopoulos
MLliB Local
A new package is available for production use of the algorithms without the need
of Spark itself. How about PMML vs this method?
https://issues.apache.org/jira/browse/SPARK-13944
https://issues.apache.org/jira/browse/SPARK-16365
37
@s_kontopoulos
MLliB - Unsupervised Learning Example
Our data set: https://www.kaggle.com/danielpanizzo/wine-quality/data
Describes wine quality. Different dimensions like: chlorides, sugar etc.
We will apply k-means to identify different clusters of wine quality.
Implemented both mllib and ml implementations as spark notebooks.
38
Normalize Data K-means PCA Visualize
@s_kontopoulos
MLliB - Unsupervised Learning Example
39
parse data
train k-means with different k
@s_kontopoulos
MLliB - Unsupervised Learning Example
40
Counting errors for elbow method
@s_kontopoulos
MLLiB - Unsupervised Learning Example
41
PCA analysis to verify k-means
with k=2
@s_kontopoulos
MLLiB - Unsupervised Learning Example
42
PCA K=2
@s_kontopoulos
MLliB - Unsupervised Learning Example
43
Available with the mllib implementation
@s_kontopoulos
Spark Deep Learning Pipelines
- People know SQL
- Models are productized as SQL UDFS.
Predictions as a SQL statement:
SELECT my_custom_keras_model_udf(image) as predictions from my_spark_image_table
https://github.com/databricks/spark-deep-learning
44
@s_kontopoulos
BigDL
● Developed by Intel.
● It does not use GPUs, optimized for Intel processors.
“It is orders of magnitude faster than out-of-box open source Caffe, Torch or
TensorFlow on a single-node Xeon (i.e., comparable with mainstream GPU).”
● It is implemented as a standalone package on Spark.
● Can be used with existing Spark or Hadoop clusters.
● High-performance powered by Intel MKL and multi-threaded programming.
● Easily scaled-out
● Appropriate for users who are not DL experts.
45
@s_kontopoulos
BigDL
● Offers a user-friendly, idiomatic Scala and Python 2.7/3.5 API for training and
testing machine learning models.
● A lot of useful features: Loss Functions, Layers support etc
● Implements a parameter server for distributed training of DL models
● Support visualization via tensorboard:
https://intel-analytics.github.io/bigdl-doc/UserGuide/visualization-with-tensorb
oard
46
@s_kontopoulos
BigDL in practice
For a cool example of using BigDL on mesos check our blog:
http://developer.lightbend.com/blog/2017-06-22-bigdl-on-mesos/
47
@s_kontopoulos
Thank you! Questions?
https://github.com/skonto/talks/blob/master/big-data-italy-2017/ml/references.md
48

More Related Content

What's hot (19)

PDF
What’s New in the Upcoming Apache Spark 3.0
Databricks
 
PDF
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Databricks
 
PPTX
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Rodney Joyce
 
PDF
Scaling and Modernizing Data Platform with Databricks
Databricks
 
PDF
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Spark Summit
 
PPTX
Practical Distributed Machine Learning Pipelines on Hadoop
DataWorks Summit
 
PDF
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Databricks
 
PDF
Building Data Intensive Analytic Application on Top of Delta Lakes
Databricks
 
PPTX
Jethro data meetup index base sql on hadoop - oct-2014
Eli Singer
 
PDF
Building Identity Graphs over Heterogeneous Data
Databricks
 
PDF
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
PDF
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Spark Summit
 
PDF
Pandas UDF: Scalable Analysis with Python and PySpark
Li Jin
 
PDF
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Databricks
 
PDF
Geospatial Analytics at Scale with Deep Learning and Apache Spark
Databricks
 
PDF
Data Versioning and Reproducible ML with DVC and MLflow
Databricks
 
PDF
Simplify and Scale Data Engineering Pipelines with Delta Lake
Databricks
 
PDF
Scaling Machine Learning with Apache Spark
Databricks
 
PDF
Big Data is changing abruptly, and where it is likely heading
Paco Nathan
 
What’s New in the Upcoming Apache Spark 3.0
Databricks
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Databricks
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Rodney Joyce
 
Scaling and Modernizing Data Platform with Databricks
Databricks
 
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Spark Summit
 
Practical Distributed Machine Learning Pipelines on Hadoop
DataWorks Summit
 
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Databricks
 
Building Data Intensive Analytic Application on Top of Delta Lakes
Databricks
 
Jethro data meetup index base sql on hadoop - oct-2014
Eli Singer
 
Building Identity Graphs over Heterogeneous Data
Databricks
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Spark Summit
 
Pandas UDF: Scalable Analysis with Python and PySpark
Li Jin
 
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Databricks
 
Geospatial Analytics at Scale with Deep Learning and Apache Spark
Databricks
 
Data Versioning and Reproducible ML with DVC and MLflow
Databricks
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Databricks
 
Scaling Machine Learning with Apache Spark
Databricks
 
Big Data is changing abruptly, and where it is likely heading
Paco Nathan
 

Similar to Machine learning at scale challenges and solutions (20)

PPTX
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
PPTX
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
PDF
Data ops: Machine Learning in production
Stepan Pushkarev
 
PPTX
Open, Secure & Transparent AI Pipelines
Nick Pentreath
 
PPTX
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Databricks
 
PDF
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Databricks
 
PDF
Scalable machine learning
Arnaud Rachez
 
PPTX
Apache Spark MLlib
Zahra Eskandari
 
PPTX
MLlib and Machine Learning on Spark
Petr Zapletal
 
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
PDF
Distributed ML in Apache Spark
Databricks
 
PPTX
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
Databricks
 
PDF
Media_Entertainment_Veriticals
Peyman Mohajerian
 
PDF
MLlib: Spark's Machine Learning Library
jeykottalam
 
PDF
Foundations for Scaling ML in Apache Spark
Databricks
 
PDF
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
BigMine
 
PDF
Build a deep learning pipeline on apache spark for ads optimization
Craig Chao
 
PPTX
Apache Spark Model Deployment
Databricks
 
PPTX
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
MLconf
 
PDF
"Managing the Complete Machine Learning Lifecycle with MLflow"
Databricks
 
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Data ops: Machine Learning in production
Stepan Pushkarev
 
Open, Secure & Transparent AI Pipelines
Nick Pentreath
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Databricks
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Databricks
 
Scalable machine learning
Arnaud Rachez
 
Apache Spark MLlib
Zahra Eskandari
 
MLlib and Machine Learning on Spark
Petr Zapletal
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
Distributed ML in Apache Spark
Databricks
 
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
Databricks
 
Media_Entertainment_Veriticals
Peyman Mohajerian
 
MLlib: Spark's Machine Learning Library
jeykottalam
 
Foundations for Scaling ML in Apache Spark
Databricks
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
BigMine
 
Build a deep learning pipeline on apache spark for ads optimization
Craig Chao
 
Apache Spark Model Deployment
Databricks
 
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
MLconf
 
"Managing the Complete Machine Learning Lifecycle with MLflow"
Databricks
 
Ad

More from Stavros Kontopoulos (10)

PDF
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Stavros Kontopoulos
 
PDF
Online machine learning in Streaming Applications
Stavros Kontopoulos
 
PPTX
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
Stavros Kontopoulos
 
PDF
Apache Flink London Meetup - Let's Talk ML on Flink
Stavros Kontopoulos
 
PDF
Spark Summit EU Supporting Spark (Brussels 2016)
Stavros Kontopoulos
 
PDF
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Stavros Kontopoulos
 
PPTX
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
PPTX
Trivento summercamp fast data 9/9/2016
Stavros Kontopoulos
 
PPTX
Typesafe spark- Zalando meetup
Stavros Kontopoulos
 
PDF
Cassandra at Pollfish
Stavros Kontopoulos
 
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Stavros Kontopoulos
 
Online machine learning in Streaming Applications
Stavros Kontopoulos
 
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
Stavros Kontopoulos
 
Apache Flink London Meetup - Let's Talk ML on Flink
Stavros Kontopoulos
 
Spark Summit EU Supporting Spark (Brussels 2016)
Stavros Kontopoulos
 
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Stavros Kontopoulos
 
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
Trivento summercamp fast data 9/9/2016
Stavros Kontopoulos
 
Typesafe spark- Zalando meetup
Stavros Kontopoulos
 
Cassandra at Pollfish
Stavros Kontopoulos
 
Ad

Recently uploaded (20)

PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PPTX
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
PPTX
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
PPTX
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
PDF
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
PDF
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PDF
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
Tally software_Introduction_Presentation
AditiBansal54083
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 

Machine learning at scale challenges and solutions

  • 1. @s_kontopoulos Machine Learning at Scale: Challenges and Solutions Stavros Kontopoulos Senior Software Engineer @ Lightbend, M.Sc.
  • 2. @s_kontopoulos Who am I? 2 skonto s_kontopoulos S. Software Engineer @ Lightbend, Fast Data Team Apache Flink Contributor at SlideShare stavroskontopoulos stavroskontopoulos All trademarks and registered trademarks are property of their respective holders.
  • 3. @s_kontopoulos Agenda - ML in the Enterprise - ML from development to production - Key technologies: Apache Spark as a case study 3
  • 4. @s_kontopoulos ML in the Enterprise ML is a key tool that fuels the effort of coupling business monitoring (BI) with predictive and prescriptive analytics. business insights -> business optimization -> data monetization 4
  • 5. @s_kontopoulos ML in the Enterprise - The Data-Science LifeCycle Identify Business Question Identify and collect related Data Data cleansing, feature extraction (Data pre-processing) Experiment planning Model Building Model Evaluation Model Deployment/Management in Production Model Optimization - Performance 5
  • 6. @s_kontopoulos Machine Learning Model A model is a function that maps inputs to outputs and essentially expresses a mathematical abstraction. Linear Regression: Neural Network: Random Forest: Function composition 6
  • 7. @s_kontopoulos Model Evolution - Models can be either pre-computed eg. trained off-line or updated on-line. - Online ML with Streaming: - Pure online means only use the latest arrived data point to update the model. Usually models are updated per batch/window eg. online k-means though. - An interesting case is when we sample the stream and train a model only when the distribution changes. - Adaptive supervised learning: SGD (Stochastic Gradient Descent) + random sampling - Re-train the model by ignoring the previous one. 7
  • 8. @s_kontopoulos Machine Learning Pipeline Machine learning pipeline in Production: describes all steps from data preprocessing before feeding the model to model output processing (post-processing). 8
  • 9. @s_kontopoulos Machine Learning Pipeline in Libraries Pros: - Data and test data go through the same steps - Like a CI (continuous integration) pipeline people can reason about data transformation - Caching of computations - Model serving easier 9
  • 10. @s_kontopoulos Multiple Models in a Pipeline Within the same pipeline it is also possible to run multiple models: a) Model Segmentation b) Model Ensemble c) Model Chaining d) Model Composition http://dmg.org/pmml/v4-1/MultipleModels.html http://dl.acm.org/citation.cfm?id=1859403 10
  • 11. @s_kontopoulos Model Development & Production Data Scientist All trademarks and registered trademarks are property of their respective holders. GO Data Engineer 11
  • 12. @s_kontopoulos Model Standardization 12 ML Framework Model Definition Evaluation Data Predictions Export Import PFA - Portable Format For Analytics
  • 13. @s_kontopoulos Model Standardization 13 - PFA or PMML won’t break the pipeline. PFA is more flexible than PMML. “Unlike PMML, PFA has control structures to direct program flow, a true type system for both model parameters and data, and its statistical functions are much more finely grained and can accept callbacks to modify their behavior” (http://dmg.org/pfa/docs/motivation/) - Custom model definitions and implementations are more flexible or more optimized but could break the pipeline. - Some Implementations: - https://github.com/jpmml/jpmml-evaluator-spark - https://github.com/jpmml - https://github.com/opendatagroup/hadrian
  • 14. @s_kontopoulos Model Lifecycle Some concerns about model lifecycle: - Model evolution - Model release practices - Model versioning - Model update process 14
  • 15. @s_kontopoulos Model Governance ● governed by the company’s policies and procedures, laws and regulations and organization’s goals ● searchable across company ● be transparent, explainable, traceable and interpretable for auditors and regulators. Example GDPR requirements: https://iapp.org/news/a/is-there-a-right-to-explanation-for-machine-learning-in- the-gdpr/ ● have approval and release process 15
  • 16. @s_kontopoulos Model Server “A model server is a system which handles the lifecycle of a model and provides the required APIs for deploying a model/pipeline.” Image: https://rise.cs.berkeley.edu/blog/low-latency-model-serving-clipper/ Image: https://www.tensorflow.org/serving/ CLIPPER Tensorflow Serving 16
  • 17. @s_kontopoulos Model Serving - Requirements Other requirements: - Response time - time to calculate a prediction. Could be a few mills. - Throughput - predictions per second. - Support for running multiple models (very common to run hundreds of models eg. A telecom operator where there is one model per customer or in IoT one model per site/sensor). 17
  • 18. @s_kontopoulos Model Serving - Requirements - multiple versions of the same machine learning pipeline within the system. One reason can be A/B testing. - Model update- How quickly and easy a model can be updated? - Uptime/reliability 18
  • 19. @s_kontopoulos Tensorflow Serving Issues Not all systems cover the requirements. For example: ● Metadata not available. (https://github.com/tensorflow/serving/issues/612) ● No new models at runtime: (https://github.com/tensorflow/serving/issues/422) ● Can be hard to build from scratch: https://github.com/tensorflow/serving/issues/327 19
  • 20. @s_kontopoulos Model Serving with Apache Flink Apache Flink: Low latency compared to Spark streaming engine based on the Beam model. 20
  • 21. @s_kontopoulos Model Serving with Apache Flink Idea: Exploit Flink’s low latency capabilities for serving models. Focus on offline models loaded from a permanent storage and update them without interruption. FLIP Proposal: (https://docs.google.com/document/d/1ON_t9S28_2LJ91Fks2yFw0RYyeZvIvndu8 oGRPsPuk8) Combines different efforts: https://github.com/FlinkML ● https://github.com/FlinkML/flink-jpmml (https://radicalbit.io/) ● https://github.com/FlinkML/flink-modelServer (Boris Lublinsky) ● https://github.com/FlinkML/flink-tensorflow (Eron Wright) 21
  • 22. @s_kontopoulos Model Serving with Apache Flink 22 Use a control stream and a data Stream. Keep model in operator’s state. Join the streams. Flink provides 2 ways of implementing low-level joins - key based join based on CoProcessFunction and partitions-based join based on RichCoFlatMapFunction.
  • 23. @s_kontopoulos Model Serving with Apache Flink 23 More here: https://info.lightbend.com/ebook-serving-machine-learning-models-register.html
  • 24. @s_kontopoulos Data Lakes How can we work with data to cover future needs and use cases. We need a robust ML framework plus flexible infrastructure. Data Warehouses will not work. Data lake to the rescue. “A data lake is a method of storing data within a system or repository, in its natural format, that facilitates the collocation of data in various schemata and structural forms, usually object blobs or files.” - Wikipedia 24
  • 25. @s_kontopoulos Data Lakes ● Agility. It can be seen as a tool that makes data accessible to different users and facilitates ML. ● Designed for low-cost storage ● Schema on read ● Security and governance still maturing. 25
  • 26. @s_kontopoulos Data Lake Issues “Through 2018, 80% of data lakes will not include effective metadata management capabilities, making them inefficient.” - Gartner Several vendors try to deliver end-to-end solutions: Databricks Delta platform, IBM Watson Platform etc. 26
  • 27. @s_kontopoulos Notebooks Very convenient for the data scientist or the analyst. Production usually is based on traditional deployment methods. - Spark Notebook - Apache zeppelin - Jupyter 27
  • 28. @s_kontopoulos ML with Apache Spark “A popular big data framework for ML and data-science.” - You can work locally and move to production fast - ETL/Feature Engineering - Hyper-parameter tuning - Rich Model support - Multiple language support (Scala, Java, Python, R) 28
  • 29. @s_kontopoulos Apache Spark - Intro 29 A framework for distributed in-memory data processing.
  • 30. @s_kontopoulos Apache Spark - Intro - User defines computations/operations (map, flatMap etc) on the data-sets (bounded or not) as a DAG. - DAG is shipped to nodes where the data lie, computation is executed and results are sent back to the user. - The data-sets are considered as immutable distributed data (RDDs). - Resilient Distributed Datasets (RDD) an immutable distributed collection of objects. 30
  • 31. @s_kontopoulos Apache Spark - Basic Example in Scala 31 basic statistics, a hello world for ML
  • 32. @s_kontopoulos Apache Spark - Intro There are three APIs: RDD, DataFrames, Datasets https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dat aframes-and-datasets.html 32 RDD DataFrames (SQL) Datasets Syntax Errors Runtime Compile Time Compile Time Analysis Errors Runtime Runtime Compile Time
  • 33. @s_kontopoulos Apache Spark - Intro “Datasets support encoders which allow to map semi-structured formats (eg JSON) to constructs of type safe languages (Scala, Java). Also they have better performance compared to java serialization or kryo.” 33
  • 34. @s_kontopoulos MLliB A library for machine learning on top of Spark. Has two APIs: - RDD based (spark.mllib). - Datasets / Dataframes based (spark.ml). The latter is relatively new and makes it easier to construct a ML pipeline or run an algorithm. The first is older with more features. 34
  • 35. @s_kontopoulos MLliB “As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. “ What are the implications? ● MLlib will still support the RDD-based API in spark.mllib with bug fixes. ● MLlib will not add new features to the RDD-based API. ● In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API. ● After reaching feature parity (roughly estimated for Spark 2.3), the RDD-based API will be deprecated. ● The RDD-based API is expected to be removed in Spark 3.0. 35
  • 36. @s_kontopoulos MLliB Supports different categories of ML algorithms: ● Basic statistics (correlations etc) ● Pipelines (LSH, TF-IDF) ● Extracting, transforming and selecting features ● Classification and Regression (Random forests, Gradient boosted trees) ● Clustering (K-means, LDA, etc) ● Collaborative filtering ● Frequent Pattern Mining ● Model selection and tuning Allows to implement: Fraud detection, Recommendation engines,... 36
  • 37. @s_kontopoulos MLliB Local A new package is available for production use of the algorithms without the need of Spark itself. How about PMML vs this method? https://issues.apache.org/jira/browse/SPARK-13944 https://issues.apache.org/jira/browse/SPARK-16365 37
  • 38. @s_kontopoulos MLliB - Unsupervised Learning Example Our data set: https://www.kaggle.com/danielpanizzo/wine-quality/data Describes wine quality. Different dimensions like: chlorides, sugar etc. We will apply k-means to identify different clusters of wine quality. Implemented both mllib and ml implementations as spark notebooks. 38 Normalize Data K-means PCA Visualize
  • 39. @s_kontopoulos MLliB - Unsupervised Learning Example 39 parse data train k-means with different k
  • 40. @s_kontopoulos MLliB - Unsupervised Learning Example 40 Counting errors for elbow method
  • 41. @s_kontopoulos MLLiB - Unsupervised Learning Example 41 PCA analysis to verify k-means with k=2
  • 42. @s_kontopoulos MLLiB - Unsupervised Learning Example 42 PCA K=2
  • 43. @s_kontopoulos MLliB - Unsupervised Learning Example 43 Available with the mllib implementation
  • 44. @s_kontopoulos Spark Deep Learning Pipelines - People know SQL - Models are productized as SQL UDFS. Predictions as a SQL statement: SELECT my_custom_keras_model_udf(image) as predictions from my_spark_image_table https://github.com/databricks/spark-deep-learning 44
  • 45. @s_kontopoulos BigDL ● Developed by Intel. ● It does not use GPUs, optimized for Intel processors. “It is orders of magnitude faster than out-of-box open source Caffe, Torch or TensorFlow on a single-node Xeon (i.e., comparable with mainstream GPU).” ● It is implemented as a standalone package on Spark. ● Can be used with existing Spark or Hadoop clusters. ● High-performance powered by Intel MKL and multi-threaded programming. ● Easily scaled-out ● Appropriate for users who are not DL experts. 45
  • 46. @s_kontopoulos BigDL ● Offers a user-friendly, idiomatic Scala and Python 2.7/3.5 API for training and testing machine learning models. ● A lot of useful features: Loss Functions, Layers support etc ● Implements a parameter server for distributed training of DL models ● Support visualization via tensorboard: https://intel-analytics.github.io/bigdl-doc/UserGuide/visualization-with-tensorb oard 46
  • 47. @s_kontopoulos BigDL in practice For a cool example of using BigDL on mesos check our blog: http://developer.lightbend.com/blog/2017-06-22-bigdl-on-mesos/ 47