SlideShare a Scribd company logo
Operationalizing Edge Machine Learning
with Apache Spark
ParallelM
2
Growing AI Investments; Few Deployed at Scale
Source: “Artificial Intelligence: The Next Digital Frontier?”, McKinsey Global Institute, June 2017
Out of 160 reviewed
AI use cases:
88% did not
progress
beyond the
experimental
stage
But successful early
AI adopters report:
Profit margins
3–15%
higher than
industry
average
20%
AI in
Production
80%
Developing,
Experimenting,
Contemplating
Survey of 3073 AI-aware C-level Executives
3
Challenges of Deploying & Managing ML in Production
• Diverse focus and expertise of Data Science & Ops teams
• Increased risk from non-deterministic nature of ML
• Current Operations solutions do not address uniqueness of ML Apps
4
Challenges of Edge/Distributed Topologies
• Varied resources at each level
• Scale, heterogeneity, disconnected operation
IoT is Driving Explosive Growth in Data Volume
“Things” Edge and Network Data Center/Cloud
Data Lake
5
What We Need For Operational ML
• Accelerate deployment & facilitate collaboration between Data & Ops teams
• Monitor validity of ML predictions, diagnose data and ML performance issues
• Orchestrate training, update, and configuration of ML pipelines across
distributed, heterogeneous infrastructure with tracking
6
What We Need For Edge Operational ML
• Distribute analytics processing to the optimal point for each use case
• Flexible management framework enables:
• Secure centralized and/or local learning, prediction, or combined learning/prediction
• Granular monitoring and control of model update policies
• Support multi-layer topologies to achieve maximum scale while accommodating low
bandwidth or unreliable connectivity
Sources
Edge
Intelligence
Streams
Data Lake
Central/Cloud
Intelligence
Batches
Edge/Cloud
Orchestration
7
ML Orchestration
ML Health
Business
Impact
Model
Governance
Continuous
Integration/
Deployment
Database
Business Value
Machine Learning
Models
MLOps – Managing the full Production ML Lifecycle
8
Models, Retraining
Control, Statistics
Events, Alerts
Data
Data Science
Platforms
Data Streams Data Lakes
Our Approach
MCenter
MCenter Server
MCenter
Agent
MCenter
Agent
MCenter
Agent
MCenter
Agent
and more…
Analytic Engines
MCenter Developer Connectors
9
Operational Abstraction
• Link pipelines (training and inference) via
an ION (Intelligence Overlay Network)
• Basically a Directed Graph
representation with allowance for cycles
• Pipelines are DAGs within each engine
• Distributed execution over
heterogeneous engines, programming
languages and geographies
Policy
based
Update
Example – KMeans Batch Training
Plus Streaming Inference
Anomaly Detection
10
An Example ION to Resource Mapping
Human
Approved
Model
Update
Every 5 min - EdgeEvery Tuesday at 10AM - Cloud
Sources
Edge
Intelligence
Streams
Data Lake
Central/Cloud
Intelligence
Batches
Models
11
Pipeline Examples
Training Pipeline
(SparkML)
Inference Pipeline
(SparkML)
12
Instrument, Upload, Orchestrate, Monitor
12
13
Integrating with Analytics Engines (Spark)
Job Management
• Via SparkLauncher: A library to control launching, monitoring and terminating jobs
• PM Agent communicates with Spark through this library for job management (also uses Java
API to launch child processes)
Statistics
• Via SparkListener: A Spark-driver callback service
• SparkListener taps into all accumulators which, is one of the popular ways to expose statistics
• PM agent communicates with the Spark driver and exposes statistics via a REST endpoint
ML Health / Model collection and updates
• PM Agent delivers and receives health events, health objects and models via sockets from
custom PM components in the ML Pipeline
14
Demo Description
Training
Inference
Thank You!
Nisha Talagala
nisha.talagala@parallelm.com
Vinay Sridhar
vinay.sridhar@parallelm.com
16
Data Lake
What We Need For Edge Operational ML
• Distribute analytics processing to the optimal point for each use case
• Flexible management framework enables:
• Secure centralized and/or local learning, prediction, or combined learning/prediction
• Granular monitoring and control of model update policies
• Support multi-layer topologies to achieve maximum scale while accommodating
low bandwidth or unreliable connectivity
Central/Cloud
Intelligence
Sources
Edge
Intelligence
Batches
Streams
Edge/Cloud
Orchestration
17
Integrating with Analytics Engines (TensorFlow)
Job Management
• TensorFlow Python programs run as standalone applications
• Standard process control mechanisms based on the OS is used to monitor and control
TensorFlow programs
Statistics Collection
• PM Agent parses contents via TensorBoard log files to extract meaningful statistics and events
that data scientists added
ML Health / Model collection
• Generation of models and health objects is recorded on a shared medium
18
An Example ION
18
Node 1: Inference Pipeline
Node 2: (Re) Training Pipeline
Model
Node 3: Policy
Model
Every 5 min on Spark cluster 1
Every Tuesday at 10AM on Spark cluster 2
Human approves/rejects
When: anytime there is
a new model

More Related Content

Similar to Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala and Vinay Sridhar (20)

PDF
Strata parallel m-ml-ops_sept_2017
Nisha Talagala
 
PDF
World Artificial Intelligence Conference Shanghai 2018
Adam Gibson
 
PDF
The Eco-System of AI and How to Use It
inside-BigData.com
 
PDF
Productionising Machine Learning Models
Tash Bickley
 
PDF
MLOps – Applying DevOps to Competitive Advantage
DATAVERSITY
 
PDF
Tecton - The state of applied ML 2023.pdf
adjie131
 
PDF
Rsqrd AI: From R&D to ROI of AI
Sanjana Chowdhury
 
PDF
Mykola Mykytenko: MLOps: your way from nonsense to valuable effect (approache...
Lviv Startup Club
 
PDF
Ml ops on AWS
PhilipBasford
 
PPTX
Unlocking DataDriven Talent Intelligence Transforming TALENTX with Industry P...
Prasanna Hegde
 
PDF
A survey on Machine Learning In Production (July 2018)
Arnab Biswas
 
PDF
End to end MLworkflows
Adam Gibson
 
PPTX
From Data Science to MLOps
Carl W. Handlin
 
PDF
Msst 2019 v4
Nisha Talagala
 
PPTX
Deploying ML models in the enterprise
doppenhe
 
PPTX
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
Jasjeet Thind
 
PDF
Cloud Experience: Data-driven Applications Made Simple and Fast
Databricks
 
PDF
MLSEV Virtual. ML Platformization and AutoML in the Enterprise
BigML, Inc
 
PDF
Using MLOps to Bring ML to Production/The Promise of MLOps
Weaveworks
 
PDF
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Ed Fernandez
 
Strata parallel m-ml-ops_sept_2017
Nisha Talagala
 
World Artificial Intelligence Conference Shanghai 2018
Adam Gibson
 
The Eco-System of AI and How to Use It
inside-BigData.com
 
Productionising Machine Learning Models
Tash Bickley
 
MLOps – Applying DevOps to Competitive Advantage
DATAVERSITY
 
Tecton - The state of applied ML 2023.pdf
adjie131
 
Rsqrd AI: From R&D to ROI of AI
Sanjana Chowdhury
 
Mykola Mykytenko: MLOps: your way from nonsense to valuable effect (approache...
Lviv Startup Club
 
Ml ops on AWS
PhilipBasford
 
Unlocking DataDriven Talent Intelligence Transforming TALENTX with Industry P...
Prasanna Hegde
 
A survey on Machine Learning In Production (July 2018)
Arnab Biswas
 
End to end MLworkflows
Adam Gibson
 
From Data Science to MLOps
Carl W. Handlin
 
Msst 2019 v4
Nisha Talagala
 
Deploying ML models in the enterprise
doppenhe
 
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
Jasjeet Thind
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Databricks
 
MLSEV Virtual. ML Platformization and AutoML in the Enterprise
BigML, Inc
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Weaveworks
 
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Ed Fernandez
 

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPT
Data base management system Transactions.ppt
gandhamcharan2006
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Climate Action.pptx action plan for climate
justfortalabat
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
Data base management system Transactions.ppt
gandhamcharan2006
 
deep dive data management sharepoint apps.ppt
novaprofk
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Ad

Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala and Vinay Sridhar

  • 1. Operationalizing Edge Machine Learning with Apache Spark ParallelM
  • 2. 2 Growing AI Investments; Few Deployed at Scale Source: “Artificial Intelligence: The Next Digital Frontier?”, McKinsey Global Institute, June 2017 Out of 160 reviewed AI use cases: 88% did not progress beyond the experimental stage But successful early AI adopters report: Profit margins 3–15% higher than industry average 20% AI in Production 80% Developing, Experimenting, Contemplating Survey of 3073 AI-aware C-level Executives
  • 3. 3 Challenges of Deploying & Managing ML in Production • Diverse focus and expertise of Data Science & Ops teams • Increased risk from non-deterministic nature of ML • Current Operations solutions do not address uniqueness of ML Apps
  • 4. 4 Challenges of Edge/Distributed Topologies • Varied resources at each level • Scale, heterogeneity, disconnected operation IoT is Driving Explosive Growth in Data Volume “Things” Edge and Network Data Center/Cloud Data Lake
  • 5. 5 What We Need For Operational ML • Accelerate deployment & facilitate collaboration between Data & Ops teams • Monitor validity of ML predictions, diagnose data and ML performance issues • Orchestrate training, update, and configuration of ML pipelines across distributed, heterogeneous infrastructure with tracking
  • 6. 6 What We Need For Edge Operational ML • Distribute analytics processing to the optimal point for each use case • Flexible management framework enables: • Secure centralized and/or local learning, prediction, or combined learning/prediction • Granular monitoring and control of model update policies • Support multi-layer topologies to achieve maximum scale while accommodating low bandwidth or unreliable connectivity Sources Edge Intelligence Streams Data Lake Central/Cloud Intelligence Batches Edge/Cloud Orchestration
  • 7. 7 ML Orchestration ML Health Business Impact Model Governance Continuous Integration/ Deployment Database Business Value Machine Learning Models MLOps – Managing the full Production ML Lifecycle
  • 8. 8 Models, Retraining Control, Statistics Events, Alerts Data Data Science Platforms Data Streams Data Lakes Our Approach MCenter MCenter Server MCenter Agent MCenter Agent MCenter Agent MCenter Agent and more… Analytic Engines MCenter Developer Connectors
  • 9. 9 Operational Abstraction • Link pipelines (training and inference) via an ION (Intelligence Overlay Network) • Basically a Directed Graph representation with allowance for cycles • Pipelines are DAGs within each engine • Distributed execution over heterogeneous engines, programming languages and geographies Policy based Update Example – KMeans Batch Training Plus Streaming Inference Anomaly Detection
  • 10. 10 An Example ION to Resource Mapping Human Approved Model Update Every 5 min - EdgeEvery Tuesday at 10AM - Cloud Sources Edge Intelligence Streams Data Lake Central/Cloud Intelligence Batches Models
  • 13. 13 Integrating with Analytics Engines (Spark) Job Management • Via SparkLauncher: A library to control launching, monitoring and terminating jobs • PM Agent communicates with Spark through this library for job management (also uses Java API to launch child processes) Statistics • Via SparkListener: A Spark-driver callback service • SparkListener taps into all accumulators which, is one of the popular ways to expose statistics • PM agent communicates with the Spark driver and exposes statistics via a REST endpoint ML Health / Model collection and updates • PM Agent delivers and receives health events, health objects and models via sockets from custom PM components in the ML Pipeline
  • 16. 16 Data Lake What We Need For Edge Operational ML • Distribute analytics processing to the optimal point for each use case • Flexible management framework enables: • Secure centralized and/or local learning, prediction, or combined learning/prediction • Granular monitoring and control of model update policies • Support multi-layer topologies to achieve maximum scale while accommodating low bandwidth or unreliable connectivity Central/Cloud Intelligence Sources Edge Intelligence Batches Streams Edge/Cloud Orchestration
  • 17. 17 Integrating with Analytics Engines (TensorFlow) Job Management • TensorFlow Python programs run as standalone applications • Standard process control mechanisms based on the OS is used to monitor and control TensorFlow programs Statistics Collection • PM Agent parses contents via TensorBoard log files to extract meaningful statistics and events that data scientists added ML Health / Model collection • Generation of models and health objects is recorded on a shared medium
  • 18. 18 An Example ION 18 Node 1: Inference Pipeline Node 2: (Re) Training Pipeline Model Node 3: Policy Model Every 5 min on Spark cluster 1 Every Tuesday at 10AM on Spark cluster 2 Human approves/rejects When: anytime there is a new model