SlideShare a Scribd company logo
WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Gene Davis, Splice Machine
Splice Machine’s use of
Apache Spark and MLflow
#UnifiedAnalytics #SparkAISummit
Splice Machine
• What are we?
– A scale-out RDBMS that enables simultaneous transactions (OLTP)
and analytics (OLAP)
– Powers Operational AI: the ability to run AI applications in real time
• Who uses us?
– Companies in financial services, healthcare, supply chain, etc.
– One example: 7PB, 2B record updates/day, 2M queries/day with sub-
second response time
• How do we do it?
– Transactional SQL engine on top of HBase and Spark
• ”Dual engine” architecture
– Many delivery options (on-premise, cloud service (AWS, Azure,
bespoke cloud, etc.))
3#UnifiedAnalytics #SparkAISummit
Operational AI
4#UnifiedAnalytics #SparkAISummit
INTELLIGENT
DECISIONS
Operational
Database
• Scale-out
• OLTP
• Fast
Enterprise Data
Warehouse
• In-Memory
• OLAP
• Massively Parallel
ARTIFICIAL
INTELLIGENCE
BUSINESS
INTELLIGENCEML Models
• Notebooks
• Algorithms
• Model
Workflow
OPERATIONAL
INTELLIGENCE
Integrated data platform for real-time AI applications
On
Premise
The Three Dimensions of
Intelligence
5#UnifiedAnalytics #SparkAISummit
OLTPOLAP
What has happened in
the past that might
impact you?
What is happening right
now?
What will happen in the
future?
ML
The Three Dimensions of
Intelligence
6#UnifiedAnalytics #SparkAISummit
OLTPOLAP
Key platforms are duct-taped together leading to
High Infrastructure Costs • Latency in Decision Making • Isolation from Business Processes
ML
Intelligent Action - Before
7#UnifiedAnalytics #SparkAISummit
Intelligent Action - After
8#UnifiedAnalytics #SparkAISummit
Data Science Pain Points
9#UnifiedAnalytics #SparkAISummit
Data
Scientist
Data
Engineer
â—Ź Is my data ready to go?
â—Ź Is it still relevant?
â—Ź Do my features still align?
● The ETL process changed again – now
what?
â—Ź The Data Scientist requested a different
level of granularity – how do I do that?
Data Science Pain Points
10#UnifiedAnalytics #SparkAISummit
Data
Scientist
Data
Engineer
â—Ź Is my data ready to go?
â—Ź Is it still relevant?
â—Ź Do my features still align?
● The ETL process changed again – now
what?
â—Ź The Data Scientist requested a different
level of granularity – how do I do that?
â—Ź What data did I use?
â—Ź What algorithms/parameters
gave the best model?
● Why didn’t I get the same results?
â—Ź What libraries are used?
â—Ź What model version is deployed?
MLflow and ML Manager
• Splice Machine chose MLflow
– MLflow Tracking: Track experiment runs and parameters
– MLflow Models: packaging model artifacts
• Splice ML Manager
– Machine Learning on the Splice Machine Stack
– MLflow Tracking and Models
– Includes UI to Deploy to Amazon SageMaker
11#UnifiedAnalytics #SparkAISummit
ML Manager Architecture
12#UnifiedAnalytics #SparkAISummit
On
Premises
Splice Machine Data Platform
Native Spark Data Source
Deployment
Automation
Native Spark Datasource
• Efficient interface from the Splice relational
tables into Spark DataFrames (and back again)
• No serialization/deserialization
• Examples:
– interestingDf = spliceContext.df(“select * from
interesting_table”)
– spliceContext.insert(dfWithData,’table_name’)
13#UnifiedAnalytics #SparkAISummit
Accessing MLflow Capabilities
• Start with Splice’s MLManager
– manager = MLManager()
– Convenience class on top of MLflow
• API’s
– manager.create_experiment()
– manager.set_active_experiment()
– manager.create_new_run()
– manager.log_param()
– manager.log_metric()
– manager.log_spark_model()
14#UnifiedAnalytics #SparkAISummit
MLflow UI
15#UnifiedAnalytics #SparkAISummit
Deployment Automation
16#UnifiedAnalytics #SparkAISummit
ML Manager
• Beta Launched in March
– MLflow v0.8
• Available at cloud.splicemachine.com
• MLManager() API Open Source at:
– https://github.com/splicemachine/pysplice
– (subject to change per MLflow 1.0 API)
17#UnifiedAnalytics #SparkAISummit
DEMO
18#UnifiedAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
The picture can't be displayed.
Many Disparate Tools
20#UnifiedAnalytics #SparkAISummit
Data Sources
OLTP -
Oracle,
Cassandra,
Dynamo
OLAP -
Redshift,
Snowflake, S3
Notebooks
Apache
Zeppelin Jupyter
Data
Manipulation Python
Pandas
Scikit
Spark
Machine
Learning MLLib, R
Experimenta
tion Tracking MLflow
Deployment Sagemaker AzureML
Insurance Claim Example
21#UnifiedAnalytics #SparkAISummit
Insurance Claim Example
22#UnifiedAnalytics #SparkAISummit

More Related Content

What's hot (20)

PDF
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Databricks
 
PDF
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Databricks
 
PDF
Apache Spark's MLlib's Past Trajectory and new Directions
Databricks
 
PDF
Automated Hyperparameter Tuning, Scaling and Tracking
Databricks
 
PDF
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
PPTX
Spark ML Pipeline serving
Stepan Pushkarev
 
PDF
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs
Databricks
 
PDF
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
Databricks
 
PDF
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Databricks
 
PDF
Extending Machine Learning Algorithms with PySpark
Databricks
 
PDF
Consolidating MLOps at One of Europe’s Biggest Airports
Databricks
 
PDF
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Databricks
 
PDF
Semantic Image Logging Using Approximate Statistics & MLflow
Databricks
 
PDF
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Databricks
 
PDF
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Databricks
 
PDF
Accelerating Machine Learning on Databricks Runtime
Databricks
 
PDF
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
Databricks
 
PDF
A Microservices Framework for Real-Time Model Scoring Using Structured Stream...
Databricks
 
PDF
Koalas: How Well Does Koalas Work?
Databricks
 
PDF
Observability for Data Pipelines With OpenLineage
Databricks
 
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Databricks
 
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Databricks
 
Apache Spark's MLlib's Past Trajectory and new Directions
Databricks
 
Automated Hyperparameter Tuning, Scaling and Tracking
Databricks
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
Spark ML Pipeline serving
Stepan Pushkarev
 
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs
Databricks
 
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
Databricks
 
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Databricks
 
Extending Machine Learning Algorithms with PySpark
Databricks
 
Consolidating MLOps at One of Europe’s Biggest Airports
Databricks
 
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Databricks
 
Semantic Image Logging Using Approximate Statistics & MLflow
Databricks
 
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Databricks
 
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Databricks
 
Accelerating Machine Learning on Databricks Runtime
Databricks
 
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
Databricks
 
A Microservices Framework for Real-Time Model Scoring Using Structured Stream...
Databricks
 
Koalas: How Well Does Koalas Work?
Databricks
 
Observability for Data Pipelines With OpenLineage
Databricks
 

Similar to Splice Machine's use of Apache Spark and MLflow (20)

PPTX
A machine learning and data science pipeline for real companies
DataWorks Summit
 
PDF
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Databricks
 
PPTX
Microsoft Fabric Certication Course | Microsoft Fabric Training.pptx
TalluriRenuka
 
PPTX
Atlanta MLConf
Qubole
 
PPTX
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
MLconf
 
PDF
End-to-End Data Pipelines with Apache Spark
Burak Yavuz
 
PPTX
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Lucidworks
 
PDF
Fighting Fraud with Apache Spark
Miklos Christine
 
PDF
Continuous delivery for machine learning
Rajesh Muppalla
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PPTX
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
ScyllaDB
 
PPTX
Multi Source Data Analysis using Spark and Tellius
datamantra
 
PPTX
Machine learning at scale - Webinar By zekeLabs
zekeLabs Technologies
 
PDF
DevOps for DataScience
Stepan Pushkarev
 
PDF
World Artificial Intelligence Conference Shanghai 2018
Adam Gibson
 
PDF
Strata parallel m-ml-ops_sept_2017
Nisha Talagala
 
PDF
A survey on Machine Learning In Production (July 2018)
Arnab Biswas
 
PDF
Paige Roberts: Shortcut MLOps with In-Database Machine Learning
Edunomica
 
PDF
Running Apache Spark Jobs Using Kubernetes
Databricks
 
PDF
How Microsoft Synapse Analytics Can Transform Your Data Analytics.pdf
Addend Analytics
 
A machine learning and data science pipeline for real companies
DataWorks Summit
 
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Databricks
 
Microsoft Fabric Certication Course | Microsoft Fabric Training.pptx
TalluriRenuka
 
Atlanta MLConf
Qubole
 
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
MLconf
 
End-to-End Data Pipelines with Apache Spark
Burak Yavuz
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Lucidworks
 
Fighting Fraud with Apache Spark
Miklos Christine
 
Continuous delivery for machine learning
Rajesh Muppalla
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
ScyllaDB
 
Multi Source Data Analysis using Spark and Tellius
datamantra
 
Machine learning at scale - Webinar By zekeLabs
zekeLabs Technologies
 
DevOps for DataScience
Stepan Pushkarev
 
World Artificial Intelligence Conference Shanghai 2018
Adam Gibson
 
Strata parallel m-ml-ops_sept_2017
Nisha Talagala
 
A survey on Machine Learning In Production (July 2018)
Arnab Biswas
 
Paige Roberts: Shortcut MLOps with In-Database Machine Learning
Edunomica
 
Running Apache Spark Jobs Using Kubernetes
Databricks
 
How Microsoft Synapse Analytics Can Transform Your Data Analytics.pdf
Addend Analytics
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 

Splice Machine's use of Apache Spark and MLflow

  • 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  • 2. Gene Davis, Splice Machine Splice Machine’s use of Apache Spark and MLflow #UnifiedAnalytics #SparkAISummit
  • 3. Splice Machine • What are we? – A scale-out RDBMS that enables simultaneous transactions (OLTP) and analytics (OLAP) – Powers Operational AI: the ability to run AI applications in real time • Who uses us? – Companies in financial services, healthcare, supply chain, etc. – One example: 7PB, 2B record updates/day, 2M queries/day with sub- second response time • How do we do it? – Transactional SQL engine on top of HBase and Spark • ”Dual engine” architecture – Many delivery options (on-premise, cloud service (AWS, Azure, bespoke cloud, etc.)) 3#UnifiedAnalytics #SparkAISummit
  • 4. Operational AI 4#UnifiedAnalytics #SparkAISummit INTELLIGENT DECISIONS Operational Database • Scale-out • OLTP • Fast Enterprise Data Warehouse • In-Memory • OLAP • Massively Parallel ARTIFICIAL INTELLIGENCE BUSINESS INTELLIGENCEML Models • Notebooks • Algorithms • Model Workflow OPERATIONAL INTELLIGENCE Integrated data platform for real-time AI applications On Premise
  • 5. The Three Dimensions of Intelligence 5#UnifiedAnalytics #SparkAISummit OLTPOLAP What has happened in the past that might impact you? What is happening right now? What will happen in the future? ML
  • 6. The Three Dimensions of Intelligence 6#UnifiedAnalytics #SparkAISummit OLTPOLAP Key platforms are duct-taped together leading to High Infrastructure Costs • Latency in Decision Making • Isolation from Business Processes ML
  • 7. Intelligent Action - Before 7#UnifiedAnalytics #SparkAISummit
  • 8. Intelligent Action - After 8#UnifiedAnalytics #SparkAISummit
  • 9. Data Science Pain Points 9#UnifiedAnalytics #SparkAISummit Data Scientist Data Engineer â—Ź Is my data ready to go? â—Ź Is it still relevant? â—Ź Do my features still align? â—Ź The ETL process changed again – now what? â—Ź The Data Scientist requested a different level of granularity – how do I do that?
  • 10. Data Science Pain Points 10#UnifiedAnalytics #SparkAISummit Data Scientist Data Engineer â—Ź Is my data ready to go? â—Ź Is it still relevant? â—Ź Do my features still align? â—Ź The ETL process changed again – now what? â—Ź The Data Scientist requested a different level of granularity – how do I do that? â—Ź What data did I use? â—Ź What algorithms/parameters gave the best model? â—Ź Why didn’t I get the same results? â—Ź What libraries are used? â—Ź What model version is deployed?
  • 11. MLflow and ML Manager • Splice Machine chose MLflow – MLflow Tracking: Track experiment runs and parameters – MLflow Models: packaging model artifacts • Splice ML Manager – Machine Learning on the Splice Machine Stack – MLflow Tracking and Models – Includes UI to Deploy to Amazon SageMaker 11#UnifiedAnalytics #SparkAISummit
  • 12. ML Manager Architecture 12#UnifiedAnalytics #SparkAISummit On Premises Splice Machine Data Platform Native Spark Data Source Deployment Automation
  • 13. Native Spark Datasource • Efficient interface from the Splice relational tables into Spark DataFrames (and back again) • No serialization/deserialization • Examples: – interestingDf = spliceContext.df(“select * from interesting_table”) – spliceContext.insert(dfWithData,’table_name’) 13#UnifiedAnalytics #SparkAISummit
  • 14. Accessing MLflow Capabilities • Start with Splice’s MLManager – manager = MLManager() – Convenience class on top of MLflow • API’s – manager.create_experiment() – manager.set_active_experiment() – manager.create_new_run() – manager.log_param() – manager.log_metric() – manager.log_spark_model() 14#UnifiedAnalytics #SparkAISummit
  • 17. ML Manager • Beta Launched in March – MLflow v0.8 • Available at cloud.splicemachine.com • MLManager() API Open Source at: – https://github.com/splicemachine/pysplice – (subject to change per MLflow 1.0 API) 17#UnifiedAnalytics #SparkAISummit
  • 19. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT The picture can't be displayed.
  • 20. Many Disparate Tools 20#UnifiedAnalytics #SparkAISummit Data Sources OLTP - Oracle, Cassandra, Dynamo OLAP - Redshift, Snowflake, S3 Notebooks Apache Zeppelin Jupyter Data Manipulation Python Pandas Scikit Spark Machine Learning MLLib, R Experimenta tion Tracking MLflow Deployment Sagemaker AzureML