Splice Machine's use of Apache Spark and MLflow

WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

Gene Davis, Splice Machine
Splice Machine’s use of
Apache Spark and MLflow
#UnifiedAnalytics #SparkAISummit

Splice Machine
• What are we?
– A scale-out RDBMS that enables simultaneous transactions (OLTP)
and analytics (OLAP)
– Powers Operational AI: the ability to run AI applications in real time
• Who uses us?
– Companies in financial services, healthcare, supply chain, etc.
– One example: 7PB, 2B record updates/day, 2M queries/day with sub-
second response time
• How do we do it?
– Transactional SQL engine on top of HBase and Spark
• ”Dual engine” architecture
– Many delivery options (on-premise, cloud service (AWS, Azure,
bespoke cloud, etc.))
3#UnifiedAnalytics #SparkAISummit

Operational AI
INTELLIGENT
DECISIONS
Operational
Database
• Scale-out
• OLTP
• Fast
Enterprise Data
Warehouse
• In-Memory
• OLAP
• Massively Parallel
ARTIFICIAL
INTELLIGENCE
BUSINESS
INTELLIGENCEML Models
• Notebooks
• Algorithms
• Model
Workflow
OPERATIONAL
INTELLIGENCE
Integrated data platform for real-time AI applications
On
Premise

The Three Dimensions of
Intelligence
OLTPOLAP
What has happened in
the past that might
impact you?
What is happening right
now?
What will happen in the
future?
ML

The Three Dimensions of
Intelligence
OLTPOLAP
Key platforms are duct-taped together leading to
High Infrastructure Costs • Latency in Decision Making • Isolation from Business Processes
ML

Intelligent Action - Before

Intelligent Action - After

Data Science Pain Points
Data
Scientist
Data
Engineer
● Is my data ready to go?
● Is it still relevant?
● Do my features still align?
● The ETL process changed again – now
what?
● The Data Scientist requested a different
level of granularity – how do I do that?

Data Science Pain Points
Data
Scientist
Data
Engineer
● Is my data ready to go?
● Is it still relevant?
● Do my features still align?
● The ETL process changed again – now
what?
● The Data Scientist requested a different
level of granularity – how do I do that?
● What data did I use?
● What algorithms/parameters
gave the best model?
● Why didn’t I get the same results?
● What libraries are used?
● What model version is deployed?

MLflow and ML Manager
• Splice Machine chose MLflow
– MLflow Tracking: Track experiment runs and parameters
– MLflow Models: packaging model artifacts
• Splice ML Manager
– Machine Learning on the Splice Machine Stack
– MLflow Tracking and Models
– Includes UI to Deploy to Amazon SageMaker

ML Manager Architecture
On
Premises
Splice Machine Data Platform
Native Spark Data Source
Deployment
Automation

Native Spark Datasource
• Efficient interface from the Splice relational
tables into Spark DataFrames (and back again)
• No serialization/deserialization
• Examples:
– interestingDf = spliceContext.df(“select * from
interesting_table”)
– spliceContext.insert(dfWithData,’table_name’)

Accessing MLflow Capabilities
• Start with Splice’s MLManager
– manager = MLManager()
– Convenience class on top of MLflow
• API’s
– manager.create_experiment()
– manager.set_active_experiment()
– manager.create_new_run()
– manager.log_param()
– manager.log_metric()
– manager.log_spark_model()

MLflow UI

Deployment Automation

ML Manager
• Beta Launched in March
– MLflow v0.8
• Available at cloud.splicemachine.com
• MLManager() API Open Source at:
– https://github.com/splicemachine/pysplice
– (subject to change per MLflow 1.0 API)

DEMO

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT
The picture can't be displayed.

Many Disparate Tools
Data Sources
OLTP -
Oracle,
Cassandra,
Dynamo
OLAP -
Redshift,
Snowflake, S3
Notebooks
Apache
Zeppelin Jupyter
Data
Manipulation Python
Pandas
Scikit
Spark
Machine
Learning MLLib, R
Experimenta
tion Tracking MLflow
Deployment Sagemaker AzureML

Insurance Claim Example

Splice Machine's use of Apache Spark and MLflow

More Related Content

What's hot (20)

Similar to Splice Machine's use of Apache Spark and MLflow (20)

More from Databricks (20)

Recently uploaded (20)

Splice Machine's use of Apache Spark and MLflow