SlideShare a Scribd company logo
Splice Machine Proprietary and Confidential
Open Source RDBMS
For Mixed Operational and Analytical Workloads
Monte Zweben
October 20, 2016
Splice Machine Proprietary and Confidential
Who We Are
The Open Source RDBMS Powered By Hadoop & Spark
2
ANSI SQL
No retraining or rewrites for SQL-based
analysts, reports, and applications
¼ the Cost
Scales out on
commodity hardware
SQL Scale Out Speed
Transactions
Ensure reliable updates
across multiple rows
Mixed Workloads
Simultaneously support
OLTP and OLAP workloads
Elastic
Increase scale in
just a few minutes
10x Faster
Leverages Spark
in-memory technology
Splice Machine Proprietary and Confidential
Life Sciences
Digital Marketing Financial Services
DECISIONS IN THE MOMENT
Supply Chain Optimization
Splice Machine Proprietary and Confidential
Today’s Reality: Stale Data, Backward-Looking Decisions
4
How old is the data in your reports?
 1 day +
 1 day
 4 hours +
 1 hour +
 Real-time
Splice Machine Proprietary and Confidential
Today’s Reality: Stale Data, Backward-Looking Decisions
5
24%
50%
7%
9%
9%
* Source: Webinars on 11-3-15 and 12-10-15, 237 respondents
How old is the data in your reports?
 1 day +
 1 day
 4 hours +
 1 hour +
 Real-time
Splice Machine Proprietary and Confidential
Legacy ETL Architectures Unable to Keep Up
Ad Hoc
Analytics
Executive
Business Reports
Operational Reports
ERP
CRM
Supply
Chain
HR
…
Data
Warehouse
Datamart
Stream or Batch
Updates
Mixed Workload
Apps
ODS
ETL
OLTP
Systems
Extract
Transform
Load
OLAP
Systems Pain
 Separate OLTP & OLAP
systems
 Messy ETL “glue”
 Why?
 Different workloads
 Different data structures
 Hard to isolate workloads
 No longer adequate
 Can’t afford to wait days or
hours to analyze data
6
Splice Machine Proprietary and Confidential
Recent Approach: Lambda Architecture
Complex to setup and maintain
7
Speed Layer
Batch Layer
Serving Layer
Developer Integrates Specialized Compute Engines
Splice Machine Proprietary and Confidential
New Approach: Lambda-In-A-Box Architecture
Easy to use with SQL
8
Speed Layer
Batch Layer
SQL Optimizer Selects Pre-Integrated Compute Engines
Serving Layer
Splice Machine Proprietary and Confidential
Simultaneous OLTP & OLAP Workloads
9
Unique Dual-Engine Architecture isolates workloads
Traditional RDBMSs Splice Machine
HBASE
Engine
SPARK
Engine
BOTTLENECKS, DELAYS
O L A P
WORKLOAD ISOLATION
O L T P
K E Y
Splice Machine Proprietary and Confidential
Simultaneous OLTP & OLAP Workloads
10
Unique Dual-Engine Architecture isolates workloads
Traditional RDBMSs Splice Machine
As OLAP load rises,
OLTP response times increase
OLAP LOAD
OLTPRESPONSETIME
As OLAP load rises,
OLTP response times remain flat
OLAP LOAD
OLTPRESPONSETIME
Splice Machine Proprietary and Confidential
Power Old and New Applications
Splice Machine Proprietary and Confidential
Proven Building Blocks: Spark, Hadoop and Derby
Apache Derby
 ANSI SQL-99 RDBMS
 Java-based
 ODBC/JDBC Compliant
Apache HBase/Hadoop
 Auto-sharding
 High availability
 Scalability to 100s of PBs
Apache Spark
 Analytical engine
 Fast, in-memory technology
 Memory resilient to node failure
12
Splice Machine Proprietary and Confidential
HBase: Proven Scale-Out
 Auto-sharding
 Scales with commodity hardware
 Cost-effective from GBs to PBs
 High availability thru failover and replication
 LSM-trees
13
Splice Machine Proprietary and Confidential
Apache
14
Unmatched Performance
 Fastest sort of 1PB of data
Advanced In-Memory Technology
 Spill-to-disk for large datasets
 Resilient against node failures
 Pipelining for computation parallelism
Most Active Apache Community
 Almost 1000 contributors
Extensive Libraries
 Over 140 and growing
 Libraries for machine learning,
streaming and graph processing
Splice Machine Proprietary and Confidential
Splice Machine: Advanced Spark Integration
15
Innovative, High-Performance
RDD Creation
 Fast access to HFiles in HDFS
 Merged with deltas from Memstore
 Avoids slower HBase API
Universal Execution Plan
and Byte Code
 Optimizer, plan and code shared across
Spark or HBase execution
•••
HBase Region Server
HDFS
•••
Region 1
Memstore
Spark Worker
•••RDD 1
HFile HFile•••
P H Y S I C A L N O D E
RDD N
HFile••• HFile•••
Region N
Memstore
HBase Region Server
HDFS
•••
Region 1
Memstore
Spark Worker
•••RDD 1
HFile HFile•••
P H Y S I C A L N O D E
RDD N
HFile••• HFile•••
Region N
Memstore
Splice Machine Proprietary and Confidential
Splice Machine Architecture
1. Standard install of HBase
Cluster (HBase, HDFS,
ZooKeeper) with Spark
HBase
Co-Processor
L
E
G
E
N
D
2. Distribute Splice Machine
JAR to each region server
3. Automatically invoke co-
processors on each region
16
Cach
e
•••
Tas
k
Executor
Tas
k
HBase Region Server
•••
HDFS
SPLICE PARSER
SPLICE PLANNER
SPLICE OPTIMIZER
SPLICE EXECUTOR
• Snapshot Isolation
• Indexes
Region Region
SPLICE EXECUTOR
• Snapshot Isolation
• Indexes
Spark Worker RDD
Spark Master
RDD
Cach
e
•••
Tas
k
Executor
Tas
k
•••
•••
•••
Cach
e
•••
Tas
k
Executor
Tas
k
HBase Region Server
HDFS
SPLICE PARSER
SPLICE PLANNER
SPLICE OPTIMIZER
SPLICE EXECUTOR
• Snapshot Isolation
• Indexes
Region Region
SPLICE EXECUTOR
• Snapshot Isolation
• Indexes
Spark Worker RDDRDD
Cach
e
•••
Tas
k
Executor
Tas
k
•••
•••
•••
HMasterZookeeper
Splice Machine Proprietary and Confidential
Splice Machine: Query Execution
17
Splice Machine Proprietary and Confidential
Splice Machine: Query Execution
18
1. Parse SQL
• Generate Abstract Syntax Tree (AST)
• Bind AST to Transactional Dictionary
Splice Machine Proprietary and Confidential
Splice Machine: Query Execution
19
1. Parse SQL
2. Optimize query plan
• Determine access plan (e.g., base table,
index), join order and join algorithm
using cost-based statistics (e.g.,
cardinality estimates)
• Unroll nested subqueries
Splice Machine Proprietary and Confidential
Splice Machine: Query Execution
20
3. Generate optimal byte code
1. Parse SQL
2. Optimize query plan
Splice Machine Proprietary and Confidential
Splice Machine: Query Execution
21
OLTP Execution on HBase
4a. Execute OLTP query from
byte code
5a. Use block cache and bloom
filters to optimize data access
6a. Return results
3. Generate optimal byte code
1. Parse SQL
2. Optimize query plan
Splice Machine Proprietary and Confidential
Splice Machine: Query Execution
22
OLAP Execution on Spark
4b. Generate Spark execution plan
OLTP Execution on HBase
4a. Execute OLTP query from
byte code
5a. Use block cache and bloom
filters to optimize data access
6a. Return results
3. Generate optimal byte code
1. Parse SQL
2. Optimize query plan
OLAP Execution on Spark
4b. Generate Spark execution plan
5b. Submit Spark plan with byte code
6b. Fair scheduling of distributed of tasks
7b. Generate RDD from HFiles and Memstore
8b. Execute query and return results
Splice Machine Proprietary and Confidential
Isolated Resource Management
23
Isolate Spark & HBase resources through Linux Cgroups
Splice Machine Proprietary and Confidential
Isolated Resource Management
24
Isolate Spark & HBase resources through Linux Cgroups
Splice Machine Proprietary and Confidential
Configurable Spark Resource Management
25
Prioritize Spark resources between Query, Admin & Import jobs
Custom resource pools
through XML
Splice Machine Proprietary and Confidential
Spark Query Management
26
Visualization of active and completed queries
Splice Machine Proprietary and Confidential
Spark Query Management (cont’d)
27
Visualization of stages for each query, plus kill function
Splice Machine Proprietary and Confidential
Spark Query Management (cont’d)
28
Visualization of stages for query plan, plus kill function
Splice Machine Proprietary and Confidential
Spark Query Management (cont’d)
29
Detailed metrics for tasks in each stage
Splice Machine Proprietary and Confidential
Spark Query Management (cont’d)
30
Splice Machine Proprietary and Confidential
Working With External Data and Compute Engines
31
Virtual Table Interface (VTI)
 Execute federated queries against external
files, libraries or databases
 External Databases
 Use JDBC to access data in DBs such as Oracle
and DB2
 External Libraries
 Access over 140 Spark libraries for machine
learning and streaming
 External Files
 Pre-defined or dynamic schema
 Access local FS, HDFS, AWS S3
 Sample query:
MapReduce I/O Formats
 Accept federated queries from
MapReduce, Pig, and Hive
 Register Splice Machine schema in
HCATALOG
 Merge structured (Splice) and
unstructured data in ad-hoc query
 Seamless integration to Hadoop
ecosystem
Splice Machine Proprietary and Confidential
ANSI SQL-99+ Coverage
32
 Data types – e.g., INTEGER, REAL,
CHARACTER, DATE, BOOLEAN, BIGINT
 DDL – e.g., CREATE TABLE, CREATE SCHEMA,
ALTER TABLE, DELETE, UPDATE TABLE
 Predicates – e.g., IN, BETWEEN, LIKE, EXISTS
 DML – e.g., INSERT, DELETE, UPDATE, SELECT
 Query specification – e.g., GROUP BY,
HAVING
 SET functions – e.g., UNION, ABS, MOD, ALL,
INTERSECT, EXCEPT
 Aggregation functions – e.g., AVG, MAX,
COUNT
 String functions – e.g., SUBSTRING,
concatenation, UPPER, LOWER, TRIM,
LENGTH
 Constraints – e.g., PRIMARY KEY, CHECK,
FOREIGN KEY, UNIQUE, NOT NULL
 Conditional functions – e.g., CASE,
searched CASE
 Privileges – e.g., privileges for SELECT,
DELETE, INSERT, EXECUTE
 Joins – e.g., INNER JOIN, LEFT OUTER JOIN
 Transactions – e.g., COMMIT, ROLLBACK,
Snapshot Isolation
 Sub-queries
 Triggers
 User-defined functions (UDFs)
 Views – including grouped views
 Window Functions – e.g., FIRST_VALUE,
LAST_VALUE, LEAD, LAG
Splice Machine Proprietary and Confidential 33
High Concurrency, ACID transactions
Required to support OLTP applications
share_quantity share_price
TIMESTAMP VALUE TIMESTAMP VALUE
T12 4,000
“Virtual”
Snapshot
T7 $15.11
T7 2,000 T5 $15.65
T3 5,000
Transaction
@T6
T2 $15.74
T1 3,000 T0 $15.27
T3 5,000
Transaction
@T6
T2 $15.74
T5 $15.65
value_held = share_quality* share_price
@T6: value_held = 5,000 * $15.65
@T3: value_held = 5,000 * $15.74
 State-of-the-art, distributed
snapshot isolation
 Form of Multi-Version
Concurrency Control (MVCC)
 Writers do not block readers
 Fast, high concurrency
 Delivers performance for small
reads/writes & batch loads
 Extends research from Google
Percolator & Yahoo Labs
 Patent pending technology
Splice Machine Proprietary and Confidential
BI and SQL tool support via ODBC/JDBC
34
No application rewrites needed
Splice Machine Proprietary and Confidential
Open Source
Features Community
Edition
Enterprise
Edition
Scale-out Architecture, ANSI SQL & Concurrent ACID Transactions ✓ ✓
OLAP and OLTP Resource Isolation ✓ ✓
Distributed In-Memory Joins, Aggregations, Scans and Groupings ✓ ✓
Cost-Based Statistics, Query Optimizer, Management Console ✓ ✓
Compaction Optimization ✓ ✓
Apache Kafka-enabled Streaming ✓ ✓
Virtual Table Interfaces ✓ ✓
New Releases and Maintenance Updates ✓ ✓
Tutorials, Forums, Videos, Documentation, Community Support ✓ ✓
Backup and Restore, Column Access Control ✓
Encryption, Kerberos, LDAP Support ✓
24/7 Support via Web and Phone ✓
Complimentary Account Management Services ✓
Splice Machine Proprietary and Confidential
Try it at scale immediately on AWS Sandbox
 5 Click Sand Box
 Cluster has full system deployed
 SSH for CLI
 URL to Management Consoles
 Open SQL connection on any
node
 Customize template
Splice Machine Proprietary and Confidential
Community
 Slack channel - #splicecommunity
 Video and code tutorials
 GitHub
Splice Machine Proprietary and Confidential
Advisory Board
41
Advisory Board includes luminaries in databases and technology
Roger Bamford
Former Principal Architect at Oracle
Father of Oracle RAC
Mike Franklin
Chair,Dept of Computer Science, UChicago
Director, UC Berkeley AMPLab
Founder of Apache Spark
Marie-Anne Neimat
Co-Founder, Times-Ten Database
Former VP, Database Eng. at Oracle
Ken Rudin
Head of Growth and Analytics for Google Search
Head of Analytics at Facebook
Abhinav Gupta
Co-Founder, Rocket Fuel
Runs 15PB HBase Cluster
Splice Machine Proprietary and Confidential 42
WE ARE HIRING
Splice Machine Proprietary and Confidential
Seasoned Team
43
Monte Zweben
Co-Founder &
Chief Executive
Officer
John Leach
Co-Founder &
Chief Technology
Officer
St. Louis
Hadoop
User Group
Krishnan
Parasuraman
VP of Sales and
Business
Development
Eran Pilovsky
Chief Financial
Officer
Gene Davis
Co-Founder & VP
of Products &
Operations
Eric Kalabacos
VP of Customer
Solutions
Splice Machine Proprietary and Confidential
Next Steps
44
Try Us!
splicemachine.com/get-started
GitHub • Tutorials • Sandbox
Splice Machine Proprietary and Confidential
Powering Real-Time
Applications & Analytics
Enabling Decisions in the Moment
October 20, 2016

More Related Content

What's hot (20)

PPTX
Unified Batch & Stream Processing with Apache Samza
DataWorks Summit
 
PPTX
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Arseny Chernov
 
PPTX
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
Yahoo Developer Network
 
PPTX
Rds data lake @ Robinhood
BalajiVaradarajan13
 
PPTX
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
Michael Stack
 
PPTX
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
HBaseCon
 
PPTX
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
Michael Stack
 
PDF
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
 
PPTX
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
HBaseCon
 
PDF
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
DataStax Academy
 
PDF
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
Cloudera, Inc.
 
PPTX
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
Cloudera, Inc.
 
PDF
Kafka to the Maxka - (Kafka Performance Tuning)
DataWorks Summit
 
PPTX
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
Cloudera, Inc.
 
PDF
Kudu - Fast Analytics on Fast Data
Ryan Bosshart
 
PPTX
HBaseConEast2016: HBase and Spark, State of the Art
Michael Stack
 
PDF
HBaseCon 2015- HBase @ Flipboard
Matthew Blair
 
PDF
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
Timo Walther
 
PDF
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
HBaseCon
 
PPTX
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBaseCon
 
Unified Batch & Stream Processing with Apache Samza
DataWorks Summit
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Arseny Chernov
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
Yahoo Developer Network
 
Rds data lake @ Robinhood
BalajiVaradarajan13
 
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
Michael Stack
 
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
HBaseCon
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
Michael Stack
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
 
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
HBaseCon
 
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
DataStax Academy
 
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
Cloudera, Inc.
 
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
Cloudera, Inc.
 
Kafka to the Maxka - (Kafka Performance Tuning)
DataWorks Summit
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
Cloudera, Inc.
 
Kudu - Fast Analytics on Fast Data
Ryan Bosshart
 
HBaseConEast2016: HBase and Spark, State of the Art
Michael Stack
 
HBaseCon 2015- HBase @ Flipboard
Matthew Blair
 
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
Timo Walther
 
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
HBaseCon
 
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBaseCon
 

Similar to October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark  (20)

PDF
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 
PPTX
HBaseConEast2016: Splice machine open source rdbms
Michael Stack
 
PPTX
Splice Machine Overview
Kunal Gupta
 
PPTX
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
DataWorks Summit
 
PPTX
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
Tim Vaillancourt
 
PPTX
Apache spark
sivachandra mandalapu
 
PPT
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
PPTX
Intro to Spark development
Spark Summit
 
PDF
Introduction to Spark Training
Spark Summit
 
PPTX
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Ali Hodroj
 
PDF
Hadoop and the Relational Database: The Best of Both Worlds
Inside Analysis
 
PPTX
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
PDF
Using your DB2 SQL Skills with Hadoop and Spark
Cynthia Saracco
 
PPTX
Apache spark
Ramakrishna kapa
 
PPTX
Apachespark 160612140708
Srikrishna k
 
PPTX
Spark from the Surface
Josi Aranda
 
PPTX
Glint with Apache Spark
Venkata Naga Ravi
 
PDF
Splice machine-bloor-webinar-data-lakes
Edgar Alejandro Villegas
 
PDF
Dev Ops Training
Spark Summit
 
PDF
How can Hadoop & SAP be integrated
Douglas Bernardini
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 
HBaseConEast2016: Splice machine open source rdbms
Michael Stack
 
Splice Machine Overview
Kunal Gupta
 
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
DataWorks Summit
 
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
Tim Vaillancourt
 
Apache spark
sivachandra mandalapu
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Intro to Spark development
Spark Summit
 
Introduction to Spark Training
Spark Summit
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Ali Hodroj
 
Hadoop and the Relational Database: The Best of Both Worlds
Inside Analysis
 
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Using your DB2 SQL Skills with Hadoop and Spark
Cynthia Saracco
 
Apache spark
Ramakrishna kapa
 
Apachespark 160612140708
Srikrishna k
 
Spark from the Surface
Josi Aranda
 
Glint with Apache Spark
Venkata Naga Ravi
 
Splice machine-bloor-webinar-data-lakes
Edgar Alejandro Villegas
 
Dev Ops Training
Spark Summit
 
How can Hadoop & SAP be integrated
Douglas Bernardini
 
Ad

More from Yahoo Developer Network (20)

PDF
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Yahoo Developer Network
 
PDF
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Yahoo Developer Network
 
PDF
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Yahoo Developer Network
 
PDF
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Yahoo Developer Network
 
PDF
CICD at Oath using Screwdriver
Yahoo Developer Network
 
PDF
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Yahoo Developer Network
 
PPTX
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
Yahoo Developer Network
 
PDF
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
Yahoo Developer Network
 
PPTX
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Yahoo Developer Network
 
PPTX
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Yahoo Developer Network
 
PDF
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
Yahoo Developer Network
 
PPTX
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Yahoo Developer Network
 
PDF
Moving the Oath Grid to Docker, Eric Badger, Oath
Yahoo Developer Network
 
PDF
Architecting Petabyte Scale AI Applications
Yahoo Developer Network
 
PDF
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Yahoo Developer Network
 
PPTX
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Yahoo Developer Network
 
PDF
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Yahoo Developer Network
 
PPTX
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Yahoo Developer Network
 
PPTX
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Yahoo Developer Network
 
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Yahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Yahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Yahoo Developer Network
 
CICD at Oath using Screwdriver
Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Yahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
Yahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
Yahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
Yahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Yahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Yahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Yahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Yahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Yahoo Developer Network
 
Ad

Recently uploaded (20)

PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Digital Circuits, important subject in CS
contactparinay1
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 

October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

  • 1. Splice Machine Proprietary and Confidential Open Source RDBMS For Mixed Operational and Analytical Workloads Monte Zweben October 20, 2016
  • 2. Splice Machine Proprietary and Confidential Who We Are The Open Source RDBMS Powered By Hadoop & Spark 2 ANSI SQL No retraining or rewrites for SQL-based analysts, reports, and applications ¼ the Cost Scales out on commodity hardware SQL Scale Out Speed Transactions Ensure reliable updates across multiple rows Mixed Workloads Simultaneously support OLTP and OLAP workloads Elastic Increase scale in just a few minutes 10x Faster Leverages Spark in-memory technology
  • 3. Splice Machine Proprietary and Confidential Life Sciences Digital Marketing Financial Services DECISIONS IN THE MOMENT Supply Chain Optimization
  • 4. Splice Machine Proprietary and Confidential Today’s Reality: Stale Data, Backward-Looking Decisions 4 How old is the data in your reports?  1 day +  1 day  4 hours +  1 hour +  Real-time
  • 5. Splice Machine Proprietary and Confidential Today’s Reality: Stale Data, Backward-Looking Decisions 5 24% 50% 7% 9% 9% * Source: Webinars on 11-3-15 and 12-10-15, 237 respondents How old is the data in your reports?  1 day +  1 day  4 hours +  1 hour +  Real-time
  • 6. Splice Machine Proprietary and Confidential Legacy ETL Architectures Unable to Keep Up Ad Hoc Analytics Executive Business Reports Operational Reports ERP CRM Supply Chain HR … Data Warehouse Datamart Stream or Batch Updates Mixed Workload Apps ODS ETL OLTP Systems Extract Transform Load OLAP Systems Pain  Separate OLTP & OLAP systems  Messy ETL “glue”  Why?  Different workloads  Different data structures  Hard to isolate workloads  No longer adequate  Can’t afford to wait days or hours to analyze data 6
  • 7. Splice Machine Proprietary and Confidential Recent Approach: Lambda Architecture Complex to setup and maintain 7 Speed Layer Batch Layer Serving Layer Developer Integrates Specialized Compute Engines
  • 8. Splice Machine Proprietary and Confidential New Approach: Lambda-In-A-Box Architecture Easy to use with SQL 8 Speed Layer Batch Layer SQL Optimizer Selects Pre-Integrated Compute Engines Serving Layer
  • 9. Splice Machine Proprietary and Confidential Simultaneous OLTP & OLAP Workloads 9 Unique Dual-Engine Architecture isolates workloads Traditional RDBMSs Splice Machine HBASE Engine SPARK Engine BOTTLENECKS, DELAYS O L A P WORKLOAD ISOLATION O L T P K E Y
  • 10. Splice Machine Proprietary and Confidential Simultaneous OLTP & OLAP Workloads 10 Unique Dual-Engine Architecture isolates workloads Traditional RDBMSs Splice Machine As OLAP load rises, OLTP response times increase OLAP LOAD OLTPRESPONSETIME As OLAP load rises, OLTP response times remain flat OLAP LOAD OLTPRESPONSETIME
  • 11. Splice Machine Proprietary and Confidential Power Old and New Applications
  • 12. Splice Machine Proprietary and Confidential Proven Building Blocks: Spark, Hadoop and Derby Apache Derby  ANSI SQL-99 RDBMS  Java-based  ODBC/JDBC Compliant Apache HBase/Hadoop  Auto-sharding  High availability  Scalability to 100s of PBs Apache Spark  Analytical engine  Fast, in-memory technology  Memory resilient to node failure 12
  • 13. Splice Machine Proprietary and Confidential HBase: Proven Scale-Out  Auto-sharding  Scales with commodity hardware  Cost-effective from GBs to PBs  High availability thru failover and replication  LSM-trees 13
  • 14. Splice Machine Proprietary and Confidential Apache 14 Unmatched Performance  Fastest sort of 1PB of data Advanced In-Memory Technology  Spill-to-disk for large datasets  Resilient against node failures  Pipelining for computation parallelism Most Active Apache Community  Almost 1000 contributors Extensive Libraries  Over 140 and growing  Libraries for machine learning, streaming and graph processing
  • 15. Splice Machine Proprietary and Confidential Splice Machine: Advanced Spark Integration 15 Innovative, High-Performance RDD Creation  Fast access to HFiles in HDFS  Merged with deltas from Memstore  Avoids slower HBase API Universal Execution Plan and Byte Code  Optimizer, plan and code shared across Spark or HBase execution ••• HBase Region Server HDFS ••• Region 1 Memstore Spark Worker •••RDD 1 HFile HFile••• P H Y S I C A L N O D E RDD N HFile••• HFile••• Region N Memstore HBase Region Server HDFS ••• Region 1 Memstore Spark Worker •••RDD 1 HFile HFile••• P H Y S I C A L N O D E RDD N HFile••• HFile••• Region N Memstore
  • 16. Splice Machine Proprietary and Confidential Splice Machine Architecture 1. Standard install of HBase Cluster (HBase, HDFS, ZooKeeper) with Spark HBase Co-Processor L E G E N D 2. Distribute Splice Machine JAR to each region server 3. Automatically invoke co- processors on each region 16 Cach e ••• Tas k Executor Tas k HBase Region Server ••• HDFS SPLICE PARSER SPLICE PLANNER SPLICE OPTIMIZER SPLICE EXECUTOR • Snapshot Isolation • Indexes Region Region SPLICE EXECUTOR • Snapshot Isolation • Indexes Spark Worker RDD Spark Master RDD Cach e ••• Tas k Executor Tas k ••• ••• ••• Cach e ••• Tas k Executor Tas k HBase Region Server HDFS SPLICE PARSER SPLICE PLANNER SPLICE OPTIMIZER SPLICE EXECUTOR • Snapshot Isolation • Indexes Region Region SPLICE EXECUTOR • Snapshot Isolation • Indexes Spark Worker RDDRDD Cach e ••• Tas k Executor Tas k ••• ••• ••• HMasterZookeeper
  • 17. Splice Machine Proprietary and Confidential Splice Machine: Query Execution 17
  • 18. Splice Machine Proprietary and Confidential Splice Machine: Query Execution 18 1. Parse SQL • Generate Abstract Syntax Tree (AST) • Bind AST to Transactional Dictionary
  • 19. Splice Machine Proprietary and Confidential Splice Machine: Query Execution 19 1. Parse SQL 2. Optimize query plan • Determine access plan (e.g., base table, index), join order and join algorithm using cost-based statistics (e.g., cardinality estimates) • Unroll nested subqueries
  • 20. Splice Machine Proprietary and Confidential Splice Machine: Query Execution 20 3. Generate optimal byte code 1. Parse SQL 2. Optimize query plan
  • 21. Splice Machine Proprietary and Confidential Splice Machine: Query Execution 21 OLTP Execution on HBase 4a. Execute OLTP query from byte code 5a. Use block cache and bloom filters to optimize data access 6a. Return results 3. Generate optimal byte code 1. Parse SQL 2. Optimize query plan
  • 22. Splice Machine Proprietary and Confidential Splice Machine: Query Execution 22 OLAP Execution on Spark 4b. Generate Spark execution plan OLTP Execution on HBase 4a. Execute OLTP query from byte code 5a. Use block cache and bloom filters to optimize data access 6a. Return results 3. Generate optimal byte code 1. Parse SQL 2. Optimize query plan OLAP Execution on Spark 4b. Generate Spark execution plan 5b. Submit Spark plan with byte code 6b. Fair scheduling of distributed of tasks 7b. Generate RDD from HFiles and Memstore 8b. Execute query and return results
  • 23. Splice Machine Proprietary and Confidential Isolated Resource Management 23 Isolate Spark & HBase resources through Linux Cgroups
  • 24. Splice Machine Proprietary and Confidential Isolated Resource Management 24 Isolate Spark & HBase resources through Linux Cgroups
  • 25. Splice Machine Proprietary and Confidential Configurable Spark Resource Management 25 Prioritize Spark resources between Query, Admin & Import jobs Custom resource pools through XML
  • 26. Splice Machine Proprietary and Confidential Spark Query Management 26 Visualization of active and completed queries
  • 27. Splice Machine Proprietary and Confidential Spark Query Management (cont’d) 27 Visualization of stages for each query, plus kill function
  • 28. Splice Machine Proprietary and Confidential Spark Query Management (cont’d) 28 Visualization of stages for query plan, plus kill function
  • 29. Splice Machine Proprietary and Confidential Spark Query Management (cont’d) 29 Detailed metrics for tasks in each stage
  • 30. Splice Machine Proprietary and Confidential Spark Query Management (cont’d) 30
  • 31. Splice Machine Proprietary and Confidential Working With External Data and Compute Engines 31 Virtual Table Interface (VTI)  Execute federated queries against external files, libraries or databases  External Databases  Use JDBC to access data in DBs such as Oracle and DB2  External Libraries  Access over 140 Spark libraries for machine learning and streaming  External Files  Pre-defined or dynamic schema  Access local FS, HDFS, AWS S3  Sample query: MapReduce I/O Formats  Accept federated queries from MapReduce, Pig, and Hive  Register Splice Machine schema in HCATALOG  Merge structured (Splice) and unstructured data in ad-hoc query  Seamless integration to Hadoop ecosystem
  • 32. Splice Machine Proprietary and Confidential ANSI SQL-99+ Coverage 32  Data types – e.g., INTEGER, REAL, CHARACTER, DATE, BOOLEAN, BIGINT  DDL – e.g., CREATE TABLE, CREATE SCHEMA, ALTER TABLE, DELETE, UPDATE TABLE  Predicates – e.g., IN, BETWEEN, LIKE, EXISTS  DML – e.g., INSERT, DELETE, UPDATE, SELECT  Query specification – e.g., GROUP BY, HAVING  SET functions – e.g., UNION, ABS, MOD, ALL, INTERSECT, EXCEPT  Aggregation functions – e.g., AVG, MAX, COUNT  String functions – e.g., SUBSTRING, concatenation, UPPER, LOWER, TRIM, LENGTH  Constraints – e.g., PRIMARY KEY, CHECK, FOREIGN KEY, UNIQUE, NOT NULL  Conditional functions – e.g., CASE, searched CASE  Privileges – e.g., privileges for SELECT, DELETE, INSERT, EXECUTE  Joins – e.g., INNER JOIN, LEFT OUTER JOIN  Transactions – e.g., COMMIT, ROLLBACK, Snapshot Isolation  Sub-queries  Triggers  User-defined functions (UDFs)  Views – including grouped views  Window Functions – e.g., FIRST_VALUE, LAST_VALUE, LEAD, LAG
  • 33. Splice Machine Proprietary and Confidential 33 High Concurrency, ACID transactions Required to support OLTP applications share_quantity share_price TIMESTAMP VALUE TIMESTAMP VALUE T12 4,000 “Virtual” Snapshot T7 $15.11 T7 2,000 T5 $15.65 T3 5,000 Transaction @T6 T2 $15.74 T1 3,000 T0 $15.27 T3 5,000 Transaction @T6 T2 $15.74 T5 $15.65 value_held = share_quality* share_price @T6: value_held = 5,000 * $15.65 @T3: value_held = 5,000 * $15.74  State-of-the-art, distributed snapshot isolation  Form of Multi-Version Concurrency Control (MVCC)  Writers do not block readers  Fast, high concurrency  Delivers performance for small reads/writes & batch loads  Extends research from Google Percolator & Yahoo Labs  Patent pending technology
  • 34. Splice Machine Proprietary and Confidential BI and SQL tool support via ODBC/JDBC 34 No application rewrites needed
  • 35. Splice Machine Proprietary and Confidential Open Source Features Community Edition Enterprise Edition Scale-out Architecture, ANSI SQL & Concurrent ACID Transactions ✓ ✓ OLAP and OLTP Resource Isolation ✓ ✓ Distributed In-Memory Joins, Aggregations, Scans and Groupings ✓ ✓ Cost-Based Statistics, Query Optimizer, Management Console ✓ ✓ Compaction Optimization ✓ ✓ Apache Kafka-enabled Streaming ✓ ✓ Virtual Table Interfaces ✓ ✓ New Releases and Maintenance Updates ✓ ✓ Tutorials, Forums, Videos, Documentation, Community Support ✓ ✓ Backup and Restore, Column Access Control ✓ Encryption, Kerberos, LDAP Support ✓ 24/7 Support via Web and Phone ✓ Complimentary Account Management Services ✓
  • 36. Splice Machine Proprietary and Confidential Try it at scale immediately on AWS Sandbox  5 Click Sand Box  Cluster has full system deployed  SSH for CLI  URL to Management Consoles  Open SQL connection on any node  Customize template
  • 37. Splice Machine Proprietary and Confidential Community  Slack channel - #splicecommunity  Video and code tutorials  GitHub
  • 38. Splice Machine Proprietary and Confidential Advisory Board 41 Advisory Board includes luminaries in databases and technology Roger Bamford Former Principal Architect at Oracle Father of Oracle RAC Mike Franklin Chair,Dept of Computer Science, UChicago Director, UC Berkeley AMPLab Founder of Apache Spark Marie-Anne Neimat Co-Founder, Times-Ten Database Former VP, Database Eng. at Oracle Ken Rudin Head of Growth and Analytics for Google Search Head of Analytics at Facebook Abhinav Gupta Co-Founder, Rocket Fuel Runs 15PB HBase Cluster
  • 39. Splice Machine Proprietary and Confidential 42 WE ARE HIRING
  • 40. Splice Machine Proprietary and Confidential Seasoned Team 43 Monte Zweben Co-Founder & Chief Executive Officer John Leach Co-Founder & Chief Technology Officer St. Louis Hadoop User Group Krishnan Parasuraman VP of Sales and Business Development Eran Pilovsky Chief Financial Officer Gene Davis Co-Founder & VP of Products & Operations Eric Kalabacos VP of Customer Solutions
  • 41. Splice Machine Proprietary and Confidential Next Steps 44 Try Us! splicemachine.com/get-started GitHub • Tutorials • Sandbox
  • 42. Splice Machine Proprietary and Confidential Powering Real-Time Applications & Analytics Enabling Decisions in the Moment October 20, 2016

Editor's Notes

  • #3: The Hadoop RDBMS is designed to scale-out from a single server to thousands of machines, with a high degree of fault tolerance. Rather than relying on high-end hardware, Splice Machine uses the proven scale-out and high availability of Hadoop, proven in production clusters of dozens of petabytes at large scale leaders like Yahoo, Facebook, and Twitter. The Hadoop RDBMS benefits include: Affordability – scale-out -- using commodity hardware Elasticity -- expand or scale back easily Transactional – execute real time updates and ACID transactions ANSI SQL -- leverage existing SQL code, tools, and skills Flexibility -- support both operational and analytical workloads Notes: SQL: Structured Query Language. SQL is a special-purpose programming language designed for managing data held in a relational database management system (RDBMS).
  • #13: Splice Machine has focused on the orange blocks to maximize the value of our R&D investment Derby's database engine, is a full-functioned relational embedded database-engine, supporting JDBC and SQL as programming APIs. It uses IBM DB2 SQL syntax. Apache Derby originated at Cloudscape Inc, an Oakland, California, start-up founded in 1996 In 1999 Informix Software, Inc., acquired Cloudscape, Inc. In 2001 IBM acquired the database assets of Informix Software, including Cloudscape In August 2004 IBM contributed the code to the Apache Software Foundation as Derby Splice Machine has focused the middle of the stack to maximize the value created by our R&D Our parallelization engine to execute SQL Secondary indexes Join strategies Query optimizers Performance High concurrency, lockless programming
  • #14: HBase is a “distributed, versioned, non-relational database modeled after Google's Bigtable, a distributed storage system for structured data”. HBase can handle very high throughput scaling. Fully leverage Hbase as a storage engine for horizontal scale-out Auto-sharding in Hbase is based on regions with regions assigned to region servers Hbase does region balancing/re-balancing Recall that regions are ranges of rowkeys in sorted order RDBMS primary key uses Hbase rowkeys Fast single row selects Fast range based scans Dense secondary indices stored in separate tables Write pipeline Writes to WAL first (not shown) for durability The writes to memstore Memstores are eventually flushed to disk to storefiles Read pipeline Blocks are read from storefiles and memstores Blocks are cached in Block Cache (not shown) Remember that HDFS is an immutable file system Storefiles are written and never updated Updates are really inserts (upserts)
  • #17: Key points A theme in Distributed computing is moving the code to where the data is because the data is big Splice Machine has its own query execution and task parallelization engine Secret sauce Not based on map/reduce Predicates pushed to shards and locally applied Each region executes local HBase operations Results are returned to the controlling node and “spliced together” hence the “splice” in the company name Serialization Highly compressed storage format for table row Snappy compression Reduces network traffic! Join strategies: Nested Loop, SortMerge, Merge, Broadcast Rely on HBase co-processors: end-points and observers Performance Speed of the read and async write pipelines 20 msecs for read 30 msecs for write
  • #34: This is big area of focus for us Based on MVCC “snapshot isolation” Lockless is key here Patent pending in this area Based on timestamps Transaction C will see changes from transaction A Transaction C won’t see changes from transaction B Here's an explanation of what is depicted in Figure 1: Transaction T1 bumps up the Qty for A by 10 twice, then commits at time t6. At the commit, A's Qty is 30, which is now visible to other transactions. Note, however, that T2 started at time t4 before T1's commit, so its value for A is still 10. Thus when it computes C = A+10, this results in 20. T3 starts at t7, as an overlap to T2, and attempts to update B just as T2 did, resulting in a write-write conflict. T3 rolls back, and attempts a reissue with T3'. This succeeds with the previously committed value from T2.