SlideShare a Scribd company logo
Tim Vaillancourt
Sr. Technical Operations Architect
One Tool to Rule Them All: Seamless SQL on
MongoDB, MySQL and Redis with Apache Spark
Staring..
as “the RDBMs”
as “the document-store”
as “the in-memory K/V store”
About Me
• Joined Percona in January 2016
• Sr Technical Operations Architect for MongoDB
• Previous:
• EA DICE (MySQL DBA)
• EA SPORTS (Sys/NoSQL DBA Ops)
• Amazon/AbeBooks Inc (Sys/MySQL+NoSQL DBA Ops)
• Main techs: MySQL, MongoDB, Cassandra, Solr, Redis, queues, etc
• 10+ years tuning Linux for database workloads (off and on)
• NOT an Apache Spark expert
Apache Spark
• “…is a fast and general engine for large-scale data processing”
• Written in Scala, utilises Akka streaming framework and runs under Java JVM
• Supports jobs written in Java, Scala, Python and R
• Pluggable datasources for various file types, databases, SaaSs, etc
• Fast and efficient: jobs work on datasources quickly in parallel
• Optional clustering
• Master/Slave Spark cluster, Zookeeper for Master elections
• Slave workers connect to master
• Master distributes tasks evenly to available workers
• Streaming and Machine Learning (MLib) capabilities
• Programatic and SQL(!) querying capabilities
Apache Spark: Software Architecture
• Jobs go to Cluster Master or runs in the
client JVM directly
• Cluster Master directs jobs to nodes
with available resources with messages
• Cluster Master HA
• Slaves reconnect to Master
• Apache Zookeeper for true HA
Apache Spark: Hadoop Comparison
• Hadoop MapReduce
• Batch-based
• Uses data less efficiently
• Relatively hard to develop/maintain
• Spark
• Stream Processing
• Fast/Parallelism
• Prefers memory as much as possible in jobs
• Divides work into many lightweight sub-tasks in threads
• Datasources
• Uses datasource-awareness to scale (eg: indices, shard-awareness, etc)
• Spark allows processing and storage to scale separately
Apache Spark: RDDs and DataFrames
• RDD: Resilient Distributed Dataset
• Original API to access data in Spark
• Lazy: does not access data until a real action is performed
• Spark’s optimiser cannot see inside
• RDDs are slow on Python
• DataFrames API
• Higher level API, focused on the “what” is being done
• Has schemas / table-like
• Interchangeable Programming and SQL APIs
• Much easier to read and comprehend
• Optimises execution plan
Apache Spark: Datasources
• Provides a pluggable mechanism for accessing structured data though Spark SQL
• At least these databases are supported in some way
• MySQL
• MongoDB
• Redis
• Cassandra
• Postgres
• HBase
• HDFS
• File
• S3
• In practice: search GitHub, find .jar file, deploy it!
Apache Spark: SQLContext
• ANSI SQL
• 30+ year old language..
• Easy to understand
• Everyone usually knows it
• Spark SQLContext
• A Spark module for structured data processing, wrapping RDD API
• Uses the same execution engine as the programatic APIs
• Supports:
• JOINs/unions
• EXPLAINs,
• Subqueries,
• ORDER/GROUP/SORT BYs
• Most datatypes you’d expect
Apache Spark: Use Cases
• Business Intelligence/Analytics
• Understand
• Tip: use dedicated replicas for expensive queries!
• Data Summaries and Batch Jobs
• Perform expensive summaries in the background,
save result
• Tip: use burstable/cloud hardware for infrequent
batch jobs
• Real-time Stream Processing
• Process data as it enters your system
So why not Apache Drill?
• A schema-free SQL engine for Hadoop, NoSQL and Cloud Storage
• Drill does not support / work with
• Relational databases (MySQL) or Redis
• No programatic-level querying
• No streaming/continuous query functionality
• I don’t know much about it
The Demo
• Scenario: You run a Weather Station data app that stores data in both an
RDBMs and a document store
• Goal: summarise weather station data stored in an RDBMs and a
Document store
• Min Water Temperature
• Avg Water Temperature
• Max Water Temperature
• Total Sample Count
• Get Top-10 (based on avg water temp)
The Demo
• RDBMs: Percona Server for MySQL 5.7
• Stores the Weather station metadata data (roughly 350 stations: ID,
name, location, etc)
• Document-Store: Percona Server for MongoDB 3.2
• Stores the Weather time-series sample data (roughly 80,000 samples:
various weather readings from stations)
• In-Memory K/V Store: Redis 2.8
• Store summarised Top-10 data for fast querying of min, avg, max
temperature and total sample counts
The Demo
• Apache Spark 1.6.2
Cluster
• 1 x Master
• 2 x Worker/Slaves
• 1 x Pyspark Job
• 1 x Macbook Pro
• 3 x Virtualbox VMs
• Job submitted on Master
The Demo
(Play Demo Video Now)
The Demo: The Pyspark Job
<- SparkContext
and
SQLContext
<- MySQL Table
as SQLContext
Temp Table
The Demo: The Pyspark Job
<- MongoDB Collection
as SQLContext
Temp Table
<- New Redis
Hash Schema
as SQLContext
Temp Table
The Demo: The Pyspark Job
<- From
Redis
<- Aggregation
The Demo: The Pyspark Job
<- Aggregation
The Demo
The Demo
The Demo
The Demo
Questions?
DATABASE PERFORMANCE
MATTERS

More Related Content

What's hot (20)

PDF
PostgreSQL HA
haroonm
 
ODP
PostgreSQL Replication in 10 Minutes - SCALE
PostgreSQL Experts, Inc.
 
PDF
Evolution of MongoDB Replicaset and Its Best Practices
Mydbops
 
PDF
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
distributed matters
 
PDF
Patroni - HA PostgreSQL made easy
Alexander Kukushkin
 
PDF
On The Building Of A PostgreSQL Cluster
Srihari Sriraman
 
PDF
PostgreSQL High Availability in a Containerized World
Jignesh Shah
 
PDF
High Availability PostgreSQL with Zalando Patroni
Zalando Technology
 
PDF
hbaseconasia2017: Large scale data near-line loading method and architecture
HBaseCon
 
PDF
Remote DBA Experts SQL Server 2008 New Features
Remote DBA Experts
 
PDF
ELK: Moose-ively scaling your log system
Avleen Vig
 
PDF
Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...
PostgreSQL-Consulting
 
PDF
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
Ashnikbiz
 
PDF
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
Ashnikbiz
 
ODP
Do more with Galera Cluster in your OpenStack cloud
philip_stoev
 
PDF
MySQL Live Migration - Common Scenarios
Mydbops
 
ODP
Logical replication with pglogical
Umair Shahid
 
PPTX
Redis Developers Day 2014 - Redis Labs Talks
Redis Labs
 
PDF
hbaseconasia2017: hbase-2.0.0
HBaseCon
 
PDF
PostgreSQL Replication High Availability Methods
Mydbops
 
PostgreSQL HA
haroonm
 
PostgreSQL Replication in 10 Minutes - SCALE
PostgreSQL Experts, Inc.
 
Evolution of MongoDB Replicaset and Its Best Practices
Mydbops
 
Meet Spilo, Zalando’s HIGH-AVAILABLE POSTGRESQL CLUSTER - Feike Steenbergen
distributed matters
 
Patroni - HA PostgreSQL made easy
Alexander Kukushkin
 
On The Building Of A PostgreSQL Cluster
Srihari Sriraman
 
PostgreSQL High Availability in a Containerized World
Jignesh Shah
 
High Availability PostgreSQL with Zalando Patroni
Zalando Technology
 
hbaseconasia2017: Large scale data near-line loading method and architecture
HBaseCon
 
Remote DBA Experts SQL Server 2008 New Features
Remote DBA Experts
 
ELK: Moose-ively scaling your log system
Avleen Vig
 
Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...
PostgreSQL-Consulting
 
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
Ashnikbiz
 
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
Ashnikbiz
 
Do more with Galera Cluster in your OpenStack cloud
philip_stoev
 
MySQL Live Migration - Common Scenarios
Mydbops
 
Logical replication with pglogical
Umair Shahid
 
Redis Developers Day 2014 - Redis Labs Talks
Redis Labs
 
hbaseconasia2017: hbase-2.0.0
HBaseCon
 
PostgreSQL Replication High Availability Methods
Mydbops
 

Similar to One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark (20)

PPTX
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PPTX
Apache Spark in Industry
Dorian Beganovic
 
PDF
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Simon Ambridge
 
PDF
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
PPTX
Big Data Processing with Apache Spark 2014
mahchiev
 
PPTX
Spark SQL
Caserta
 
PPT
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
PPTX
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Hyderabad Scalability Meetup
 
PPTX
Dec6 meetup spark presentation
Ramesh Mudunuri
 
PDF
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
PDF
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
PDF
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
PDF
Data processing with spark in r &amp; python
Maloy Manna, PMP®
 
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
PPTX
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
PPTX
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
PPTX
Programming in Spark using PySpark
Mostafa
 
PDF
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Domino Data Lab
 
PPTX
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Apache Spark Fundamentals
Zahra Eskandari
 
Apache Spark in Industry
Dorian Beganovic
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Simon Ambridge
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
Big Data Processing with Apache Spark 2014
mahchiev
 
Spark SQL
Caserta
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Ramesh Mudunuri
 
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Data processing with spark in r &amp; python
Maloy Manna, PMP®
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Programming in Spark using PySpark
Mostafa
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Domino Data Lab
 
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Ad

One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark

  • 1. Tim Vaillancourt Sr. Technical Operations Architect One Tool to Rule Them All: Seamless SQL on MongoDB, MySQL and Redis with Apache Spark
  • 2. Staring.. as “the RDBMs” as “the document-store” as “the in-memory K/V store”
  • 3. About Me • Joined Percona in January 2016 • Sr Technical Operations Architect for MongoDB • Previous: • EA DICE (MySQL DBA) • EA SPORTS (Sys/NoSQL DBA Ops) • Amazon/AbeBooks Inc (Sys/MySQL+NoSQL DBA Ops) • Main techs: MySQL, MongoDB, Cassandra, Solr, Redis, queues, etc • 10+ years tuning Linux for database workloads (off and on) • NOT an Apache Spark expert
  • 4. Apache Spark • “…is a fast and general engine for large-scale data processing” • Written in Scala, utilises Akka streaming framework and runs under Java JVM • Supports jobs written in Java, Scala, Python and R • Pluggable datasources for various file types, databases, SaaSs, etc • Fast and efficient: jobs work on datasources quickly in parallel • Optional clustering • Master/Slave Spark cluster, Zookeeper for Master elections • Slave workers connect to master • Master distributes tasks evenly to available workers • Streaming and Machine Learning (MLib) capabilities • Programatic and SQL(!) querying capabilities
  • 5. Apache Spark: Software Architecture • Jobs go to Cluster Master or runs in the client JVM directly • Cluster Master directs jobs to nodes with available resources with messages • Cluster Master HA • Slaves reconnect to Master • Apache Zookeeper for true HA
  • 6. Apache Spark: Hadoop Comparison • Hadoop MapReduce • Batch-based • Uses data less efficiently • Relatively hard to develop/maintain • Spark • Stream Processing • Fast/Parallelism • Prefers memory as much as possible in jobs • Divides work into many lightweight sub-tasks in threads • Datasources • Uses datasource-awareness to scale (eg: indices, shard-awareness, etc) • Spark allows processing and storage to scale separately
  • 7. Apache Spark: RDDs and DataFrames • RDD: Resilient Distributed Dataset • Original API to access data in Spark • Lazy: does not access data until a real action is performed • Spark’s optimiser cannot see inside • RDDs are slow on Python • DataFrames API • Higher level API, focused on the “what” is being done • Has schemas / table-like • Interchangeable Programming and SQL APIs • Much easier to read and comprehend • Optimises execution plan
  • 8. Apache Spark: Datasources • Provides a pluggable mechanism for accessing structured data though Spark SQL • At least these databases are supported in some way • MySQL • MongoDB • Redis • Cassandra • Postgres • HBase • HDFS • File • S3 • In practice: search GitHub, find .jar file, deploy it!
  • 9. Apache Spark: SQLContext • ANSI SQL • 30+ year old language.. • Easy to understand • Everyone usually knows it • Spark SQLContext • A Spark module for structured data processing, wrapping RDD API • Uses the same execution engine as the programatic APIs • Supports: • JOINs/unions • EXPLAINs, • Subqueries, • ORDER/GROUP/SORT BYs • Most datatypes you’d expect
  • 10. Apache Spark: Use Cases • Business Intelligence/Analytics • Understand • Tip: use dedicated replicas for expensive queries! • Data Summaries and Batch Jobs • Perform expensive summaries in the background, save result • Tip: use burstable/cloud hardware for infrequent batch jobs • Real-time Stream Processing • Process data as it enters your system
  • 11. So why not Apache Drill? • A schema-free SQL engine for Hadoop, NoSQL and Cloud Storage • Drill does not support / work with • Relational databases (MySQL) or Redis • No programatic-level querying • No streaming/continuous query functionality • I don’t know much about it
  • 12. The Demo • Scenario: You run a Weather Station data app that stores data in both an RDBMs and a document store • Goal: summarise weather station data stored in an RDBMs and a Document store • Min Water Temperature • Avg Water Temperature • Max Water Temperature • Total Sample Count • Get Top-10 (based on avg water temp)
  • 13. The Demo • RDBMs: Percona Server for MySQL 5.7 • Stores the Weather station metadata data (roughly 350 stations: ID, name, location, etc) • Document-Store: Percona Server for MongoDB 3.2 • Stores the Weather time-series sample data (roughly 80,000 samples: various weather readings from stations) • In-Memory K/V Store: Redis 2.8 • Store summarised Top-10 data for fast querying of min, avg, max temperature and total sample counts
  • 14. The Demo • Apache Spark 1.6.2 Cluster • 1 x Master • 2 x Worker/Slaves • 1 x Pyspark Job • 1 x Macbook Pro • 3 x Virtualbox VMs • Job submitted on Master
  • 15. The Demo (Play Demo Video Now)
  • 16. The Demo: The Pyspark Job <- SparkContext and SQLContext <- MySQL Table as SQLContext Temp Table
  • 17. The Demo: The Pyspark Job <- MongoDB Collection as SQLContext Temp Table <- New Redis Hash Schema as SQLContext Temp Table
  • 18. The Demo: The Pyspark Job <- From Redis <- Aggregation
  • 19. The Demo: The Pyspark Job <- Aggregation