SlideShare a Scribd company logo
Real-time Application
Architecture
David Mellor
VP & Chief Architect Curriculum Associates
Building a Real-Time
Feedback Loop for Education
David Mellor
VP & Chief Architect Curriculum Associates
Adjusted title to match abstract
submission
• Curriculum Associates has a mission to make classrooms better places for
teachers and students.
• Our founding value drives us to continually innovative to produce new
exciting products that give every student and teacher the chance to
succeed.
–Students
–Teachers
–Administrators
• To meet some of our latest expectations, and provide the best
teacher/student feedback available, we are enabling our educators with
real-time data.
3
Our Mission
•The Architectural Journey
•Understanding Sharding
•Using Multiple Storage Engines
•Adding Kafka Message Queues
•Integrating the Data Lake
4
In This Talk
The Architectural Journey
5
6
The Architecture End State
iReady
Confluent
Kafka
Debezium
Connector
(DB to Kafka)
S3
Connector
(Kafka to S3)
HTC Dispatch
Data Lake
Raw Store
Read
Optimized Store
Nightly
Load
Files
Brokers ZooKeeper
Lesson
iReady
Event Event System
MemSQL
Reporting
DB
MemSQL
Reporting
DB
7
Our Architectural Journey
• Where did we start and what fundamental problem do we need to solve
to get real-time values to our users?
iReady
Lesson
iReady
Event
Scheduled
Jobs
ETL to Data
Warehouse
Data Warehouse
Reporting
Data Mart
ETL to Data Mart
8
Start with the Largest Aggregate Report
Our largest aggregate report
consists of logically:
–6,000,000 leaf values filtered to
250,000
–600,000,000 leaf values filtered to
10,000,000 used as the
intermediate dataset
–Rolled up to produce 300
aggregate totals
–Response target 1 Sec
6,000,000+ Students
600,000,000+ Facts
A District Report:
10,000,000 Facts
rolled up and into
300 schools
SID DESC ATTR1 SID FACT1 FACT2
.
.
.
• SQL Compatible – our developers and basic paradigm is
SQL
• Fast Calculations – we need to compute large calculations
across large datasets
• Fast Updates – we need to do real-time updates
• Fast Loads – we need to re-load our reporting database
nightly
• Massively Scalable – we need to support large data
volumes and large numbers of concurrent users.
• Cost Effective – we need a practical solution based on cost
9
In-Memory Databases MemSQL
• Columnar and Row storage models provides
for very fast aggregations across large
amounts of data
• Very fast load times allows us to update our
reporting db nightly
• Very fast update times for Row storage tables
• Highly scalable based on their MPP base
architecture
• Unique ability to query across Columnar and
Row tables in a single query
• Convert our existing database design to be optimal in MemSQL
• Analyze our usage patterns to determine the best Sharding key
• Create our prototype and run typical queries to determine the optimal
database structure across the spectrum of projected usage
–Use the same Sharding key in all tables
–Push down predicates to as many tables as we can
10
Our MemSQL Journey Begins
Understanding Sharding
11
12
Why is the selection of a Sharding key so
important?
SID DESC ATTR1 SID FACT1 FACT2
.
.
.
NODE2
NODE3
NODE1
Create database with 9 partitions
Create tables in the database using
a sharding key which is advantageous
to query execution
The goal is to get the execution of a
given query to be as evenly distributed
over the partitions
PS1 PS2 PS3
PS4 PS5 PS6
PS7 PS8 PS9
13
How does the Sharding Key affect the “join”
PS1 PS2
Node 1
Select a.sid, b.factid from table1 a, table2 b
Where a.sid in {10 ….. } and b.sid in {10 ….. }
And a.sid = b.sid
The basis of the join is on the sid column.
When the sharding key is chosen based on the sid
Columns for both tables … the join can be done
Independently within each partition and the result
Merged
This is an ideal situation to get the nodes performing
In parallel which can maximum query performance
14
How does the Sharding Key affect the “join”
PS1 PS2
Node 1
Select a.sid, b.factid from table1 a, table2 b
Where a.sid in {10 ….. } and b.sid in {10 ….. }
And a.sid = b.sid
When the sharding key is not based on the sid
Columns for both tables … the join cannot be done
Independently within each partition and will cause
what
Is called a broadcast
This is not the ideal situation to get the nodes
performing
In parallel and we have seen query performance
degredration
In these cases
Using Multiple Storage Engines
15
• Row storage was the most performant for queries, load and updates –
this is also the most expensive solution
• Columnar storage was performant for some queries and load but
degraded with updates – cost effective but not performant enough on too
many of the target queries
• To maximize our use of MemSQL we have combined Row storage and
Columnar storage to create a logical table
–Volatile (changeable) data is kept in Row storage
–Non-Volatile (immutable) data is kept in Columnar storage
–Requests for data are made using “Union All” queries
16
Columnar and Row Storage Models
17
Columnar and Row
.
.
.
SID FACT1 FACT2 Active
?
?
?
Row Storage Portion Columnar Storage Portion
Logical
Table
SID FACT1 FACT2 Active
n
n
n
n
n
n
n
.
.
.
n
n
n
SID FACT1 FACT2 Active
?
n
n
?
Select sid, fact1, fact2
From fact_row
Where sid in (1 …10)
Union All
Select sid, fact1, fact2
From fact_columnar
Where sid in (1 …10)
Adding Kafka Message Queues
18
19
Dispatching The Human Time Events Events
iReady
Kafka
HTC Dispatch
MemSQL
Reporting
DB
Brokers ZooKeeper
Lesson
iReady
Event
Event System
• We had a database engine that
could perform our queries
• We solved our cost and scaling
needs
• We proved we could load and
update the database on the
desired schedule
• How are we going to get the
real-time data to the Reporting
DB?
20
Dispatching The Human Time Events Events
iReady
Kafka
HTC Dispatch
MemSQL
Reporting
DB
Brokers ZooKeeper
Lesson
iReady
Event
Event System
Event:
Payload:
JSON
Event:
Payload:
JSON
Event:
Payload:
JSON
Event:
Payload:
JSON
Event:
Payload:
JSON
Event:
Payload:
JSON
Event:
Payload:
JSON
Event:
Payload:
JSON
Event:
Payload:
JSON
Event:
Payload:
JSON
MemSQL
Pipeline
• Use MemSQL Pipelines to
ingest data into MemSQL
from Kafka
• Declared MemSQL Objects
• Managed and controlled by
MemSQL
• No significant transforms
• Tables are augment with a column to contain the event in JSON form
• All other columns derived
21
Kafka and MemSQL Pipelines
SID FACT1 FACT2 SID FACT1 FACT2 event
CREATE TABLE realtime.fact_table
(event JSON NOT NULL,
SID as event::data::$SID PERSISTED
varchar(36)
FACT1 as event::data::rawFact1 PERSISTED
int(11)
FACT2 as event::data::rawFact2 PERSISTED
int(11)
KEY (SID))
create pipeline fact_pipe asload data
kafka '0.0.0.0:0000/fact-event-
stream'into table realtime.fact_table
columns (event);
22
Adding the Nightly Rebuild Process
iReady
Confluent
Kafka
Debezium
Connector
(DB to Kafka)
HTC Dispatch
MemSQL
Reporting
DB
Brokers ZooKeeper
Lesson
iReady
Event Event System
• Get the transactional data from
the database
• Employ database replication to
dedicated slaves
• Introduce the Confluent platform
to unify data movement through
Kafka
• Deploy the Debezium Confluent
Connector to move the replication
log data into Kafka
Integrating the Data Lake
23
24
Create and Update the Data Lake
iReady
Confluent
Kafka
Debezium
Connector
(DB to
Kafka)
S3
Connector
(Kafka to
S3)
HTC
Dispatch
MemSQL
Reporting
DB
Data Lake
Raw Store
Read
Optimized
Store
Brokers ZooKeeper
Lesson
iReady
Event
Event System
• Build a Data Lake in S3
• Deploy the Confluent S3
Connector to move the
transaction data from Kafka
to the Data Lake
• Split the Data Lake into 2
Distinct forms – Raw and
Read Optimized
• Deploy Spark to move the
data from the Raw form to
the Read Optimized form
25
Move the Data from the Data Lake to MemSQL
iReady
Confluent
Kafka
Debezium
Connector
(DB to
Kafka)
S3
Connector
(Kafka to
S3)
HTC
Dispatch
MemSQL
Reporting
DB
Data Lake
Raw Store
Read
Optimized
Store
Brokers ZooKeeper
Lesson
iReady
Event
Event System
• Deploy Spark to transform
the data from the Read
Optimized form to a
Reporting Optimized form
• Save the output to a
managed S3 location
• Deploy MemSQL S3
Pipelines to automatically
ingest the nightly load files
from a specified location
• Deploy MemSQL Pipeline to
Kafka
• Activate the MemSQL
Pipeline when the reload is
complete
Nightly
Load
Files
MemSQL
Reporting
DB
26
Swap the Light/Dark MemSQL DB
iReady
Confluent
Kafka
Debezium
Connector
(DB to
Kafka)
S3
Connector
(Kafka to
S3)
HTC
Dispatch
Data Lake
Raw Store
Read
Optimized
Store
Brokers ZooKeeper
Lesson
iReady
Event
Event System
• Open up the Dark DB to
accept connections
• Trigger an iReady
application event to drain
the current connection pool
and replace the connections
with new connections to the
new database
• Close the current Light DB
Nightly
Load
Files
MemSQL
Reporting
DB
MemSQL
Reporting
DB
MemSQL
Reporting
DB
MemSQL
Reporting
DB
27
The Architecture End State
iReady
Confluent
Kafka
Debezium
Connector
(DB to Kafka)
S3
Connector
(Kafka to S3)
HTC Dispatch
Data Lake
Raw Store
Read
Optimized Store
Nightly
Load
Files
Brokers ZooKeeper
Lesson
iReady
Event Event System
MemSQL
Reporting
DB
MemSQL
Reporting
DB
• Ensure the system you are considering is up to the challenge of your most
sophisticated queries
• With distributed systems, spend time to pick the right sharding strategy
• Make use of multiple storage engines where available
• Design workflows with message queues for flexibilty and update-ability
• Incorporate data lakes for long term retention and context
28
Key Takeaways
Curriculum Associates Strata NYC 2017

More Related Content

PPTX
Netflix viewing data architecture evolution - QCon 2014
Philip Fisher-Ogden
 
PDF
Building A Diverse Geo-Architecture For Cloud Native Applications In One Day
VMware Tanzu
 
PPTX
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
StreamNative
 
PDF
[db tech showcase Tokyo 2017] C24:Taking off to the clouds. How to use DMS in...
Insight Technology, Inc.
 
PDF
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
Monal Daxini
 
PPTX
Netflix's Big Leap from Oracle to Cassandra
Roopa Tangirala
 
PDF
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Patrick Di Loreto
 
PDF
Querying Data Pipeline with AWS Athena
Yaroslav Tkachenko
 
Netflix viewing data architecture evolution - QCon 2014
Philip Fisher-Ogden
 
Building A Diverse Geo-Architecture For Cloud Native Applications In One Day
VMware Tanzu
 
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
StreamNative
 
[db tech showcase Tokyo 2017] C24:Taking off to the clouds. How to use DMS in...
Insight Technology, Inc.
 
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
Monal Daxini
 
Netflix's Big Leap from Oracle to Cassandra
Roopa Tangirala
 
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Patrick Di Loreto
 
Querying Data Pipeline with AWS Athena
Yaroslav Tkachenko
 

What's hot (17)

PDF
Data Stores @ Netflix
Vinay Kumar Chella
 
PPTX
Deploy data analysis pipeline with mesos and docker
Vu Nguyen Duy
 
PPTX
How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScale
MariaDB plc
 
PDF
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Databricks
 
PPTX
BigData: AWS RedShift with S3, EC2
Paulraj Pappaiah
 
PPTX
How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...
StreamNative
 
PPTX
New AWS Services for Bioinformatics
Lynn Langit
 
PDF
Renegotiating the boundary between database latency and consistency
ScyllaDB
 
PDF
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
DataStax
 
PDF
Azure + DataStax Enterprise Powers Office 365 Per User Store
DataStax Academy
 
PDF
eBay Cloud CMS - QCon 2012 - http://yidb.org/
Xu Jiang
 
PDF
Big Data Tools in AWS
Shu-Jeng Hsieh
 
PPTX
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
SingleStore
 
PDF
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
HostedbyConfluent
 
PDF
How to deploy Apache Spark 
to Mesos/DCOS
Legacy Typesafe (now Lightbend)
 
PDF
Beyond Relational
Lynn Langit
 
PPT
Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief ...
buildacloud
 
Data Stores @ Netflix
Vinay Kumar Chella
 
Deploy data analysis pipeline with mesos and docker
Vu Nguyen Duy
 
How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScale
MariaDB plc
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Databricks
 
BigData: AWS RedShift with S3, EC2
Paulraj Pappaiah
 
How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...
StreamNative
 
New AWS Services for Bioinformatics
Lynn Langit
 
Renegotiating the boundary between database latency and consistency
ScyllaDB
 
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
DataStax
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
DataStax Academy
 
eBay Cloud CMS - QCon 2012 - http://yidb.org/
Xu Jiang
 
Big Data Tools in AWS
Shu-Jeng Hsieh
 
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
SingleStore
 
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
HostedbyConfluent
 
How to deploy Apache Spark 
to Mesos/DCOS
Legacy Typesafe (now Lightbend)
 
Beyond Relational
Lynn Langit
 
Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief ...
buildacloud
 
Ad

Similar to Curriculum Associates Strata NYC 2017 (20)

PDF
NoSQL – Back to the Future or Yet Another DB Feature?
Martin Scholl
 
PDF
Database Revolution - Exploratory Webcast
Inside Analysis
 
PDF
Database revolution opening webcast 01 18-12
mark madsen
 
PDF
Database Systems - A Historical Perspective
Karoly K
 
PDF
Wolfgang Lehner Technische Universitat Dresden
InfinIT - Innovationsnetværket for it
 
PPT
Database Management System Processing.ppt
HajarMeseehYaseen
 
PPTX
Information processing architectures
Raji Gogulapati
 
PPTX
Comparing sql and nosql dbs
Vasilios Kuznos
 
PPTX
To SQL or NoSQL, that is the question
Krishnakumar S
 
PPTX
NOSQL
akbarashaikh
 
PPTX
CodeFutures - Scaling Your Database in the Cloud
RightScale
 
PDF
Database Survival Guide: Exploratory Webcast
Eric Kavanagh
 
PDF
Is NoSQL The Future of Data Storage?
Saltmarch Media
 
PDF
NoSQL
Yousof Alsatom
 
PDF
OSDC 2018 | The operational brain: how new Paradigms like Machine Learning ar...
NETWAYS
 
DOCX
Report 1.0.docx
pinstechwork
 
PPTX
Revision
David Sherlock
 
PPTX
Lecture 5- Data Collection and Storage.pptx
Brianc34
 
PPTX
The Rise of NoSQL and Polyglot Persistence
Abdelmonaim Remani
 
PPTX
Intro to Big Data and NoSQL
Don Demcsak
 
NoSQL – Back to the Future or Yet Another DB Feature?
Martin Scholl
 
Database Revolution - Exploratory Webcast
Inside Analysis
 
Database revolution opening webcast 01 18-12
mark madsen
 
Database Systems - A Historical Perspective
Karoly K
 
Wolfgang Lehner Technische Universitat Dresden
InfinIT - Innovationsnetværket for it
 
Database Management System Processing.ppt
HajarMeseehYaseen
 
Information processing architectures
Raji Gogulapati
 
Comparing sql and nosql dbs
Vasilios Kuznos
 
To SQL or NoSQL, that is the question
Krishnakumar S
 
CodeFutures - Scaling Your Database in the Cloud
RightScale
 
Database Survival Guide: Exploratory Webcast
Eric Kavanagh
 
Is NoSQL The Future of Data Storage?
Saltmarch Media
 
OSDC 2018 | The operational brain: how new Paradigms like Machine Learning ar...
NETWAYS
 
Report 1.0.docx
pinstechwork
 
Revision
David Sherlock
 
Lecture 5- Data Collection and Storage.pptx
Brianc34
 
The Rise of NoSQL and Polyglot Persistence
Abdelmonaim Remani
 
Intro to Big Data and NoSQL
Don Demcsak
 
Ad

Recently uploaded (20)

PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PPTX
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPT
2009worlddatasheet_presentation.ppt peoole
umutunsalnsl4402
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
PPTX
1intro to AI.pptx AI components & composition
ssuserb993e5
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PPTX
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
PPTX
International-health-agency and it's work.pptx
shreehareeshgs
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PPT
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
2009worlddatasheet_presentation.ppt peoole
umutunsalnsl4402
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
1intro to AI.pptx AI components & composition
ssuserb993e5
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
International-health-agency and it's work.pptx
shreehareeshgs
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 

Curriculum Associates Strata NYC 2017

  • 1. Real-time Application Architecture David Mellor VP & Chief Architect Curriculum Associates
  • 2. Building a Real-Time Feedback Loop for Education David Mellor VP & Chief Architect Curriculum Associates Adjusted title to match abstract submission
  • 3. • Curriculum Associates has a mission to make classrooms better places for teachers and students. • Our founding value drives us to continually innovative to produce new exciting products that give every student and teacher the chance to succeed. –Students –Teachers –Administrators • To meet some of our latest expectations, and provide the best teacher/student feedback available, we are enabling our educators with real-time data. 3 Our Mission
  • 4. •The Architectural Journey •Understanding Sharding •Using Multiple Storage Engines •Adding Kafka Message Queues •Integrating the Data Lake 4 In This Talk
  • 6. 6 The Architecture End State iReady Confluent Kafka Debezium Connector (DB to Kafka) S3 Connector (Kafka to S3) HTC Dispatch Data Lake Raw Store Read Optimized Store Nightly Load Files Brokers ZooKeeper Lesson iReady Event Event System MemSQL Reporting DB MemSQL Reporting DB
  • 7. 7 Our Architectural Journey • Where did we start and what fundamental problem do we need to solve to get real-time values to our users? iReady Lesson iReady Event Scheduled Jobs ETL to Data Warehouse Data Warehouse Reporting Data Mart ETL to Data Mart
  • 8. 8 Start with the Largest Aggregate Report Our largest aggregate report consists of logically: –6,000,000 leaf values filtered to 250,000 –600,000,000 leaf values filtered to 10,000,000 used as the intermediate dataset –Rolled up to produce 300 aggregate totals –Response target 1 Sec 6,000,000+ Students 600,000,000+ Facts A District Report: 10,000,000 Facts rolled up and into 300 schools SID DESC ATTR1 SID FACT1 FACT2 . . .
  • 9. • SQL Compatible – our developers and basic paradigm is SQL • Fast Calculations – we need to compute large calculations across large datasets • Fast Updates – we need to do real-time updates • Fast Loads – we need to re-load our reporting database nightly • Massively Scalable – we need to support large data volumes and large numbers of concurrent users. • Cost Effective – we need a practical solution based on cost 9 In-Memory Databases MemSQL • Columnar and Row storage models provides for very fast aggregations across large amounts of data • Very fast load times allows us to update our reporting db nightly • Very fast update times for Row storage tables • Highly scalable based on their MPP base architecture • Unique ability to query across Columnar and Row tables in a single query
  • 10. • Convert our existing database design to be optimal in MemSQL • Analyze our usage patterns to determine the best Sharding key • Create our prototype and run typical queries to determine the optimal database structure across the spectrum of projected usage –Use the same Sharding key in all tables –Push down predicates to as many tables as we can 10 Our MemSQL Journey Begins
  • 12. 12 Why is the selection of a Sharding key so important? SID DESC ATTR1 SID FACT1 FACT2 . . . NODE2 NODE3 NODE1 Create database with 9 partitions Create tables in the database using a sharding key which is advantageous to query execution The goal is to get the execution of a given query to be as evenly distributed over the partitions PS1 PS2 PS3 PS4 PS5 PS6 PS7 PS8 PS9
  • 13. 13 How does the Sharding Key affect the “join” PS1 PS2 Node 1 Select a.sid, b.factid from table1 a, table2 b Where a.sid in {10 ….. } and b.sid in {10 ….. } And a.sid = b.sid The basis of the join is on the sid column. When the sharding key is chosen based on the sid Columns for both tables … the join can be done Independently within each partition and the result Merged This is an ideal situation to get the nodes performing In parallel which can maximum query performance
  • 14. 14 How does the Sharding Key affect the “join” PS1 PS2 Node 1 Select a.sid, b.factid from table1 a, table2 b Where a.sid in {10 ….. } and b.sid in {10 ….. } And a.sid = b.sid When the sharding key is not based on the sid Columns for both tables … the join cannot be done Independently within each partition and will cause what Is called a broadcast This is not the ideal situation to get the nodes performing In parallel and we have seen query performance degredration In these cases
  • 16. • Row storage was the most performant for queries, load and updates – this is also the most expensive solution • Columnar storage was performant for some queries and load but degraded with updates – cost effective but not performant enough on too many of the target queries • To maximize our use of MemSQL we have combined Row storage and Columnar storage to create a logical table –Volatile (changeable) data is kept in Row storage –Non-Volatile (immutable) data is kept in Columnar storage –Requests for data are made using “Union All” queries 16 Columnar and Row Storage Models
  • 17. 17 Columnar and Row . . . SID FACT1 FACT2 Active ? ? ? Row Storage Portion Columnar Storage Portion Logical Table SID FACT1 FACT2 Active n n n n n n n . . . n n n SID FACT1 FACT2 Active ? n n ? Select sid, fact1, fact2 From fact_row Where sid in (1 …10) Union All Select sid, fact1, fact2 From fact_columnar Where sid in (1 …10)
  • 18. Adding Kafka Message Queues 18
  • 19. 19 Dispatching The Human Time Events Events iReady Kafka HTC Dispatch MemSQL Reporting DB Brokers ZooKeeper Lesson iReady Event Event System • We had a database engine that could perform our queries • We solved our cost and scaling needs • We proved we could load and update the database on the desired schedule • How are we going to get the real-time data to the Reporting DB?
  • 20. 20 Dispatching The Human Time Events Events iReady Kafka HTC Dispatch MemSQL Reporting DB Brokers ZooKeeper Lesson iReady Event Event System Event: Payload: JSON Event: Payload: JSON Event: Payload: JSON Event: Payload: JSON Event: Payload: JSON Event: Payload: JSON Event: Payload: JSON Event: Payload: JSON Event: Payload: JSON Event: Payload: JSON MemSQL Pipeline • Use MemSQL Pipelines to ingest data into MemSQL from Kafka • Declared MemSQL Objects • Managed and controlled by MemSQL • No significant transforms
  • 21. • Tables are augment with a column to contain the event in JSON form • All other columns derived 21 Kafka and MemSQL Pipelines SID FACT1 FACT2 SID FACT1 FACT2 event CREATE TABLE realtime.fact_table (event JSON NOT NULL, SID as event::data::$SID PERSISTED varchar(36) FACT1 as event::data::rawFact1 PERSISTED int(11) FACT2 as event::data::rawFact2 PERSISTED int(11) KEY (SID)) create pipeline fact_pipe asload data kafka '0.0.0.0:0000/fact-event- stream'into table realtime.fact_table columns (event);
  • 22. 22 Adding the Nightly Rebuild Process iReady Confluent Kafka Debezium Connector (DB to Kafka) HTC Dispatch MemSQL Reporting DB Brokers ZooKeeper Lesson iReady Event Event System • Get the transactional data from the database • Employ database replication to dedicated slaves • Introduce the Confluent platform to unify data movement through Kafka • Deploy the Debezium Confluent Connector to move the replication log data into Kafka
  • 24. 24 Create and Update the Data Lake iReady Confluent Kafka Debezium Connector (DB to Kafka) S3 Connector (Kafka to S3) HTC Dispatch MemSQL Reporting DB Data Lake Raw Store Read Optimized Store Brokers ZooKeeper Lesson iReady Event Event System • Build a Data Lake in S3 • Deploy the Confluent S3 Connector to move the transaction data from Kafka to the Data Lake • Split the Data Lake into 2 Distinct forms – Raw and Read Optimized • Deploy Spark to move the data from the Raw form to the Read Optimized form
  • 25. 25 Move the Data from the Data Lake to MemSQL iReady Confluent Kafka Debezium Connector (DB to Kafka) S3 Connector (Kafka to S3) HTC Dispatch MemSQL Reporting DB Data Lake Raw Store Read Optimized Store Brokers ZooKeeper Lesson iReady Event Event System • Deploy Spark to transform the data from the Read Optimized form to a Reporting Optimized form • Save the output to a managed S3 location • Deploy MemSQL S3 Pipelines to automatically ingest the nightly load files from a specified location • Deploy MemSQL Pipeline to Kafka • Activate the MemSQL Pipeline when the reload is complete Nightly Load Files MemSQL Reporting DB
  • 26. 26 Swap the Light/Dark MemSQL DB iReady Confluent Kafka Debezium Connector (DB to Kafka) S3 Connector (Kafka to S3) HTC Dispatch Data Lake Raw Store Read Optimized Store Brokers ZooKeeper Lesson iReady Event Event System • Open up the Dark DB to accept connections • Trigger an iReady application event to drain the current connection pool and replace the connections with new connections to the new database • Close the current Light DB Nightly Load Files MemSQL Reporting DB MemSQL Reporting DB MemSQL Reporting DB MemSQL Reporting DB
  • 27. 27 The Architecture End State iReady Confluent Kafka Debezium Connector (DB to Kafka) S3 Connector (Kafka to S3) HTC Dispatch Data Lake Raw Store Read Optimized Store Nightly Load Files Brokers ZooKeeper Lesson iReady Event Event System MemSQL Reporting DB MemSQL Reporting DB
  • 28. • Ensure the system you are considering is up to the challenge of your most sophisticated queries • With distributed systems, spend time to pick the right sharding strategy • Make use of multiple storage engines where available • Design workflows with message queues for flexibilty and update-ability • Incorporate data lakes for long term retention and context 28 Key Takeaways

Editor's Notes

  • #7: User Growth 250K – 4.5M in 4 years 80K concurrent users 60K/min user diagnostic item responses 13K/min lesson component starts 332K/day diagonstics completed 1.6M/day lesson completed
  • #13: Create database defines number of partitions A partition is created on a specific node Tables in the database are sharded evenly among the partitions using the defined sharding key
  • #14: A good design allows the join to all be performed on a single node If not memsql needs to shuttle the join data to the necessary node to perform the join
  • #15: A good design allows the join to all be performed on a single node If not memsql needs to shuttle the join data to the necessary node to perform the join
  • #25: Raw Store is Avero or JSON Read Optimized Store is Parquet