SlideShare a Scribd company logo
Opensource Column Store Databases:
MariaDB ColumnStore vs. ClickHouse
Alexander Rubin
VirtualHealth
2
About me
● Working with MySQL for 10-15 years
○ Started at MySQL AB 2006
- Sun Microsystems, Oracle (MySQL Consulting)
- Percona since 2014
○ Recently joined Virtual Health (medical records startup)
3
MariaDB ColumnStore, ClickHouse and Storage Formats
Caution:
1. This talk is not about specifics of implementation
○ A number of presentations about Clickhouse and MariaDB @ Percona Live 2019
2. This is all about:
○ What? -- what is the problem
○ Why? -- why queries are slow
○ How? -- how to solve
3. Examples are real-world example, medical insurance records
○ (but no actual PII data shown)
4
MySQL = { }
Into: MySQL and Slow Queries
Simple query - top 10 - clients who visited doctors most often (data from 2017-2019)
mysql> SELECT
-> client_id,
-> min(date) as first_visit,
-> max(date) as last_visit,
-> count(distinct date) as days_visited,
-> count(cv.id) as visits,
-> count(distinct cv.service_location_name) as locations
-> FROM client_visit cv
-> GROUP BY client_id
-> ORDER by visits desc
-> LIMIT 10;
+-----------+-------------+------------+--------------+--------+-----------+
| client_id | first_visit | last_visit | days_visited | visits | locations |
+-----------+-------------+------------+--------------+--------+-----------+
| ......... | 2017-08-07 | 2019-05-24 | .. | ... | .. |
10 rows in set (46.24 sec)
5
What exactly is slow?
Is 47 seconds slow?
… depends on expectations
● Data Science world it is blazing fast
● Realtime report/dashboard - extremely slow
… Web based queries - users tends to reload page if it is slow
… Leaving MySQL with tons of queries doing the same thing
6
What to do?
Some ideas:
1. Use index Luke!
2. Table per report
3. Pre-aggregate - table per group of reports
4. Something else
7
Use index
But, it is already using index:
id: 1
select_type: SIMPLE
table: cv
partitions: NULL
type: index
possible_keys: FK_client_visit
key: FK_client_visit
key_len: 5
ref: NULL
rows: 10483873
filtered: 100.00
Extra: Using temporary; Using filesort
1 row in set, 1 warning (0.00 sec)
PRIMARY KEY (`id`),
KEY `FK_client_visit_author_id` (`client_id`)
8
Ok, better index: covered index
mysql> alter table client_visit add key comb(client_id, date, service_location_name);
Query OK, 0 rows affected (38.48 sec)
Records: 0 Duplicates: 0 Warnings: 0
table: cv
partitions: NULL
type: index
possible_keys: FK_client_id,comb
key: comb
key_len: 776
ref: NULL
rows: 10483873
filtered: 100.00
Extra: Using index; Using temporary; Using filesort
10 rows in set (12.18 sec)
Still
slow!
9
Ok, how large is the table?
mysql> show table status like 'client_visit'G
*************************** 1. row ***************************
Name: client_visit
Engine: InnoDB
Version: 10
Row_format: Dynamic
Rows: 10483873
Avg_row_length: 233
Data_length: 2 452 602 880
Index_length: 1 644 773 376
~= 4G on disk, that is it!
24 columns, including
`notes` text
`description` text
etc...
10
Ok, other options in MySQL?
Create table per each report
Problems
1. Too many tables
2. Hard to maintain
11
Ok, other options in MySQL?
Pre-aggregate in a table:
● group by client_id + avg, sum, …
● group by date + avg,sum
Final report will do another aggregation if needed
Problems:
1. Some aggregates can’t be re-aggregated
2. Still too many tables
3. Hard to maintain
12
And it was only the beginning… now this:
SELECT
cv.client_id as client_id,
min(date) as first_visit,
max(date) as last_visit,
count(distinct date) as days_visited,
count(distinct cv.id) as visits,
count(distinct cp.cpt_code) as procedures,
count(distinct cv.service_location_name) as locations,
sum(billed_amount) as total_billed,
max(billed_amount) as max_price,
avg(billed_amount) as avg_price
FROM
client_visit cv
join client_procedure cp on cp.encounter_id = cv.encounter_id
join client_procedure_claim cpc on cp.id = cpc.client_procedure_id
join client_claim cc on cc.id = cpc.client_claim_id
GROUP BY client_id
ORDER BY total_billed desc
LIMIT 10
Highly
normalized
schema
13
4 table JOIN, all large tables
+-----------+-------------+------------+--------------+--------+------------+-----------+--------------+-----------+-------------+
| client_id | first_visit | last_visit | days_visited | visits | procedures | locations | total_billed | max_price | avg_price |
+-----------+-------------+------------+--------------+--------+------------+-----------+--------------+-----------+-------------+
| ....... | 2018-02-14 | 2019-04-22 | 64 | 64 | .. | .. | 200K | 11K | 449.34 |
...
10 rows in set (5 min 18.16 sec)
14
Why MySQL is slow for such queries?
1. Row oriented - even if we retrieve 2 fields it will need to read a row
2. InnoDB organize table by pages (16k page) - will need to read more
3. MySQL will use 1 cpu core per query, not utilizing all cores
15
Why MySQL is slow for such queries?
Row-oriented DBMS (MySQL) Column-oriented DBMS
https://clickhouse.yandex/docs/en/
16
Column Store Databases
MariaDB Columnstore
https://mariadb.com/kb/en/library/mariadb-columnstore/
Tips and Tricks with MariaDB ColumnStore
Wednesday 5:10 PM - 5:35 PM@ Texas
17
Column Store Databases
Yandex Clickhouse
https://clickhouse.yandex/
Low Cost Transactional and Analytics With MySQL and
Clickhouse, Have Your Cake and Eat It Too!
Wednesday 5:40 PM - 6:05 PM
Clickhouse Features to Blow your Mind
Thursday 11:55 AM - 12:45 PM
18
Column-store tests
Testing box 1:
● AWS ec2 instance, c5d.4xlarge
● RAM: 32.0 GiB
● vCPU: 16
● Disk: NVMe SSD + EBS
Testing box 2:
● AWS ec2 instance, c5d.18xlarge
● RAM: 144.0 GiB
● vCPU: 72
● Disk: NVMe SSD + EBS
19
Is it worth using column store: Q1
MySQL Clickhouse MariaDB
Response time (sec) 46.24 0.754 11.43
Speed increase
compared to MySQL
(times, %)
62x
6248%
4x
404%
AWS ec2 instance,
c5d.4xlarge
20
Is it worth using column store: Q2
MySQL Clickhouse MariaDB
Response time (sec) 5 min 18.16 sec 33.83 sec 1 min 2.16 sec
Speed increase
compared to MySQL
(times, %)
9x
940%
5x
511%
AWS ec2 instance,
c5d.4xlarge
21
Table sizes on disk
MySQL Clickhouse ColumnStore
client_visit 5,876,219,904 793,976,832 3,606,462,464
client_procedure 13,841,203,200 2,253,180,928 9,562,865,664
client_procedure_claim 2,466,250,752 292,007,936 335,683,584
client_claim 11,710,496,768 2,400,182,272 6,720,749,568
Total 33,894,170,624 5,739,347,968 20,225,761,280
Smaller compared to
MySQL size (x) 5.91 1.68
Compression
22
Exporting from MySQL
Usually 3 options
1. ETL to Star Schema
2. ETL to flat de-normalized tables
3. Copy / replicate realtime (as is)
I will talk about option 3.
23
Yandex Clickhouse
sudo apt-get install clickhouse-client
clickhouse-server
24
Clickhouse: export from mysql (schema)
https://github.com/Altinity/clickhouse-mysql-data-reader
1. Schema import
$ clickhouse-mysql --create-table-sql 
--src-host=mysql-replica-host 
--src-user=export 
--src-password=xxxxxx 
--src-schemas=main 
--src-tables=client_condition,client_procedure,client_visit
It will choose partition key and sort key, i.e.
ENGINE = MergeTree() PARTITION BY toYYYYMM(created_date) ORDER BY client_id
Beware: enum is not supported (bug)
25
Clickhouse: export from MySQL (data)
1. Use clickhouse-mysql-data-reader (slower)
2. Use native Clickhouse MySQL connection:
INSERT INTO client_procedure_claim SELECT *
FROM
mysql('host', 'db', 'client_procedure_claim', 'export', 'xxxxx')
Ok.
0 rows in set. Elapsed: 17.821 sec. Processed 37.40 million rows, 299.18
MB (2.10 million rows/s., 16.79 MB/s.)
26
Clickhouse: connect using MySQL client
https://github.com/sysown/proxysql/wiki/ClickHouse-Support
$ wget
https://github.com/sysown/proxysql/releases/download/v2.0.4/p
roxysql_2.0.4-ubuntu18_amd64.deb
$ dpkg -i proxysql_2.0.4-clickhouse-ubuntu18_amd64.deb
$ proxysql --clickhouse-server
27
Clickhouse: connect using MySQL client
$ mysql -h 127.0.0.1 -P 6032 ...
Admin> SELECT * FROM clickhouse_users;
Empty set (0.00 sec)
Admin> INSERT INTO clickhouse_users VALUES ('clicku','clickp',1,100);
Query OK, 1 row affected (0.00 sec)
Admin> LOAD CLICKHOUSE USERS TO RUNTIME;
Query OK, 0 rows affected (0.00 sec)
Admin> SAVE CLICKHOUSE USERS TO DISK;
Query OK, 0 rows affected (0.01 sec)
https://github.com/sysown/proxysql/wiki/ClickHouse-Support
28
Clickhouse: connect using MySQL client
mysql -h 127.0.0.1 -P 6090 -uclicku -pclickp
...
Server version: 5.5.30 (ProxySQL ClickHouse Module)
mysql> select client_id, count(cv.id) as visits,
count(distinct cv.service_location_name) as locations
from client_visit cv
group by client_id order by visits desc limit 10;
+-----------+--------+-----------+
| client_id | visits | locations |
+-----------+--------+-----------+
...
10 rows in set (0.53 sec)
29
ProxySQL to Clickhouse - experimental
Some bugs exists:
select min(date) as first_visit, max(date) as last_visit from
client_visit;
ERROR 2013 (HY000): Lost connection to MySQL server during query
30
Clickhouse - joining MySQL source
CREATE VIEW clients_mysql as select * from
mysql('mysql_host', 'db', 'client', 'export', 'xxxx');
select count(*) from clients_mysql;
SELECT count(*)
FROM clients_mysql
┌─count()─┐
│ 1035284 │
└─────────┘
31
Clickhouse - joining MySQL source
SELECT cv.client_id as client_id ...
FROM
...
INNER JOIN
(
SELECT
user_id,
is_active
FROM clients_mysql
) AS c ON cv.client_id = c.user_id
WHERE c.is_active = 1
...
10 rows in set. Elapsed: 36.653 sec. Processed 122.58 million
rows, 2.30 GB (3.34 million rows/s., 62.72 MB/s.)
CREATE VIEW clients_mysql as
select * from
mysql('mysql_host', 'db',
'client', 'export', 'xxxx');
32
MariaDB Column Store Architecture
33
MariaDB ColumnStore: export from mysql
Schema import - need to create custom schema
mysqldump --no-data
… change engine=InnoDB to engine=Columnstore
34
MariaDB ColumnStore: export from mysql
Schema import - need to create custom schema
mysqldump --no-data does not work out of the box
$ mcsmysql test < client_visit.sql
ERROR 1069 (42000) at line 25: Too many keys specified; max 0
keys allowed
$ mcsmysql test < client_visit.sql
ERROR 1075 (42000) at line 25: Incorrect table definition; there
can be only one auto column and it must be defined as a key
35
MariaDB ColumnStore: export from mysql (data)
Fastest way -
1. mysql > select into outfile …
2. $ cpimport ...
Easiest way:
1. Import into InnoDB locally (columnstore includes MySQL server)
2. Run “insert into columstore_table select * from innodb_table”
36
MariaDB ColumnStore: Joining MySQL source
1. Export using InnoDB storage engine
2. JOIN across engines
SELECT ...
FROM
client_visit cv
join client_procedure cp on cp.encounter_id = cv.encounter_id
join client_procedure_claim cpc on cp.id = cpc.client_procedure_id
join client_claim cc on cc.id = cpc.client_claim_id
join client_innodb c on cv.client_id = c.user_id
WHERE c.is_active = 1
GROUP BY client_id
ORDER BY total_billed desc
limit 10;
37
Clickhouse: replication from MySQL
clickhouse-mysql --src-server-id=1 --src-resume --src-wait
--nice-pause=1 --src-host=10.0.0.142 --src-user=chreader
--src-password=pass --src-tables=wiki.pageviews
--dst-host=127.0.0.1 --pump-data --csvpool
38
ColumnStore: replication from MySQL
Coming soon
https://jira.mariadb.org/browse/MCOL-498
https://jira.mariadb.org/browse/MCOL-593
39
Update / delete - MariaDB ColumnStore
DMLs are usually the slowest in Columnar Stores
Single row:
MariaDB [vh]> update client set is_active = 0 where user_id = 3216031;
Query OK, 1 row affected (0.156 sec)
Rows matched: 0 Changed: 0 Warnings: 0
40
Update / delete - Clickhouse
Implemented as “Alter table update” (mutations)
Asynchronous
https://clickhouse.yandex/docs/en/query_language/alter/#alter-mutations
Single row:
:) ALTER TABLE client
UPDATE is_active = 0 WHERE user_id = 3216031
Ok.
0 rows in set. Elapsed: 0.006 sec.
41
When NOT to use Column Store db (1)
Single row (full row) by id: select * from client_claim where id = <num>;
MySQL: 1 row in set (0.00 sec)
Columnstore: 1 row in set (0.039 sec)
Clickhouse: 1 rows in set. Elapsed: 0.023 sec.
42
When NOT to use Column Store db (2)
When we will be using index in MySQL + limit
SELECT *
FROM client_claim
WHERE place_of_service = 'OFFICE'
ORDER BY received_date DESC
LIMIT 10
MySQL: 10 rows in set (0.00 sec)
Columnstore: 10 rows in set (20.838 sec)
Clickhouse: 10 rows in set. Elapsed: 1.148 sec
ERROR 1815
(HY000): Internal
error: IDB-2015:
Sorting length
exceeded. Session
variable
max_length_for_sor
t_data needs to be
set higher.
Index scan
43
When NOT to use Column Store db (2)
mysql> explain SELECT * FROM client_claim WHERE place_of_service = 'OFFICE'
ORDER BY received_date DESC LIMIT 10G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: client_claim
partitions: NULL
type: ref
possible_keys: place_of_service_received_date
key: place_of_service_received_date
key_len: 768
ref: const
rows: 13072594
filtered: 100.00
Extra: Using where
44
Other idea: daily online snapshots
● Size is columnar store DB is significantly smaller
● We can load daily snapshots
○ I.e. store 30 days of data that can be queried (without restore from backup)
45
Summary: Clickhouse
Advantages
● Fastest queries
● Very efficient storage
Disadvantages
● For JOINs needs RAM to store all JOINed tables
● No native MySQL protocol (only via ProxySQL)
● No standard sql support (data types, etc)
46
Summary: MariaDB ColumnStore
Advantages
● Native MySQL protocol - easier to integrate
● Native shared nothing cluster
Disadvantages
● Slower queries
47
Conclusion (1)
Clickhouse and MariaDB ColumnStore can help optimizing
slow reporting queries
a. Those queries can be slow by design in MySQL
b. MySQL is good for lots of small queries rather than some smaller ones
c. MySQL will not be able to utilize all hardware resources
48
Conclusion (2)
We can see 5x-10x to 60x performance increase for some
queries
○ From 5 minutes for 30 seconds in Clickhouse
49
Conclusion (3)
Both systems can be used together with MySQL
a. Sync data quickly
b. Replicate realtime
50
Thank you!

More Related Content

What's hot (20)

PDF
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Altinity Ltd
 
PDF
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
Altinity Ltd
 
PDF
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
Altinity Ltd
 
PDF
ClickHouse Materialized Views: The Magic Continues
Altinity Ltd
 
PDF
Altinity Cluster Manager: ClickHouse Management for Kubernetes and Cloud
Altinity Ltd
 
PDF
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
Altinity Ltd
 
PDF
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
Altinity Ltd
 
PDF
Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10
Altinity Ltd
 
PDF
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
Altinity Ltd
 
PDF
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
Altinity Ltd
 
PDF
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
Altinity Ltd
 
PDF
ClickHouse Analytical DBMS. Introduction and usage, by Alexander Zaitsev
Altinity Ltd
 
PDF
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...
Altinity Ltd
 
PDF
ClickHouse Monitoring 101: What to monitor and how
Altinity Ltd
 
PDF
ClickHouse Features for Advanced Users, by Aleksei Milovidov
Altinity Ltd
 
PDF
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
Altinity Ltd
 
PDF
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Altinity Ltd
 
PDF
Big Data and Beautiful Video: How ClickHouse enables Mux to Deliver Content a...
Altinity Ltd
 
PDF
Altinity Quickstart for ClickHouse
Altinity Ltd
 
PDF
ClickHouse materialized views - a secret weapon for high performance analytic...
Altinity Ltd
 
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Altinity Ltd
 
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
Altinity Ltd
 
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
Altinity Ltd
 
ClickHouse Materialized Views: The Magic Continues
Altinity Ltd
 
Altinity Cluster Manager: ClickHouse Management for Kubernetes and Cloud
Altinity Ltd
 
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
Altinity Ltd
 
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
Altinity Ltd
 
Polyglot ClickHouse -- ClickHouse SF Meetup Sept 10
Altinity Ltd
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
Altinity Ltd
 
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
Altinity Ltd
 
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
Altinity Ltd
 
ClickHouse Analytical DBMS. Introduction and usage, by Alexander Zaitsev
Altinity Ltd
 
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...
Altinity Ltd
 
ClickHouse Monitoring 101: What to monitor and how
Altinity Ltd
 
ClickHouse Features for Advanced Users, by Aleksei Milovidov
Altinity Ltd
 
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
Altinity Ltd
 
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Altinity Ltd
 
Big Data and Beautiful Video: How ClickHouse enables Mux to Deliver Content a...
Altinity Ltd
 
Altinity Quickstart for ClickHouse
Altinity Ltd
 
ClickHouse materialized views - a secret weapon for high performance analytic...
Altinity Ltd
 

Similar to MariaDB and Clickhouse Percona Live 2019 talk (20)

ODP
Beyond PHP - It's not (just) about the code
Wim Godden
 
ODP
Beyond php - it's not (just) about the code
Wim Godden
 
ODP
Beyond php - it's not (just) about the code
Wim Godden
 
PDF
Beyond php - it's not (just) about the code
Wim Godden
 
ODP
Beyond php - it's not (just) about the code
Wim Godden
 
ODP
Beyond php it's not (just) about the code
Wim Godden
 
PDF
U C2007 My S Q L Performance Cookbook
guestae36d0
 
ODP
MySQL Scaling Presentation
Tommy Falgout
 
ODP
Beyond PHP - it's not (just) about the code
Wim Godden
 
ODP
Beyond php - it's not (just) about the code
Wim Godden
 
PDF
Highload Perf Tuning
HighLoad2009
 
ODP
Beyond PHP - it's not (just) about the code
Wim Godden
 
ODP
Beyond php - it's not (just) about the code
Wim Godden
 
PPTX
MySQL vs. MonetDB
"FENG "GEORGE"" YU
 
PDF
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
Altinity Ltd
 
PDF
iloug2015.Mysql.for.oracle.dba.V2
Baruch Osoveskiy
 
PPTX
Building an Analytic Extension to MySQL with ClickHouse and Open Source
Altinity Ltd
 
PPTX
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
Altinity Ltd
 
ODP
Caching and tuning fun for high scalability
Wim Godden
 
KEY
10x Performance Improvements
Ronald Bradford
 
Beyond PHP - It's not (just) about the code
Wim Godden
 
Beyond php - it's not (just) about the code
Wim Godden
 
Beyond php - it's not (just) about the code
Wim Godden
 
Beyond php - it's not (just) about the code
Wim Godden
 
Beyond php - it's not (just) about the code
Wim Godden
 
Beyond php it's not (just) about the code
Wim Godden
 
U C2007 My S Q L Performance Cookbook
guestae36d0
 
MySQL Scaling Presentation
Tommy Falgout
 
Beyond PHP - it's not (just) about the code
Wim Godden
 
Beyond php - it's not (just) about the code
Wim Godden
 
Highload Perf Tuning
HighLoad2009
 
Beyond PHP - it's not (just) about the code
Wim Godden
 
Beyond php - it's not (just) about the code
Wim Godden
 
MySQL vs. MonetDB
"FENG "GEORGE"" YU
 
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
Altinity Ltd
 
iloug2015.Mysql.for.oracle.dba.V2
Baruch Osoveskiy
 
Building an Analytic Extension to MySQL with ClickHouse and Open Source
Altinity Ltd
 
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
Altinity Ltd
 
Caching and tuning fun for high scalability
Wim Godden
 
10x Performance Improvements
Ronald Bradford
 
Ad

Recently uploaded (20)

PDF
CAD-CAM U-1 Combined Notes_57761226_2025_04_22_14_40.pdf
shailendrapratap2002
 
PPTX
ENSA_Module_7.pptx_wide_area_network_concepts
RanaMukherjee24
 
PPTX
Basics of Auto Computer Aided Drafting .pptx
Krunal Thanki
 
PDF
Natural_Language_processing_Unit_I_notes.pdf
sanguleumeshit
 
PDF
Jual GPS Geodetik CHCNAV i93 IMU-RTK Lanjutan dengan Survei Visual
Budi Minds
 
PDF
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
PDF
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
PPTX
filteration _ pre.pptx 11111110001.pptx
awasthivaibhav825
 
PPTX
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
PPTX
Chapter_Seven_Construction_Reliability_Elective_III_Msc CM
SubashKumarBhattarai
 
PPTX
Ground improvement techniques-DEWATERING
DivakarSai4
 
DOCX
SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docx
Kanimozhi676285
 
PDF
All chapters of Strength of materials.ppt
girmabiniyam1234
 
PDF
Packaging Tips for Stainless Steel Tubes and Pipes
heavymetalsandtubes
 
PPTX
FUNDAMENTALS OF ELECTRIC VEHICLES UNIT-1
MikkiliSuresh
 
PPTX
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
PPTX
MULTI LEVEL DATA TRACKING USING COOJA.pptx
dollysharma12ab
 
PPTX
Online Cab Booking and Management System.pptx
diptipaneri80
 
PPTX
cybersecurityandthe importance of the that
JayachanduHNJc
 
PDF
SG1-ALM-MS-EL-30-0008 (00) MS - Isolators and disconnecting switches.pdf
djiceramil
 
CAD-CAM U-1 Combined Notes_57761226_2025_04_22_14_40.pdf
shailendrapratap2002
 
ENSA_Module_7.pptx_wide_area_network_concepts
RanaMukherjee24
 
Basics of Auto Computer Aided Drafting .pptx
Krunal Thanki
 
Natural_Language_processing_Unit_I_notes.pdf
sanguleumeshit
 
Jual GPS Geodetik CHCNAV i93 IMU-RTK Lanjutan dengan Survei Visual
Budi Minds
 
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
filteration _ pre.pptx 11111110001.pptx
awasthivaibhav825
 
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
Chapter_Seven_Construction_Reliability_Elective_III_Msc CM
SubashKumarBhattarai
 
Ground improvement techniques-DEWATERING
DivakarSai4
 
SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docx
Kanimozhi676285
 
All chapters of Strength of materials.ppt
girmabiniyam1234
 
Packaging Tips for Stainless Steel Tubes and Pipes
heavymetalsandtubes
 
FUNDAMENTALS OF ELECTRIC VEHICLES UNIT-1
MikkiliSuresh
 
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
MULTI LEVEL DATA TRACKING USING COOJA.pptx
dollysharma12ab
 
Online Cab Booking and Management System.pptx
diptipaneri80
 
cybersecurityandthe importance of the that
JayachanduHNJc
 
SG1-ALM-MS-EL-30-0008 (00) MS - Isolators and disconnecting switches.pdf
djiceramil
 
Ad

MariaDB and Clickhouse Percona Live 2019 talk

  • 1. Opensource Column Store Databases: MariaDB ColumnStore vs. ClickHouse Alexander Rubin VirtualHealth
  • 2. 2 About me ● Working with MySQL for 10-15 years ○ Started at MySQL AB 2006 - Sun Microsystems, Oracle (MySQL Consulting) - Percona since 2014 ○ Recently joined Virtual Health (medical records startup)
  • 3. 3 MariaDB ColumnStore, ClickHouse and Storage Formats Caution: 1. This talk is not about specifics of implementation ○ A number of presentations about Clickhouse and MariaDB @ Percona Live 2019 2. This is all about: ○ What? -- what is the problem ○ Why? -- why queries are slow ○ How? -- how to solve 3. Examples are real-world example, medical insurance records ○ (but no actual PII data shown)
  • 4. 4 MySQL = { } Into: MySQL and Slow Queries Simple query - top 10 - clients who visited doctors most often (data from 2017-2019) mysql> SELECT -> client_id, -> min(date) as first_visit, -> max(date) as last_visit, -> count(distinct date) as days_visited, -> count(cv.id) as visits, -> count(distinct cv.service_location_name) as locations -> FROM client_visit cv -> GROUP BY client_id -> ORDER by visits desc -> LIMIT 10; +-----------+-------------+------------+--------------+--------+-----------+ | client_id | first_visit | last_visit | days_visited | visits | locations | +-----------+-------------+------------+--------------+--------+-----------+ | ......... | 2017-08-07 | 2019-05-24 | .. | ... | .. | 10 rows in set (46.24 sec)
  • 5. 5 What exactly is slow? Is 47 seconds slow? … depends on expectations ● Data Science world it is blazing fast ● Realtime report/dashboard - extremely slow … Web based queries - users tends to reload page if it is slow … Leaving MySQL with tons of queries doing the same thing
  • 6. 6 What to do? Some ideas: 1. Use index Luke! 2. Table per report 3. Pre-aggregate - table per group of reports 4. Something else
  • 7. 7 Use index But, it is already using index: id: 1 select_type: SIMPLE table: cv partitions: NULL type: index possible_keys: FK_client_visit key: FK_client_visit key_len: 5 ref: NULL rows: 10483873 filtered: 100.00 Extra: Using temporary; Using filesort 1 row in set, 1 warning (0.00 sec) PRIMARY KEY (`id`), KEY `FK_client_visit_author_id` (`client_id`)
  • 8. 8 Ok, better index: covered index mysql> alter table client_visit add key comb(client_id, date, service_location_name); Query OK, 0 rows affected (38.48 sec) Records: 0 Duplicates: 0 Warnings: 0 table: cv partitions: NULL type: index possible_keys: FK_client_id,comb key: comb key_len: 776 ref: NULL rows: 10483873 filtered: 100.00 Extra: Using index; Using temporary; Using filesort 10 rows in set (12.18 sec) Still slow!
  • 9. 9 Ok, how large is the table? mysql> show table status like 'client_visit'G *************************** 1. row *************************** Name: client_visit Engine: InnoDB Version: 10 Row_format: Dynamic Rows: 10483873 Avg_row_length: 233 Data_length: 2 452 602 880 Index_length: 1 644 773 376 ~= 4G on disk, that is it! 24 columns, including `notes` text `description` text etc...
  • 10. 10 Ok, other options in MySQL? Create table per each report Problems 1. Too many tables 2. Hard to maintain
  • 11. 11 Ok, other options in MySQL? Pre-aggregate in a table: ● group by client_id + avg, sum, … ● group by date + avg,sum Final report will do another aggregation if needed Problems: 1. Some aggregates can’t be re-aggregated 2. Still too many tables 3. Hard to maintain
  • 12. 12 And it was only the beginning… now this: SELECT cv.client_id as client_id, min(date) as first_visit, max(date) as last_visit, count(distinct date) as days_visited, count(distinct cv.id) as visits, count(distinct cp.cpt_code) as procedures, count(distinct cv.service_location_name) as locations, sum(billed_amount) as total_billed, max(billed_amount) as max_price, avg(billed_amount) as avg_price FROM client_visit cv join client_procedure cp on cp.encounter_id = cv.encounter_id join client_procedure_claim cpc on cp.id = cpc.client_procedure_id join client_claim cc on cc.id = cpc.client_claim_id GROUP BY client_id ORDER BY total_billed desc LIMIT 10 Highly normalized schema
  • 13. 13 4 table JOIN, all large tables +-----------+-------------+------------+--------------+--------+------------+-----------+--------------+-----------+-------------+ | client_id | first_visit | last_visit | days_visited | visits | procedures | locations | total_billed | max_price | avg_price | +-----------+-------------+------------+--------------+--------+------------+-----------+--------------+-----------+-------------+ | ....... | 2018-02-14 | 2019-04-22 | 64 | 64 | .. | .. | 200K | 11K | 449.34 | ... 10 rows in set (5 min 18.16 sec)
  • 14. 14 Why MySQL is slow for such queries? 1. Row oriented - even if we retrieve 2 fields it will need to read a row 2. InnoDB organize table by pages (16k page) - will need to read more 3. MySQL will use 1 cpu core per query, not utilizing all cores
  • 15. 15 Why MySQL is slow for such queries? Row-oriented DBMS (MySQL) Column-oriented DBMS https://clickhouse.yandex/docs/en/
  • 16. 16 Column Store Databases MariaDB Columnstore https://mariadb.com/kb/en/library/mariadb-columnstore/ Tips and Tricks with MariaDB ColumnStore Wednesday 5:10 PM - 5:35 PM@ Texas
  • 17. 17 Column Store Databases Yandex Clickhouse https://clickhouse.yandex/ Low Cost Transactional and Analytics With MySQL and Clickhouse, Have Your Cake and Eat It Too! Wednesday 5:40 PM - 6:05 PM Clickhouse Features to Blow your Mind Thursday 11:55 AM - 12:45 PM
  • 18. 18 Column-store tests Testing box 1: ● AWS ec2 instance, c5d.4xlarge ● RAM: 32.0 GiB ● vCPU: 16 ● Disk: NVMe SSD + EBS Testing box 2: ● AWS ec2 instance, c5d.18xlarge ● RAM: 144.0 GiB ● vCPU: 72 ● Disk: NVMe SSD + EBS
  • 19. 19 Is it worth using column store: Q1 MySQL Clickhouse MariaDB Response time (sec) 46.24 0.754 11.43 Speed increase compared to MySQL (times, %) 62x 6248% 4x 404% AWS ec2 instance, c5d.4xlarge
  • 20. 20 Is it worth using column store: Q2 MySQL Clickhouse MariaDB Response time (sec) 5 min 18.16 sec 33.83 sec 1 min 2.16 sec Speed increase compared to MySQL (times, %) 9x 940% 5x 511% AWS ec2 instance, c5d.4xlarge
  • 21. 21 Table sizes on disk MySQL Clickhouse ColumnStore client_visit 5,876,219,904 793,976,832 3,606,462,464 client_procedure 13,841,203,200 2,253,180,928 9,562,865,664 client_procedure_claim 2,466,250,752 292,007,936 335,683,584 client_claim 11,710,496,768 2,400,182,272 6,720,749,568 Total 33,894,170,624 5,739,347,968 20,225,761,280 Smaller compared to MySQL size (x) 5.91 1.68 Compression
  • 22. 22 Exporting from MySQL Usually 3 options 1. ETL to Star Schema 2. ETL to flat de-normalized tables 3. Copy / replicate realtime (as is) I will talk about option 3.
  • 23. 23 Yandex Clickhouse sudo apt-get install clickhouse-client clickhouse-server
  • 24. 24 Clickhouse: export from mysql (schema) https://github.com/Altinity/clickhouse-mysql-data-reader 1. Schema import $ clickhouse-mysql --create-table-sql --src-host=mysql-replica-host --src-user=export --src-password=xxxxxx --src-schemas=main --src-tables=client_condition,client_procedure,client_visit It will choose partition key and sort key, i.e. ENGINE = MergeTree() PARTITION BY toYYYYMM(created_date) ORDER BY client_id Beware: enum is not supported (bug)
  • 25. 25 Clickhouse: export from MySQL (data) 1. Use clickhouse-mysql-data-reader (slower) 2. Use native Clickhouse MySQL connection: INSERT INTO client_procedure_claim SELECT * FROM mysql('host', 'db', 'client_procedure_claim', 'export', 'xxxxx') Ok. 0 rows in set. Elapsed: 17.821 sec. Processed 37.40 million rows, 299.18 MB (2.10 million rows/s., 16.79 MB/s.)
  • 26. 26 Clickhouse: connect using MySQL client https://github.com/sysown/proxysql/wiki/ClickHouse-Support $ wget https://github.com/sysown/proxysql/releases/download/v2.0.4/p roxysql_2.0.4-ubuntu18_amd64.deb $ dpkg -i proxysql_2.0.4-clickhouse-ubuntu18_amd64.deb $ proxysql --clickhouse-server
  • 27. 27 Clickhouse: connect using MySQL client $ mysql -h 127.0.0.1 -P 6032 ... Admin> SELECT * FROM clickhouse_users; Empty set (0.00 sec) Admin> INSERT INTO clickhouse_users VALUES ('clicku','clickp',1,100); Query OK, 1 row affected (0.00 sec) Admin> LOAD CLICKHOUSE USERS TO RUNTIME; Query OK, 0 rows affected (0.00 sec) Admin> SAVE CLICKHOUSE USERS TO DISK; Query OK, 0 rows affected (0.01 sec) https://github.com/sysown/proxysql/wiki/ClickHouse-Support
  • 28. 28 Clickhouse: connect using MySQL client mysql -h 127.0.0.1 -P 6090 -uclicku -pclickp ... Server version: 5.5.30 (ProxySQL ClickHouse Module) mysql> select client_id, count(cv.id) as visits, count(distinct cv.service_location_name) as locations from client_visit cv group by client_id order by visits desc limit 10; +-----------+--------+-----------+ | client_id | visits | locations | +-----------+--------+-----------+ ... 10 rows in set (0.53 sec)
  • 29. 29 ProxySQL to Clickhouse - experimental Some bugs exists: select min(date) as first_visit, max(date) as last_visit from client_visit; ERROR 2013 (HY000): Lost connection to MySQL server during query
  • 30. 30 Clickhouse - joining MySQL source CREATE VIEW clients_mysql as select * from mysql('mysql_host', 'db', 'client', 'export', 'xxxx'); select count(*) from clients_mysql; SELECT count(*) FROM clients_mysql ┌─count()─┐ │ 1035284 │ └─────────┘
  • 31. 31 Clickhouse - joining MySQL source SELECT cv.client_id as client_id ... FROM ... INNER JOIN ( SELECT user_id, is_active FROM clients_mysql ) AS c ON cv.client_id = c.user_id WHERE c.is_active = 1 ... 10 rows in set. Elapsed: 36.653 sec. Processed 122.58 million rows, 2.30 GB (3.34 million rows/s., 62.72 MB/s.) CREATE VIEW clients_mysql as select * from mysql('mysql_host', 'db', 'client', 'export', 'xxxx');
  • 32. 32 MariaDB Column Store Architecture
  • 33. 33 MariaDB ColumnStore: export from mysql Schema import - need to create custom schema mysqldump --no-data … change engine=InnoDB to engine=Columnstore
  • 34. 34 MariaDB ColumnStore: export from mysql Schema import - need to create custom schema mysqldump --no-data does not work out of the box $ mcsmysql test < client_visit.sql ERROR 1069 (42000) at line 25: Too many keys specified; max 0 keys allowed $ mcsmysql test < client_visit.sql ERROR 1075 (42000) at line 25: Incorrect table definition; there can be only one auto column and it must be defined as a key
  • 35. 35 MariaDB ColumnStore: export from mysql (data) Fastest way - 1. mysql > select into outfile … 2. $ cpimport ... Easiest way: 1. Import into InnoDB locally (columnstore includes MySQL server) 2. Run “insert into columstore_table select * from innodb_table”
  • 36. 36 MariaDB ColumnStore: Joining MySQL source 1. Export using InnoDB storage engine 2. JOIN across engines SELECT ... FROM client_visit cv join client_procedure cp on cp.encounter_id = cv.encounter_id join client_procedure_claim cpc on cp.id = cpc.client_procedure_id join client_claim cc on cc.id = cpc.client_claim_id join client_innodb c on cv.client_id = c.user_id WHERE c.is_active = 1 GROUP BY client_id ORDER BY total_billed desc limit 10;
  • 37. 37 Clickhouse: replication from MySQL clickhouse-mysql --src-server-id=1 --src-resume --src-wait --nice-pause=1 --src-host=10.0.0.142 --src-user=chreader --src-password=pass --src-tables=wiki.pageviews --dst-host=127.0.0.1 --pump-data --csvpool
  • 38. 38 ColumnStore: replication from MySQL Coming soon https://jira.mariadb.org/browse/MCOL-498 https://jira.mariadb.org/browse/MCOL-593
  • 39. 39 Update / delete - MariaDB ColumnStore DMLs are usually the slowest in Columnar Stores Single row: MariaDB [vh]> update client set is_active = 0 where user_id = 3216031; Query OK, 1 row affected (0.156 sec) Rows matched: 0 Changed: 0 Warnings: 0
  • 40. 40 Update / delete - Clickhouse Implemented as “Alter table update” (mutations) Asynchronous https://clickhouse.yandex/docs/en/query_language/alter/#alter-mutations Single row: :) ALTER TABLE client UPDATE is_active = 0 WHERE user_id = 3216031 Ok. 0 rows in set. Elapsed: 0.006 sec.
  • 41. 41 When NOT to use Column Store db (1) Single row (full row) by id: select * from client_claim where id = <num>; MySQL: 1 row in set (0.00 sec) Columnstore: 1 row in set (0.039 sec) Clickhouse: 1 rows in set. Elapsed: 0.023 sec.
  • 42. 42 When NOT to use Column Store db (2) When we will be using index in MySQL + limit SELECT * FROM client_claim WHERE place_of_service = 'OFFICE' ORDER BY received_date DESC LIMIT 10 MySQL: 10 rows in set (0.00 sec) Columnstore: 10 rows in set (20.838 sec) Clickhouse: 10 rows in set. Elapsed: 1.148 sec ERROR 1815 (HY000): Internal error: IDB-2015: Sorting length exceeded. Session variable max_length_for_sor t_data needs to be set higher. Index scan
  • 43. 43 When NOT to use Column Store db (2) mysql> explain SELECT * FROM client_claim WHERE place_of_service = 'OFFICE' ORDER BY received_date DESC LIMIT 10G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: client_claim partitions: NULL type: ref possible_keys: place_of_service_received_date key: place_of_service_received_date key_len: 768 ref: const rows: 13072594 filtered: 100.00 Extra: Using where
  • 44. 44 Other idea: daily online snapshots ● Size is columnar store DB is significantly smaller ● We can load daily snapshots ○ I.e. store 30 days of data that can be queried (without restore from backup)
  • 45. 45 Summary: Clickhouse Advantages ● Fastest queries ● Very efficient storage Disadvantages ● For JOINs needs RAM to store all JOINed tables ● No native MySQL protocol (only via ProxySQL) ● No standard sql support (data types, etc)
  • 46. 46 Summary: MariaDB ColumnStore Advantages ● Native MySQL protocol - easier to integrate ● Native shared nothing cluster Disadvantages ● Slower queries
  • 47. 47 Conclusion (1) Clickhouse and MariaDB ColumnStore can help optimizing slow reporting queries a. Those queries can be slow by design in MySQL b. MySQL is good for lots of small queries rather than some smaller ones c. MySQL will not be able to utilize all hardware resources
  • 48. 48 Conclusion (2) We can see 5x-10x to 60x performance increase for some queries ○ From 5 minutes for 30 seconds in Clickhouse
  • 49. 49 Conclusion (3) Both systems can be used together with MySQL a. Sync data quickly b. Replicate realtime