SlideShare a Scribd company logo
MariaDB
ColumnStore
Agenda
• Why using a Column Based database
– And why not?
• MariaDB ColumnStore architecture
• Using and Sizing MariaDB ColumnStore
• Getting data into MariaDB ColumnStore
Why a Columnar Database
Data organization
• Row-by-row
– Good for row based processing
– Use indexing for lookups
• Typically B-Tree indexing
– Indexing is difficult for data that is not well distributed
– Indexing slows down DML
• Column based
– Good for dataset based processing
– Needs no indexes
– Data is typically organized in chunks
– Lends itself to high level of compression
– Metadata used for filtering and processing
– Large amount of data is much, much less of an issue
– Loading data is consistently fast, independent on data size
OLTP/NoSQL
Workloads
Suited for reporting or analysis of millions-billions of rows from data sets containing millions-trillions of rows.
OLAP/Analytic/
Reporting Workloads
Workload – Query Vision/Scope
1 100 10,000
10-100GB
10,000,000,000
1-10TB
1,000,000 100,000,000
100-1,000GB
Columnar General Best Practices
Not suited for OLTP
Micro-batch load allows for near real-time behavior
Infrequently used columns do not impact other queries
Columnar suitable for sparse columns (nulls compress nicely)
Data Modeling Best Practices
Star-schema optimizations are generally a good idea
Conservative data typing is very important
Especially around fixed-length vs. dictionary boundary (8 bytes)
IP Address vs. IP Number
Break down compound fields into individual fields:
Trivializes searching for sub-fields
Can avoid dictionary overhead
Cost to re-assemble is generally small
MariaDB ColumnStore
The basics
MariaDB ColumnStore
High performance columnar storage engine that support wide variety of
analytical use cases with SQL in a highly scalable distributed environments
Parallel query
processing for
distributed
environments
Faster, More
Efficient Queries
Single SQL
Interface for both
OLTP and
analytics
Easier Enterprise
Analytics
Power of SQL and
Freedom of Open
Source to Big Data
Analytics
Better Price
Performance
MariaDB ColumnStore
• GPLv2 Open Source
• Columnar, Massively Parallel
MariaDB Storage Engine
• Scalable, high-performance
analytics platform
• Built in redundancy and
high availability
• Runs on premise and on AWS cloud
• Full SQL syntax and capabilities
regardless of platform
Big Data Sources Analytics Insight
MariaDB ColumnStore
. . .
Node 1 Node 2 Node 3 Node N
Local / AWS® / GlusterFS®
ELT
Tools
BI
Tools
SQL Features
Source : InfiniDB SQL Syntax Guide
Cross Engine
Joins
UDF
DML
Aggregation
DDL
Disk Based
Joins
Windowing
Functions
SELECT
QUERY
MariaDB ColumnStore Architecture
Data Storage
User Connections
User Module nUser Module 1
Performance
Module n
Performance
Module 2
Performance
Module 1
MariaDB
Front End
Query Engine
User Module
Processes SQL Requests
Performance Module
Distributed Processing Engine
Process Functionality Value
MariaDB
• Hosts MariaDB
• Connection management
• SQL parsing & optimization
Familiar DBMS interface
Leverages existing partner integrations
Delivers rich SQL syntax support
Extent Map
• Abstracts physical
and logical storage
• Metadata store
Enables partition elimination
ExeMgr
• Work distribution
• Final results management
and aggregation
Multi-threaded to take advantage
of multi-core HW platforms
User Module at a Glance
Process Functionality Value
PrimProc
• Scale-out cache management
• Distributed scan, filter, join
and aggregation operations
• Resource management
Independent scalability and
tunable performance
Multi-threaded to take advantage
of multi-core HW platforms
Data
• High Speed Bulk Load
• Transactional DML and DDL
• Online schema extensions
Non-blocking read enabled
Multi-threaded to take advantage
of multi-core HW platforms
Performance Module at a Glance
MariaDB ColumnStore
MariaDB Functions
• MariaDB Client
• MariaDB Connectivity (JDBC, ODBC)
• MariaDB Security
• Initial SQL Statement Parsing
• Initial SQL Optimization < Custom Handler Class >
• Execute final sort and final limit
• Display final results
ExeMgr Functions
• SQL Optimization
• Distribute work for scan, filter, join, functions,
expressions, group by, aggregation, etc. to all available
Performance Modules to be run in parallel
• Collect the results returned by the Performance Modules
• Return the final results to MariaDB for display
MariaDB
ColumnStore
ExeMgr
Data Storage
User Connections
User Module nUser Module 1
Performance
Module n
Performance
Module 2
Performance
Module 1
User Module
Processes SQL Requests
MariaDB Front End
Performance Module
Executes the Queries
Distributed Processing Engine
Compression with Data Storage Layer
Blocks (8KB)
Extent1
(8MB~64MB
8 million rows)
Logical
Layer
Segment File1
(maps to an Extent)
Physical
Layer
Compression
Chunks
• 8-byte fixed length token (pointer).
• A variable length value stored at the
location identified by the pointer.
Data Types
1-byte Field
with 8192 values per
8k block
2-byte Field
with 4096 values
per 8k block
4-byte Field
with 2048 values
per 8k block
8-byte Field
with 1024 values per
8k block
Dictionary structure
made up of 2
files/extents with:
At the physical layer, all columns are stored as:
• Varchar(8) or larger
• Char(9) or larger
Data Types
1-byte Field
Examples
TinyInt, Char(1)
2-byte Field
Examples
SmallInt, Char(2)
4-byte Field
Examples
Int, Char(3),
Char(4), date, float
8-byte Field
Examples
BigInt, Char(5-
8),datetime, real/double
Dictionary Examples
At the physical layer, all columns are stored as:
Using MariaDB ColumnStore
Just like MariaDB Server
MariaDB ColumnStore
MariaDB ColumnStore
uses standard
“Engine=columnstore”
syntax
mysql> use tpcds_djoshi
Database changed
mysql> select count(*) from store_sales;
+----------+
| count(*) |
+----------+
| 2880404 |
+----------+
1 row in set (1.68 sec)
mysql> describe warehouse;
+-------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+--------------+------+-----+---------+-------+
| w_warehouse_sk | int(11) | NO | | NULL | |
| w_warehouse_id | char(16) | NO | | NULL | |
| w_warehouse_name | varchar(20) | YES | | NULL | |
| w_warehouse_sq_ft | int(11) | YES | | NULL | |
| w_street_number | char(10) | YES | | NULL | |
| w_street_name | varchar(60) | YES | | NULL | |
| w_street_type | char(15) | YES | | NULL | |
| w_suite_number | char(10) | YES | | NULL | |
| w_city | varchar(60) | YES | | NULL | |
| w_county | varchar(30) | YES | | NULL | |
| w_state | char(2) | YES | | NULL | |
| w_zip | char(10) | YES | | NULL | |
| w_country | varchar(20) | YES | | NULL | |
| w_gmt_offset | decimal(5,2) | YES | | NULL | |
+-------------------+--------------+------+-----+---------+-------+
14 rows in set (0.05 sec)
CREATE TABLE `game_warehouse`.`dim_title` (
`id` INT,
`name` VARCHAR(45),
`publisher` VARCHAR(45),
`release_date` DATE,
`language` INT,
`platform_name` VARCHAR(45),
`version` VARCHAR(45)
) ENGINE=columnstore;
Uses custom scalable
columnar architecture
MariaDB ColumnStore
mysql> use tpcds_djoshi
Database changed
mysql> select count(*) from store_sales;
+----------+
| count(*) |
+----------+
| 2880404 |
+----------+
1 row in set (1.68 sec)
mysql> describe warehouse;
+-------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+--------------+------+-----+---------+-------+
| w_warehouse_sk | int(11) | NO | | NULL | |
| w_warehouse_id | char(16) | NO | | NULL | |
| w_warehouse_name | varchar(20) | YES | | NULL | |
| w_warehouse_sq_ft | int(11) | YES | | NULL | |
| w_street_number | char(10) | YES | | NULL | |
| w_street_name | varchar(60) | YES | | NULL | |
| w_street_type | char(15) | YES | | NULL | |
| w_suite_number | char(10) | YES | | NULL | |
| w_city | varchar(60) | YES | | NULL | |
| w_county | varchar(30) | YES | | NULL | |
| w_state | char(2) | YES | | NULL | |
| w_zip | char(10) | YES | | NULL | |
| w_country | varchar(20) | YES | | NULL | |
| w_gmt_offset | decimal(5,2) | YES | | NULL | |
+-------------------+--------------+------+-----+---------+-------+
14 rows in set (0.05 sec)
MariaDB Front End
Standard ANSI SQL
Sizing
Minimum Spec
UM
4 core,
32 G RAM PM
4 core,
16 G RAM
Typical Server spec
PM
8 core 64G RAM
UM
8 core, 264G RAM
Data Storage
External Data Volumes
• Maximum 2 data volume per IO
channel per PM node server
• up to 2TB on the disk per data
volume ≈ Max 4 TB per PM node
Local disk
Up to 2TB on the disk per
PM node server
DETAILED SIZING GUIDE
based on data size
and workload
Sizing - Example
• MariaDB ColumnStore 60TB uncompressed data =
6TB compressed data at 10x compression
• 2UM - 8 core 512G(based on work load)
• 6 TB compressed = 3 data volume (at 2TB per volume)
- with 1 data volume per PM node - 3PMs
• Data growth - 2TB per month, Data retention - 2 years
- Plan for 2TB X24 = 48 TB additional
- 48 TB = 4.8TB compressed ≈ 3 data volume(at 2TB per volume)
with 1 data volume per PM node - 3 additional PMs
• Total 6 PMs, 2 UMs
Loading data
24
Data Load and Extents (local load)
8 million rows
1st Data Load
CSV File
Data Range
1 ~ 200
Rows 16 million
2nd Data Load
New CSV File
Data Range
150 ~ 210
Rows 16 million +8
Data Load
Data Load
Extent 1
Min 1, Max 200
Extent 2
Min 1, Max 200
8 million rows
8 million rows
Extent 3
Min 150, Max 210
Extent 4
Min 150, Max 210
8 million rows
Extent 5
Min 150, Max 210
8 million rows
Bulk Data Load: cpimport
• Fastest way to load data into MariaDB ColumnStore
• Load data from CSV file
cpimport dbName tblName [loadFile]
• Load data from Standard Input
mysql -e 'select * from source_table;' -N db2 | cpimport destination_db
destination_tbl -s 't‘
• Load data from Binary Source file
cpimport -I1 mydb mytable sourcefile.bin
• Multiple tables in can be loaded in parallel by launching multiple jobs
• Read queries continue without being blocked
• Successful cpimport is auto-committed
• In case of errors, entire load is rolled back
Bulk Data Load: cpimport mode 1
Single file Central Input :
Data source at UM
cpimport -m1 mytest mytable
mytable.tbl
cpimport
Name Node
UM Node
Source
Data Node
PM Node
Data Node
PM Node
Data Node
PM Node
Bulk Data Load: cpimport mode 2
Distributed Input:
Data Source at PMs
Partitioned load
file on each PM
cpimport -m2 testdb mytable
/home/mydata/mytable.tbl
cpimport
Name Node
UM Node
Source
Data Node
PM Node
Data Node
PM Node
Data Node
PM Node
Source Source
Distributed Input:
Data Source at PMs
Partitioned load
file on each PM
cpimport -m2 testdb mytable
/home/mydata/mytable.tbl
Bulk load command
at one or more PM
cpimport –m3 testdb mytable
/home/mydata/mytable.tbl
Bulk Data Load: cpimport mode 3
Name Node
UM Node
Source
Data Node
PM Node
Data Node
PM Node
Data Node
PM Node
Source Source
cpimport cpimport cpimport
Traditional way of
importing data into
any MariaDB storage
engine table
Bulk Data Load:
LOAD DATA INFILE
Up to 2 times slower
than cpimport for
large size imports
mysql> load data infile '/tmp/
outfile1.txt' into table destinationTable;
Query OK, 9765625 rows affected
(2 min 20.01 sec)
Records: 9765625 Deleted:
0 Skipped: 0 Warnings: 0
Either success or
error operation can
be rolled back
Thank you

More Related Content

What's hot (20)

PDF
Using all of the high availability options in MariaDB
MariaDB plc
 
PDF
Ansible, MongoDB Ops Manager and AWS v1.1
Michael Lynn
 
PDF
Introduction to Greenplum
Dave Cramer
 
PPTX
An Introduction to MongoDB Ops Manager
MongoDB
 
PPTX
Running MariaDB in multiple data centers
MariaDB plc
 
PDF
Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
Mydbops
 
PDF
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
Altinity Ltd
 
PDF
Histogram-in-Parallel-universe-of-MySQL-and-MariaDB
Mydbops
 
PDF
Linux tuning to improve PostgreSQL performance
PostgreSQL-Consulting
 
PDF
New optimizer features in MariaDB releases before 10.12
Sergey Petrunya
 
PDF
MySQL Advanced Administrator 2021 - 네오클로바
NeoClova
 
PDF
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
ScaleGrid.io
 
PDF
MariaDB 제품 소개
NeoClova
 
PDF
MariaDB 10.5 binary install (바이너리 설치)
NeoClova
 
PDF
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
PostgreSQL-Consulting
 
PDF
Patroni - HA PostgreSQL made easy
Alexander Kukushkin
 
PPTX
Maxscale 소개 1.1.1
NeoClova
 
PDF
MySQL 상태 메시지 분석 및 활용
I Goo Lee
 
PDF
MySQL Data Encryption at Rest
Mydbops
 
PDF
Mastering PostgreSQL Administration
EDB
 
Using all of the high availability options in MariaDB
MariaDB plc
 
Ansible, MongoDB Ops Manager and AWS v1.1
Michael Lynn
 
Introduction to Greenplum
Dave Cramer
 
An Introduction to MongoDB Ops Manager
MongoDB
 
Running MariaDB in multiple data centers
MariaDB plc
 
Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
Mydbops
 
New features in ProxySQL 2.0 (updated to 2.0.9) by Rene Cannao (ProxySQL)
Altinity Ltd
 
Histogram-in-Parallel-universe-of-MySQL-and-MariaDB
Mydbops
 
Linux tuning to improve PostgreSQL performance
PostgreSQL-Consulting
 
New optimizer features in MariaDB releases before 10.12
Sergey Petrunya
 
MySQL Advanced Administrator 2021 - 네오클로바
NeoClova
 
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
ScaleGrid.io
 
MariaDB 제품 소개
NeoClova
 
MariaDB 10.5 binary install (바이너리 설치)
NeoClova
 
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
PostgreSQL-Consulting
 
Patroni - HA PostgreSQL made easy
Alexander Kukushkin
 
Maxscale 소개 1.1.1
NeoClova
 
MySQL 상태 메시지 분석 및 활용
I Goo Lee
 
MySQL Data Encryption at Rest
Mydbops
 
Mastering PostgreSQL Administration
EDB
 

Similar to MariaDB ColumnStore (20)

PDF
04 2017 emea_roadshowmilan_mariadb columnstore
mlraviol
 
PDF
Introduction of MariaDB AX / TX
GOTO Satoru
 
PDF
Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStore
Matt Stubbs
 
PDF
Big Data Analytics with MariaDB ColumnStore
MariaDB plc
 
PPTX
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
Malin Weiss
 
PPTX
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
Speedment, Inc.
 
PPTX
In memory databases presentation
Michael Keane
 
PDF
[db tech showcase OSS 2017] A25: Replacing Oracle Database at DBS Bank by Mar...
Insight Technology, Inc.
 
PDF
[db tech showcase OSS 2017] A23: Analytics with MariaDB ColumnStore by MariaD...
Insight Technology, Inc.
 
PDF
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
Michael Stack
 
PDF
Deep Dive into DynamoDB
AWS Germany
 
PPTX
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Altinity Ltd
 
PDF
Big Data Analytics with MariaDB ColumnStore
MariaDB plc
 
PDF
Suburface 2021 IBM Cloud Data Lake
Torsten Steinbach
 
PDF
MySQL NDB Cluster 8.0 SQL faster than NoSQL
Bernd Ocklin
 
PPT
Fudcon talk.ppt
webhostingguy
 
PDF
What is MariaDB Server 10.3?
Colin Charles
 
PPTX
SQL Server 2014 Memory Optimised Tables - Advanced
Tony Rogerson
 
PPTX
Redshift overview
Amazon Web Services LATAM
 
PPTX
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
SnapLogic
 
04 2017 emea_roadshowmilan_mariadb columnstore
mlraviol
 
Introduction of MariaDB AX / TX
GOTO Satoru
 
Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStore
Matt Stubbs
 
Big Data Analytics with MariaDB ColumnStore
MariaDB plc
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
Malin Weiss
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
Speedment, Inc.
 
In memory databases presentation
Michael Keane
 
[db tech showcase OSS 2017] A25: Replacing Oracle Database at DBS Bank by Mar...
Insight Technology, Inc.
 
[db tech showcase OSS 2017] A23: Analytics with MariaDB ColumnStore by MariaD...
Insight Technology, Inc.
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
Michael Stack
 
Deep Dive into DynamoDB
AWS Germany
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Altinity Ltd
 
Big Data Analytics with MariaDB ColumnStore
MariaDB plc
 
Suburface 2021 IBM Cloud Data Lake
Torsten Steinbach
 
MySQL NDB Cluster 8.0 SQL faster than NoSQL
Bernd Ocklin
 
Fudcon talk.ppt
webhostingguy
 
What is MariaDB Server 10.3?
Colin Charles
 
SQL Server 2014 Memory Optimised Tables - Advanced
Tony Rogerson
 
Redshift overview
Amazon Web Services LATAM
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
SnapLogic
 
Ad

More from MariaDB plc (20)

PDF
MariaDB Berlin Roadshow Slides - 8 April 2025
MariaDB plc
 
PDF
MariaDB München Roadshow - 24 September, 2024
MariaDB plc
 
PDF
MariaDB Paris Roadshow - 19 September 2024
MariaDB plc
 
PDF
MariaDB Amsterdam Roadshow: 19 September, 2024
MariaDB plc
 
PDF
MariaDB Paris Workshop 2023 - MaxScale 23.02.x
MariaDB plc
 
PDF
MariaDB Paris Workshop 2023 - Newpharma
MariaDB plc
 
PDF
MariaDB Paris Workshop 2023 - Cloud
MariaDB plc
 
PDF
MariaDB Paris Workshop 2023 - MariaDB Enterprise
MariaDB plc
 
PDF
MariaDB Paris Workshop 2023 - Performance Optimization
MariaDB plc
 
PDF
MariaDB Paris Workshop 2023 - MaxScale
MariaDB plc
 
PDF
MariaDB Paris Workshop 2023 - novadys presentation
MariaDB plc
 
PDF
MariaDB Paris Workshop 2023 - DARVA presentation
MariaDB plc
 
PDF
MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server
MariaDB plc
 
PDF
MariaDB SkySQL Autonome Skalierung, Observability, Cloud-Backup
MariaDB plc
 
PDF
Einführung : MariaDB Tech und Business Update Hamburg 2023
MariaDB plc
 
PDF
Hochverfügbarkeitslösungen mit MariaDB
MariaDB plc
 
PDF
Die Neuheiten in MariaDB Enterprise Server
MariaDB plc
 
PDF
Global Data Replication with Galera for Ansell Guardian®
MariaDB plc
 
PDF
Introducing workload analysis
MariaDB plc
 
PDF
Under the hood: SkySQL monitoring
MariaDB plc
 
MariaDB Berlin Roadshow Slides - 8 April 2025
MariaDB plc
 
MariaDB München Roadshow - 24 September, 2024
MariaDB plc
 
MariaDB Paris Roadshow - 19 September 2024
MariaDB plc
 
MariaDB Amsterdam Roadshow: 19 September, 2024
MariaDB plc
 
MariaDB Paris Workshop 2023 - MaxScale 23.02.x
MariaDB plc
 
MariaDB Paris Workshop 2023 - Newpharma
MariaDB plc
 
MariaDB Paris Workshop 2023 - Cloud
MariaDB plc
 
MariaDB Paris Workshop 2023 - MariaDB Enterprise
MariaDB plc
 
MariaDB Paris Workshop 2023 - Performance Optimization
MariaDB plc
 
MariaDB Paris Workshop 2023 - MaxScale
MariaDB plc
 
MariaDB Paris Workshop 2023 - novadys presentation
MariaDB plc
 
MariaDB Paris Workshop 2023 - DARVA presentation
MariaDB plc
 
MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server
MariaDB plc
 
MariaDB SkySQL Autonome Skalierung, Observability, Cloud-Backup
MariaDB plc
 
Einführung : MariaDB Tech und Business Update Hamburg 2023
MariaDB plc
 
Hochverfügbarkeitslösungen mit MariaDB
MariaDB plc
 
Die Neuheiten in MariaDB Enterprise Server
MariaDB plc
 
Global Data Replication with Galera for Ansell Guardian®
MariaDB plc
 
Introducing workload analysis
MariaDB plc
 
Under the hood: SkySQL monitoring
MariaDB plc
 
Ad

Recently uploaded (20)

PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PPTX
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PDF
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PDF
AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025
bashirkhan333g
 
PPTX
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
PPTX
Customise Your Correlation Table in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PDF
NEW-Viral>Wondershare Filmora 14.5.18.12900 Crack Free
sherryg1122g
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
Tally software_Introduction_Presentation
AditiBansal54083
 
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025
bashirkhan333g
 
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
Customise Your Correlation Table in IBM SPSS Statistics.pptx
Version 1 Analytics
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
NEW-Viral>Wondershare Filmora 14.5.18.12900 Crack Free
sherryg1122g
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 

MariaDB ColumnStore

  • 2. Agenda • Why using a Column Based database – And why not? • MariaDB ColumnStore architecture • Using and Sizing MariaDB ColumnStore • Getting data into MariaDB ColumnStore
  • 3. Why a Columnar Database
  • 4. Data organization • Row-by-row – Good for row based processing – Use indexing for lookups • Typically B-Tree indexing – Indexing is difficult for data that is not well distributed – Indexing slows down DML • Column based – Good for dataset based processing – Needs no indexes – Data is typically organized in chunks – Lends itself to high level of compression – Metadata used for filtering and processing – Large amount of data is much, much less of an issue – Loading data is consistently fast, independent on data size
  • 5. OLTP/NoSQL Workloads Suited for reporting or analysis of millions-billions of rows from data sets containing millions-trillions of rows. OLAP/Analytic/ Reporting Workloads Workload – Query Vision/Scope 1 100 10,000 10-100GB 10,000,000,000 1-10TB 1,000,000 100,000,000 100-1,000GB
  • 6. Columnar General Best Practices Not suited for OLTP Micro-batch load allows for near real-time behavior Infrequently used columns do not impact other queries Columnar suitable for sparse columns (nulls compress nicely)
  • 7. Data Modeling Best Practices Star-schema optimizations are generally a good idea Conservative data typing is very important Especially around fixed-length vs. dictionary boundary (8 bytes) IP Address vs. IP Number Break down compound fields into individual fields: Trivializes searching for sub-fields Can avoid dictionary overhead Cost to re-assemble is generally small
  • 9. MariaDB ColumnStore High performance columnar storage engine that support wide variety of analytical use cases with SQL in a highly scalable distributed environments Parallel query processing for distributed environments Faster, More Efficient Queries Single SQL Interface for both OLTP and analytics Easier Enterprise Analytics Power of SQL and Freedom of Open Source to Big Data Analytics Better Price Performance
  • 10. MariaDB ColumnStore • GPLv2 Open Source • Columnar, Massively Parallel MariaDB Storage Engine • Scalable, high-performance analytics platform • Built in redundancy and high availability • Runs on premise and on AWS cloud • Full SQL syntax and capabilities regardless of platform Big Data Sources Analytics Insight MariaDB ColumnStore . . . Node 1 Node 2 Node 3 Node N Local / AWS® / GlusterFS® ELT Tools BI Tools
  • 11. SQL Features Source : InfiniDB SQL Syntax Guide Cross Engine Joins UDF DML Aggregation DDL Disk Based Joins Windowing Functions SELECT QUERY
  • 12. MariaDB ColumnStore Architecture Data Storage User Connections User Module nUser Module 1 Performance Module n Performance Module 2 Performance Module 1 MariaDB Front End Query Engine User Module Processes SQL Requests Performance Module Distributed Processing Engine
  • 13. Process Functionality Value MariaDB • Hosts MariaDB • Connection management • SQL parsing & optimization Familiar DBMS interface Leverages existing partner integrations Delivers rich SQL syntax support Extent Map • Abstracts physical and logical storage • Metadata store Enables partition elimination ExeMgr • Work distribution • Final results management and aggregation Multi-threaded to take advantage of multi-core HW platforms User Module at a Glance
  • 14. Process Functionality Value PrimProc • Scale-out cache management • Distributed scan, filter, join and aggregation operations • Resource management Independent scalability and tunable performance Multi-threaded to take advantage of multi-core HW platforms Data • High Speed Bulk Load • Transactional DML and DDL • Online schema extensions Non-blocking read enabled Multi-threaded to take advantage of multi-core HW platforms Performance Module at a Glance
  • 15. MariaDB ColumnStore MariaDB Functions • MariaDB Client • MariaDB Connectivity (JDBC, ODBC) • MariaDB Security • Initial SQL Statement Parsing • Initial SQL Optimization < Custom Handler Class > • Execute final sort and final limit • Display final results ExeMgr Functions • SQL Optimization • Distribute work for scan, filter, join, functions, expressions, group by, aggregation, etc. to all available Performance Modules to be run in parallel • Collect the results returned by the Performance Modules • Return the final results to MariaDB for display MariaDB ColumnStore ExeMgr Data Storage User Connections User Module nUser Module 1 Performance Module n Performance Module 2 Performance Module 1 User Module Processes SQL Requests MariaDB Front End Performance Module Executes the Queries Distributed Processing Engine
  • 16. Compression with Data Storage Layer Blocks (8KB) Extent1 (8MB~64MB 8 million rows) Logical Layer Segment File1 (maps to an Extent) Physical Layer Compression Chunks
  • 17. • 8-byte fixed length token (pointer). • A variable length value stored at the location identified by the pointer. Data Types 1-byte Field with 8192 values per 8k block 2-byte Field with 4096 values per 8k block 4-byte Field with 2048 values per 8k block 8-byte Field with 1024 values per 8k block Dictionary structure made up of 2 files/extents with: At the physical layer, all columns are stored as:
  • 18. • Varchar(8) or larger • Char(9) or larger Data Types 1-byte Field Examples TinyInt, Char(1) 2-byte Field Examples SmallInt, Char(2) 4-byte Field Examples Int, Char(3), Char(4), date, float 8-byte Field Examples BigInt, Char(5- 8),datetime, real/double Dictionary Examples At the physical layer, all columns are stored as:
  • 19. Using MariaDB ColumnStore Just like MariaDB Server
  • 20. MariaDB ColumnStore MariaDB ColumnStore uses standard “Engine=columnstore” syntax mysql> use tpcds_djoshi Database changed mysql> select count(*) from store_sales; +----------+ | count(*) | +----------+ | 2880404 | +----------+ 1 row in set (1.68 sec) mysql> describe warehouse; +-------------------+--------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------------------+--------------+------+-----+---------+-------+ | w_warehouse_sk | int(11) | NO | | NULL | | | w_warehouse_id | char(16) | NO | | NULL | | | w_warehouse_name | varchar(20) | YES | | NULL | | | w_warehouse_sq_ft | int(11) | YES | | NULL | | | w_street_number | char(10) | YES | | NULL | | | w_street_name | varchar(60) | YES | | NULL | | | w_street_type | char(15) | YES | | NULL | | | w_suite_number | char(10) | YES | | NULL | | | w_city | varchar(60) | YES | | NULL | | | w_county | varchar(30) | YES | | NULL | | | w_state | char(2) | YES | | NULL | | | w_zip | char(10) | YES | | NULL | | | w_country | varchar(20) | YES | | NULL | | | w_gmt_offset | decimal(5,2) | YES | | NULL | | +-------------------+--------------+------+-----+---------+-------+ 14 rows in set (0.05 sec) CREATE TABLE `game_warehouse`.`dim_title` ( `id` INT, `name` VARCHAR(45), `publisher` VARCHAR(45), `release_date` DATE, `language` INT, `platform_name` VARCHAR(45), `version` VARCHAR(45) ) ENGINE=columnstore; Uses custom scalable columnar architecture
  • 21. MariaDB ColumnStore mysql> use tpcds_djoshi Database changed mysql> select count(*) from store_sales; +----------+ | count(*) | +----------+ | 2880404 | +----------+ 1 row in set (1.68 sec) mysql> describe warehouse; +-------------------+--------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------------------+--------------+------+-----+---------+-------+ | w_warehouse_sk | int(11) | NO | | NULL | | | w_warehouse_id | char(16) | NO | | NULL | | | w_warehouse_name | varchar(20) | YES | | NULL | | | w_warehouse_sq_ft | int(11) | YES | | NULL | | | w_street_number | char(10) | YES | | NULL | | | w_street_name | varchar(60) | YES | | NULL | | | w_street_type | char(15) | YES | | NULL | | | w_suite_number | char(10) | YES | | NULL | | | w_city | varchar(60) | YES | | NULL | | | w_county | varchar(30) | YES | | NULL | | | w_state | char(2) | YES | | NULL | | | w_zip | char(10) | YES | | NULL | | | w_country | varchar(20) | YES | | NULL | | | w_gmt_offset | decimal(5,2) | YES | | NULL | | +-------------------+--------------+------+-----+---------+-------+ 14 rows in set (0.05 sec) MariaDB Front End Standard ANSI SQL
  • 22. Sizing Minimum Spec UM 4 core, 32 G RAM PM 4 core, 16 G RAM Typical Server spec PM 8 core 64G RAM UM 8 core, 264G RAM Data Storage External Data Volumes • Maximum 2 data volume per IO channel per PM node server • up to 2TB on the disk per data volume ≈ Max 4 TB per PM node Local disk Up to 2TB on the disk per PM node server DETAILED SIZING GUIDE based on data size and workload
  • 23. Sizing - Example • MariaDB ColumnStore 60TB uncompressed data = 6TB compressed data at 10x compression • 2UM - 8 core 512G(based on work load) • 6 TB compressed = 3 data volume (at 2TB per volume) - with 1 data volume per PM node - 3PMs • Data growth - 2TB per month, Data retention - 2 years - Plan for 2TB X24 = 48 TB additional - 48 TB = 4.8TB compressed ≈ 3 data volume(at 2TB per volume) with 1 data volume per PM node - 3 additional PMs • Total 6 PMs, 2 UMs
  • 25. Data Load and Extents (local load) 8 million rows 1st Data Load CSV File Data Range 1 ~ 200 Rows 16 million 2nd Data Load New CSV File Data Range 150 ~ 210 Rows 16 million +8 Data Load Data Load Extent 1 Min 1, Max 200 Extent 2 Min 1, Max 200 8 million rows 8 million rows Extent 3 Min 150, Max 210 Extent 4 Min 150, Max 210 8 million rows Extent 5 Min 150, Max 210 8 million rows
  • 26. Bulk Data Load: cpimport • Fastest way to load data into MariaDB ColumnStore • Load data from CSV file cpimport dbName tblName [loadFile] • Load data from Standard Input mysql -e 'select * from source_table;' -N db2 | cpimport destination_db destination_tbl -s 't‘ • Load data from Binary Source file cpimport -I1 mydb mytable sourcefile.bin • Multiple tables in can be loaded in parallel by launching multiple jobs • Read queries continue without being blocked • Successful cpimport is auto-committed • In case of errors, entire load is rolled back
  • 27. Bulk Data Load: cpimport mode 1 Single file Central Input : Data source at UM cpimport -m1 mytest mytable mytable.tbl cpimport Name Node UM Node Source Data Node PM Node Data Node PM Node Data Node PM Node
  • 28. Bulk Data Load: cpimport mode 2 Distributed Input: Data Source at PMs Partitioned load file on each PM cpimport -m2 testdb mytable /home/mydata/mytable.tbl cpimport Name Node UM Node Source Data Node PM Node Data Node PM Node Data Node PM Node Source Source
  • 29. Distributed Input: Data Source at PMs Partitioned load file on each PM cpimport -m2 testdb mytable /home/mydata/mytable.tbl Bulk load command at one or more PM cpimport –m3 testdb mytable /home/mydata/mytable.tbl Bulk Data Load: cpimport mode 3 Name Node UM Node Source Data Node PM Node Data Node PM Node Data Node PM Node Source Source cpimport cpimport cpimport
  • 30. Traditional way of importing data into any MariaDB storage engine table Bulk Data Load: LOAD DATA INFILE Up to 2 times slower than cpimport for large size imports mysql> load data infile '/tmp/ outfile1.txt' into table destinationTable; Query OK, 9765625 rows affected (2 min 20.01 sec) Records: 9765625 Deleted: 0 Skipped: 0 Warnings: 0 Either success or error operation can be rolled back