SlideShare a Scribd company logo
Big Data Analytics
with MariaDB
ColumnStore
Andrew Hutchings (LinuxJedi)
Senior Software Engineer
My Background
• MySQL / Drizzle
– Sun/Oracle: Senior Support Engineer, NDB & C/C++ API specialist (dev/support)
– Rackspace: Core Engineer (Drizzle)
– SkySQL: Senior Sustaining Engineer (Drizzle & C/C++ connectors)
– Co-Author of MySQL 5.1 Plugins Development
• OpenStack (at HP Cloud)
– Core CI Engineer
– Lead Engineer for LBaaS
– Principal Engineer for Advanced Technology Group
• NGINX
– Senior Developer Advocate & Technical Product Manager
• MariaDB
– Lead Engineer for MariaDB ColumnStore
History of MariaDB ColumnStore
• March 2010 - Calpont launches InfiniDB
• September 2014 - Calpont (now itself called InfiniDB) closes down
– MariaDB (then SkySQL) supports InfiniDB customers
• April 2016 - MariaDB announces development of MariaDB ColumnStore
• August 2016 - I joined MariaDB and jumped straight into ColumnStore
• December 2016 - MariaDB ColumnStore 1.0 GA
– InfiniDB + MariaDB 10.1 + Many fixes and improvements
MariaDB ColumnStore
• GPLv2 Open Source
• Columnar, Massively Parallel
MariaDB Storage Engine
• Scalable, high-performance
analytics platform
• Built in redundancy and
high availability
• Runs on premise, on AWS cloud
• Full SQL syntax and capabilities
regardless of platform
Big Data Sources Analytics Insight
MariaDB ColumnStore
. . .
Node 1 Node 2 Node 3 Node N
Local / AWS® / GlusterFS ®
ELT
Tools
BI
Tools
MariaDB ColumnStore
High performance columnar storage engine that support wide variety of
analytical use cases with SQL in a highly scalable distributed environments
Parallel query
processing for
distributed
environments
Faster, More
Efficient Queries
Single SQL
Interface for OLTP
and analytics
Easier Enterprise
Analytics
Power of SQL and
Freedom of Open
Source to Big Data
Analytics
Better Price
Performance
OLTP/NoSQL
Workloads
Suited for reporting or analysis of millions-billions of rows from data sets containing millions-trillions of rows.
OLAP/Analytic/
Reporting Workloads
Workload – Query Vision/Scope
1 100 10,000
10-100GB
10,000,000,000
1-10TB
1,000,000 100,000,000
100-1,000GB
Row-oriented vs. Column-oriented format
• Row oriented
– Rows stored sequentially in
a file
– Scans through every record
row by row
• Column oriented:
– Each column is stored in a
separate file
– Scans only the relevant
columns
ID Fname Lname State Zip Phone Age Sex
1 Bugs Bunny NY 11217 (718) 938-3235 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F
ID
1
2
3
4
5
Fname
Bugs
Yosemite
Daffy
Elmer
Witch
Lname
Bunny
Sam
Duck
Fudd
Hazel
State
NY
CA
NY
ME
MA
Zip
11217
95389
10013
04578
01970
Phone
(718) 938-3235
(209) 375-6572
(212) 227-1810
(207) 882-7323
(978) 744-0991
Age
34
52
35
43
57
Sex
M
M
M
M
F
SELECT Fname FROM People WHERE State = 'NY'
Analytics
• In-database distributed analytics with
complex join, aggregation, window functions
• Cross Engine Join allows for queries to be
executed referencing both columnstore and
non-columnstore tables.
• Extensible User Defined Functions allow
creation of specialized logic executed at PM
level.
• Standard MariaDB Connectors provide for out
of the box integration with:
– BI Tools (Tableau, Pentaho, ..)
– Custom Application Code (Java, Scala, C#,
Python, ..)
– Data Processing Frameworks (R, Spark,
Numpy, ..)
Item ID Server_date Revenue
1 2017-02-01 20,000.0
1 2017-02-02 5,001.00
2 2017-02-01 15,000.0
2 2017-02-04 34,029.0
2 2017-02-05 7,138.00
3 2017-02-01 17,250.0
3 2017-02-03 25,010.0
3 2017-02-04 21,034.0
3 2017-02-05 4,120.00
Running Average
20,000.00
12,500.50
15,000.00
34,029.00
20,583.50
17,250.00
25,010.00
23,022.00
12,577.00
Window Function Example: Daily Running Average Revenue by Item
SELECT item_id, server_date, daily_revenue,
AVG(revenue) OVER
(PARTITION BY item_id ORDER BY server_date
RANGE INTERVAL 1 DAY PRECEDING ) running_avg
FROM web_item_sales
BI Tool
Custom
Big Data App
Data
Processing
Framework
JDBC / ODBC / Connector
Enterprise Grade
• Enterprise Grade Security
– SSL, role based access, auditability.
– MaxScale database firewall
• Deployment Flexibility
– Run on commodity Linux servers on
premise or in the cloud.
– AWS optimized AMI Image.
– Add horizontal capacity as you grow.
• High Availability
– Automatic UM failover
– Automatic PM failover with distributed
data attachment across all PMs in SAN and
EBS environment
Shared-Nothing Distributed Data Storage
Compressed by default
User
Module
(UM)
Performance
Module
(PM)
Data Storage
Load
Balancer -
MaxScale
MariaDB ColumnStore Architecture
Columnar Distributed Data Storage
Local Storage | SAN | NAS | EBS | Gluster FS
BI Tool SQL Client Custom
Big Data App
Application
MariaDB
SQL Front
End
Distributed
Query Engine
Data Storage
MariaDB ColumnStore
Shared Nothing Distributed Data Storage
SQL
Column
Primitives
User
Module
Performance
Module
UM
PM
• Query received and parsed by
MariaDB Front End on UM
• Storage Engine Plugin breaks down query in
primitive operations and distributes across PM
• Primitives processed on PM
• One thread working on a range of rows
• Execute column restrictions and projections
• Execute group by/aggregation against local data
• Each PM work on Primitives in parallel
and fully distributed
• Each primitive executes in a fraction of a second
• Return intermediate results to UM
Primitives ↓↓↓↓
Intermediate
↑↑Results↑↑
MariaDB ColumnStore
MariaDB ColumnStore
uses standard
“Engine=columnstore”
syntax
mysql> use tpcds_djoshi
Database changed
mysql> select count(*) from store_sales;
+----------+
| count(*) |
+----------+
| 2880404 |
+----------+
1 row in set (1.68 sec)
mysql> describe warehouse;
+-------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+--------------+------+-----+---------+-------+
| w_warehouse_sk | int(11) | NO | | NULL | |
| w_warehouse_id | char(16) | NO | | NULL | |
| w_warehouse_name | varchar(20) | YES | | NULL | |
| w_warehouse_sq_ft | int(11) | YES | | NULL | |
| w_street_number | char(10) | YES | | NULL | |
| w_street_name | varchar(60) | YES | | NULL | |
| w_street_type | char(15) | YES | | NULL | |
| w_suite_number | char(10) | YES | | NULL | |
| w_city | varchar(60) | YES | | NULL | |
| w_county | varchar(30) | YES | | NULL | |
| w_state | char(2) | YES | | NULL | |
| w_zip | char(10) | YES | | NULL | |
| w_country | varchar(20) | YES | | NULL | |
| w_gmt_offset | decimal(5,2) | YES | | NULL | |
+-------------------+--------------+------+-----+---------+-------+
14 rows in set (0.05 sec)
CREATE TABLE `game_warehouse`.`dim_title` (
`id` INT,
`name` VARCHAR(45),
`publisher` VARCHAR(45),
`release_date` DATE,
`language` INT,
`platform_name` VARCHAR(45),
`version` VARCHAR(45)
) ENGINE=columnstore;
Uses custom scalable
columnar architecture
ColumnStore Modules
• User Module (UM)
–MariaDB Storage Engine Plugin
–ExeMgr
–DMLProc, DDLProc
–cpimport
• Performance Module (PM)
–PrimProc
–WriteEngine
–ProcMgr / ProcMon
–cpimport
Compression with Data Storage Layer
Blocks (8KB)
Extent1
(8MB~64MB
8 million rows)
Logical
Layer
Segment File1
(maps to an Extent)
Physical
Layer
Compression
Chunks
Key meta-structure that powers MariaDB ColumnStore’s
performance
A catalog of all extents
• Minimum and maximum values for a column’s data within an extent
Master copy of the Extent Map on primary PM node
Upon system startup, copied to all other UM and PM
nodes for disaster recovery and failover purposes
Extent Map resident in memory for quick access at all nodes
As extents modified, updates broadcasted to all participating nodes
Stores about 64 bytes for each 8-64 Mbytes on disk
Extent Map
Extent Map
When performing queries:
• Eliminate the extents by taking into consideration only
the extents for the column in join and filter conditions
• Use the minimum and maximum value for the extents for
join columns to filter the columns and eliminate extent
Multiple columns can be used
together for partition elimination
Transitive properties apply, i.e. a filter
on a dimension column (date, for example)
can allow for partition elimination on fact table
• 8-byte fixed length token (pointer).
• A variable length value stored at the
location identified by the pointer.
Data Types
1-byte Field
with 8192 values per
8k block
2-byte Field
with 4096 values
per 8k block
4-byte Field
with 2048 values
per 8k block
8-byte Field
with 1024 values per
8k block
Dictionary structure
made up of 2
files/extents with:
At the physical layer, all columns are stored as:
• Varchar(8) or larger
• Char(9) or larger
Data Types
1-byte Field
Examples
TinyInt, Char(1)
2-byte Field
Examples
SmallInt, Char(2)
4-byte Field
Examples
Int, Char(3),
Char(4), date, float
8-byte Field
Examples
BigInt, Char(5-
8),datetime, double
Dictionary Examples
At the physical layer, all columns are stored as:
Sizing
Minimum Spec
UM
4 core,
32 G RAM PM
4 core,
16 G RAM
Typical Server spec
PM
8 core 64G RAM
UM
8 core, 64G RAM
Data Storage
External Data Volumes
• Maximum 2 data volume per IO
channel per PM node server
• up to 2TB on the disk per data
volume ≈ Max 4 TB per PM node
Local disk
Up to 2TB on the disk per
PM node server
DETAILED SIZING GUIDE
based on data size
and workload
Sizing - Example
• MariaDB ColumnStore 60TB uncompressed data =
6TB compressed data at 10x compression
• 2UM - 8 core 512G(based on work load)
• 6 TB compressed = 3 data volume (at 2TB per volume)
- with 1 data volume per PM node - 3PMs
• Data growth - 2TB per month, Data retention - 2 years
- Plan for 2TB X24 = 48 TB additional
- 48 TB = 4.8TB compressed ≈ 3 data volume(at 2TB per volume)
with 1 data volume per PM node - 3 additional PMs
• Total 6 PMs, 2 UMs
Analytics with
MariaDB
ColumnStore
SQL Features
Aggregation
Windowing Functions
UDF
Tuning
Commands
ETL
MAX RANK
MIN DENSE_RANK
COUNT PERCENT_RANK
SUM NTH_VALUE
AVG FIRST_VALUE
VARIANCE LAST_VALUE
VAR_POP CUME_DIST
VAR_SAMP LAG
STD LEAD
STDDEV NTILE
STDDEV_POP PERCENTILE_CONT
STDDEV_SAMP PERCENTILE_DISC
ROW_NUMBER MEDIAN
• Aggregate over a series of related rows
• Simplified function for complex statistical
analytics over sliding window per row
- Cumulative, moving or centered aggregates
- Simple Statistical functions like rank, max, min,
average, median
- More complex functions such as distribution,
percentile, lag, lead
- Without running complex sub-queries
Windowing Functions
Source : InfiniDB SQL Syntax Guide
Top N Visitors for each Month
Window Function Example
Total for Each
Visitor by Month
Top 1 :
Time_rank = 1
Top 2 :
Time_rank <= 2
Top N :
Time_rank <= N
Data Modeling Best Practices
Star-schema optimizations are generally a good idea
Conservative data typing is very important
Especially around fixed-length vs. dictionary boundary (8 bytes)
IP Address vs. IP Number
Break down compound fields into individual fields:
Trivializes searching for sub-fields
Can avoid dictionary overhead
Cost to re-assemble is generally small
Horizontal
Partition:
8 Million Rows
Extent 2
Horizontal
Partition:
8 Million Rows
Extent 3
Horizontal
Partition:
8 Million Rows
Extent 1
Storage Architecture reduces I/O
• Only touch column files
that are in filter, projection,
group by, and join conditions
• Eliminate disk block touches
to partitions outside filter
and join conditions
Extent 1:
ShipDate: 2016-01-12 - 2016-03-05
Extent 2:
ShipDate: 2016-03-05 - 2016-09-23
Extent 3:
ShipDate: 2016-09-24 - 2017-01-06
SELECT Item, sum(Quantity) FROM Orders
WHERE ShipDate between ‘2016-01-01’ and ‘2016-01-31’
GROUP BY Item
Extent Elimination
Id OrderId Line Item Quantity Price Supplier ShipDate ShipMode
1 1 1 Laptop 5 1000 Dell 2016-01-12 G
2 1 2 Monitor 5 200 LG 2016-01-13 G
3 2 1 Mouse 1 20 Logitech 2016-02-05 M
4 3 1 Laptop 3 1600 Apple 2016-01-31 P
... ... ... ... ... ... ... ... ...
8M 2016-03-05
8M+1 2016-03-05
... ... ... ... ... ... ... ... ...
16M 2016-09-23
16M+1 2016-09-24
... ... ... ... ... ... ... ... ...
24M 2017-01-06
ELIMINATED PARTITION
ELIMINATED PARTITION
Tuning Commands
MariaDB [test]> select count(*) from t1 where i = 5;
+----------+
| count(*) |
+----------+
| 2200000 |
+----------+
1 row in set (0.27 sec)
MariaDB [test]> select calGetStats()G
*************************** 1. row ***************************
calGetStats(): Query Stats: MaxMemPct-0; NumTempFiles-0; TempFileSpace-0B; ApproxPhyI/O-11042; CacheI/O-11042;
BlocksTouched-11042; PartitionBlocksEliminated-0; MsgBytesIn-332KB; MsgBytesOut-3KB; Mode-Distributed
1 row in set (0.00 sec)
calGetStats: Information On The Last Query Executed Within A Given Session
MariaDB [test]> select calSetTrace(1);
+----------------+
| calSetTrace(1) |
+----------------+
| 0 |
+----------------+
1 row in set (0.00 sec)
MariaDB [test]> select d.name dept_name,
-> count(*) emp_count,
-> sum(e.salary) salary_cost
-> from emp e
-> join i_dept d on e.dept_id = d.dept_id
-> group by dept_name;
+-------------+-----------+-------------+
| dept_name | emp_count | salary_cost |
+-------------+-----------+-------------+
| Engineering | 2 | 2500 |
| Sales | 2 | 3800 |
+-------------+-----------+-------------+
2 rows in set, 1 warning (0.03 sec)
Tuning Commands
calGetTrace: Detailed distributed query execution plan
MariaDB [test]> select calGetTrace()G
*************************** 1. row ***************************
calGetTrace():
Desc Mode Table TableOID ReferencedColumns PIO LIO PBE Elapsed Rows
CES UM - - - - - - 0.000 2
BPS PM e 3013 (dept_id,salary) 2 4 0 0.008 2
HJS PM e-d 3013 - - - - ----- -
TAS UM - - - - - - 0.000 2
1 row in set (0.00 sec)
Tuning Commands
Query Statistics
Users can view the query statistics by selecting the rows from the
query stats table in the infinidb_querystats schema.
Example 1 Example 2
List execution time, rows returned
for all the select queries within
the past 12 hours
select queryid, query, endtime-starttime,
rows from querystats where starttime >=
now() - interval 12 hour and querytype =
'SELECT';
List the average, min and max
running time of all the INSERT SELECT
queries within the past 12 hours
select min(endtime-starttime), max(endtime-starttime),
avg(endtimestarttime) from querystats where
querytype='INSERT SELECT' and starttime >=
now() - interval 12 hour;
ETL
29
Bulk Data Load: cpimport
• Fastest way to load data into MariaDB ColumnStore
• Load data from CSV file
cpimport dbName tblName [loadFile]
• Load data from Standard Input
mysql -e 'select * from source_table;' -N db2 | cpimport destination_db
destination_tbl -s 't‘
• Load data from Binary Source file
cpimport -I1 mydb mytable sourcefile.bin
• Multiple tables in can be loaded in parallel by launching multiple jobs
• Read queries continue without being blocked
• Successful cpimport is auto-committed
• In case of errors, entire load is rolled back
Bulk Data Load: cpimport mode 1
Single file Central Input :
Data source at UM
cpimport -m1 mytest mytable
mytable.tbl
cpimport
Name Node
UM Node
Source
Data Node
PM Node
Data Node
PM Node
Data Node
PM Node
Bulk Data Load: cpimport mode 2
Distributed Input:
Data Source at PMs
Partitioned load
file on each PM
cpimport -m2 testdb mytable
/home/mydata/mytable.tbl
cpimport
Name Node
UM Node
Source
Data Node
PM Node
Data Node
PM Node
Data Node
PM Node
Source Source
Distributed Input:
Data Source at PMs
Partitioned load
file on each PM
cpimport -m2 testdb mytable
/home/mydata/mytable.tbl
Bulk load command
at one or more PM
cpimport –m3 testdb mytable
/home/mydata/mytable.tbl
Bulk Data Load: cpimport mode 3
Name Node
UM Node
Source
Data Node
PM Node
Data Node
PM Node
Data Node
PM Node
Source Source
cpimport cpimport cpimport
Traditional way of
importing data into
any MariaDB storage
engine table
Bulk Data Load:
LOAD DATA INFILE
Up to 2 times slower
than cpimport for
large size imports
mysql> load data infile '/tmp/
outfile1.txt' into table destinationTable;
Query OK, 9765625 rows affected
(2 min 20.01 sec)
Records: 9765625 Deleted:
0 Skipped: 0 Warnings: 0
Either success or
error operation can
be rolled back
Resources
Data Warehousing
Selective column
based queries
Large number
of dimensions
High Performance
Analytics On Large
Volume Of Data
Reporting and analysis
on millions or billions
of rows
From datasets
containing millions
to trillions of rows
Terabytes to Petabytes
of datasets
Analytics Require
Complex Joins,
Windowing Functions
Technical Use Cases
Industry Category Use Case
Gaming Behavior Analytics Projecting and predicting user behavior based on past and current data
Advertising Customer Analytics Customer behavior data for market segmentation and predictive analytics.
Advertising Loyalty Analytics Customer analytics focusing on a person’s commitment to a product, company, or brand.
Web, E-
commerce
Click Stream Analytics
Web activity analysis, software testing, market research with analytics on data about the clicks areas of web pages while
web browsing [Deal News]
Marketing Promotional Testing Using marketing and campaign management data to identify the best criteria to be used for a particular marketing offer.
Social Network Network Analytics Relationship analytics among network nodes
Financial Fraud Analytics
Monitoring user financial transactions and identifying patterns of behaviour to predict and detect abnormal or fraudulent
activity to prevent damage to user and institution.
Healthcare Patient Analytics Analyzing patient medical records to identify patterns to be used for improved medical treatment.
Healthcare Clinical Analytics Analyzing clinical data and its impact on patients to identify patterns to be used for improved medical treatment.
Telco
Network and Application
Performance Analytics
Streaming data from network devices and applications enriched with business operations data to uncover actionable
insights for network planning, operations and marketing analytics
Aviation Flight analytics
Proactively project parts replacement, maintenance and air-plane retirement based on real-time and historically collected
flight parameter data [Boeing]
Customer Use Cases
Coming Soon - ColumnStore 1.1
● Text / Blob datatype support
● Bulk Write API Connector
○ Kafka integration
○ Maxscale CDC Replication integration
○ Custom
● User Defined Aggregate & Window functions.
● Data Redundancy for local storage.
● Installation improvements.
● Performance & stability improvements.
● MariaDB Server 10.2
SPECIAL NY DATABASE MONTH CONFERENCE PRICING | Use code: BIGDATALDN
https://m18.mariadb.com
Conference
Registration
$99
Technical
Workshop
$49
Thank you
Andrew Hutchings (LinuxJedi)
Senior Software Engineer

More Related Content

What's hot (11)

PDF
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
 
PPTX
سکوهای ابری و مدل های برنامه نویسی در ابر
datastack
 
PPTX
Gpu computing workshop
datastack
 
PPTX
In-Memory DataBase
Pridhvi Kodamasimham
 
PPTX
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop bigdata overview
harithakannan
 
PPT
Main MeMory Data Base
Siva Rushi
 
PPTX
in-memory database system and low latency
hyeongchae lee
 
PPTX
A tour of Amazon Redshift
Kel Graham
 
PDF
The inner workings of Dynamo DB
Jonathan Lau
 
PDF
Voldemort Nosql
elliando dias
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
 
سکوهای ابری و مدل های برنامه نویسی در ابر
datastack
 
Gpu computing workshop
datastack
 
In-Memory DataBase
Pridhvi Kodamasimham
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
DataWorks Summit/Hadoop Summit
 
Hadoop bigdata overview
harithakannan
 
Main MeMory Data Base
Siva Rushi
 
in-memory database system and low latency
hyeongchae lee
 
A tour of Amazon Redshift
Kel Graham
 
The inner workings of Dynamo DB
Jonathan Lau
 
Voldemort Nosql
elliando dias
 

Similar to Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStore (20)

PDF
04 2017 emea_roadshowmilan_mariadb columnstore
mlraviol
 
PDF
MariaDB ColumnStore
MariaDB plc
 
PDF
Big Data Analytics with MariaDB ColumnStore
MariaDB plc
 
PDF
[db tech showcase OSS 2017] A23: Analytics with MariaDB ColumnStore by MariaD...
Insight Technology, Inc.
 
PDF
[db tech showcase OSS 2017] A25: Replacing Oracle Database at DBS Bank by Mar...
Insight Technology, Inc.
 
PDF
MariaDB ColumnStore
MariaDB plc
 
PDF
In-depth session: Big Data Analytics with MariaDB AX
MariaDB plc
 
PDF
M|18 Understanding the Architecture of MariaDB ColumnStore
MariaDB plc
 
PDF
Understanding the architecture of MariaDB ColumnStore
MariaDB plc
 
PDF
Big Data Analytics with MariaDB ColumnStore
MariaDB plc
 
PDF
Sesión técnica: Big Data Analytics con MariaDB ColumnStore
MariaDB plc
 
PDF
Big Data Analytics with MariaDB ColumnStore
MariaDB plc
 
PDF
[db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use...
Insight Technology, Inc.
 
PDF
Big Data Analytics with MariaDB ColumnStore
MariaDB plc
 
PDF
Modeling data for scalable, ad hoc analytics
MariaDB plc
 
PDF
Big-Data-Analysen mit MariaDB ColumnStore
MariaDB plc
 
PDF
What to expect from MariaDB Platform X5, part 2
MariaDB plc
 
PDF
Delivering fast, powerful and scalable analytics #OPEN18
Kangaroot
 
PDF
How to make data available for analytics ASAP
MariaDB plc
 
PDF
What’s new in MariaDB ColumnStore
MariaDB plc
 
04 2017 emea_roadshowmilan_mariadb columnstore
mlraviol
 
MariaDB ColumnStore
MariaDB plc
 
Big Data Analytics with MariaDB ColumnStore
MariaDB plc
 
[db tech showcase OSS 2017] A23: Analytics with MariaDB ColumnStore by MariaD...
Insight Technology, Inc.
 
[db tech showcase OSS 2017] A25: Replacing Oracle Database at DBS Bank by Mar...
Insight Technology, Inc.
 
MariaDB ColumnStore
MariaDB plc
 
In-depth session: Big Data Analytics with MariaDB AX
MariaDB plc
 
M|18 Understanding the Architecture of MariaDB ColumnStore
MariaDB plc
 
Understanding the architecture of MariaDB ColumnStore
MariaDB plc
 
Big Data Analytics with MariaDB ColumnStore
MariaDB plc
 
Sesión técnica: Big Data Analytics con MariaDB ColumnStore
MariaDB plc
 
Big Data Analytics with MariaDB ColumnStore
MariaDB plc
 
[db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use...
Insight Technology, Inc.
 
Big Data Analytics with MariaDB ColumnStore
MariaDB plc
 
Modeling data for scalable, ad hoc analytics
MariaDB plc
 
Big-Data-Analysen mit MariaDB ColumnStore
MariaDB plc
 
What to expect from MariaDB Platform X5, part 2
MariaDB plc
 
Delivering fast, powerful and scalable analytics #OPEN18
Kangaroot
 
How to make data available for analytics ASAP
MariaDB plc
 
What’s new in MariaDB ColumnStore
MariaDB plc
 
Ad

More from Matt Stubbs (20)

PDF
Blueprint Series: Banking In The Cloud – Ultra-high Reliability Architectures
Matt Stubbs
 
PDF
Speed Up Your Apache Cassandra™ Applications: A Practical Guide to Reactive P...
Matt Stubbs
 
PDF
Blueprint Series: Expedia Partner Solutions, Data Platform
Matt Stubbs
 
PDF
Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...
Matt Stubbs
 
PDF
Big Data LDN 2018: DATA, WHAT PEOPLE THINK AND WHAT YOU CAN DO TO BUILD TRUST.
Matt Stubbs
 
PDF
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCE
Matt Stubbs
 
PDF
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
Matt Stubbs
 
PDF
Big Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTS
Matt Stubbs
 
PDF
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Matt Stubbs
 
PDF
Big Data LDN 2018: AI VS. GDPR
Matt Stubbs
 
PDF
Big Data LDN 2018: REALISING THE PROMISE OF SELF-SERVICE ANALYTICS WITH DATA ...
Matt Stubbs
 
PDF
Big Data LDN 2018: TURNING MULTIPLE DATA LAKES INTO A UNIFIED ANALYTIC DATA L...
Matt Stubbs
 
PDF
Big Data LDN 2018: MICROSOFT AZURE AND CLOUDERA – FLEXIBLE CLOUD, WHATEVER TH...
Matt Stubbs
 
PDF
Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...
Matt Stubbs
 
PDF
Big Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICS
Matt Stubbs
 
PDF
Big Data LDN 2018: EXPERIAN: MAXIMISE EVERY OPPORTUNITY IN THE BIG DATA UNIVERSE
Matt Stubbs
 
PDF
Big Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNING
Matt Stubbs
 
PDF
Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...
Matt Stubbs
 
PDF
Big Data LDN 2018: FROM PROLIFERATION TO PRODUCTIVITY: MACHINE LEARNING DATA ...
Matt Stubbs
 
PDF
Big Data LDN 2018: DATA APIS DON’T DISCRIMINATE
Matt Stubbs
 
Blueprint Series: Banking In The Cloud – Ultra-high Reliability Architectures
Matt Stubbs
 
Speed Up Your Apache Cassandra™ Applications: A Practical Guide to Reactive P...
Matt Stubbs
 
Blueprint Series: Expedia Partner Solutions, Data Platform
Matt Stubbs
 
Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...
Matt Stubbs
 
Big Data LDN 2018: DATA, WHAT PEOPLE THINK AND WHAT YOU CAN DO TO BUILD TRUST.
Matt Stubbs
 
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCE
Matt Stubbs
 
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
Matt Stubbs
 
Big Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTS
Matt Stubbs
 
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Matt Stubbs
 
Big Data LDN 2018: AI VS. GDPR
Matt Stubbs
 
Big Data LDN 2018: REALISING THE PROMISE OF SELF-SERVICE ANALYTICS WITH DATA ...
Matt Stubbs
 
Big Data LDN 2018: TURNING MULTIPLE DATA LAKES INTO A UNIFIED ANALYTIC DATA L...
Matt Stubbs
 
Big Data LDN 2018: MICROSOFT AZURE AND CLOUDERA – FLEXIBLE CLOUD, WHATEVER TH...
Matt Stubbs
 
Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...
Matt Stubbs
 
Big Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICS
Matt Stubbs
 
Big Data LDN 2018: EXPERIAN: MAXIMISE EVERY OPPORTUNITY IN THE BIG DATA UNIVERSE
Matt Stubbs
 
Big Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNING
Matt Stubbs
 
Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...
Matt Stubbs
 
Big Data LDN 2018: FROM PROLIFERATION TO PRODUCTIVITY: MACHINE LEARNING DATA ...
Matt Stubbs
 
Big Data LDN 2018: DATA APIS DON’T DISCRIMINATE
Matt Stubbs
 
Ad

Recently uploaded (20)

PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
big data eco system fundamentals of data science
arivukarasi
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
What Is Data Integration and Transformation?
subhashenia
 

Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStore

  • 1. Big Data Analytics with MariaDB ColumnStore Andrew Hutchings (LinuxJedi) Senior Software Engineer
  • 2. My Background • MySQL / Drizzle – Sun/Oracle: Senior Support Engineer, NDB & C/C++ API specialist (dev/support) – Rackspace: Core Engineer (Drizzle) – SkySQL: Senior Sustaining Engineer (Drizzle & C/C++ connectors) – Co-Author of MySQL 5.1 Plugins Development • OpenStack (at HP Cloud) – Core CI Engineer – Lead Engineer for LBaaS – Principal Engineer for Advanced Technology Group • NGINX – Senior Developer Advocate & Technical Product Manager • MariaDB – Lead Engineer for MariaDB ColumnStore
  • 3. History of MariaDB ColumnStore • March 2010 - Calpont launches InfiniDB • September 2014 - Calpont (now itself called InfiniDB) closes down – MariaDB (then SkySQL) supports InfiniDB customers • April 2016 - MariaDB announces development of MariaDB ColumnStore • August 2016 - I joined MariaDB and jumped straight into ColumnStore • December 2016 - MariaDB ColumnStore 1.0 GA – InfiniDB + MariaDB 10.1 + Many fixes and improvements
  • 4. MariaDB ColumnStore • GPLv2 Open Source • Columnar, Massively Parallel MariaDB Storage Engine • Scalable, high-performance analytics platform • Built in redundancy and high availability • Runs on premise, on AWS cloud • Full SQL syntax and capabilities regardless of platform Big Data Sources Analytics Insight MariaDB ColumnStore . . . Node 1 Node 2 Node 3 Node N Local / AWS® / GlusterFS ® ELT Tools BI Tools
  • 5. MariaDB ColumnStore High performance columnar storage engine that support wide variety of analytical use cases with SQL in a highly scalable distributed environments Parallel query processing for distributed environments Faster, More Efficient Queries Single SQL Interface for OLTP and analytics Easier Enterprise Analytics Power of SQL and Freedom of Open Source to Big Data Analytics Better Price Performance
  • 6. OLTP/NoSQL Workloads Suited for reporting or analysis of millions-billions of rows from data sets containing millions-trillions of rows. OLAP/Analytic/ Reporting Workloads Workload – Query Vision/Scope 1 100 10,000 10-100GB 10,000,000,000 1-10TB 1,000,000 100,000,000 100-1,000GB
  • 7. Row-oriented vs. Column-oriented format • Row oriented – Rows stored sequentially in a file – Scans through every record row by row • Column oriented: – Each column is stored in a separate file – Scans only the relevant columns ID Fname Lname State Zip Phone Age Sex 1 Bugs Bunny NY 11217 (718) 938-3235 34 M 2 Yosemite Sam CA 95389 (209) 375-6572 52 M 3 Daffy Duck NY 10013 (212) 227-1810 35 M 4 Elmer Fudd ME 04578 (207) 882-7323 43 M 5 Witch Hazel MA 01970 (978) 744-0991 57 F ID 1 2 3 4 5 Fname Bugs Yosemite Daffy Elmer Witch Lname Bunny Sam Duck Fudd Hazel State NY CA NY ME MA Zip 11217 95389 10013 04578 01970 Phone (718) 938-3235 (209) 375-6572 (212) 227-1810 (207) 882-7323 (978) 744-0991 Age 34 52 35 43 57 Sex M M M M F SELECT Fname FROM People WHERE State = 'NY'
  • 8. Analytics • In-database distributed analytics with complex join, aggregation, window functions • Cross Engine Join allows for queries to be executed referencing both columnstore and non-columnstore tables. • Extensible User Defined Functions allow creation of specialized logic executed at PM level. • Standard MariaDB Connectors provide for out of the box integration with: – BI Tools (Tableau, Pentaho, ..) – Custom Application Code (Java, Scala, C#, Python, ..) – Data Processing Frameworks (R, Spark, Numpy, ..) Item ID Server_date Revenue 1 2017-02-01 20,000.0 1 2017-02-02 5,001.00 2 2017-02-01 15,000.0 2 2017-02-04 34,029.0 2 2017-02-05 7,138.00 3 2017-02-01 17,250.0 3 2017-02-03 25,010.0 3 2017-02-04 21,034.0 3 2017-02-05 4,120.00 Running Average 20,000.00 12,500.50 15,000.00 34,029.00 20,583.50 17,250.00 25,010.00 23,022.00 12,577.00 Window Function Example: Daily Running Average Revenue by Item SELECT item_id, server_date, daily_revenue, AVG(revenue) OVER (PARTITION BY item_id ORDER BY server_date RANGE INTERVAL 1 DAY PRECEDING ) running_avg FROM web_item_sales BI Tool Custom Big Data App Data Processing Framework JDBC / ODBC / Connector
  • 9. Enterprise Grade • Enterprise Grade Security – SSL, role based access, auditability. – MaxScale database firewall • Deployment Flexibility – Run on commodity Linux servers on premise or in the cloud. – AWS optimized AMI Image. – Add horizontal capacity as you grow. • High Availability – Automatic UM failover – Automatic PM failover with distributed data attachment across all PMs in SAN and EBS environment Shared-Nothing Distributed Data Storage Compressed by default User Module (UM) Performance Module (PM) Data Storage Load Balancer - MaxScale
  • 10. MariaDB ColumnStore Architecture Columnar Distributed Data Storage Local Storage | SAN | NAS | EBS | Gluster FS BI Tool SQL Client Custom Big Data App Application MariaDB SQL Front End Distributed Query Engine Data Storage
  • 11. MariaDB ColumnStore Shared Nothing Distributed Data Storage SQL Column Primitives User Module Performance Module UM PM • Query received and parsed by MariaDB Front End on UM • Storage Engine Plugin breaks down query in primitive operations and distributes across PM • Primitives processed on PM • One thread working on a range of rows • Execute column restrictions and projections • Execute group by/aggregation against local data • Each PM work on Primitives in parallel and fully distributed • Each primitive executes in a fraction of a second • Return intermediate results to UM Primitives ↓↓↓↓ Intermediate ↑↑Results↑↑
  • 12. MariaDB ColumnStore MariaDB ColumnStore uses standard “Engine=columnstore” syntax mysql> use tpcds_djoshi Database changed mysql> select count(*) from store_sales; +----------+ | count(*) | +----------+ | 2880404 | +----------+ 1 row in set (1.68 sec) mysql> describe warehouse; +-------------------+--------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------------------+--------------+------+-----+---------+-------+ | w_warehouse_sk | int(11) | NO | | NULL | | | w_warehouse_id | char(16) | NO | | NULL | | | w_warehouse_name | varchar(20) | YES | | NULL | | | w_warehouse_sq_ft | int(11) | YES | | NULL | | | w_street_number | char(10) | YES | | NULL | | | w_street_name | varchar(60) | YES | | NULL | | | w_street_type | char(15) | YES | | NULL | | | w_suite_number | char(10) | YES | | NULL | | | w_city | varchar(60) | YES | | NULL | | | w_county | varchar(30) | YES | | NULL | | | w_state | char(2) | YES | | NULL | | | w_zip | char(10) | YES | | NULL | | | w_country | varchar(20) | YES | | NULL | | | w_gmt_offset | decimal(5,2) | YES | | NULL | | +-------------------+--------------+------+-----+---------+-------+ 14 rows in set (0.05 sec) CREATE TABLE `game_warehouse`.`dim_title` ( `id` INT, `name` VARCHAR(45), `publisher` VARCHAR(45), `release_date` DATE, `language` INT, `platform_name` VARCHAR(45), `version` VARCHAR(45) ) ENGINE=columnstore; Uses custom scalable columnar architecture
  • 13. ColumnStore Modules • User Module (UM) –MariaDB Storage Engine Plugin –ExeMgr –DMLProc, DDLProc –cpimport • Performance Module (PM) –PrimProc –WriteEngine –ProcMgr / ProcMon –cpimport
  • 14. Compression with Data Storage Layer Blocks (8KB) Extent1 (8MB~64MB 8 million rows) Logical Layer Segment File1 (maps to an Extent) Physical Layer Compression Chunks
  • 15. Key meta-structure that powers MariaDB ColumnStore’s performance A catalog of all extents • Minimum and maximum values for a column’s data within an extent Master copy of the Extent Map on primary PM node Upon system startup, copied to all other UM and PM nodes for disaster recovery and failover purposes Extent Map resident in memory for quick access at all nodes As extents modified, updates broadcasted to all participating nodes Stores about 64 bytes for each 8-64 Mbytes on disk Extent Map
  • 16. Extent Map When performing queries: • Eliminate the extents by taking into consideration only the extents for the column in join and filter conditions • Use the minimum and maximum value for the extents for join columns to filter the columns and eliminate extent Multiple columns can be used together for partition elimination Transitive properties apply, i.e. a filter on a dimension column (date, for example) can allow for partition elimination on fact table
  • 17. • 8-byte fixed length token (pointer). • A variable length value stored at the location identified by the pointer. Data Types 1-byte Field with 8192 values per 8k block 2-byte Field with 4096 values per 8k block 4-byte Field with 2048 values per 8k block 8-byte Field with 1024 values per 8k block Dictionary structure made up of 2 files/extents with: At the physical layer, all columns are stored as:
  • 18. • Varchar(8) or larger • Char(9) or larger Data Types 1-byte Field Examples TinyInt, Char(1) 2-byte Field Examples SmallInt, Char(2) 4-byte Field Examples Int, Char(3), Char(4), date, float 8-byte Field Examples BigInt, Char(5- 8),datetime, double Dictionary Examples At the physical layer, all columns are stored as:
  • 19. Sizing Minimum Spec UM 4 core, 32 G RAM PM 4 core, 16 G RAM Typical Server spec PM 8 core 64G RAM UM 8 core, 64G RAM Data Storage External Data Volumes • Maximum 2 data volume per IO channel per PM node server • up to 2TB on the disk per data volume ≈ Max 4 TB per PM node Local disk Up to 2TB on the disk per PM node server DETAILED SIZING GUIDE based on data size and workload
  • 20. Sizing - Example • MariaDB ColumnStore 60TB uncompressed data = 6TB compressed data at 10x compression • 2UM - 8 core 512G(based on work load) • 6 TB compressed = 3 data volume (at 2TB per volume) - with 1 data volume per PM node - 3PMs • Data growth - 2TB per month, Data retention - 2 years - Plan for 2TB X24 = 48 TB additional - 48 TB = 4.8TB compressed ≈ 3 data volume(at 2TB per volume) with 1 data volume per PM node - 3 additional PMs • Total 6 PMs, 2 UMs
  • 22. MAX RANK MIN DENSE_RANK COUNT PERCENT_RANK SUM NTH_VALUE AVG FIRST_VALUE VARIANCE LAST_VALUE VAR_POP CUME_DIST VAR_SAMP LAG STD LEAD STDDEV NTILE STDDEV_POP PERCENTILE_CONT STDDEV_SAMP PERCENTILE_DISC ROW_NUMBER MEDIAN • Aggregate over a series of related rows • Simplified function for complex statistical analytics over sliding window per row - Cumulative, moving or centered aggregates - Simple Statistical functions like rank, max, min, average, median - More complex functions such as distribution, percentile, lag, lead - Without running complex sub-queries Windowing Functions Source : InfiniDB SQL Syntax Guide
  • 23. Top N Visitors for each Month Window Function Example Total for Each Visitor by Month Top 1 : Time_rank = 1 Top 2 : Time_rank <= 2 Top N : Time_rank <= N
  • 24. Data Modeling Best Practices Star-schema optimizations are generally a good idea Conservative data typing is very important Especially around fixed-length vs. dictionary boundary (8 bytes) IP Address vs. IP Number Break down compound fields into individual fields: Trivializes searching for sub-fields Can avoid dictionary overhead Cost to re-assemble is generally small
  • 25. Horizontal Partition: 8 Million Rows Extent 2 Horizontal Partition: 8 Million Rows Extent 3 Horizontal Partition: 8 Million Rows Extent 1 Storage Architecture reduces I/O • Only touch column files that are in filter, projection, group by, and join conditions • Eliminate disk block touches to partitions outside filter and join conditions Extent 1: ShipDate: 2016-01-12 - 2016-03-05 Extent 2: ShipDate: 2016-03-05 - 2016-09-23 Extent 3: ShipDate: 2016-09-24 - 2017-01-06 SELECT Item, sum(Quantity) FROM Orders WHERE ShipDate between ‘2016-01-01’ and ‘2016-01-31’ GROUP BY Item Extent Elimination Id OrderId Line Item Quantity Price Supplier ShipDate ShipMode 1 1 1 Laptop 5 1000 Dell 2016-01-12 G 2 1 2 Monitor 5 200 LG 2016-01-13 G 3 2 1 Mouse 1 20 Logitech 2016-02-05 M 4 3 1 Laptop 3 1600 Apple 2016-01-31 P ... ... ... ... ... ... ... ... ... 8M 2016-03-05 8M+1 2016-03-05 ... ... ... ... ... ... ... ... ... 16M 2016-09-23 16M+1 2016-09-24 ... ... ... ... ... ... ... ... ... 24M 2017-01-06 ELIMINATED PARTITION ELIMINATED PARTITION
  • 26. Tuning Commands MariaDB [test]> select count(*) from t1 where i = 5; +----------+ | count(*) | +----------+ | 2200000 | +----------+ 1 row in set (0.27 sec) MariaDB [test]> select calGetStats()G *************************** 1. row *************************** calGetStats(): Query Stats: MaxMemPct-0; NumTempFiles-0; TempFileSpace-0B; ApproxPhyI/O-11042; CacheI/O-11042; BlocksTouched-11042; PartitionBlocksEliminated-0; MsgBytesIn-332KB; MsgBytesOut-3KB; Mode-Distributed 1 row in set (0.00 sec) calGetStats: Information On The Last Query Executed Within A Given Session
  • 27. MariaDB [test]> select calSetTrace(1); +----------------+ | calSetTrace(1) | +----------------+ | 0 | +----------------+ 1 row in set (0.00 sec) MariaDB [test]> select d.name dept_name, -> count(*) emp_count, -> sum(e.salary) salary_cost -> from emp e -> join i_dept d on e.dept_id = d.dept_id -> group by dept_name; +-------------+-----------+-------------+ | dept_name | emp_count | salary_cost | +-------------+-----------+-------------+ | Engineering | 2 | 2500 | | Sales | 2 | 3800 | +-------------+-----------+-------------+ 2 rows in set, 1 warning (0.03 sec) Tuning Commands calGetTrace: Detailed distributed query execution plan MariaDB [test]> select calGetTrace()G *************************** 1. row *************************** calGetTrace(): Desc Mode Table TableOID ReferencedColumns PIO LIO PBE Elapsed Rows CES UM - - - - - - 0.000 2 BPS PM e 3013 (dept_id,salary) 2 4 0 0.008 2 HJS PM e-d 3013 - - - - ----- - TAS UM - - - - - - 0.000 2 1 row in set (0.00 sec)
  • 28. Tuning Commands Query Statistics Users can view the query statistics by selecting the rows from the query stats table in the infinidb_querystats schema. Example 1 Example 2 List execution time, rows returned for all the select queries within the past 12 hours select queryid, query, endtime-starttime, rows from querystats where starttime >= now() - interval 12 hour and querytype = 'SELECT'; List the average, min and max running time of all the INSERT SELECT queries within the past 12 hours select min(endtime-starttime), max(endtime-starttime), avg(endtimestarttime) from querystats where querytype='INSERT SELECT' and starttime >= now() - interval 12 hour;
  • 30. Bulk Data Load: cpimport • Fastest way to load data into MariaDB ColumnStore • Load data from CSV file cpimport dbName tblName [loadFile] • Load data from Standard Input mysql -e 'select * from source_table;' -N db2 | cpimport destination_db destination_tbl -s 't‘ • Load data from Binary Source file cpimport -I1 mydb mytable sourcefile.bin • Multiple tables in can be loaded in parallel by launching multiple jobs • Read queries continue without being blocked • Successful cpimport is auto-committed • In case of errors, entire load is rolled back
  • 31. Bulk Data Load: cpimport mode 1 Single file Central Input : Data source at UM cpimport -m1 mytest mytable mytable.tbl cpimport Name Node UM Node Source Data Node PM Node Data Node PM Node Data Node PM Node
  • 32. Bulk Data Load: cpimport mode 2 Distributed Input: Data Source at PMs Partitioned load file on each PM cpimport -m2 testdb mytable /home/mydata/mytable.tbl cpimport Name Node UM Node Source Data Node PM Node Data Node PM Node Data Node PM Node Source Source
  • 33. Distributed Input: Data Source at PMs Partitioned load file on each PM cpimport -m2 testdb mytable /home/mydata/mytable.tbl Bulk load command at one or more PM cpimport –m3 testdb mytable /home/mydata/mytable.tbl Bulk Data Load: cpimport mode 3 Name Node UM Node Source Data Node PM Node Data Node PM Node Data Node PM Node Source Source cpimport cpimport cpimport
  • 34. Traditional way of importing data into any MariaDB storage engine table Bulk Data Load: LOAD DATA INFILE Up to 2 times slower than cpimport for large size imports mysql> load data infile '/tmp/ outfile1.txt' into table destinationTable; Query OK, 9765625 rows affected (2 min 20.01 sec) Records: 9765625 Deleted: 0 Skipped: 0 Warnings: 0 Either success or error operation can be rolled back
  • 36. Data Warehousing Selective column based queries Large number of dimensions High Performance Analytics On Large Volume Of Data Reporting and analysis on millions or billions of rows From datasets containing millions to trillions of rows Terabytes to Petabytes of datasets Analytics Require Complex Joins, Windowing Functions Technical Use Cases
  • 37. Industry Category Use Case Gaming Behavior Analytics Projecting and predicting user behavior based on past and current data Advertising Customer Analytics Customer behavior data for market segmentation and predictive analytics. Advertising Loyalty Analytics Customer analytics focusing on a person’s commitment to a product, company, or brand. Web, E- commerce Click Stream Analytics Web activity analysis, software testing, market research with analytics on data about the clicks areas of web pages while web browsing [Deal News] Marketing Promotional Testing Using marketing and campaign management data to identify the best criteria to be used for a particular marketing offer. Social Network Network Analytics Relationship analytics among network nodes Financial Fraud Analytics Monitoring user financial transactions and identifying patterns of behaviour to predict and detect abnormal or fraudulent activity to prevent damage to user and institution. Healthcare Patient Analytics Analyzing patient medical records to identify patterns to be used for improved medical treatment. Healthcare Clinical Analytics Analyzing clinical data and its impact on patients to identify patterns to be used for improved medical treatment. Telco Network and Application Performance Analytics Streaming data from network devices and applications enriched with business operations data to uncover actionable insights for network planning, operations and marketing analytics Aviation Flight analytics Proactively project parts replacement, maintenance and air-plane retirement based on real-time and historically collected flight parameter data [Boeing] Customer Use Cases
  • 38. Coming Soon - ColumnStore 1.1 ● Text / Blob datatype support ● Bulk Write API Connector ○ Kafka integration ○ Maxscale CDC Replication integration ○ Custom ● User Defined Aggregate & Window functions. ● Data Redundancy for local storage. ● Installation improvements. ● Performance & stability improvements. ● MariaDB Server 10.2
  • 39. SPECIAL NY DATABASE MONTH CONFERENCE PRICING | Use code: BIGDATALDN https://m18.mariadb.com Conference Registration $99 Technical Workshop $49
  • 40. Thank you Andrew Hutchings (LinuxJedi) Senior Software Engineer