SlideShare a Scribd company logo
Fast, Powerful and
Scalable Analytics
Maria Luisa Raviol
Senior Sales Engineer EMEA-MariaDB
Why Analytics ?
•  Get the most value of your data asset
•  Faster Better decision making process
•  Cost reduction
•  New products and services
What is likely
to happen?
Why is it
happening?
Types of analytics
What is
happening?
What should I
do about it?
Descriptive: What happened ?
●  Reports
○  Sales Report
○  Expense summary
●  Ad-hoc requests to analyst
Diagnostics: Why did it happen
●  Aggregates: aggregate measure over one or
more dimension
○  Find total sales
○  Top five product ranked by sales
●  Roll-ups: Aggregate at different levels of
dimension hierarchy
○  given total sales by city, roll-up to get sales
by state
●  Drill-down: Inverse of roll-ups
○  given total sales by state, drill-down to get
total by city
●  Slicing and Dicing:
○  Equality and range selections on one or
more dimensions
Predictive: What is likely to happen
●  Sales Prediction
○  Analyze data to identify trends, spot
weakness or determine conditions
among broader data sets for making
decisions about the future
●  Targeted marketing
○  what is likelihood of a customer
buying a particular product based on
past buying behavior
Real World Example - Visualization
Prescriptive: What is the best course of action?
Paradox of choices
With too many choices, which one is the best?
Data Analytics Use Cases
By industry
Finance
Identify trade patterns
Detect fraud and anomalies
Predict trading outcomes
Manufacturing
Simulations to improve design/yield
Detect production anomalies
Predict machine failures (sensor data)
Telecom
Behavioral analysis of customer calls
Network analysis (perf and reliability)
Healthcare
Find genetic profiles/matches
Analyze health vs spending
Predict viral outbreaks
Analytics Database requirements
Why this is different from OLTP
and why indexes are not helpful
OLTP or Transactional Workload
•  OLTP applications
–  have a read / write ratio of maybe 50/50
–  Web apps / E-commerce have more reads, ending with maybe 90/10
–  Single rows are selected, inserted, updated and deleted, one by one or in small groups
•  OLTP data structure is
–  a representation of the business or the applications
•  An order reference a customer, and order item is linked to an order
–  Sometimes individual aspects break the normal form, for performance reasons
–  Transactions and ACID properties are required
The OLAP or Analytics Workload
•  Deals with data from a high level perspective
•  Contains structured, semi-structured and sometimes unstructured data
•  Data structures are optimized for analytics use and performance
•  Handles data in large groups of rows
–  SELECTs data by date, customer location, product id etc.
–  Dealing with individual data items is usually ineffective
•  Analytics data
–  Often comes from many different sources
–  Are loaded in batch or streamed in, mainly just INSERTed
–  Are sometimes purged, but most of the times not
•  Queries are largely ad-hoc,
•  Transactions and ACID requirements are relaxed
Analytics database requirements
•  Fast access to large amounts of data
•  Scalable as data grows over time
–  Analytics requirements increasing
–  Regulatory requirements
–  New data sources are added
•  Load performance must be fast, scalable and predictable
•  Data loading should be very flexible due to the different sources of data
–  Some data loaded in batch, other is streamed
•  Query performance also need to be scalable
•  Data compression is a requirement
–  Data size constraints, as well as read performance from disk
B-TREE INDEXES
THE GOOD
B-TREE INDEXES
THE BAD•  Well known technology
•  Works with most types of data
•  Scales reasonably well
•  Really good for OLTP
transactional data
•  Really bad for unbalanced data
•  Index modifications can be really slow
•  Index modifications are largely single
threaded
•  Slows down with the amount of data
•  Really not scalable with large amount
of data
In summary, what do we need
•  Something that
–  can compress data A LOT
–  can be written to with fast and predictable performance
–  doesn't necessarily support transactions
•  performance is key
–  can support analytics queries (ad hoc, aggregate)
–  can scale as data grows
–  can still have a level of high availability
•  Something that works with analytics tools, (Tableau, Pehtaho, Microstrategy, etc.)
The Solution
Distributed Column based storage
Existing Approaches
Limited real time
analytics
Slow releases of product
innovation
Expensive hardware and software
Data Warehouses
Hadoop / NoSQL
LIMITED SQL SUPPORT
DIFFICULT TO INSTALL/
MANAGE
LIMITED TALENT POOL
DATA LAKE W/ NO DATA
MANAGEMENT
Hard to use
Purpose Built rather than
predictive analytics
Row-oriented vs. Column-oriented format
• Row oriented
– Rows stored sequentially
in a file
– Scans through every
record row by row
• Column oriented:
– Each column is stored in a
separate file
– Scans only the relevant
columns
ID Fname Lname State Zip Phone Age Sex
1 Bugs Bunny NY 11217 (718) 938-3235 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F
ID
1
2
3
4
5
Fname
Bugs
Yosemite
Daffy
Elmer
Witch
Lname
Bunny
Sam
Duck
Fudd
Hazel
State
NY
CA
NY
ME
MA
Zip
11217
95389
10013
04578
01970
Phone
(718) 938-3235
(209) 375-6572
(212) 227-1810
(207) 882-7323
(978) 744-0991
Age
34
52
35
43
57
Sex
M
M
M
M
F
SELECT Fname FROM People WHERE State = 'NY'
Single-Row Operations - Insert
Row oriented:
new rows appended
to the end.
Column oriented:
new value added to
each file.
Key Fname Lname State Zip Phone Age Sex
1 Bugs Bunny NY 11217 (718) 938-3235 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F
6 Marvin Martian CA 91602 (818) 761-9964 26 M
Key
1
2
3
4
5
Fname
Bugs
Yosemite
Daffy
Elmer
Witch
Lname
Bunny
Sam
Duck
Fudd
Hazel
State
NY
CA
NY
ME
MA
Zip
11217
95389
10013
04578
01970
Phone
(718) 938-3235
(209) 375-6572
(212) 227-1810
(207) 882-7323
(978) 744-0991
Age
34
52
35
43
57
Sex
M
M
M
M
F
6 Marvin Martian CA 91602 (818) 761-9964 26 M
Columnar insert not efficient for singleton insertions (OLTP). Batch loads touches row vs.
column. Batch load on column-oriented is faster (compression, no indexes).
Single-Row Operations - Update
Row oriented:
Update 100% of rows
means change 100%
of blocks on disk.
Column oriented:
Just update the
blocks needed to be
updated
Key Fname Lname State Zip Phone Age Sex
1 Bugs Bunny NY 11217 (718) 938-3235 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F
Key
1
2
3
4
5
Fname
Bugs
Yosemite
Daffy
Elmer
Witch
Lname
Bunny
Sam
Duck
Fudd
Hazel
State
NY
CA
NY
ME
MA
Zip
11217
95389
10013
04578
01970
Phone
(718) 938-3235
(209) 375-6572
(212) 227-1810
(207) 882-7323
(978) 744-0991
Age
34
52
35
43
57
Sex
M
M
M
M
F
Single-Row Operations - Delete
Row oriented:
new rows deleted
Column oriented:
value deleted from
each file
Recommended Partition Drop to allow dropping columns in bulk.
Key Fname Lname State Zip Phone Age Sex
1 Bugs Bunny NY 11217 (718) 938-3235 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F
6 Marvin Martian CA 91602 (818) 761-9964 26 M
Key
1
2
3
4
5
Fname
Bugs
Yosemite
Daffy
Elmer
Witch
Lname
Bunny
Sam
Duck
Fudd
Hazel
State
NY
CA
NY
ME
MA
Zip
11217
95389
10013
04578
01970
Phone
(718) 938-3235
(209) 375-6572
(212) 227-1810
(207) 882-7323
(978) 744-0991
Age
34
52
35
43
57
Sex
M
M
M
M
F
6 Marvin Martian CA 91602 (818) 761-9964 26 M
Changing the table structure
Row oriented:
requires rebuilding of
the whole table
Column oriented:
Create new file for
the new column
Column-oriented is very flexible for adding columns, no need for a full rebuild
required with it.
Key Fname Lname State Zip Phone Age Sex Active
1 Bugs Bunny NY 11217 (718) 938-3235 34 M Y
2 Yosemite Sam CA 95389 (209) 375-6572 52 M N
3 Daffy Duck NY 10013 (212) 227-1810 35 M N
4 Elmer Fudd ME 04578 (207) 882-7323 43 M Y
5 Witch Hazel MA 01970 (978) 744-0991 57 F N
Key
1
2
3
4
5
Fname
Bugs
Yosemite
Daffy
Elmer
Witch
Lname
Bunny
Sam
Duck
Fudd
Hazel
State
NY
CA
NY
ME
MA
Zip
11217
95389
10013
04578
01970
Phone
(718) 938-3235
(209) 375-6572
(212) 227-1810
(207) 882-7323
(978) 744-0991
Age
34
52
35
43
57
Sex
M
M
M
M
F
Active
Y
N
N
Y
N
Storage Architecture
•  Columnar storage
–  Each column stored as separate file
–  No index management for query
performance tuning
–  Online Schema changes: Add new column
without impacting running queries
•  Automatic horizontal partitioning
–  Logical partition every 8 Million rows
–  In memory metadata of partition min and max
–  Query engine performs partition elimination.
–  No partition management for query
performance tuning
•  Compression
–  Accelerate decompression rate
–  Reduce I/O for compressed blocks
Column 1
Extent 1 (8 million rows, 8MB~64MB)
Extent 2 (8 million rows)
Extent M (8 million rows)
Column 2 Column 3 ... Column N
Data automatically arranged by
•  Column – Acts as Vertical Partitioning
•  Extents – Acts as horizontal partition
Vertical
Partition
Horizontal
Partition
...
Vertical
Partition
Vertical
Partition
Vertical
Partition
Horizontal
Partition
Horizontal
Partition
High Performance Query Processing
Horizontal
Partition:
8 Million Rows
Extent 2
Horizontal
Partition:
8 Million Rows
Extent 3
Horizontal
Partition:
8 Million Rows
Extent 1
Storage Architecture reduces I/O
• Only touch column files
that are in filter, projection,
group by, and join conditions
• Eliminate disk block touches
to partitions outside filter
and join conditions
Extent 1:
ShipDate: 2016-01-12 - 2016-03-05
Extent 2:
ShipDate: 2016-03-05 - 2016-09-23
Extent 3:
ShipDate: 2016-09-24 - 2017-01-06
SELECT Item, sum(Quantity) FROM Orders
WHERE ShipDate between ‘2016-01-01’ and ‘2016-01-31’
GROUP BY Item
Id OrderId Line Item Quantity Price Supplier ShipDate ShipMode
1 1 1 Laptop 5 1000 Dell
2016-01-1
2 G
2 1 2 Monitor 5 200 LG
2016-01-1
3 G
3 2 1 Mouse 1 20 Logitech 2016-02-05 M
4 3 1 Laptop 3 1600 Apple 2016-01-31 P
... ... ... ... ... ... ... ... ...
8M 2016-03-05
8M+1 2016-03-05
... ... ... ... ... ... ... ... ...
16M 2016-09-23
16M+1 2016-09-24
... ... ... ... ... ... ... ... ...
24M 2017-01-06
ELIMINATED PARTITION
ELIMINATED PARTITION
MariaDB Big Data Solution
MariaDB AX
and
MariaDB ColumnStore
MariaDB AX
Analytics -
simple, fast,
scalable…
and open source
MariaDB ColumnStore
High performance columnar storage engine that supports a wide variety
of analytical use cases in highly scalable distributed environments
Parallel query
processing for distributed
environments
Faster, More
Efficient Queries
Single Interface for
OLTP and analytics
Easy to Manage and
Scale
Easier Enterprise
Analytics
Power of SQL and
Freedom of Open
Source to Big Data
Analytics
Better Price
Performance
MariaDB AX
MariaDB Server
MariaDB MaxScale
MariaDB ColumnStore
Parallel queries
Distributed storage
No indexes
Automatic partitioning
Read optimized
High compression
Low disk IO ColumnStore
Storage
ColumnStore
Storage
ColumnStore
Storage
MariaDB Server
ColumnStore
MariaDB Server
ColumnStore
MariaDB MaxScale
MariaDB Server
ColumnStore
ColumnStore
Storage
MariaDB MaxScale
UM
User
Module
PM
Performance Module
Better Price
Performance
No need to maintain a third platform
•  Run analytics from the same SQL front end
•  No need to update application code
•  Leverage MariaDB Extensible architecture
Flexible deployment option
Cloud and On-premise
Run on commodity hardware
Open Source, Subscription based pricing
High data compression
•  More efficient at storing big data
•  Less hardware
Customers have saved by going to MariaDB AX against
Oracle(HealthCare), MemSQL(Auto-parts), Vertica(Finance, SEO
Marketing): Come see them at M18!
90.3%
less per TB
per year
Commercial Data
Warehouse
MariaDB
ColumnStore
Easier Enterprise
Analytics
Full ANSI SQL
•  No more SQL “like” query
•  Support complex join, aggregation and window
function
Single SQL Front-end
Use a single SQL interface for analytics and OLTP
Leverage MariaDB Security features - Encryption for data in motion,
role based access and auditing
Easy to manage and scale
•  Eliminate needs for indexes and views
•  Automated horizontal/vertical partitioning
•  Linear scalable by adding new nodes as data grows
•  Out of box connection with BI tools
ANSI SQL
Faster, More
Efficient Queries
Parallel distributed query execution
•  Distributed queries into series of parallel operations
•  Fully parallel high speed data ingestion
–  TPCH lineitem table - 750K to 1 million rows per min
Optimized for Columnar storage
Columnar storage reduces disk I/O
Blazing fast read-intensive workload
Ultra fast data import
Highly available analytic environment
•  Built-in Redundancy
•  Automatic fail-over
Parallel
Query Processing
MariaDB ColumnStore
Analytics Use Cases
Healthcare / Life Science Industry
Genome analysis
•  In-depth genome research for the dairy industry to improve production of milk and protein.
•  Fast data load for large amount of genome dataset (DNA data for 7billion cows in US - 20GB per load)
Healthcare spending analysis
•  Analyze 3TB of US health care spending for 155 conditions with 7 years of historical data
•  Used sankey diagram, treemap, and pyramid chart to analyze trends by age, sex, type of care, and condition
Why MariaDB ColumnStore
•  Strong security features including role based data access and audit plug in
•  MPP architecture handles analytics on big data with high speed
•  Easy to analyze archived data with SQL based analytics
•  Does not require DBA to index or partition data
Telecommunication Industry
Customer behavior analysis
•  Analyze call data record to segment customers based on their behavior
•  Data-driven analysis for customer satisfaction
•  Create behavioral based upsell or cross-sell opportunity
Call data analysis
•  Data size: 6TB
•  Ingest 1.5 million rows of logs per day with 30million texts and 3million calls
•  Call and network quality analysis
•  Provide higher quality customer services based on data
Why MariaDB ColumnStore
•  ColumnStore support time based partitioning and time-series analysis
•  Fast data load for real-time analytics
•  MPP architecture handles analytics on big data with high speed
•  Easy to analyze the archived data with SQL based analytics
In Conclusion
•  Analytics require a different technology to be able to cope with
–  Different types of data
–  Different types of data access
•  OLTP databases has different requirements compared to Analytics
•  Column Based storage allows high compression
•  Metadata can replace indexing
•  Distributed processing allows for performance and scalability
•  MariaDB ColumnStore implement a fast an efficient distributed database for
analytics
•  MariaDB AX is the subscription for professional use of MariaDB ColumnStore
•  MariaDB ColumnStore is gaining wide acceptance
Thank you
MariaDB AX Use Cases
IHME - Institute of Health Metrics and Evaluation
IHME Visualizations library: http://www.healthdata.org/results/data-visualizations
Started with 4.2 TB, with goal to go to 30TB of data
Customer Use Case -1
Industry: healthcare (Medicaid)
Data: surveys
Use case: decision support system
Details:
•  Identify trends and patterns
•  Determine population cohorts
•  Predict health outcomes
•  Anticipate funding / capacity
•  Recommend intervention
Can’t do complex queries on current
hardware with Oracle and snowflake
schemas
Limited to optimizing for simple, known
queries (2-3 columns)
Replaced with ColumnStore
> a single table
> 2.5 million rows, 248 columns >
complex, ad-hoc queries
> query 20+ columns in seconds
Customer Use Case - 2
Industry: biotechnology (genetics)
Data: genotypes
Use case: genetic profiling
Details:
•  Find genetic mates (beef and dairy)
•  Predict meat production (pork)
•  Gene/DNA analysis
Had to convert to CSV files and schedule
import jobs (cron)
Always receiving new genetic data
Migrated to data adapter (Python)
> streamline import process
> remove steps / possible error
> remove delays
> import data on demand
> immediate customer access
Customer Use Case - 3
Industry:Mobile text/call app
Data: call and text logs
Use case: Mobile app use analytics
Details:
•  30 million text and 3 million phone call
per day
•  1.5 billion rows of logs per day
•  The text and call volume rate will continue
to grow
InnoDB backend hit the scale limit of
6TB and it requires lot of performance
tuning and index management
Migrated to MariaDB AX
> Able to process 24 month - 24TB vs
6 months limitation of InnoDB
> Same BI tools and client applications
worked with MariaDB AX seamlessly
MariaDB AX
Analytics made easy –
simple, fast, scalable…
Thank you

More Related Content

Similar to Big Data Analytics with MariaDB AX (20)

PDF
Performance tuning ColumnStore
MariaDB plc
 
PPT
Tunning overview
Hitesh Kumar Markam
 
PPT
The thinking persons guide to data warehouse design
Calpont
 
PDF
Intro to column stores
Justin Swanhart
 
PDF
04 2017 emea_roadshowmilan_mariadb columnstore
mlraviol
 
PDF
MariaDB AX: Analytics with MariaDB ColumnStore
MariaDB plc
 
PDF
MariaDB AX: Solución analítica con ColumnStore
MariaDB plc
 
PPT
Making MySQL Great For Business Intelligence
Calpont
 
PPTX
Performance By Design
Guy Harrison
 
PDF
Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStore
Matt Stubbs
 
PDF
Big Data with MySQL
Ivan Zoratti
 
ODP
Mysql For Developers
Carol McDonald
 
PPTX
Geek Sync | Tips for Data Warehouses and Other Very Large Databases
IDERA Software
 
PDF
Understanding the architecture of MariaDB ColumnStore
MariaDB plc
 
PPTX
File Organization in database management.pptx
ubaidullah75790
 
PDF
Operational-Analytics
Niloy Mukherjee
 
PDF
MySQL optimisation Percona LeMug.fr
cyruss666
 
PDF
Sap technical deep dive in a column oriented in memory database
Alexander Talac
 
PDF
Indexes overview
aioughydchapter
 
PDF
M|18 Understanding the Architecture of MariaDB ColumnStore
MariaDB plc
 
Performance tuning ColumnStore
MariaDB plc
 
Tunning overview
Hitesh Kumar Markam
 
The thinking persons guide to data warehouse design
Calpont
 
Intro to column stores
Justin Swanhart
 
04 2017 emea_roadshowmilan_mariadb columnstore
mlraviol
 
MariaDB AX: Analytics with MariaDB ColumnStore
MariaDB plc
 
MariaDB AX: Solución analítica con ColumnStore
MariaDB plc
 
Making MySQL Great For Business Intelligence
Calpont
 
Performance By Design
Guy Harrison
 
Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStore
Matt Stubbs
 
Big Data with MySQL
Ivan Zoratti
 
Mysql For Developers
Carol McDonald
 
Geek Sync | Tips for Data Warehouses and Other Very Large Databases
IDERA Software
 
Understanding the architecture of MariaDB ColumnStore
MariaDB plc
 
File Organization in database management.pptx
ubaidullah75790
 
Operational-Analytics
Niloy Mukherjee
 
MySQL optimisation Percona LeMug.fr
cyruss666
 
Sap technical deep dive in a column oriented in memory database
Alexander Talac
 
Indexes overview
aioughydchapter
 
M|18 Understanding the Architecture of MariaDB ColumnStore
MariaDB plc
 

More from MariaDB plc (20)

PDF
MariaDB Berlin Roadshow Slides - 8 April 2025
MariaDB plc
 
PDF
MariaDB München Roadshow - 24 September, 2024
MariaDB plc
 
PDF
MariaDB Paris Roadshow - 19 September 2024
MariaDB plc
 
PDF
MariaDB Amsterdam Roadshow: 19 September, 2024
MariaDB plc
 
PDF
MariaDB Paris Workshop 2023 - MaxScale 23.02.x
MariaDB plc
 
PDF
MariaDB Paris Workshop 2023 - Newpharma
MariaDB plc
 
PDF
MariaDB Paris Workshop 2023 - Cloud
MariaDB plc
 
PDF
MariaDB Paris Workshop 2023 - MariaDB Enterprise
MariaDB plc
 
PDF
MariaDB Paris Workshop 2023 - Performance Optimization
MariaDB plc
 
PDF
MariaDB Paris Workshop 2023 - MaxScale
MariaDB plc
 
PDF
MariaDB Paris Workshop 2023 - novadys presentation
MariaDB plc
 
PDF
MariaDB Paris Workshop 2023 - DARVA presentation
MariaDB plc
 
PDF
MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server
MariaDB plc
 
PDF
MariaDB SkySQL Autonome Skalierung, Observability, Cloud-Backup
MariaDB plc
 
PDF
Einführung : MariaDB Tech und Business Update Hamburg 2023
MariaDB plc
 
PDF
Hochverfügbarkeitslösungen mit MariaDB
MariaDB plc
 
PDF
Die Neuheiten in MariaDB Enterprise Server
MariaDB plc
 
PDF
Global Data Replication with Galera for Ansell Guardian®
MariaDB plc
 
PDF
Introducing workload analysis
MariaDB plc
 
PDF
Under the hood: SkySQL monitoring
MariaDB plc
 
MariaDB Berlin Roadshow Slides - 8 April 2025
MariaDB plc
 
MariaDB München Roadshow - 24 September, 2024
MariaDB plc
 
MariaDB Paris Roadshow - 19 September 2024
MariaDB plc
 
MariaDB Amsterdam Roadshow: 19 September, 2024
MariaDB plc
 
MariaDB Paris Workshop 2023 - MaxScale 23.02.x
MariaDB plc
 
MariaDB Paris Workshop 2023 - Newpharma
MariaDB plc
 
MariaDB Paris Workshop 2023 - Cloud
MariaDB plc
 
MariaDB Paris Workshop 2023 - MariaDB Enterprise
MariaDB plc
 
MariaDB Paris Workshop 2023 - Performance Optimization
MariaDB plc
 
MariaDB Paris Workshop 2023 - MaxScale
MariaDB plc
 
MariaDB Paris Workshop 2023 - novadys presentation
MariaDB plc
 
MariaDB Paris Workshop 2023 - DARVA presentation
MariaDB plc
 
MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server
MariaDB plc
 
MariaDB SkySQL Autonome Skalierung, Observability, Cloud-Backup
MariaDB plc
 
Einführung : MariaDB Tech und Business Update Hamburg 2023
MariaDB plc
 
Hochverfügbarkeitslösungen mit MariaDB
MariaDB plc
 
Die Neuheiten in MariaDB Enterprise Server
MariaDB plc
 
Global Data Replication with Galera for Ansell Guardian®
MariaDB plc
 
Introducing workload analysis
MariaDB plc
 
Under the hood: SkySQL monitoring
MariaDB plc
 
Ad

Recently uploaded (20)

PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PDF
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
PDF
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PPTX
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PDF
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
PDF
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PPTX
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
PPTX
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
Ad

Big Data Analytics with MariaDB AX

  • 1. Fast, Powerful and Scalable Analytics Maria Luisa Raviol Senior Sales Engineer EMEA-MariaDB
  • 2. Why Analytics ? •  Get the most value of your data asset •  Faster Better decision making process •  Cost reduction •  New products and services
  • 3. What is likely to happen? Why is it happening? Types of analytics What is happening? What should I do about it?
  • 4. Descriptive: What happened ? ●  Reports ○  Sales Report ○  Expense summary ●  Ad-hoc requests to analyst
  • 5. Diagnostics: Why did it happen ●  Aggregates: aggregate measure over one or more dimension ○  Find total sales ○  Top five product ranked by sales ●  Roll-ups: Aggregate at different levels of dimension hierarchy ○  given total sales by city, roll-up to get sales by state ●  Drill-down: Inverse of roll-ups ○  given total sales by state, drill-down to get total by city ●  Slicing and Dicing: ○  Equality and range selections on one or more dimensions
  • 6. Predictive: What is likely to happen ●  Sales Prediction ○  Analyze data to identify trends, spot weakness or determine conditions among broader data sets for making decisions about the future ●  Targeted marketing ○  what is likelihood of a customer buying a particular product based on past buying behavior
  • 7. Real World Example - Visualization
  • 8. Prescriptive: What is the best course of action? Paradox of choices With too many choices, which one is the best?
  • 9. Data Analytics Use Cases By industry Finance Identify trade patterns Detect fraud and anomalies Predict trading outcomes Manufacturing Simulations to improve design/yield Detect production anomalies Predict machine failures (sensor data) Telecom Behavioral analysis of customer calls Network analysis (perf and reliability) Healthcare Find genetic profiles/matches Analyze health vs spending Predict viral outbreaks
  • 10. Analytics Database requirements Why this is different from OLTP and why indexes are not helpful
  • 11. OLTP or Transactional Workload •  OLTP applications –  have a read / write ratio of maybe 50/50 –  Web apps / E-commerce have more reads, ending with maybe 90/10 –  Single rows are selected, inserted, updated and deleted, one by one or in small groups •  OLTP data structure is –  a representation of the business or the applications •  An order reference a customer, and order item is linked to an order –  Sometimes individual aspects break the normal form, for performance reasons –  Transactions and ACID properties are required
  • 12. The OLAP or Analytics Workload •  Deals with data from a high level perspective •  Contains structured, semi-structured and sometimes unstructured data •  Data structures are optimized for analytics use and performance •  Handles data in large groups of rows –  SELECTs data by date, customer location, product id etc. –  Dealing with individual data items is usually ineffective •  Analytics data –  Often comes from many different sources –  Are loaded in batch or streamed in, mainly just INSERTed –  Are sometimes purged, but most of the times not •  Queries are largely ad-hoc, •  Transactions and ACID requirements are relaxed
  • 13. Analytics database requirements •  Fast access to large amounts of data •  Scalable as data grows over time –  Analytics requirements increasing –  Regulatory requirements –  New data sources are added •  Load performance must be fast, scalable and predictable •  Data loading should be very flexible due to the different sources of data –  Some data loaded in batch, other is streamed •  Query performance also need to be scalable •  Data compression is a requirement –  Data size constraints, as well as read performance from disk
  • 14. B-TREE INDEXES THE GOOD B-TREE INDEXES THE BAD•  Well known technology •  Works with most types of data •  Scales reasonably well •  Really good for OLTP transactional data •  Really bad for unbalanced data •  Index modifications can be really slow •  Index modifications are largely single threaded •  Slows down with the amount of data •  Really not scalable with large amount of data
  • 15. In summary, what do we need •  Something that –  can compress data A LOT –  can be written to with fast and predictable performance –  doesn't necessarily support transactions •  performance is key –  can support analytics queries (ad hoc, aggregate) –  can scale as data grows –  can still have a level of high availability •  Something that works with analytics tools, (Tableau, Pehtaho, Microstrategy, etc.)
  • 17. Existing Approaches Limited real time analytics Slow releases of product innovation Expensive hardware and software Data Warehouses Hadoop / NoSQL LIMITED SQL SUPPORT DIFFICULT TO INSTALL/ MANAGE LIMITED TALENT POOL DATA LAKE W/ NO DATA MANAGEMENT Hard to use Purpose Built rather than predictive analytics
  • 18. Row-oriented vs. Column-oriented format • Row oriented – Rows stored sequentially in a file – Scans through every record row by row • Column oriented: – Each column is stored in a separate file – Scans only the relevant columns ID Fname Lname State Zip Phone Age Sex 1 Bugs Bunny NY 11217 (718) 938-3235 34 M 2 Yosemite Sam CA 95389 (209) 375-6572 52 M 3 Daffy Duck NY 10013 (212) 227-1810 35 M 4 Elmer Fudd ME 04578 (207) 882-7323 43 M 5 Witch Hazel MA 01970 (978) 744-0991 57 F ID 1 2 3 4 5 Fname Bugs Yosemite Daffy Elmer Witch Lname Bunny Sam Duck Fudd Hazel State NY CA NY ME MA Zip 11217 95389 10013 04578 01970 Phone (718) 938-3235 (209) 375-6572 (212) 227-1810 (207) 882-7323 (978) 744-0991 Age 34 52 35 43 57 Sex M M M M F SELECT Fname FROM People WHERE State = 'NY'
  • 19. Single-Row Operations - Insert Row oriented: new rows appended to the end. Column oriented: new value added to each file. Key Fname Lname State Zip Phone Age Sex 1 Bugs Bunny NY 11217 (718) 938-3235 34 M 2 Yosemite Sam CA 95389 (209) 375-6572 52 M 3 Daffy Duck NY 10013 (212) 227-1810 35 M 4 Elmer Fudd ME 04578 (207) 882-7323 43 M 5 Witch Hazel MA 01970 (978) 744-0991 57 F 6 Marvin Martian CA 91602 (818) 761-9964 26 M Key 1 2 3 4 5 Fname Bugs Yosemite Daffy Elmer Witch Lname Bunny Sam Duck Fudd Hazel State NY CA NY ME MA Zip 11217 95389 10013 04578 01970 Phone (718) 938-3235 (209) 375-6572 (212) 227-1810 (207) 882-7323 (978) 744-0991 Age 34 52 35 43 57 Sex M M M M F 6 Marvin Martian CA 91602 (818) 761-9964 26 M Columnar insert not efficient for singleton insertions (OLTP). Batch loads touches row vs. column. Batch load on column-oriented is faster (compression, no indexes).
  • 20. Single-Row Operations - Update Row oriented: Update 100% of rows means change 100% of blocks on disk. Column oriented: Just update the blocks needed to be updated Key Fname Lname State Zip Phone Age Sex 1 Bugs Bunny NY 11217 (718) 938-3235 34 M 2 Yosemite Sam CA 95389 (209) 375-6572 52 M 3 Daffy Duck NY 10013 (212) 227-1810 35 M 4 Elmer Fudd ME 04578 (207) 882-7323 43 M 5 Witch Hazel MA 01970 (978) 744-0991 57 F Key 1 2 3 4 5 Fname Bugs Yosemite Daffy Elmer Witch Lname Bunny Sam Duck Fudd Hazel State NY CA NY ME MA Zip 11217 95389 10013 04578 01970 Phone (718) 938-3235 (209) 375-6572 (212) 227-1810 (207) 882-7323 (978) 744-0991 Age 34 52 35 43 57 Sex M M M M F
  • 21. Single-Row Operations - Delete Row oriented: new rows deleted Column oriented: value deleted from each file Recommended Partition Drop to allow dropping columns in bulk. Key Fname Lname State Zip Phone Age Sex 1 Bugs Bunny NY 11217 (718) 938-3235 34 M 2 Yosemite Sam CA 95389 (209) 375-6572 52 M 3 Daffy Duck NY 10013 (212) 227-1810 35 M 4 Elmer Fudd ME 04578 (207) 882-7323 43 M 5 Witch Hazel MA 01970 (978) 744-0991 57 F 6 Marvin Martian CA 91602 (818) 761-9964 26 M Key 1 2 3 4 5 Fname Bugs Yosemite Daffy Elmer Witch Lname Bunny Sam Duck Fudd Hazel State NY CA NY ME MA Zip 11217 95389 10013 04578 01970 Phone (718) 938-3235 (209) 375-6572 (212) 227-1810 (207) 882-7323 (978) 744-0991 Age 34 52 35 43 57 Sex M M M M F 6 Marvin Martian CA 91602 (818) 761-9964 26 M
  • 22. Changing the table structure Row oriented: requires rebuilding of the whole table Column oriented: Create new file for the new column Column-oriented is very flexible for adding columns, no need for a full rebuild required with it. Key Fname Lname State Zip Phone Age Sex Active 1 Bugs Bunny NY 11217 (718) 938-3235 34 M Y 2 Yosemite Sam CA 95389 (209) 375-6572 52 M N 3 Daffy Duck NY 10013 (212) 227-1810 35 M N 4 Elmer Fudd ME 04578 (207) 882-7323 43 M Y 5 Witch Hazel MA 01970 (978) 744-0991 57 F N Key 1 2 3 4 5 Fname Bugs Yosemite Daffy Elmer Witch Lname Bunny Sam Duck Fudd Hazel State NY CA NY ME MA Zip 11217 95389 10013 04578 01970 Phone (718) 938-3235 (209) 375-6572 (212) 227-1810 (207) 882-7323 (978) 744-0991 Age 34 52 35 43 57 Sex M M M M F Active Y N N Y N
  • 23. Storage Architecture •  Columnar storage –  Each column stored as separate file –  No index management for query performance tuning –  Online Schema changes: Add new column without impacting running queries •  Automatic horizontal partitioning –  Logical partition every 8 Million rows –  In memory metadata of partition min and max –  Query engine performs partition elimination. –  No partition management for query performance tuning •  Compression –  Accelerate decompression rate –  Reduce I/O for compressed blocks Column 1 Extent 1 (8 million rows, 8MB~64MB) Extent 2 (8 million rows) Extent M (8 million rows) Column 2 Column 3 ... Column N Data automatically arranged by •  Column – Acts as Vertical Partitioning •  Extents – Acts as horizontal partition Vertical Partition Horizontal Partition ... Vertical Partition Vertical Partition Vertical Partition Horizontal Partition Horizontal Partition
  • 24. High Performance Query Processing Horizontal Partition: 8 Million Rows Extent 2 Horizontal Partition: 8 Million Rows Extent 3 Horizontal Partition: 8 Million Rows Extent 1 Storage Architecture reduces I/O • Only touch column files that are in filter, projection, group by, and join conditions • Eliminate disk block touches to partitions outside filter and join conditions Extent 1: ShipDate: 2016-01-12 - 2016-03-05 Extent 2: ShipDate: 2016-03-05 - 2016-09-23 Extent 3: ShipDate: 2016-09-24 - 2017-01-06 SELECT Item, sum(Quantity) FROM Orders WHERE ShipDate between ‘2016-01-01’ and ‘2016-01-31’ GROUP BY Item Id OrderId Line Item Quantity Price Supplier ShipDate ShipMode 1 1 1 Laptop 5 1000 Dell 2016-01-1 2 G 2 1 2 Monitor 5 200 LG 2016-01-1 3 G 3 2 1 Mouse 1 20 Logitech 2016-02-05 M 4 3 1 Laptop 3 1600 Apple 2016-01-31 P ... ... ... ... ... ... ... ... ... 8M 2016-03-05 8M+1 2016-03-05 ... ... ... ... ... ... ... ... ... 16M 2016-09-23 16M+1 2016-09-24 ... ... ... ... ... ... ... ... ... 24M 2017-01-06 ELIMINATED PARTITION ELIMINATED PARTITION
  • 25. MariaDB Big Data Solution MariaDB AX and MariaDB ColumnStore
  • 26. MariaDB AX Analytics - simple, fast, scalable… and open source
  • 27. MariaDB ColumnStore High performance columnar storage engine that supports a wide variety of analytical use cases in highly scalable distributed environments Parallel query processing for distributed environments Faster, More Efficient Queries Single Interface for OLTP and analytics Easy to Manage and Scale Easier Enterprise Analytics Power of SQL and Freedom of Open Source to Big Data Analytics Better Price Performance
  • 28. MariaDB AX MariaDB Server MariaDB MaxScale MariaDB ColumnStore Parallel queries Distributed storage No indexes Automatic partitioning Read optimized High compression Low disk IO ColumnStore Storage ColumnStore Storage ColumnStore Storage MariaDB Server ColumnStore MariaDB Server ColumnStore MariaDB MaxScale MariaDB Server ColumnStore ColumnStore Storage MariaDB MaxScale UM User Module PM Performance Module
  • 29. Better Price Performance No need to maintain a third platform •  Run analytics from the same SQL front end •  No need to update application code •  Leverage MariaDB Extensible architecture Flexible deployment option Cloud and On-premise Run on commodity hardware Open Source, Subscription based pricing High data compression •  More efficient at storing big data •  Less hardware Customers have saved by going to MariaDB AX against Oracle(HealthCare), MemSQL(Auto-parts), Vertica(Finance, SEO Marketing): Come see them at M18! 90.3% less per TB per year Commercial Data Warehouse MariaDB ColumnStore
  • 30. Easier Enterprise Analytics Full ANSI SQL •  No more SQL “like” query •  Support complex join, aggregation and window function Single SQL Front-end Use a single SQL interface for analytics and OLTP Leverage MariaDB Security features - Encryption for data in motion, role based access and auditing Easy to manage and scale •  Eliminate needs for indexes and views •  Automated horizontal/vertical partitioning •  Linear scalable by adding new nodes as data grows •  Out of box connection with BI tools ANSI SQL
  • 31. Faster, More Efficient Queries Parallel distributed query execution •  Distributed queries into series of parallel operations •  Fully parallel high speed data ingestion –  TPCH lineitem table - 750K to 1 million rows per min Optimized for Columnar storage Columnar storage reduces disk I/O Blazing fast read-intensive workload Ultra fast data import Highly available analytic environment •  Built-in Redundancy •  Automatic fail-over Parallel Query Processing
  • 33. Healthcare / Life Science Industry Genome analysis •  In-depth genome research for the dairy industry to improve production of milk and protein. •  Fast data load for large amount of genome dataset (DNA data for 7billion cows in US - 20GB per load) Healthcare spending analysis •  Analyze 3TB of US health care spending for 155 conditions with 7 years of historical data •  Used sankey diagram, treemap, and pyramid chart to analyze trends by age, sex, type of care, and condition Why MariaDB ColumnStore •  Strong security features including role based data access and audit plug in •  MPP architecture handles analytics on big data with high speed •  Easy to analyze archived data with SQL based analytics •  Does not require DBA to index or partition data
  • 34. Telecommunication Industry Customer behavior analysis •  Analyze call data record to segment customers based on their behavior •  Data-driven analysis for customer satisfaction •  Create behavioral based upsell or cross-sell opportunity Call data analysis •  Data size: 6TB •  Ingest 1.5 million rows of logs per day with 30million texts and 3million calls •  Call and network quality analysis •  Provide higher quality customer services based on data Why MariaDB ColumnStore •  ColumnStore support time based partitioning and time-series analysis •  Fast data load for real-time analytics •  MPP architecture handles analytics on big data with high speed •  Easy to analyze the archived data with SQL based analytics
  • 35. In Conclusion •  Analytics require a different technology to be able to cope with –  Different types of data –  Different types of data access •  OLTP databases has different requirements compared to Analytics •  Column Based storage allows high compression •  Metadata can replace indexing •  Distributed processing allows for performance and scalability •  MariaDB ColumnStore implement a fast an efficient distributed database for analytics •  MariaDB AX is the subscription for professional use of MariaDB ColumnStore •  MariaDB ColumnStore is gaining wide acceptance
  • 37. MariaDB AX Use Cases
  • 38. IHME - Institute of Health Metrics and Evaluation IHME Visualizations library: http://www.healthdata.org/results/data-visualizations Started with 4.2 TB, with goal to go to 30TB of data
  • 39. Customer Use Case -1 Industry: healthcare (Medicaid) Data: surveys Use case: decision support system Details: •  Identify trends and patterns •  Determine population cohorts •  Predict health outcomes •  Anticipate funding / capacity •  Recommend intervention Can’t do complex queries on current hardware with Oracle and snowflake schemas Limited to optimizing for simple, known queries (2-3 columns) Replaced with ColumnStore > a single table > 2.5 million rows, 248 columns > complex, ad-hoc queries > query 20+ columns in seconds
  • 40. Customer Use Case - 2 Industry: biotechnology (genetics) Data: genotypes Use case: genetic profiling Details: •  Find genetic mates (beef and dairy) •  Predict meat production (pork) •  Gene/DNA analysis Had to convert to CSV files and schedule import jobs (cron) Always receiving new genetic data Migrated to data adapter (Python) > streamline import process > remove steps / possible error > remove delays > import data on demand > immediate customer access
  • 41. Customer Use Case - 3 Industry:Mobile text/call app Data: call and text logs Use case: Mobile app use analytics Details: •  30 million text and 3 million phone call per day •  1.5 billion rows of logs per day •  The text and call volume rate will continue to grow InnoDB backend hit the scale limit of 6TB and it requires lot of performance tuning and index management Migrated to MariaDB AX > Able to process 24 month - 24TB vs 6 months limitation of InnoDB > Same BI tools and client applications worked with MariaDB AX seamlessly
  • 42. MariaDB AX Analytics made easy – simple, fast, scalable…