Can data virtualization uphold performance with complex queries?

W E B I N A R S E R I E S
Can data virtualization
uphold performance with
complex queries?
Paul Moxon
SVP Data Architectures & Chief Evangelist
Denodo
2nd April 2020

Paul Moxon
SVP Data Architectures & Chief
Evangelist, Denodo
Speakers

1. Origins of the Performance Myth
2. Just the Facts, Ma’am
3. The Proof is in the Pudding
4. Q&A
5. Next Steps
Agenda

4
Myth #1:
Data virtualization can’t
perform with large data
sets and complex queries.

Origins of the
Performance Myth

6
Early ‘Federation Servers’ Had Poor Performance
Data Federation Servers didn’t live up to their hype
• Early forms of Data Virtualization were Data Federation Servers
• e.g. IBM InfoSphere Federation Server
• They had limited connectivity and limited query processing
• Couldn’t handle complex queries or relied on retrieving all data for processing
• Sometimes mistakenly positioned as an alternative to a Data Warehouse
• Performance comparisons were not favorable
• As a result, Data Federation got a bad name
• Data Federation is used as a pejorative comparison to Data Virtualization

7
Poor Performance Compared to What?
What are you comparing Data Virtualization performance against?
• Comparing against a Data Warehouse?
• This assumes that all of the data is in the Data Warehouse…is that the case?
• Did you take into account the time, cost, and latency introduced by copying all of the data
into the Data Warehouse?
• Comparing against hand-coded applications? Or BI Tools (AKA ‘data blending’)?
• Sometimes just a lack of understanding of enterprise Data Virtualization technology
• Assuming that Data Virtualization is ‘naïve federation’

9
Performance Comparison
Logical Data Warehouse vs. Physical Data Warehouse
• Extensive testing using queries from the standard test TPC-DS*.
• Compare the performance of a federated approach in Denodo with an MPP system
where all the data has been replicated via ETL.
Customer Dim.
2 M rows
Sales Facts
290 M rows
Items Dim.
400 K rows
* TPC-DS is the de-facto industry standard benchmark for measuring the performance of decision support solutions, including Big Data systems.
vs.
Sales Facts
290 M rows
Items Dim.
400 K rows
Customer Dim.
2 M rows

10
Performance Comparison Results
Logical Data Warehouse vs. Physical Data Warehouse
Query Description
Returned
Rows
Time Netezza
Time Denodo (Federated
Oracle, Netezza & SQL
Server)
Optimization Technique
(automatically selected)
Total sales by customer 1,99 M 20.9 sec. 21.4 sec. Full aggregation push-down
Total sales by customer and year
between 2000 and 2004
5,51 M 52.3 sec. 59.0 sec Full aggregation push-down
Total sales by item brand 31,35 K 4.7 sec. 5.0 sec. Partial aggregation push-down
Total sales by item where sale price
less than current list price
17,05 K 3.5 sec. 5.2 sec On the fly data movement

11
Denodo Platform – Layers of Performance Optimization
Four layers of performance optimization
1. Query Rewriting and Delegation
• Push processing to where the data lives, minimize the amount of data going through the
network
• Automatic, but with controls
2. MPP Query Acceleration
• Offload processing to co-located MPP cluster
3. Caching
• Caching data in a local cache for performance improvement
4. Throttling and Controlling Data Access
• Managing the work load on the Data Sources

12
Denodo Platform – Query Optimization Pipeline
Query Parsing
• Maps query entities (tables, fields) to actual metadata
• Retrieves execution capabilities and restrictions for views involved in the query
Static Optimizer
• Query delegation
• SQL rewriting rules (removal of redundant filters, tree pruning, join reordering,
transformation push-up, star-schema rewritings, etc.)
• Data movement query plans
Dynamic
Optimizer
• Classic cost-based optimization using data distribution statistics, indexes, transfer
rates, etc., generating query plans and selecting best plan
• Picks optimal JOIN methods and orders based statistics
Execution
• Creates the calls to the underlying systems in their corresponding protocols and
dialects (SQL, MDX, WS calls, etc.)

13
Static vs. Dynamic Optimization
• Static optimization:
• Based on SQL transformations.
• Rewrite query in more optimal way.
• Remove redundancies, inactive sub-trees, etc.
• Push-down delegation:
• Optimize query by pushing down sub-trees to underlying data sources.
• Dynamic optimization:
• Use statistics and indices to estimate costs of alternative execution plans.
• Select Join methods and Join ordering.

14
Denodo Platform – Query Optimization Techniques
• Advanced Query Optimization:
• Query Delegation.
• Cost and Source Constraint Based Query Plans.
• Automatic Query Rewriting.
• Join Optimizations.
• Data Movement.
• Asynchronous Multi-threaded Processing.
• Server Throttling Mechanisms.
• Linear Scalability.

15
Denodo Platform – MPP Query Acceleration
Utilizing the power of a co-located MPP engine
• Denodo Platform supports using MPP cluster to accelerate queries
• Hive, Spark, Impala, Presto
• Operations that can be parallelized can be moved to MPP cluster
• e.g. GROUP BY aggregations
• Data is copied to cluster and operation is delegated for processing
• Data copied in Parquet file
• Results returned to Denodo Platform
• Does not require any special commands from user

16
Denodo MPP Query Acceleration
4.8M rows
(sales by customer)
Current Sales
60 M rows
1. Partial Aggregation
push down
Maximizes source processing
dramatically Reduces network
traffic 3. On-demand data transfer
Denodo automatically generates
and upload Parquet files
4. Integration with local data
The engine detects when data
is cached or comes from a
local table already in the MPP
2. Integrated with Cost Based Optimizer
Based on data volume estimation and
the cost of these particular operations,
the CBO can decide to move all or part
of the execution tree to the MPP
5. Fast parallel execution
Support for Spark, Presto and Impala
for fast analytical processing in
inexpensive Hadoop-based solutions
Hist. Sales
215 M rows
Customer
2 M rows
join
group by State
and Year
System Execution Time Optimization Techniques
Others ~ 39 min Simple federation
No MPP 3.4 min Aggregation push-down
With MPP 47 sec Aggregation push-down + MPP integration (Impala 4 nodes)
Group by Customer
key and Date key
Date Dim
73K rows

18
Scenario 1 – Query Optimization
Same Store Sales by Year
Scenario:
• Current sales data (last 12 months) in EDW
• Historical data offloaded to Hadoop cluster
for cheaper storage
• Store data is in the RDBMS
• Date dimension in EDW
Very large data volumes:
• Sales tables have hundreds of millions of
rows
join
group by Store
and Year
union
Current Sales
60 million rows
Historical Sales
215 million rows
Store
401 rows (RDBMS)
join
Date
73K rows (EDW)

19
Scenario 2 – MPP Query Acceleration
Average Customer Purchases by
State and Year
Scenario:
• Current sales data (last 12 months) in EDW
• Historical data offloaded to Hadoop cluster for
cheaper storage
• Customer data is in the RDBMS
• Date dimension in EDW
Very large data volumes:
• Sales tables have hundreds of millions of rows
join
group by State
and Year
union
Current Sales
60 million rows
Historical Sales
215 million rows
Customer
2 million rows (RDBMS)
join
Date
73K rows (EDW)

21
Data Virtualization and Performance
Busting the myth
• Four layers of performance optimization
• Denodo Platform has a sophisticated query optimizer to process queries
• Uses advanced techniques to leverage power of underlying data stores (when possible)
• Offload processing to MPP engine
• Take advantage of power of MPP cluster for heavy duty processing
• Caching to speed up slower data sources
• Resource Manager to optimize queries with strict SLAs
• Performance is comparable to accessing data in single data store
• Large data sets…complex queries…performance is still excellent

22
Myth #1:
Data virtualization can’t
perform with large data
sets and complex queries.

24
Next Steps
Access Denodo Platform in the Cloud!
Take a Test Drive today!
www.denodo.com/TestDrive
GET STARTED TODAY

Thanks!
www.denodo.com info@denodo.com
© Copyright Denodo Technologies. All rights reserved
Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm,
without prior the written authorization from Denodo Technologies.

Can data virtualization uphold performance with complex queries?

More Related Content

What's hot (20)

Similar to Can data virtualization uphold performance with complex queries? (20)

More from Denodo (20)

Recently uploaded (20)

Can data virtualization uphold performance with complex queries?