SlideShare a Scribd company logo
W E B I N A R S E R I E S
Can data virtualization
uphold performance with
complex queries?
Paul Moxon
SVP Data Architectures & Chief Evangelist
Denodo
2nd April 2020
Paul Moxon
SVP Data Architectures & Chief
Evangelist, Denodo
Speakers
1. Origins of the Performance Myth
2. Just the Facts, Ma’am
3. The Proof is in the Pudding
4. Q&A
5. Next Steps
Agenda
4
Myth #1:
Data virtualization can’t
perform with large data
sets and complex queries.
Origins of the
Performance Myth
6
Early ‘Federation Servers’ Had Poor Performance
Data Federation Servers didn’t live up to their hype
• Early forms of Data Virtualization were Data Federation Servers
• e.g. IBM InfoSphere Federation Server
• They had limited connectivity and limited query processing
• Couldn’t handle complex queries or relied on retrieving all data for processing
• Sometimes mistakenly positioned as an alternative to a Data Warehouse
• Performance comparisons were not favorable
• As a result, Data Federation got a bad name
• Data Federation is used as a pejorative comparison to Data Virtualization
7
Poor Performance Compared to What?
What are you comparing Data Virtualization performance against?
• Comparing against a Data Warehouse?
• This assumes that all of the data is in the Data Warehouse…is that the case?
• Did you take into account the time, cost, and latency introduced by copying all of the data
into the Data Warehouse?
• Comparing against hand-coded applications? Or BI Tools (AKA ‘data blending’)?
• Sometimes just a lack of understanding of enterprise Data Virtualization technology
• Assuming that Data Virtualization is ‘naïve federation’
Just the Facts, Ma’am
9
Performance Comparison
Logical Data Warehouse vs. Physical Data Warehouse
• Extensive testing using queries from the standard test TPC-DS*.
• Compare the performance of a federated approach in Denodo with an MPP system
where all the data has been replicated via ETL.
Customer Dim.
2 M rows
Sales Facts
290 M rows
Items Dim.
400 K rows
* TPC-DS is the de-facto industry standard benchmark for measuring the performance of decision support solutions, including Big Data systems.
vs.
Sales Facts
290 M rows
Items Dim.
400 K rows
Customer Dim.
2 M rows
10
Performance Comparison Results
Logical Data Warehouse vs. Physical Data Warehouse
Query Description
Returned
Rows
Time Netezza
Time Denodo (Federated
Oracle, Netezza & SQL
Server)
Optimization Technique
(automatically selected)
Total sales by customer 1,99 M 20.9 sec. 21.4 sec. Full aggregation push-down
Total sales by customer and year
between 2000 and 2004
5,51 M 52.3 sec. 59.0 sec Full aggregation push-down
Total sales by item brand 31,35 K 4.7 sec. 5.0 sec. Partial aggregation push-down
Total sales by item where sale price
less than current list price
17,05 K 3.5 sec. 5.2 sec On the fly data movement
11
Denodo Platform – Layers of Performance Optimization
Four layers of performance optimization
1. Query Rewriting and Delegation
• Push processing to where the data lives, minimize the amount of data going through the
network
• Automatic, but with controls
2. MPP Query Acceleration
• Offload processing to co-located MPP cluster
3. Caching
• Caching data in a local cache for performance improvement
4. Throttling and Controlling Data Access
• Managing the work load on the Data Sources
12
Denodo Platform – Query Optimization Pipeline
Query Parsing
• Maps query entities (tables, fields) to actual metadata
• Retrieves execution capabilities and restrictions for views involved in the query
Static Optimizer
• Query delegation
• SQL rewriting rules (removal of redundant filters, tree pruning, join reordering,
transformation push-up, star-schema rewritings, etc.)
• Data movement query plans
Dynamic
Optimizer
• Classic cost-based optimization using data distribution statistics, indexes, transfer
rates, etc., generating query plans and selecting best plan
• Picks optimal JOIN methods and orders based statistics
Execution
• Creates the calls to the underlying systems in their corresponding protocols and
dialects (SQL, MDX, WS calls, etc.)
13
Static vs. Dynamic Optimization
• Static optimization:
• Based on SQL transformations.
• Rewrite query in more optimal way.
• Remove redundancies, inactive sub-trees, etc.
• Push-down delegation:
• Optimize query by pushing down sub-trees to underlying data sources.
• Dynamic optimization:
• Use statistics and indices to estimate costs of alternative execution plans.
• Select Join methods and Join ordering.
14
Denodo Platform – Query Optimization Techniques
• Advanced Query Optimization:
• Query Delegation.
• Cost and Source Constraint Based Query Plans.
• Automatic Query Rewriting.
• Join Optimizations.
• Data Movement.
• Asynchronous Multi-threaded Processing.
• Server Throttling Mechanisms.
• Linear Scalability.
15
Denodo Platform – MPP Query Acceleration
Utilizing the power of a co-located MPP engine
• Denodo Platform supports using MPP cluster to accelerate queries
• Hive, Spark, Impala, Presto
• Operations that can be parallelized can be moved to MPP cluster
• e.g. GROUP BY aggregations
• Data is copied to cluster and operation is delegated for processing
• Data copied in Parquet file
• Results returned to Denodo Platform
• Does not require any special commands from user
16
Denodo MPP Query Acceleration
4.8M rows
(sales by customer)
Current Sales
60 M rows
1. Partial Aggregation
push down
Maximizes source processing
dramatically Reduces network
traffic 3. On-demand data transfer
Denodo automatically generates
and upload Parquet files
4. Integration with local data
The engine detects when data
is cached or comes from a
local table already in the MPP
2. Integrated with Cost Based Optimizer
Based on data volume estimation and
the cost of these particular operations,
the CBO can decide to move all or part
of the execution tree to the MPP
5. Fast parallel execution
Support for Spark, Presto and Impala
for fast analytical processing in
inexpensive Hadoop-based solutions
Hist. Sales
215 M rows
Customer
2 M rows
join
group by State
and Year
System Execution Time Optimization Techniques
Others ~ 39 min Simple federation
No MPP 3.4 min Aggregation push-down
With MPP 47 sec Aggregation push-down + MPP integration (Impala 4 nodes)
Group by Customer
key and Date key
Date Dim
73K rows
The Proof is in the Pudding
18
Scenario 1 – Query Optimization
Same Store Sales by Year
Scenario:
• Current sales data (last 12 months) in EDW
• Historical data offloaded to Hadoop cluster
for cheaper storage
• Store data is in the RDBMS
• Date dimension in EDW
Very large data volumes:
• Sales tables have hundreds of millions of
rows
join
group by Store
and Year
union
Current Sales
60 million rows
Historical Sales
215 million rows
Store
401 rows (RDBMS)
join
Date
73K rows (EDW)
19
Scenario 2 – MPP Query Acceleration
Average Customer Purchases by
State and Year
Scenario:
• Current sales data (last 12 months) in EDW
• Historical data offloaded to Hadoop cluster for
cheaper storage
• Customer data is in the RDBMS
• Date dimension in EDW
Very large data volumes:
• Sales tables have hundreds of millions of rows
join
group by State
and Year
union
Current Sales
60 million rows
Historical Sales
215 million rows
Customer
2 million rows (RDBMS)
join
Date
73K rows (EDW)
Summary & Conclusions
21
Data Virtualization and Performance
Busting the myth
• Four layers of performance optimization
• Denodo Platform has a sophisticated query optimizer to process queries
• Uses advanced techniques to leverage power of underlying data stores (when possible)
• Offload processing to MPP engine
• Take advantage of power of MPP cluster for heavy duty processing
• Caching to speed up slower data sources
• Resource Manager to optimize queries with strict SLAs
• Performance is comparable to accessing data in single data store
• Large data sets…complex queries…performance is still excellent
22
Myth #1:
Data virtualization can’t
perform with large data
sets and complex queries.
Q&A
24
Next Steps
Access Denodo Platform in the Cloud!
Take a Test Drive today!
www.denodo.com/TestDrive
GET STARTED TODAY
Thanks!
www.denodo.com info@denodo.com
© Copyright Denodo Technologies. All rights reserved
Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm,
without prior the written authorization from Denodo Technologies.

More Related Content

What's hot (20)

PDF
Performance Acceleration: Summaries, Recommendation, MPP and more
Denodo
 
PDF
In Memory Parallel Processing for Big Data Scenarios
Denodo
 
PDF
Data Ninja Webinar Series: Realizing the Promise of Data Lakes
Denodo
 
PDF
Logical Data Warehouse: The Foundation of Modern Data and Analytics (APAC)
Denodo
 
PDF
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Denodo
 
PDF
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
Denodo
 
PDF
Designing Fast Data Architecture for Big Data using Logical Data Warehouse a...
Denodo
 
PPTX
Denodo Data Virtualization - IT Days in Luxembourg with Oktopus
Denodo
 
PDF
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
Denodo
 
PPT
DW 101
jeffd00
 
PDF
GDPR Noncompliance: Avoid the Risk with Data Virtualization
Denodo
 
PPTX
Applying Big Data Superpowers to Healthcare
Paul Boal
 
PPT
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
MapR Technologies
 
PDF
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
Denodo
 
PDF
Partner Enablement: Key Differentiators of Denodo Platform 6.0 for the Field
Denodo
 
PDF
Multi-Cloud Integration with Data Virtualization (ASEAN)
Denodo
 
PDF
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)
Denodo
 
PPTX
Dell Technology World - IT as a Business - Multi-Cloud Strategy is your Product
Manuel "Manny" Rodriguez-Perez
 
PDF
Are You Killing the Benefits of Your Data Lake?
Denodo
 
PDF
Enabling a Data Mesh Architecture with Data Virtualization
Denodo
 
Performance Acceleration: Summaries, Recommendation, MPP and more
Denodo
 
In Memory Parallel Processing for Big Data Scenarios
Denodo
 
Data Ninja Webinar Series: Realizing the Promise of Data Lakes
Denodo
 
Logical Data Warehouse: The Foundation of Modern Data and Analytics (APAC)
Denodo
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Denodo
 
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
Denodo
 
Designing Fast Data Architecture for Big Data using Logical Data Warehouse a...
Denodo
 
Denodo Data Virtualization - IT Days in Luxembourg with Oktopus
Denodo
 
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
Denodo
 
DW 101
jeffd00
 
GDPR Noncompliance: Avoid the Risk with Data Virtualization
Denodo
 
Applying Big Data Superpowers to Healthcare
Paul Boal
 
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
MapR Technologies
 
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
Denodo
 
Partner Enablement: Key Differentiators of Denodo Platform 6.0 for the Field
Denodo
 
Multi-Cloud Integration with Data Virtualization (ASEAN)
Denodo
 
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)
Denodo
 
Dell Technology World - IT as a Business - Multi-Cloud Strategy is your Product
Manuel "Manny" Rodriguez-Perez
 
Are You Killing the Benefits of Your Data Lake?
Denodo
 
Enabling a Data Mesh Architecture with Data Virtualization
Denodo
 

Similar to Can data virtualization uphold performance with complex queries? (20)

PDF
Demystifying Data Virtualization (ASEAN)
Denodo
 
PDF
How to Achieve Fast Data Performance in Big Data, Logical Data Warehouse, and...
Denodo
 
PDF
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Denodo
 
PDF
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Denodo
 
PDF
Denodo 6.0: Self Service Search, Discovery & Governance using an Universal Se...
Denodo
 
PDF
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...
Denodo
 
PDF
Performance Considerations in Logical Data Warehouse
Denodo
 
PDF
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...
Denodo
 
PDF
Denodo Data Virtualization Platform Architecture: Performance (session 2 from...
Denodo
 
PDF
Virtualisation de données : Enjeux, Usages & Bénéfices
Denodo
 
PDF
Why Data Virtualization? An Introduction
Denodo
 
PDF
Denodo DataFest 2016: What’s New in Denodo Platform – Demo and Roadmap
Denodo
 
PPTX
Take your Data Management Practice to the Next Level with Denodo 7
Denodo
 
PDF
Connecting Silos in Real Time with Data Virtualization
Denodo
 
PDF
Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
Data Con LA
 
PDF
Big Data LDN 2018: CONNECTING SILOS IN REAL-TIME WITH DATA VIRTUALIZATION
Matt Stubbs
 
PDF
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
Denodo
 
PPTX
Technical Demonstration - Denodo Platform 7.0
Denodo
 
PDF
Getting Started with Data Virtualization – What problems DV solves
Denodo
 
PDF
3 Reasons Data Virtualization Matters in Your Portfolio
Denodo
 
Demystifying Data Virtualization (ASEAN)
Denodo
 
How to Achieve Fast Data Performance in Big Data, Logical Data Warehouse, and...
Denodo
 
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Denodo
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Denodo
 
Denodo 6.0: Self Service Search, Discovery & Governance using an Universal Se...
Denodo
 
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...
Denodo
 
Performance Considerations in Logical Data Warehouse
Denodo
 
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...
Denodo
 
Denodo Data Virtualization Platform Architecture: Performance (session 2 from...
Denodo
 
Virtualisation de données : Enjeux, Usages & Bénéfices
Denodo
 
Why Data Virtualization? An Introduction
Denodo
 
Denodo DataFest 2016: What’s New in Denodo Platform – Demo and Roadmap
Denodo
 
Take your Data Management Practice to the Next Level with Denodo 7
Denodo
 
Connecting Silos in Real Time with Data Virtualization
Denodo
 
Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
Data Con LA
 
Big Data LDN 2018: CONNECTING SILOS IN REAL-TIME WITH DATA VIRTUALIZATION
Matt Stubbs
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
Denodo
 
Technical Demonstration - Denodo Platform 7.0
Denodo
 
Getting Started with Data Virtualization – What problems DV solves
Denodo
 
3 Reasons Data Virtualization Matters in Your Portfolio
Denodo
 
Ad

More from Denodo (20)

PDF
Enterprise Monitoring and Auditing in Denodo
Denodo
 
PDF
Lunch and Learn ANZ: Mastering Cloud Data Cost Control: A FinOps Approach
Denodo
 
PDF
Achieving Self-Service Analytics with a Governed Data Services Layer
Denodo
 
PDF
What you need to know about Generative AI and Data Management?
Denodo
 
PDF
Mastering Data Compliance in a Dynamic Business Landscape
Denodo
 
PDF
Denodo Partner Connect: Business Value Demo with Denodo Demo Lite
Denodo
 
PDF
Expert Panel: Overcoming Challenges with Distributed Data to Maximize Busines...
Denodo
 
PDF
Drive Data Privacy Regulatory Compliance
Denodo
 
PDF
Знакомство с виртуализацией данных для профессионалов в области данных
Denodo
 
PDF
Data Democratization: A Secret Sauce to Say Goodbye to Data Fragmentation
Denodo
 
PDF
Denodo Partner Connect - Technical Webinar - Ask Me Anything
Denodo
 
PDF
Lunch and Learn ANZ: Key Takeaways for 2023!
Denodo
 
PDF
It’s a Wrap! 2023 – A Groundbreaking Year for AI and The Way Forward
Denodo
 
PDF
Quels sont les facteurs-clés de succès pour appliquer au mieux le RGPD à votr...
Denodo
 
PDF
Lunch and Learn ANZ: Achieving Self-Service Analytics with a Governed Data Se...
Denodo
 
PDF
How to Build Your Data Marketplace with Data Virtualization?
Denodo
 
PDF
Webinar #2 - Transforming Challenges into Opportunities for Credit Unions
Denodo
 
PDF
Enabling Data Catalog users with advanced usability
Denodo
 
PDF
Denodo Partner Connect: Technical Webinar - Architect Associate Certification...
Denodo
 
PDF
GenAI y el futuro de la gestión de datos: mitos y realidades
Denodo
 
Enterprise Monitoring and Auditing in Denodo
Denodo
 
Lunch and Learn ANZ: Mastering Cloud Data Cost Control: A FinOps Approach
Denodo
 
Achieving Self-Service Analytics with a Governed Data Services Layer
Denodo
 
What you need to know about Generative AI and Data Management?
Denodo
 
Mastering Data Compliance in a Dynamic Business Landscape
Denodo
 
Denodo Partner Connect: Business Value Demo with Denodo Demo Lite
Denodo
 
Expert Panel: Overcoming Challenges with Distributed Data to Maximize Busines...
Denodo
 
Drive Data Privacy Regulatory Compliance
Denodo
 
Знакомство с виртуализацией данных для профессионалов в области данных
Denodo
 
Data Democratization: A Secret Sauce to Say Goodbye to Data Fragmentation
Denodo
 
Denodo Partner Connect - Technical Webinar - Ask Me Anything
Denodo
 
Lunch and Learn ANZ: Key Takeaways for 2023!
Denodo
 
It’s a Wrap! 2023 – A Groundbreaking Year for AI and The Way Forward
Denodo
 
Quels sont les facteurs-clés de succès pour appliquer au mieux le RGPD à votr...
Denodo
 
Lunch and Learn ANZ: Achieving Self-Service Analytics with a Governed Data Se...
Denodo
 
How to Build Your Data Marketplace with Data Virtualization?
Denodo
 
Webinar #2 - Transforming Challenges into Opportunities for Credit Unions
Denodo
 
Enabling Data Catalog users with advanced usability
Denodo
 
Denodo Partner Connect: Technical Webinar - Architect Associate Certification...
Denodo
 
GenAI y el futuro de la gestión de datos: mitos y realidades
Denodo
 
Ad

Recently uploaded (20)

PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
big data eco system fundamentals of data science
arivukarasi
 
Research Methodology Overview Introduction
ayeshagul29594
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
What Is Data Integration and Transformation?
subhashenia
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 

Can data virtualization uphold performance with complex queries?

  • 1. W E B I N A R S E R I E S Can data virtualization uphold performance with complex queries? Paul Moxon SVP Data Architectures & Chief Evangelist Denodo 2nd April 2020
  • 2. Paul Moxon SVP Data Architectures & Chief Evangelist, Denodo Speakers
  • 3. 1. Origins of the Performance Myth 2. Just the Facts, Ma’am 3. The Proof is in the Pudding 4. Q&A 5. Next Steps Agenda
  • 4. 4 Myth #1: Data virtualization can’t perform with large data sets and complex queries.
  • 6. 6 Early ‘Federation Servers’ Had Poor Performance Data Federation Servers didn’t live up to their hype • Early forms of Data Virtualization were Data Federation Servers • e.g. IBM InfoSphere Federation Server • They had limited connectivity and limited query processing • Couldn’t handle complex queries or relied on retrieving all data for processing • Sometimes mistakenly positioned as an alternative to a Data Warehouse • Performance comparisons were not favorable • As a result, Data Federation got a bad name • Data Federation is used as a pejorative comparison to Data Virtualization
  • 7. 7 Poor Performance Compared to What? What are you comparing Data Virtualization performance against? • Comparing against a Data Warehouse? • This assumes that all of the data is in the Data Warehouse…is that the case? • Did you take into account the time, cost, and latency introduced by copying all of the data into the Data Warehouse? • Comparing against hand-coded applications? Or BI Tools (AKA ‘data blending’)? • Sometimes just a lack of understanding of enterprise Data Virtualization technology • Assuming that Data Virtualization is ‘naïve federation’
  • 8. Just the Facts, Ma’am
  • 9. 9 Performance Comparison Logical Data Warehouse vs. Physical Data Warehouse • Extensive testing using queries from the standard test TPC-DS*. • Compare the performance of a federated approach in Denodo with an MPP system where all the data has been replicated via ETL. Customer Dim. 2 M rows Sales Facts 290 M rows Items Dim. 400 K rows * TPC-DS is the de-facto industry standard benchmark for measuring the performance of decision support solutions, including Big Data systems. vs. Sales Facts 290 M rows Items Dim. 400 K rows Customer Dim. 2 M rows
  • 10. 10 Performance Comparison Results Logical Data Warehouse vs. Physical Data Warehouse Query Description Returned Rows Time Netezza Time Denodo (Federated Oracle, Netezza & SQL Server) Optimization Technique (automatically selected) Total sales by customer 1,99 M 20.9 sec. 21.4 sec. Full aggregation push-down Total sales by customer and year between 2000 and 2004 5,51 M 52.3 sec. 59.0 sec Full aggregation push-down Total sales by item brand 31,35 K 4.7 sec. 5.0 sec. Partial aggregation push-down Total sales by item where sale price less than current list price 17,05 K 3.5 sec. 5.2 sec On the fly data movement
  • 11. 11 Denodo Platform – Layers of Performance Optimization Four layers of performance optimization 1. Query Rewriting and Delegation • Push processing to where the data lives, minimize the amount of data going through the network • Automatic, but with controls 2. MPP Query Acceleration • Offload processing to co-located MPP cluster 3. Caching • Caching data in a local cache for performance improvement 4. Throttling and Controlling Data Access • Managing the work load on the Data Sources
  • 12. 12 Denodo Platform – Query Optimization Pipeline Query Parsing • Maps query entities (tables, fields) to actual metadata • Retrieves execution capabilities and restrictions for views involved in the query Static Optimizer • Query delegation • SQL rewriting rules (removal of redundant filters, tree pruning, join reordering, transformation push-up, star-schema rewritings, etc.) • Data movement query plans Dynamic Optimizer • Classic cost-based optimization using data distribution statistics, indexes, transfer rates, etc., generating query plans and selecting best plan • Picks optimal JOIN methods and orders based statistics Execution • Creates the calls to the underlying systems in their corresponding protocols and dialects (SQL, MDX, WS calls, etc.)
  • 13. 13 Static vs. Dynamic Optimization • Static optimization: • Based on SQL transformations. • Rewrite query in more optimal way. • Remove redundancies, inactive sub-trees, etc. • Push-down delegation: • Optimize query by pushing down sub-trees to underlying data sources. • Dynamic optimization: • Use statistics and indices to estimate costs of alternative execution plans. • Select Join methods and Join ordering.
  • 14. 14 Denodo Platform – Query Optimization Techniques • Advanced Query Optimization: • Query Delegation. • Cost and Source Constraint Based Query Plans. • Automatic Query Rewriting. • Join Optimizations. • Data Movement. • Asynchronous Multi-threaded Processing. • Server Throttling Mechanisms. • Linear Scalability.
  • 15. 15 Denodo Platform – MPP Query Acceleration Utilizing the power of a co-located MPP engine • Denodo Platform supports using MPP cluster to accelerate queries • Hive, Spark, Impala, Presto • Operations that can be parallelized can be moved to MPP cluster • e.g. GROUP BY aggregations • Data is copied to cluster and operation is delegated for processing • Data copied in Parquet file • Results returned to Denodo Platform • Does not require any special commands from user
  • 16. 16 Denodo MPP Query Acceleration 4.8M rows (sales by customer) Current Sales 60 M rows 1. Partial Aggregation push down Maximizes source processing dramatically Reduces network traffic 3. On-demand data transfer Denodo automatically generates and upload Parquet files 4. Integration with local data The engine detects when data is cached or comes from a local table already in the MPP 2. Integrated with Cost Based Optimizer Based on data volume estimation and the cost of these particular operations, the CBO can decide to move all or part of the execution tree to the MPP 5. Fast parallel execution Support for Spark, Presto and Impala for fast analytical processing in inexpensive Hadoop-based solutions Hist. Sales 215 M rows Customer 2 M rows join group by State and Year System Execution Time Optimization Techniques Others ~ 39 min Simple federation No MPP 3.4 min Aggregation push-down With MPP 47 sec Aggregation push-down + MPP integration (Impala 4 nodes) Group by Customer key and Date key Date Dim 73K rows
  • 17. The Proof is in the Pudding
  • 18. 18 Scenario 1 – Query Optimization Same Store Sales by Year Scenario: • Current sales data (last 12 months) in EDW • Historical data offloaded to Hadoop cluster for cheaper storage • Store data is in the RDBMS • Date dimension in EDW Very large data volumes: • Sales tables have hundreds of millions of rows join group by Store and Year union Current Sales 60 million rows Historical Sales 215 million rows Store 401 rows (RDBMS) join Date 73K rows (EDW)
  • 19. 19 Scenario 2 – MPP Query Acceleration Average Customer Purchases by State and Year Scenario: • Current sales data (last 12 months) in EDW • Historical data offloaded to Hadoop cluster for cheaper storage • Customer data is in the RDBMS • Date dimension in EDW Very large data volumes: • Sales tables have hundreds of millions of rows join group by State and Year union Current Sales 60 million rows Historical Sales 215 million rows Customer 2 million rows (RDBMS) join Date 73K rows (EDW)
  • 21. 21 Data Virtualization and Performance Busting the myth • Four layers of performance optimization • Denodo Platform has a sophisticated query optimizer to process queries • Uses advanced techniques to leverage power of underlying data stores (when possible) • Offload processing to MPP engine • Take advantage of power of MPP cluster for heavy duty processing • Caching to speed up slower data sources • Resource Manager to optimize queries with strict SLAs • Performance is comparable to accessing data in single data store • Large data sets…complex queries…performance is still excellent
  • 22. 22 Myth #1: Data virtualization can’t perform with large data sets and complex queries.
  • 23. Q&A
  • 24. 24 Next Steps Access Denodo Platform in the Cloud! Take a Test Drive today! www.denodo.com/TestDrive GET STARTED TODAY
  • 25. Thanks! www.denodo.com [email protected] © Copyright Denodo Technologies. All rights reserved Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.