SlideShare a Scribd company logo
#FastDataStrategy
Redefine Analytics with In-Memory Parallel
Processing and Data Virtualization
Pablo Alvarez-Yañez
Product Manager, Denodo
AgendaAgenda1. Modern Data Architectures
2. Denodo Platform – Big Data Integrations
3. Demo
4. Putting This All Together
5. Next Steps
The Modern Data Architecture
5
Organizations are Storing More and More Data…
5
6
… That Needs to be Stored and Processed
7
Data Lake – The Concept
Source:
https://resources.zaloni.com/blog/what-is-a-data-lake
8
Data Lake – The Challenges
"However, getting value out of the data remains the
responsibility of the business end user. (…) Without at least
some semblance of information governance, the lake will end
up being a collection of disconnected data pools or
information silos all in one place."
Data lakes therefore carry substantial risks. The most
important is the inability to determine data quality or the
lineage of findings by other analysts or users that have found
value, previously, in using the same data in the lake.
Another risk is security and access control. Data can be
placed into the data lake with no oversight of the contents.
Many data lakes are being used for data whose privacy and
regulatory requirements are likely to represent risk exposure
Source:
https://www.forbes.com/sites/danwoods/2016/08/26/why-data-lakes-are-evil/Source:
https://www.gartner.com/newsroom/id/2809117
9
Data Lake – The Concept
10
Logical Data Lake with Data Virtualization
L O G I C A L D A T A L A K E
Denodo Platform and Big Data
Integrations
1212
Hadoop as a Data Source
Denodo offers native connectors for all the major
SQL-on-Hadoop engines:
▪ Hive
▪ Impala
▪ SparkSQL
▪ Presto
In addition, Denodo also offers connectivity for
HBase and direct HDFS access to different file
formats
1313
Hadoop as Cache
Denodo uses an external RDBMS of your choice
to persist copies of the result sets to improve
execution times
• Since data is persisted in an RDBMS, Denodo can
push down relational operations, like JOINS with
other tables, to the database used for cache
SQL-on-Hadoop systems can also be used as
Denodo’s cache
Cache load process based on direct load to HDFS:
1. Creation of the target table in Cache system
2. Generation of Parquet files (in chunks) with
Snappy compression in the local machine
3. Upload in parallel of Parquet files to HDFS
1414
Hadoop as Processing Engine
Denodo optimizer provides native integration
with MPP systems to provide one extra key
capability: Query Acceleration
Denodo can move, on demand, processing to the
MPP during execution of a query
• Parallel power for calculations in the
virtual layer
• Avoids slow processing in-disk when
processing buffers don’t fit into
Denodo’s memory (swapped data)
1515
Combining Denodo’s Optimizer with a Hadoop MPP
Denodo provides the most advanced optimizer in the
market, with techniques focused on data virtualization
scenarios with large data volumes
In addition to traditional Cost Based Optimizations (CBO),
Denodo’s optimizer applies innovative optimization
strategies, designed specifically for virtualized scenarios,
beyond traditional RDBMS optimizations.
Combined with the tight integration with SQL-on-Hadoop
MPP databases, it creates a very powerful combo
1616
Example: Scenario
Evolution of sales per ZIP code over
the previous years.
Scenario:
▪ Current data (last 12 months) in EDW
▪ Historical data offloaded to Hadoop cluster for
cheaper storage
▪ Customer master data is used often, so it is
cached in the Hadoop cluster
Very large data volumes:
▪ Sales tables have hundreds of millions of rows
join
group by ZIP
union
Current Sales
100 million rows
Historical Sales
300 million rows
Customer
2 million rows (cached)
1717
Example: What are the options?
1. Option A: Simple Federation in Virtual Layer
▪ Move hundreds of millions of rows for processing in the virtual layer
2. Option B: Data Shipping
▪ Move “Current sales” to Hadoop and process content in the cluster
▪ Moves 100 million rows
3. Option C: Partial Aggregation Pushdown (Denodo 6)
▪ Modifies the execution tree to split the aggregation in two steps:
1. first by Customer ID for the JOIN (pushed down to source)
2. seconds by ZIP for the final results (in virtual layer)
▪ Reduces significantly network traffic but processing of large amount
of data in the virtual layer (aggregation by ZIP) becomes the
bottleneck
4. Denodo’s MPP Integration (Denodo 7 – next slide)
Simple Federation
Shipping
join
group by ID
group by ZIP
group by ZIP
join
18
Example: Denodo’s Integration with the Hadoop Ecosystem
2M rows
(sales by customer)
System Execution Time Optimization Techniques
Others ~ 10 min Simple federation
No MPP 43 sec Aggregation push-down
With MPP 11 sec Aggregation push-down + MPP integration (Impala 8 nodes)
Current Sales
100 M rows
group by
customer ID1. Partial Aggregation
push down
Maximizes source processing
Dramatically Reduces network
traffic
3. On-demand data transfer
Denodo automatically generates
and upload Parquet files
4. Integration with local
and pre-cached data
The engine detects when data
Is cached or comes from a
local table already in the MPP
2. Integrated with Cost Based Optimizer
Based on data volume estimation and
the cost of these particular operations,
the CBO can decide to move all or part
Of the execution tree to the MPP
5. Fast parallel execution
Support for Spark, Presto and Impala
For fast analytical processing in
inexpensive Hadoop-based solutions
Hist. Sales
300 M rows
Customer
2 M rows
(cached)
join
group by ZIP
Demo
Putting all the pieces together
2121
Putting all the pieces together
These three techniques (Hadoop as a data source, cache and
processing engine) can be combined to successfully approach
complex scenarios with big data volumes in an efficient way:
▪ Surfaces all the company data without the need to replicate all
data to the Hadoop lake, in a business friendly manner
▪ Improves governance and metadata management to avoid
“data swamps”: data lineage, catalog, access control, impact
analysis for changes, etc.
▪ Allows for on-demand combination of real-time (from the
original sources) with historical data (in the cluster)
▪ Leverages the processing power of the existing cluster
controlled by Denodo’s optimizer
2222
Architecture – Technical notes
To benefit from this architecture,
Denodo servers should run in edge
nodes of the Hadoop cluster
This will ensure:
▪ Faster uploads to HDFS
▪ Faster data retrieval from the MPP
▪ Better compatibility with the
Hadoop configuration and versions
of the libraries
Denodo Cluster
▪ Multiple nodes
behind a load
balancer for HA
▪ Running on Hadoop
Edge nodes
Hadoop Cluster
▪ Processing and Storage
nodes
▪ Same subnet as Denodo
cluster
Q&AQ&A
24
DOWNLOAD DENODO
EXPRESS
DENODO FOR AWS DENODO FOR AZURE
Download Denodo Express
Next Steps
Access Denodo Platform in the cloud!
30 day free trial available!
Thank you!
© Copyright Denodo Technologies. All rights reserved
Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written
authorization from Denodo Technologies.
#FastDataStrategy

More Related Content

What's hot (19)

PPTX
Powering Self Service Business Intelligence with Hadoop and Data Virtualization
Denodo
 
PPTX
Solution architecture for big data projects
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
 
PDF
Secure Data Sharing with the Denodo Platform
Denodo
 
PPTX
GDPRov: provenance for GDPR
vty
 
PDF
Big Data Fabric: A Recipe for Big Data Initiatives
Denodo
 
PDF
How Financial Institutions Are Leveraging Data Virtualization to Overcome the...
Denodo
 
PDF
Data Lakes: 8 Enterprise Data Management Requirements
SnapLogic
 
PPTX
How OpenTable uses Big Data to impact growth by Raman Marya
Data Con LA
 
PDF
An introduction to data virtualization in business intelligence
David Walker
 
PPT
Data Mining and Data Warehousing
Amdocs
 
PDF
MongoDB Case Study in Healthcare
MongoDB
 
PDF
Multi cloud data integration with data virtualization
Denodo
 
PPT
Why Data Virtualization? An Introduction by Denodo
Justo Hidalgo
 
PPTX
Enterprise Reporting with MongoDB and JasperSoft
MongoDB
 
PDF
Creating a Modern Data Architecture for Digital Transformation
MongoDB
 
PDF
Manage tracability with Apache Atlas, a flexible metadata repository
Synaltic Group
 
PDF
Big Data and Data Virtualization
Kenneth Peeples
 
PPTX
MongoDB at Agilysys: A Case Study
MongoDB
 
PPTX
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
semanticsconference
 
Powering Self Service Business Intelligence with Hadoop and Data Virtualization
Denodo
 
Solution architecture for big data projects
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
 
Secure Data Sharing with the Denodo Platform
Denodo
 
GDPRov: provenance for GDPR
vty
 
Big Data Fabric: A Recipe for Big Data Initiatives
Denodo
 
How Financial Institutions Are Leveraging Data Virtualization to Overcome the...
Denodo
 
Data Lakes: 8 Enterprise Data Management Requirements
SnapLogic
 
How OpenTable uses Big Data to impact growth by Raman Marya
Data Con LA
 
An introduction to data virtualization in business intelligence
David Walker
 
Data Mining and Data Warehousing
Amdocs
 
MongoDB Case Study in Healthcare
MongoDB
 
Multi cloud data integration with data virtualization
Denodo
 
Why Data Virtualization? An Introduction by Denodo
Justo Hidalgo
 
Enterprise Reporting with MongoDB and JasperSoft
MongoDB
 
Creating a Modern Data Architecture for Digital Transformation
MongoDB
 
Manage tracability with Apache Atlas, a flexible metadata repository
Synaltic Group
 
Big Data and Data Virtualization
Kenneth Peeples
 
MongoDB at Agilysys: A Case Study
MongoDB
 
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
semanticsconference
 

Similar to Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing and Data Virtualization (20)

PDF
In Memory Parallel Processing for Big Data Scenarios
Denodo
 
PDF
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Denodo
 
PDF
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Denodo
 
PDF
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
Denodo
 
PPTX
Hadoop and MapReduce addDdaDadadDDAD.pptx
ms236400269
 
PPTX
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
NPN Training
 
PDF
Demystifying Data Virtualization (ASEAN)
Denodo
 
PPTX
Hadoop File system (HDFS)
Prashant Gupta
 
PPT
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
PPT
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
PDF
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Denodo
 
PPTX
Oct 2011 CHADNUG Presentation on Hadoop
Josh Patterson
 
PDF
How can Hadoop & SAP be integrated
Douglas Bernardini
 
PDF
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
IT Strategy Group
 
PDF
How to Achieve Fast Data Performance in Big Data, Logical Data Warehouse, and...
Denodo
 
PDF
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Denodo
 
PDF
How the Development Bank of Singapore solves on-prem compute capacity challen...
Alluxio, Inc.
 
PDF
Data Virtualization: An Essential Component of a Cloud Data Lake
Denodo
 
PDF
Exploring the Wider World of Big Data
NetApp
 
DOCX
Hadoop Research
Shreyansh Ajit kumar
 
In Memory Parallel Processing for Big Data Scenarios
Denodo
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Denodo
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Denodo
 
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
Denodo
 
Hadoop and MapReduce addDdaDadadDDAD.pptx
ms236400269
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
NPN Training
 
Demystifying Data Virtualization (ASEAN)
Denodo
 
Hadoop File system (HDFS)
Prashant Gupta
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Denodo
 
Oct 2011 CHADNUG Presentation on Hadoop
Josh Patterson
 
How can Hadoop & SAP be integrated
Douglas Bernardini
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
IT Strategy Group
 
How to Achieve Fast Data Performance in Big Data, Logical Data Warehouse, and...
Denodo
 
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Denodo
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
Alluxio, Inc.
 
Data Virtualization: An Essential Component of a Cloud Data Lake
Denodo
 
Exploring the Wider World of Big Data
NetApp
 
Hadoop Research
Shreyansh Ajit kumar
 
Ad

More from Denodo (20)

PDF
Enterprise Monitoring and Auditing in Denodo
Denodo
 
PDF
Lunch and Learn ANZ: Mastering Cloud Data Cost Control: A FinOps Approach
Denodo
 
PDF
Achieving Self-Service Analytics with a Governed Data Services Layer
Denodo
 
PDF
What you need to know about Generative AI and Data Management?
Denodo
 
PDF
Mastering Data Compliance in a Dynamic Business Landscape
Denodo
 
PDF
Denodo Partner Connect: Business Value Demo with Denodo Demo Lite
Denodo
 
PDF
Expert Panel: Overcoming Challenges with Distributed Data to Maximize Busines...
Denodo
 
PDF
Drive Data Privacy Regulatory Compliance
Denodo
 
PDF
Знакомство с виртуализацией данных для профессионалов в области данных
Denodo
 
PDF
Data Democratization: A Secret Sauce to Say Goodbye to Data Fragmentation
Denodo
 
PDF
Denodo Partner Connect - Technical Webinar - Ask Me Anything
Denodo
 
PDF
Lunch and Learn ANZ: Key Takeaways for 2023!
Denodo
 
PDF
It’s a Wrap! 2023 – A Groundbreaking Year for AI and The Way Forward
Denodo
 
PDF
Quels sont les facteurs-clés de succès pour appliquer au mieux le RGPD à votr...
Denodo
 
PDF
Lunch and Learn ANZ: Achieving Self-Service Analytics with a Governed Data Se...
Denodo
 
PDF
How to Build Your Data Marketplace with Data Virtualization?
Denodo
 
PDF
Webinar #2 - Transforming Challenges into Opportunities for Credit Unions
Denodo
 
PDF
Enabling Data Catalog users with advanced usability
Denodo
 
PDF
Denodo Partner Connect: Technical Webinar - Architect Associate Certification...
Denodo
 
PDF
GenAI y el futuro de la gestión de datos: mitos y realidades
Denodo
 
Enterprise Monitoring and Auditing in Denodo
Denodo
 
Lunch and Learn ANZ: Mastering Cloud Data Cost Control: A FinOps Approach
Denodo
 
Achieving Self-Service Analytics with a Governed Data Services Layer
Denodo
 
What you need to know about Generative AI and Data Management?
Denodo
 
Mastering Data Compliance in a Dynamic Business Landscape
Denodo
 
Denodo Partner Connect: Business Value Demo with Denodo Demo Lite
Denodo
 
Expert Panel: Overcoming Challenges with Distributed Data to Maximize Busines...
Denodo
 
Drive Data Privacy Regulatory Compliance
Denodo
 
Знакомство с виртуализацией данных для профессионалов в области данных
Denodo
 
Data Democratization: A Secret Sauce to Say Goodbye to Data Fragmentation
Denodo
 
Denodo Partner Connect - Technical Webinar - Ask Me Anything
Denodo
 
Lunch and Learn ANZ: Key Takeaways for 2023!
Denodo
 
It’s a Wrap! 2023 – A Groundbreaking Year for AI and The Way Forward
Denodo
 
Quels sont les facteurs-clés de succès pour appliquer au mieux le RGPD à votr...
Denodo
 
Lunch and Learn ANZ: Achieving Self-Service Analytics with a Governed Data Se...
Denodo
 
How to Build Your Data Marketplace with Data Virtualization?
Denodo
 
Webinar #2 - Transforming Challenges into Opportunities for Credit Unions
Denodo
 
Enabling Data Catalog users with advanced usability
Denodo
 
Denodo Partner Connect: Technical Webinar - Architect Associate Certification...
Denodo
 
GenAI y el futuro de la gestión de datos: mitos y realidades
Denodo
 
Ad

Recently uploaded (20)

PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna36
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna36
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
Research Methodology Overview Introduction
ayeshagul29594
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 

Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing and Data Virtualization

  • 2. Redefine Analytics with In-Memory Parallel Processing and Data Virtualization Pablo Alvarez-Yañez Product Manager, Denodo
  • 3. AgendaAgenda1. Modern Data Architectures 2. Denodo Platform – Big Data Integrations 3. Demo 4. Putting This All Together 5. Next Steps
  • 4. The Modern Data Architecture
  • 5. 5 Organizations are Storing More and More Data… 5
  • 6. 6 … That Needs to be Stored and Processed
  • 7. 7 Data Lake – The Concept Source: https://resources.zaloni.com/blog/what-is-a-data-lake
  • 8. 8 Data Lake – The Challenges "However, getting value out of the data remains the responsibility of the business end user. (…) Without at least some semblance of information governance, the lake will end up being a collection of disconnected data pools or information silos all in one place." Data lakes therefore carry substantial risks. The most important is the inability to determine data quality or the lineage of findings by other analysts or users that have found value, previously, in using the same data in the lake. Another risk is security and access control. Data can be placed into the data lake with no oversight of the contents. Many data lakes are being used for data whose privacy and regulatory requirements are likely to represent risk exposure Source: https://www.forbes.com/sites/danwoods/2016/08/26/why-data-lakes-are-evil/Source: https://www.gartner.com/newsroom/id/2809117
  • 9. 9 Data Lake – The Concept
  • 10. 10 Logical Data Lake with Data Virtualization L O G I C A L D A T A L A K E
  • 11. Denodo Platform and Big Data Integrations
  • 12. 1212 Hadoop as a Data Source Denodo offers native connectors for all the major SQL-on-Hadoop engines: ▪ Hive ▪ Impala ▪ SparkSQL ▪ Presto In addition, Denodo also offers connectivity for HBase and direct HDFS access to different file formats
  • 13. 1313 Hadoop as Cache Denodo uses an external RDBMS of your choice to persist copies of the result sets to improve execution times • Since data is persisted in an RDBMS, Denodo can push down relational operations, like JOINS with other tables, to the database used for cache SQL-on-Hadoop systems can also be used as Denodo’s cache Cache load process based on direct load to HDFS: 1. Creation of the target table in Cache system 2. Generation of Parquet files (in chunks) with Snappy compression in the local machine 3. Upload in parallel of Parquet files to HDFS
  • 14. 1414 Hadoop as Processing Engine Denodo optimizer provides native integration with MPP systems to provide one extra key capability: Query Acceleration Denodo can move, on demand, processing to the MPP during execution of a query • Parallel power for calculations in the virtual layer • Avoids slow processing in-disk when processing buffers don’t fit into Denodo’s memory (swapped data)
  • 15. 1515 Combining Denodo’s Optimizer with a Hadoop MPP Denodo provides the most advanced optimizer in the market, with techniques focused on data virtualization scenarios with large data volumes In addition to traditional Cost Based Optimizations (CBO), Denodo’s optimizer applies innovative optimization strategies, designed specifically for virtualized scenarios, beyond traditional RDBMS optimizations. Combined with the tight integration with SQL-on-Hadoop MPP databases, it creates a very powerful combo
  • 16. 1616 Example: Scenario Evolution of sales per ZIP code over the previous years. Scenario: ▪ Current data (last 12 months) in EDW ▪ Historical data offloaded to Hadoop cluster for cheaper storage ▪ Customer master data is used often, so it is cached in the Hadoop cluster Very large data volumes: ▪ Sales tables have hundreds of millions of rows join group by ZIP union Current Sales 100 million rows Historical Sales 300 million rows Customer 2 million rows (cached)
  • 17. 1717 Example: What are the options? 1. Option A: Simple Federation in Virtual Layer ▪ Move hundreds of millions of rows for processing in the virtual layer 2. Option B: Data Shipping ▪ Move “Current sales” to Hadoop and process content in the cluster ▪ Moves 100 million rows 3. Option C: Partial Aggregation Pushdown (Denodo 6) ▪ Modifies the execution tree to split the aggregation in two steps: 1. first by Customer ID for the JOIN (pushed down to source) 2. seconds by ZIP for the final results (in virtual layer) ▪ Reduces significantly network traffic but processing of large amount of data in the virtual layer (aggregation by ZIP) becomes the bottleneck 4. Denodo’s MPP Integration (Denodo 7 – next slide) Simple Federation Shipping join group by ID group by ZIP group by ZIP join
  • 18. 18 Example: Denodo’s Integration with the Hadoop Ecosystem 2M rows (sales by customer) System Execution Time Optimization Techniques Others ~ 10 min Simple federation No MPP 43 sec Aggregation push-down With MPP 11 sec Aggregation push-down + MPP integration (Impala 8 nodes) Current Sales 100 M rows group by customer ID1. Partial Aggregation push down Maximizes source processing Dramatically Reduces network traffic 3. On-demand data transfer Denodo automatically generates and upload Parquet files 4. Integration with local and pre-cached data The engine detects when data Is cached or comes from a local table already in the MPP 2. Integrated with Cost Based Optimizer Based on data volume estimation and the cost of these particular operations, the CBO can decide to move all or part Of the execution tree to the MPP 5. Fast parallel execution Support for Spark, Presto and Impala For fast analytical processing in inexpensive Hadoop-based solutions Hist. Sales 300 M rows Customer 2 M rows (cached) join group by ZIP
  • 19. Demo
  • 20. Putting all the pieces together
  • 21. 2121 Putting all the pieces together These three techniques (Hadoop as a data source, cache and processing engine) can be combined to successfully approach complex scenarios with big data volumes in an efficient way: ▪ Surfaces all the company data without the need to replicate all data to the Hadoop lake, in a business friendly manner ▪ Improves governance and metadata management to avoid “data swamps”: data lineage, catalog, access control, impact analysis for changes, etc. ▪ Allows for on-demand combination of real-time (from the original sources) with historical data (in the cluster) ▪ Leverages the processing power of the existing cluster controlled by Denodo’s optimizer
  • 22. 2222 Architecture – Technical notes To benefit from this architecture, Denodo servers should run in edge nodes of the Hadoop cluster This will ensure: ▪ Faster uploads to HDFS ▪ Faster data retrieval from the MPP ▪ Better compatibility with the Hadoop configuration and versions of the libraries Denodo Cluster ▪ Multiple nodes behind a load balancer for HA ▪ Running on Hadoop Edge nodes Hadoop Cluster ▪ Processing and Storage nodes ▪ Same subnet as Denodo cluster
  • 24. 24 DOWNLOAD DENODO EXPRESS DENODO FOR AWS DENODO FOR AZURE Download Denodo Express Next Steps Access Denodo Platform in the cloud! 30 day free trial available!
  • 25. Thank you! © Copyright Denodo Technologies. All rights reserved Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies. #FastDataStrategy