SlideShare a Scribd company logo
DATA VIRTUALIZATION PACKED LUNCH
WEBINAR SERIES
Sessions Covering Key Data Integration Challenges
Solved with Data Virtualization
Data Virtualization
An Essential Component of a Cloud Data Lake
Pablo Alvarez-Yanez
Director of Product Management, Denodo
Agenda
1. Current challenges in data management
2. Cloud Data Lakes
3. Shortcomings of data lakes
4. Data virtualization and cloud data lakes working
together
5. Cloud, on prem and hybrid
6. Key takeaways
4
Current Challenges in Data Management
1. End Users: faster & more accurate decision making
 Significant increase in business speed & complexity of
requirements
2. Regulations: enterprise-wide governance & data security
 Thousands of new regulations worldwide: tax, finance, privacy, HR,
environmental, etc.
3. IT: cost reduction
 Huge data growth with associated storage and operational costs
5
Data lakes were born to efficiently
address the challenge of cost reduction:
data lakes allow for cheap, efficient
storage of very large amounts of data
Cloud implementation simplified the
complexity of managing a large data
lake
6
A Bit of History – Etymology of “data lake”
https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/
Pentaho’s CTO James Dixon is credited with coining
the term "data lake". He described it in his blog in
2010:
"If you think of a data mart as a store of bottled
water – cleansed and packaged and structured
for easy consumption – the data lake is a large
body of water in a more natural state. The
contents of the data lake stream in from a
source to fill the lake, and various users of the
lake can come to examine, dive in, or take
samples."
7
The Data Lake – Architecture I
Distributed File SystemDistributed File System
Cheap storage for large data volumes
• Support for multiple file formats (Parquet, CSV,
JSON, etc)
• Examples:
• On-prem: HDFS
• Cloud native: AWS S3, Azure ADLS, Google GCS
8
The Data Lake – Architecture II
Distributed File System
Execution EngineExecution Engine
Massively parallel & scalable execution engine
• Cheaper execution than traditional EDW
architectures
• Decoupled from storage
• Doesn’t require specialized HW
• Examples:
• SQL-on-Hadoop engines: Spark, Hive, Impala, Drill,
Dremio, Presto, etc.
• Cloud native: AWS Redshift, Snowflake, AWS Athena,
Delta Lake, GCP BigQuery
9
The Data Lake – Architecture III
Adoption of new transformation techniques
• Data ingested is normally raw and unusable by end
users
• Data is transformed and moved to different
“zones” with different levels of curation
• End users only access the refined zone
• Use of ELT as a cheaper transformation technique
than ETL
• Use of the engine and storage of the lake for data
transformation instead of external ETL flows
• Removes the need for additional staging HW
Raw zone Trusted zone Refined Zone
Distributed File System
Execution Engine
10
Data Lake Example –AWS
• Data ingested using AWS Glue (or other ETL tools)
• Raw data stored in S3 object store
• Maintain fidelity and structure of data
• Metadata extracted/enriched using Glue Data Catalog
• Business rules/DQ rules applied to S3 data as copied to
Trusted Zone data stores
• Trusted Zone contains more than one data store – select
best data store for data and data processing
• Refined Zone contains data for consumer – curated data
sets (data marts?)
• Refined Zone data stores differ – Redshift, Athena,
Snowflake, …
TRUSTED ZONERAW ZONE
S3 for raw data
INGESTION
Data Sources
Internal
&
external
AWS Glue
Consumers
Data Portals
BI –Visualization
Analytic
Workbench
Mobile Apps
Etc.
REFINED ZONE
11
Hadoop-Based Data Lakes – A Data Scientist’s Playground
The early data scientists saw Hadoop as their
personal supercomputer.
Hadoop-based Data Lakes helped democratize
access to state-of-the-art supercomputing with
off-the-shelf HW (and later cloud)
The industry push for BI made Hadoop–based
solutions the standard to bring modern
analytics to any corporation
Hadoop-based Data Lakes became
“data science silos”
Can data lakes also address the
other data management
challenges?
Can they provide fast decision
making with proper
governance and security?
13
Changing the Data Lake Goals
“The popular view is that a
data lake will be the one
destination for all the data
in their enterprise and the
optimal platform for all
their analytics.”
Nick Heudecker, Gartner
14
The Data Lake as the Repository of All Data
• Huge up-front investment: creating ingestion pipelines for all company datasets into the
lake is costly
• Questionable ROI as a lot of that data may never be used
• Replicate the EDW? Replace it entirely?
• Large recurrent maintenance costs: those pipelines need to be constantly modified as
data structures change in the sources
• Risk of inconsistencies: data needs to be frequently synchronized to avoid stale datasets
• Loss of capabilities: data lake capabilities may differ from those of original sources, e.g.
quick access by ID in operational RDBMS
Efficient use of the data lake to accelerate insights comes at the cost of price,
time-to-market and governance
COST
GOVERNANCE
To efficiently enable self-service initiatives, a data lake must provide access to all company data.
Is that realistic? And even if possible, it comes with multiple trade-offs:
15
Purpose-specific data lakes
• Higher complexity: end users need to find where data is and how to use it
• Risk of Inconsistencies: data may be in multiple places, in different formats
and calculated at different times
• Loss of security: frustrations increase the use of shadow IT, “personal”
extracts, uncontrolled data prep flows, etc.
An environment with multiple purpose-specific systems slows down TTM and
jeopardizes security and governance
TTM
SECURITY
If we restrict the use of the data lake to a specific use case (e.g. data science), some of those
problems go away.
However, to maintain the capabilities for fast insights and self-service, we add an additional
burden to the end user:
16
Data Lakes in the ‘Pit of Despair’
Data Lakes are 5-10 years from
Plateau of Productivity and are
deep in the
Trough of Disillusionment
17
Gartner – The Evolution of Analytical Environments
This is a Second Major Cycle of Analytical Consolidation
Operational ApplicationOperational Application
Operational ApplicationOperational Application
Operational ApplicationOperational Application
IoT DataIoT Data
Other NewDataOther NewData
Operational
Application
Operational
Application
Operational
Application
Operational
Application
CubeCube
Operational
Application
Operational
Application
CubeCube
?? Operational ApplicationOperational Application
Operational ApplicationOperational Application
Operational ApplicationOperational Application
IoT DataIoT Data
Other NewDataOther NewData
1980s
Pre EDW
1990s
EDW
2010s2000s
Post EDW
Time
LDW
Operational
Application
Operational
Application
Operational
Application
Operational
Application
Operational
Application
Operational
Application
Data
Warehouse
Data
Warehouse
Data
Warehouse
Data
Warehouse
Data
Lake
Data
Lake
??
LDWLDW
Data WarehouseData Warehouse
Data LakeData Lake
MartsMarts
ODSODS
Staging/IngestStaging/Ingest
Unified analysis
› Consolidated data
› "Collect the data"
› Single server, multiple nodes
› More analysis than any
one server can provide
©2018 Gartner, Inc.
Unified analysis
› Logically consolidated view of all data
› "Connect and collect"
› Multiple servers, of multiple nodes
› More analysis than any one system can provide
ID: 342254
Fragmented/
nonexistent analysis
› Multiple sources
› Multiple structured sources
Fragmented analysis
› "Collect the data" (Into
› different repositories)
› New data types,
› processing, requirements
› Uncoordinated views
“Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs”. Henry Cook, Gartner April 2018
18
Gartner – Logical Data Warehouse
“Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs”. Henry Cook, Gartner April 2018
DATA VIRTUALIZATION
How can a logical
architecture enabled by
Data Virtualization help?
20
Faster Time to Market for data projects
A data virtualization layer allows you to connect directly to all kinds of data sources: the EDW,
application databases, SaaS applications, etc.
This means that not all data needs to be replicated to the data lake for consumers to access it
from a single (virtual) repository.
In some cases, it makes sense to replicate in the lake, for others it doesn’t. DV opens that door
 Data can be accessed immediately, easily improving TTM and ROI of the lake
 If data is not useful, time was not lost preparing pipelines and copying data
 Can ingest and synchronize data into the lake efficiently when needed
 Denodo can load and update data into the data lake natively, using Parquet, and parallel loads
 Execution is pushed down to original sources, taking advantage of their capabilities
 Especially significant in the case of EDW with strong processing capabilities
TTM
COST
21
Easier self-service through a single delivery layer
From an end user perspective, access to all data is done through a single layer, in
change of delivery of any data, regardless of its actual physical location
A single delivery layer also allows you to enforce security and governance policies
The virtual layer becomes the “delivery zone” of the data lake, offering modeling and
caching capabilities, documentation and output in multiple formats
GOVERNANCE
• Built-in rich modeling capabilities to tailor data models to end
users
• Integrated catalog, search and documentation capabilities
• Access via SQL, REST, OData and GraphQL with no additional
coding
• Advanced security controls, SSO, workload management,
monitoring, etc.
22
Accelerates query execution
Controlling data delivery separately from storage allows a virtual layer to accelerate
query execution, providing faster response than the sources alone
 Aggregate-aware capabilities to accelerate execution of
analytical queries
 Flexible caching options to materialize frequently used data:
 Full datasets
 Partial results
 Hybrid (cached content + updates from source in real time)
 Powerful optimization capabilities for multi-source federated
queries
PERFORMANCE
23
Denodo’s Logical Data Lake
ETL
Data Warehouse
Kafka
Physical Data Lake
Logical Data Lake
Files
ETL
Data Warehouse
Kafka
Physical Data Lake
Files
IT Storage and Processing
BI & Reporting
Mobile
Applications
Predictive Analytics
AI/ML
Real time dashboards
Consuming Tools
QueryEngine
BusinessDelivery
SourceAbstraction
Business Catalog
Security and Governance
Raw zone Trusted zone Refined Zone
Distributed File System
Execution Engine
Delivery Zone
Cloud, On-Prem or
Hybrid?
25
Denodo Customers Cloud Survey - 2019
• More than 60% of companies already have multiple projects in cloud
• 25% are Cloud-First and/or are in “advanced” state
• Only 4.5% do not have plans for Cloud in the short term
• More than 46% have hybrid integration needs, more than 35% are already multi-cloud
• Key Use Cases include: Analytics (49%), Data Lake (45%), Cloud Data Warehouse (40%)
• Less than 9% of on-prem systems are decommissioned (Forrester estimates 8%)
• Key Technologies in Cloud Journey: Cloud Platform Tools (56%), Data Virtualization (49.5%),
Data Lake Technology (48%)
Source: Denodo Cloud Survey 2019, N = 200.
https://www.denodo.com/en/document/whitepaper/denodo-global-cloud-survey-2019
26
Denodo and cloud
A virtual layer lake Denodo should be deployed based on “data gravity”: wherever most of
your data is
However, as we have seen, data gravity can change overtime and requires hybrid models
Denodo’s deployment model allows for multiple options
• Cloud deployments with full automation of the infrastructure management
• One-click changes in cluster settings, type of nodes, versions, etc.
• Elastic options for cluster auto-scaling
• Traditional on-prem installations
• Hybrid models with cloud and on-prem Denodo installations talking to each other
27
Logical Multi-Cloud Architecture
1. In most cases, not all the data is going to be in the
data lake
2. Large data lake projects are complex environments
that will benefit from a virtual ‘consumption’ layer
3. Data virtualization provides a governance and
management infrastructure required for successful
data lake implementation
4. Data Virtualization is more than just a data access
or services layer, it is a key component for a Data
Lake
Key Takeaways
Data Virtualization: An Essential Component of a Cloud Data Lake
30
Next Steps
Access Denodo Platform in the Cloud!
Take a Test Drive today!
www.denodo.com/TestDrive
G E T S TA R T E D TO DAY
Thank you!
© Copyright Denodo Technologies. All rights reserved
Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and
microfilm, without prior the written authorization from Denodo Technologies.

More Related Content

What's hot (20)

PDF
How to Strengthen Enterprise Data Governance with Data Quality
DATAVERSITY
 
PDF
Data Governance Takes a Village (So Why is Everyone Hiding?)
DATAVERSITY
 
PDF
Webinar Data Mesh - Part 3
Jeffrey T. Pollock
 
PDF
Data at the Speed of Business with Data Mastering and Governance
DATAVERSITY
 
PDF
Groupby -Power bi dashboard in hour by vishal pawar-Presentation
Vishal Pawar
 
PDF
Session découverte de la Data Virtualization
Denodo
 
PDF
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
DATAVERSITY
 
PPTX
Data Vault Overview
Empowered Holdings, LLC
 
PDF
Enabling a Data Mesh Architecture with Data Virtualization
Denodo
 
PPTX
Data weekender4.2 azure purview erwin de kreuk
Erwin de Kreuk
 
PDF
Data Lake Architecture – Modern Strategies & Approaches
DATAVERSITY
 
PDF
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
HostedbyConfluent
 
PPTX
Data Monetization Framework
Arvind Radhakrishnen
 
PDF
How to identify the correct Master Data subject areas & tooling for your MDM...
Christopher Bradley
 
PDF
Data Governance and Metadata Management
DATAVERSITY
 
PPTX
Introduction to DCAM, the Data Management Capability Assessment Model
Element22
 
PDF
Data Governance Program Powerpoint Presentation Slides
SlideTeam
 
PPT
Building a Data Quality Program from Scratch
dmurph4
 
PPT
Data Architecture for Data Governance
DATAVERSITY
 
PPT
Gartner: Master Data Management Functionality
Gartner
 
How to Strengthen Enterprise Data Governance with Data Quality
DATAVERSITY
 
Data Governance Takes a Village (So Why is Everyone Hiding?)
DATAVERSITY
 
Webinar Data Mesh - Part 3
Jeffrey T. Pollock
 
Data at the Speed of Business with Data Mastering and Governance
DATAVERSITY
 
Groupby -Power bi dashboard in hour by vishal pawar-Presentation
Vishal Pawar
 
Session découverte de la Data Virtualization
Denodo
 
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
DATAVERSITY
 
Data Vault Overview
Empowered Holdings, LLC
 
Enabling a Data Mesh Architecture with Data Virtualization
Denodo
 
Data weekender4.2 azure purview erwin de kreuk
Erwin de Kreuk
 
Data Lake Architecture – Modern Strategies & Approaches
DATAVERSITY
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
HostedbyConfluent
 
Data Monetization Framework
Arvind Radhakrishnen
 
How to identify the correct Master Data subject areas & tooling for your MDM...
Christopher Bradley
 
Data Governance and Metadata Management
DATAVERSITY
 
Introduction to DCAM, the Data Management Capability Assessment Model
Element22
 
Data Governance Program Powerpoint Presentation Slides
SlideTeam
 
Building a Data Quality Program from Scratch
dmurph4
 
Data Architecture for Data Governance
DATAVERSITY
 
Gartner: Master Data Management Functionality
Gartner
 

Similar to Data Virtualization: An Essential Component of a Cloud Data Lake (20)

PDF
Data Lakes: A Logical Approach for Faster Unified Insights
Denodo
 
PDF
Data Lakes: A Logical Approach for Faster Unified Insights (ASEAN)
Denodo
 
PDF
Myth Busters III: I’m Building a Data Lake, So I Don’t Need Data Virtualization
Denodo
 
PDF
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
Denodo
 
PDF
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
Denodo
 
PDF
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Denodo
 
PDF
Unlock Your Data for ML & AI using Data Virtualization
Denodo
 
PDF
Data lakes
Şaban Dalaman
 
PPTX
Big data architectures and the data lake
James Serra
 
PDF
Bridging the Last Mile: Getting Data to the People Who Need It
Denodo
 
PDF
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Denodo
 
PDF
What is the future of data strategy?
Denodo
 
PPTX
Data Lake Overview
James Serra
 
PDF
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Denodo
 
PDF
Are You Killing the Benefits of Your Data Lake?
Denodo
 
PPTX
Exploiting Data Lakes: Architecture, Capabilities & Future
Agilisium Consulting
 
PDF
Future of Data Strategy (ASEAN)
Denodo
 
PPTX
Is the traditional data warehouse dead?
James Serra
 
PDF
Enabling digital business with governed data lake
Karan Sachdeva
 
PDF
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Denodo
 
Data Lakes: A Logical Approach for Faster Unified Insights
Denodo
 
Data Lakes: A Logical Approach for Faster Unified Insights (ASEAN)
Denodo
 
Myth Busters III: I’m Building a Data Lake, So I Don’t Need Data Virtualization
Denodo
 
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
Denodo
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
Denodo
 
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Denodo
 
Unlock Your Data for ML & AI using Data Virtualization
Denodo
 
Data lakes
Şaban Dalaman
 
Big data architectures and the data lake
James Serra
 
Bridging the Last Mile: Getting Data to the People Who Need It
Denodo
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Denodo
 
What is the future of data strategy?
Denodo
 
Data Lake Overview
James Serra
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Denodo
 
Are You Killing the Benefits of Your Data Lake?
Denodo
 
Exploiting Data Lakes: Architecture, Capabilities & Future
Agilisium Consulting
 
Future of Data Strategy (ASEAN)
Denodo
 
Is the traditional data warehouse dead?
James Serra
 
Enabling digital business with governed data lake
Karan Sachdeva
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Denodo
 
Ad

More from Denodo (20)

PDF
Enterprise Monitoring and Auditing in Denodo
Denodo
 
PDF
Lunch and Learn ANZ: Mastering Cloud Data Cost Control: A FinOps Approach
Denodo
 
PDF
Achieving Self-Service Analytics with a Governed Data Services Layer
Denodo
 
PDF
What you need to know about Generative AI and Data Management?
Denodo
 
PDF
Mastering Data Compliance in a Dynamic Business Landscape
Denodo
 
PDF
Denodo Partner Connect: Business Value Demo with Denodo Demo Lite
Denodo
 
PDF
Expert Panel: Overcoming Challenges with Distributed Data to Maximize Busines...
Denodo
 
PDF
Drive Data Privacy Regulatory Compliance
Denodo
 
PDF
Знакомство с виртуализацией данных для профессионалов в области данных
Denodo
 
PDF
Data Democratization: A Secret Sauce to Say Goodbye to Data Fragmentation
Denodo
 
PDF
Denodo Partner Connect - Technical Webinar - Ask Me Anything
Denodo
 
PDF
Lunch and Learn ANZ: Key Takeaways for 2023!
Denodo
 
PDF
It’s a Wrap! 2023 – A Groundbreaking Year for AI and The Way Forward
Denodo
 
PDF
Quels sont les facteurs-clés de succès pour appliquer au mieux le RGPD à votr...
Denodo
 
PDF
Lunch and Learn ANZ: Achieving Self-Service Analytics with a Governed Data Se...
Denodo
 
PDF
How to Build Your Data Marketplace with Data Virtualization?
Denodo
 
PDF
Webinar #2 - Transforming Challenges into Opportunities for Credit Unions
Denodo
 
PDF
Enabling Data Catalog users with advanced usability
Denodo
 
PDF
Denodo Partner Connect: Technical Webinar - Architect Associate Certification...
Denodo
 
PDF
GenAI y el futuro de la gestión de datos: mitos y realidades
Denodo
 
Enterprise Monitoring and Auditing in Denodo
Denodo
 
Lunch and Learn ANZ: Mastering Cloud Data Cost Control: A FinOps Approach
Denodo
 
Achieving Self-Service Analytics with a Governed Data Services Layer
Denodo
 
What you need to know about Generative AI and Data Management?
Denodo
 
Mastering Data Compliance in a Dynamic Business Landscape
Denodo
 
Denodo Partner Connect: Business Value Demo with Denodo Demo Lite
Denodo
 
Expert Panel: Overcoming Challenges with Distributed Data to Maximize Busines...
Denodo
 
Drive Data Privacy Regulatory Compliance
Denodo
 
Знакомство с виртуализацией данных для профессионалов в области данных
Denodo
 
Data Democratization: A Secret Sauce to Say Goodbye to Data Fragmentation
Denodo
 
Denodo Partner Connect - Technical Webinar - Ask Me Anything
Denodo
 
Lunch and Learn ANZ: Key Takeaways for 2023!
Denodo
 
It’s a Wrap! 2023 – A Groundbreaking Year for AI and The Way Forward
Denodo
 
Quels sont les facteurs-clés de succès pour appliquer au mieux le RGPD à votr...
Denodo
 
Lunch and Learn ANZ: Achieving Self-Service Analytics with a Governed Data Se...
Denodo
 
How to Build Your Data Marketplace with Data Virtualization?
Denodo
 
Webinar #2 - Transforming Challenges into Opportunities for Credit Unions
Denodo
 
Enabling Data Catalog users with advanced usability
Denodo
 
Denodo Partner Connect: Technical Webinar - Architect Associate Certification...
Denodo
 
GenAI y el futuro de la gestión de datos: mitos y realidades
Denodo
 
Ad

Recently uploaded (20)

PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
big data eco system fundamentals of data science
arivukarasi
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
BinarySearchTree in datastructures in detail
kichokuttu
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
What Is Data Integration and Transformation?
subhashenia
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 

Data Virtualization: An Essential Component of a Cloud Data Lake

  • 1. DATA VIRTUALIZATION PACKED LUNCH WEBINAR SERIES Sessions Covering Key Data Integration Challenges Solved with Data Virtualization
  • 2. Data Virtualization An Essential Component of a Cloud Data Lake Pablo Alvarez-Yanez Director of Product Management, Denodo
  • 3. Agenda 1. Current challenges in data management 2. Cloud Data Lakes 3. Shortcomings of data lakes 4. Data virtualization and cloud data lakes working together 5. Cloud, on prem and hybrid 6. Key takeaways
  • 4. 4 Current Challenges in Data Management 1. End Users: faster & more accurate decision making  Significant increase in business speed & complexity of requirements 2. Regulations: enterprise-wide governance & data security  Thousands of new regulations worldwide: tax, finance, privacy, HR, environmental, etc. 3. IT: cost reduction  Huge data growth with associated storage and operational costs
  • 5. 5 Data lakes were born to efficiently address the challenge of cost reduction: data lakes allow for cheap, efficient storage of very large amounts of data Cloud implementation simplified the complexity of managing a large data lake
  • 6. 6 A Bit of History – Etymology of “data lake” https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/ Pentaho’s CTO James Dixon is credited with coining the term "data lake". He described it in his blog in 2010: "If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples."
  • 7. 7 The Data Lake – Architecture I Distributed File SystemDistributed File System Cheap storage for large data volumes • Support for multiple file formats (Parquet, CSV, JSON, etc) • Examples: • On-prem: HDFS • Cloud native: AWS S3, Azure ADLS, Google GCS
  • 8. 8 The Data Lake – Architecture II Distributed File System Execution EngineExecution Engine Massively parallel & scalable execution engine • Cheaper execution than traditional EDW architectures • Decoupled from storage • Doesn’t require specialized HW • Examples: • SQL-on-Hadoop engines: Spark, Hive, Impala, Drill, Dremio, Presto, etc. • Cloud native: AWS Redshift, Snowflake, AWS Athena, Delta Lake, GCP BigQuery
  • 9. 9 The Data Lake – Architecture III Adoption of new transformation techniques • Data ingested is normally raw and unusable by end users • Data is transformed and moved to different “zones” with different levels of curation • End users only access the refined zone • Use of ELT as a cheaper transformation technique than ETL • Use of the engine and storage of the lake for data transformation instead of external ETL flows • Removes the need for additional staging HW Raw zone Trusted zone Refined Zone Distributed File System Execution Engine
  • 10. 10 Data Lake Example –AWS • Data ingested using AWS Glue (or other ETL tools) • Raw data stored in S3 object store • Maintain fidelity and structure of data • Metadata extracted/enriched using Glue Data Catalog • Business rules/DQ rules applied to S3 data as copied to Trusted Zone data stores • Trusted Zone contains more than one data store – select best data store for data and data processing • Refined Zone contains data for consumer – curated data sets (data marts?) • Refined Zone data stores differ – Redshift, Athena, Snowflake, … TRUSTED ZONERAW ZONE S3 for raw data INGESTION Data Sources Internal & external AWS Glue Consumers Data Portals BI –Visualization Analytic Workbench Mobile Apps Etc. REFINED ZONE
  • 11. 11 Hadoop-Based Data Lakes – A Data Scientist’s Playground The early data scientists saw Hadoop as their personal supercomputer. Hadoop-based Data Lakes helped democratize access to state-of-the-art supercomputing with off-the-shelf HW (and later cloud) The industry push for BI made Hadoop–based solutions the standard to bring modern analytics to any corporation Hadoop-based Data Lakes became “data science silos”
  • 12. Can data lakes also address the other data management challenges? Can they provide fast decision making with proper governance and security?
  • 13. 13 Changing the Data Lake Goals “The popular view is that a data lake will be the one destination for all the data in their enterprise and the optimal platform for all their analytics.” Nick Heudecker, Gartner
  • 14. 14 The Data Lake as the Repository of All Data • Huge up-front investment: creating ingestion pipelines for all company datasets into the lake is costly • Questionable ROI as a lot of that data may never be used • Replicate the EDW? Replace it entirely? • Large recurrent maintenance costs: those pipelines need to be constantly modified as data structures change in the sources • Risk of inconsistencies: data needs to be frequently synchronized to avoid stale datasets • Loss of capabilities: data lake capabilities may differ from those of original sources, e.g. quick access by ID in operational RDBMS Efficient use of the data lake to accelerate insights comes at the cost of price, time-to-market and governance COST GOVERNANCE To efficiently enable self-service initiatives, a data lake must provide access to all company data. Is that realistic? And even if possible, it comes with multiple trade-offs:
  • 15. 15 Purpose-specific data lakes • Higher complexity: end users need to find where data is and how to use it • Risk of Inconsistencies: data may be in multiple places, in different formats and calculated at different times • Loss of security: frustrations increase the use of shadow IT, “personal” extracts, uncontrolled data prep flows, etc. An environment with multiple purpose-specific systems slows down TTM and jeopardizes security and governance TTM SECURITY If we restrict the use of the data lake to a specific use case (e.g. data science), some of those problems go away. However, to maintain the capabilities for fast insights and self-service, we add an additional burden to the end user:
  • 16. 16 Data Lakes in the ‘Pit of Despair’ Data Lakes are 5-10 years from Plateau of Productivity and are deep in the Trough of Disillusionment
  • 17. 17 Gartner – The Evolution of Analytical Environments This is a Second Major Cycle of Analytical Consolidation Operational ApplicationOperational Application Operational ApplicationOperational Application Operational ApplicationOperational Application IoT DataIoT Data Other NewDataOther NewData Operational Application Operational Application Operational Application Operational Application CubeCube Operational Application Operational Application CubeCube ?? Operational ApplicationOperational Application Operational ApplicationOperational Application Operational ApplicationOperational Application IoT DataIoT Data Other NewDataOther NewData 1980s Pre EDW 1990s EDW 2010s2000s Post EDW Time LDW Operational Application Operational Application Operational Application Operational Application Operational Application Operational Application Data Warehouse Data Warehouse Data Warehouse Data Warehouse Data Lake Data Lake ?? LDWLDW Data WarehouseData Warehouse Data LakeData Lake MartsMarts ODSODS Staging/IngestStaging/Ingest Unified analysis › Consolidated data › "Collect the data" › Single server, multiple nodes › More analysis than any one server can provide ©2018 Gartner, Inc. Unified analysis › Logically consolidated view of all data › "Connect and collect" › Multiple servers, of multiple nodes › More analysis than any one system can provide ID: 342254 Fragmented/ nonexistent analysis › Multiple sources › Multiple structured sources Fragmented analysis › "Collect the data" (Into › different repositories) › New data types, › processing, requirements › Uncoordinated views “Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs”. Henry Cook, Gartner April 2018
  • 18. 18 Gartner – Logical Data Warehouse “Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs”. Henry Cook, Gartner April 2018 DATA VIRTUALIZATION
  • 19. How can a logical architecture enabled by Data Virtualization help?
  • 20. 20 Faster Time to Market for data projects A data virtualization layer allows you to connect directly to all kinds of data sources: the EDW, application databases, SaaS applications, etc. This means that not all data needs to be replicated to the data lake for consumers to access it from a single (virtual) repository. In some cases, it makes sense to replicate in the lake, for others it doesn’t. DV opens that door  Data can be accessed immediately, easily improving TTM and ROI of the lake  If data is not useful, time was not lost preparing pipelines and copying data  Can ingest and synchronize data into the lake efficiently when needed  Denodo can load and update data into the data lake natively, using Parquet, and parallel loads  Execution is pushed down to original sources, taking advantage of their capabilities  Especially significant in the case of EDW with strong processing capabilities TTM COST
  • 21. 21 Easier self-service through a single delivery layer From an end user perspective, access to all data is done through a single layer, in change of delivery of any data, regardless of its actual physical location A single delivery layer also allows you to enforce security and governance policies The virtual layer becomes the “delivery zone” of the data lake, offering modeling and caching capabilities, documentation and output in multiple formats GOVERNANCE • Built-in rich modeling capabilities to tailor data models to end users • Integrated catalog, search and documentation capabilities • Access via SQL, REST, OData and GraphQL with no additional coding • Advanced security controls, SSO, workload management, monitoring, etc.
  • 22. 22 Accelerates query execution Controlling data delivery separately from storage allows a virtual layer to accelerate query execution, providing faster response than the sources alone  Aggregate-aware capabilities to accelerate execution of analytical queries  Flexible caching options to materialize frequently used data:  Full datasets  Partial results  Hybrid (cached content + updates from source in real time)  Powerful optimization capabilities for multi-source federated queries PERFORMANCE
  • 23. 23 Denodo’s Logical Data Lake ETL Data Warehouse Kafka Physical Data Lake Logical Data Lake Files ETL Data Warehouse Kafka Physical Data Lake Files IT Storage and Processing BI & Reporting Mobile Applications Predictive Analytics AI/ML Real time dashboards Consuming Tools QueryEngine BusinessDelivery SourceAbstraction Business Catalog Security and Governance Raw zone Trusted zone Refined Zone Distributed File System Execution Engine Delivery Zone
  • 25. 25 Denodo Customers Cloud Survey - 2019 • More than 60% of companies already have multiple projects in cloud • 25% are Cloud-First and/or are in “advanced” state • Only 4.5% do not have plans for Cloud in the short term • More than 46% have hybrid integration needs, more than 35% are already multi-cloud • Key Use Cases include: Analytics (49%), Data Lake (45%), Cloud Data Warehouse (40%) • Less than 9% of on-prem systems are decommissioned (Forrester estimates 8%) • Key Technologies in Cloud Journey: Cloud Platform Tools (56%), Data Virtualization (49.5%), Data Lake Technology (48%) Source: Denodo Cloud Survey 2019, N = 200. https://www.denodo.com/en/document/whitepaper/denodo-global-cloud-survey-2019
  • 26. 26 Denodo and cloud A virtual layer lake Denodo should be deployed based on “data gravity”: wherever most of your data is However, as we have seen, data gravity can change overtime and requires hybrid models Denodo’s deployment model allows for multiple options • Cloud deployments with full automation of the infrastructure management • One-click changes in cluster settings, type of nodes, versions, etc. • Elastic options for cluster auto-scaling • Traditional on-prem installations • Hybrid models with cloud and on-prem Denodo installations talking to each other
  • 28. 1. In most cases, not all the data is going to be in the data lake 2. Large data lake projects are complex environments that will benefit from a virtual ‘consumption’ layer 3. Data virtualization provides a governance and management infrastructure required for successful data lake implementation 4. Data Virtualization is more than just a data access or services layer, it is a key component for a Data Lake Key Takeaways
  • 30. 30 Next Steps Access Denodo Platform in the Cloud! Take a Test Drive today! www.denodo.com/TestDrive G E T S TA R T E D TO DAY
  • 31. Thank you! © Copyright Denodo Technologies. All rights reserved Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.