SlideShare a Scribd company logo
Building the
Enterprise Data Lake
Considerations before you
jump in
December, 2015
Mark Madsen
www.ThirdNature.net
@markmadsen1
What This Session Isn’t
SQL..
.
SQL!
SQL?
SQL
The craft model of information delivery does not scale
Ā© Third Nature, Inc.
So we shifted to data publishing
Industrialized data delivery for self-service access.
Events and sensors are a relatively new data source
Sensor data doesn’t fit well with current methods of modeling,
collection and storage, or with the technology to process and analyze it.
There’s lots of other new data involved
Ā© Third Nature, Inc.
You can store this data in an RDBMS, but…
These sorts of things slow user requests down
Conclusion: any methodology built on the premise that you
must know and model all the data first is untenable
Ā© Third Nature, Inc.
Analytics embiggens data volume problems
Many of the processing problems are O(n2) or worse, so
moderate data can be a problem for scale-up platforms
Ā© Third Nature, Inc.
Old market says: There’s nothing wrong with what
you have, just keep buying new products from us
The emerging big data market has an answer…
Ā© Third Nature, Inc.
The data lake
Ā© Third Nature, Inc.
Views of the lake
Is the business vs supports the business?
Application vs infrastructure?
Ā© Third Nature, Inc.
The naĆÆve idea of a data lake leads to predictable results
Ā© Third Nature, Inc.
You can’t install Hadoop and hope it solves all the problems
Big data no 2
Slide 16
The answer isn’t just technology, it’s architecture
Schema
In the DW world both data and processing are bounded
No consideration for feedback loops and change
Processing only
happens here
Carefully
controlled
access
here
Nobodyherecreates
newinformation
Sources few and
well understood
Complex DI
is controlled
by IT
Schemas are few
and designed
Tools are authorized,
few in number and
kind
One way flow
This is a monolithic, layered architecture
Ā© Third Nature, Inc.
In the big data world flow is unbounded and continuous
Feedback
loops allowed
End-of-analysis
dataset may be
start of a BI dataset
Continuous data
integration and delivery
Files are back as both
input and storage
Minimal
barrier of /
control on
collection
Areas of
provisioned
data
Any shape in,
rectangles out
This needs a distributed service architecture
Ā© Third Nature, Inc.
Deconstructing data environments
There are three
things happening in
a data warehouse:
ā–Ŗ Data acquisition
ā–Ŗ Data management
ā–Ŗ Data delivery
Isolate them from one
another, allow read-
write use, and you are
on the path.
Data
Warehouse
Data lake subsystems / components
The acquisition component allows any data to be collected at any latency. The
management component allows some data to be standardized and integrated. The
access component provides access at any latency and via any means an application
chooses. Processing can be done to any data at any time from any area.
Data Acquisition
Collect & Store
Incremental
Batch
One-time copy
Real time
Data Lake Platform Services
Data Management
Process & Integrate
Data Access
Deliver & Use
Data storage
In reality, you are building three systems, not one. Avoid the monolith.
Ā© Third Nature, Inc.
Data lake functions depend on platform services
Base Platform Services
Data Movement MetadataData Persistence
Workflow
Management
Processing Engines Dataflow Services
Data Curation
Data Access
Services
Data Acquisition
Collect & Store
Data Management
Process & Integrate
Data Access
Deliver & Use
Platformservicesneeded
DATA ARCHITECTURE
We’re so focused on the light switch that we’re not
talking about the light
Ā© Third Nature, Inc.
Decouple the Data Architecture
The core of the data lake isn’t a database or HDFS,
it’s the data architecture that the tools implement.
We need a data architecture that is not limiting:
ā–Ŗ Deals with change easily and at scale
ā–Ŗ Does not enforce requirements and models up front
ā–Ŗ Does not limit the format or structure of data
ā–Ŗ Assumes the range of data latencies in and out, from
streaming to one-time bulk
Ā© Third Nature, Inc.
Food supply chain: an analogy for data
Multiple contexts of use, differing quality levels
You need to keep the original because just like baking,
you can’t unmake dough once it’s mixed.
Ā© Third Nature, Inc.
Data architecture is required by the services, and vice versa
Raw data in an immutable
storage area
Standardized or
enhanced data
Common or
usage-
specific data
Transient data
DataAcquisition
Collect&Store
Platform Services
DataAccess
Deliver&Use
Data Management
Process & Integrate
Ā© Third Nature, Inc.
The data areas map (mostly) to functional areas of the lake
Collection can’t be limited by database scale and latency.
Immutability, persistence and concurrency are required.
Incremental
Collect
Batch
One-time copy
Real time
Manage & Integrate Process, Deliver, Use
Ā© Third Nature, Inc.
Stages, not layers
Some tools require specific repositories or models.
Others can reach in to get what they need. Do not
enforce a single access point or model.
Ā© Third Nature, Inc.
The geography has been redefined
The box IT created:
• not any data, rigidly typed data
• not any form, tabular rows and
columns of typed data
• not any latency, persist what the
DB can keep up with
• not any process, only queries
The digital world was diminished
to only what’s inside the box until
we forgot the box was there.
Ā© Third Nature, Inc.
Layered data architecture
The DW assumed a single flat
model of data, DB in the center.
The data lake enables new ways
to organize data:
ā–Ŗ Raw – straight from the source
ā–Ŗ Enhanced –cleaned, standardized
ā–Ŗ Integrated – modeled,
augmented, ~semi-persistent
ā–Ŗ Derived – analytic output,
pattern based sets, ephemeral
Implies a new technology architecture
and data modeling approaches.
Ā© Third Nature, Inc.
The data lake enables evolutionary design for data
Evolutionary design is required because data needs change. You
need a system not for stability – we have that in the DW - but for
evolution and change, the data lake.
Data Acquisition
Collect & Store
Incremental
Batch
One-time copy
Real time
Data Lake Platform Services
Data Management
Process & Integrate
Data Access
Deliver & Use
Data storage
You can’t build this all at once. You need to grow it over time.
Ā© Third Nature, Inc.
Away from ā€œone throat to chokeā€, back to best of breed
Tight coupling leads to efficient
reuse and standardization, and
to slow changes.
In a rapidly evolving market
componentized architectures,
modularity and loose coupling
are favorable over monolithic
stacks, single-vendor
architectures and tight
coupling.
Architecture, not blueprints:
there is no single answer. It
depends on your goals and
starting position.
Questions?ā€œWhen a new technology rolls over you, you're either part of
the steamroller or part of the road.ā€ – Stewart Brand
Ā© Third Nature, Inc.
CC Image Attributions
Thanks to the people who supplied the creative commons licensed images used in this presentation:
donuts_4_views.jpg - http://www.flickr.com/photos/le_hibou/76718773/
glass_buildings.jpg - http://www.flickr.com/photos/erikvanhannen/547701721
Ā© Third Nature, Inc.
About the Presenter
Mark Madsen is president of Third Nature, a
consulting and advisory firm focused on
analytics, business intelligence and data
management. Mark is an award-winning
author, architect and CTO. Over the past ten
years Mark received awards for his work
from the American Productivity & Quality
Center, TDWI, and the Smithsonian Institute.
He is an international speaker, a contributor
to Forbes, member of the O’Reilly Strata
program committee. For more information
or to contact Mark, follow @markmadsen on
Twitter or visit http://ThirdNature.net
About Third Nature
Third Nature is a consulting and advisory firm focused on new and emerging technology
and practices in information strategy, analytics, business intelligence and data
management. If your question is related to data, analytics, information strategy and
technology infrastructure then youā€˜re at the right place.
Our goal is to help organizations solve problems using data. We offer education, consulting
and research services to support business and IT organizations as well as technology
vendors.
We fill the gap between what the industry analyst firms cover and what IT needs. We
specialize in strategy and architecture, so we look at emerging technologies and markets,
evaluating how technologies are applied to solve problems rather than evaluating product
features.

More Related Content

What's hot (20)

PDF
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
NoSQLmatters
Ā 
PDF
Incorporating the Data Lake into Your Analytic Architecture
Caserta
Ā 
PPTX
Hadoop Powers Modern Enterprise Data Architectures
DataWorks Summit
Ā 
PDF
Data Governance for Data Lakes
Kiran Kamreddy
Ā 
PPTX
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
DataWorks Summit
Ā 
PPTX
Big data architectures and the data lake
James Serra
Ā 
PDF
Hadoop data-lake-white-paper
Supratim Ray
Ā 
PDF
The Warranty Data Lake – After, Inc.
Richard Vermillion
Ā 
PPTX
Hadoop and Your Data Warehouse
Caserta
Ā 
PDF
Designing the Next Generation Data Lake
Robert Chong
Ā 
PPTX
10 Amazing Things To Do With a Hadoop-Based Data Lake
VMware Tanzu
Ā 
PDF
Hadoop Integration into Data Warehousing Architectures
Humza Naseer
Ā 
PDF
The Data Lake and Getting Buisnesses the Big Data Insights They Need
Dunn Solutions Group
Ā 
PPTX
Developing a Strategy for Data Lake Governance
Tony Baer
Ā 
PDF
Enterprise Data Lake - Scalable Digital
sambiswal
Ā 
PDF
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Denodo
Ā 
PPTX
Hybrid Data Warehouse Hadoop Implementations
David Portnoy
Ā 
PDF
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Data Con LA
Ā 
PDF
DAMA Chicago - Ensuring your data lake doesn’t become a data swamp
NVISIA
Ā 
PPTX
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
Ā 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
NoSQLmatters
Ā 
Incorporating the Data Lake into Your Analytic Architecture
Caserta
Ā 
Hadoop Powers Modern Enterprise Data Architectures
DataWorks Summit
Ā 
Data Governance for Data Lakes
Kiran Kamreddy
Ā 
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
DataWorks Summit
Ā 
Big data architectures and the data lake
James Serra
Ā 
Hadoop data-lake-white-paper
Supratim Ray
Ā 
The Warranty Data Lake – After, Inc.
Richard Vermillion
Ā 
Hadoop and Your Data Warehouse
Caserta
Ā 
Designing the Next Generation Data Lake
Robert Chong
Ā 
10 Amazing Things To Do With a Hadoop-Based Data Lake
VMware Tanzu
Ā 
Hadoop Integration into Data Warehousing Architectures
Humza Naseer
Ā 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
Dunn Solutions Group
Ā 
Developing a Strategy for Data Lake Governance
Tony Baer
Ā 
Enterprise Data Lake - Scalable Digital
sambiswal
Ā 
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Denodo
Ā 
Hybrid Data Warehouse Hadoop Implementations
David Portnoy
Ā 
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Data Con LA
Ā 
DAMA Chicago - Ensuring your data lake doesn’t become a data swamp
NVISIA
Ā 
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
Ā 

Viewers also liked (19)

PDF
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Hortonworks
Ā 
PDF
The Emerging Data Lake IT Strategy
Thomas Kelly, PMP
Ā 
PDF
Implementing a Data Lake with Enterprise Grade Data Governance
Hortonworks
Ā 
PDF
Datalake Architecture
TechYugadi IT Solutions & Consulting
Ā 
PPTX
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
DataWorks Summit/Hadoop Summit
Ā 
PDF
Building the Enterprise Data Lake - Important Considerations Before You Jump In
SnapLogic
Ā 
PDF
Setting Up the Data Lake
Caserta
Ā 
PPTX
Big Data: Setting Up the Big Data Lake
Caserta
Ā 
PDF
The Data Lake - Balancing Data Governance and Innovation
Caserta
Ā 
PDF
Competitive Advantage from the Data Lake
Argyle Executive Forum
Ā 
PDF
Organising the Data Lake - Information Management in a Big Data World
DataWorks Summit/Hadoop Summit
Ā 
PDF
Business intelligence 3.0 and the data lake
Data Science Thailand
Ā 
PDF
Business Data Lake Best Practices
Capgemini
Ā 
PPTX
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Khalid Salama
Ā 
PPTX
Azure Data Lake Analytics Deep Dive
Ilyas F ☁☁☁
Ā 
PDF
Using Data Virtualization to Integrate With Big Data
mark madsen
Ā 
PPTX
Beyond WCAG: Implementing BS8878
IWMW
Ā 
PPTX
Key Information Sets Data
IWMW
Ā 
PDF
The technology of the business data lake
Capgemini
Ā 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Hortonworks
Ā 
The Emerging Data Lake IT Strategy
Thomas Kelly, PMP
Ā 
Implementing a Data Lake with Enterprise Grade Data Governance
Hortonworks
Ā 
Datalake Architecture
TechYugadi IT Solutions & Consulting
Ā 
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
DataWorks Summit/Hadoop Summit
Ā 
Building the Enterprise Data Lake - Important Considerations Before You Jump In
SnapLogic
Ā 
Setting Up the Data Lake
Caserta
Ā 
Big Data: Setting Up the Big Data Lake
Caserta
Ā 
The Data Lake - Balancing Data Governance and Innovation
Caserta
Ā 
Competitive Advantage from the Data Lake
Argyle Executive Forum
Ā 
Organising the Data Lake - Information Management in a Big Data World
DataWorks Summit/Hadoop Summit
Ā 
Business intelligence 3.0 and the data lake
Data Science Thailand
Ā 
Business Data Lake Best Practices
Capgemini
Ā 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Khalid Salama
Ā 
Azure Data Lake Analytics Deep Dive
Ilyas F ☁☁☁
Ā 
Using Data Virtualization to Integrate With Big Data
mark madsen
Ā 
Beyond WCAG: Implementing BS8878
IWMW
Ā 
Key Information Sets Data
IWMW
Ā 
The technology of the business data lake
Capgemini
Ā 
Ad

Similar to Building the Enterprise Data Lake: A look at architecture (20)

PDF
Traditional BI vs. Business Data Lake – A Comparison
Capgemini
Ā 
PDF
Benefits of a data lake
Sun Technologies
Ā 
PDF
Driving Business Value Through Agile Data Assets
Embarcadero Technologies
Ā 
PDF
An Overview of Data Lake
IRJET Journal
Ā 
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
Ā 
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
Ā 
PDF
Everything Has Changed Except Us: Modernizing the Data Warehouse
mark madsen
Ā 
PPTX
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
Cloudera, Inc.
Ā 
PPTX
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
DataScienceConferenc1
Ā 
PDF
Why a Data Services Marketplace is Critical for a Successful Data-Driven Ente...
Denodo
Ā 
PDF
The Shifting Landscape of Data Integration
DATAVERSITY
Ā 
PDF
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Denodo
Ā 
PPTX
Real Time Analytics
Mohsin Hakim
Ā 
PPTX
Real Time Analytics
Mohsin Hakim
Ā 
PPTX
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Nathan Bijnens
Ā 
PDF
Five Things to Consider About Data Mesh and Data Governance
DATAVERSITY
Ā 
PPT
20IT501_DWDM_PPT_Unit_I.ppt
SumathiG8
Ā 
PDF
Got data?… now what? An introduction to modern data platforms
JamesAnderson599331
Ā 
PDF
Data Science Operationalization: The Journey of Enterprise AI
Denodo
Ā 
PDF
Lecture4 big data technology foundations
hktripathy
Ā 
Traditional BI vs. Business Data Lake – A Comparison
Capgemini
Ā 
Benefits of a data lake
Sun Technologies
Ā 
Driving Business Value Through Agile Data Assets
Embarcadero Technologies
Ā 
An Overview of Data Lake
IRJET Journal
Ā 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
Ā 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
Ā 
Everything Has Changed Except Us: Modernizing the Data Warehouse
mark madsen
Ā 
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
Cloudera, Inc.
Ā 
[DSC Europe 23] Milos Solujic - Data Lakehouse Revolutionizing Data Managemen...
DataScienceConferenc1
Ā 
Why a Data Services Marketplace is Critical for a Successful Data-Driven Ente...
Denodo
Ā 
The Shifting Landscape of Data Integration
DATAVERSITY
Ā 
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Denodo
Ā 
Real Time Analytics
Mohsin Hakim
Ā 
Real Time Analytics
Mohsin Hakim
Ā 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Nathan Bijnens
Ā 
Five Things to Consider About Data Mesh and Data Governance
DATAVERSITY
Ā 
20IT501_DWDM_PPT_Unit_I.ppt
SumathiG8
Ā 
Got data?… now what? An introduction to modern data platforms
JamesAnderson599331
Ā 
Data Science Operationalization: The Journey of Enterprise AI
Denodo
Ā 
Lecture4 big data technology foundations
hktripathy
Ā 
Ad

More from mark madsen (20)

PDF
Data Architecture: OMG It’s Made of People
mark madsen
Ā 
PDF
Solve User Problems: Data Architecture for Humans
mark madsen
Ā 
PDF
The Black Box: Interpretability, Reproducibility, and Data Management
mark madsen
Ā 
PDF
Operationalizing Machine Learning in the Enterprise
mark madsen
Ā 
PDF
Building a Data Platform Strata SF 2019
mark madsen
Ā 
PDF
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
mark madsen
Ā 
PDF
Architecting a Platform for Enterprise Use - Strata London 2018
mark madsen
Ā 
PDF
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
mark madsen
Ā 
PDF
How to understand trends in the data & software market
mark madsen
Ā 
PDF
Pay no attention to the man behind the curtain - the unseen work behind data ...
mark madsen
Ā 
PDF
Assumptions about Data and Analysis: Briefing room webcast slides
mark madsen
Ā 
PDF
A Pragmatic Approach to Analyzing Customers
mark madsen
Ā 
PDF
Disruptive Innovation: how do you use these theories to manage your IT?
mark madsen
Ā 
PDF
Briefing room: An alternative for streaming data collection
mark madsen
Ā 
PDF
Briefing Room analyst comments - streaming analytics
mark madsen
Ā 
PDF
Everything has changed except us
mark madsen
Ā 
PDF
Bi isn't big data and big data isn't BI (updated)
mark madsen
Ā 
PDF
On the edge: analytics for the modern enterprise (analyst comments)
mark madsen
Ā 
PDF
Crossing the chasm with a high performance dynamically scalable open source p...
mark madsen
Ā 
PDF
Don't let data get in the way of a good story
mark madsen
Ā 
Data Architecture: OMG It’s Made of People
mark madsen
Ā 
Solve User Problems: Data Architecture for Humans
mark madsen
Ā 
The Black Box: Interpretability, Reproducibility, and Data Management
mark madsen
Ā 
Operationalizing Machine Learning in the Enterprise
mark madsen
Ā 
Building a Data Platform Strata SF 2019
mark madsen
Ā 
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
mark madsen
Ā 
Architecting a Platform for Enterprise Use - Strata London 2018
mark madsen
Ā 
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
mark madsen
Ā 
How to understand trends in the data & software market
mark madsen
Ā 
Pay no attention to the man behind the curtain - the unseen work behind data ...
mark madsen
Ā 
Assumptions about Data and Analysis: Briefing room webcast slides
mark madsen
Ā 
A Pragmatic Approach to Analyzing Customers
mark madsen
Ā 
Disruptive Innovation: how do you use these theories to manage your IT?
mark madsen
Ā 
Briefing room: An alternative for streaming data collection
mark madsen
Ā 
Briefing Room analyst comments - streaming analytics
mark madsen
Ā 
Everything has changed except us
mark madsen
Ā 
Bi isn't big data and big data isn't BI (updated)
mark madsen
Ā 
On the edge: analytics for the modern enterprise (analyst comments)
mark madsen
Ā 
Crossing the chasm with a high performance dynamically scalable open source p...
mark madsen
Ā 
Don't let data get in the way of a good story
mark madsen
Ā 

Recently uploaded (20)

PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
Ā 
PPTX
What Is Data Integration and Transformation?
subhashenia
Ā 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
šŸ“Š Markus Baersch
Ā 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
Ā 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
Ā 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
Ā 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
Ā 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
Ā 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
Ā 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
Ā 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
Ā 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
Ā 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
Ā 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
Ā 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
Ā 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
Ā 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
Ā 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
Ā 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
Ā 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
Ā 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
Ā 
What Is Data Integration and Transformation?
subhashenia
Ā 
JavaScript - Good or Bad? Tips for Google Tag Manager
šŸ“Š Markus Baersch
Ā 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
Ā 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
Ā 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
Ā 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
Ā 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
Ā 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
Ā 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
Ā 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
Ā 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
Ā 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
Ā 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
Ā 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
Ā 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
Ā 
BinarySearchTree in datastructures in detail
kichokuttu
Ā 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
Ā 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
Ā 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
Ā 

Building the Enterprise Data Lake: A look at architecture

  • 1. Building the Enterprise Data Lake Considerations before you jump in December, 2015 Mark Madsen www.ThirdNature.net @markmadsen1
  • 2. What This Session Isn’t SQL.. . SQL! SQL? SQL
  • 3. The craft model of information delivery does not scale
  • 4. Ā© Third Nature, Inc. So we shifted to data publishing Industrialized data delivery for self-service access.
  • 5. Events and sensors are a relatively new data source Sensor data doesn’t fit well with current methods of modeling, collection and storage, or with the technology to process and analyze it.
  • 6. There’s lots of other new data involved
  • 7. Ā© Third Nature, Inc. You can store this data in an RDBMS, but…
  • 8. These sorts of things slow user requests down Conclusion: any methodology built on the premise that you must know and model all the data first is untenable
  • 9. Ā© Third Nature, Inc. Analytics embiggens data volume problems Many of the processing problems are O(n2) or worse, so moderate data can be a problem for scale-up platforms
  • 10. Ā© Third Nature, Inc. Old market says: There’s nothing wrong with what you have, just keep buying new products from us
  • 11. The emerging big data market has an answer…
  • 12. Ā© Third Nature, Inc. The data lake
  • 13. Ā© Third Nature, Inc. Views of the lake Is the business vs supports the business? Application vs infrastructure?
  • 14. Ā© Third Nature, Inc. The naĆÆve idea of a data lake leads to predictable results
  • 15. Ā© Third Nature, Inc. You can’t install Hadoop and hope it solves all the problems Big data no 2
  • 16. Slide 16 The answer isn’t just technology, it’s architecture
  • 17. Schema In the DW world both data and processing are bounded No consideration for feedback loops and change Processing only happens here Carefully controlled access here Nobodyherecreates newinformation Sources few and well understood Complex DI is controlled by IT Schemas are few and designed Tools are authorized, few in number and kind One way flow This is a monolithic, layered architecture
  • 18. Ā© Third Nature, Inc. In the big data world flow is unbounded and continuous Feedback loops allowed End-of-analysis dataset may be start of a BI dataset Continuous data integration and delivery Files are back as both input and storage Minimal barrier of / control on collection Areas of provisioned data Any shape in, rectangles out This needs a distributed service architecture
  • 19. Ā© Third Nature, Inc. Deconstructing data environments There are three things happening in a data warehouse: ā–Ŗ Data acquisition ā–Ŗ Data management ā–Ŗ Data delivery Isolate them from one another, allow read- write use, and you are on the path. Data Warehouse
  • 20. Data lake subsystems / components The acquisition component allows any data to be collected at any latency. The management component allows some data to be standardized and integrated. The access component provides access at any latency and via any means an application chooses. Processing can be done to any data at any time from any area. Data Acquisition Collect & Store Incremental Batch One-time copy Real time Data Lake Platform Services Data Management Process & Integrate Data Access Deliver & Use Data storage In reality, you are building three systems, not one. Avoid the monolith.
  • 21. Ā© Third Nature, Inc. Data lake functions depend on platform services Base Platform Services Data Movement MetadataData Persistence Workflow Management Processing Engines Dataflow Services Data Curation Data Access Services Data Acquisition Collect & Store Data Management Process & Integrate Data Access Deliver & Use Platformservicesneeded
  • 22. DATA ARCHITECTURE We’re so focused on the light switch that we’re not talking about the light
  • 23. Ā© Third Nature, Inc. Decouple the Data Architecture The core of the data lake isn’t a database or HDFS, it’s the data architecture that the tools implement. We need a data architecture that is not limiting: ā–Ŗ Deals with change easily and at scale ā–Ŗ Does not enforce requirements and models up front ā–Ŗ Does not limit the format or structure of data ā–Ŗ Assumes the range of data latencies in and out, from streaming to one-time bulk
  • 24. Ā© Third Nature, Inc. Food supply chain: an analogy for data Multiple contexts of use, differing quality levels You need to keep the original because just like baking, you can’t unmake dough once it’s mixed.
  • 25. Ā© Third Nature, Inc. Data architecture is required by the services, and vice versa Raw data in an immutable storage area Standardized or enhanced data Common or usage- specific data Transient data DataAcquisition Collect&Store Platform Services DataAccess Deliver&Use Data Management Process & Integrate
  • 26. Ā© Third Nature, Inc. The data areas map (mostly) to functional areas of the lake Collection can’t be limited by database scale and latency. Immutability, persistence and concurrency are required. Incremental Collect Batch One-time copy Real time Manage & Integrate Process, Deliver, Use
  • 27. Ā© Third Nature, Inc. Stages, not layers Some tools require specific repositories or models. Others can reach in to get what they need. Do not enforce a single access point or model.
  • 28. Ā© Third Nature, Inc. The geography has been redefined The box IT created: • not any data, rigidly typed data • not any form, tabular rows and columns of typed data • not any latency, persist what the DB can keep up with • not any process, only queries The digital world was diminished to only what’s inside the box until we forgot the box was there.
  • 29. Ā© Third Nature, Inc. Layered data architecture The DW assumed a single flat model of data, DB in the center. The data lake enables new ways to organize data: ā–Ŗ Raw – straight from the source ā–Ŗ Enhanced –cleaned, standardized ā–Ŗ Integrated – modeled, augmented, ~semi-persistent ā–Ŗ Derived – analytic output, pattern based sets, ephemeral Implies a new technology architecture and data modeling approaches.
  • 30. Ā© Third Nature, Inc. The data lake enables evolutionary design for data Evolutionary design is required because data needs change. You need a system not for stability – we have that in the DW - but for evolution and change, the data lake. Data Acquisition Collect & Store Incremental Batch One-time copy Real time Data Lake Platform Services Data Management Process & Integrate Data Access Deliver & Use Data storage You can’t build this all at once. You need to grow it over time.
  • 31. Ā© Third Nature, Inc. Away from ā€œone throat to chokeā€, back to best of breed Tight coupling leads to efficient reuse and standardization, and to slow changes. In a rapidly evolving market componentized architectures, modularity and loose coupling are favorable over monolithic stacks, single-vendor architectures and tight coupling. Architecture, not blueprints: there is no single answer. It depends on your goals and starting position.
  • 32. Questions?ā€œWhen a new technology rolls over you, you're either part of the steamroller or part of the road.ā€ – Stewart Brand
  • 33. Ā© Third Nature, Inc. CC Image Attributions Thanks to the people who supplied the creative commons licensed images used in this presentation: donuts_4_views.jpg - http://www.flickr.com/photos/le_hibou/76718773/ glass_buildings.jpg - http://www.flickr.com/photos/erikvanhannen/547701721
  • 34. Ā© Third Nature, Inc. About the Presenter Mark Madsen is president of Third Nature, a consulting and advisory firm focused on analytics, business intelligence and data management. Mark is an award-winning author, architect and CTO. Over the past ten years Mark received awards for his work from the American Productivity & Quality Center, TDWI, and the Smithsonian Institute. He is an international speaker, a contributor to Forbes, member of the O’Reilly Strata program committee. For more information or to contact Mark, follow @markmadsen on Twitter or visit http://ThirdNature.net
  • 35. About Third Nature Third Nature is a consulting and advisory firm focused on new and emerging technology and practices in information strategy, analytics, business intelligence and data management. If your question is related to data, analytics, information strategy and technology infrastructure then youā€˜re at the right place. Our goal is to help organizations solve problems using data. We offer education, consulting and research services to support business and IT organizations as well as technology vendors. We fill the gap between what the industry analyst firms cover and what IT needs. We specialize in strategy and architecture, so we look at emerging technologies and markets, evaluating how technologies are applied to solve problems rather than evaluating product features.