SlideShare a Scribd company logo
10
Most read
14
Most read
16
Most read
Architecting a Datalake
Laurent Léturgez – Sep 2019
Big Data Meetup - Lille
Whoami
• Database and BigData Architect (Hadoop, Data Science and other
cool topics)
• Former Developer and Consultant
• Owner@Premiseo: Data Management on Premises and in the
Cloud
• Blogger since 2004
• http://laurent-leturgez.com
• Twitter : @lleturgez
What’s on the menu ?
• What is a Datalake ?
• Keys to architect a Datalake
• Design, Security
• Data movement, Data Processing
• Discovery
• Solutions available
• Example
• Datalake Implementation driven by IoT
What is a Datalake ?
• Repository of data stored in natural format
• Single Store of Enterprise data
• Raw Data
• Transformed Data : Reports, DataViz, Results (AI, ML …)
• Data Structure:
• Structured Data : Row, Columns, Relational Data
• Semi Structured Data: CSV, XML, JSON, log files
• Unstructured Data: Mails, Documents, Binaries (Images, Videos)
What is a Datalake ?
• Features
• Data are usually integrated unprocessed
• Processed data can be kept in the Datalake
• Data are kept … ready to be transformed
• Data are saved as long as possible
• A Datalake is
• Organized
• Managed
What is a Datalake ?
• A Datalake is
not a datawarehouse
Source: martinfowler.com
Keys to architect a Datalake
• A well thought design
• Vital for
• Success
• Discovery efficiency
• ETL development effort
• Coupled with Security and business process
Keys to architect a Datalake
• A well thought design … example
• Operational Areas
• Raw Area
• Data landing zone in native raw format
• Data are kept indefinitely in this area
• Data Tagging
• Folder Structure organized by Source, Dataset, Date etc.
• Staging Area
• Data Preparation Area : Decompression, cleansing, aggregation
• Data Quality Management is usually made here
• Hub Area
• Trusted layer of data
• Data is ready for analytics organized functionaly
Keys to architect a Datalake
• A well thought design … example
• (Extra) Supported Area
• Master Data Area
• Customer, Products, Financial Data
• Used by Analytics
• Exploratory Area
• Playground for Data Scientists and Analysts
• Temporary Area
• Testing Data decompression
• Single point of data storage before move accross network
Keys to architect a Datalake
• Security
• Data Access Control
• By User
• By Application
• ETL Softwares
• Analytics
• …
• By Operational zone
• By Source
Key Point: IAM Integration
Keys to architect a Datalake
• Security
• Data Security
• Data Lake Management (Role Control)
• Data Resilience
• Disaster recovery
• Backup / Restore
• SLA: Availability, RTO, RPO
• Data Encryption
• At rest
• In transit
Keys to architect a Datalake
• Data Movement, Data Processing
• Consider the Data Lake as central point for
• Data Ingestion
• Data Processing
Keys to architect a Datalake
• Data Movement, Data Processing
• Consider the Data Lake as central point
• Data Ingestion
• Tools / ETL
• Metadata strategy should be in place (Data Catalog for tagging)
• Data Format
• Naming convention for files/directories: ingestion date, format, source etc.
• Batch or real time
• Many small files or few big files
• Data Partitioning  Maximum query and processing performance
• Cloud or OnPrem ?
• Network issues, hybrid Cloud considerations
• Data Processing
Keys to architect a Datalake
• Data Movement, Data Processing
• Consider the Data Lake as central point
• Data Ingestion
• Data Processing
• Tools
• Hadoop (on Prem / Cloud)
• Legacies Database systems (SQL Server PolyBase, Oracle Connector for Hadoop, AWS
Spectrum/Athena etc.)
• Analytics, DataViz and ML
• Data Bricks, Power BI, SAS, Qlik etc.
• Data Colocation
• Data Format
• Compressed / Uncompressed
• Column oriented
Keys to architect a Datalake
• Orchestration
• Cloud Automation or Job Automation ?
• Batch or real time
• Batch automation
• Monitoring
• Data volume
• Real Time (Usually used for IoT)
• How is built the pipeline ?
• Event based or not ?
• Monitoring
Keys to architect a Datalake
• Discovery
• Tagging and Metadata management : Similar … but different
• MetaData management :
• Data about data : creation and modification date, source, format etc.
• Traditional metadata: source, connection string, data type, length, versions etc.
• Modern metadata: included in files (AVRO For example) or a database
• Advanced metadata: automated processing of metadata
• Tagging
• Set of tag to understand/describe datasets in the datalake
• Usually stored in a Catalog or KV database or through Naming conventions
• Key points: When the data has been tagged ? Who owns the tagging system ?
Solutions available
• Solutions available
• On Prem :
• Hadoop / HDFS
• Cloud
• AWS : S3 Buckets
• Azure : Azure Datalake Store Gen1/Gen2, Storage Accounts
• GCP: Google Cloud Storage
• Oracle Cloud Infrastructure: Object Storage
Implementation
• Example : Solution
• Customer : Industry, Trucks maker
• Project : Parts failure prediction
• Sensors are embedded in trucks
• Data collection for parts health
• Data are integrated real time in the Datalake
• Legacy data are integrated into the datalake (batch mode)
• Parts related data (mostly coming from ERPs) : Serial number, provider, purchases etc.
• Predictive algorithms are designed to replace parts before they broke
Implementation
• Example: Solution
• Azure Datalake Store / Storage Accounts closely integrated with MS SQL
Databases
• Why not on Prem ?
• Infrastructure costs
• Fuzzy Data volume prediction
• Hadoop management
Implementation
• Example: Solution
• Why Azure ?
• Microsoft long time customer
• Many services already used (Legacy databases: MS SQL DWH, Power BI etc.)
• Active Directory Integration: Security, ACL and
• Batch Integration by Talend
• Real Time Integration by Azure Products (Iot Hub + Azure Functions)
• Close integration with DataBricks for Analytics and Data Processing
Conclusion
• DataLake are now central components for enterprises
• Without …
• Organized Data
• Managed Data (Security, design etc.)
• High volume of Data
• No powerful AI or ML algorithms
• No powerful Analytic processes
Questions ?

More Related Content

PPTX
Microsoft Fabric Introduction
James Serra
 
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
PDF
adb.pdf
AdityaMehta724216
 
PDF
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
PDF
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
Cathrine Wilhelmsen
 
PPTX
Building a modern data warehouse
James Serra
 
PDF
Databricks Delta Lake and Its Benefits
Databricks
 
PPTX
Azure Data Factory ETL Patterns in the Cloud
Mark Kromer
 
Microsoft Fabric Introduction
James Serra
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
Choosing Between Microsoft Fabric, Azure Synapse Analytics and Azure Data Fac...
Cathrine Wilhelmsen
 
Building a modern data warehouse
James Serra
 
Databricks Delta Lake and Its Benefits
Databricks
 
Azure Data Factory ETL Patterns in the Cloud
Mark Kromer
 

What's hot (20)

PPTX
Microsoft Fabric.pptx
Shruti Chaurasia
 
PDF
Modernizing to a Cloud Data Architecture
Databricks
 
PPTX
Azure Cloud Adoption Framework + Governance - Sana Khan and Jay Kumar
Timothy McAliley
 
PPTX
Data platform modernization with Databricks.pptx
CalvinSim10
 
PDF
Introducing Databricks Delta
Databricks
 
PDF
Migrate to Microsoft Azure with Confidence
David J Rosenthal
 
PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
PDF
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
PPTX
Azure data platform overview
James Serra
 
PPTX
Azure Synapse Analytics Overview (r2)
James Serra
 
PPTX
Building Advanced Analytics Pipelines with Azure Databricks
Lace Lofranco
 
PDF
Considerations for Data Access in the Lakehouse
Databricks
 
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
PDF
Introduction to Azure Data Lake
Antonios Chatzipavlis
 
PDF
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
PPTX
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
PDF
Microsoft Fabric Intro D Koutsanastasis
Uni Systems S.M.S.A.
 
PPTX
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
Timothy McAliley
 
PDF
Moving to Databricks & Delta
Databricks
 
PDF
Intro to Delta Lake
Databricks
 
Microsoft Fabric.pptx
Shruti Chaurasia
 
Modernizing to a Cloud Data Architecture
Databricks
 
Azure Cloud Adoption Framework + Governance - Sana Khan and Jay Kumar
Timothy McAliley
 
Data platform modernization with Databricks.pptx
CalvinSim10
 
Introducing Databricks Delta
Databricks
 
Migrate to Microsoft Azure with Confidence
David J Rosenthal
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
Azure data platform overview
James Serra
 
Azure Synapse Analytics Overview (r2)
James Serra
 
Building Advanced Analytics Pipelines with Azure Databricks
Lace Lofranco
 
Considerations for Data Access in the Lakehouse
Databricks
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Introduction to Azure Data Lake
Antonios Chatzipavlis
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
Microsoft Fabric Intro D Koutsanastasis
Uni Systems S.M.S.A.
 
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
Timothy McAliley
 
Moving to Databricks & Delta
Databricks
 
Intro to Delta Lake
Databricks
 
Ad

Similar to Architecting a datalake (20)

PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
PPTX
Lecture 5- Data Collection and Storage.pptx
Brianc34
 
PPSX
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Institute of Contemporary Sciences
 
PDF
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
DATAVERSITY
 
PDF
So You Want to Build a Data Lake?
David P. Moore
 
PPTX
DA_01_Intro.pptx
Alok Mohapatra
 
PDF
Demystifying Data Warehouse as a Service (DWaaS)
Kent Graziano
 
PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
PPTX
Move your on prem data to a lake in a Lake in Cloud
CAMMS
 
PDF
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Rittman Analytics
 
PPTX
Data modeling trends for analytics
Ike Ellis
 
PDF
An overview of modern scalable web development
Tung Nguyen
 
PDF
AWS Community Day Poland 2022 - Building a Data Lake.pdf
Anurag896857
 
PPTX
The Data Engineering Guide 101 - GDGoC NUML X Bytewise
gdscnuml
 
PDF
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
DATAVERSITY
 
PDF
Harness the power of Data in a Big Data Lake
Saurabh K. Gupta
 
PPTX
AzureDay - Introduction Big Data Analytics.
Łukasz Grala
 
PDF
Ds03 data analysis
DotNetCampus
 
PDF
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Tech Triveni
 
PPTX
Microsoft Traditional & Modern DW solutions stack Presentation.pptx
RaoMajidSultan
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
Lecture 5- Data Collection and Storage.pptx
Brianc34
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Institute of Contemporary Sciences
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
DATAVERSITY
 
So You Want to Build a Data Lake?
David P. Moore
 
DA_01_Intro.pptx
Alok Mohapatra
 
Demystifying Data Warehouse as a Service (DWaaS)
Kent Graziano
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Move your on prem data to a lake in a Lake in Cloud
CAMMS
 
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Rittman Analytics
 
Data modeling trends for analytics
Ike Ellis
 
An overview of modern scalable web development
Tung Nguyen
 
AWS Community Day Poland 2022 - Building a Data Lake.pdf
Anurag896857
 
The Data Engineering Guide 101 - GDGoC NUML X Bytewise
gdscnuml
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
DATAVERSITY
 
Harness the power of Data in a Big Data Lake
Saurabh K. Gupta
 
AzureDay - Introduction Big Data Analytics.
Łukasz Grala
 
Ds03 data analysis
DotNetCampus
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Tech Triveni
 
Microsoft Traditional & Modern DW solutions stack Presentation.pptx
RaoMajidSultan
 
Ad

More from Laurent Leturgez (6)

PPTX
Python and Oracle : allies for best of data management
Laurent Leturgez
 
PDF
Oracle hadoop let them talk together !
Laurent Leturgez
 
PDF
Oracle Database : Addressing a performance issue the drilldown approach
Laurent Leturgez
 
PDF
Improve oracle 12c security
Laurent Leturgez
 
PDF
Which cloud provider for your oracle database
Laurent Leturgez
 
PDF
SIMD inside and outside Oracle 12c In Memory
Laurent Leturgez
 
Python and Oracle : allies for best of data management
Laurent Leturgez
 
Oracle hadoop let them talk together !
Laurent Leturgez
 
Oracle Database : Addressing a performance issue the drilldown approach
Laurent Leturgez
 
Improve oracle 12c security
Laurent Leturgez
 
Which cloud provider for your oracle database
Laurent Leturgez
 
SIMD inside and outside Oracle 12c In Memory
Laurent Leturgez
 

Recently uploaded (20)

PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PPTX
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PDF
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PDF
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 
PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
INFO8116 -Big data architecture and analytics
guddipatel10
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
Fundamentals and Techniques of Biophysics and Molecular Biology (Pranav Kumar...
RohitKumar868624
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
INFO8116 -Big data architecture and analytics
guddipatel10
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 

Architecting a datalake

  • 1. Architecting a Datalake Laurent Léturgez – Sep 2019 Big Data Meetup - Lille
  • 2. Whoami • Database and BigData Architect (Hadoop, Data Science and other cool topics) • Former Developer and Consultant • Owner@Premiseo: Data Management on Premises and in the Cloud • Blogger since 2004 • http://laurent-leturgez.com • Twitter : @lleturgez
  • 3. What’s on the menu ? • What is a Datalake ? • Keys to architect a Datalake • Design, Security • Data movement, Data Processing • Discovery • Solutions available • Example • Datalake Implementation driven by IoT
  • 4. What is a Datalake ? • Repository of data stored in natural format • Single Store of Enterprise data • Raw Data • Transformed Data : Reports, DataViz, Results (AI, ML …) • Data Structure: • Structured Data : Row, Columns, Relational Data • Semi Structured Data: CSV, XML, JSON, log files • Unstructured Data: Mails, Documents, Binaries (Images, Videos)
  • 5. What is a Datalake ? • Features • Data are usually integrated unprocessed • Processed data can be kept in the Datalake • Data are kept … ready to be transformed • Data are saved as long as possible • A Datalake is • Organized • Managed
  • 6. What is a Datalake ? • A Datalake is not a datawarehouse Source: martinfowler.com
  • 7. Keys to architect a Datalake • A well thought design • Vital for • Success • Discovery efficiency • ETL development effort • Coupled with Security and business process
  • 8. Keys to architect a Datalake • A well thought design … example • Operational Areas • Raw Area • Data landing zone in native raw format • Data are kept indefinitely in this area • Data Tagging • Folder Structure organized by Source, Dataset, Date etc. • Staging Area • Data Preparation Area : Decompression, cleansing, aggregation • Data Quality Management is usually made here • Hub Area • Trusted layer of data • Data is ready for analytics organized functionaly
  • 9. Keys to architect a Datalake • A well thought design … example • (Extra) Supported Area • Master Data Area • Customer, Products, Financial Data • Used by Analytics • Exploratory Area • Playground for Data Scientists and Analysts • Temporary Area • Testing Data decompression • Single point of data storage before move accross network
  • 10. Keys to architect a Datalake • Security • Data Access Control • By User • By Application • ETL Softwares • Analytics • … • By Operational zone • By Source Key Point: IAM Integration
  • 11. Keys to architect a Datalake • Security • Data Security • Data Lake Management (Role Control) • Data Resilience • Disaster recovery • Backup / Restore • SLA: Availability, RTO, RPO • Data Encryption • At rest • In transit
  • 12. Keys to architect a Datalake • Data Movement, Data Processing • Consider the Data Lake as central point for • Data Ingestion • Data Processing
  • 13. Keys to architect a Datalake • Data Movement, Data Processing • Consider the Data Lake as central point • Data Ingestion • Tools / ETL • Metadata strategy should be in place (Data Catalog for tagging) • Data Format • Naming convention for files/directories: ingestion date, format, source etc. • Batch or real time • Many small files or few big files • Data Partitioning  Maximum query and processing performance • Cloud or OnPrem ? • Network issues, hybrid Cloud considerations • Data Processing
  • 14. Keys to architect a Datalake • Data Movement, Data Processing • Consider the Data Lake as central point • Data Ingestion • Data Processing • Tools • Hadoop (on Prem / Cloud) • Legacies Database systems (SQL Server PolyBase, Oracle Connector for Hadoop, AWS Spectrum/Athena etc.) • Analytics, DataViz and ML • Data Bricks, Power BI, SAS, Qlik etc. • Data Colocation • Data Format • Compressed / Uncompressed • Column oriented
  • 15. Keys to architect a Datalake • Orchestration • Cloud Automation or Job Automation ? • Batch or real time • Batch automation • Monitoring • Data volume • Real Time (Usually used for IoT) • How is built the pipeline ? • Event based or not ? • Monitoring
  • 16. Keys to architect a Datalake • Discovery • Tagging and Metadata management : Similar … but different • MetaData management : • Data about data : creation and modification date, source, format etc. • Traditional metadata: source, connection string, data type, length, versions etc. • Modern metadata: included in files (AVRO For example) or a database • Advanced metadata: automated processing of metadata • Tagging • Set of tag to understand/describe datasets in the datalake • Usually stored in a Catalog or KV database or through Naming conventions • Key points: When the data has been tagged ? Who owns the tagging system ?
  • 17. Solutions available • Solutions available • On Prem : • Hadoop / HDFS • Cloud • AWS : S3 Buckets • Azure : Azure Datalake Store Gen1/Gen2, Storage Accounts • GCP: Google Cloud Storage • Oracle Cloud Infrastructure: Object Storage
  • 18. Implementation • Example : Solution • Customer : Industry, Trucks maker • Project : Parts failure prediction • Sensors are embedded in trucks • Data collection for parts health • Data are integrated real time in the Datalake • Legacy data are integrated into the datalake (batch mode) • Parts related data (mostly coming from ERPs) : Serial number, provider, purchases etc. • Predictive algorithms are designed to replace parts before they broke
  • 19. Implementation • Example: Solution • Azure Datalake Store / Storage Accounts closely integrated with MS SQL Databases • Why not on Prem ? • Infrastructure costs • Fuzzy Data volume prediction • Hadoop management
  • 20. Implementation • Example: Solution • Why Azure ? • Microsoft long time customer • Many services already used (Legacy databases: MS SQL DWH, Power BI etc.) • Active Directory Integration: Security, ACL and • Batch Integration by Talend • Real Time Integration by Azure Products (Iot Hub + Azure Functions) • Close integration with DataBricks for Analytics and Data Processing
  • 21. Conclusion • DataLake are now central components for enterprises • Without … • Organized Data • Managed Data (Security, design etc.) • High volume of Data • No powerful AI or ML algorithms • No powerful Analytic processes