SlideShare a Scribd company logo
Automated Metadata
Management in Data Lake –
A CI/CD Driven Approach
Keyuri Shah, Lead Engineer
Josh Reilly, Lead Engineer
Agenda
§ Introduction
§ Need for Metadata
Management
§ Architecture
§ Overview on Tool
§ Live Demo
Need for Metadata Management
• What is Metadata Management
• Motivation for a config driven tool
• Governance
• Easy to Maintain
• Development Stack
• Python
• Gitlab CI
• Use Cases
• Enterprise Data Lake
• Sharing Lake across different teams
Config File Options
▪ name
▪ owning_team
▪ description
▪ access:
▪ type: ad_group
▪ <env>: AD_GRP_NAM
▪ type: data
• name
• description
• schema
• encrypted_columns
• masked_columns
• Tables
• Database
• name
• database
• description
• query
• Views
Design
Live Demo
• Create a database/table/view config
• Check in to Git
• Run CICD pipeline to plan and apply to int
• Verify Database/Table/View in Databricks
• Update Schema & Run Pipeline
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
Keyuri Shah: https://www.linkedin.com/in/keyuri-shah
Josh Reilly: https://www.linkedin.com/in/josh-reilly-51052996/

More Related Content

What's hot (20)

PDF
Android와 Flutter 앱 개발의 큰 차이점 5가지
Bansook Nam
 
PDF
우아한테크세미나-우아한멀티모듈
용근 권
 
PDF
Gitはじめの一歩
Ayana Yokota
 
PDF
Hadoop tools with Examples
Joe McTee
 
PPTX
빌링 미터링 platform
승필 박
 
PDF
객체지향적인 도메인 레이어 구축하기
Young-Ho Cho
 
PPTX
Agile Release Planning
Adnan Aziz
 
PDF
Retrospective & review
Conscires Agile Practices
 
PDF
PSR-3 Logger Interfaceの紹介
Hiraku Nakano
 
PPTX
Azure Application Insights とか
Takekazu Omi
 
PDF
既存アプリケーションをJava11に対応させる際に 知っておくべきこと
ikeyat
 
PDF
The Arrow - Advanced Kanban board
Tomas Rybing
 
PPTX
Sprint review presentation
BernhardBoennemann
 
PPTX
Scrum - Product Backlog
Upekha Vandebona
 
PDF
はじめてのGit forデザイナー&コーダー
Saeko Yamamoto
 
PPTX
A1-6 ドメイン乗っ取られた!!
JPAAWG (Japan Anti-Abuse Working Group)
 
PDF
[수정본] 우아한 객체지향
Young-Ho Cho
 
PDF
RedmineのFAQとアンチパターン集
akipii Oga
 
PDF
MelOn 빅데이터 플랫폼과 Tajo 이야기
Gruter
 
PDF
우아한 객체지향
Young-Ho Cho
 
Android와 Flutter 앱 개발의 큰 차이점 5가지
Bansook Nam
 
우아한테크세미나-우아한멀티모듈
용근 권
 
Gitはじめの一歩
Ayana Yokota
 
Hadoop tools with Examples
Joe McTee
 
빌링 미터링 platform
승필 박
 
객체지향적인 도메인 레이어 구축하기
Young-Ho Cho
 
Agile Release Planning
Adnan Aziz
 
Retrospective & review
Conscires Agile Practices
 
PSR-3 Logger Interfaceの紹介
Hiraku Nakano
 
Azure Application Insights とか
Takekazu Omi
 
既存アプリケーションをJava11に対応させる際に 知っておくべきこと
ikeyat
 
The Arrow - Advanced Kanban board
Tomas Rybing
 
Sprint review presentation
BernhardBoennemann
 
Scrum - Product Backlog
Upekha Vandebona
 
はじめてのGit forデザイナー&コーダー
Saeko Yamamoto
 
A1-6 ドメイン乗っ取られた!!
JPAAWG (Japan Anti-Abuse Working Group)
 
[수정본] 우아한 객체지향
Young-Ho Cho
 
RedmineのFAQとアンチパターン集
akipii Oga
 
MelOn 빅데이터 플랫폼과 Tajo 이야기
Gruter
 
우아한 객체지향
Young-Ho Cho
 

Similar to Automated Metadata Management in Data Lake – A CI/CD Driven Approach (7)

PPTX
JOSA TechTalk: Metadata Management
in Big Data
Jordan Open Source Association
 
PPTX
Navigating the World of User Data Management and Data Discovery
DataWorks Summit/Hadoop Summit
 
PPTX
To The Cloud and Back: A Look At Hybrid Analytics
DataWorks Summit/Hadoop Summit
 
PDF
Stl meetup cloudera platform - january 2020
Adam Doyle
 
PDF
Horses for Courses: Database Roundtable
Eric Kavanagh
 
PDF
Intelligent Data Management NDMO_Data Catalog and Metadata Domain Specificati...
MahmoudAli570320
 
PPTX
Operating a secure big data platform in a multi-cloud environment
DataWorks Summit
 
JOSA TechTalk: Metadata Management
in Big Data
Jordan Open Source Association
 
Navigating the World of User Data Management and Data Discovery
DataWorks Summit/Hadoop Summit
 
To The Cloud and Back: A Look At Hybrid Analytics
DataWorks Summit/Hadoop Summit
 
Stl meetup cloudera platform - january 2020
Adam Doyle
 
Horses for Courses: Database Roundtable
Eric Kavanagh
 
Intelligent Data Management NDMO_Data Catalog and Metadata Domain Specificati...
MahmoudAli570320
 
Operating a secure big data platform in a multi-cloud environment
DataWorks Summit
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPT
Data base management system Transactions.ppt
gandhamcharan2006
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
Data base management system Transactions.ppt
gandhamcharan2006
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 

Automated Metadata Management in Data Lake – A CI/CD Driven Approach

  • 1. Automated Metadata Management in Data Lake – A CI/CD Driven Approach Keyuri Shah, Lead Engineer Josh Reilly, Lead Engineer
  • 2. Agenda § Introduction § Need for Metadata Management § Architecture § Overview on Tool § Live Demo
  • 3. Need for Metadata Management • What is Metadata Management • Motivation for a config driven tool • Governance • Easy to Maintain • Development Stack • Python • Gitlab CI • Use Cases • Enterprise Data Lake • Sharing Lake across different teams
  • 4. Config File Options ▪ name ▪ owning_team ▪ description ▪ access: ▪ type: ad_group ▪ <env>: AD_GRP_NAM ▪ type: data • name • description • schema • encrypted_columns • masked_columns • Tables • Database • name • database • description • query • Views
  • 6. Live Demo • Create a database/table/view config • Check in to Git • Run CICD pipeline to plan and apply to int • Verify Database/Table/View in Databricks • Update Schema & Run Pipeline
  • 7. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions. Keyuri Shah: https://www.linkedin.com/in/keyuri-shah Josh Reilly: https://www.linkedin.com/in/josh-reilly-51052996/