SlideShare a Scribd company logo
NFTBank x Snowflake Tech Seminar
Building modern data pipeline with
Snowflake + DBT + Airflow
NFTBank x Snowflake Tech Seminar
Index
Session 1: Data Quality & Productivity
- Data Quality
- Data Quality Validation
- Data Catalog, Lineage Documentation
- DBT Introduction
Session 2: Integrate DBT with Airflow
- DBT Cloud or Airflow?
- Astronomer Cosmos
- dbt deps
Session 3: Cost Optimization
- Query Optimization
- Cost Monitoring
NFTBank x Snowflake Tech Seminar
Chris Hoyean Song
NFTBank VP of AIOps 2021.11 ~
Riiid VP of AIOps 2020.09-2021.11
Naver AI Engineer 2017.12-2020.05
Kakao Data Engineer 2015.08-2017.12
Startup CEO 2013-2015
KAIST Innovation & Tech Management Master
KAIST Computer Science Bachelor
Commercialization of AI Tech
& Scalable AI
I am in charge of connecting artificial intelligence technology
with business.
The area of expertise is ML Pipeline, which increases the
productivity of artificial intelligence projects.
"The biggest lesson that can be read from 70 years of AI research is that
general methods that leverage computation are ultimately the most
effective, and by a large margin. "
-Richard Sutton
https://www.linkedin.com/in/chris-song-0bb03439/
NFTBank x Snowflake Tech Seminar
NFTBank x Snowflake Tech Seminar
NFTBank x Snowflake Tech Seminar
onchain(blockchain)
data
off-chain data
feature vector Estimated price
models for NFTs
Processed
data
API
customers
Data Value Chain
MLOps
NFTBank x Snowflake Tech Seminar
Data Quality & Productivity
Session 1
NFTBank x Snowflake Tech Seminar
Data Quality: The Rule of Ten
Image source: https://www.barlog.de/en/services/engineering-cae/
The Rule of Ten for defect costs shows
how important cost savings are.
This defines that the cost of an
undiscovered defect increases tenfold per
value generation step.
NFTBank x Snowflake Tech Seminar
1-1. Data Quality Validation
Data Quality
- Data Freshness Monitoring
- Data is not loading
- Data Unique Test
- Business logic twisted due to redundant data
- Data Count Test
- Query is executed, but the data is empty
- Data Min Max Test
- Invalid data
- Schema Validation
- Partner API response results suddenly change
- If this is not Schema validated, the pipeline failure is not easily detected
NFTBank x Snowflake Tech Seminar
1-2. Data Catalog, Lineage Documentation
- Data Catalog
- Where is the data and what column do you have?
- Who's in charge?
- Who modified it at the end?
- Data Lineage
- What data did you make this data from?
- What table is this data being used to create?
- Isn't there a circular reference in creating data?
- I want to see the whole big picture of creating our core data.
NFTBank x Snowflake Tech Seminar
Trial #1. Datahub
Datahub is one of the best data documentation tools today
Problem: Datahub documentation cleanup does not become a priority due to a backlog of tasks
NFTBank x Snowflake Tech Seminar
Trial #2. Airflow DAG DVL
특징:
- Manually Implemented Airflow Module
단점:
- Too many DVL files are difficult to manage in a
unified manner
NFTBank x Snowflake Tech Seminar
Too many Test Queries
Too many test codes,
and if it's not managed, it turns into legacy
NFTBank x Snowflake Tech Seminar
Trial #3. PyDantic
Features:
- Suitable for schema validation
- Serialize integration
- FastAPI Default Packages
Cons:
- Slow: Spend a lot of time serializing
- If you optimize the response speed, you remove it
- Memory: twice as much memory as
collections.namedTuple
- If you optimize memory, you can remove it
NFTBank x Snowflake Tech Seminar
Trial #4. Datadog SLO Monitoring
- Periodically sending custom metrics to Datalog
- Used primarily to monitor real-time data presentation
Features
- Suitable for real-time monitoring
- SLO, useful for SLA management
NFTBank x Snowflake Tech Seminar
Trial #5. DBT
Features:
- Data Catalog and Data Lineage are automatically defined when SQL is defined according to DBT
conventions only
- Add one line to the yaml file to create a Data Validation Test
- You can see what tests are on each model and see if they pass
- Easily work with all major data solutions: Snowflake, BigQuery, Databricks, Postgresql
Cons:
- Airflow and integration are not seamless: need to care a lot about integration
- DBT Cloud has limitations because it cannot work with other Airflow DAGs
- Not many references yet
NFTBank x Snowflake Tech Seminar
DBT: Data pipeline with just SQL files
NFTBank x Snowflake Tech Seminar
$ dbt test
NFTBank x Snowflake Tech Seminar
$ dbt docs serve
NFTBank x Snowflake Tech Seminar
DBT 도입 소감
NFTBank x Snowflake Tech Seminar
Integrate DBT with Airflow
Session 2
NFTBank x Snowflake Tech Seminar
2-1. How Can we trigger DBT models?: Airflow
Cloud
or
NFTBank x Snowflake Tech Seminar
2-2. How can we build task dependency?: Astronomer Cosmos
https://github.com/astronomer/astronomer-cosmos
Features:
- Library to run DBT in Airflow
- Implement the lineage of the DBT Model
- After running dbt run, check quality with
connected dbt test
- Reduce time to implement Airflow DAG
NFTBank x Snowflake Tech Seminar
2-3. How can we sync dbt files to Airflow?: dbt deps
NFTBank x Snowflake Tech Seminar
NFTBank x Snowflake Tech Seminar
Data Warehouse Cost Optimization
Session 3
NFTBank x Snowflake Tech Seminar
Data Warehouse Cost Optimization
Data Warehouse
Cost Optimization
Query Optimization Cost Monitoring
NFTBank x Snowflake Tech Seminar
3-1. Query Optimization
The following 4 conditions must be met to optimize the query
1. Partitioning: Must be partitioned based on frequently used query conditions
2. File Size: File size in a partition should be as small as 50-512MB in size suitable for large queries for
better performance
3. Clustering: clustering, optimizing query performance by aligning data
4. Query Tuning: Reduce the scope of data scanning.
- Remove unnecessary scans
- Eliminating inefficient JOIN
- Remove CROSS JOIN that is not essential
- Only scan the MERGE statement you need
- Specify only the necessary scan ranges for the dbt test
NFTBank x Snowflake Tech Seminar
Snowflake vs BigQuery: Characteristics
Snowflake BigQuery
Partitioning Micro Partition Explicit Partition
Cost Data Delay Near Real-time (custom dashboard)
1-2 Days (built-in dashboard)
2-3 Days
(GCP cost analysis)
Cost Analysis Per Warehouse Size, User, Tag Per Project, Product
(BigQuery Computing,
Storage)
NFTBank x Snowflake Tech Seminar
Snowflake Micro Partition
- Normally, it is common to set the partitions explicitly
- Snowflake prevents users from touching it at all, and takes the task of partitioning to managed
NFTBank x Snowflake Tech Seminar
3-2. Cost Monitoring: Realtime Snowflake Cost Monitoring
Our Senior Data Engineer, Jensen visualized the
real-time cost dashboards from with automated
tools
This allows you to immediately identify where your
costs go
You can query query usage history directly with
Snowflake queries to make cost estimates in real
time
- Warehouse Type
- Time (Minutes or Hour)
- Credit / Warehouse * Hour
- Cost / Credit
NFTBank x Snowflake Tech Seminar
3-2. Cost Monitoring: Daily cost review
- Data engineers gather to check cost monitoring and check issues before daily stand-up starts
- If I find a problem, I fix it right away
- A disability retrospective should be conducted for significant cost impairments
NFTBank x Snowflake Tech Seminar
3-2. Cost Monitoring: Predicting and Retrospecting Costs
- Estimate how much money will be incurred before running a large backfill,
and be confirmed to spend by the leader
- Contrast and recall estimated and actual expenditure after performing backfill
- The more records are accumulated, the more accurate the estimation becomes,
the faster you can recognize and tune inefficient queries
NFTBank x Snowflake Tech Seminar
Thanks to Jensen
Senior Data Engineer
https://www.linkedin.com/in/jensen-yap-8020b535/
NFTBank x Snowflake Tech Seminar
Join NFTBank! 🚀
https://nftbank.breezy.hr/
35
NFTBank Medium NFTBank 지원 NFTBank YouTube
NFTBank x Snowflake Tech Seminar
Thank you
NFTBank x Snowflake Tech Seminar
Chris Hoyean Song
NFTBank VP of AIOps 2021.11 ~
Riiid VP of AIOps 2020.09-2021.11
Naver AI Engineer 2017.12-2020.05
Kakao Data Engineer 2015.08-2017.12
Startup CEO 2013-2015
KAIST Innovation & Tech Management Master
KAIST Computer Science Bachelor
Commercialization of AI Tech
& Scalable AI
I am in charge of connecting artificial intelligence technology
with business.
The area of expertise is ML Pipeline, which increases the
productivity of artificial intelligence projects.
"The biggest lesson that can be read from 70 years of AI research is that
general methods that leverage computation are ultimately the most
effective, and by a large margin. "
-Richard Sutton
https://www.linkedin.com/in/chris-song-0bb03439/

More Related Content

What's hot (20)

PPTX
DBT ELT approach for Advanced Analytics.pptx
Hong Ong
 
PPTX
Databricks Platform.pptx
Alex Ivy
 
PDF
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
Databricks
 
PPTX
Introducing the Snowflake Computing Cloud Data Warehouse
Snowflake Computing
 
PDF
Airbyte @ Airflow Summit - The new modern data stack
Michel Tricot
 
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
PPTX
Snowflake essentials
qureshihamid
 
PDF
Intro to Delta Lake
Databricks
 
PPTX
Azure Synapse Analytics Overview (r1)
James Serra
 
PDF
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
PPTX
Siligong.Data - May 2021 - Transforming your analytics workflow with dbt
Jon Su
 
PPTX
Databricks Fundamentals
Dalibor Wijas
 
PDF
Azure BI Cloud Architectural Guidelines.pdf
pbonillo1
 
PPTX
Snowflake: The Good, the Bad, and the Ugly
Tyler Wishnoff
 
PDF
3D: DBT using Databricks and Delta
Databricks
 
PPTX
Zero to Snowflake Presentation
Brett VanderPlaats
 
PDF
Getting Started with Delta Lake on Databricks
Knoldus Inc.
 
PDF
Snowflake for Data Engineering
Harald Erb
 
PDF
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
PDF
Achieving Lakehouse Models with Spark 3.0
Databricks
 
DBT ELT approach for Advanced Analytics.pptx
Hong Ong
 
Databricks Platform.pptx
Alex Ivy
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
Databricks
 
Introducing the Snowflake Computing Cloud Data Warehouse
Snowflake Computing
 
Airbyte @ Airflow Summit - The new modern data stack
Michel Tricot
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Snowflake essentials
qureshihamid
 
Intro to Delta Lake
Databricks
 
Azure Synapse Analytics Overview (r1)
James Serra
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
Siligong.Data - May 2021 - Transforming your analytics workflow with dbt
Jon Su
 
Databricks Fundamentals
Dalibor Wijas
 
Azure BI Cloud Architectural Guidelines.pdf
pbonillo1
 
Snowflake: The Good, the Bad, and the Ugly
Tyler Wishnoff
 
3D: DBT using Databricks and Delta
Databricks
 
Zero to Snowflake Presentation
Brett VanderPlaats
 
Getting Started with Delta Lake on Databricks
Knoldus Inc.
 
Snowflake for Data Engineering
Harald Erb
 
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Achieving Lakehouse Models with Spark 3.0
Databricks
 

Similar to [EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf (20)

PDF
Laboratorio práctico: Data warehouse en la nube
Software Guru
 
PDF
From Raw Data to an Interactive Data App in an Hour: Powered by Snowpark Python
HostedbyConfluent
 
PDF
Six Steps to Modernize Your Data Ecosystem - Mindtree
samirandev1
 
PDF
Steps to Modernize Your Data Ecosystem with Mindtree Blog
sameerroshan
 
PDF
6 Steps to Modernize Data Ecosystem with Mindtree
devraajsingh
 
PDF
Steps to Modernize Your Data Ecosystem | Mindtree
AnikeyRoy
 
PDF
Productionalizing a spark application
datamantra
 
PDF
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Databricks
 
PDF
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Certus Solutions
 
PDF
Does it only have to be ML + AI?
Harald Erb
 
PDF
Hadoop As The Platform For The Smartgrid At TVA
Cloudera, Inc.
 
PDF
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
Timothy Spann
 
PDF
Navigating the Data World_ A Deep Dive into Architecture of Big Data Tools.pdf
Impaakt Magazine
 
PDF
Building a Turbo-fast Data Warehousing Platform with Databricks
Databricks
 
PPTX
Join Snowflake Training in India Snowflake Training.pptx
pravinvisualpath
 
PDF
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Grega Kespret
 
DOCX
95Orchestrating Big Data Analysis Workflows in the Cloud.docx
fredharris32
 
DOCX
95Orchestrating Big Data Analysis Workflows in the Cloud.docx
blondellchancy
 
PDF
best snowflake training in Hyderabad 100% job Assistance
arvinittechnologytea
 
PDF
Architecting Modern Data Platforms Jan Kunigk Ian Buss Paul Wilkinson
iseniamabuh
 
Laboratorio práctico: Data warehouse en la nube
Software Guru
 
From Raw Data to an Interactive Data App in an Hour: Powered by Snowpark Python
HostedbyConfluent
 
Six Steps to Modernize Your Data Ecosystem - Mindtree
samirandev1
 
Steps to Modernize Your Data Ecosystem with Mindtree Blog
sameerroshan
 
6 Steps to Modernize Data Ecosystem with Mindtree
devraajsingh
 
Steps to Modernize Your Data Ecosystem | Mindtree
AnikeyRoy
 
Productionalizing a spark application
datamantra
 
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Databricks
 
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Certus Solutions
 
Does it only have to be ML + AI?
Harald Erb
 
Hadoop As The Platform For The Smartgrid At TVA
Cloudera, Inc.
 
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
Timothy Spann
 
Navigating the Data World_ A Deep Dive into Architecture of Big Data Tools.pdf
Impaakt Magazine
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Databricks
 
Join Snowflake Training in India Snowflake Training.pptx
pravinvisualpath
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Grega Kespret
 
95Orchestrating Big Data Analysis Workflows in the Cloud.docx
fredharris32
 
95Orchestrating Big Data Analysis Workflows in the Cloud.docx
blondellchancy
 
best snowflake training in Hyderabad 100% job Assistance
arvinittechnologytea
 
Architecting Modern Data Platforms Jan Kunigk Ian Buss Paul Wilkinson
iseniamabuh
 
Ad

Recently uploaded (20)

PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Advancing WebDriver BiDi support in WebKit
Igalia
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Biography of Daniel Podor.pdf
Daniel Podor
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Advancing WebDriver BiDi support in WebKit
Igalia
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Ad

[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf

  • 1. NFTBank x Snowflake Tech Seminar Building modern data pipeline with Snowflake + DBT + Airflow
  • 2. NFTBank x Snowflake Tech Seminar Index Session 1: Data Quality & Productivity - Data Quality - Data Quality Validation - Data Catalog, Lineage Documentation - DBT Introduction Session 2: Integrate DBT with Airflow - DBT Cloud or Airflow? - Astronomer Cosmos - dbt deps Session 3: Cost Optimization - Query Optimization - Cost Monitoring
  • 3. NFTBank x Snowflake Tech Seminar Chris Hoyean Song NFTBank VP of AIOps 2021.11 ~ Riiid VP of AIOps 2020.09-2021.11 Naver AI Engineer 2017.12-2020.05 Kakao Data Engineer 2015.08-2017.12 Startup CEO 2013-2015 KAIST Innovation & Tech Management Master KAIST Computer Science Bachelor Commercialization of AI Tech & Scalable AI I am in charge of connecting artificial intelligence technology with business. The area of expertise is ML Pipeline, which increases the productivity of artificial intelligence projects. "The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. " -Richard Sutton https://www.linkedin.com/in/chris-song-0bb03439/
  • 4. NFTBank x Snowflake Tech Seminar
  • 5. NFTBank x Snowflake Tech Seminar
  • 6. NFTBank x Snowflake Tech Seminar onchain(blockchain) data off-chain data feature vector Estimated price models for NFTs Processed data API customers Data Value Chain MLOps
  • 7. NFTBank x Snowflake Tech Seminar Data Quality & Productivity Session 1
  • 8. NFTBank x Snowflake Tech Seminar Data Quality: The Rule of Ten Image source: https://www.barlog.de/en/services/engineering-cae/ The Rule of Ten for defect costs shows how important cost savings are. This defines that the cost of an undiscovered defect increases tenfold per value generation step.
  • 9. NFTBank x Snowflake Tech Seminar 1-1. Data Quality Validation Data Quality - Data Freshness Monitoring - Data is not loading - Data Unique Test - Business logic twisted due to redundant data - Data Count Test - Query is executed, but the data is empty - Data Min Max Test - Invalid data - Schema Validation - Partner API response results suddenly change - If this is not Schema validated, the pipeline failure is not easily detected
  • 10. NFTBank x Snowflake Tech Seminar 1-2. Data Catalog, Lineage Documentation - Data Catalog - Where is the data and what column do you have? - Who's in charge? - Who modified it at the end? - Data Lineage - What data did you make this data from? - What table is this data being used to create? - Isn't there a circular reference in creating data? - I want to see the whole big picture of creating our core data.
  • 11. NFTBank x Snowflake Tech Seminar Trial #1. Datahub Datahub is one of the best data documentation tools today Problem: Datahub documentation cleanup does not become a priority due to a backlog of tasks
  • 12. NFTBank x Snowflake Tech Seminar Trial #2. Airflow DAG DVL 특징: - Manually Implemented Airflow Module 단점: - Too many DVL files are difficult to manage in a unified manner
  • 13. NFTBank x Snowflake Tech Seminar Too many Test Queries Too many test codes, and if it's not managed, it turns into legacy
  • 14. NFTBank x Snowflake Tech Seminar Trial #3. PyDantic Features: - Suitable for schema validation - Serialize integration - FastAPI Default Packages Cons: - Slow: Spend a lot of time serializing - If you optimize the response speed, you remove it - Memory: twice as much memory as collections.namedTuple - If you optimize memory, you can remove it
  • 15. NFTBank x Snowflake Tech Seminar Trial #4. Datadog SLO Monitoring - Periodically sending custom metrics to Datalog - Used primarily to monitor real-time data presentation Features - Suitable for real-time monitoring - SLO, useful for SLA management
  • 16. NFTBank x Snowflake Tech Seminar Trial #5. DBT Features: - Data Catalog and Data Lineage are automatically defined when SQL is defined according to DBT conventions only - Add one line to the yaml file to create a Data Validation Test - You can see what tests are on each model and see if they pass - Easily work with all major data solutions: Snowflake, BigQuery, Databricks, Postgresql Cons: - Airflow and integration are not seamless: need to care a lot about integration - DBT Cloud has limitations because it cannot work with other Airflow DAGs - Not many references yet
  • 17. NFTBank x Snowflake Tech Seminar DBT: Data pipeline with just SQL files
  • 18. NFTBank x Snowflake Tech Seminar $ dbt test
  • 19. NFTBank x Snowflake Tech Seminar $ dbt docs serve
  • 20. NFTBank x Snowflake Tech Seminar DBT 도입 소감
  • 21. NFTBank x Snowflake Tech Seminar Integrate DBT with Airflow Session 2
  • 22. NFTBank x Snowflake Tech Seminar 2-1. How Can we trigger DBT models?: Airflow Cloud or
  • 23. NFTBank x Snowflake Tech Seminar 2-2. How can we build task dependency?: Astronomer Cosmos https://github.com/astronomer/astronomer-cosmos Features: - Library to run DBT in Airflow - Implement the lineage of the DBT Model - After running dbt run, check quality with connected dbt test - Reduce time to implement Airflow DAG
  • 24. NFTBank x Snowflake Tech Seminar 2-3. How can we sync dbt files to Airflow?: dbt deps
  • 25. NFTBank x Snowflake Tech Seminar
  • 26. NFTBank x Snowflake Tech Seminar Data Warehouse Cost Optimization Session 3
  • 27. NFTBank x Snowflake Tech Seminar Data Warehouse Cost Optimization Data Warehouse Cost Optimization Query Optimization Cost Monitoring
  • 28. NFTBank x Snowflake Tech Seminar 3-1. Query Optimization The following 4 conditions must be met to optimize the query 1. Partitioning: Must be partitioned based on frequently used query conditions 2. File Size: File size in a partition should be as small as 50-512MB in size suitable for large queries for better performance 3. Clustering: clustering, optimizing query performance by aligning data 4. Query Tuning: Reduce the scope of data scanning. - Remove unnecessary scans - Eliminating inefficient JOIN - Remove CROSS JOIN that is not essential - Only scan the MERGE statement you need - Specify only the necessary scan ranges for the dbt test
  • 29. NFTBank x Snowflake Tech Seminar Snowflake vs BigQuery: Characteristics Snowflake BigQuery Partitioning Micro Partition Explicit Partition Cost Data Delay Near Real-time (custom dashboard) 1-2 Days (built-in dashboard) 2-3 Days (GCP cost analysis) Cost Analysis Per Warehouse Size, User, Tag Per Project, Product (BigQuery Computing, Storage)
  • 30. NFTBank x Snowflake Tech Seminar Snowflake Micro Partition - Normally, it is common to set the partitions explicitly - Snowflake prevents users from touching it at all, and takes the task of partitioning to managed
  • 31. NFTBank x Snowflake Tech Seminar 3-2. Cost Monitoring: Realtime Snowflake Cost Monitoring Our Senior Data Engineer, Jensen visualized the real-time cost dashboards from with automated tools This allows you to immediately identify where your costs go You can query query usage history directly with Snowflake queries to make cost estimates in real time - Warehouse Type - Time (Minutes or Hour) - Credit / Warehouse * Hour - Cost / Credit
  • 32. NFTBank x Snowflake Tech Seminar 3-2. Cost Monitoring: Daily cost review - Data engineers gather to check cost monitoring and check issues before daily stand-up starts - If I find a problem, I fix it right away - A disability retrospective should be conducted for significant cost impairments
  • 33. NFTBank x Snowflake Tech Seminar 3-2. Cost Monitoring: Predicting and Retrospecting Costs - Estimate how much money will be incurred before running a large backfill, and be confirmed to spend by the leader - Contrast and recall estimated and actual expenditure after performing backfill - The more records are accumulated, the more accurate the estimation becomes, the faster you can recognize and tune inefficient queries
  • 34. NFTBank x Snowflake Tech Seminar Thanks to Jensen Senior Data Engineer https://www.linkedin.com/in/jensen-yap-8020b535/
  • 35. NFTBank x Snowflake Tech Seminar Join NFTBank! 🚀 https://nftbank.breezy.hr/ 35 NFTBank Medium NFTBank 지원 NFTBank YouTube
  • 36. NFTBank x Snowflake Tech Seminar Thank you
  • 37. NFTBank x Snowflake Tech Seminar Chris Hoyean Song NFTBank VP of AIOps 2021.11 ~ Riiid VP of AIOps 2020.09-2021.11 Naver AI Engineer 2017.12-2020.05 Kakao Data Engineer 2015.08-2017.12 Startup CEO 2013-2015 KAIST Innovation & Tech Management Master KAIST Computer Science Bachelor Commercialization of AI Tech & Scalable AI I am in charge of connecting artificial intelligence technology with business. The area of expertise is ML Pipeline, which increases the productivity of artificial intelligence projects. "The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. " -Richard Sutton https://www.linkedin.com/in/chris-song-0bb03439/