SlideShare a Scribd company logo
Upleveling Analytics with Kafka
Amy Chen
2
Upleveling Analytics with Kafka
Amy Chen, Kafka Summit 2023
Amy Chen
dbt Labs
they/them
@notamyfromdbt
amy@dbtlabs.com
3
“Always a data practitioner,
never a stakeholder”
The framework
of the
conversation
4
Yes, it is that hard.
Who is Apache Kafka For?
● To my audience:
○ What is your experience level?
○ How long did it take for you to get to this level?
6
Folks with `Analyst` in their
Linkedin Profile
7
22,100,000
Evidence - Merit
● Analytics Engineers own :
○ the data contacts that
dictates how a Kafka
message relates to a dbt
model
○ Debugging upstream dbt
issues and upstream
○ Updating Kafka topics as
needed
8
Hypothesis: Apache Kafka is for Analysts
Why do you want an analyst to upskill?
● Career progression
● Upleveling the data team/more
hands on deck
My first data stack (without even knowing it)
10
*Joke/Diagram courtesy of Xebia Data
My first data stack (without even knowing it)
11
The next data stack
12
So why am I telling you my
life story?
● Because it doesnʼt start with Apache Kafka.
● Skills ahead of time:
○ Command Line
○ AWS architecture: IAM Roles, VPNs, EC2
○ Data Warehousing and modeling
○ Resource Monitoring
13
Experimentation & Testing the Hypothesis
How to learn Kafka
What didnʼt work
● Kafka: The Definitive
Guide: Real-Time Data
and Stream Processing at
Scale book
● Community slacks
15
What worked
● Friends (the text
messages were weird)
● Confluent & Snowflakeʼs
developer workshops
● Stack Overflow
● Medium posts
Analytics Engineering & Kafka : The experiment metrics
● Build a working streaming pipeline
● Apply analytics engineer best practices to make it production ready
○ Testing
○ Documentation
○ Version Control
○ Scalability
● Variable
○ Managed Kafka Service: AWS MSK and Confluent Cloud
16
My first end to end streaming pipeline
17
Tools:
● Cloudformation script to set up AWS Managed Service Kafka architecture
● Snowflake Kafka Connector
Actions:
● Cloudformation to set up the Infrastructure including EC2 instance, Kafka
Cluster, Linux jumphost, and IAM roles
● Installed Kafka Connector for Snowpipe Streaming
● Set up producer and topic in MSK Cluster to ingest from Rest API
Loading the Data: AWS MSK
18
Insert in Kafka Picture
Look, data! ✨
Tools:
● Confluent Cloud
● Snowflake Snowpipe
Actions:
● Creating the topic
● Created the Snowflake Sink as a connector in UI
● Provided Credentials
● Data in Snowflake
Loading the Data: Confluent Cloud
20
● Tested: ✅ ❌
○ Used a console consumer for adhoc
check
○ Up next: Schema Registry, no unit
testing
● Documented: ✅
○ in dbt project
● Version Controlled: ❌
○ Terraform Overkill for pet projects
● Scalability: ✅
○ AWS MSK - easy to switch out
Loading the Data: the Metrics
21
Lessons Learned:
● No version control logic - how do I save configurations?
● Security Access is a large determinant of success
● AWS MSK - know your bash commands well to debug
● Confluent - UI error vs third party errors
● Personal issue: cross-regional dependencies
● Overall - know the bigger picture of the connections
Loading the Data
22
Tools:
● dbt Cloud
● Snowflake
Actions:
● Using dbt to version control my logic in dbt models, created dynamic tables
inside of Snowflake
● Applied tests and documentation to my dbt models
Transforming the data
23
Upleveling Analytics with Kafka with Amy Chen
● ✅ Tested:
○ dbt tests
● ✅ Documented:
○ in dbt project
● ✅ Version Controlled:
○ dbt Github repository
● ✅ Scalability:
○ performance levers in
place
Transforming the Data: the Metrics
25
Tools:
● Hex
Actions:
● Created a notebook that selected
from the dynamic table
Visualizing the Data
26
Drawing Conclusions
✅ Data from Kafka Topic to Notebook
❌✅ Testing implemented
✅ Documented entire pipeline
❌ ✅ Version Controlled
⏲ Scalability
The Metrics
28
Conclusion: Apache Kafka is for Analysts
● More hands on deck with
business knowledge
● Career advancement
● Learn some software
development best practices
● Dependency management
○ Where is your source coming from? What happens if thereʼs a change
upstream?
○ Data contracts
● How to debug upstream
○ Your report broke - how do you work backwards?
● SQL/Git/CLI
○ Have to flatten that json blob somehow
○ Version control & Speedy development
○ Debugging
● Cost/Performance Optimization
○ When do you need streaming?
What does an analyst actually need to know?
30
● Security
○ How much access to the infrastructure do you have?
● Data Governance
○ How do you maintain PII data?
● Cross team reliance
○ How do other teams work?
○ Data contracts
What can be blockers?
31
Upleveling Analytics with Kafka with Amy Chen
Thank you!
Amy Chen
@notamyfromdbt
amy@dbtlabs.com

More Related Content

What's hot (20)

PDF
Common issues with Apache Kafka® Producer
confluent
 
PDF
Design and Implementation of Incremental Cooperative Rebalancing
confluent
 
PDF
Apples and Oranges - Comparing Kafka Streams and Flink with Bill Bejeck
HostedbyConfluent
 
PPTX
Kubernetes Workshop
loodse
 
PDF
Exactly-Once, Again: Adding EOS Support for Kafka Connect Source Connectors w...
HostedbyConfluent
 
PDF
Disaster Recovery Plans for Apache Kafka
confluent
 
PDF
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
confluent
 
PPTX
Integrating Microservices with Apache Camel
Christian Posta
 
PDF
Tradeoffs in Distributed Systems Design: Is Kafka The Best? (Ben Stopford and...
HostedbyConfluent
 
PDF
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
PDF
a Secure Public Cache for YARN Application Resources
DataWorks Summit
 
PDF
What is Apache Kafka and What is an Event Streaming Platform?
confluent
 
PPTX
Cloud Native PostgreSQL
EDB
 
PPTX
Kafka Tutorial - basics of the Kafka streaming platform
Jean-Paul Azar
 
PDF
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Kai Wähner
 
PPTX
Data Streaming with Apache Kafka & MongoDB
confluent
 
PDF
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
HostedbyConfluent
 
PDF
Deep Dive into Kafka Connect Protocol with Catalin Pop
HostedbyConfluent
 
PDF
Quick introduction to Kubernetes
Eduardo Garcia Moyano
 
Common issues with Apache Kafka® Producer
confluent
 
Design and Implementation of Incremental Cooperative Rebalancing
confluent
 
Apples and Oranges - Comparing Kafka Streams and Flink with Bill Bejeck
HostedbyConfluent
 
Kubernetes Workshop
loodse
 
Exactly-Once, Again: Adding EOS Support for Kafka Connect Source Connectors w...
HostedbyConfluent
 
Disaster Recovery Plans for Apache Kafka
confluent
 
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
confluent
 
Integrating Microservices with Apache Camel
Christian Posta
 
Tradeoffs in Distributed Systems Design: Is Kafka The Best? (Ben Stopford and...
HostedbyConfluent
 
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
a Secure Public Cache for YARN Application Resources
DataWorks Summit
 
What is Apache Kafka and What is an Event Streaming Platform?
confluent
 
Cloud Native PostgreSQL
EDB
 
Kafka Tutorial - basics of the Kafka streaming platform
Jean-Paul Azar
 
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Kai Wähner
 
Data Streaming with Apache Kafka & MongoDB
confluent
 
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
HostedbyConfluent
 
Deep Dive into Kafka Connect Protocol with Catalin Pop
HostedbyConfluent
 
Quick introduction to Kubernetes
Eduardo Garcia Moyano
 

Similar to Upleveling Analytics with Kafka with Amy Chen (20)

PDF
Structured Streaming in Spark
Digital Vidya
 
PDF
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
SeungYong Oh
 
PPTX
End to-end example: consumer loan acceptance scoring using kubeflow
Radovan Parrak
 
PDF
Data Science in the Cloud @StitchFix
C4Media
 
PDF
Build real time stream processing applications using Apache Kafka
Hotstar
 
PDF
All the DataOps, all the paradigms .
Lars Albertsson
 
PDF
Google Cloud Dataflow
Alex Van Boxel
 
PDF
Observability for Data Pipelines With OpenLineage
Databricks
 
PDF
Data and AI summit: data pipelines observability with open lineage
Julien Le Dem
 
PDF
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
PyData
 
PDF
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
NETWAYS
 
PPTX
Data Engineer's Lunch #57: StreamSets for Data Engineering
Anant Corporation
 
PDF
Who needs containers in a serverless world
Matthias Luebken
 
PPTX
Mule soft meetup_chandigarh_#7_25_sept_2021
Lalit Panwar
 
PDF
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Anant Corporation
 
PDF
Data ops in practice - Swedish style
Lars Albertsson
 
PPTX
Netflix Big Data Paris 2017
Jason Flittner
 
PDF
API workshop by AWS and 3scale
3scale
 
PDF
MySQL X protocol - Talking to MySQL Directly over the Wire
Simon J Mudd
 
PDF
Data Platform in the Cloud
Amihay Zer-Kavod
 
Structured Streaming in Spark
Digital Vidya
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
SeungYong Oh
 
End to-end example: consumer loan acceptance scoring using kubeflow
Radovan Parrak
 
Data Science in the Cloud @StitchFix
C4Media
 
Build real time stream processing applications using Apache Kafka
Hotstar
 
All the DataOps, all the paradigms .
Lars Albertsson
 
Google Cloud Dataflow
Alex Van Boxel
 
Observability for Data Pipelines With OpenLineage
Databricks
 
Data and AI summit: data pipelines observability with open lineage
Julien Le Dem
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
PyData
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
NETWAYS
 
Data Engineer's Lunch #57: StreamSets for Data Engineering
Anant Corporation
 
Who needs containers in a serverless world
Matthias Luebken
 
Mule soft meetup_chandigarh_#7_25_sept_2021
Lalit Panwar
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Anant Corporation
 
Data ops in practice - Swedish style
Lars Albertsson
 
Netflix Big Data Paris 2017
Jason Flittner
 
API workshop by AWS and 3scale
3scale
 
MySQL X protocol - Talking to MySQL Directly over the Wire
Simon J Mudd
 
Data Platform in the Cloud
Amihay Zer-Kavod
 
Ad

More from HostedbyConfluent (20)

PDF
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
PDF
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
PDF
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
PDF
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
PDF
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
PDF
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
PDF
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
PDF
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
PDF
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
PDF
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
PDF
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
PDF
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
PDF
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
PDF
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
PDF
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
PDF
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
PDF
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
PDF
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
PDF
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
PDF
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 
Ad

Recently uploaded (20)

PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PPTX
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 

Upleveling Analytics with Kafka with Amy Chen

  • 1. Upleveling Analytics with Kafka Amy Chen
  • 2. 2 Upleveling Analytics with Kafka Amy Chen, Kafka Summit 2023
  • 5. Yes, it is that hard.
  • 6. Who is Apache Kafka For? ● To my audience: ○ What is your experience level? ○ How long did it take for you to get to this level? 6
  • 7. Folks with `Analyst` in their Linkedin Profile 7 22,100,000
  • 8. Evidence - Merit ● Analytics Engineers own : ○ the data contacts that dictates how a Kafka message relates to a dbt model ○ Debugging upstream dbt issues and upstream ○ Updating Kafka topics as needed 8
  • 9. Hypothesis: Apache Kafka is for Analysts Why do you want an analyst to upskill? ● Career progression ● Upleveling the data team/more hands on deck
  • 10. My first data stack (without even knowing it) 10 *Joke/Diagram courtesy of Xebia Data
  • 11. My first data stack (without even knowing it) 11
  • 12. The next data stack 12
  • 13. So why am I telling you my life story? ● Because it doesnʼt start with Apache Kafka. ● Skills ahead of time: ○ Command Line ○ AWS architecture: IAM Roles, VPNs, EC2 ○ Data Warehousing and modeling ○ Resource Monitoring 13
  • 14. Experimentation & Testing the Hypothesis
  • 15. How to learn Kafka What didnʼt work ● Kafka: The Definitive Guide: Real-Time Data and Stream Processing at Scale book ● Community slacks 15 What worked ● Friends (the text messages were weird) ● Confluent & Snowflakeʼs developer workshops ● Stack Overflow ● Medium posts
  • 16. Analytics Engineering & Kafka : The experiment metrics ● Build a working streaming pipeline ● Apply analytics engineer best practices to make it production ready ○ Testing ○ Documentation ○ Version Control ○ Scalability ● Variable ○ Managed Kafka Service: AWS MSK and Confluent Cloud 16
  • 17. My first end to end streaming pipeline 17
  • 18. Tools: ● Cloudformation script to set up AWS Managed Service Kafka architecture ● Snowflake Kafka Connector Actions: ● Cloudformation to set up the Infrastructure including EC2 instance, Kafka Cluster, Linux jumphost, and IAM roles ● Installed Kafka Connector for Snowpipe Streaming ● Set up producer and topic in MSK Cluster to ingest from Rest API Loading the Data: AWS MSK 18
  • 19. Insert in Kafka Picture Look, data! ✨
  • 20. Tools: ● Confluent Cloud ● Snowflake Snowpipe Actions: ● Creating the topic ● Created the Snowflake Sink as a connector in UI ● Provided Credentials ● Data in Snowflake Loading the Data: Confluent Cloud 20
  • 21. ● Tested: ✅ ❌ ○ Used a console consumer for adhoc check ○ Up next: Schema Registry, no unit testing ● Documented: ✅ ○ in dbt project ● Version Controlled: ❌ ○ Terraform Overkill for pet projects ● Scalability: ✅ ○ AWS MSK - easy to switch out Loading the Data: the Metrics 21
  • 22. Lessons Learned: ● No version control logic - how do I save configurations? ● Security Access is a large determinant of success ● AWS MSK - know your bash commands well to debug ● Confluent - UI error vs third party errors ● Personal issue: cross-regional dependencies ● Overall - know the bigger picture of the connections Loading the Data 22
  • 23. Tools: ● dbt Cloud ● Snowflake Actions: ● Using dbt to version control my logic in dbt models, created dynamic tables inside of Snowflake ● Applied tests and documentation to my dbt models Transforming the data 23
  • 25. ● ✅ Tested: ○ dbt tests ● ✅ Documented: ○ in dbt project ● ✅ Version Controlled: ○ dbt Github repository ● ✅ Scalability: ○ performance levers in place Transforming the Data: the Metrics 25
  • 26. Tools: ● Hex Actions: ● Created a notebook that selected from the dynamic table Visualizing the Data 26
  • 28. ✅ Data from Kafka Topic to Notebook ❌✅ Testing implemented ✅ Documented entire pipeline ❌ ✅ Version Controlled ⏲ Scalability The Metrics 28
  • 29. Conclusion: Apache Kafka is for Analysts ● More hands on deck with business knowledge ● Career advancement ● Learn some software development best practices
  • 30. ● Dependency management ○ Where is your source coming from? What happens if thereʼs a change upstream? ○ Data contracts ● How to debug upstream ○ Your report broke - how do you work backwards? ● SQL/Git/CLI ○ Have to flatten that json blob somehow ○ Version control & Speedy development ○ Debugging ● Cost/Performance Optimization ○ When do you need streaming? What does an analyst actually need to know? 30
  • 31. ● Security ○ How much access to the infrastructure do you have? ● Data Governance ○ How do you maintain PII data? ● Cross team reliance ○ How do other teams work? ○ Data contracts What can be blockers? 31