SlideShare a Scribd company logo
6
Most read
10
Most read
15
Most read
Change Data Capture
with
Mongo + Kafka
By Dan Harvey
Change data capture with MongoDB and Kafka.
High level stack
React.js - Website
Node.js - API Routing
Ruby on Rails + MongoDB - Core API
Java - Opinion Streams, Search, Suggestions
Redshift - SQL Analytics
Problems
• Keep user experience consistent
• Streams / search index need to update
• Keep developers efficient
• Loosely couple services
• Trust denormalisations
Use case
• User to User recommender
• Suggest “interesting” users to a user
• Update as soon as you make a new opinion
• Instant feedback for contributing content
Log transformation
Java$Services
Avro
Rails$API
JSON/BSON
Mongo
Opinion
Optaileroplog
Kafka:
User Topic User
Recommender
Change$data$capture
Stream$processing
User
Kafka:
Opinion Topic
Op(log)tailer
• Converts BSON/JSON to Avro
• Guarantees latest document in topic (eventually)
• Does not guarantee all changes
• Compacting Kafka topic (only keeps latest)
Avro Schemas
• Each Kafka topic has a schema
• Schemas evolve over time
• Readers and Writers will have different schemas
• Allows us to update services independently
Schema Changes
• Schema to ID managed by Confluent registry
• Readers and writers discover schemas
• Avro deals with resolution to compiled schema
• Must be forwards and backwards compatible
Ka#a$message:$byte[]
message:$byte[]schema$ID:$int
Search indexing
• User / Topic / Opinion search
• Re-use Kafka topics from before
• Index from Kafka to Elasticsearch
• Need to update quickly and reliably
Samza Indexers
• Index from Kafka to Elasticsearch
• Used Samza for transform and loading
• Far less code than Java Kafka consumers
• Stores offsets and state in Kafka
Elasticsearch Producer
• Samza consumers/producers deal with I/O
• Wrote new ElasticsearchSystemProducer
• Contributed back to Samza project
• Included in Samza 0.10.0 (released soon)
Samza Good/Bad
• Good API
• Simple transformations easy
• Simple ops: logging, metrics all built in
• Only depends on Kafka
• Inbuilt state management
• Joins tricky, need consistent partitioning
• Complex flows are hard (Flink/Spark better)
Decoupling Good/Bad
• Easy to try out complex new services
• Easy to keep data stores in sync, low latency
• Started to duplicate core logic
• More overhead with more services
• Need high level framework for denormalisations
• Samza SQL being developed
Ruby Workers
• Ruby Kafka consumers not great…
• Optailer to AWS SQS (Shoryuken gem)
• No order guarantee like Kafka topics
• But guaranteed trigger off database writes
• Better for core data transformations
Future
• Segment.io user interaction logs to Kafka
• Use in product, view counts, etc…
• Fill Redshift for analytics (currently batch)
• Kafka CopyCat instead of our Optailer
• Avro transformation in Samza
Questions?
• email: dan@state.com
• twitter: @danharvey

More Related Content

What's hot (20)

PDF
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
PDF
Building Robust ETL Pipelines with Apache Spark
Databricks
 
PPTX
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
PDF
Benefits of Stream Processing and Apache Kafka Use Cases
confluent
 
PPTX
Spark architecture
GauravBiswas9
 
PDF
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PDF
Intro to HBase
alexbaranau
 
PDF
Apache Kafka Fundamentals for Architects, Admins and Developers
confluent
 
PDF
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
The Hive
 
PDF
Event-sourced architectures with Akka
Sander Mak (@Sander_Mak)
 
PDF
Delta Lake Streaming: Under the Hood
Databricks
 
PPTX
Spark
Koushik Mondal
 
PPTX
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Introduction to Apache Kafka
AIMDek Technologies
 
PDF
Fundamentals of Apache Kafka
Chhavi Parasher
 
ODP
Stream processing using Kafka
Knoldus Inc.
 
PPTX
Introduction to Apache Kafka
Jeff Holoman
 
PDF
Keeping Identity Graphs In Sync With Apache Spark
Databricks
 
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
Building Robust ETL Pipelines with Apache Spark
Databricks
 
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
Benefits of Stream Processing and Apache Kafka Use Cases
confluent
 
Spark architecture
GauravBiswas9
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Intro to HBase
alexbaranau
 
Apache Kafka Fundamentals for Architects, Admins and Developers
confluent
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
The Hive
 
Event-sourced architectures with Akka
Sander Mak (@Sander_Mak)
 
Delta Lake Streaming: Under the Hood
Databricks
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Introduction to Apache Kafka
AIMDek Technologies
 
Fundamentals of Apache Kafka
Chhavi Parasher
 
Stream processing using Kafka
Knoldus Inc.
 
Introduction to Apache Kafka
Jeff Holoman
 
Keeping Identity Graphs In Sync With Apache Spark
Databricks
 

Viewers also liked (20)

PDF
Building Real Time Systems on MongoDB Using the Oplog at Stripe
Stripe
 
PDF
Building Real Time Systems on MongoDB Using the Oplog at Stripe
MongoDB
 
PDF
Webinar: Data Streaming with Apache Kafka & MongoDB
MongoDB
 
PPTX
Data Streaming with Apache Kafka & MongoDB
confluent
 
PPTX
Apigee Console & eZ Publish REST
lserwatka
 
PPTX
Unified Log London (May 2015) - Why your company needs a unified log
Alexander Dean
 
PDF
Change data capture
James Deppen
 
PPTX
Change Data Capture using Kafka
Akash Vacher
 
PPTX
Cassandra Motores de recomendación Isthari - Datastax
Jose Felix Hernandez Barrio
 
PDF
CQRS and Event Sourcing with MongoDB and PHP
Davide Bellettini
 
PPTX
Data Architectures for Robust Decision Making
Gwen (Chen) Shapira
 
PPTX
Data Streaming with Apache Kafka & MongoDB - EMEA
Andrew Morgan
 
PPTX
Kafka for DBAs
Gwen (Chen) Shapira
 
PDF
Why your Spark job is failing
Sandy Ryza
 
PPTX
Streaming Data Ingest and Processing with Apache Kafka
Attunity
 
PPTX
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
PPTX
Big MDM Part 2: Using a Graph Database for MDM and Relationship Management
Caserta
 
PDF
Graph Databases for Master Data Management
Neo4j
 
PPTX
Using a Graph Database for Next-Gen MDM
Neo4j
 
Building Real Time Systems on MongoDB Using the Oplog at Stripe
Stripe
 
Building Real Time Systems on MongoDB Using the Oplog at Stripe
MongoDB
 
Webinar: Data Streaming with Apache Kafka & MongoDB
MongoDB
 
Data Streaming with Apache Kafka & MongoDB
confluent
 
Apigee Console & eZ Publish REST
lserwatka
 
Unified Log London (May 2015) - Why your company needs a unified log
Alexander Dean
 
Change data capture
James Deppen
 
Change Data Capture using Kafka
Akash Vacher
 
Cassandra Motores de recomendación Isthari - Datastax
Jose Felix Hernandez Barrio
 
CQRS and Event Sourcing with MongoDB and PHP
Davide Bellettini
 
Data Architectures for Robust Decision Making
Gwen (Chen) Shapira
 
Data Streaming with Apache Kafka & MongoDB - EMEA
Andrew Morgan
 
Kafka for DBAs
Gwen (Chen) Shapira
 
Why your Spark job is failing
Sandy Ryza
 
Streaming Data Ingest and Processing with Apache Kafka
Attunity
 
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Big MDM Part 2: Using a Graph Database for MDM and Relationship Management
Caserta
 
Graph Databases for Master Data Management
Neo4j
 
Using a Graph Database for Next-Gen MDM
Neo4j
 
Ad

Similar to Change data capture with MongoDB and Kafka. (20)

PDF
Big data on aws
Serkan Özal
 
PDF
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Emprovise
 
PPTX
Samza at LinkedIn: Taking Stream Processing to the Next Level
Martin Kleppmann
 
PDF
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
confluent
 
PPTX
Tutorial(release)
Oshin Hung
 
PDF
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Yaroslav Tkachenko
 
PDF
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
HostedbyConfluent
 
PPTX
Capture the Streams of Database Changes
confluent
 
PDF
Using MongoDB with Kafka - Use Cases and Best Practices
Antonios Giannopoulos
 
PDF
Big Data on AWS
Szilveszter Molnár
 
PPTX
Building a fully Kafka-based product as a Data Scientist | Patrick Neff, BAADER
HostedbyConfluent
 
PDF
Data pipeline with kafka
Mole Wong
 
PPT
Moving Towards a Streaming Architecture
Gabriele Modena
 
PDF
Scalable Stream Processing with Apache Samza
Prateek Maheshwari
 
PPTX
Netflix Data Pipeline With Kafka
Steven Wu
 
PPTX
Kafka Tutorial, Kafka ecosystem with clustering examples
Jean-Paul Azar
 
PPTX
Real time data pipline with kafka streams
Yoni Farin
 
PDF
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
Hisham Mardam-Bey
 
PDF
Amazon DynamoDB Lessen's Learned by Beginner
Hirokazu Tokuno
 
PDF
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Monal Daxini
 
Big data on aws
Serkan Özal
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Emprovise
 
Samza at LinkedIn: Taking Stream Processing to the Next Level
Martin Kleppmann
 
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
confluent
 
Tutorial(release)
Oshin Hung
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Yaroslav Tkachenko
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
HostedbyConfluent
 
Capture the Streams of Database Changes
confluent
 
Using MongoDB with Kafka - Use Cases and Best Practices
Antonios Giannopoulos
 
Big Data on AWS
Szilveszter Molnár
 
Building a fully Kafka-based product as a Data Scientist | Patrick Neff, BAADER
HostedbyConfluent
 
Data pipeline with kafka
Mole Wong
 
Moving Towards a Streaming Architecture
Gabriele Modena
 
Scalable Stream Processing with Apache Samza
Prateek Maheshwari
 
Netflix Data Pipeline With Kafka
Steven Wu
 
Kafka Tutorial, Kafka ecosystem with clustering examples
Jean-Paul Azar
 
Real time data pipline with kafka streams
Yoni Farin
 
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
Hisham Mardam-Bey
 
Amazon DynamoDB Lessen's Learned by Beginner
Hirokazu Tokuno
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Monal Daxini
 
Ad

More from Dan Harvey (6)

PDF
Data Processing in the Work of NoSQL? An Introduction to Hadoop
Dan Harvey
 
KEY
An Introduction to Hadoop
Dan Harvey
 
PDF
AWS at Mendeley (London, September 27th 2011)
Dan Harvey
 
PDF
Overview of Hadoop in 2010 and what's coming up in 2011
Dan Harvey
 
PDF
Project Voldemort: Big data loading
Dan Harvey
 
PDF
HBase at Mendeley
Dan Harvey
 
Data Processing in the Work of NoSQL? An Introduction to Hadoop
Dan Harvey
 
An Introduction to Hadoop
Dan Harvey
 
AWS at Mendeley (London, September 27th 2011)
Dan Harvey
 
Overview of Hadoop in 2010 and what's coming up in 2011
Dan Harvey
 
Project Voldemort: Big data loading
Dan Harvey
 
HBase at Mendeley
Dan Harvey
 

Recently uploaded (20)

PDF
BRKSP-2551 - Introduction to Segment Routing.pdf
fcesargonca
 
PPTX
Presentation3gsgsgsgsdfgadgsfgfgsfgagsfgsfgzfdgsdgs.pptx
SUB03
 
PDF
Boardroom AI: The Next 10 Moves | Cerebraix Talent Tech
ssuser73bdb11
 
PPTX
Networking_Essentials_version_3.0_-_Module_3.pptx
ryan622010
 
PPTX
Orchestrating things in Angular application
Peter Abraham
 
PDF
Top 10 Testing Procedures to Ensure Your Magento to Shopify Migration Success...
CartCoders
 
PPTX
西班牙巴利阿里群岛大学电子版毕业证{UIBLetterUIB文凭证书}文凭复刻
Taqyea
 
PPTX
L1A Season 1 ENGLISH made by A hegy fixed
toszolder91
 
PPTX
Networking_Essentials_version_3.0_-_Module_5.pptx
ryan622010
 
PPTX
Softuni - Psychology of entrepreneurship
Kalin Karakehayov
 
PPTX
04 Output 1 Instruments & Tools (3).pptx
GEDYIONGebre
 
PDF
BRKAPP-1102 - Proactive Network and Application Monitoring.pdf
fcesargonca
 
PPTX
Lec15_Mutability Immutability-converted.pptx
khanjahanzaib1
 
PDF
Digital burnout toolkit for youth workers and teachers
asociatiastart123
 
PPTX
法国巴黎第二大学本科毕业证{Paris 2学费发票Paris 2成绩单}办理方法
Taqyea
 
PDF
Enhancing Parental Roles in Protecting Children from Online Sexual Exploitati...
ICT Frame Magazine Pvt. Ltd.
 
PPTX
Metaphysics_Presentation_With_Visuals.pptx
erikjohnsales1
 
PDF
FutureCon Seattle 2025 Presentation Slides - You Had One Job
Suzanne Aldrich
 
DOCX
Custom vs. Off-the-Shelf Banking Software
KristenCarter35
 
PDF
BRKACI-1003 ACI Brownfield Migration - Real World Experiences and Best Practi...
fcesargonca
 
BRKSP-2551 - Introduction to Segment Routing.pdf
fcesargonca
 
Presentation3gsgsgsgsdfgadgsfgfgsfgagsfgsfgzfdgsdgs.pptx
SUB03
 
Boardroom AI: The Next 10 Moves | Cerebraix Talent Tech
ssuser73bdb11
 
Networking_Essentials_version_3.0_-_Module_3.pptx
ryan622010
 
Orchestrating things in Angular application
Peter Abraham
 
Top 10 Testing Procedures to Ensure Your Magento to Shopify Migration Success...
CartCoders
 
西班牙巴利阿里群岛大学电子版毕业证{UIBLetterUIB文凭证书}文凭复刻
Taqyea
 
L1A Season 1 ENGLISH made by A hegy fixed
toszolder91
 
Networking_Essentials_version_3.0_-_Module_5.pptx
ryan622010
 
Softuni - Psychology of entrepreneurship
Kalin Karakehayov
 
04 Output 1 Instruments & Tools (3).pptx
GEDYIONGebre
 
BRKAPP-1102 - Proactive Network and Application Monitoring.pdf
fcesargonca
 
Lec15_Mutability Immutability-converted.pptx
khanjahanzaib1
 
Digital burnout toolkit for youth workers and teachers
asociatiastart123
 
法国巴黎第二大学本科毕业证{Paris 2学费发票Paris 2成绩单}办理方法
Taqyea
 
Enhancing Parental Roles in Protecting Children from Online Sexual Exploitati...
ICT Frame Magazine Pvt. Ltd.
 
Metaphysics_Presentation_With_Visuals.pptx
erikjohnsales1
 
FutureCon Seattle 2025 Presentation Slides - You Had One Job
Suzanne Aldrich
 
Custom vs. Off-the-Shelf Banking Software
KristenCarter35
 
BRKACI-1003 ACI Brownfield Migration - Real World Experiences and Best Practi...
fcesargonca
 

Change data capture with MongoDB and Kafka.

  • 1. Change Data Capture with Mongo + Kafka By Dan Harvey
  • 3. High level stack React.js - Website Node.js - API Routing Ruby on Rails + MongoDB - Core API Java - Opinion Streams, Search, Suggestions Redshift - SQL Analytics
  • 4. Problems • Keep user experience consistent • Streams / search index need to update • Keep developers efficient • Loosely couple services • Trust denormalisations
  • 5. Use case • User to User recommender • Suggest “interesting” users to a user • Update as soon as you make a new opinion • Instant feedback for contributing content
  • 6. Log transformation Java$Services Avro Rails$API JSON/BSON Mongo Opinion Optaileroplog Kafka: User Topic User Recommender Change$data$capture Stream$processing User Kafka: Opinion Topic
  • 7. Op(log)tailer • Converts BSON/JSON to Avro • Guarantees latest document in topic (eventually) • Does not guarantee all changes • Compacting Kafka topic (only keeps latest)
  • 8. Avro Schemas • Each Kafka topic has a schema • Schemas evolve over time • Readers and Writers will have different schemas • Allows us to update services independently
  • 9. Schema Changes • Schema to ID managed by Confluent registry • Readers and writers discover schemas • Avro deals with resolution to compiled schema • Must be forwards and backwards compatible Ka#a$message:$byte[] message:$byte[]schema$ID:$int
  • 10. Search indexing • User / Topic / Opinion search • Re-use Kafka topics from before • Index from Kafka to Elasticsearch • Need to update quickly and reliably
  • 11. Samza Indexers • Index from Kafka to Elasticsearch • Used Samza for transform and loading • Far less code than Java Kafka consumers • Stores offsets and state in Kafka
  • 12. Elasticsearch Producer • Samza consumers/producers deal with I/O • Wrote new ElasticsearchSystemProducer • Contributed back to Samza project • Included in Samza 0.10.0 (released soon)
  • 13. Samza Good/Bad • Good API • Simple transformations easy • Simple ops: logging, metrics all built in • Only depends on Kafka • Inbuilt state management • Joins tricky, need consistent partitioning • Complex flows are hard (Flink/Spark better)
  • 14. Decoupling Good/Bad • Easy to try out complex new services • Easy to keep data stores in sync, low latency • Started to duplicate core logic • More overhead with more services • Need high level framework for denormalisations • Samza SQL being developed
  • 15. Ruby Workers • Ruby Kafka consumers not great… • Optailer to AWS SQS (Shoryuken gem) • No order guarantee like Kafka topics • But guaranteed trigger off database writes • Better for core data transformations
  • 16. Future • Segment.io user interaction logs to Kafka • Use in product, view counts, etc… • Fill Redshift for analytics (currently batch) • Kafka CopyCat instead of our Optailer • Avro transformation in Samza