Change data capture with MongoDB and Kafka.

12 likes7,820 views

This document outlines a change data capture architecture using MongoDB and Kafka, detailing a tech stack that includes React.js, Node.js, and Ruby on Rails. It discusses various components such as user recommendation systems, schema management with Avro, and indexing with Elasticsearch, along with the pros and cons of using Samza for data transformations. The author also hints at future improvements, including user interaction logs and analytics integration.

Internet

More Related Content

What's hot (20)

PDF

Producer Performance Tuning for Apache KafkaJiangjie Qin

PDF

Building Robust ETL Pipelines with Apache SparkDatabricks

PPTX

Netflix Data Pipeline With KafkaAllen (Xiaozhong) Wang

PDF

Benefits of Stream Processing and Apache Kafka Use Casesconfluent

PPTX

Spark architectureGauravBiswas9

PDF

The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks

PDF

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit

PDF

Intro to HBasealexbaranau

PDF

Apache Kafka Fundamentals for Architects, Admins and Developersconfluent

PDF

Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive

PDF

Event-sourced architectures with AkkaSander Mak (@Sander_Mak)

PDF

Delta Lake Streaming: Under the HoodDatabricks

PPTX

SparkKoushik Mondal

PPTX

Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Simplilearn

PPTX

Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit

PPTX

Introduction to Apache KafkaAIMDek Technologies

PDF

Fundamentals of Apache KafkaChhavi Parasher

ODP

Stream processing using KafkaKnoldus Inc.

PPTX

Introduction to Apache KafkaJeff Holoman

PDF

Keeping Identity Graphs In Sync With Apache SparkDatabricks

Producer Performance Tuning for Apache KafkaJiangjie Qin

Building Robust ETL Pipelines with Apache SparkDatabricks

Netflix Data Pipeline With KafkaAllen (Xiaozhong) Wang

Benefits of Stream Processing and Apache Kafka Use Casesconfluent

Spark architectureGauravBiswas9

The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit

Intro to HBasealexbaranau

Apache Kafka Fundamentals for Architects, Admins and Developersconfluent

Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive

Event-sourced architectures with AkkaSander Mak (@Sander_Mak)

Delta Lake Streaming: Under the HoodDatabricks

SparkKoushik Mondal

Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Simplilearn

Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit

Introduction to Apache KafkaAIMDek Technologies

Fundamentals of Apache KafkaChhavi Parasher

Stream processing using KafkaKnoldus Inc.

Introduction to Apache KafkaJeff Holoman

Keeping Identity Graphs In Sync With Apache SparkDatabricks

Viewers also liked (20)

PDF

Building Real Time Systems on MongoDB Using the Oplog at StripeStripe

PDF

Building Real Time Systems on MongoDB Using the Oplog at StripeMongoDB

PDF

Webinar: Data Streaming with Apache Kafka & MongoDBMongoDB

PPTX

Data Streaming with Apache Kafka & MongoDBconfluent

PPTX

Apigee Console & eZ Publish RESTlserwatka

PPTX

Unified Log London (May 2015) - Why your company needs a unified logAlexander Dean

PDF

Change data captureJames Deppen

PPTX

Change Data Capture using KafkaAkash Vacher

PPTX

Cassandra Motores de recomendación Isthari - DatastaxJose Felix Hernandez Barrio

PDF

CQRS and Event Sourcing with MongoDB and PHPDavide Bellettini

PPTX

Data Architectures for Robust Decision MakingGwen (Chen) Shapira

PPTX

Data Streaming with Apache Kafka & MongoDB - EMEAAndrew Morgan

PPTX

Kafka for DBAsGwen (Chen) Shapira

PDF

Why your Spark job is failingSandy Ryza

PPTX

Streaming Data Ingest and Processing with Apache KafkaAttunity

PPTX

Lambda architecture on Spark, Kafka for real-time large scale MLhuguk

PDF

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit

PPTX

Big MDM Part 2: Using a Graph Database for MDM and Relationship ManagementCaserta

PDF

Graph Databases for Master Data ManagementNeo4j

PPTX

Using a Graph Database for Next-Gen MDMNeo4j

Building Real Time Systems on MongoDB Using the Oplog at StripeStripe

Building Real Time Systems on MongoDB Using the Oplog at StripeMongoDB

Webinar: Data Streaming with Apache Kafka & MongoDBMongoDB

Data Streaming with Apache Kafka & MongoDBconfluent

Apigee Console & eZ Publish RESTlserwatka

Unified Log London (May 2015) - Why your company needs a unified logAlexander Dean

Change data captureJames Deppen

Change Data Capture using KafkaAkash Vacher

Cassandra Motores de recomendación Isthari - DatastaxJose Felix Hernandez Barrio

CQRS and Event Sourcing with MongoDB and PHPDavide Bellettini

Data Architectures for Robust Decision MakingGwen (Chen) Shapira

Data Streaming with Apache Kafka & MongoDB - EMEAAndrew Morgan

Kafka for DBAsGwen (Chen) Shapira

Why your Spark job is failingSandy Ryza

Streaming Data Ingest and Processing with Apache KafkaAttunity

Lambda architecture on Spark, Kafka for real-time large scale MLhuguk

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit

Big MDM Part 2: Using a Graph Database for MDM and Relationship ManagementCaserta

Graph Databases for Master Data ManagementNeo4j

Using a Graph Database for Next-Gen MDMNeo4j

Similar to Change data capture with MongoDB and Kafka. (20)

PDF

Big data on awsSerkan Özal

PDF

Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Emprovise

PPTX

Samza at LinkedIn: Taking Stream Processing to the Next LevelMartin Kleppmann

PDF

Kafka Summit SF 2017 - Running Kafka for Maximum Painconfluent

PPTX

Tutorial(release)Oshin Hung

PDF

Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingYaroslav Tkachenko

PDF

Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...HostedbyConfluent

PPTX

Capture the Streams of Database Changesconfluent

PDF

Using MongoDB with Kafka - Use Cases and Best PracticesAntonios Giannopoulos

PDF

Big Data on AWSSzilveszter Molnár

PPTX

Building a fully Kafka-based product as a Data Scientist | Patrick Neff, BAADERHostedbyConfluent

PDF

Data pipeline with kafkaMole Wong

PPT

Moving Towards a Streaming ArchitectureGabriele Modena

PDF

Scalable Stream Processing with Apache SamzaPrateek Maheshwari

PPTX

Netflix Data Pipeline With KafkaSteven Wu

PPTX

Kafka Tutorial, Kafka ecosystem with clustering examplesJean-Paul Azar

PPTX

Real time data pipline with kafka streamsYoni Farin

PDF

Type safe, versioned, and rewindable stream processing with Apache {Avro, K...Hisham Mardam-Bey

PDF

Amazon DynamoDB Lessen's Learned by BeginnerHirokazu Tokuno

PDF

Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini

Big data on awsSerkan Özal

Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Emprovise

Samza at LinkedIn: Taking Stream Processing to the Next LevelMartin Kleppmann

Kafka Summit SF 2017 - Running Kafka for Maximum Painconfluent

Tutorial(release)Oshin Hung

Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingYaroslav Tkachenko

Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...HostedbyConfluent

Capture the Streams of Database Changesconfluent

Using MongoDB with Kafka - Use Cases and Best PracticesAntonios Giannopoulos

Big Data on AWSSzilveszter Molnár

Building a fully Kafka-based product as a Data Scientist | Patrick Neff, BAADERHostedbyConfluent

Data pipeline with kafkaMole Wong

Moving Towards a Streaming ArchitectureGabriele Modena

Scalable Stream Processing with Apache SamzaPrateek Maheshwari

Netflix Data Pipeline With KafkaSteven Wu

Kafka Tutorial, Kafka ecosystem with clustering examplesJean-Paul Azar

Real time data pipline with kafka streamsYoni Farin

Type safe, versioned, and rewindable stream processing with Apache {Avro, K...Hisham Mardam-Bey

Amazon DynamoDB Lessen's Learned by BeginnerHirokazu Tokuno

Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini

More from Dan Harvey (6)

PDF

Data Processing in the Work of NoSQL? An Introduction to HadoopDan Harvey

KEY

An Introduction to HadoopDan Harvey

PDF

AWS at Mendeley (London, September 27th 2011)Dan Harvey

PDF

Overview of Hadoop in 2010 and what's coming up in 2011Dan Harvey

PDF

Project Voldemort: Big data loadingDan Harvey

PDF

HBase at MendeleyDan Harvey

Data Processing in the Work of NoSQL? An Introduction to HadoopDan Harvey

An Introduction to HadoopDan Harvey

AWS at Mendeley (London, September 27th 2011)Dan Harvey

Overview of Hadoop in 2010 and what's coming up in 2011Dan Harvey

Project Voldemort: Big data loadingDan Harvey

HBase at MendeleyDan Harvey

Recently uploaded (20)

PDF

BRKSP-2551 - Introduction to Segment Routing.pdffcesargonca

PPTX

Presentation3gsgsgsgsdfgadgsfgfgsfgagsfgsfgzfdgsdgs.pptxSUB03

PDF

Boardroom AI: The Next 10 Moves | Cerebraix Talent Techssuser73bdb11

PPTX

Networking_Essentials_version_3.0_-_Module_3.pptxryan622010

PPTX

Orchestrating things in Angular applicationPeter Abraham

PDF

Top 10 Testing Procedures to Ensure Your Magento to Shopify Migration Success...CartCoders

PPTX

西班牙巴利阿里群岛大学电子版毕业证{UIBLetterUIB文凭证书}文凭复刻Taqyea

PPTX

L1A Season 1 ENGLISH made by A hegy fixedtoszolder91

PPTX

Networking_Essentials_version_3.0_-_Module_5.pptxryan622010

PPTX

Softuni - Psychology of entrepreneurshipKalin Karakehayov

PPTX

04 Output 1 Instruments & Tools (3).pptxGEDYIONGebre

PDF

BRKAPP-1102 - Proactive Network and Application Monitoring.pdffcesargonca

PPTX

Lec15_Mutability Immutability-converted.pptxkhanjahanzaib1

PDF

Digital burnout toolkit for youth workers and teachersasociatiastart123

PPTX

法国巴黎第二大学本科毕业证{Paris 2学费发票Paris 2成绩单}办理方法Taqyea

PDF

Enhancing Parental Roles in Protecting Children from Online Sexual Exploitati...ICT Frame Magazine Pvt. Ltd.

PPTX

Metaphysics_Presentation_With_Visuals.pptxerikjohnsales1

PDF

FutureCon Seattle 2025 Presentation Slides - You Had One JobSuzanne Aldrich

DOCX

Custom vs. Off-the-Shelf Banking SoftwareKristenCarter35

PDF

BRKACI-1003 ACI Brownfield Migration - Real World Experiences and Best Practi...fcesargonca

BRKSP-2551 - Introduction to Segment Routing.pdffcesargonca

Presentation3gsgsgsgsdfgadgsfgfgsfgagsfgsfgzfdgsdgs.pptxSUB03

Boardroom AI: The Next 10 Moves | Cerebraix Talent Techssuser73bdb11

Networking_Essentials_version_3.0_-_Module_3.pptxryan622010

Orchestrating things in Angular applicationPeter Abraham

Top 10 Testing Procedures to Ensure Your Magento to Shopify Migration Success...CartCoders

西班牙巴利阿里群岛大学电子版毕业证{UIBLetterUIB文凭证书}文凭复刻Taqyea

L1A Season 1 ENGLISH made by A hegy fixedtoszolder91

Networking_Essentials_version_3.0_-_Module_5.pptxryan622010

Softuni - Psychology of entrepreneurshipKalin Karakehayov

04 Output 1 Instruments & Tools (3).pptxGEDYIONGebre

BRKAPP-1102 - Proactive Network and Application Monitoring.pdffcesargonca

Lec15_Mutability Immutability-converted.pptxkhanjahanzaib1

Digital burnout toolkit for youth workers and teachersasociatiastart123

法国巴黎第二大学本科毕业证{Paris 2学费发票Paris 2成绩单}办理方法Taqyea

Enhancing Parental Roles in Protecting Children from Online Sexual Exploitati...ICT Frame Magazine Pvt. Ltd.

Metaphysics_Presentation_With_Visuals.pptxerikjohnsales1

FutureCon Seattle 2025 Presentation Slides - You Had One JobSuzanne Aldrich

Custom vs. Off-the-Shelf Banking SoftwareKristenCarter35

BRKACI-1003 ACI Brownfield Migration - Real World Experiences and Best Practi...fcesargonca

Change data capture with MongoDB and Kafka.

1. Change Data Capture with Mongo + Kafka By Dan Harvey

3. High level stack React.js - Website Node.js - API Routing Ruby on Rails + MongoDB - Core API Java - Opinion Streams, Search, Suggestions Redshift - SQL Analytics

4. Problems • Keep user experience consistent • Streams / search index need to update • Keep developers efﬁcient • Loosely couple services • Trust denormalisations

5. Use case • User to User recommender • Suggest “interesting” users to a user • Update as soon as you make a new opinion • Instant feedback for contributing content

6. Log transformation Java$Services Avro Rails$API JSON/BSON Mongo Opinion Optaileroplog Kafka: User Topic User Recommender Change$data$capture Stream$processing User Kafka: Opinion Topic

7. Op(log)tailer • Converts BSON/JSON to Avro • Guarantees latest document in topic (eventually) • Does not guarantee all changes • Compacting Kafka topic (only keeps latest)

8. Avro Schemas • Each Kafka topic has a schema • Schemas evolve over time • Readers and Writers will have different schemas • Allows us to update services independently

9. Schema Changes • Schema to ID managed by Conﬂuent registry • Readers and writers discover schemas • Avro deals with resolution to compiled schema • Must be forwards and backwards compatible Ka#a$message:$byte[] message:$byte[]schema$ID:$int

10. Search indexing • User / Topic / Opinion search • Re-use Kafka topics from before • Index from Kafka to Elasticsearch • Need to update quickly and reliably

11. Samza Indexers • Index from Kafka to Elasticsearch • Used Samza for transform and loading • Far less code than Java Kafka consumers • Stores offsets and state in Kafka

12. Elasticsearch Producer • Samza consumers/producers deal with I/O • Wrote new ElasticsearchSystemProducer • Contributed back to Samza project • Included in Samza 0.10.0 (released soon)

13. Samza Good/Bad • Good API • Simple transformations easy • Simple ops: logging, metrics all built in • Only depends on Kafka • Inbuilt state management • Joins tricky, need consistent partitioning • Complex ﬂows are hard (Flink/Spark better)

14. Decoupling Good/Bad • Easy to try out complex new services • Easy to keep data stores in sync, low latency • Started to duplicate core logic • More overhead with more services • Need high level framework for denormalisations • Samza SQL being developed

15. Ruby Workers • Ruby Kafka consumers not great… • Optailer to AWS SQS (Shoryuken gem) • No order guarantee like Kafka topics • But guaranteed trigger off database writes • Better for core data transformations

16. Future • Segment.io user interaction logs to Kafka • Use in product, view counts, etc… • Fill Redshift for analytics (currently batch) • Kafka CopyCat instead of our Optailer • Avro transformation in Samza

17. Questions? • email: [email protected] • twitter: @danharvey