Designing Scalable and Extendable Data Pipeline for Call Of Duty Games

0 likes1,074 views

This document discusses scaling and extending the data pipeline for Call of Duty games. Some key points: 1) The data pipeline uses Apache Kafka topics partitioned for scalability, but too many topics creates operational overhead, so topic naming follows conventions expressing data types rather than producers/consumers. 2) A stream processing layer called "Refinery" enables measuring, validating, enriching, filtering and routing messages to make the pipeline flexible for new use cases. 3) A unified message envelope and Schema Registry supporting multiple formats allows easy integration of new games into the pipeline by standardizing the message format.

Software

Designing Scalable
and Extendable Data
Pipeline for Call of
Duty Games
Yaroslav Tkachenko
Senior Data Engineer at Activision

Designing Scalable and Extendable Data Pipeline for Call Of Duty Games

Number of topics in the
biggest cluster
(Apache Kafka) 600+

Scaling the data pipeline even further
Volume
Well-known industry
techniques
Games
Using previous experience
Use-cases
Completely unpredictable
Complexity

0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
Kafka topic
Consumer
or
Producer
Partition 1
Partition 2
Partition 3
Kafka topics are partitioned and replicated

We need to keep the number
of topics and partitions low
More topics means more operational burden.
Number of partitions in a fixed cluster is not infinite.
Autoscaling Kafka is impossible, scaling is hard.

Topic naming convention
$env.$source.$title.$category-$version
prod.glutton.1234.telemetry_match_event-v1
Unique game id
“CoD WW2 on PSN”Producer

A proper solution has
been invented decades
ago.
Think about databases.

Messaging system IS a
form of a database
Data topic = Database + Table.
Data topic = Namespace + Data type.

telemetry.matches
user.logins
marketplace.purchases
prod.glutton.1234.telemetry_match_event-v1
dev.user_login_records.4321.all-v1
prod.marketplace.5678.purchase_event-v1
Compare this

Each approach has pros and cons
• Topics that use metadata for their
names are obviously easier to track
and monitor (and even consume).
• As a consumer, I can consume
exactly what I want, instead of
consuming a single large topic and
extracting required values.
• These dynamic fields can and will
change. Producers (sources) and
consumers will change.
• Very efficient utilization of topics
and partitions.
• Finally, it’s impossible to enforce
any constraints with a topic name.
And you can always end up with dev
data in prod topic and vice versa.

After removing
necessary metadata
from the topic names
stream processing
becomes mandatory.

Stream processing becomes mandatory
Measuring → Validating → Enriching → Filtering & routing

Having a single
message schema for a
topic is more than
just a nice-to-have.

Stream processor
JSON Protobuf
Custom Avro
? ?
? ?

// Application.java
props.put("value.deserializer", "com.example.CustomDeserializer");
// CustomDeserializer.java
public class CustomDeserializer implements Deserializer<???> {
@Override
public ??? deserialize(String topic, byte[] data) {
???
}
}
Custom deserialization

Message envelope anatomy
ID, env, timestamp, source, game, ...
Event
Header / Metadata
Body / Payload
Message

$Unified message envelope syntax = "proto2"; message MessageEnvelope { optional bytes message_id = 1; optional uint64 created_at = 2; optional uint64 ingested_at = 3; optional string source = 4; optional uint64 title_id = 5; optional string env = 6; optional UserInfo resource_owner = 7; optional SchemaInfo schema_info = 8; optional string message_name = 9; optional bytes message = 100; }$

Schema Registry
• API to manage message schemas
• Single source of truth for all producers and consumers
• It should be impossible to send a message to the pipeline
without registering its schema in the Schema Registry!
• Good Schema Registry supports immutability, versioning and
basic validation
• Activision uses custom Schema Registry implemented with
Python and Cassandra

Summary: scaling and extending the data pipeline
Games
• By using unified message envelope
and topic names adding a new game
becomes almost effortless
• “Operational” stream processing
makes it possible
• Still flexible enough: each game can
use its own message payload format
via Schema Registry
Use-cases
• Topic names express data types, not
producers or consumers
• Stream filtering & routing allows
low-cost experiments
• Data catalog built on top of Schema
Registry promotes data discovery

More Related Content

What's hot (20)

PPTX

Exactly-once Stream Processing with Kafka StreamsGuozhang Wang

PDF

Apache Kafka, and the Rise of Stream ProcessingGuozhang Wang

PDF

KSQL Introconfluent

PDF

KSQL: Streaming SQL for Kafkaconfluent

PDF

Kafka Summit NYC 2017 - The Best Thing Since Partitioned Bread confluent

PDF

Multi cluster, multitenant and hierarchical kafka messaging service slideshareAllen (Xiaozhong) Wang

PDF

From Message to Cluster: A Realworld Introduction to Kafka Capacity Planningconfluent

PPTX

Stream Application Development with Apache KafkaMatthias J. Sax

PDF

Webinar: Deep Dive on Apache Flink State - Seth WiesmanVerverica

PDF

Performance Analysis and Optimizations for Kafka Streams ApplicationsGuozhang Wang

PPTX

Kafkashrenikp

PDF

kafkaAriel Moskovich

PDF

LINE's messaging service architecture underlying more than 200 million monthl...kawamuray

PDF

Kafka meetup JP #3 - Engineering Apache Kafka at LINEkawamuray

PPTX

Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloJoe Stein

PDF

Chicago Kafka MeetupCliff Gilmore

ODP

Apache Kafka DemoEdward Capriolo

PDF

ksqlDB: A Stream-Relational Database Systemconfluent

PDF

Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...HostedbyConfluent

PDF

Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...Guozhang Wang

Exactly-once Stream Processing with Kafka StreamsGuozhang Wang

Apache Kafka, and the Rise of Stream ProcessingGuozhang Wang

KSQL Introconfluent

KSQL: Streaming SQL for Kafkaconfluent

Kafka Summit NYC 2017 - The Best Thing Since Partitioned Bread confluent

Multi cluster, multitenant and hierarchical kafka messaging service slideshareAllen (Xiaozhong) Wang

From Message to Cluster: A Realworld Introduction to Kafka Capacity Planningconfluent

Stream Application Development with Apache KafkaMatthias J. Sax

Webinar: Deep Dive on Apache Flink State - Seth WiesmanVerverica

Performance Analysis and Optimizations for Kafka Streams ApplicationsGuozhang Wang

Kafkashrenikp

kafkaAriel Moskovich

LINE's messaging service architecture underlying more than 200 million monthl...kawamuray

Kafka meetup JP #3 - Engineering Apache Kafka at LINEkawamuray

Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloJoe Stein

Chicago Kafka MeetupCliff Gilmore

Apache Kafka DemoEdward Capriolo

ksqlDB: A Stream-Relational Database Systemconfluent

Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...HostedbyConfluent

Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...Guozhang Wang

More from Yaroslav Tkachenko (13)

PDF

Dynamic Change Data Capture with Flink CDC and Consistent HashingYaroslav Tkachenko

PDF

Streaming SQL for Data Engineers: The Next Big Thing?Yaroslav Tkachenko

PDF

Apache Flink Adoption at ShopifyYaroslav Tkachenko

PDF

Storing State Forever: Why It Can Be Good For Your AnalyticsYaroslav Tkachenko

PDF

It's Time To Stop Using Lambda ArchitectureYaroslav Tkachenko

PDF

Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingYaroslav Tkachenko

PPTX

10 tips for making Bash a sane programming languageYaroslav Tkachenko

PDF

Building Stateful Microservices With AkkaYaroslav Tkachenko

PDF

Querying Data Pipeline with AWS AthenaYaroslav Tkachenko

PDF

Why Actor-Based Systems Are The Best For MicroservicesYaroslav Tkachenko

PPTX

Why actor-based systems are the best for microservicesYaroslav Tkachenko

PPTX

Building Eventing Systems for Microservice Architecture Yaroslav Tkachenko

PPTX

Быстрая и безболезненная разработка клиентской части веб-приложенийYaroslav Tkachenko

Dynamic Change Data Capture with Flink CDC and Consistent HashingYaroslav Tkachenko

Streaming SQL for Data Engineers: The Next Big Thing?Yaroslav Tkachenko

Apache Flink Adoption at ShopifyYaroslav Tkachenko

Storing State Forever: Why It Can Be Good For Your AnalyticsYaroslav Tkachenko

It's Time To Stop Using Lambda ArchitectureYaroslav Tkachenko

Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingYaroslav Tkachenko

10 tips for making Bash a sane programming languageYaroslav Tkachenko

Building Stateful Microservices With AkkaYaroslav Tkachenko

Querying Data Pipeline with AWS AthenaYaroslav Tkachenko

Why Actor-Based Systems Are The Best For MicroservicesYaroslav Tkachenko

Why actor-based systems are the best for microservicesYaroslav Tkachenko

Building Eventing Systems for Microservice Architecture Yaroslav Tkachenko

Быстрая и безболезненная разработка клиентской части веб-приложенийYaroslav Tkachenko

Recently uploaded (20)

PPTX

ChiSquare Procedure in IBM SPSS Statistics Version 31.pptxVersion 1 Analytics

PPTX

Agentic Automation: Build & Deploy Your First UiPath Agentklpathrudu

PDF

IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025bashirkhan333g

PPTX

Change Common Properties in IBM SPSS Statistics Version 31.pptxVersion 1 Analytics

PDF

Digger Solo: Semantic search and maps for your local filesseanpedersen96

PDF

Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREEutfefguu

PPTX

Comprehensive Risk Assessment Module for Smarter Risk ManagementEHA Soft Solutions

PDF

유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례Seongdae Kim

PPTX

Foundations of Marketo Engage - Powering Campaigns with Marketo Personalizationbbedford2

PDF

MiniTool Partition Wizard 12.8 Crack License Key LATESThashhshs786

PPTX

Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agentsklpathrudu

PDF

How to Hire AI Developers_ Step-by-Step Guide in 2025.pdfDianApps Technologies

PDF

The 5 Reasons for IT Maintenance - Arna SoftechArna Softech

PDF

MiniTool Power Data Recovery 8.8 With Crack New Latest 2025bashirkhan333g

PDF

TheFutureIsDynamic-BoxLang witch Luis Majano.pdfOrtus Solutions, Corp

PPTX

Customise Your Correlation Table in IBM SPSS Statistics.pptxVersion 1 Analytics

PPTX

Hardware(Central Processing Unit ) CU and ALURizwanaKalsoom2

PPTX

Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...Shane Coughlan

PDF

Automate Cybersecurity Tasks with PythonVICTOR MAESTRE RAMIREZ

PDF

iTop VPN With Crack Lifetime Activation Key-CODEutfefguu

ChiSquare Procedure in IBM SPSS Statistics Version 31.pptxVersion 1 Analytics

Agentic Automation: Build & Deploy Your First UiPath Agentklpathrudu

IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025bashirkhan333g

Change Common Properties in IBM SPSS Statistics Version 31.pptxVersion 1 Analytics

Digger Solo: Semantic search and maps for your local filesseanpedersen96

Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREEutfefguu

Comprehensive Risk Assessment Module for Smarter Risk ManagementEHA Soft Solutions

유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례Seongdae Kim

Foundations of Marketo Engage - Powering Campaigns with Marketo Personalizationbbedford2

MiniTool Partition Wizard 12.8 Crack License Key LATESThashhshs786

Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agentsklpathrudu

How to Hire AI Developers_ Step-by-Step Guide in 2025.pdfDianApps Technologies

The 5 Reasons for IT Maintenance - Arna SoftechArna Softech

MiniTool Power Data Recovery 8.8 With Crack New Latest 2025bashirkhan333g

TheFutureIsDynamic-BoxLang witch Luis Majano.pdfOrtus Solutions, Corp

Customise Your Correlation Table in IBM SPSS Statistics.pptxVersion 1 Analytics

Hardware(Central Processing Unit ) CU and ALURizwanaKalsoom2

Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...Shane Coughlan

Automate Cybersecurity Tasks with PythonVICTOR MAESTRE RAMIREZ

iTop VPN With Crack Lifetime Activation Key-CODEutfefguu

Designing Scalable and Extendable Data Pipeline for Call Of Duty Games

1. Designing Scalable and Extendable Data Pipeline for Call of Duty Games Yaroslav Tkachenko Senior Data Engineer at Activision

4. 1+ PB Data lake size (AWS S3)

5. Number of topics in the biggest cluster (Apache Kafka) 600+

6. 10k+ Messages per second (Apache Kafka)

7. Scaling the data pipeline even further Volume Well-known industry techniques Games Using previous experience Use-cases Completely unpredictable Complexity

9. 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Kafka topic Consumer or Producer Partition 1 Partition 2 Partition 3 Kafka topics are partitioned and replicated

10. We need to keep the number of topics and partitions low More topics means more operational burden. Number of partitions in a fixed cluster is not infinite. Autoscaling Kafka is impossible, scaling is hard.

11. Topic naming convention $env.$source.$title.$category-$version prod.glutton.1234.telemetry_match_event-v1 Unique game id “CoD WW2 on PSN”Producer

12. A proper solution has been invented decades ago. Think about databases.

13. Messaging system IS a form of a database Data topic = Database + Table. Data topic = Namespace + Data type.

14. telemetry.matches user.logins marketplace.purchases prod.glutton.1234.telemetry_match_event-v1 dev.user_login_records.4321.all-v1 prod.marketplace.5678.purchase_event-v1 Compare this

15. Each approach has pros and cons • Topics that use metadata for their names are obviously easier to track and monitor (and even consume). • As a consumer, I can consume exactly what I want, instead of consuming a single large topic and extracting required values. • These dynamic fields can and will change. Producers (sources) and consumers will change. • Very efficient utilization of topics and partitions. • Finally, it’s impossible to enforce any constraints with a topic name. And you can always end up with dev data in prod topic and vice versa.

16. After removing necessary metadata from the topic names stream processing becomes mandatory.

17. Stream processing becomes mandatory Measuring → Validating → Enriching → Filtering & routing

18. Refinery

19. Having a single message schema for a topic is more than just a nice-to-have.

20. Number of supported message formats 8

21. Stream processor JSON Protobuf Custom Avro ? ? ? ?

22. // Application.java props.put("value.deserializer", "com.example.CustomDeserializer"); // CustomDeserializer.java public class CustomDeserializer implements Deserializer<???> { @Override public ??? deserialize(String topic, byte[] data) { ??? } } Custom deserialization

23. Message envelope anatomy ID, env, timestamp, source, game, ... Event Header / Metadata Body / Payload Message

24. Unified message envelope syntax = "proto2"; message MessageEnvelope { optional bytes message_id = 1; optional uint64 created_at = 2; optional uint64 ingested_at = 3; optional string source = 4; optional uint64 title_id = 5; optional string env = 6; optional UserInfo resource_owner = 7; optional SchemaInfo schema_info = 8; optional string message_name = 9; optional bytes message = 100; }

25. Schema Registry • API to manage message schemas • Single source of truth for all producers and consumers • It should be impossible to send a message to the pipeline without registering its schema in the Schema Registry! • Good Schema Registry supports immutability, versioning and basic validation • Activision uses custom Schema Registry implemented with Python and Cassandra

26. Summary: scaling and extending the data pipeline Games • By using unified message envelope and topic names adding a new game becomes almost effortless • “Operational” stream processing makes it possible • Still flexible enough: each game can use its own message payload format via Schema Registry Use-cases • Topic names express data types, not producers or consumers • Stream filtering & routing allows low-cost experiments • Data catalog built on top of Schema Registry promotes data discovery

27. Thanks! @sap1ens