SlideShare a Scribd company logo
www.mapflat.com
Testing data streaming
applications
Lars Albertsson, independent consultant
Øyvind Løkling, Schibsted Media Group
www.mapflat.com
Who’s talking?
● Swedish Institute of Computer Science (distributed system test+debug tools)
● Sun Microsystems (very large machines)
● Google (Hangouts, productivity)
● Recorded Future (NLP startup)
● Cinnober Financial Tech. (trading systems)
● Spotify (data processing & modelling)
● Schibsted Media Group (data processing & modelling)
● Mapflat - independent data engineering consultant
www.mapflat.com
Why stream processing?
● Increasing number of
data-driven features
● 90+% fed by batch processing
○ Simpler, better tooling
○ 1+ hour data reaction time
● Stream processing for
○ 100ms - 1 hour reaction
○ Decoupled, asynchronous
microservices
User content
Professional
content
Ads / partners
User
behaviour
Systems
Ads
System
diagnostics
Recommendations
Data-based
features
Curated
content
Pushing
Business
intelligence
Experiments
Exploration
www.mapflat.com
The organic past
● Many paths
● Synchronous
● Link failure -> chain failure
● Heterogeneous
● Difficult to recover from
transformation bugs
Service Service Service
App App App
DB
Poll
Queue
Aggregate
logs
NFS
Hourly dump
Data
warehouse
ETL
Queue
NFS
scp
DB
HTTP
www.mapflat.com
● Publish data in streams
● Replicated, sharded
append-only log
● Pub / sub with history
○ Kafka, Google Pub/Sub,
AWS Kinesis
● Tap to data lake for batch
processing
Unified log
The unified log
Ads Search Feed
App App App
StreamStream Stream
Data lake
www.mapflat.com
● Decoupled
producers/consumers
○ In source/deployment
○ In space
○ In time
● Publish results to log
● Recovers from link failures
● Replay on job bug fix
Stream processing
Job
Ads Search Feed
App App App
StreamStream Stream
Stream Stream Stream
Job
Job
Stream Stream Stream
Job
Job
Data lake
Business
intelligence
Job
www.mapflat.com
Stream processing building blocks
● Aggregate
○ Calculate time windows
○ Aggregate state (in memory / local database / shared database)
● Filter
○ Slim down stream
○ Privacy, security concerns
● Join
○ Enrich by joining with datasets, e.g. geo IP lookup, demographics
○ Join streams within time windows, e.g. click-through rate
● Transform
○ Bring data into same “shape”, schema
www.mapflat.com
Stream processing technologies
● Spark Streaming
○ Ideal if you are already using Spark, same model
○ Bridges gap between data science / data engineers, batch and stream
● Kafka Streams
○ Library - new, positions itself as a lightweight alternative
○ Tightly coupled to Kafka
● Others
○ Storm, Heron, Flink, Samza, Google Dataflow, AWS Lambda
www.mapflat.com
● Update database table, e.g. for
polling dashboard
● Create service index table n+1.
Notify service to switch.
● Post to external web service
● Push stream to client
Egress
Service
Stream Stream
Job Job
App
www.mapflat.com
Test concepts
Test harness
Test
fixture
System under test
(SUT)
3rd party
component
(e.g. DB)
3rd party
component
3rd party
component
Test
input
Test
oracle
Test framework (e.g. JUnit, Scalatest)
Seam
IDEs
Build
tools
www.mapflat.com
● Unit
● Single job
● Multiple jobs
● Pipeline, including service
● Full system, including client
Choose stable interfaces
Each scope has a cost
Potential test scopes
Job
Service
App
Stream
Stream
Job
Stream
Job
www.mapflat.com
Stream application properties
● Output = function(input, code)
○ Perfect for testing!
○ Avoid: indeterministic processing, reading wall clock
● Pipeline and job endpoints are stable
○ Correspond to business value
● Internal abstractions are volatile
○ Reslicing in different dimensions is common
www.mapflat.com
● Single job
● Multiple jobs
● Pipeline, including service
Recommended scopes
Job
Service
App
Stream
Stream
Job
Stream
Job
www.mapflat.com
● Unit
○ Few stable interfaces
○ Not necessary
○ Avoid mocks, DI rituals
● Full system, including client
○ Client automation fragile
“Focus on functional system
tests, complement with smaller
where you cannot get
coverage.” - Henrik Kniberg
Scopes to avoid
Job
Service
App
Stream
Stream
Job
Stream
Job
www.mapflat.com
Stream application, example harness
Scalatest Spark Streaming jobs
IDE, CI, debug integration
15
DB
Topic
Kafka
Test
input
Test
oracle
Docker
IDE / Gradle
Polling
www.mapflat.com
Test lifecycle
1. Start fixture containers
2. Await fixture ready
3. Allocate test case resources
4. Start jobs
5. Push input data to Kafka
6. While (!done && !timeout) { pollDatabase(); sleep(1ms) }
7. While (moreTests) { Goto 3 }
8. Tear down fixture
For absence test, send dummy sync messages at end.
2, 7. Scalatest 4. Spark
5 6
1. Docker
IDE / Gradle
www.mapflat.com
● Input & output is denormalised & wide
● Fields are frequently changed
○ Additions are compatible
○ Modifications are incompatible => new, similar data type
● Static test input, e.g. JSON files
○ Unmaintainable
● Input generation routines
○ Robust to changes, reusable
Input generation
www.mapflat.com
Test oracles
● Compare with expected output
● Check fields relevant for test
○ Robust to field changes
○ Reusable for new, similar types
● Tip: Use lenses
○ JSON: JsonPath (Java), Play JSON (Scala)
○ Case classes: Monocle
● Express invariants for each data type
○ Reuse for production data quality monitoring
www.mapflat.com
Data pipeline = yet another program
Don’t veer from best practices
● Regression testing
● Design: Separation of concerns, modularity, etc
● Process: CI/CD, code review, static analysis tools
● Avoid anti-patterns: Global state, hard-coding location, duplication, ...
In data engineering, slipping is in the culture... :-(
● Mix in solid backend engineers
● Document “golden path”
www.mapflat.com
Testing with cloud services
● PaaS components do not work locally
○ Cloud providers should provide fake implementations
○ Exceptions: Kubernetes, Cloud SQL, Relational Database Service, (S3)
● Integrate PaaS service as fixture component is challenging
○ Distribute access tokens, etc
○ Pay $ or $$$
www.mapflat.com
Top anti-patterns
1. Test as afterthought or in production
Data processing applications are suited for test!
2. Static test input in version control
3. Exact expected output test oracle
4. Unit testing volatile interfaces
5. Using mocks & dependency injection
6. Tool-specific test framework - vendor lock-in
7. Using wall clock time
8. Embedded fixture components
www.mapflat.com
Thank you. Questions?
Credits:
Øyvind Løkling, Schibsted Media Group
● Content inspiration
Confluent, LinkedIn, Google, Netflix, Apache Samza
● Images
Tracey Saxby, Integration and Application Network, University of Maryland
Center for Environmental Science (ian.umces.edu/imagelibrary/).
www.mapflat.com
Bonus slides
www.mapflat.com
Quality testing variants
● Functional regression
○ Binary, key to productivity
● Golden set
○ Extreme inputs => obvious output
○ No regressions tolerated
● (Saved) production data input
○ Individual regressions ok
○ Weighted sum must not decline
○ Beware of privacy
24
www.mapflat.com
Hadoop / Spark counters
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter
● Dedicated quality assessment pipelines
○ Reuse test oracle invariants in production
Obtaining quality metrics
25
DB
Quality assessment job
www.mapflat.com
Quality testing in the process
● Binary self-contained
○ Validate in CI
● Relative vs history
○ E.g. large drops
○ Precondition for publishing dataset
● Push aggregates to DB
○ Standard ops: monitor, alert
26
DB
∆?
Code ∆!

More Related Content

What's hot (20)

PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
PDF
The Dream Stream Team for Pulsar and Spring
Timothy Spann
 
PDF
New Generation Oracle RAC Performance
Anil Nair
 
PPTX
Integrating Microservices with Apache Camel
Christian Posta
 
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
PDF
Disaster Recovery with MySQL InnoDB ClusterSet - What is it and how do I use it?
Miguel Araújo
 
PPTX
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
HostedbyConfluent
 
PDF
Demystifying MySQL Replication Crash Safety
Jean-François Gagné
 
PDF
Intro to Delta Lake
Databricks
 
PPTX
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
PDF
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kai Wähner
 
PDF
Spark with Delta Lake
Knoldus Inc.
 
PPTX
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
PDF
Ceph Block Devices: A Deep Dive
Red_Hat_Storage
 
PDF
Advanced backup methods (Postgres@CERN)
Anastasia Lubennikova
 
PPTX
What’s New in Oracle Database 19c - Part 1
Satishbabu Gunukula
 
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
PDF
Load Balancing MySQL with HAProxy - Slides
Severalnines
 
PDF
MySQL InnoDB Cluster - Group Replication
Frederic Descamps
 
PDF
Postgresql database administration volume 1
Federico Campoli
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
The Dream Stream Team for Pulsar and Spring
Timothy Spann
 
New Generation Oracle RAC Performance
Anil Nair
 
Integrating Microservices with Apache Camel
Christian Posta
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Disaster Recovery with MySQL InnoDB ClusterSet - What is it and how do I use it?
Miguel Araújo
 
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
HostedbyConfluent
 
Demystifying MySQL Replication Crash Safety
Jean-François Gagné
 
Intro to Delta Lake
Databricks
 
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kai Wähner
 
Spark with Delta Lake
Knoldus Inc.
 
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
Ceph Block Devices: A Deep Dive
Red_Hat_Storage
 
Advanced backup methods (Postgres@CERN)
Anastasia Lubennikova
 
What’s New in Oracle Database 19c - Part 1
Satishbabu Gunukula
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Load Balancing MySQL with HAProxy - Slides
Severalnines
 
MySQL InnoDB Cluster - Group Replication
Frederic Descamps
 
Postgresql database administration volume 1
Federico Campoli
 

Viewers also liked (12)

PDF
Test strategies for data processing pipelines, v2.0
Lars Albertsson
 
PPT
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Slim Baltagi
 
PDF
Testing distributed, complex web applications
Jens-Christian Fischer
 
PPT
Step-by-Step Introduction to Apache Flink
Slim Baltagi
 
PDF
10 ways to stumble with big data
Lars Albertsson
 
PPTX
A Beginner's Guide to Building Data Pipelines with Luigi
Growth Intelligence
 
PDF
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
Mathieu Bastian
 
PDF
Data pipelines from zero to solid
Lars Albertsson
 
PDF
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
PPTX
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
PPTX
Building a unified data pipeline in Apache Spark
DataWorks Summit
 
PDF
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs
 
Test strategies for data processing pipelines, v2.0
Lars Albertsson
 
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Slim Baltagi
 
Testing distributed, complex web applications
Jens-Christian Fischer
 
Step-by-Step Introduction to Apache Flink
Slim Baltagi
 
10 ways to stumble with big data
Lars Albertsson
 
A Beginner's Guide to Building Data Pipelines with Luigi
Growth Intelligence
 
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
Mathieu Bastian
 
Data pipelines from zero to solid
Lars Albertsson
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
Building a unified data pipeline in Apache Spark
DataWorks Summit
 
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs
 
Ad

Similar to Testing data streaming applications (20)

PDF
Test strategies for data processing pipelines
Lars Albertsson
 
PDF
A primer on building real time data-driven products
Lars Albertsson
 
PDF
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
StampedeCon
 
PDF
Scaling up uber's real time data analytics
Xiang Fu
 
PDF
Machine learning and big data @ uber a tale of two systems
Zhenxiao Luo
 
PDF
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Spark Summit
 
PDF
Holistic data application quality
Lars Albertsson
 
PDF
Structured Streaming in Spark
Digital Vidya
 
PDF
Near real-time anomaly detection at Lyft
markgrover
 
PDF
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
Anna Ossowski
 
PDF
Scala Days Highlights | BoldRadius
BoldRadius Solutions
 
PDF
Distributed real time stream processing- why and how
Petr Zapletal
 
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
PDF
Open Data Inside - Why Internal Data Portals are Key to Successful Data Gover...
Joel Natividad
 
PPTX
Druid Optimizations for Scaling Customer Facing Analytics
Amir Youssefi
 
PDF
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
NETWAYS
 
PDF
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Gabriele Bartolini
 
PDF
Data ops in practice - Swedish style
Lars Albertsson
 
PPTX
Webinar september 2013
Marc Gille
 
PDF
The Lyft data platform: Now and in the future
markgrover
 
Test strategies for data processing pipelines
Lars Albertsson
 
A primer on building real time data-driven products
Lars Albertsson
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
StampedeCon
 
Scaling up uber's real time data analytics
Xiang Fu
 
Machine learning and big data @ uber a tale of two systems
Zhenxiao Luo
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Spark Summit
 
Holistic data application quality
Lars Albertsson
 
Structured Streaming in Spark
Digital Vidya
 
Near real-time anomaly detection at Lyft
markgrover
 
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
Anna Ossowski
 
Scala Days Highlights | BoldRadius
BoldRadius Solutions
 
Distributed real time stream processing- why and how
Petr Zapletal
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
Open Data Inside - Why Internal Data Portals are Key to Successful Data Gover...
Joel Natividad
 
Druid Optimizations for Scaling Customer Facing Analytics
Amir Youssefi
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
NETWAYS
 
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Gabriele Bartolini
 
Data ops in practice - Swedish style
Lars Albertsson
 
Webinar september 2013
Marc Gille
 
The Lyft data platform: Now and in the future
markgrover
 
Ad

More from Lars Albertsson (20)

PDF
All the DataOps, all the paradigms .
Lars Albertsson
 
PDF
Generative AI - the power to destroy democracy meets the security and reliabi...
Lars Albertsson
 
PDF
The road to pragmatic application of AI.pdf
Lars Albertsson
 
PDF
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
PDF
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
PDF
Industrialised data - the key to AI success.pdf
Lars Albertsson
 
PDF
Crossing the data divide
Lars Albertsson
 
PDF
Schema management with Scalameta
Lars Albertsson
 
PDF
How to not kill people - Berlin Buzzwords 2023.pdf
Lars Albertsson
 
PDF
Data engineering in 10 years.pdf
Lars Albertsson
 
PDF
The 7 habits of data effective companies.pdf
Lars Albertsson
 
PDF
Secure software supply chain on a shoestring budget
Lars Albertsson
 
PDF
DataOps - Lean principles and lean practices
Lars Albertsson
 
PDF
Ai legal and ethics
Lars Albertsson
 
PDF
The right side of speed - learning to shift left
Lars Albertsson
 
PDF
Mortal analytics - Covid-19 and the problem of data quality
Lars Albertsson
 
PDF
The lean principles of data ops
Lars Albertsson
 
PDF
Data democratised
Lars Albertsson
 
PDF
Engineering data quality
Lars Albertsson
 
PDF
Eventually, time will kill your data processing
Lars Albertsson
 
All the DataOps, all the paradigms .
Lars Albertsson
 
Generative AI - the power to destroy democracy meets the security and reliabi...
Lars Albertsson
 
The road to pragmatic application of AI.pdf
Lars Albertsson
 
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
Industrialised data - the key to AI success.pdf
Lars Albertsson
 
Crossing the data divide
Lars Albertsson
 
Schema management with Scalameta
Lars Albertsson
 
How to not kill people - Berlin Buzzwords 2023.pdf
Lars Albertsson
 
Data engineering in 10 years.pdf
Lars Albertsson
 
The 7 habits of data effective companies.pdf
Lars Albertsson
 
Secure software supply chain on a shoestring budget
Lars Albertsson
 
DataOps - Lean principles and lean practices
Lars Albertsson
 
Ai legal and ethics
Lars Albertsson
 
The right side of speed - learning to shift left
Lars Albertsson
 
Mortal analytics - Covid-19 and the problem of data quality
Lars Albertsson
 
The lean principles of data ops
Lars Albertsson
 
Data democratised
Lars Albertsson
 
Engineering data quality
Lars Albertsson
 
Eventually, time will kill your data processing
Lars Albertsson
 

Recently uploaded (20)

PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 

Testing data streaming applications

  • 1. www.mapflat.com Testing data streaming applications Lars Albertsson, independent consultant Øyvind Løkling, Schibsted Media Group
  • 2. www.mapflat.com Who’s talking? ● Swedish Institute of Computer Science (distributed system test+debug tools) ● Sun Microsystems (very large machines) ● Google (Hangouts, productivity) ● Recorded Future (NLP startup) ● Cinnober Financial Tech. (trading systems) ● Spotify (data processing & modelling) ● Schibsted Media Group (data processing & modelling) ● Mapflat - independent data engineering consultant
  • 3. www.mapflat.com Why stream processing? ● Increasing number of data-driven features ● 90+% fed by batch processing ○ Simpler, better tooling ○ 1+ hour data reaction time ● Stream processing for ○ 100ms - 1 hour reaction ○ Decoupled, asynchronous microservices User content Professional content Ads / partners User behaviour Systems Ads System diagnostics Recommendations Data-based features Curated content Pushing Business intelligence Experiments Exploration
  • 4. www.mapflat.com The organic past ● Many paths ● Synchronous ● Link failure -> chain failure ● Heterogeneous ● Difficult to recover from transformation bugs Service Service Service App App App DB Poll Queue Aggregate logs NFS Hourly dump Data warehouse ETL Queue NFS scp DB HTTP
  • 5. www.mapflat.com ● Publish data in streams ● Replicated, sharded append-only log ● Pub / sub with history ○ Kafka, Google Pub/Sub, AWS Kinesis ● Tap to data lake for batch processing Unified log The unified log Ads Search Feed App App App StreamStream Stream Data lake
  • 6. www.mapflat.com ● Decoupled producers/consumers ○ In source/deployment ○ In space ○ In time ● Publish results to log ● Recovers from link failures ● Replay on job bug fix Stream processing Job Ads Search Feed App App App StreamStream Stream Stream Stream Stream Job Job Stream Stream Stream Job Job Data lake Business intelligence Job
  • 7. www.mapflat.com Stream processing building blocks ● Aggregate ○ Calculate time windows ○ Aggregate state (in memory / local database / shared database) ● Filter ○ Slim down stream ○ Privacy, security concerns ● Join ○ Enrich by joining with datasets, e.g. geo IP lookup, demographics ○ Join streams within time windows, e.g. click-through rate ● Transform ○ Bring data into same “shape”, schema
  • 8. www.mapflat.com Stream processing technologies ● Spark Streaming ○ Ideal if you are already using Spark, same model ○ Bridges gap between data science / data engineers, batch and stream ● Kafka Streams ○ Library - new, positions itself as a lightweight alternative ○ Tightly coupled to Kafka ● Others ○ Storm, Heron, Flink, Samza, Google Dataflow, AWS Lambda
  • 9. www.mapflat.com ● Update database table, e.g. for polling dashboard ● Create service index table n+1. Notify service to switch. ● Post to external web service ● Push stream to client Egress Service Stream Stream Job Job App
  • 10. www.mapflat.com Test concepts Test harness Test fixture System under test (SUT) 3rd party component (e.g. DB) 3rd party component 3rd party component Test input Test oracle Test framework (e.g. JUnit, Scalatest) Seam IDEs Build tools
  • 11. www.mapflat.com ● Unit ● Single job ● Multiple jobs ● Pipeline, including service ● Full system, including client Choose stable interfaces Each scope has a cost Potential test scopes Job Service App Stream Stream Job Stream Job
  • 12. www.mapflat.com Stream application properties ● Output = function(input, code) ○ Perfect for testing! ○ Avoid: indeterministic processing, reading wall clock ● Pipeline and job endpoints are stable ○ Correspond to business value ● Internal abstractions are volatile ○ Reslicing in different dimensions is common
  • 13. www.mapflat.com ● Single job ● Multiple jobs ● Pipeline, including service Recommended scopes Job Service App Stream Stream Job Stream Job
  • 14. www.mapflat.com ● Unit ○ Few stable interfaces ○ Not necessary ○ Avoid mocks, DI rituals ● Full system, including client ○ Client automation fragile “Focus on functional system tests, complement with smaller where you cannot get coverage.” - Henrik Kniberg Scopes to avoid Job Service App Stream Stream Job Stream Job
  • 15. www.mapflat.com Stream application, example harness Scalatest Spark Streaming jobs IDE, CI, debug integration 15 DB Topic Kafka Test input Test oracle Docker IDE / Gradle Polling
  • 16. www.mapflat.com Test lifecycle 1. Start fixture containers 2. Await fixture ready 3. Allocate test case resources 4. Start jobs 5. Push input data to Kafka 6. While (!done && !timeout) { pollDatabase(); sleep(1ms) } 7. While (moreTests) { Goto 3 } 8. Tear down fixture For absence test, send dummy sync messages at end. 2, 7. Scalatest 4. Spark 5 6 1. Docker IDE / Gradle
  • 17. www.mapflat.com ● Input & output is denormalised & wide ● Fields are frequently changed ○ Additions are compatible ○ Modifications are incompatible => new, similar data type ● Static test input, e.g. JSON files ○ Unmaintainable ● Input generation routines ○ Robust to changes, reusable Input generation
  • 18. www.mapflat.com Test oracles ● Compare with expected output ● Check fields relevant for test ○ Robust to field changes ○ Reusable for new, similar types ● Tip: Use lenses ○ JSON: JsonPath (Java), Play JSON (Scala) ○ Case classes: Monocle ● Express invariants for each data type ○ Reuse for production data quality monitoring
  • 19. www.mapflat.com Data pipeline = yet another program Don’t veer from best practices ● Regression testing ● Design: Separation of concerns, modularity, etc ● Process: CI/CD, code review, static analysis tools ● Avoid anti-patterns: Global state, hard-coding location, duplication, ... In data engineering, slipping is in the culture... :-( ● Mix in solid backend engineers ● Document “golden path”
  • 20. www.mapflat.com Testing with cloud services ● PaaS components do not work locally ○ Cloud providers should provide fake implementations ○ Exceptions: Kubernetes, Cloud SQL, Relational Database Service, (S3) ● Integrate PaaS service as fixture component is challenging ○ Distribute access tokens, etc ○ Pay $ or $$$
  • 21. www.mapflat.com Top anti-patterns 1. Test as afterthought or in production Data processing applications are suited for test! 2. Static test input in version control 3. Exact expected output test oracle 4. Unit testing volatile interfaces 5. Using mocks & dependency injection 6. Tool-specific test framework - vendor lock-in 7. Using wall clock time 8. Embedded fixture components
  • 22. www.mapflat.com Thank you. Questions? Credits: Øyvind Løkling, Schibsted Media Group ● Content inspiration Confluent, LinkedIn, Google, Netflix, Apache Samza ● Images Tracey Saxby, Integration and Application Network, University of Maryland Center for Environmental Science (ian.umces.edu/imagelibrary/).
  • 24. www.mapflat.com Quality testing variants ● Functional regression ○ Binary, key to productivity ● Golden set ○ Extreme inputs => obvious output ○ No regressions tolerated ● (Saved) production data input ○ Individual regressions ok ○ Weighted sum must not decline ○ Beware of privacy 24
  • 25. www.mapflat.com Hadoop / Spark counters ● Processing tool (Spark/Hadoop) counters ○ Odd code path => bump counter ● Dedicated quality assessment pipelines ○ Reuse test oracle invariants in production Obtaining quality metrics 25 DB Quality assessment job
  • 26. www.mapflat.com Quality testing in the process ● Binary self-contained ○ Validate in CI ● Relative vs history ○ E.g. large drops ○ Precondition for publishing dataset ● Push aggregates to DB ○ Standard ops: monitor, alert 26 DB ∆? Code ∆!