SlideShare a Scribd company logo
Data Pipeline at Tapad 
@tobym 
@TapadEng
Who am I? 
Toby Matejovsky 
First engineer hired at Tapad 3+ years 
ago 
Scala developer 
@tobym
What are we talking about?
Outline 
• What Tapad does 
• Why bother with a data pipeline? 
• Evolution of the pipeline 
• Day in the life of a analytics pixel 
• What’s next
What Tapad Does 
Cross-platform advertising and analytics 
Process billions of events per day 
A Unified View. 
The Tapad Difference.
Cross platform? 
Device Graph 
Node=device 
edge=inferred connection 
Billion devices 
Quarter billion edges 
85+% accuracy 
A Unified View. 
The Tapad Difference.
Why a Data Pipeline? 
Graph building 
Sanity while processing big data 
Decouple components 
Data accessible at multiple stages
Graph Building 
Realtime mode, but don’t impact bidding latency 
Batch mode
Sanity 
Billions of events, terabytes of logs per day 
Don’t have NSA’s budget 
Clear data retention policy 
Store aggregations
Decouple Components 
Bidder only bids, graph-building 
process only builds graph 
Data stream can split and merge
Data accessible at multiple stages 
Logs on edge of system 
Local spool of data 
Kafka broker 
Consumer local spool 
HDFS
Evolution of the Data Pipeline 
Dark Ages: Monolithic process, synchronous process 
Renaissance: Queues, asynchronous work in same process 
Age of Exploration: Inter-process comm, ad hoc batching 
Age of Enlightenment: Standardize on Kafka and Avro
Dark Ages 
Monolithic process, synchronous process 
It was fast enough, and we had to start somewhere.
Renaissance 
Queues, asynchronous work in same process 
No, it wasn’t fast enough.
Age of Exploration 
Inter-process communication, ad hoc batching 
Servers at the edge batch up events, ship them to another 
service.
Age of Enlightenment 
Standardize on Kafka and Avro 
Properly engineered and supported, reliable
Age of Enlightenment 
Standardize on Kafka and Avro 
Properly engineered and supported, reliable
Tangent! 
Batching, queues, and serialization 
A Unified View. 
The Tapad Difference.
Batching 
Batching is great, will really help throughput 
Batching != slow 
A Unified View. 
The Tapad Difference.
Queues 
Queues are amazing, until they explode and destroy the Rube Goldberg 
machine. 
“I’ll just increase the buffer size.” 
- spoken one day before someone ended up on double PagerDuty rotation 
A Unified View. 
The Tapad Difference.
Care and feeding of your queue 
Monitor 
Back-pressure 
Buffering 
Spooling 
Degraded mode 
A Unified View. 
The Tapad Difference.
Serialization - Protocol Buffers 
Tagged fields 
Sort of self-describing 
required, optional, repeated fields in schema 
“Map” type: 
message StringPair { 
required string key = 1; 
optional string value = 2; 
} 
A Unified View. 
The Tapad Difference.
Serialization - Avro 
Optional field: union { null, long } user_timestamp = null; 
Splittable (Hadoop world) 
Schema evolution and storage 
A Unified View. 
The Tapad Difference.
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
Day in the life of a pixel 
Browser loads pixel from pixel server 
Pixel server immediately responds with 200 and transparent gif, 
then serializes requests into a batch file 
Batch file ships every few seconds or when the file reaches 2K
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
Day in the life of a pixel 
Pixel ingress server receives 2 kilobyte file containing serialized 
web requests. 
Deserialize, process some requests immediately (update 
database), then convert into Avro records with schema hash 
header, and publish to various Kafka topics
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
Day in the life of a pixel 
Producer client figures out where to publish via the broker they 
connect to 
Kafka topics are partitioned into multiple chunks, each has a master 
and slave and are on different servers to survive an outage. 
Configurable retention based on time 
Can add topics dynamically
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
Day in the life of a pixel 
Consumer processes are organized into groups 
Many consumer groups can read from same Kafka topic 
Plugins: 
trait Plugin[A] { 
def onStartup(): Unit 
def onSuccess(a: A): Unit 
def onFailure(a: A): Unit 
def onShutdown(): Unit 
} 
GraphitePlugin, BatchingLogfilePlaybackPlugin, TimestampDrivenClockPlugin, 
BatchingTimestampDrivenClockPlugin, …
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
Day in the life of a pixel 
trait Plugins[A] { 
private val _plugins = ArrayBuffer.empty[Plugin[A]] 
def plugins: Seq[Plugin[A]] = _plugins 
def registerPlugin(plugin: Plugin[A]) = _plugins += plugin 
}
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
Day in the life of a pixel 
object KafkaConsumer { 
sealed trait Result { 
def notify[A](plugins: Seq[Plugin[A]], a: A): Unit 
} 
case object Success extends Result { 
def notify[A](plugins: Seq[Plugin[A]], a: A) { 
plugins.foreach(_.onSuccess(a)) 
} 
} 
}
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
/** Decorate a Function1[A, B] with retry logic */ 
case class Retry[A, B](maxAttempts: Int, backoff: Long)(f: A => B){ 
def apply(a: A): Result[A, B] = { 
def execute(attempt: Int, errorLog: List[Throwable]): Result[A, B] = { 
val result = try { 
Success(this, a, f(a)) 
} catch { 
… Failure(this, a, e :: errorLog) … 
} 
result match { 
case failure @ Failure(_, _, errorLog) if errorLog.size < maxAttempts => 
val _backoff = (math.pow(2, attempt) * backoff).toLong 
Thread.sleep(_backoff) // wait before the next invocation 
execute(attempt + 1, errorLog) // try again 
case failure @ Failure(_, _, errorLog) => 
failure 
} 
} 
execute(attempt = 0, errorLog = Nil) 
}
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
Day in the life of a pixel 
Consumers log into “permanent storage” in HDFS. 
File format is Avro, written in batches. 
Data retention policy is essential.
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
Day in the life of a pixel 
Hadoop 2 - YARN 
Scalding to write map-reduce jobs easily 
Rewrite Avro files as Parquet 
Oozie to schedule regular jobs
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
YARN
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
Scalding 
class WordCountJob(args : Args) extends Job(args) { 
TextLine( args("input") ) 
.flatMap('line -> 'word) { line : String => tokenize(line) } 
.groupBy('word) { _.size } 
.write( Tsv( args("output") ) ) 
// Split a piece of text into individual words. 
def tokenize(text : String) : Array[String] = { 
// Lowercase each word and remove punctuation. 
text.toLowerCase.replaceAll("[^a-zA-Z0-9s]", "").split("s+") 
} 
}
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
Parquet 
Column-oriented storage for Hadoop 
Nested data is okay 
Projections 
Predicates
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
Parquet 
val requests = ParquetAvroSource 
.project[Request](args("requests"), Projection[Request]("header.query_params", "partner_id")) 
.read 
.sample(args("sample-rate").toDouble) 
.mapTo('Request -> ('queryParams, 'partnerId)) { req: TapestryRequest => 
(req.getHeader.getQueryParams, req.getPartnerId) 
}
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
Oozie 
<workflow-app name="combined_queries" xmlns="uri:oozie:workflow:0.3"> 
<start to="devices-location"/> 
<!--<start to="export2db"/>--> 
<action name="devices-location"> 
<shell xmlns="uri:oozie:shell-action:0.1"> 
<job-tracker>${jobTracker}</job-tracker> 
<name-node>${nameNode}</name-node> 
<exec>hadoop</exec> 
<argument>fs</argument> 
<argument>-cat</argument> 
<argument>${devicesConfig}</argument> 
<capture-output/> 
</shell> 
<ok to="networks-location"/> 
<error to="kill"/> 
</action>
pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs 
Day in the life of a pixel 
Near real-time consumers and batch hadoop jobs generate data 
cubes from incoming events and save those aggregations into 
Vertica for fast and easy querying with SQL.
Stack summary 
Scala, Jetty/Netty, Finagle 
Avro, Protocol Buffers, Parquet 
Kafka 
Zookeeper 
Hadoop - YARN and HDFS 
Vertica 
Scalding 
Oozie, Sqoop
What’s next? 
Hive 
Druid 
Impala 
Oozie alternative
Thank You yes, we’re hiring! :) 
@tobym 
@TapadEng 
Toby Matejovsky, Director of Engineering 
toby@tapad.com 
@tobym

More Related Content

What's hot (20)

PPTX
Lambda architecture: from zero to One
Serg Masyutin
 
PDF
Introduction to Spark Streaming
datamantra
 
PDF
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Ververica
 
ODP
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PDF
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
PDF
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
HostedbyConfluent
 
PPTX
Lambda architecture with Spark
Vincent GALOPIN
 
PDF
Diving into the Deep End - Kafka Connect
confluent
 
PDF
Building real time data-driven products
Lars Albertsson
 
PDF
Introduction to apache kafka, confluent and why they matter
Paolo Castagna
 
PDF
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
HostedbyConfluent
 
PDF
Timeline Service v.2 (Hadoop Summit 2016)
Sangjin Lee
 
PDF
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
confluent
 
PPTX
Omid: A Transactional Framework for HBase
DataWorks Summit/Hadoop Summit
 
PDF
Introduction to Stream Processing
Guido Schmutz
 
PDF
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
confluent
 
PDF
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
confluent
 
PPTX
Spark Streaming & Kafka-The Future of Stream Processing
Jack Gudenkauf
 
PDF
Data integration with Apache Kafka
confluent
 
PDF
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
confluent
 
Lambda architecture: from zero to One
Serg Masyutin
 
Introduction to Spark Streaming
datamantra
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Ververica
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
HostedbyConfluent
 
Lambda architecture with Spark
Vincent GALOPIN
 
Diving into the Deep End - Kafka Connect
confluent
 
Building real time data-driven products
Lars Albertsson
 
Introduction to apache kafka, confluent and why they matter
Paolo Castagna
 
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
HostedbyConfluent
 
Timeline Service v.2 (Hadoop Summit 2016)
Sangjin Lee
 
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
confluent
 
Omid: A Transactional Framework for HBase
DataWorks Summit/Hadoop Summit
 
Introduction to Stream Processing
Guido Schmutz
 
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
confluent
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
confluent
 
Spark Streaming & Kafka-The Future of Stream Processing
Jack Gudenkauf
 
Data integration with Apache Kafka
confluent
 
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
confluent
 

Similar to Data Pipeline at Tapad (20)

PPT
Hadoop ecosystem framework n hadoop in live environment
Delhi/NCR HUG
 
PDF
Sparklife - Life In The Trenches With Spark
Ian Pointer
 
PDF
Apache Hadoop & Friends at Utah Java User's Group
Cloudera, Inc.
 
PDF
Lessons Learned: Using Spark and Microservices
Alexis Seigneurin
 
PDF
Towards Data Operations
Andrea Monacchi
 
PPTX
Stream processing on mobile networks
pbelko82
 
PPTX
Hadoop introduction
musrath mohammad
 
PDF
Hadoop - Lessons Learned
tcurdt
 
PDF
Getting Started with Hadoop
Josh Devins
 
PDF
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
Flink Forward
 
PDF
NetflixOSS Open House Lightning talks
Ruslan Meshenberg
 
PDF
Building Big Data Streaming Architectures
David Martínez Rego
 
PPTX
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Navina Ramesh
 
PDF
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Mark Kerzner
 
PPTX
Zaharia spark-scala-days-2012
Skills Matter Talks
 
PDF
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Richard McDougall
 
PPT
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
PDF
Revealing the Power of Legacy Machine Data
Databricks
 
PDF
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
Hadoop ecosystem framework n hadoop in live environment
Delhi/NCR HUG
 
Sparklife - Life In The Trenches With Spark
Ian Pointer
 
Apache Hadoop & Friends at Utah Java User's Group
Cloudera, Inc.
 
Lessons Learned: Using Spark and Microservices
Alexis Seigneurin
 
Towards Data Operations
Andrea Monacchi
 
Stream processing on mobile networks
pbelko82
 
Hadoop introduction
musrath mohammad
 
Hadoop - Lessons Learned
tcurdt
 
Getting Started with Hadoop
Josh Devins
 
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
Flink Forward
 
NetflixOSS Open House Lightning talks
Ruslan Meshenberg
 
Building Big Data Streaming Architectures
David Martínez Rego
 
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Navina Ramesh
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Mark Kerzner
 
Zaharia spark-scala-days-2012
Skills Matter Talks
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Richard McDougall
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
Revealing the Power of Legacy Machine Data
Databricks
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
Ad

Recently uploaded (20)

PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
July Patch Tuesday
Ivanti
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Biography of Daniel Podor.pdf
Daniel Podor
 
July Patch Tuesday
Ivanti
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Ad

Data Pipeline at Tapad

  • 1. Data Pipeline at Tapad @tobym @TapadEng
  • 2. Who am I? Toby Matejovsky First engineer hired at Tapad 3+ years ago Scala developer @tobym
  • 3. What are we talking about?
  • 4. Outline • What Tapad does • Why bother with a data pipeline? • Evolution of the pipeline • Day in the life of a analytics pixel • What’s next
  • 5. What Tapad Does Cross-platform advertising and analytics Process billions of events per day A Unified View. The Tapad Difference.
  • 6. Cross platform? Device Graph Node=device edge=inferred connection Billion devices Quarter billion edges 85+% accuracy A Unified View. The Tapad Difference.
  • 7. Why a Data Pipeline? Graph building Sanity while processing big data Decouple components Data accessible at multiple stages
  • 8. Graph Building Realtime mode, but don’t impact bidding latency Batch mode
  • 9. Sanity Billions of events, terabytes of logs per day Don’t have NSA’s budget Clear data retention policy Store aggregations
  • 10. Decouple Components Bidder only bids, graph-building process only builds graph Data stream can split and merge
  • 11. Data accessible at multiple stages Logs on edge of system Local spool of data Kafka broker Consumer local spool HDFS
  • 12. Evolution of the Data Pipeline Dark Ages: Monolithic process, synchronous process Renaissance: Queues, asynchronous work in same process Age of Exploration: Inter-process comm, ad hoc batching Age of Enlightenment: Standardize on Kafka and Avro
  • 13. Dark Ages Monolithic process, synchronous process It was fast enough, and we had to start somewhere.
  • 14. Renaissance Queues, asynchronous work in same process No, it wasn’t fast enough.
  • 15. Age of Exploration Inter-process communication, ad hoc batching Servers at the edge batch up events, ship them to another service.
  • 16. Age of Enlightenment Standardize on Kafka and Avro Properly engineered and supported, reliable
  • 17. Age of Enlightenment Standardize on Kafka and Avro Properly engineered and supported, reliable
  • 18. Tangent! Batching, queues, and serialization A Unified View. The Tapad Difference.
  • 19. Batching Batching is great, will really help throughput Batching != slow A Unified View. The Tapad Difference.
  • 20. Queues Queues are amazing, until they explode and destroy the Rube Goldberg machine. “I’ll just increase the buffer size.” - spoken one day before someone ended up on double PagerDuty rotation A Unified View. The Tapad Difference.
  • 21. Care and feeding of your queue Monitor Back-pressure Buffering Spooling Degraded mode A Unified View. The Tapad Difference.
  • 22. Serialization - Protocol Buffers Tagged fields Sort of self-describing required, optional, repeated fields in schema “Map” type: message StringPair { required string key = 1; optional string value = 2; } A Unified View. The Tapad Difference.
  • 23. Serialization - Avro Optional field: union { null, long } user_timestamp = null; Splittable (Hadoop world) Schema evolution and storage A Unified View. The Tapad Difference.
  • 24. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs Day in the life of a pixel Browser loads pixel from pixel server Pixel server immediately responds with 200 and transparent gif, then serializes requests into a batch file Batch file ships every few seconds or when the file reaches 2K
  • 25. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs Day in the life of a pixel Pixel ingress server receives 2 kilobyte file containing serialized web requests. Deserialize, process some requests immediately (update database), then convert into Avro records with schema hash header, and publish to various Kafka topics
  • 26. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs Day in the life of a pixel Producer client figures out where to publish via the broker they connect to Kafka topics are partitioned into multiple chunks, each has a master and slave and are on different servers to survive an outage. Configurable retention based on time Can add topics dynamically
  • 27. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs Day in the life of a pixel Consumer processes are organized into groups Many consumer groups can read from same Kafka topic Plugins: trait Plugin[A] { def onStartup(): Unit def onSuccess(a: A): Unit def onFailure(a: A): Unit def onShutdown(): Unit } GraphitePlugin, BatchingLogfilePlaybackPlugin, TimestampDrivenClockPlugin, BatchingTimestampDrivenClockPlugin, …
  • 28. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs Day in the life of a pixel trait Plugins[A] { private val _plugins = ArrayBuffer.empty[Plugin[A]] def plugins: Seq[Plugin[A]] = _plugins def registerPlugin(plugin: Plugin[A]) = _plugins += plugin }
  • 29. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs Day in the life of a pixel object KafkaConsumer { sealed trait Result { def notify[A](plugins: Seq[Plugin[A]], a: A): Unit } case object Success extends Result { def notify[A](plugins: Seq[Plugin[A]], a: A) { plugins.foreach(_.onSuccess(a)) } } }
  • 30. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs /** Decorate a Function1[A, B] with retry logic */ case class Retry[A, B](maxAttempts: Int, backoff: Long)(f: A => B){ def apply(a: A): Result[A, B] = { def execute(attempt: Int, errorLog: List[Throwable]): Result[A, B] = { val result = try { Success(this, a, f(a)) } catch { … Failure(this, a, e :: errorLog) … } result match { case failure @ Failure(_, _, errorLog) if errorLog.size < maxAttempts => val _backoff = (math.pow(2, attempt) * backoff).toLong Thread.sleep(_backoff) // wait before the next invocation execute(attempt + 1, errorLog) // try again case failure @ Failure(_, _, errorLog) => failure } } execute(attempt = 0, errorLog = Nil) }
  • 31. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs Day in the life of a pixel Consumers log into “permanent storage” in HDFS. File format is Avro, written in batches. Data retention policy is essential.
  • 32. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs Day in the life of a pixel Hadoop 2 - YARN Scalding to write map-reduce jobs easily Rewrite Avro files as Parquet Oozie to schedule regular jobs
  • 33. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs YARN
  • 34. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs Scalding class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) // Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.replaceAll("[^a-zA-Z0-9s]", "").split("s+") } }
  • 35. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs Parquet Column-oriented storage for Hadoop Nested data is okay Projections Predicates
  • 36. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs Parquet val requests = ParquetAvroSource .project[Request](args("requests"), Projection[Request]("header.query_params", "partner_id")) .read .sample(args("sample-rate").toDouble) .mapTo('Request -> ('queryParams, 'partnerId)) { req: TapestryRequest => (req.getHeader.getQueryParams, req.getPartnerId) }
  • 37. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs Oozie <workflow-app name="combined_queries" xmlns="uri:oozie:workflow:0.3"> <start to="devices-location"/> <!--<start to="export2db"/>--> <action name="devices-location"> <shell xmlns="uri:oozie:shell-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <exec>hadoop</exec> <argument>fs</argument> <argument>-cat</argument> <argument>${devicesConfig}</argument> <capture-output/> </shell> <ok to="networks-location"/> <error to="kill"/> </action>
  • 38. pixel server - pixel ingress - kafka - consumer - hdfs - hadoop jobs Day in the life of a pixel Near real-time consumers and batch hadoop jobs generate data cubes from incoming events and save those aggregations into Vertica for fast and easy querying with SQL.
  • 39. Stack summary Scala, Jetty/Netty, Finagle Avro, Protocol Buffers, Parquet Kafka Zookeeper Hadoop - YARN and HDFS Vertica Scalding Oozie, Sqoop
  • 40. What’s next? Hive Druid Impala Oozie alternative
  • 41. Thank You yes, we’re hiring! :) @tobym @TapadEng Toby Matejovsky, Director of Engineering [email protected] @tobym

Editor's Notes

  • #4: Data pipelines can look a bit like a Rube Goldberg machine
  • #6: HTTP requests indicating “user is interested in a widget”, “want to show an ad?”, “ad was served”, “user bought a widget”
  • #7: At any given time, have roughly a billion devices and a quarter billion edges. Graph is constantly changing in realtime whenever a signal is processed, or a record expires. Accuracy is checked against an objective third party dataset.
  • #8: Generating a terabyte of logs per day, can’t store it all. Don’t want to store it all either, more data takes longer to process
  • #9: Realtime bidding infrastructure has very tight SLA, is very sensitive to latency. It needs access to the graph database, and incoming signals may add or modify an edge depending on a a big list of rules. Used to do this in-process; obvious problem to have the bidder to work that isn’t directly related to bidding. Solution, publish the signals to a queue (Kafka), let a consumer pull from that and build the graph in near-realtime. All one signal at a time, plus some contextual history for similar signals. Batch Mode - Scalding job running on a one petabyte, 50-node hadoop cluster. Looks at several weeks worth of signals and creates entire “new” graph. More connections, same or better accuracy.
  • #10: Data retention policy For some data, fine to store aggregations instead of individual elements
  • #11: Transparency, not just input-> black box -> output Slow graph-building process won’t slow down bidder Deploy new versions of some component in the pipeline without needing to interrupt another process Easy to tap into data stream at any point
  • #12: Can inspect the data at any one of these places, aids debugging Log produced vs consumed at each stage to see if things are flowing properly
  • #13: Dark ages - had to start somewhere, and it was fast enough
  • #14: Had to start somewhere, and it was fast enough in the beginning.
  • #15: Pretty obvious that the synchronous stuff didn’t work once we started to scale, so just process things in a separate thread pool. Standard software development here; nothing fancy.
  • #16: Edge servers serialize HTTP request using protocol buffers, write delimited records to a file, ship the file every N seconds or when the file hits a certain size, whichever comes first. Easy because it was the same code deployed on different machines, just needed to add the serialization/deserialization, ship/receive, and batch modes. Very simple, batch mode is just a loop that calls the original single event processor.
  • #17: Apache Kafka is a distributed queuing system. Fast (A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.) Scalable (can expand capacity without downtime, queues are partitioned and replicated, not limited by single node capacity, distributed by design) Durable (messages are written to disk on master and slave machines) Avro - serialization format like protobuf. Supports maps and default values; protobuf doesn’t. Used for our HDFS storage as well; standardizing allows us to use the same code whether it’s running in a consumer reading from Kafka or in a hadoop job reading from HDFS.
  • #18: Apache Kafka is a distributed queuing system. Fast (A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.) Scalable (can expand capacity without downtime, queues are partitioned and replicated, not limited by single node capacity, distributed by design) Durable (messages are written to disk on master and slave machines) Avro - serialization format like protobuf. Supports maps and default values; protobuf doesn’t. Used for our HDFS storage as well; standardizing allows us to use the same code whether it’s running in a consumer reading from Kafka or in a hadoop job reading from HDFS.
  • #20: Batching will really improve processing throughput, because you save the cost of repeated setup and teardown. Works at all scales, batching != slow: on the small end, think about how an optimizing compiler performs “loop unrolling” – perform a dozen operations on each iteration instead of one per iteration. Can batch inside of some function in your application, and inter-process.
  • #21: Queues are great because they allow for elasticity. However, this can be a double-edged sword because the elasticity may hide a problem until it becomes catastrophic. An unbounded queue WILL cause the system to fail one day. If the producer is faster than the consumer, it will put messages in the queue until you run out of memory.
  • #22: Monitor – graphite metrics for produced vs consumed counts, alert if things are too far off Back-pressure – Provide back-pressure via a bounded queue. bounded java.util.concurrent.LinkedBlockingQueue is great for this; if it’s full the inserting thread blocks until there is space. Similar with ExecutorService which is backed by the same; thread fails submit job, either throw an exception or have the inserting thread process. “Increase the buffer size” – Actually this is okay, just take some time to think about what a good size is. Main issue with big queue size is GC pressure. Spooling – producer can spool messages locally and retry later. Avoid OOMing Degraded mode – just drop some data. Bidder process does this with incoming big requests by discarding from the front of the queue (those are the messages that have been in the queue the longest, so get rid of them if they are already stale or at risk of becoming stale)
  • #23: Protocol buffers have tagged fields (just a number, so you can use whatever name you want, and change it later), then a type (int, string, etc), then the length of the field, then the field value. This is cool because each record is can be decoded without having the same schema as the encoder. Each field describes its type, but not the name so you need the generated classes to fully deserialize into something useful with the field names you expect. Evolve the schema by adding new field with a new tag number, or deleting and old field. Never reuse tag number. Easier to evolve schema than Avro because of this technique.
  • #24: No optional type, because all fields are always present in same order as the schema; so use union with null for optional. Also there is a Map type. Schema evolution possible by resolution rules, need to be careful though; fields are matched by name so cannot rename stuff thoughtlessly. For example, give a default value to a new field so it’s possible to parse a record encoded without that field. Lots of overhead to send schema with each request; don’t do it. So how does one deal with having multiple records with multiple versions of the schema? Store the schema hash, then storage the actual schema (JSON) somewhere else; we use ZooKeeper. Also in HDFS, the header of a giant Avro file can contain the schema for the records contained within. Naturally splittable, good for map-reduce jobs because a single file can be split up automatically among N mappers. Uses a split marker. Test with unit tests - serialize with one schema, deserialize with the other, ensure there are no exceptions and you have expected values in each field.
  • #25: Serialize with protocol buffers
  • #26: Some things are supposed to be processed immediately, so do it. Others can wait long enough to do it the right way, so publish the request to the appropriate topic. Topic is just another name for a particular queue.
  • #27: Configure number of partitions per topic in the broker config files. Consumers can autodiscover brokers via zookeeper, producers autodiscover based on connecting to an existing broker We have 24 hour retention policy, and brokers each have a terabyte of storage available. Once the data is older than the configured age, it’s gone. Don’t fall behind! Started using at v0.7.1. Built some tooling for ourselves that didn’t exist yet.
  • #28: Consumers autodiscover brokers via zookeeper Batching and discrete consumers Plugins such as GraphitePlugin, BatchingLogfilePlaybackPlugin, TimestampDrivenClockPlugin, BatchingTimestampDrivenClockPlugin, … TimestampDrivenClockPlugin is for a producer. It registers itself with Zookeeper, and saves the latest timestamp that it has processed. This allows other processes to coordinate by taking the minimum timestamp published by the group of producers.
  • #29: This is how a plugin is registered with a given producer or consumer client.
  • #30: Example of plugin callbacks being run after notification of a success
  • #31: A consumer is basically a Function1[A, B] Here’s some retry logic with exponential back-off. Eventually it will fail and stop processing.
  • #32: Batch write so you have a smaller number of bigger files. Many small files is the Achilles heel of hadoop. Mappers take too long to spin up. Data retention policy is essential because storage consumed WILL expand to the limits of storage available. Make clear distinctions between data that lives for a week, a month, a year. Scratch space as well, use it but be aware that it could be wiped out if necessary.
  • #33: YARN is like the OS of the hadoop cluster; it allocates resources like compute power to jobs which need it Scalding is a Scala API which makes it easy to write map-reduce jobs Oozie is a job scheduler and coordinator. It’s sort of clunky and uses lots of XML. Not in love with it, but it get the job done and we haven’t committed to seriously exploring other options yet.
  • #34: Photo credit Hortonworks (http://hortonworks.com/hadoop/yarn/) Basically, HDFS is great and everything just reads from that. YARN allows any application to then run on the same hadoop cluster so it can easily get at the data in HDFS.
  • #35: Scalding is a Scala API which makes it easy to write map-reduce jobs See example code. joinWithTiny is fantastically fast if you can get get away with it because everything is done in-memory in the mapper; no need for extra map-reduce steps for the join.
  • #36: Parquet is a column-oriented storage format for Hadoop. Push-down predicates and projections make for faster reads, sometimes giving HUGE speedups. Predicate lets you check some field before reading data into your application Projection lets you load only specified fields out of a record Meta-format, so we still use Avro-generated classes
  • #37: Example of a projection
  • #38: Oozie coordinates workflows, which are directed acyclic graphs of actions like “wait for this file, then run this job, if it errors goto this step (kill/cleanup), otherwise go to that step (export to database with sqoop). XML workflow, plus some properties files.
  • #41: Hive to make data in HDFS available to non-programmers. SQL is easier than writing a map-reduce job Oozie is a bit awkward, we know there are alternatives Druid - realtime big data analytics database. We essentially have our own homegrown version of this; not as mature though Impala is another SQL-on-Hadoop sort of thing