SlideShare a Scribd company logo
An adaptive and eventually self-healing framework for geo-
distributed real-time data ingestion
Angad Singh
InMobi
The problem domain
Scale
● 15 billion events per day (post filtering)
● 1.5+ billion users, 200 million per day
● 4 geographically distributed data centers (DCs)
● User’s request may land on non-local DC
Ingestion requirements
● multiple tenants, multiple schemas per tenant
● batch, stream, micro-batch and on-demand ingestion
● 20+ streams, 100+ data types
● need to ingest, transform, validate and aggregate this data
● need to ingest streaming data in real-time (<1 min) for ad-serving/targeting use
cases (strict SLA)
The problem domain
Usage/serving requirements
● need to pivot this data by user, activity type and other primary keys
● serve an aggregated view (profile) at the end in < 5ms p99 latency
● need both real-time serving of the view
● as well as batch summaries for analytics, inference algorithms, feedback loops
● need to be resilient to failure, absolutely no room for data loss/lag in ingestion
Data arrival, volume and velocity
● data may be received out of order, or duplicated
● data can arrive in periodic batches or real-time/streaming or once in a while
● data may arrive in bursts or trickle slowly in some streams (autoscale)
● user data may be received in any DC, but needs to be collectively available in a
single DC
The problem domain
Multi-tenancy
● Quotas
● Rate limiting/SLAs
● Isolation
Manageability
● need to be self-serve, flexible for specific changes in the flow, easily deployable
● may need online migration, reprocessing, etc. of data
● hassle-free schema evolution across the stack
● monitoring, visibility, operability aspects for all of the above
The architecture
Serving layer
(user store)
aerospike
cluster
API
dedup, aggregate, business rules
Ratelimiting/quotas
API
Ad serving
<5ms,99.95%success
notifications
pubsub
(kafka)
notification
listeners (storm)
periodic
dumps
streaming
offline snapshot
store (HDFS)
batch inference jobs
(MR/spark)
analytics engine
(cubes, lens)
real-time
enrichment
on user
engagement
Ingestion layer
globaldcglobaldc
offline snapshot
store (HDFS)
globaldc
localdc
localdc
localdc
upstreamingestionsources
batchsources
streamingsources
adaptors
localdc
adaptors
adaptors
(MR/storm)
localdclocaldc
routers
localdc
routers
routers
(MR/storm)
localdclocaldc
sinkssinks
sinks
(MR/storm)
remotedc
Ingestion Service
orchestrate/manage
remotedc remotedc
Architecture
DC1 (global)
DC2 (slave)
DC3 (slave)
adaptors
(MR)
adaptors
(storm)
adaptors
(storm)
routers
(MR)
routers
(storm)
routers
(storm)
User-Colo
Metadata
(aerospike)
User-Colo
Metadata
(aerospike)
User-Colo
Metadata
(aerospike)
custom
replication
(contains userid)
(contains userid)
(contains userid)
sinks
getColo(userid)
sinks
sinks
Kafka-data-replicator
(stormtopology)
global colo tagger
(storm)
tag not found
tagged
data
tag found
write tag
User Store
API
History
Profile
User Store
API
History
Profile
User Store
API
History
Profile
XDR
(profile)
Cross-DC
architecture
custom
replication
DC1 (global)
DC2 (slave)
DC3 (slave)
adaptors
(MR)
adaptors
(storm)
adaptors
(storm)
routers
(storm)
routers
(storm)
routers
(storm)
User-Colo
Metadata
(aerospike)
User-Colo
Metadata
(aerospike)
User-Colo
Metadata
(aerospike)
XDRXDR
(contains userid)
(contains userid)
(contains userid)
sinks
getColo(userid)
sinks
sinks
Kafka-data-replicator
(stormtopology)
global colo tagger
(storm)
tag not found
tagged
data
tag found
write tag
User Store
API
History
Profile
User Store
API
History
Profile
User Store
API
History
Profile
XDR
(profile)
Mapper
Comparison to
map-reducePartitioner Shuffler Reducer
The ingestion layer
An adaptive and eventually self healing framework for geo-distributed real-time data ingestion
Current Features
Business-agnostic APIs
● Built on simple RESTful APIs: Schema, Feed, Sink, Source, Flow, Driver, Data router, Adaptor
● Unified APIs for doing batch, streaming, micro-batch ingestion.
● Self-serve system which provides rule validation, metrics, etc. and makes the expression of sources,
sinks and flows easy with custom DSL.
Platform-agnostic Flow Execution
● Pluggable execution engine (storm, hadoop, spark) - provides a Driver API
● Uses falcon for batch scheduling, in-built scheduler for streaming drivers (storm, etc.)
Serialization support
● Pluggable schema serde support (thrift, avro)
Current Features
Schema management
● Schema is a first class citizen.
● Contract between source, sink and flow all based on and validated against schema
● Schema versioning and compatibility checks.
● Error-free schema evolution across data flows
● Clean abstractions to centrally manage all the schemas, data sources/feeds, sinks (key value store,
HDFS, etc.) and data flows (storm topologies, MR jobs) which are part of the ingestion pipelines
Manageability, operability
● All entities - schemas, sinks and flows - can be updated online without any downtime.
● Retries, error handling, metrics, orchestration hooks, etc. come standard
Out of the box support for
● Cross-colo flow chaining
● Data routing
● Transformation, validation, conversion
● All based on pluggable code
An adaptive and eventually self healing framework for geo-distributed real-time data ingestion
An adaptive and eventually self healing framework for geo-distributed real-time data ingestion
The problems we’ve seen
Storm
● as usual, lot of knobs to tune based on lot of metrics: workers, threads, tasks, acks, max
spout pending, buffer sizes, xmx, num slots, execute/process/ack latency, capacity, etc.
● debugging storm topology’s isn’t easy: threads, workers, shared logs, shuffling of data
between workers, netty, the ack system, etc.
● storm (0.9.x) doesn’t like heterogenous load: unbalanced distribution between supervisors.
heavy topologies can choke each other. rebalancing not fully resource aware (1.x tries to
solve this)
● no rolling upgrades, supervisor failures cause unrecoverable errors
● zookeeper issues: too many executors leads to worker heartbeat update failure to zk.
● storm-kafka issue: storm-kafka spout unaware of purging (earliestOffset update)
● storm-kafka issue: invisible data loss
● retries should done cautiously
● etc
Kafka
● topic deletion asynchronous, slow
● tuning num partitions manually
● bad consumers can cause excessive logging on brokers
Features under development
● Autoscaling flows - rebalance storm topology based on spout lag, priority and
current throughput (or bolt capacity) - runtime metrics or linear regression on
historical metrics
● Streaming and batch compaction / dedup of data based on domain specific
rules
● Automatic fallback from streaming to batch ingestion in case of huge
backlogs, for low priority ingestions
● Dynamic rerouting / sharding of data between DCs for load balancing cross-
DC flows
● Eventual self-correction of data based on validations on the aggregated view
(data received from multiple streams)
● Data lineage/auditing
● Backfill management

More Related Content

What's hot (19)

PPTX
Monitoring Cassandra with graphite using Yammer Coda-Hale Library
Nader Ganayem
 
PDF
Windowing in apex
Yogi Devendra Vyavahare
 
PPTX
Devops kc
Philip Thompson
 
PDF
Optimistic Algorithm and Concurrency Control Algorithm
Shounak Katyayan
 
PDF
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Flink Forward
 
PPTX
Fault Tolerance and Processing Semantics in Apache Apex
Apache Apex Organizer
 
PDF
BigDataSpain 2016: Introduction to Apache Apex
Thomas Weise
 
PDF
Stateful stream processing with Apache Flink
Knoldus Inc.
 
PPTX
Apache Samza - New features in the upcoming Samza release 0.10.0
Navina Ramesh
 
PPTX
Stream processing - Apache flink
Renato Guimaraes
 
PDF
An Introduction to Prometheus
Evgeny Shmarnev
 
PDF
Data-Oriented Programming with Clojure and Jackdaw (Charles Reese, Funding Ci...
confluent
 
PPTX
Monitoring and Alerting with InfluxDB 2.0 | Deniz Kusefoglu & Nate Isley | In...
InfluxData
 
PDF
From Startup to Mature Company: PostgreSQL Tips and techniques
John Ashmead
 
PDF
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward
 
PDF
Distributed systems vs compositionality
Roland Kuhn
 
PDF
Low latency stream processing with jet
StreamNative
 
PPTX
Harvesting the Power of Samza in LinkedIn's Feed
Mohamed El-Geish
 
PDF
HBaseCon2017 HBase at Xiaomi
HBaseCon
 
Monitoring Cassandra with graphite using Yammer Coda-Hale Library
Nader Ganayem
 
Windowing in apex
Yogi Devendra Vyavahare
 
Devops kc
Philip Thompson
 
Optimistic Algorithm and Concurrency Control Algorithm
Shounak Katyayan
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Flink Forward
 
Fault Tolerance and Processing Semantics in Apache Apex
Apache Apex Organizer
 
BigDataSpain 2016: Introduction to Apache Apex
Thomas Weise
 
Stateful stream processing with Apache Flink
Knoldus Inc.
 
Apache Samza - New features in the upcoming Samza release 0.10.0
Navina Ramesh
 
Stream processing - Apache flink
Renato Guimaraes
 
An Introduction to Prometheus
Evgeny Shmarnev
 
Data-Oriented Programming with Clojure and Jackdaw (Charles Reese, Funding Ci...
confluent
 
Monitoring and Alerting with InfluxDB 2.0 | Deniz Kusefoglu & Nate Isley | In...
InfluxData
 
From Startup to Mature Company: PostgreSQL Tips and techniques
John Ashmead
 
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward
 
Distributed systems vs compositionality
Roland Kuhn
 
Low latency stream processing with jet
StreamNative
 
Harvesting the Power of Samza in LinkedIn's Feed
Mohamed El-Geish
 
HBaseCon2017 HBase at Xiaomi
HBaseCon
 

Viewers also liked (12)

PDF
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Institute of Contemporary Sciences
 
PDF
Designing a Real Time Data Ingestion Pipeline
DataScience
 
PDF
Creating a Modern Data Architecture
Zaloni
 
PDF
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Zaloni
 
PPTX
10 Amazing Things To Do With a Hadoop-Based Data Lake
VMware Tanzu
 
PDF
Data Lake,beyond the Data Warehouse
Data Science Thailand
 
PDF
The Emerging Data Lake IT Strategy
Thomas Kelly, PMP
 
PDF
Implementing a Data Lake with Enterprise Grade Data Governance
Hortonworks
 
PDF
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Hortonworks
 
PPTX
Big data architectures and the data lake
James Serra
 
PPTX
Hadoop data ingestion
Vinod Nayal
 
PPTX
Analysing Smart City Development in india
Omkar Parishwad
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Institute of Contemporary Sciences
 
Designing a Real Time Data Ingestion Pipeline
DataScience
 
Creating a Modern Data Architecture
Zaloni
 
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Zaloni
 
10 Amazing Things To Do With a Hadoop-Based Data Lake
VMware Tanzu
 
Data Lake,beyond the Data Warehouse
Data Science Thailand
 
The Emerging Data Lake IT Strategy
Thomas Kelly, PMP
 
Implementing a Data Lake with Enterprise Grade Data Governance
Hortonworks
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Hortonworks
 
Big data architectures and the data lake
James Serra
 
Hadoop data ingestion
Vinod Nayal
 
Analysing Smart City Development in india
Omkar Parishwad
 
Ad

Similar to An adaptive and eventually self healing framework for geo-distributed real-time data ingestion (20)

PDF
Real-time Stream Processing using Apache Apex
Apache Apex
 
PPTX
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Navina Ramesh
 
PDF
Eranea's solution and technology for mainframe migration / transformation : d...
Eranea
 
PDF
Network visibility and control using industry standard sFlow telemetry
pphaal
 
PDF
Introduction to Apache Apex by Thomas Weise
Big Data Spain
 
PDF
Stream Processing Overview
Maycon Viana Bordin
 
PDF
M|18 Choosing the Right High Availability Strategy for You
MariaDB plc
 
PDF
3450 - Writing and optimising applications for performance in a hybrid messag...
Timothy McCormick
 
PDF
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Monal Daxini
 
PPTX
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
DataWorks Summit
 
PDF
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
PDF
Serverless London 2019 FaaS composition using Kafka and CloudEvents
Neil Avery
 
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
PPTX
The Enterprise IT Checklist for Docker Operations
Nicola Kabar
 
PDF
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Martin Zapletal
 
PDF
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 
PDF
High-Speed Reactive Microservices
Rick Hightower
 
PPTX
redGuardian DP100 large scale DDoS mitigation solution
Redge Technologies
 
PPTX
Designing apps for resiliency
Masashi Narumoto
 
PDF
Reactive mistakes - ScalaDays Chicago 2017
Petr Zapletal
 
Real-time Stream Processing using Apache Apex
Apache Apex
 
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Navina Ramesh
 
Eranea's solution and technology for mainframe migration / transformation : d...
Eranea
 
Network visibility and control using industry standard sFlow telemetry
pphaal
 
Introduction to Apache Apex by Thomas Weise
Big Data Spain
 
Stream Processing Overview
Maycon Viana Bordin
 
M|18 Choosing the Right High Availability Strategy for You
MariaDB plc
 
3450 - Writing and optimising applications for performance in a hybrid messag...
Timothy McCormick
 
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Monal Daxini
 
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
DataWorks Summit
 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Serverless London 2019 FaaS composition using Kafka and CloudEvents
Neil Avery
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
The Enterprise IT Checklist for Docker Operations
Nicola Kabar
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Martin Zapletal
 
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 
High-Speed Reactive Microservices
Rick Hightower
 
redGuardian DP100 large scale DDoS mitigation solution
Redge Technologies
 
Designing apps for resiliency
Masashi Narumoto
 
Reactive mistakes - ScalaDays Chicago 2017
Petr Zapletal
 
Ad

More from Angad Singh (7)

PDF
From Journeyman to Master
Angad Singh
 
PPT
Oxylabs apps
Angad Singh
 
PDF
OpenSolaris School OS Beginners Guide
Angad Singh
 
ODP
NetBeans 6.5
Angad Singh
 
PPT
Netbeans 6.1 Talk
Angad Singh
 
PPT
Open Solaris 2008.05
Angad Singh
 
PDF
Sun Spot Talk
Angad Singh
 
From Journeyman to Master
Angad Singh
 
Oxylabs apps
Angad Singh
 
OpenSolaris School OS Beginners Guide
Angad Singh
 
NetBeans 6.5
Angad Singh
 
Netbeans 6.1 Talk
Angad Singh
 
Open Solaris 2008.05
Angad Singh
 
Sun Spot Talk
Angad Singh
 

Recently uploaded (20)

PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
July Patch Tuesday
Ivanti
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
July Patch Tuesday
Ivanti
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 

An adaptive and eventually self healing framework for geo-distributed real-time data ingestion

  • 1. An adaptive and eventually self-healing framework for geo- distributed real-time data ingestion Angad Singh InMobi
  • 2. The problem domain Scale ● 15 billion events per day (post filtering) ● 1.5+ billion users, 200 million per day ● 4 geographically distributed data centers (DCs) ● User’s request may land on non-local DC Ingestion requirements ● multiple tenants, multiple schemas per tenant ● batch, stream, micro-batch and on-demand ingestion ● 20+ streams, 100+ data types ● need to ingest, transform, validate and aggregate this data ● need to ingest streaming data in real-time (<1 min) for ad-serving/targeting use cases (strict SLA)
  • 3. The problem domain Usage/serving requirements ● need to pivot this data by user, activity type and other primary keys ● serve an aggregated view (profile) at the end in < 5ms p99 latency ● need both real-time serving of the view ● as well as batch summaries for analytics, inference algorithms, feedback loops ● need to be resilient to failure, absolutely no room for data loss/lag in ingestion Data arrival, volume and velocity ● data may be received out of order, or duplicated ● data can arrive in periodic batches or real-time/streaming or once in a while ● data may arrive in bursts or trickle slowly in some streams (autoscale) ● user data may be received in any DC, but needs to be collectively available in a single DC
  • 4. The problem domain Multi-tenancy ● Quotas ● Rate limiting/SLAs ● Isolation Manageability ● need to be self-serve, flexible for specific changes in the flow, easily deployable ● may need online migration, reprocessing, etc. of data ● hassle-free schema evolution across the stack ● monitoring, visibility, operability aspects for all of the above
  • 6. Serving layer (user store) aerospike cluster API dedup, aggregate, business rules Ratelimiting/quotas API Ad serving <5ms,99.95%success notifications pubsub (kafka) notification listeners (storm) periodic dumps streaming offline snapshot store (HDFS) batch inference jobs (MR/spark) analytics engine (cubes, lens) real-time enrichment on user engagement Ingestion layer globaldcglobaldc offline snapshot store (HDFS) globaldc localdc localdc localdc upstreamingestionsources batchsources streamingsources adaptors localdc adaptors adaptors (MR/storm) localdclocaldc routers localdc routers routers (MR/storm) localdclocaldc sinkssinks sinks (MR/storm) remotedc Ingestion Service orchestrate/manage remotedc remotedc Architecture
  • 7. DC1 (global) DC2 (slave) DC3 (slave) adaptors (MR) adaptors (storm) adaptors (storm) routers (MR) routers (storm) routers (storm) User-Colo Metadata (aerospike) User-Colo Metadata (aerospike) User-Colo Metadata (aerospike) custom replication (contains userid) (contains userid) (contains userid) sinks getColo(userid) sinks sinks Kafka-data-replicator (stormtopology) global colo tagger (storm) tag not found tagged data tag found write tag User Store API History Profile User Store API History Profile User Store API History Profile XDR (profile) Cross-DC architecture custom replication
  • 8. DC1 (global) DC2 (slave) DC3 (slave) adaptors (MR) adaptors (storm) adaptors (storm) routers (storm) routers (storm) routers (storm) User-Colo Metadata (aerospike) User-Colo Metadata (aerospike) User-Colo Metadata (aerospike) XDRXDR (contains userid) (contains userid) (contains userid) sinks getColo(userid) sinks sinks Kafka-data-replicator (stormtopology) global colo tagger (storm) tag not found tagged data tag found write tag User Store API History Profile User Store API History Profile User Store API History Profile XDR (profile) Mapper Comparison to map-reducePartitioner Shuffler Reducer
  • 11. Current Features Business-agnostic APIs ● Built on simple RESTful APIs: Schema, Feed, Sink, Source, Flow, Driver, Data router, Adaptor ● Unified APIs for doing batch, streaming, micro-batch ingestion. ● Self-serve system which provides rule validation, metrics, etc. and makes the expression of sources, sinks and flows easy with custom DSL. Platform-agnostic Flow Execution ● Pluggable execution engine (storm, hadoop, spark) - provides a Driver API ● Uses falcon for batch scheduling, in-built scheduler for streaming drivers (storm, etc.) Serialization support ● Pluggable schema serde support (thrift, avro)
  • 12. Current Features Schema management ● Schema is a first class citizen. ● Contract between source, sink and flow all based on and validated against schema ● Schema versioning and compatibility checks. ● Error-free schema evolution across data flows ● Clean abstractions to centrally manage all the schemas, data sources/feeds, sinks (key value store, HDFS, etc.) and data flows (storm topologies, MR jobs) which are part of the ingestion pipelines Manageability, operability ● All entities - schemas, sinks and flows - can be updated online without any downtime. ● Retries, error handling, metrics, orchestration hooks, etc. come standard Out of the box support for ● Cross-colo flow chaining ● Data routing ● Transformation, validation, conversion ● All based on pluggable code
  • 15. The problems we’ve seen Storm ● as usual, lot of knobs to tune based on lot of metrics: workers, threads, tasks, acks, max spout pending, buffer sizes, xmx, num slots, execute/process/ack latency, capacity, etc. ● debugging storm topology’s isn’t easy: threads, workers, shared logs, shuffling of data between workers, netty, the ack system, etc. ● storm (0.9.x) doesn’t like heterogenous load: unbalanced distribution between supervisors. heavy topologies can choke each other. rebalancing not fully resource aware (1.x tries to solve this) ● no rolling upgrades, supervisor failures cause unrecoverable errors ● zookeeper issues: too many executors leads to worker heartbeat update failure to zk. ● storm-kafka issue: storm-kafka spout unaware of purging (earliestOffset update) ● storm-kafka issue: invisible data loss ● retries should done cautiously ● etc Kafka ● topic deletion asynchronous, slow ● tuning num partitions manually ● bad consumers can cause excessive logging on brokers
  • 16. Features under development ● Autoscaling flows - rebalance storm topology based on spout lag, priority and current throughput (or bolt capacity) - runtime metrics or linear regression on historical metrics ● Streaming and batch compaction / dedup of data based on domain specific rules ● Automatic fallback from streaming to batch ingestion in case of huge backlogs, for low priority ingestions ● Dynamic rerouting / sharding of data between DCs for load balancing cross- DC flows ● Eventual self-correction of data based on validations on the aggregated view (data received from multiple streams) ● Data lineage/auditing ● Backfill management