SlideShare a Scribd company logo
BigDataStreams
Architectures
Why?What?How?
Anton Nazaruk
CTO @ VITech+
BigDatain2016+?
BigDatain2016+?
● No more an exotic buzzword
● Mature enough and already adopted by majority of businesses/companies
● Set of well-defined tools and processes… questionable
● Data Analysis at scale - taking value from your data!
○ Prescriptive - reveals what action should be taken
○ Predictive - analysis of likely scenarios of what might happen
○ Diagnostic - past analysis, shows what had happened and why (classic)
○ Descriptive - real time analytics (stocks, healthcare..)
BigDataanalysischallenges
● Integration - ability to have needed data in needed place
● Latency - data have to be presented for processing immediately
● Throughput - ability to consume/process massive volumes of data
● Consistency - data mutation in one place must be reflected everywhere
● Teams collaboration - inconvenient interface for inter-teams
communication
● Technology adoption - typical technologies stack greatly complicates
entire project ecosystem - another world of hiring, deployment, testing,
scaling, fault tolerance, upgrades, monitoring, etc.
It’sachallenge!
Evolutionary
system
Solution
The Event LOG
What every software engineer should know about real-
time data's unifying abstraction
https://engineering.linkedin.com/distributed-systems/log-what-every-
software-engineer-should-know-about-real-time-datas-unifying
TheeventLog
Reference
architecture
transitioncase
Unifiedorderedeventlog
Kafka
● Fast - single Kafka broker can handle hundreds of megabytes of reads
and writes per second from thousands of clients
● Scalable - can be elastically and transparently expanded without
downtime
● Durable - Messages are persisted on disk and replicated within the
cluster to prevent data loss. Each broker can handle terabytes of
messages without performance impact
● Reliable - has a modern cluster-centric design that offers strong
durability and fault-tolerance guarantees
Kafka-highlevelview
Kafka-buildingblocks
● Producer - process that publishes messages to a Kafka
● Topic - a category or feed name to which messages are published. For
each topic, the Kafka cluster maintains a partitioned log
● Partition - part of a topic: level of parallelism in Kafka. Write/Read order
is guaranteed at partition level
Kafka-buildingblocks
● Producer - process that publishes messages to a Kafka
● Topic - a category or feed name to which messages are published. For
each topic, the Kafka cluster maintains a partitioned log
● Partition - part of a topic: level of parallelism in Kafka. Write/Read order
is guaranteed at partition level
● Replica - up-to-date partition’s copy. Each partition is replicated across a
configurable number of servers for fault tolerance (like HDFS block)
Kafka-buildingblocks
● Producer - process that publishes messages to a Kafka
● Topic - a category or feed name to which messages are published. For
each topic, the Kafka cluster maintains a partitioned log
● Partition - part of a topic: level of parallelism in Kafka. Write/Read order
is guaranteed at partition level
● Replica - up-to-date partition’s copy. Each partition is replicated across a
configurable number of servers for fault tolerance (like HDFS block)
● Consumer - process that subscribes to topics and processes published
messages
Kafka-buildingblocks
● Consumer - process that subscribes to topics and processes published
messages
StreamProcessing-highlevel
Apache Storm
Apache Spark
Apache Samza
Apache Flink
Apache Flume
...
StreamProcessing-possibleimplementationframeworks
StreamProcessing-possibleimplementationframeworks
● Pros
○ Automatic fault tolerance
○ Scaling
○ No data loss guarantees
○ Stream processing DSL/SQL (joins, filters, count aggregates, etc)
● Cons
○ Overall system complexity significantly grows
■ New cluster to maintain/monitor/upgrade/etc (Apache Storm)
■ Multi-pattern (mixed) data access (Spark/Samza on YARN)
○ Another framework to learn for your team
StreamProcessing-microservices
StreamProcessing-microservices
Small, independent processes that communicate with each other to form
complex applications which utilize language-agnostic APIs.
These services are small building blocks, highly decoupled and focused on
doing a small task, facilitating a modular approach to system-building.
The microservices architectural style is becoming the standard for building
modern applications.
StreamProcessing-microservicescommunication
Three most commonly used protocols are :
● Synchronous request-response calls (mainly via HTTP REST API)
● Asynchronous (non blocking IO) request-response communication (Akka,
Play Framework, etc)
● Asynchronous messages buffers (RabbitMQ, JMS, ActiveMQ, etc)
StreamProcessing-microservicesplatforms
Microservices deployment platforms :
● Apache Mesos with a framework like Marathon
● Swarm from Docker
● Kubernetes
● YARN with something like Slider
● Various hosted container services such as ECS from Amazon
● Cloud Foundry
● Heroku
StreamProcessing-microservices
Why can’t I just package
and deploy my events
processing code on Yarn /
Mesos / Docker / Amazon
cluster and let it take care o
fault tolerance, scaling and
other weird things?
StreamProcessing-microservices
StreamProcessing-microservicescommunication
Fourth protocol is :
● Asynchronous, ordered and manageable logs of events - Kafka
StreamProcessing-newera(kafka&microservices)
StreamProcessing-kafka
● New Kafka Consumer 0.9.+
○ Light - consumer client is just a thin JAR without heavy 3rd party
dependencies (ZooKeeper, scala runtime, etc)
○ Acts as Load Balancer
○ Fault tolerant
○ Simple to use API
○ Kafka Streams - elegant DSL (should be officially released this
month)
StreamProcessing-kafka&microservices
StreamProcessing-kafka&microservices
1. Language agnostic logs of events (buffers)
2. No backpressure on consumers (API endpoints with sync approach)
3. Fault tolerance - no data loss
4. Failed service doesn’t bring entire chain down
5. Resuming from last committed offset position
6. No circuit breaker like patterns needed
7. Smooth configs management across all nodes and services
StreamProcessing-kafka&microservices
LambdaArchitecture
KappaArchitecture
KappaArchitecture
Architecturescomparison
Lambda Kappa
Processing paradigm Batch + Streaming Streaming
Re-processing paradigm Every batch cycles Only when code changes
Resource consumption Higher Lower
Maintenance/Support
complexity
Higher Lower
Ability to re-create dataset
Per any point of time
No (or very hard) Yes
Big Data Streams Architectures. Why? What? How?
Evenmoreinterestingcomparison
Hadoop-centric system Kafka-centric system
Data Replication + +
Fault Tolerance + +
Scaling + +
Random Reads With HBase With Elasticsearch/Solr
Ordered Reads - +
Secondary indices With Elasticsearch/Solr With Elasticsearch/Solr
Storage for Big Files (>10M) + -
TCO higher lower
Summary
1. Events Log centric system design - from chaos to structured
architecture
Summary
1. Events Log centric system design - from chaos to structured
architecture
2. Kafka as an Events Log reference storage implementation
Summary
1. Events Log centric system design - from chaos to structured
architecture
2. Kafka as an Events Log reference storage implementation
3. Microservices as distributed events processing approach
Summary
1. Events Log centric system design - from chaos to structured
architecture
2. Kafka as an Events Log reference storage implementation
3. Microservices as distributed events processing approach
4. Kappa Architecture as Microservices & Kafka symbiosis
Usefullinks
1. “I heart Logs” by Jay Krepps http://shop.oreilly.
com/product/0636920034339.do
2. http://confluent.io/blog
3. https://engineering.linkedin.com/distributed-systems/log-what-every-
software-engineer-should-know-about-real-time-datas-unifying
4. “Making sense of stream processing” by Martin Kleppmann
5. http://kafka.apache.org/
6. http://martinfowler.com/articles/microservices.html
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?

More Related Content

What's hot (20)

PDF
The Rise of Engineering-Driven Analytics by Loren Shure
Big Data Spain
 
ODP
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
Romeo Kienzler
 
PPTX
Obfuscating LinkedIn Member Data
DataWorks Summit
 
PPTX
Big Data Use Cases
boorad
 
ODP
BigData Hadoop
Kumari Surabhi
 
PDF
An overview of modern scalable web development
Tung Nguyen
 
PPTX
Владимир Слободянюк «DWH & BigData – architecture approaches»
Anna Shymchenko
 
PDF
Introduction to basic data analytics tools
Nascenia IT
 
PPTX
Big Data Analytics with Hadoop, MongoDB and SQL Server
Mark Kromer
 
PDF
Advanced data science algorithms applied to scalable stream processing by Dav...
Big Data Spain
 
PDF
Data Pipline Observability meetup
Omid Vahdaty
 
PDF
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
Dataconomy Media
 
PDF
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dataconomy Media
 
PPTX
عصر کلان داده، چرا و چگونه؟
datastack
 
PPTX
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
 
PDF
Big Data Architecture
Guido Schmutz
 
PDF
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Denodo
 
PPTX
Our journey with druid - from initial research to full production scale
Itai Yaffe
 
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
PPTX
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
The Rise of Engineering-Driven Analytics by Loren Shure
Big Data Spain
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
Romeo Kienzler
 
Obfuscating LinkedIn Member Data
DataWorks Summit
 
Big Data Use Cases
boorad
 
BigData Hadoop
Kumari Surabhi
 
An overview of modern scalable web development
Tung Nguyen
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Anna Shymchenko
 
Introduction to basic data analytics tools
Nascenia IT
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Mark Kromer
 
Advanced data science algorithms applied to scalable stream processing by Dav...
Big Data Spain
 
Data Pipline Observability meetup
Omid Vahdaty
 
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
Dataconomy Media
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dataconomy Media
 
عصر کلان داده، چرا و چگونه؟
datastack
 
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
 
Big Data Architecture
Guido Schmutz
 
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Denodo
 
Our journey with druid - from initial research to full production scale
Itai Yaffe
 
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 

Viewers also liked (20)

PDF
Kappa Architecture, IoT of the cars - LibreCon 2016
LibreCon
 
PPTX
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
PDF
Innovation in the Data Warehouse - StampedeCon 2016
StampedeCon
 
PDF
Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Ahsan Javed Awan
 
PDF
ASPgems - kappa architecture
Juantomás García Molina
 
PDF
Real time data ingestion and Hybrid Cloud
Neeraj Sabharwal
 
PDF
Streaming Patterns Revolutionary Architectures with the Kafka API
Carol McDonald
 
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Gyula Fóra
 
PDF
Voldemort : Prototype to Production
Vinoth Chandar
 
PDF
Librecon 2016 bilbao: kappa architecture IoT of the cars
Juantomás García Molina
 
PPTX
High-Performance Analytics in the Cloud with Apache Impala
Cloudera, Inc.
 
PPTX
Moving Beyond Lambda Architectures with Apache Kudu
Cloudera, Inc.
 
PDF
Streaming Analytics - Comparison of Open Source Frameworks and Products
Kai Wähner
 
PDF
BDX 2016- Monal daxini @ Netflix
Ido Shilon
 
PDF
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Mathieu Dumoulin
 
PPTX
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
 
PDF
Boosting spark performance: An Overview of Techniques
Ahsan Javed Awan
 
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Yahoo Developer Network
 
PDF
Big Data Architectures
Guido Schmutz
 
PPSX
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
Apache Apex
 
Kappa Architecture, IoT of the cars - LibreCon 2016
LibreCon
 
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
Innovation in the Data Warehouse - StampedeCon 2016
StampedeCon
 
Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Ahsan Javed Awan
 
ASPgems - kappa architecture
Juantomás García Molina
 
Real time data ingestion and Hybrid Cloud
Neeraj Sabharwal
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Carol McDonald
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Gyula Fóra
 
Voldemort : Prototype to Production
Vinoth Chandar
 
Librecon 2016 bilbao: kappa architecture IoT of the cars
Juantomás García Molina
 
High-Performance Analytics in the Cloud with Apache Impala
Cloudera, Inc.
 
Moving Beyond Lambda Architectures with Apache Kudu
Cloudera, Inc.
 
Streaming Analytics - Comparison of Open Source Frameworks and Products
Kai Wähner
 
BDX 2016- Monal daxini @ Netflix
Ido Shilon
 
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Mathieu Dumoulin
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
 
Boosting spark performance: An Overview of Techniques
Ahsan Javed Awan
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Yahoo Developer Network
 
Big Data Architectures
Guido Schmutz
 
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
Apache Apex
 
Ad

Similar to Big Data Streams Architectures. Why? What? How? (20)

PDF
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
GeeksLab Odessa
 
PDF
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
PDF
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
DataStax Academy
 
PDF
Cloud Lambda Architecture Patterns
Asis Mohanty
 
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
PPTX
HPC and cloud distributed computing, as a journey
Peter Clapham
 
PPTX
Captial One: Why Stream Data as Part of Data Transformation?
ScyllaDB
 
PPT
Pacemaker+DRBD
Dan Frincu
 
PDF
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
PDF
Connect K of SMACK:pykafka, kafka-python or?
Micron Technology
 
PDF
lessons from managing a pulsar cluster
Shivji Kumar Jha
 
PPTX
Data Engineering for Data Scientists
jlacefie
 
PDF
Capital One Delivers Risk Insights in Real Time with Stream Processing
confluent
 
PPTX
Flexible compute
Peter Clapham
 
PPTX
Sanger, upcoming Openstack for Bio-informaticians
Peter Clapham
 
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
PPTX
Software architecture for data applications
Ding Li
 
PDF
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Dibyendu Bhattacharya
 
PDF
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
confluent
 
PDF
Redpanda and ClickHouse
Altinity Ltd
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
GeeksLab Odessa
 
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
DataStax Academy
 
Cloud Lambda Architecture Patterns
Asis Mohanty
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
HPC and cloud distributed computing, as a journey
Peter Clapham
 
Captial One: Why Stream Data as Part of Data Transformation?
ScyllaDB
 
Pacemaker+DRBD
Dan Frincu
 
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Connect K of SMACK:pykafka, kafka-python or?
Micron Technology
 
lessons from managing a pulsar cluster
Shivji Kumar Jha
 
Data Engineering for Data Scientists
jlacefie
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
confluent
 
Flexible compute
Peter Clapham
 
Sanger, upcoming Openstack for Bio-informaticians
Peter Clapham
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
Software architecture for data applications
Ding Li
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Dibyendu Bhattacharya
 
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
confluent
 
Redpanda and ClickHouse
Altinity Ltd
 
Ad

Recently uploaded (20)

PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 

Big Data Streams Architectures. Why? What? How?

  • 3. BigDatain2016+? ● No more an exotic buzzword ● Mature enough and already adopted by majority of businesses/companies ● Set of well-defined tools and processes… questionable ● Data Analysis at scale - taking value from your data! ○ Prescriptive - reveals what action should be taken ○ Predictive - analysis of likely scenarios of what might happen ○ Diagnostic - past analysis, shows what had happened and why (classic) ○ Descriptive - real time analytics (stocks, healthcare..)
  • 4. BigDataanalysischallenges ● Integration - ability to have needed data in needed place ● Latency - data have to be presented for processing immediately ● Throughput - ability to consume/process massive volumes of data ● Consistency - data mutation in one place must be reflected everywhere ● Teams collaboration - inconvenient interface for inter-teams communication ● Technology adoption - typical technologies stack greatly complicates entire project ecosystem - another world of hiring, deployment, testing, scaling, fault tolerance, upgrades, monitoring, etc.
  • 7. Solution The Event LOG What every software engineer should know about real- time data's unifying abstraction https://engineering.linkedin.com/distributed-systems/log-what-every- software-engineer-should-know-about-real-time-datas-unifying
  • 12. Kafka ● Fast - single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients ● Scalable - can be elastically and transparently expanded without downtime ● Durable - Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact ● Reliable - has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees
  • 14. Kafka-buildingblocks ● Producer - process that publishes messages to a Kafka ● Topic - a category or feed name to which messages are published. For each topic, the Kafka cluster maintains a partitioned log ● Partition - part of a topic: level of parallelism in Kafka. Write/Read order is guaranteed at partition level
  • 15. Kafka-buildingblocks ● Producer - process that publishes messages to a Kafka ● Topic - a category or feed name to which messages are published. For each topic, the Kafka cluster maintains a partitioned log ● Partition - part of a topic: level of parallelism in Kafka. Write/Read order is guaranteed at partition level ● Replica - up-to-date partition’s copy. Each partition is replicated across a configurable number of servers for fault tolerance (like HDFS block)
  • 16. Kafka-buildingblocks ● Producer - process that publishes messages to a Kafka ● Topic - a category or feed name to which messages are published. For each topic, the Kafka cluster maintains a partitioned log ● Partition - part of a topic: level of parallelism in Kafka. Write/Read order is guaranteed at partition level ● Replica - up-to-date partition’s copy. Each partition is replicated across a configurable number of servers for fault tolerance (like HDFS block) ● Consumer - process that subscribes to topics and processes published messages
  • 17. Kafka-buildingblocks ● Consumer - process that subscribes to topics and processes published messages
  • 19. Apache Storm Apache Spark Apache Samza Apache Flink Apache Flume ... StreamProcessing-possibleimplementationframeworks
  • 20. StreamProcessing-possibleimplementationframeworks ● Pros ○ Automatic fault tolerance ○ Scaling ○ No data loss guarantees ○ Stream processing DSL/SQL (joins, filters, count aggregates, etc) ● Cons ○ Overall system complexity significantly grows ■ New cluster to maintain/monitor/upgrade/etc (Apache Storm) ■ Multi-pattern (mixed) data access (Spark/Samza on YARN) ○ Another framework to learn for your team
  • 22. StreamProcessing-microservices Small, independent processes that communicate with each other to form complex applications which utilize language-agnostic APIs. These services are small building blocks, highly decoupled and focused on doing a small task, facilitating a modular approach to system-building. The microservices architectural style is becoming the standard for building modern applications.
  • 23. StreamProcessing-microservicescommunication Three most commonly used protocols are : ● Synchronous request-response calls (mainly via HTTP REST API) ● Asynchronous (non blocking IO) request-response communication (Akka, Play Framework, etc) ● Asynchronous messages buffers (RabbitMQ, JMS, ActiveMQ, etc)
  • 24. StreamProcessing-microservicesplatforms Microservices deployment platforms : ● Apache Mesos with a framework like Marathon ● Swarm from Docker ● Kubernetes ● YARN with something like Slider ● Various hosted container services such as ECS from Amazon ● Cloud Foundry ● Heroku
  • 25. StreamProcessing-microservices Why can’t I just package and deploy my events processing code on Yarn / Mesos / Docker / Amazon cluster and let it take care o fault tolerance, scaling and other weird things?
  • 27. StreamProcessing-microservicescommunication Fourth protocol is : ● Asynchronous, ordered and manageable logs of events - Kafka
  • 29. StreamProcessing-kafka ● New Kafka Consumer 0.9.+ ○ Light - consumer client is just a thin JAR without heavy 3rd party dependencies (ZooKeeper, scala runtime, etc) ○ Acts as Load Balancer ○ Fault tolerant ○ Simple to use API ○ Kafka Streams - elegant DSL (should be officially released this month)
  • 31. StreamProcessing-kafka&microservices 1. Language agnostic logs of events (buffers) 2. No backpressure on consumers (API endpoints with sync approach) 3. Fault tolerance - no data loss 4. Failed service doesn’t bring entire chain down 5. Resuming from last committed offset position 6. No circuit breaker like patterns needed 7. Smooth configs management across all nodes and services
  • 36. Architecturescomparison Lambda Kappa Processing paradigm Batch + Streaming Streaming Re-processing paradigm Every batch cycles Only when code changes Resource consumption Higher Lower Maintenance/Support complexity Higher Lower Ability to re-create dataset Per any point of time No (or very hard) Yes
  • 38. Evenmoreinterestingcomparison Hadoop-centric system Kafka-centric system Data Replication + + Fault Tolerance + + Scaling + + Random Reads With HBase With Elasticsearch/Solr Ordered Reads - + Secondary indices With Elasticsearch/Solr With Elasticsearch/Solr Storage for Big Files (>10M) + - TCO higher lower
  • 39. Summary 1. Events Log centric system design - from chaos to structured architecture
  • 40. Summary 1. Events Log centric system design - from chaos to structured architecture 2. Kafka as an Events Log reference storage implementation
  • 41. Summary 1. Events Log centric system design - from chaos to structured architecture 2. Kafka as an Events Log reference storage implementation 3. Microservices as distributed events processing approach
  • 42. Summary 1. Events Log centric system design - from chaos to structured architecture 2. Kafka as an Events Log reference storage implementation 3. Microservices as distributed events processing approach 4. Kappa Architecture as Microservices & Kafka symbiosis
  • 43. Usefullinks 1. “I heart Logs” by Jay Krepps http://shop.oreilly. com/product/0636920034339.do 2. http://confluent.io/blog 3. https://engineering.linkedin.com/distributed-systems/log-what-every- software-engineer-should-know-about-real-time-datas-unifying 4. “Making sense of stream processing” by Martin Kleppmann 5. http://kafka.apache.org/ 6. http://martinfowler.com/articles/microservices.html