SlideShare a Scribd company logo
2
Most read
3
Most read
5
Most read
Real-time stream processing
for Big Data
Presented by Luay AL-Assadi
INTRODUCTION
Rise of the web 2.0 and the Internet of things.
 Huge amounts of data. (ex sensors, social media, online marketing).
 Track all kinds of information that are only valuable for a short time and therefore have to be
processed immediately.
 Monitoring user activity to optimize product or video recommendations for the current user
context.
Traditional batch-oriented approaches.
 Complex Event Processing (CEP) engines and DBMSs.
Distributed data processing.
 MapReduce.
Real-time analytics: Big Data in motion
 Real time Data infrastructure:
 Built from distributed components.
 Communicate via asynchronous network.
 Engineered on top of the JVM(Java Virtual Machine).
 Real time Big Data Basic Architecture Model:
 Collecting data from various places.
 Moving data to streaming layer.
 Analyze data in stream processor.
 Forwarding outputs to serving layer.
Real-time analytics: Big Data in motion
 Big Data Architecture Model:
Collecting Data
Streaming Data
Batch processing
Store Data
Stream processing
Serving Layer
Lambda Architecture
Real-time analytics: Big Data in motion
 Big Data Architecture Models:
Collecting Data
Streaming Data
Stream processing
Serving Layer
Kappa Architecture
Store, retain Data
Real-time streamers
 RabbitMQ.
 Broker centric, message Acknowledgement.
 focused around delivery guarantees between producers and consumers.
 fall over if your consumers were too slow.
Producer ConsumerBROKER
Message
Ack
Real-time streamers
 Kafka.
Producer centric.
Online / Offline consumers.
Use Zookeeper to reliably maintain their state across a cluster.
Real-time processors:
Latency Throughput & Efficiency
Handling data items
immediately as they arrive.
buffering and processing them in
batches increased efficiency.
Low Latency High Throughput
SAMZA
STORM
SPARK
SPARK Streaming
Trident
Stream BatchMicro - Batch
groups tuples into batches
Restrict batch size
Real-time processors
 STORM
Storm was developed by
Nathan Marz as a BackType
project which was later
acquired by Twitter in the
year 2011.
initially promoted as the
“Hadoop of real-time”.
 The vital parts of a Storm
deployment are a ZooKeeper
cluster for reliable coordination.
Real-time processors
 STORM
Topology:
network made of spout and bolts
Similar to hadoop Map reduce.
Stream:
an unbounded pipeline of tuples
Spout & bolts:
receiving data continuously,
transforming those data into
actual stream of tuples and
finally sending them to the
bolts to be processed.
Real-time processors
 STORM
Nodes
Master Node:
runs a daemon called ‘Nimbus’,
which is similar to the ‘Job
Tracker’ of Hadoop cluster.
Assign Jobs.
Monitor performance.
Real-time processors
 STORM
Nodes
Worker Node:
runs a daemon called
‘Supervisor’.
run one or more worker
processes on its node.
Apache Zookeeper facilitates communication between
Nimbus and Supervisors with the help of message
acknowledgements and processing status.
Real-time processors
 SAMZA
It was initially created at LinkedIn, submitted to the Apache
Incubator in July 2013.
Samza was co-developed with the queueing system Kafka.
Samza requires a little more work than storm to deploy as it does
not only depend on a ZooKeeper cluster, but also runs on top of
Hadoop YARN.
Real-time processors
 SAMZA - YARN
cluster scheduler. It allows you to allocate a number
of containers (processes) in a cluster of machines, and execute
arbitrary commands on them, The Samza client uses YARN to run a
Samza job.
NodeManager: is responsible for launching processes on the
machine.
ResourceManager: Talks to all of the NodeManagers to tell
them what to run.
ApplicationMaster: is responsible for managing the
application’s workload, asking for containers, and handling
notifications when one of its containers fails.
Real-time processors
 SAMZA
decouples individual processing
steps.
buffering data between
processing steps makes
(intermediate) results available
to unrelated parties.
 Prevent data loss by periodically
checkpointing current progress
and reprocessing all data from
failure point.
Real-time processors
 SPARK
Is a batch-processing framework that is often mentioned as the in
official successor of Hadoop as it offers several benefits in
comparison.
significant performance improvements through in-memory
caching.
 Spark provides a variety of machine learning algorithms out-of-the-box
through the MLlib library.
Real-time processors
 SPARK – Architecture
Discussion
SPARKSAMZASTORM
Achievable latency
processing model
ordering guarantees
<< 100 ms < 100 ms < 1 s
one-at-a-time one-at-a-time Micro-batch
between batcheswithin stream partitionsNo
elasticity Yes YesNo
All these different systems show that low latency is involved in a
number of trade-offs with other desirable properties such as
throughput, fault-tolerance, reliability (processing guarantees) and
ease of development.
References
• https://www.quora.com/What-are-the-differences-between-
Apache-Spark-Storm-Samza-Flink-Beam-Apex
• https://www.quora.com/What-are-the-differences-between-batch-
processing-and-stream-processing-systems
• https://samza.apache.org/learn/documentation/0.10/introduction/
architecture.html
• https://dzone.com/articles/streaming-big-data-storm-spark
• Paper : Real-time stream processing for Big Data
Bereitgestellt von | Staats- und Universitätsbibliothek Hamburg
Angemeldet
Heruntergeladen am | 13.10.16 19:14
THANKS

More Related Content

What's hot (20)

PPTX
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
DataScienceConferenc1
 
PPTX
Introduction to Data Engineering
Vivek Aanand Ganesan
 
PDF
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
PPTX
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
 
PDF
Data Mesh
Piethein Strengholt
 
PPTX
Introduction to Apache Mahout
Aman Adhikari
 
PPTX
Get Savvy with Snowflake
Matillion
 
PPTX
Modern Data Warehousing with the Microsoft Analytics Platform System
James Serra
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
PPTX
Delta Lake with Azure Databricks
Dustin Vannoy
 
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
PDF
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
PPTX
Big Data Analytics with Hadoop
Philippe Julio
 
PPTX
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Simplilearn
 
PDF
Apache Kafka - Martin Podval
Martin Podval
 
PDF
Real-time processing of large amounts of data
confluent
 
PPTX
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
PDF
Predicting Flights with Azure Databricks
Sarah Dutkiewicz
 
PDF
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
PPTX
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
DataWorks Summit
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
DataScienceConferenc1
 
Introduction to Data Engineering
Vivek Aanand Ganesan
 
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
 
Introduction to Apache Mahout
Aman Adhikari
 
Get Savvy with Snowflake
Matillion
 
Modern Data Warehousing with the Microsoft Analytics Platform System
James Serra
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Delta Lake with Azure Databricks
Dustin Vannoy
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Big Data Analytics with Hadoop
Philippe Julio
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Simplilearn
 
Apache Kafka - Martin Podval
Martin Podval
 
Real-time processing of large amounts of data
confluent
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
Predicting Flights with Azure Databricks
Sarah Dutkiewicz
 
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
DataWorks Summit
 

Viewers also liked (20)

PDF
Comparison of Open Source Frameworks for Integrating the Internet of Things
Kai Wähner
 
PPTX
Real-Time Event & Stream Processing on MS Azure
Khalid Salama
 
PPTX
Spark Tips & Tricks
Jason Hubbard
 
KEY
Big Data in Real-Time at Twitter
nkallen
 
PDF
What to Expect for Big Data and Apache Spark in 2017
Databricks
 
PDF
A Novel methodology for handling Document Level Security in Search Based Appl...
lucenerevolution
 
PDF
FundRock Fact Sheet
Colm Quirke
 
PDF
Catalogue i-tec
Tylečková Linda
 
PDF
Conteúdo programático do curso de matemática básica
Deivison Silva
 
DOCX
Link del blog
Javy Buenaño
 
PDF
Wintpresen 150424142843-conversion-gate02
carmstea
 
DOCX
Práctica manejo de internet
Javy Buenaño
 
PPTX
Hive Poster
ragho
 
PPTX
Integrate ManifoldCF with Solr
francelabs
 
DOC
Revisão 5 exercícios de leitura de gráficos (1)
Maria Aparecida Borges
 
PPTX
Introduction to Big Data processing (FGRE2016)
Thomas Vanhove
 
PDF
Flipped Classroom and blended learning, pros, cons, similarities and differences
ROSA CALZADO
 
PDF
Aprendizes 1 ano aula 1 parte a aula pdf
free
 
PDF
Smart City Ecosystem, fram data to value for the citizens, Km4City solution, ...
Paolo Nesi
 
Comparison of Open Source Frameworks for Integrating the Internet of Things
Kai Wähner
 
Real-Time Event & Stream Processing on MS Azure
Khalid Salama
 
Spark Tips & Tricks
Jason Hubbard
 
Big Data in Real-Time at Twitter
nkallen
 
What to Expect for Big Data and Apache Spark in 2017
Databricks
 
A Novel methodology for handling Document Level Security in Search Based Appl...
lucenerevolution
 
FundRock Fact Sheet
Colm Quirke
 
Catalogue i-tec
Tylečková Linda
 
Conteúdo programático do curso de matemática básica
Deivison Silva
 
Link del blog
Javy Buenaño
 
Wintpresen 150424142843-conversion-gate02
carmstea
 
Práctica manejo de internet
Javy Buenaño
 
Hive Poster
ragho
 
Integrate ManifoldCF with Solr
francelabs
 
Revisão 5 exercícios de leitura de gráficos (1)
Maria Aparecida Borges
 
Introduction to Big Data processing (FGRE2016)
Thomas Vanhove
 
Flipped Classroom and blended learning, pros, cons, similarities and differences
ROSA CALZADO
 
Aprendizes 1 ano aula 1 parte a aula pdf
free
 
Smart City Ecosystem, fram data to value for the citizens, Km4City solution, ...
Paolo Nesi
 
Ad

Similar to Real time big data stream processing (20)

PPT
CS8091_BDA_Unit_IV_Stream_Computing
Palani Kumar
 
PPTX
Real time analytics
Leandro Totino Pereira
 
PDF
Data Streaming in Kafka
SilviuMarcu1
 
PPTX
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector
 
PDF
Cloud Lambda Architecture Patterns
Asis Mohanty
 
PDF
Dataservices - Processing Big Data The Microservice Way
Josef Adersberger
 
KEY
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
netvis
 
PDF
Webinar - Big Data: Let's SMACK - Jorg Schad
Codemotion
 
PPTX
Colorado OpenStack 5th Birthday Monasca Operations
dlfryar
 
PDF
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
AboutYouGmbH
 
PDF
Scala in increasingly demanding environments - DATABIZ
DATABIZit
 
ODP
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
PPT
Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...
Scala Italy
 
PPTX
Apache samza past, present and future
Ed Yakabosky
 
PDF
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
DataStax Academy
 
PPTX
Bdu -stream_processing_with_smack_final
manishduttpurohit
 
PDF
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
GeeksLab Odessa
 
PPTX
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
DataScienceConferenc1
 
PDF
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
PDF
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 
CS8091_BDA_Unit_IV_Stream_Computing
Palani Kumar
 
Real time analytics
Leandro Totino Pereira
 
Data Streaming in Kafka
SilviuMarcu1
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector
 
Cloud Lambda Architecture Patterns
Asis Mohanty
 
Dataservices - Processing Big Data The Microservice Way
Josef Adersberger
 
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
netvis
 
Webinar - Big Data: Let's SMACK - Jorg Schad
Codemotion
 
Colorado OpenStack 5th Birthday Monasca Operations
dlfryar
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
AboutYouGmbH
 
Scala in increasingly demanding environments - DATABIZ
DATABIZit
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...
Scala Italy
 
Apache samza past, present and future
Ed Yakabosky
 
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
DataStax Academy
 
Bdu -stream_processing_with_smack_final
manishduttpurohit
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
GeeksLab Odessa
 
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
DataScienceConferenc1
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 
Ad

Recently uploaded (20)

PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
big data eco system fundamentals of data science
arivukarasi
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
Research Methodology Overview Introduction
ayeshagul29594
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 

Real time big data stream processing

  • 1. Real-time stream processing for Big Data Presented by Luay AL-Assadi
  • 2. INTRODUCTION Rise of the web 2.0 and the Internet of things.  Huge amounts of data. (ex sensors, social media, online marketing).  Track all kinds of information that are only valuable for a short time and therefore have to be processed immediately.  Monitoring user activity to optimize product or video recommendations for the current user context. Traditional batch-oriented approaches.  Complex Event Processing (CEP) engines and DBMSs. Distributed data processing.  MapReduce.
  • 3. Real-time analytics: Big Data in motion  Real time Data infrastructure:  Built from distributed components.  Communicate via asynchronous network.  Engineered on top of the JVM(Java Virtual Machine).  Real time Big Data Basic Architecture Model:  Collecting data from various places.  Moving data to streaming layer.  Analyze data in stream processor.  Forwarding outputs to serving layer.
  • 4. Real-time analytics: Big Data in motion  Big Data Architecture Model: Collecting Data Streaming Data Batch processing Store Data Stream processing Serving Layer Lambda Architecture
  • 5. Real-time analytics: Big Data in motion  Big Data Architecture Models: Collecting Data Streaming Data Stream processing Serving Layer Kappa Architecture Store, retain Data
  • 6. Real-time streamers  RabbitMQ.  Broker centric, message Acknowledgement.  focused around delivery guarantees between producers and consumers.  fall over if your consumers were too slow. Producer ConsumerBROKER Message Ack
  • 7. Real-time streamers  Kafka. Producer centric. Online / Offline consumers. Use Zookeeper to reliably maintain their state across a cluster.
  • 8. Real-time processors: Latency Throughput & Efficiency Handling data items immediately as they arrive. buffering and processing them in batches increased efficiency. Low Latency High Throughput SAMZA STORM SPARK SPARK Streaming Trident Stream BatchMicro - Batch groups tuples into batches Restrict batch size
  • 9. Real-time processors  STORM Storm was developed by Nathan Marz as a BackType project which was later acquired by Twitter in the year 2011. initially promoted as the “Hadoop of real-time”.  The vital parts of a Storm deployment are a ZooKeeper cluster for reliable coordination.
  • 10. Real-time processors  STORM Topology: network made of spout and bolts Similar to hadoop Map reduce. Stream: an unbounded pipeline of tuples Spout & bolts: receiving data continuously, transforming those data into actual stream of tuples and finally sending them to the bolts to be processed.
  • 11. Real-time processors  STORM Nodes Master Node: runs a daemon called ‘Nimbus’, which is similar to the ‘Job Tracker’ of Hadoop cluster. Assign Jobs. Monitor performance.
  • 12. Real-time processors  STORM Nodes Worker Node: runs a daemon called ‘Supervisor’. run one or more worker processes on its node. Apache Zookeeper facilitates communication between Nimbus and Supervisors with the help of message acknowledgements and processing status.
  • 13. Real-time processors  SAMZA It was initially created at LinkedIn, submitted to the Apache Incubator in July 2013. Samza was co-developed with the queueing system Kafka. Samza requires a little more work than storm to deploy as it does not only depend on a ZooKeeper cluster, but also runs on top of Hadoop YARN.
  • 14. Real-time processors  SAMZA - YARN cluster scheduler. It allows you to allocate a number of containers (processes) in a cluster of machines, and execute arbitrary commands on them, The Samza client uses YARN to run a Samza job. NodeManager: is responsible for launching processes on the machine. ResourceManager: Talks to all of the NodeManagers to tell them what to run. ApplicationMaster: is responsible for managing the application’s workload, asking for containers, and handling notifications when one of its containers fails.
  • 15. Real-time processors  SAMZA decouples individual processing steps. buffering data between processing steps makes (intermediate) results available to unrelated parties.  Prevent data loss by periodically checkpointing current progress and reprocessing all data from failure point.
  • 16. Real-time processors  SPARK Is a batch-processing framework that is often mentioned as the in official successor of Hadoop as it offers several benefits in comparison. significant performance improvements through in-memory caching.  Spark provides a variety of machine learning algorithms out-of-the-box through the MLlib library.
  • 17. Real-time processors  SPARK – Architecture
  • 18. Discussion SPARKSAMZASTORM Achievable latency processing model ordering guarantees << 100 ms < 100 ms < 1 s one-at-a-time one-at-a-time Micro-batch between batcheswithin stream partitionsNo elasticity Yes YesNo All these different systems show that low latency is involved in a number of trade-offs with other desirable properties such as throughput, fault-tolerance, reliability (processing guarantees) and ease of development.
  • 19. References • https://www.quora.com/What-are-the-differences-between- Apache-Spark-Storm-Samza-Flink-Beam-Apex • https://www.quora.com/What-are-the-differences-between-batch- processing-and-stream-processing-systems • https://samza.apache.org/learn/documentation/0.10/introduction/ architecture.html • https://dzone.com/articles/streaming-big-data-storm-spark • Paper : Real-time stream processing for Big Data Bereitgestellt von | Staats- und Universitätsbibliothek Hamburg Angemeldet Heruntergeladen am | 13.10.16 19:14