SlideShare a Scribd company logo
Big Stream
Processing Systems
&
Big Graphs
Based on presentations during
Brno Data Week 2018
by prof Sherif Sakr
Created by: Tichý, T. & Luhan, J.
(Feb. 2019)
https://www.chedteb.eu/
Static vs
Streaming
Data
Computation
• Today, in several applications data is continuously produced
(e.g., user activity logs, web logs, sensors, database transactions, ...).
• Streaming processing engines analyze data while it arrives.
• The main goal of stream processing is to decrease the overall latency
to obtain results.
Big stream
Big streams
• In 2010, Walmart reported that it was handling more than 1 million
customer transactions every hour.
• The New York Stock Exchange (NYSE) reported trading more than 800
million shares on a typical day in October 2012.
• By the end of 2011, there were about 30 billion Radio-Frequency
Identification (RFID) tags.
• In all of these applications and domains, there is a crucial requirement
to collect, process and analyse big streams of data in a real time
fashion.
Big stream
Can we use Hadoop for Big Streams?
• From the stream-processing point of view, the main limitation
of Hadoop is that it was designed so that the entire output of
each map and reduce task is materialized into a local file before
it can be consumed by the next stage.
• This materialization step enables the implementation of a
simple and elegant checkpoint/restart fault-tolerance
mechanism. But it causes a significant delay for jobs with real-
time processing requirements.
Big stream: Processing systems
Apache Storm
• Storm is a real-time distributed computing framework for reliably
processing unbounded data streams.
• Storm is a project which was created by Nathan Marz and his team at
BackType, and released as open source in 2011 after BackType was
acquired by Twitter.
• Part of Apache Incubator since September 2013.
• Provides general primitives to do real time computations.
Big stream: Processing systems
Big graphs
• While it is great that we can analyse a huge
amout of data, it would not be useful without
some kind of a graphical presentation of this
data.
• BigData = BigGraphs.
• We use a lot of algorithms to visualize our data
Big graphs
Examples of graph
processing algorithms
• PageRank
• Triangle Counting
• Connected Components
• Random Walk
• Graph Coloring
• Community Detection
• and many others
Big graphs
Main
challenges of
graph
processing
Data is dynamic -> No way of doing "schema on write"
Structure driven computation -> Poor Memory Locality
and Data Transfer Issues
Algorithms are explorative and iterative
Combinatorial explosion of datasets -> Relationships
Grow Exponentially and Limited Scalability
Irregular Structure -> Challenging Graph Partitioning
and Limited Parallelism
Big graphs
Can we use Hadoop for Big Graphs?
• MapReduce does not directly support iterative
algorithms.
• Invariant graph-topology-data re-loaded and re-processed
at each iteration -> wasting I/O, network bandwidth, and
CPU.
• Materializations of intermediate results at every
MapReduce iteration harm performance.
Big graphs
An Overview
of Big Graph
Processing
Systems
Big graphs: Processing systems
Google Pregel
• The first BSP-based implementation for graph processing
• Communication through message passing (usually sent along
the outgoing edges from each vertex) + Shared-Nothing
• Advantages:
• No locks -> message-based communication
• No semaphores -> global synchronization
• Iteration isolation -> massively parallelizable
Big graphs: Processing systems
GraphX
• A distributed graph engine built on top of Spark;
• GraphX extends Sparks Resilient Distributed Dataset (RDD)
abstraction to introduce the Resilient Distributed Graph
(RDG).
• The GraphX RDG leverages advances in distributed graph
representation and exploits the graph structure to minimize
communication and storage overhead.
Big graphs: Processing systems
GraphX
• One system for the entire graph pipeline. Unlike other graph
processing systems, the GraphX API enables the composition
of graphs.
• Tables and graphs are views of the same physical data.
• Each view has its own operators that exploit the semantics of
the view to achieve efficient execution.
Big graphs: Processing systems
To be
continued
Our last episode of our series will
be focussed on machine learning
in Big Data and challenges we
have to face in Big Data.

More Related Content

What's hot (20)

PPT
Data mining with big data
Sandip Tipayle Patil
 
PDF
Big data tools
Novita Sari
 
PPT
Big data
Harry Potter
 
PDF
Big data introduction
Chirag Ahuja
 
PPTX
Data mining on big data
Swapnil Chaudhari
 
DOCX
JPJ1417 Data Mining With Big Data
chennaijp
 
PPTX
Mining Big Data in Real Time
Albert Bifet
 
PPTX
Big data Ppt
Prashant Navatre
 
PDF
Big Data, Big Deal: For Future Big Data Scientists
Way-Yen Lin
 
PPTX
Big data
valeri kopaleishvili
 
PPTX
Overview of Big data(ppt)
Shatavisha Roy Chowdhury
 
PPTX
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
yashbheda
 
PPTX
Big Data ppt
Vivek Gautam
 
PPTX
Big Data Projects Research Ideas
Matlab Simulation
 
PPTX
Big data ppt
AKASH SIHAG
 
PPTX
Big data, Big decision
Venkatesh Balakumar
 
PPSX
Big Data
Neha Mehta
 
PDF
Big Data & the importance of Data Science
Wim Van Leuven
 
PPTX
Big data and its applications
ali easazadeh
 
PPTX
Hadoop Training Tutorial for Freshers
rajkamaltibacademy
 
Data mining with big data
Sandip Tipayle Patil
 
Big data tools
Novita Sari
 
Big data
Harry Potter
 
Big data introduction
Chirag Ahuja
 
Data mining on big data
Swapnil Chaudhari
 
JPJ1417 Data Mining With Big Data
chennaijp
 
Mining Big Data in Real Time
Albert Bifet
 
Big data Ppt
Prashant Navatre
 
Big Data, Big Deal: For Future Big Data Scientists
Way-Yen Lin
 
Overview of Big data(ppt)
Shatavisha Roy Chowdhury
 
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
yashbheda
 
Big Data ppt
Vivek Gautam
 
Big Data Projects Research Ideas
Matlab Simulation
 
Big data ppt
AKASH SIHAG
 
Big data, Big decision
Venkatesh Balakumar
 
Big Data
Neha Mehta
 
Big Data & the importance of Data Science
Wim Van Leuven
 
Big data and its applications
ali easazadeh
 
Hadoop Training Tutorial for Freshers
rajkamaltibacademy
 

Similar to Big Stream Processing Systems, Big Graphs (20)

PDF
Graph Stream Processing : spinning fast, large scale, complex analytics
Paris Carbone
 
PPTX
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
PDF
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
csandit
 
PDF
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
cscpconf
 
PDF
The Future is Big Graphs: A Community View on Graph Processing Systems
Neo4j
 
PPTX
Apache Spark Components
Girish Khanzode
 
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
PDF
Big Data Processing & Analytics: Improving data insight.pdf
McSkyzeZeg
 
PDF
Introduction to Apache Flink
datamantra
 
PDF
Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event
The Hive
 
PDF
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
 
PPT
CS8091_BDA_Unit_IV_Stream_Computing
Palani Kumar
 
PDF
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
PDF
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
PPTX
Big Data for QAs
Ahmed Misbah
 
PPT
Seminar presentation
Klawal13
 
PDF
Introduction to Spark Streaming
datamantra
 
PPTX
Trivento summercamp fast data 9/9/2016
Stavros Kontopoulos
 
PDF
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
Jason Riedy
 
PPT
strata_spark_streaming.ppt
rveiga100
 
Graph Stream Processing : spinning fast, large scale, complex analytics
Paris Carbone
 
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
csandit
 
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
cscpconf
 
The Future is Big Graphs: A Community View on Graph Processing Systems
Neo4j
 
Apache Spark Components
Girish Khanzode
 
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
Big Data Processing & Analytics: Improving data insight.pdf
McSkyzeZeg
 
Introduction to Apache Flink
datamantra
 
Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event
The Hive
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
 
CS8091_BDA_Unit_IV_Stream_Computing
Palani Kumar
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
Big Data for QAs
Ahmed Misbah
 
Seminar presentation
Klawal13
 
Introduction to Spark Streaming
datamantra
 
Trivento summercamp fast data 9/9/2016
Stavros Kontopoulos
 
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
Jason Riedy
 
strata_spark_streaming.ppt
rveiga100
 
Ad

Recently uploaded (20)

PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PPTX
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
PDF
2_Management_of_patients_with_Reproductive_System_Disorders.pdf
motbayhonewunetu
 
PDF
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
DOC
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
Introduction to Artificial Intelligence.pptx
StarToon1
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPT
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
PDF
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
DOCX
AI/ML Applications in Financial domain projects
Rituparna De
 
PPTX
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPTX
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PDF
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
2_Management_of_patients_with_Reproductive_System_Disorders.pdf
motbayhonewunetu
 
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
Introduction to Artificial Intelligence.pptx
StarToon1
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
AI/ML Applications in Financial domain projects
Rituparna De
 
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
Ad

Big Stream Processing Systems, Big Graphs

  • 1. Big Stream Processing Systems & Big Graphs Based on presentations during Brno Data Week 2018 by prof Sherif Sakr Created by: Tichý, T. & Luhan, J. (Feb. 2019) https://www.chedteb.eu/
  • 2. Static vs Streaming Data Computation • Today, in several applications data is continuously produced (e.g., user activity logs, web logs, sensors, database transactions, ...). • Streaming processing engines analyze data while it arrives. • The main goal of stream processing is to decrease the overall latency to obtain results. Big stream
  • 3. Big streams • In 2010, Walmart reported that it was handling more than 1 million customer transactions every hour. • The New York Stock Exchange (NYSE) reported trading more than 800 million shares on a typical day in October 2012. • By the end of 2011, there were about 30 billion Radio-Frequency Identification (RFID) tags. • In all of these applications and domains, there is a crucial requirement to collect, process and analyse big streams of data in a real time fashion. Big stream
  • 4. Can we use Hadoop for Big Streams? • From the stream-processing point of view, the main limitation of Hadoop is that it was designed so that the entire output of each map and reduce task is materialized into a local file before it can be consumed by the next stage. • This materialization step enables the implementation of a simple and elegant checkpoint/restart fault-tolerance mechanism. But it causes a significant delay for jobs with real- time processing requirements. Big stream: Processing systems
  • 5. Apache Storm • Storm is a real-time distributed computing framework for reliably processing unbounded data streams. • Storm is a project which was created by Nathan Marz and his team at BackType, and released as open source in 2011 after BackType was acquired by Twitter. • Part of Apache Incubator since September 2013. • Provides general primitives to do real time computations. Big stream: Processing systems
  • 6. Big graphs • While it is great that we can analyse a huge amout of data, it would not be useful without some kind of a graphical presentation of this data. • BigData = BigGraphs. • We use a lot of algorithms to visualize our data Big graphs
  • 7. Examples of graph processing algorithms • PageRank • Triangle Counting • Connected Components • Random Walk • Graph Coloring • Community Detection • and many others Big graphs
  • 8. Main challenges of graph processing Data is dynamic -> No way of doing "schema on write" Structure driven computation -> Poor Memory Locality and Data Transfer Issues Algorithms are explorative and iterative Combinatorial explosion of datasets -> Relationships Grow Exponentially and Limited Scalability Irregular Structure -> Challenging Graph Partitioning and Limited Parallelism Big graphs
  • 9. Can we use Hadoop for Big Graphs? • MapReduce does not directly support iterative algorithms. • Invariant graph-topology-data re-loaded and re-processed at each iteration -> wasting I/O, network bandwidth, and CPU. • Materializations of intermediate results at every MapReduce iteration harm performance. Big graphs
  • 10. An Overview of Big Graph Processing Systems Big graphs: Processing systems
  • 11. Google Pregel • The first BSP-based implementation for graph processing • Communication through message passing (usually sent along the outgoing edges from each vertex) + Shared-Nothing • Advantages: • No locks -> message-based communication • No semaphores -> global synchronization • Iteration isolation -> massively parallelizable Big graphs: Processing systems
  • 12. GraphX • A distributed graph engine built on top of Spark; • GraphX extends Sparks Resilient Distributed Dataset (RDD) abstraction to introduce the Resilient Distributed Graph (RDG). • The GraphX RDG leverages advances in distributed graph representation and exploits the graph structure to minimize communication and storage overhead. Big graphs: Processing systems
  • 13. GraphX • One system for the entire graph pipeline. Unlike other graph processing systems, the GraphX API enables the composition of graphs. • Tables and graphs are views of the same physical data. • Each view has its own operators that exploit the semantics of the view to achieve efficient execution. Big graphs: Processing systems
  • 14. To be continued Our last episode of our series will be focussed on machine learning in Big Data and challenges we have to face in Big Data.