Big Stream Processing Systems, Big Graphs

Big Stream
Processing Systems
&
Big Graphs
Based on presentations during
Brno Data Week 2018
by prof Sherif Sakr
Created by: Tichý, T. & Luhan, J.
(Feb. 2019)
https://www.chedteb.eu/

Static vs
Streaming
Data
Computation
• Today, in several applications data is continuously produced
(e.g., user activity logs, web logs, sensors, database transactions, ...).
• Streaming processing engines analyze data while it arrives.
• The main goal of stream processing is to decrease the overall latency
to obtain results.
Big stream

Big streams
• In 2010, Walmart reported that it was handling more than 1 million
customer transactions every hour.
• The New York Stock Exchange (NYSE) reported trading more than 800
million shares on a typical day in October 2012.
• By the end of 2011, there were about 30 billion Radio-Frequency
Identification (RFID) tags.
• In all of these applications and domains, there is a crucial requirement
to collect, process and analyse big streams of data in a real time
fashion.
Big stream

Can we use Hadoop for Big Streams?
• From the stream-processing point of view, the main limitation
of Hadoop is that it was designed so that the entire output of
each map and reduce task is materialized into a local file before
it can be consumed by the next stage.
• This materialization step enables the implementation of a
simple and elegant checkpoint/restart fault-tolerance
mechanism. But it causes a significant delay for jobs with real-
time processing requirements.
Big stream: Processing systems

Apache Storm
• Storm is a real-time distributed computing framework for reliably
processing unbounded data streams.
• Storm is a project which was created by Nathan Marz and his team at
BackType, and released as open source in 2011 after BackType was
acquired by Twitter.
• Part of Apache Incubator since September 2013.
• Provides general primitives to do real time computations.
Big stream: Processing systems

Big graphs
• While it is great that we can analyse a huge
amout of data, it would not be useful without
some kind of a graphical presentation of this
data.
• BigData = BigGraphs.
• We use a lot of algorithms to visualize our data
Big graphs

Examples of graph
processing algorithms
• PageRank
• Triangle Counting
• Connected Components
• Random Walk
• Graph Coloring
• Community Detection
• and many others
Big graphs

Main
challenges of
graph
processing
Data is dynamic -> No way of doing "schema on write"
Structure driven computation -> Poor Memory Locality
and Data Transfer Issues
Algorithms are explorative and iterative
Combinatorial explosion of datasets -> Relationships
Grow Exponentially and Limited Scalability
Irregular Structure -> Challenging Graph Partitioning
and Limited Parallelism
Big graphs

Can we use Hadoop for Big Graphs?
• MapReduce does not directly support iterative
algorithms.
• Invariant graph-topology-data re-loaded and re-processed
at each iteration -> wasting I/O, network bandwidth, and
CPU.
• Materializations of intermediate results at every
MapReduce iteration harm performance.
Big graphs

An Overview
of Big Graph
Processing
Systems
Big graphs: Processing systems

Google Pregel
• The first BSP-based implementation for graph processing
• Communication through message passing (usually sent along
the outgoing edges from each vertex) + Shared-Nothing
• Advantages:
• No locks -> message-based communication
• No semaphores -> global synchronization
• Iteration isolation -> massively parallelizable

GraphX
• A distributed graph engine built on top of Spark;
• GraphX extends Sparks Resilient Distributed Dataset (RDD)
abstraction to introduce the Resilient Distributed Graph
(RDG).
• The GraphX RDG leverages advances in distributed graph
representation and exploits the graph structure to minimize
communication and storage overhead.

GraphX
• One system for the entire graph pipeline. Unlike other graph
processing systems, the GraphX API enables the composition
of graphs.
• Tables and graphs are views of the same physical data.
• Each view has its own operators that exploit the semantics of
the view to achieve efficient execution.

To be
continued
Our last episode of our series will
be focussed on machine learning
in Big Data and challenges we
have to face in Big Data.

Big Stream Processing Systems, Big Graphs

More Related Content

What's hot (20)

Similar to Big Stream Processing Systems, Big Graphs (20)

Recently uploaded (20)

Big Stream Processing Systems, Big Graphs