SlideShare a Scribd company logo
Evolution of
Spark
Framework for
Simplifying Big
Data Analytics
Submitted By:
Rishabh Verma
Information Technology
1404313027
Submitted To:
Prof. A.K.Solanki
Head of Department
Content
 Types of data
 What is big data?
 What is Big Data Analytics?
 Facts on Big Data
 Characteristic of Big Data
 Traditional Approach Hadoop
 Hadoop Architecture HDFS and Mapreduce
 What is Spark?
 Spark Ecosystem
 Spark SQL
 Spark Streaming
 Mlib
 GraphX
 Comparison between Hadoop MapReduce and Apache Spark
 Conclusion
Types of Data
 Relational Data (Tables/Transaction/Legacy Data)
 Text Data (Web)
 Semi-structured Data (XML)
 Graph Data
 Social Network, Semantic Web (RDF)
 Streaming Data
 You can only scan the data once
What is Big Data?
Similar to “smaller Data” but Bigger in Size.
What is Big Data Analytics?
 Examining large
data set to find
Hidden Pattern.
 Unknown
correlations, market
trends, customer
preferences and
other useful business
information.
Facts on Big Data
 Over 90% of data was created in past two years only.
 Every minute we send 204 million emails, 1.8 million
likes on facebook, send 28 thousands tweets, upload
200thousand photos on facebook.
 3.5 billion thousand queries is received by google
every day.
Evolution of spark framework for simplifying data analysis.
Traditional Approach-Hadoop
 An open-source framework,
running application on large
clusture.
 Used for distributed storage
and processing of very
large datasets using the MapReduce.
 Hadoop splits files into large blocks and distributes them
across nodes in a cluster.
Hadoop Architecture
HDFS
HDFS
 Contain two type of node: Namenode(master) and number
of Datanode(worker).
 Namenode manages filesystem tree and meta data of all
the files.
 Datanode=workhorses, store and retrieve data on
command of Namenode, continuosly send heartbeat signal
to namenode.
 Data is replicated to ensure Fault Tolerance. Usually
replication factor is 3.
MapReduce
 “Map” job sends a query for processing to various
nodes in a Hadoop cluster and “Reduce” job collects all
the results to output into a single value.
 Map:
(in_value, in_key)=> (out_key, intermediate_value)
 Reduce:
(out_key, intermediate_value)=>(out_value_list)
Map Reduce
MapReduce working
 Map-Reduce split input data-set to be processed by
map task in parallel manner.
 Framework sort output of map which is input to
reduce task.
 Input and output of the job is stored in the filesystem.
 Apache Spark is a fast, in-memory data processing engine.
 Integration with Hadoop and its eco-system and can read
existing data.
 Provide high level API in
1)Java
2)Scala
3)Python
 10-100x Faster than MapReduce.
SPARK ECO-SYSTEM
 Spark SQL
-For SQL and unstructured
data.
 Mlib
-Machine Learning Algorithms.
 GraphX
-Graph Processing.
 Spark Streaming
-stream processing of live
data stream.
Integrated queries.
-Spark SQL is a component on top of 'Spark Core' for structured
data processing.
HIVE Compatibility.
-Spark SQL reuses the Hive frontend and metastore, giving you
full compatibility with existing Hive data, queries, and UDFs
Uniform Data Access
-DataFrames and SQL provide a common way to access a variety of
data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC
SPARK STREAMING
 Streaming data from data sources (e.g. live logs, system telemetry data, IoT
device data, etc.) into some data ingestion system like Apache Kafka, Amazon
Kinesis, etc.
 The data in parallel on a cluster. This is what stream processing engines are
designed to do, as we will discuss in detail next.
 The results out to downstream systems like HBase, Cassandra, Kafka, etc.
Evolution of spark framework for simplifying data analysis.
Spark Streaming
 Easy, reliable, fast processing of live data streams.
 Fast failure and straggler recovery.
 Dynamic Load Balancing
 Found its application in cyber security, Online
Advertisement and Campaign, IDS and alarms.
MLib
 Mlib is a low-level machine learning library that can be called
from Scala, Python and Java programming languages.
 Perform multiple iteration to improve accuracy.
 Nine times as fast as the disk-based implementation used
by Apache Mahout.
 Some algorithm used are-
 Clusturing: K-means.
 Decomposition: Principal Component Analysis (PCA)
 Regression: Linear Regression
Graph X
Graph X
 Graph processing Library for Apache Spark.
 GraphX unifies ETL and iterative graph computation
within a single system.
 RDG’s associate records with the vertices and edges in
a graph and help them to exploit in less than 20 lines
of code.
 Graph Frame an advancement in GraphX, provide
uniform API for all 3 languages.
Advantage of spark over hadoop.
APACHE SPARK HADOOP MapReduce
10-100X faster than Hadoop due to in
memory computation.
Slower than Spark, support disk based
computation.
Use to deal with data in real time. It is mainly focussed on Batch
Processing.
Spark ensures lower latency
computations by caching the partial
results across its memory of distributed
workers.
Map Reduce is completely Disk
oriented.
Perform streaming, batch processing,
machine learning all in same clusture
Hadoop Mapreduce is mainly used to
generate report for historical queries.
CONCLUSION
So to conclude with we can state that, the choice of
Hadoop MapReduce vs. Apache Spark depends on the
user-based case and we cannot make an autonomous
choice.
References
[1]. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott
Shenker, Ion Stoica,“Spark: Cluster Computing with Working
Sets”,University of California, Berkeley,2016.
[2]. Yanfeng Zhang, Qixin Gao, Lixin Gao, Cuirong Wang “PrIter: A
Distributed Framework for Prioritizing Iterative Computations, Parallel
and Distributed Systems”, IEEE Transactions onTransactions on Prallel
and Distributed Systems, vol.24, no.9, pp.1884, 1893, Sept.2016.
[3]. Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma,Khaled
Elmeleegy Scott Shenker and IonStoic. “Delay scheduling: a simple
technique for achieving locality and fairness in cluster scheduling”,
Proceedings of the 5th European conference on Computer systems,
ACM New York 2016.
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.

More Related Content

What's hot (20)

PDF
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark Summit
 
PPTX
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
PDF
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Databricks
 
PDF
Big Data Processing with Spark and Scala
Edureka!
 
PDF
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Databricks
 
PDF
Hadoop Ecosystem Architecture Overview
Senthil Kumar
 
PDF
Visualizing big data in the browser using spark
Databricks
 
PDF
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
PDF
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Databricks
 
PDF
Started with-apache-spark
Happiest Minds Technologies
 
PDF
Hadoop and Vertica: Data Analytics Platform at Twitter
DataWorks Summit
 
PPTX
How Spark Enables the Internet of Things- Paula Ta-Shma
Spark Summit
 
PDF
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Databricks
 
PDF
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Databricks
 
PPT
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
Yuanyuan Tian
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PPT
Graph Analytics for big data
Sigmoid
 
PDF
The Future of Real-Time in Spark
Databricks
 
DOCX
Neo4j vs giraph
Nishant Gandhi
 
PPT
Hadoop World Vertica
Omer Trajman
 
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark Summit
 
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Databricks
 
Big Data Processing with Spark and Scala
Edureka!
 
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Databricks
 
Hadoop Ecosystem Architecture Overview
Senthil Kumar
 
Visualizing big data in the browser using spark
Databricks
 
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Databricks
 
Started with-apache-spark
Happiest Minds Technologies
 
Hadoop and Vertica: Data Analytics Platform at Twitter
DataWorks Summit
 
How Spark Enables the Internet of Things- Paula Ta-Shma
Spark Summit
 
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Databricks
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Databricks
 
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
Yuanyuan Tian
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Graph Analytics for big data
Sigmoid
 
The Future of Real-Time in Spark
Databricks
 
Neo4j vs giraph
Nishant Gandhi
 
Hadoop World Vertica
Omer Trajman
 

Similar to Evolution of spark framework for simplifying data analysis. (20)

PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
PDF
Big data processing with apache spark
sarith divakar
 
PPTX
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
PDF
Unified Big Data Processing with Apache Spark
C4Media
 
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
prajods
 
PDF
Simplifying Big Data Analytics with Apache Spark
Databricks
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PDF
Apache Spark PDF
Naresh Rupareliya
 
PDF
[@NaukriEngineering] Apache Spark
Naukri.com
 
PPTX
APACHE SPARK.pptx
DeepaThirumurugan
 
PDF
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
PPTX
Big Data training
vishal192091
 
PPTX
Glint with Apache Spark
Venkata Naga Ravi
 
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
PPTX
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
PPT
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
PDF
SparkPaper
Suraj Thapaliya
 
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
PDF
Dev Ops Training
Spark Summit
 
PDF
Spark Driven Big Data Analytics
inoshg
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Big data processing with apache spark
sarith divakar
 
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Unified Big Data Processing with Apache Spark
C4Media
 
Apache Spark: The Next Gen toolset for Big Data Processing
prajods
 
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Apache Spark PDF
Naresh Rupareliya
 
[@NaukriEngineering] Apache Spark
Naukri.com
 
APACHE SPARK.pptx
DeepaThirumurugan
 
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Big Data training
vishal192091
 
Glint with Apache Spark
Venkata Naga Ravi
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
SparkPaper
Suraj Thapaliya
 
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
Dev Ops Training
Spark Summit
 
Spark Driven Big Data Analytics
inoshg
 
Ad

Recently uploaded (20)

PDF
Android Programming - Basics of Mobile App, App tools and Android Basics
Kavitha P.V
 
PDF
STATEMENT-BY-THE-HON.-MINISTER-FOR-HEALTH-ON-THE-COVID-19-OUTBREAK-AT-UG_revi...
nservice241
 
PDF
Mahidol_Change_Agent_Note_2025-06-27-29_MUSEF
Tassanee Lerksuthirat
 
PPTX
PPT-Q1-WK-3-ENGLISH Revised Matatag Grade 3.pptx
reijhongidayawan02
 
PDF
Reconstruct, Restore, Reimagine: New Perspectives on Stoke Newington’s Histor...
History of Stoke Newington
 
PPTX
How to Create Odoo JS Dialog_Popup in Odoo 18
Celine George
 
PPTX
How to Configure Re-Ordering From Portal in Odoo 18 Website
Celine George
 
PDF
AI-Powered-Visual-Storytelling-for-Nonprofits.pdf
TechSoup
 
PDF
Horarios de distribución de agua en julio
pegazohn1978
 
PPTX
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
PDF
Aprendendo Arquitetura Framework Salesforce - Dia 03
Mauricio Alexandre Silva
 
PDF
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
PPTX
Identifying elements in the story. Arrange the events in the story
geraldineamahido2
 
PDF
Week 2 - Irish Natural Heritage Powerpoint.pdf
swainealan
 
PPTX
How to Create a Customer From Website in Odoo 18.pptx
Celine George
 
PPTX
How to Manage Allocation Report for Manufacturing Orders in Odoo 18
Celine George
 
PDF
Chapter-V-DED-Entrepreneurship: Institutions Facilitating Entrepreneurship
Dayanand Huded
 
PPTX
Controller Request and Response in Odoo18
Celine George
 
PDF
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
PPTX
TRANSLATIONAL AND ROTATIONAL MOTION.pptx
KIPAIZAGABAWA1
 
Android Programming - Basics of Mobile App, App tools and Android Basics
Kavitha P.V
 
STATEMENT-BY-THE-HON.-MINISTER-FOR-HEALTH-ON-THE-COVID-19-OUTBREAK-AT-UG_revi...
nservice241
 
Mahidol_Change_Agent_Note_2025-06-27-29_MUSEF
Tassanee Lerksuthirat
 
PPT-Q1-WK-3-ENGLISH Revised Matatag Grade 3.pptx
reijhongidayawan02
 
Reconstruct, Restore, Reimagine: New Perspectives on Stoke Newington’s Histor...
History of Stoke Newington
 
How to Create Odoo JS Dialog_Popup in Odoo 18
Celine George
 
How to Configure Re-Ordering From Portal in Odoo 18 Website
Celine George
 
AI-Powered-Visual-Storytelling-for-Nonprofits.pdf
TechSoup
 
Horarios de distribución de agua en julio
pegazohn1978
 
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
Aprendendo Arquitetura Framework Salesforce - Dia 03
Mauricio Alexandre Silva
 
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
Identifying elements in the story. Arrange the events in the story
geraldineamahido2
 
Week 2 - Irish Natural Heritage Powerpoint.pdf
swainealan
 
How to Create a Customer From Website in Odoo 18.pptx
Celine George
 
How to Manage Allocation Report for Manufacturing Orders in Odoo 18
Celine George
 
Chapter-V-DED-Entrepreneurship: Institutions Facilitating Entrepreneurship
Dayanand Huded
 
Controller Request and Response in Odoo18
Celine George
 
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
TRANSLATIONAL AND ROTATIONAL MOTION.pptx
KIPAIZAGABAWA1
 
Ad

Evolution of spark framework for simplifying data analysis.

  • 1. Evolution of Spark Framework for Simplifying Big Data Analytics Submitted By: Rishabh Verma Information Technology 1404313027 Submitted To: Prof. A.K.Solanki Head of Department
  • 2. Content  Types of data  What is big data?  What is Big Data Analytics?  Facts on Big Data  Characteristic of Big Data  Traditional Approach Hadoop  Hadoop Architecture HDFS and Mapreduce  What is Spark?  Spark Ecosystem  Spark SQL  Spark Streaming  Mlib  GraphX  Comparison between Hadoop MapReduce and Apache Spark  Conclusion
  • 3. Types of Data  Relational Data (Tables/Transaction/Legacy Data)  Text Data (Web)  Semi-structured Data (XML)  Graph Data  Social Network, Semantic Web (RDF)  Streaming Data  You can only scan the data once
  • 4. What is Big Data? Similar to “smaller Data” but Bigger in Size.
  • 5. What is Big Data Analytics?  Examining large data set to find Hidden Pattern.  Unknown correlations, market trends, customer preferences and other useful business information.
  • 6. Facts on Big Data  Over 90% of data was created in past two years only.  Every minute we send 204 million emails, 1.8 million likes on facebook, send 28 thousands tweets, upload 200thousand photos on facebook.  3.5 billion thousand queries is received by google every day.
  • 8. Traditional Approach-Hadoop  An open-source framework, running application on large clusture.  Used for distributed storage and processing of very large datasets using the MapReduce.  Hadoop splits files into large blocks and distributes them across nodes in a cluster.
  • 10. HDFS
  • 11. HDFS  Contain two type of node: Namenode(master) and number of Datanode(worker).  Namenode manages filesystem tree and meta data of all the files.  Datanode=workhorses, store and retrieve data on command of Namenode, continuosly send heartbeat signal to namenode.  Data is replicated to ensure Fault Tolerance. Usually replication factor is 3.
  • 12. MapReduce  “Map” job sends a query for processing to various nodes in a Hadoop cluster and “Reduce” job collects all the results to output into a single value.  Map: (in_value, in_key)=> (out_key, intermediate_value)  Reduce: (out_key, intermediate_value)=>(out_value_list)
  • 14. MapReduce working  Map-Reduce split input data-set to be processed by map task in parallel manner.  Framework sort output of map which is input to reduce task.  Input and output of the job is stored in the filesystem.
  • 15.  Apache Spark is a fast, in-memory data processing engine.  Integration with Hadoop and its eco-system and can read existing data.  Provide high level API in 1)Java 2)Scala 3)Python  10-100x Faster than MapReduce.
  • 16. SPARK ECO-SYSTEM  Spark SQL -For SQL and unstructured data.  Mlib -Machine Learning Algorithms.  GraphX -Graph Processing.  Spark Streaming -stream processing of live data stream.
  • 17. Integrated queries. -Spark SQL is a component on top of 'Spark Core' for structured data processing. HIVE Compatibility. -Spark SQL reuses the Hive frontend and metastore, giving you full compatibility with existing Hive data, queries, and UDFs Uniform Data Access -DataFrames and SQL provide a common way to access a variety of data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC
  • 18. SPARK STREAMING  Streaming data from data sources (e.g. live logs, system telemetry data, IoT device data, etc.) into some data ingestion system like Apache Kafka, Amazon Kinesis, etc.  The data in parallel on a cluster. This is what stream processing engines are designed to do, as we will discuss in detail next.  The results out to downstream systems like HBase, Cassandra, Kafka, etc.
  • 20. Spark Streaming  Easy, reliable, fast processing of live data streams.  Fast failure and straggler recovery.  Dynamic Load Balancing  Found its application in cyber security, Online Advertisement and Campaign, IDS and alarms.
  • 21. MLib  Mlib is a low-level machine learning library that can be called from Scala, Python and Java programming languages.  Perform multiple iteration to improve accuracy.  Nine times as fast as the disk-based implementation used by Apache Mahout.  Some algorithm used are-  Clusturing: K-means.  Decomposition: Principal Component Analysis (PCA)  Regression: Linear Regression
  • 23. Graph X  Graph processing Library for Apache Spark.  GraphX unifies ETL and iterative graph computation within a single system.  RDG’s associate records with the vertices and edges in a graph and help them to exploit in less than 20 lines of code.  Graph Frame an advancement in GraphX, provide uniform API for all 3 languages.
  • 24. Advantage of spark over hadoop. APACHE SPARK HADOOP MapReduce 10-100X faster than Hadoop due to in memory computation. Slower than Spark, support disk based computation. Use to deal with data in real time. It is mainly focussed on Batch Processing. Spark ensures lower latency computations by caching the partial results across its memory of distributed workers. Map Reduce is completely Disk oriented. Perform streaming, batch processing, machine learning all in same clusture Hadoop Mapreduce is mainly used to generate report for historical queries.
  • 25. CONCLUSION So to conclude with we can state that, the choice of Hadoop MapReduce vs. Apache Spark depends on the user-based case and we cannot make an autonomous choice.
  • 26. References [1]. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica,“Spark: Cluster Computing with Working Sets”,University of California, Berkeley,2016. [2]. Yanfeng Zhang, Qixin Gao, Lixin Gao, Cuirong Wang “PrIter: A Distributed Framework for Prioritizing Iterative Computations, Parallel and Distributed Systems”, IEEE Transactions onTransactions on Prallel and Distributed Systems, vol.24, no.9, pp.1884, 1893, Sept.2016. [3]. Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma,Khaled Elmeleegy Scott Shenker and IonStoic. “Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling”, Proceedings of the 5th European conference on Computer systems, ACM New York 2016.

Editor's Notes

  • #24: ETL= extract transform load RDG resilient Distributive graphs