SlideShare a Scribd company logo
Spark Streaming &
Spark SQL
Yousun Jeong
jerryjung@sk.com
History - Spark
Developed in 2009 at UC Berkeley AMPLab, then
open sourced in 2010, Spark has since become one
of the largest OSS communities in big data, with over
200 contributors in 50+ organizations
“Organizations that are looking at big data challenges – including collection, ETL,
storage, exploration and analytics – should consider Spark for its in-memory
performance and the breadth of its model. It supports advanced analytics solutions
on Hadoop clusters, including the iterative model required for machine learning and
graph analysis.”
Gartner, Advanced Analytics and Data Science (2014)
History - Spark
Some key points about Spark:
• handles batch, interactive, and real-time within a single
framework
• native integration with Java, Python, Scala programming
at a higher level of abstraction
• multi-step Directed Acrylic Graphs (DAGs). 

many stages compared to just Hadoop Map and
Reduce only.
Data Sharing in MR
http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012
Spark
Benchmark Test
databricks.com/blog/2014/11/05/spark-officially- sets-a-new-record-in-large-scale-
sorting.html
RDD
Resilient Distributed Datasets (RDD) are the primary
abstraction in Spark – a fault-tolerant collection of
elements that can be operated on in parallel
There are currently two types:
• parallelized collections – take an existing Scala collection
and run functions on it in parallel
• Hadoop datasets – run functions on each record of a file
in Hadoop distributed file system or any other storage
system supported by Hadoop
Fault Tolerance
• An RDD is an immutable, deterministically re-
computable, distributed dataset.
• RDD tracks lineage info rebuild lost data
Benefit of Spark
Spark help us to have the gains in processing speed and implement various
big data applications easily and speedily
▪ Support for Event Stream
Processing
▪ Fast Data Queries in Real Time
▪ Improved Programmer Productivity
▪ Fast Batch Processing of Large Data
Set
Why I use spark …
Big Data
Big Data is not just “big”
The 3V of Big Data
Big Data Processing
1. Batch Processing
• processing data en masse
• big & complex
• higher latencies ex) MR
2. Stream Processing
• one-at-a-time processing
• computations are relatively simple and generally independent
• sub-second latency ex) Storm
3. Micro-Batching
• small batch size (batch+streaming)
Spark Streaming Integration
Spark Streaming In Action
import org.apache.spark.streaming._ 

import org.apache.spark.streaming.StreamingContext._ 



// create a StreamingContext with a SparkConf configuration
val ssc = new StreamingContext(sparkConf, Seconds(10)) 



// create a DStream that will connect to serverIP:serverPort
val lines = ssc.socketTextStream(serverIP, serverPort) 



// split each line into words 

val words = lines.flatMap(_.split(" ")) 



// count each word in each batch 

val pairs = words.map(word => (word, 1)) 

val wordCounts = pairs.reduceByKey(_ + _) 



// print a few of the counts to the console
wordCounts.print() 



ssc.start() // Start the computation

ssc.awaitTermination() // Wait for the computation to terminate
Spark UI
Spark SQL
Spark SQL In Action
// Data can easily be extracted from existing sources,
// such as Apache Hive.
val trainingDataTable = sql("""
SELECT e.action, u.age, u.latitude, u.logitude
FROM Users u
JOIN Events e
ON u.userId = e.userId”"")
// Since `sql` returns an RDD, the results of the above
// query can be easily used in MLlib
val trainingData = trainingDataTable.map { row =>
val features = Array[Double](row(1), row(2), row(3))
LabeledPoint(row(0), features)
}
val model =
new LogisticRegressionWithSGD().run(trainingData)
Spark SQL In Action
val allCandidates = sql("""
SELECT userId,
age,
latitude,
logitude
FROM Users
WHERE subscribed = FALSE”"")
// Results of ML algorithms can be used as tables
// in subsequent SQL statements.
case class Score(userId: Int, score: Double)
val scores = allCandidates.map { row =>
val features = Array[Double](row(1), row(2), row(3))
Score(row(0), model.predict(features))
}
scores.registerAsTable("Scores")
MR vs RDD - Compute an
Average
RDD vs DF - Compute an
Average
Using RDDs
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [int(x[1]), 1]))
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]])
.map(lambda x: [x[0], x[1][0] / x[1][1]])
.collect()
Using DataFrames
sqlCtx.table("people").groupBy("name").agg("name", avg("age")).collect()
Spark 2.0 : Structured
Streaming
• Structured Streaming
• High-level streaming API built on Spark SQL engine
• Runs the same queries on DataFrames
• Event time, windowing, sessions, sources & sinks
• Unifies streaming, interactive and batch queries
• Aggregate data in a stream, then serve using JDBC
• Change queries at runtime
• Build and apply ML models
Spark 2.0 Example: Page
View Count
Input: records in Kafka
Query: select count(*) group by page, minute(evtime)
Trigger:“every 5 sec”
Output mode: “update-in-place”, into MySQL sink
logs =
ctx.read.format("json").stream("s3://logs")
logs.groupBy(logs.user_id).

agg(sum(logs.time))
.write.format("jdbc")
.stream("jdbc:mysql//...")
Spark 2.0 Use Case: Fraud
Detection
Spark 2.0 Performance
Q & A
Thank You!

More Related Content

What's hot (20)

PPTX
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
PDF
Building Robust ETL Pipelines with Apache Spark
Databricks
 
PDF
Apache Spark Introduction
sudhakara st
 
PPTX
Snowflake Datawarehouse Architecturing
Ishan Bhawantha Hewanayake
 
PPTX
Spark
Heena Madan
 
PDF
Hyperspace for Delta Lake
Databricks
 
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
PPTX
Apache Spark Core
Girish Khanzode
 
PPTX
Spark architecture
GauravBiswas9
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PPTX
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
DataScienceConferenc1
 
PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PPTX
Spark
Koushik Mondal
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
Building Robust ETL Pipelines with Apache Spark
Databricks
 
Apache Spark Introduction
sudhakara st
 
Snowflake Datawarehouse Architecturing
Ishan Bhawantha Hewanayake
 
Hyperspace for Delta Lake
Databricks
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Apache Spark Core
Girish Khanzode
 
Spark architecture
GauravBiswas9
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
DataScienceConferenc1
 
Parquet performance tuning: the missing guide
Ryan Blue
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 

Similar to Spark streaming , Spark SQL (20)

PPTX
Apache Spark Introduction @ University College London
Vitthal Gogate
 
PDF
Apache Spark Overview @ ferret
Andrii Gakhov
 
PDF
Unified Big Data Processing with Apache Spark
C4Media
 
PDF
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
PPT
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
PDF
Simplifying Big Data Analytics with Apache Spark
Databricks
 
PDF
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
PDF
Spark For Plain Old Java Geeks (June2014 Meetup)
sdeeg
 
PDF
Spark forplainoldjavageeks svforum_20140724
sdeeg
 
PDF
Reactive dashboard’s using apache spark
Rahul Kumar
 
PPTX
Geek Night - Functional Data Processing using Spark and Scala
Atif Akhtar
 
PDF
Bds session 13 14
Infinity Tech Solutions
 
PDF
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
DB Tsai
 
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
PDF
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
PDF
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
PDF
Spark-summit-2013 Matei Zaharia
Prabeesh K
 
PPTX
Glint with Apache Spark
Venkata Naga Ravi
 
PPTX
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Apache Spark Introduction @ University College London
Vitthal Gogate
 
Apache Spark Overview @ ferret
Andrii Gakhov
 
Unified Big Data Processing with Apache Spark
C4Media
 
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
Spark For Plain Old Java Geeks (June2014 Meetup)
sdeeg
 
Spark forplainoldjavageeks svforum_20140724
sdeeg
 
Reactive dashboard’s using apache spark
Rahul Kumar
 
Geek Night - Functional Data Processing using Spark and Scala
Atif Akhtar
 
Bds session 13 14
Infinity Tech Solutions
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
DB Tsai
 
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
Spark-summit-2013 Matei Zaharia
Prabeesh K
 
Glint with Apache Spark
Venkata Naga Ravi
 
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Ad

More from Yousun Jeong (10)

PDF
Stsg17 speaker yousunjeong
Yousun Jeong
 
PDF
Spark day 2017 - Spark on Kubernetes
Yousun Jeong
 
PDF
Druid meetup 4th_sql_on_druid
Yousun Jeong
 
PDF
Kubernetes on aws
Yousun Jeong
 
PDF
Kafka for begginer
Yousun Jeong
 
PDF
Data Analytics with Druid
Yousun Jeong
 
PDF
IEEE International Conference on Data Engineering 2015
Yousun Jeong
 
PDF
Big Telco Real-Time Network Analytics
Yousun Jeong
 
PDF
Enterprise 환경에서의 오픈소스 기반 아키텍처 적용 사례
Yousun Jeong
 
PDF
2012 07 28_cloud_reference_architecture_openplatform
Yousun Jeong
 
Stsg17 speaker yousunjeong
Yousun Jeong
 
Spark day 2017 - Spark on Kubernetes
Yousun Jeong
 
Druid meetup 4th_sql_on_druid
Yousun Jeong
 
Kubernetes on aws
Yousun Jeong
 
Kafka for begginer
Yousun Jeong
 
Data Analytics with Druid
Yousun Jeong
 
IEEE International Conference on Data Engineering 2015
Yousun Jeong
 
Big Telco Real-Time Network Analytics
Yousun Jeong
 
Enterprise 환경에서의 오픈소스 기반 아키텍처 적용 사례
Yousun Jeong
 
2012 07 28_cloud_reference_architecture_openplatform
Yousun Jeong
 
Ad

Recently uploaded (20)

PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
Research Methodology Overview Introduction
ayeshagul29594
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 

Spark streaming , Spark SQL

  • 2. History - Spark Developed in 2009 at UC Berkeley AMPLab, then open sourced in 2010, Spark has since become one of the largest OSS communities in big data, with over 200 contributors in 50+ organizations “Organizations that are looking at big data challenges – including collection, ETL, storage, exploration and analytics – should consider Spark for its in-memory performance and the breadth of its model. It supports advanced analytics solutions on Hadoop clusters, including the iterative model required for machine learning and graph analysis.” Gartner, Advanced Analytics and Data Science (2014)
  • 3. History - Spark Some key points about Spark: • handles batch, interactive, and real-time within a single framework • native integration with Java, Python, Scala programming at a higher level of abstraction • multi-step Directed Acrylic Graphs (DAGs). 
 many stages compared to just Hadoop Map and Reduce only.
  • 4. Data Sharing in MR http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012
  • 7. RDD Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – a fault-tolerant collection of elements that can be operated on in parallel There are currently two types: • parallelized collections – take an existing Scala collection and run functions on it in parallel • Hadoop datasets – run functions on each record of a file in Hadoop distributed file system or any other storage system supported by Hadoop
  • 8. Fault Tolerance • An RDD is an immutable, deterministically re- computable, distributed dataset. • RDD tracks lineage info rebuild lost data
  • 9. Benefit of Spark Spark help us to have the gains in processing speed and implement various big data applications easily and speedily ▪ Support for Event Stream Processing ▪ Fast Data Queries in Real Time ▪ Improved Programmer Productivity ▪ Fast Batch Processing of Large Data Set Why I use spark …
  • 10. Big Data Big Data is not just “big” The 3V of Big Data
  • 11. Big Data Processing 1. Batch Processing • processing data en masse • big & complex • higher latencies ex) MR 2. Stream Processing • one-at-a-time processing • computations are relatively simple and generally independent • sub-second latency ex) Storm 3. Micro-Batching • small batch size (batch+streaming)
  • 13. Spark Streaming In Action import org.apache.spark.streaming._ 
 import org.apache.spark.streaming.StreamingContext._ 
 
 // create a StreamingContext with a SparkConf configuration val ssc = new StreamingContext(sparkConf, Seconds(10)) 
 
 // create a DStream that will connect to serverIP:serverPort val lines = ssc.socketTextStream(serverIP, serverPort) 
 
 // split each line into words 
 val words = lines.flatMap(_.split(" ")) 
 
 // count each word in each batch 
 val pairs = words.map(word => (word, 1)) 
 val wordCounts = pairs.reduceByKey(_ + _) 
 
 // print a few of the counts to the console wordCounts.print() 
 
 ssc.start() // Start the computation
 ssc.awaitTermination() // Wait for the computation to terminate
  • 16. Spark SQL In Action // Data can easily be extracted from existing sources, // such as Apache Hive. val trainingDataTable = sql(""" SELECT e.action, u.age, u.latitude, u.logitude FROM Users u JOIN Events e ON u.userId = e.userId”"") // Since `sql` returns an RDD, the results of the above // query can be easily used in MLlib val trainingData = trainingDataTable.map { row => val features = Array[Double](row(1), row(2), row(3)) LabeledPoint(row(0), features) } val model = new LogisticRegressionWithSGD().run(trainingData)
  • 17. Spark SQL In Action val allCandidates = sql(""" SELECT userId, age, latitude, logitude FROM Users WHERE subscribed = FALSE”"") // Results of ML algorithms can be used as tables // in subsequent SQL statements. case class Score(userId: Int, score: Double) val scores = allCandidates.map { row => val features = Array[Double](row(1), row(2), row(3)) Score(row(0), model.predict(features)) } scores.registerAsTable("Scores")
  • 18. MR vs RDD - Compute an Average
  • 19. RDD vs DF - Compute an Average Using RDDs data = sc.textFile(...).split("t") data.map(lambda x: (x[0], [int(x[1]), 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect() Using DataFrames sqlCtx.table("people").groupBy("name").agg("name", avg("age")).collect()
  • 20. Spark 2.0 : Structured Streaming • Structured Streaming • High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames • Event time, windowing, sessions, sources & sinks • Unifies streaming, interactive and batch queries • Aggregate data in a stream, then serve using JDBC • Change queries at runtime • Build and apply ML models
  • 21. Spark 2.0 Example: Page View Count Input: records in Kafka Query: select count(*) group by page, minute(evtime) Trigger:“every 5 sec” Output mode: “update-in-place”, into MySQL sink logs = ctx.read.format("json").stream("s3://logs") logs.groupBy(logs.user_id).
 agg(sum(logs.time)) .write.format("jdbc") .stream("jdbc:mysql//...")
  • 22. Spark 2.0 Use Case: Fraud Detection
  • 24. Q & A