SlideShare a Scribd company logo
Spark and Spark Streaming
Eric Fu
2018-Jun-04
Agenda
• Spark
• Resilient Distributed Datasets (RDD)
• Transformations and Actions
• Implementation
• Spark SQL
• Spark Streaming
• Discretized Streams (D-Streams)
• Stateful Transformations
• Consistency: exactly-once
• Spark Structured Streaming
• System Design
MapReduce
MapReduce reuse the immediate data by writing to external storage
How to achieve fault-tolerance?
• An efficient way to put data in memory and keep it persistent
• Copy to external storage (costly)
• Replicate to several nodes (costly)
• Just recompute it (but only when data is deterministic)
Resilient Distributed Datasets (RDD)
• RDD is a read-only, partitioned collection of records
• RDD can only be created through deterministic operations
Resilient Distributed Datasets (RDD)
lines = spark.textFile("hdfs://...")
errors = lines
.filter(_.startsWith("ERROR"))
.filter(_.contains("HDFS"))
.map(_.split('t')(3))
.collect()
Fault-tolerance
• Failed partition can be recomputed
• Stragglers can be moved to other nodes
Programming Interface
• API similar to Java 8 Stream
• Driver - Master - Worker
• Driver tracks RDDs lineage
• Driver send functions to Worker
Transformations
Actions
Example: PageRank
PageRank in Spark
Better to specify partition:
Inside RDD
• Partitions
• Dependencies (parents)
• Iterator (constructor)
• Metadata
Inside RDD (cont.)
• HDFS Files
• map
• union
• sample
• join
2 kinds of dependencies
Job Execution
When user runs an action ...
1. Build lineage graph (DAG)
2. Find missing partitions
3. Schedule tasks based on locality
4. Wait until completed
Spark SQL
A Relational, Declarative API to Spark
Differences
• DataFrame API
• DataFrame = Table
• Keep track of schema
• An RDD of Row objects
• Catalyst
• SQL Optimizer
• SQL with UDF
Catalyst
row.get("x")+3
Spark Streaming
From Batch to Streaming System
Existing Streaming Systems
• Continuous operator model
• Long-running, stateful operators
• Hard to handle faults or stragglers
• Hard to perform backup & recovery – replication or upstream backup
Discretized Streams (D-Streams)
• Structure a streaming computation as a series of short, stateless,
deterministic batch computations on small time intervals
• Higher latency (100ms vs 1s)
• Higher throughput (2–5x faster than Storm)
• Easy to handle faults or stragglers (parallel recovery 1-2s)
Continuous model vs. D-Stream
Example
• Running word count
• Auto checkpoint
• Fault or straggler
Programming Interface
• Input
• Transformation
• Stateless
• Or with state across intervals
• Output operation
Stateful transformations
• Windowing
• Groups the records from a sliding window into one RDD
• Incremental aggregation
• Aggregate over a sliding window
Consistency Semantics
• Hard to provide consistency of state across nodes in streaming system
• D-Streams provide consistent "exactly-once" processing across the
cluster
Exactly-once (1/3)
Exactly-once (2/3)
Exactly-once (3/3)
State Management
• Asynchronous RDD Checkpointing
• Lineage cutoff
Spark Structured Streaming
Incremental SQL Processing
Differences
• To provide exactly-once
• Input sources must be replayable
• Output sinks must support idempotent write
• SQL and DataFrame API
• User can mark a column as denoting event time
• An additional continuous processing mode
"Incrementalize"
Window
Tumbling Window
Hopping Window
Sliding Window
Session Window
Watermarks
• It's impossible to allow arbitrarily late data
• Need to set a watermark for event time columns
• Watermarks affect when stateful operators can forget old state
System Design
Architecture
Master
• Tracks the D-Stream lineage graph
• Schedules tasks to compute new RDD partitions
Worker
• Receive and store partitions of RDD (input or computed)
• Execute tasks
Some Details
• Pipelines operators that can be grouped into a single task
• Submits next timestep before the current one finished
• Asynchronous checkpoints of RDDs and forgets lineage
• Block store manages RDD partitions in an LRU fashion
• Master recovery
Fault and Straggler
• Parallel Recovery
• Parallel across partitions of the RDDs in each timestep
• Parallel across timesteps for independent operations
• Detect stragglers (1.4× slower)
Get your hands dirty!
Thanks!
Q&A

More Related Content

What's hot (20)

PDF
3D: DBT using Databricks and Delta
Databricks
 
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
PDF
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
PDF
Intro to Delta Lake
Databricks
 
PDF
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
PDF
Introduction to apache spark
Aakashdata
 
PDF
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
Databricks
 
PDF
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
PDF
Data Mesh
Piethein Strengholt
 
PPTX
Introduction to Data Engineering
Hadi Fadlallah
 
PPTX
Azure Synapse Analytics Overview (r2)
James Serra
 
PPTX
Sharding
MongoDB
 
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
PDF
Diving into Delta Lake: Unpacking the Transaction Log
Databricks
 
PDF
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
3D: DBT using Databricks and Delta
Databricks
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
Intro to Delta Lake
Databricks
 
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Introduction to apache spark
Aakashdata
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
Databricks
 
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Introduction to Data Engineering
Hadi Fadlallah
 
Azure Synapse Analytics Overview (r2)
James Serra
 
Sharding
MongoDB
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Diving into Delta Lake: Unpacking the Transaction Log
Databricks
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 

Similar to Spark and Spark Streaming (20)

PPT
strata spark streaming strata spark streamingsrata spark streaming
ShidrokhGoudarzi1
 
PPT
strata_spark_streaming.ppt
snowflakebatch
 
PPT
strata_spark_streaming.ppt
AbhijitManna19
 
PPTX
Apache Spark Components
Girish Khanzode
 
PDF
Deep dive into spark streaming
Tao Li
 
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
PDF
Spark streaming
Noam Shaish
 
PPTX
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
PDF
Introduction to Spark Streaming
datamantra
 
PPTX
Spark Kafka summit 2017
ajay_ei
 
PDF
Spark & Spark Streaming Internals - Nov 15 (1)
Akhil Das
 
PPT
Spark streaming
Venkateswaran Kandasamy
 
PDF
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Santosh Sahoo
 
PPT
strata_spark_streaming.ppt
rveiga100
 
PPTX
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
ODP
Understanding Spark Structured Streaming
Knoldus Inc.
 
PDF
Lifting the hood on spark streaming - StampedeCon 2015
StampedeCon
 
PDF
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
PPTX
Spark 计算模型
wang xing
 
PPTX
Stream, stream, stream: Different streaming methods with Spark and Kafka
Itai Yaffe
 
strata spark streaming strata spark streamingsrata spark streaming
ShidrokhGoudarzi1
 
strata_spark_streaming.ppt
snowflakebatch
 
strata_spark_streaming.ppt
AbhijitManna19
 
Apache Spark Components
Girish Khanzode
 
Deep dive into spark streaming
Tao Li
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Spark streaming
Noam Shaish
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
Introduction to Spark Streaming
datamantra
 
Spark Kafka summit 2017
ajay_ei
 
Spark & Spark Streaming Internals - Nov 15 (1)
Akhil Das
 
Spark streaming
Venkateswaran Kandasamy
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Santosh Sahoo
 
strata_spark_streaming.ppt
rveiga100
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
Understanding Spark Structured Streaming
Knoldus Inc.
 
Lifting the hood on spark streaming - StampedeCon 2015
StampedeCon
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
Spark 计算模型
wang xing
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Itai Yaffe
 
Ad

More from 宇 傅 (12)

PDF
Parallel Query Execution
宇 傅
 
PPTX
The Evolution of Data Systems
宇 傅
 
PPTX
The Volcano/Cascades Optimizer
宇 傅
 
PPTX
PelotonDB - A self-driving database for hybrid workloads
宇 傅
 
PPTX
Immutable Data Structures
宇 傅
 
PPTX
The Case for Learned Index Structures
宇 傅
 
PDF
Functional Programming in Java 8
宇 傅
 
PDF
第三届阿里中间件性能挑战赛冠军队伍答辩
宇 傅
 
PDF
Data Streaming Algorithms
宇 傅
 
PDF
Golang 101
宇 傅
 
PDF
Docker Container: isolation and security
宇 傅
 
PDF
Paxos and Raft Distributed Consensus Algorithm
宇 傅
 
Parallel Query Execution
宇 傅
 
The Evolution of Data Systems
宇 傅
 
The Volcano/Cascades Optimizer
宇 傅
 
PelotonDB - A self-driving database for hybrid workloads
宇 傅
 
Immutable Data Structures
宇 傅
 
The Case for Learned Index Structures
宇 傅
 
Functional Programming in Java 8
宇 傅
 
第三届阿里中间件性能挑战赛冠军队伍答辩
宇 傅
 
Data Streaming Algorithms
宇 傅
 
Golang 101
宇 傅
 
Docker Container: isolation and security
宇 傅
 
Paxos and Raft Distributed Consensus Algorithm
宇 傅
 
Ad

Recently uploaded (20)

PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PPTX
Build a Custom Agent for Agentic Testing.pptx
klpathrudu
 
PDF
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
PPTX
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
PPTX
Customise Your Correlation Table in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PDF
UITP Summit Meep Pitch may 2025 MaaS Rebooted
campoamor1
 
PDF
ERP Consulting Services and Solutions by Contetra Pvt Ltd
jayjani123
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PPTX
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PDF
AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025
bashirkhan333g
 
PDF
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
PDF
IObit Driver Booster Pro 12.4.0.585 Crack Free Download
henryc1122g
 
PPTX
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
PPTX
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
Build a Custom Agent for Agentic Testing.pptx
klpathrudu
 
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
Customise Your Correlation Table in IBM SPSS Statistics.pptx
Version 1 Analytics
 
UITP Summit Meep Pitch may 2025 MaaS Rebooted
campoamor1
 
ERP Consulting Services and Solutions by Contetra Pvt Ltd
jayjani123
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025
bashirkhan333g
 
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
IObit Driver Booster Pro 12.4.0.585 Crack Free Download
henryc1122g
 
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 

Spark and Spark Streaming

Editor's Notes

  • #13: We can also write a custom Partitioner class to group pages that link to each other together (e.g., partition the URLs by domain name).
  • #14: and metadata about its partitioning scheme and data placement
  • #38: The system tries to place both state and tasks to maximize data locality, but this underlying flexibility makes speculation and parallel recovery possible
  • #39: dropping data to disk if there is not enough memory