Spark and Spark Streaming

Spark and Spark Streaming
Eric Fu
2018-Jun-04

Agenda
• Spark
• Resilient Distributed Datasets (RDD)
• Transformations and Actions
• Implementation
• Spark SQL
• Spark Streaming
• Discretized Streams (D-Streams)
• Stateful Transformations
• Consistency: exactly-once
• Spark Structured Streaming
• System Design

MapReduce
MapReduce reuse the immediate data by writing to external storage

How to achieve fault-tolerance?
• An efficient way to put data in memory and keep it persistent
• Copy to external storage (costly)
• Replicate to several nodes (costly)
• Just recompute it (but only when data is deterministic)

Resilient Distributed Datasets (RDD)
• RDD is a read-only, partitioned collection of records
• RDD can only be created through deterministic operations

Resilient Distributed Datasets (RDD)
lines = spark.textFile("hdfs://...")
errors = lines
.filter(_.startsWith("ERROR"))
.filter(_.contains("HDFS"))
.map(_.split('t')(3))
.collect()

Fault-tolerance
• Failed partition can be recomputed
• Stragglers can be moved to other nodes

Programming Interface
• API similar to Java 8 Stream
• Driver - Master - Worker
• Driver tracks RDDs lineage
• Driver send functions to Worker

PageRank in Spark
Better to specify partition:

Inside RDD
• Partitions
• Dependencies (parents)
• Iterator (constructor)
• Metadata

Inside RDD (cont.)
• HDFS Files
• map
• union
• sample
• join
2 kinds of dependencies

Job Execution
When user runs an action ...
1. Build lineage graph (DAG)
2. Find missing partitions
3. Schedule tasks based on locality
4. Wait until completed

Spark SQL
A Relational, Declarative API to Spark

Differences
• DataFrame API
• DataFrame = Table
• Keep track of schema
• An RDD of Row objects
• Catalyst
• SQL Optimizer
• SQL with UDF

Spark Streaming
From Batch to Streaming System

Existing Streaming Systems
• Continuous operator model
• Long-running, stateful operators
• Hard to handle faults or stragglers
• Hard to perform backup & recovery – replication or upstream backup

Discretized Streams (D-Streams)
• Structure a streaming computation as a series of short, stateless,
deterministic batch computations on small time intervals
• Higher latency (100ms vs 1s)
• Higher throughput (2–5x faster than Storm)
• Easy to handle faults or stragglers (parallel recovery 1-2s)

Example
• Running word count
• Auto checkpoint
• Fault or straggler

Programming Interface
• Input
• Transformation
• Stateless
• Or with state across intervals
• Output operation

Stateful transformations
• Windowing
• Groups the records from a sliding window into one RDD
• Incremental aggregation
• Aggregate over a sliding window

Consistency Semantics
• Hard to provide consistency of state across nodes in streaming system
• D-Streams provide consistent "exactly-once" processing across the
cluster

State Management
• Asynchronous RDD Checkpointing
• Lineage cutoff

Spark Structured Streaming
Incremental SQL Processing

Differences
• To provide exactly-once
• Input sources must be replayable
• Output sinks must support idempotent write
• SQL and DataFrame API
• User can mark a column as denoting event time
• An additional continuous processing mode

Window
Tumbling Window
Hopping Window
Sliding Window
Session Window

Watermarks
• It's impossible to allow arbitrarily late data
• Need to set a watermark for event time columns
• Watermarks affect when stateful operators can forget old state

Architecture
Master
• Tracks the D-Stream lineage graph
• Schedules tasks to compute new RDD partitions
Worker
• Receive and store partitions of RDD (input or computed)
• Execute tasks

Some Details
• Pipelines operators that can be grouped into a single task
• Submits next timestep before the current one finished
• Asynchronous checkpoints of RDDs and forgets lineage
• Block store manages RDD partitions in an LRU fashion
• Master recovery

Fault and Straggler
• Parallel Recovery
• Parallel across partitions of the RDDs in each timestep
• Parallel across timesteps for independent operations
• Detect stragglers (1.4× slower)

Spark and Spark Streaming

More Related Content

What's hot (20)

Similar to Spark and Spark Streaming (20)

More from 宇傅 (12)

Recently uploaded (20)