Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Your Name, Your Organization
Title of Your Presentation
Goes Here
#UnifiedDataAnalytics #SparkAISummit
Itai Yaffe, Nielsen
Stream, Stream, Stream
Different Streaming methods with Spark and Kafka
#UnifiedDataAnalytics #SparkAISummit

Introduction
Itai Yaffe
● Tech Lead, Big Data group
● Dealing with Big Data
challenges since 2012

Introduction - part 2 (or: “your turn…”)
● Data engineers? Data architects? Something else?
● Working with Spark? Planning to?
● Working Kafka? Planning to?

Agenda
Nielsen Marketing Cloud (NMC)
○ About
○ High-level architecture
Data flow - past and present
Spark Streaming
○ ”Stateless” and ”stateful” use-cases
Spark Structured Streaming
”Streaming” over our Data Lake

Nielsen Marketing Cloud (NMC)
● eXelate was acquired by Nielsen on March 2015
● A Data company
● Machine learning models for insights
● Targeting
● Business decisions

Nielsen Marketing Cloud - questions we try to answer
1. How many unique users of a certain profile can we reach?
E.g campaign for young women who love tech
2. How many impressions a campaign received?

Nielsen Marketing Cloud - high-level architecture

Data flow in the old days ...
In-DB aggregation
OLAP

Data flow in the old days… What’s wrong with that?
● CSV-related issues, e.g:
○ Truncated lines in input files
○ Can’t enforce schema
● Scale-related issues, e.g:
○ Had to “manually” scale the processes

That's one small step for [a] man… (2014)
“Apache Spark is the Taylor Swift of big data software" (Derrick Harris, Fortune.com, 2015)
In-DB aggregation
OLAP

Why just a small step?
● Solved the scaling issues
● Still faced the CSV-related issues

Data flow - the modern way
+
Photography Copyright: NBC

Spark Streaming - “stateless” app use-case (2015)
Read Messages
In-DB aggregation
OLAP

The need for stateful streaming
Fast forward a few months...
● New requirements were being raised
● Specific use-case :
○ To take the load off of the operational DB (used both as OLTP and OLAP), we wanted to move most of
the aggregative operations to our Spark Streaming app

Stateful streaming via “local” aggregations
1.
Read Messages
5.
Upsert aggregated data
(every X micro-batches)
2.
Aggregate current micro-batch
3.
Write combined aggregated data
4.
Read aggregated data
of previous micro-batches from HDFS
OLAP

Stateful streaming via “local” aggregations
● Required us to manage the state on our own
● Error-prone
○ E.g what if my cluster is terminated and data on HDFS is lost?
● Complicates the code
○ Mixed input sources for the same app (Kafka + files)
● Possible performance impact
○ Might cause the Kafka consumer to lag

Structured Streaming - to the rescue?
Spark 2.0 introduced Structured Streaming
● Enables running continuous, incremental processes
○ Basically manages the state for you
● Built on Spark SQL
○ DataFrame/Dataset API
○ Catalyst Optimizer
● Many other features
● Was in ALPHA mode in 2.0 and 2.1
Structured Streaming

Structured Streaming - stateful app use-case
2.
Aggregate current window
3.
Checkpoint (state and offsets) handled internally by Spark
1.
Read Messages
4.
Upsert aggregated data
(on window end)
Structured
streaming
OLAP

Structured Streaming - known issues & tips
● 3 major issues we had in 2.1.0 (solved in 2.1.1) :
○ https://issues.apache.org/jira/browse/SPARK-19517
● Checkpointing to S3 wasn’t straight-forward
○ Tried using EMRFS consistent view
■ Worked for stateless apps
■ Encountered sporadic issues for stateful apps

Structured Streaming - strengths and weaknesses (IMO)
● Strengths include:
○ Running incremental, continuous processing
○ Increased performance (e.g via Catalyst SQL optimizer)
○ Massive efforts are invested in it
● Weaknesses were mostly related to maturity

Back to the future - Spark Streaming revived for “stateful” app use-case
1.
Read Messages
3.
Write Files
2.
Aggregate current micro-batch
4.
Load Data
OLAP

Cool, so… Why can’t we stop here?
● Significantly underutilized cluster resources = wasted $$$

Cool, so… Why can’t we stop here? (cont.)
● Extreme load of Kafka brokers’ disks
○ Each micro-batch needs to read ~300M messages, Kafka can’t store it all in memory
● ConcurrentModificationException when using Spark Streaming + Kafka 0.10 integration
○ Forced us to use 1 core per executor to avoid it
○ https://issues.apache.org/jira/browse/SPARK-19185 supposedly solved in 2.4.0 (possibly solving
https://issues.apache.org/jira/browse/SPARK-22562 as well)
● We wish we could run it even less frequently
○ Remember - longer micro-batches result in a better aggregation ratio

Introducing RDR
RDR (or Raw Data Repository) is our Data Lake
● Kafka topic messages are stored on S3 in Parquet format
partitioned by date (date=2019-10-17)
● RDR Loaders - stateless Spark Streaming applications
● Applications can read data from RDR for various use-cases
○ E.g analyzing data of the last 1 day or 30 days
Can we leverage our Data Lake and use it as the data source (instead of Kafka)?

Potentially yes ...
S3 RDR
2.
Process files
1.
Read RDR files
from last day
date=2019-10-14
date=2019-10-15
date=2019-10-16

... but
● This ignores late arriving events

Enter “streaming” over RDR
+ +

How do we “stream” RDR files - producer side
S3 RDRRDR Loaders
2.
Write files
1.
Read Messages
3.
Write files’ paths
Topics with files’ paths as messages

How do we “stream” RDR files - consumer side
S3 RDR
3.
Process files
1.
Read files’ paths
2.
Read RDR files

How do we “stream” RDR files – producer & consumers
S3 RDR
2.
Write files1.
Read Messages
.3
Write files’ paths
RDR Loader
Topic with raw data
Topic with files’
paths
4.
Read files’ paths
5.
Read RDR files
6.
Process files

How do we use the new RDR “streaming” infrastructure?
1.
Read files’ paths
3.
Write files
2.
Read RDR files
OLAP
4.
Load Data
.3
Aggregate current batch

Did we solve the aforementioned problems?
● EMR clusters are now transient - no more idle clusters
Day 1 Day 2 Day 3
80% REDUCTION

Did we solve the aforementioned problems? (cont.)
● No more extreme load of Kafka brokers’ disks
○ We still read old messages from Kafka, but now we only read
about 1K messages per hour (rather than ~300M)
● The new infra doesn’t depend on the integration of Spark Streaming with Kafka
○ No more weird exceptions ...
● We can run the Spark batch applications as (in)frequent as we’d like
● Built-in handling of late arriving events

Summary
● Initially replaced standalone Java with Spark & Scala
○ Still faced CSV-related issues
● Introduced Spark Streaming & Kafka for “stateless” use-cases
○ Quickly needed to handle stateful use-cases as well
● Tried Spark Streaming for stateful use-cases (via “local” aggregations)
○ Required us to manage the state on our own
● Moved to Structured Streaming (for all use-cases)
○ Cons were mostly related to maturity

Summary (cont.)
● Went back to Spark Streaming (with Druid as OLAP)
○ Performance penalty in Kafka for long micro-batches
○ Under-utilized Spark clusters
○ Etc .
● Introduced “streaming” over our Data Lake
○ Eliminated Kafka performance penalty
○ Spark clusters are much better utilized = $$$ saved
○ And more ...

DRUID ES
Want to know more?
● Women in Big Data
○ A world-wide program that aims:
■ To inspire, connect, grow, and champion success of women in the Big Data & analytics field
■ To grow women representation in Big Data field > 25% by 2020
○ Over 20 chapters and 14,000+ members world-wide
○ Everyone can join (regardless of gender), so find a chapter near you -
https://www.womeninbigdata.org/wibd-structure/
● Counting Unique Users in Real-Time: Here's a Challenge for You!
○ Big Data LDN, November 13th 2019, https://tinyurl.com/y5ffvlqk
● NMC Tech Blog - https://medium.com/nmc-techblog

Itai Yaffe
THANK YOU
Itai Yaffe

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Structured Streaming -
additional slides

Structured Streaming - basic concepts
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts
Data stream
Unbounded Table
New data in the
data streamer
=
New rows appended
to a unbounded table
Data stream as an unbonded table

Structured Streaming - basic concepts

Structured Streaming - WordCount example

Structured Streaming - basic terms
● Input sources :
○ File
○ Kafka
○ Socket, Rate (for testing)
● Output modes:
○ Append (default)
○ Complete
○ Update (added in Spark 2.1.1)
○ Different types of queries support different output modes
■ E.g for non-aggregation queries, Complete mode not supported as it is infeasible to keep all
unaggregated data in the Result Table
● Output sinks:
○ File
○ Kafka (added in Spark 2.2.0)
○ Foreach
○ Console ,Memory (for debugging)
○ Different types of sinks support different output modes

Fault tolerance
● The goal - end-to-end exactly-once semantics
● The means:
○ Trackable sources (i.e offsets)
○ Checkpointing
○ Idempotent sinks
aggDF
.writeStream
.outputMode("complete")
.option("checkpointLocation", "path/to/HDFS/dir")
.format("memory")
.start()

Monitoring
● Interactive APIs :
○ streamingQuery.lastProgress()/status()
○ Output example
● Asynchronous API :
○ val spark: SparkSession = ...
spark.streams.addListener(new StreamingQueryListener() {
override def onQueryStarted(queryStarted: QueryStartedEvent): Unit = {
println("Query started: " + queryStarted.id)
}
override def onQueryTerminated(queryTerminated: QueryTerminatedEvent): Unit = {
println("Query terminated: " + queryTerminated.id)
}
override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {
println("Query made progress: " + queryProgress.progress)
}
})

Structured Streaming in production
So we started moving to Structured Streaming
Use case Previous architecture Old flow New architecture New flow
Existing
Spark app
Periodic Spark batch job Read Parquet from S3 -
> Transform ->
Write Parquet to S3
Stateless Structured
Streaming
Read from Kafka ->
Transform ->
Write Parquet to S3
Existing Java
app
Periodic standalone Java
process (“manual”
scaling)
Read CSV ->
Transform and
aggregate -> Write to
RDBMS
Stateful Structured
Streaming
Read from Kafka ->
Transform and aggregate ->
Write to RDBMS
New app N/A N/A Stateful Structured
Streaming
Read from Kafka ->
Transform and aggregate ->
Write to RDBMS

Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka

More Related Content

What's hot (20)

Similar to Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka (20)

More from Databricks (20)

Recently uploaded (20)

Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka