SlideShare a Scribd company logo
Sessionization At Scale
Using Spark Streaming in production and staying sane
Marina Grechuhin & Yuval
Itzchakov
12/09/2017
YuvalItzchakov
2Confidential
• Developer @ Clicktale for the past 3 years
• Previously developer @ IDF (8200)
• @yuvalitzchakov
• https://stackoverflow.com/users/1870803/yuval-itzchakov
• http://asyncified.io
MarinaGrechuhin
3Confidential
• Team Leader @ Clicktale
• Previously co-founder and VP R&D @ SureVisit
• Previously – many more
Yes
No
Agenda
4Confidential
• Introduction to Spark
• Spark In Depth
• What Is Sessionization?
• Spark Brief Overview
• Sessionization With Spark Streaming
• Scale Challenges
• Structured Streaming with Stateful Aggregations
5Confidential
Architecture – Pipeline CEC
Elastic Load
Balancing
Auto Scaling group
Ingest
Servers
{
"version": 1,
"location":"http://adobe.com/shoe.html",
"projectId": 10,
"documentReferrer": "",
"visitId": 6403608503386111,
"domContentLoaded": 324,
"visitorId": 3246944914767871,
"pageviewId": 1199465738272767,
"engagementTime": 2336,
"messageId": 0
}
6Confidential
Pipeline CEC – Data Types
• Init Message
• Chunk Messages 0-N
• End Message
14
7Confidential
Sizing Pipeline CEC
Elastic Load
Balancing
10
500G/Day
100G/Day
Ingest
1415
Elastic Load
Balancing
8
Ingest
10
8Confidential
What is Sessionization?
Session:
“A sequence of requests made by a single end-user during a visit to a particular site”
(Wikipedia)
• To be able to aggregate user actions over time
• All data doesn’t arrive at once, but piece by piece
9Confidential
Pipeline CEC – Data Types
PageView
End
Chunk
Init
PageView
End
Chunk
Init
PageView
Chunk
Chunk
Init
Visit
PageView
PageView
PageView
PageView – User’s Journey on a single web page
Visit – User’s journey on site
10Confidential
Requirements overview
• Data size ranging between 200B – 1K (may grow over time)
• Process incoming user messages up to 100,000 messages per second
• Handle traffic peaks up to 1,000,000 messages/second (common with Fortune 500
companies)
• Scale out as needed without user intervention (hopefully linearly)
• Save user state until a session is complete, and only then send it down the pipeline
• Latency - up to 10 seconds from ingestion to processing (make data available as
soon as it’s ready)
11Confidential
12Confidential
Spark Ecosystem
Source: https://www.linkedin.com/pulse/apache-spark-big-data-dataframe-things-know-abhishek-choudhary
13Confidential
Spark Streaming
• Discretized Stream (DStream)
• Micro batching
• One RDD every batch
Where is the state
kept between
batches?
14Confidential
mapWithState
Source: https://databricks.com/wp-content/uploads/2016/01/blog-faster-stateful-streaming-figure-1-1024x562.png
Init Chunk End
Page
View
Visit
PageView
• Partial Updates
• Timeout
• Initial State
15Confidential
(“dardasaba”, “hello”),
(“dardasaba”,
“goodbye”),
(“hathatul”, “w00t”),
(“hathatul”, “nope”),
(“gargamel”, “muhaha”)
Executor 1
Executor 2
Executor 3
Key Value
“dardasaba” [“hello”,
“goodbye”]
Key Value
“hathatul” [“w00t”,
“nope”]
Key Value
“gargamel” [“muhaha”]
OpenHashMap[String, List[String]]
DStream[(String, String)]
Key Value
16Confidential
What Could Possibly Go Wrong?
17Confidential
Scale Challenges
• Stability
• Resiliency
• Scalability (scale up / down)
• Monitoring
18Confidential
Spark Streaming Challenges
1. Stability & Resiliency -> Checkpointing
• S3 
• Task failure - Eventual consistency on read
• AWS EFS (?)
• Not suited for small file systems (limited IOPS)
• HDFS 
• Best overall write performance out of the three
• Can be installed on the same node as Spark Workers
• Relatively low maintenance (if used only for checkpoint)
19Confidential
Spark Streaming Challenges
1. Stability & Resiliency -> Checkpointing (cont.)
• Problem:
• State not always recoverable
• No matter the DFS, limits your throughput:
• 1KB message size
• 100,000 messages/sec
• 1 minute checkpoint time
(occurs every 40 seconds)
• Workaround:
• None (in Spark Streaming )
Checkpointing –
This is the cost???
20Confidential
Spark Streaming Challenges
2. Resiliency -> Managing user state between application upgrades
• Problems:
• Can’t change the graph
• Can’t change your data structures
• Workaround:
• Roll your own using `stateSnapshot()`
• Provide on start up using `StateSpec.initialState()`
* Can potentially double overhead of the job time (critical with high throughput).
21Confidential
Spark Streaming Challenges
• Problem:
• Spark Streaming defaults to one job (batch) at a
time
• If a particular job is stuck, all others wait
indefinitely
• Workaround:
• Monitor job status using Sparks driver REST API
(http://<driver ip>:4040/api/v1/applications)
• Consider using Speculation (should be done
carefully)
• Enable Blacklisting if a particular node is faulty.
• If you like to live dangerously, consider modifying
“spark.streaming.concurrentJobs”
3. Stability -> Frozen Jobs
22Confidential
Spark Streaming Challenges
• Scale Up – Just works*
• Scale Down – Who takes over the worker’s state?
4. Scalability
No One!
23Confidential
Spark Streaming Challenges
• Logging mechanism – Log only small and random percent of
traffic
4. Monitoring
25Confidential
Is there a better alternative?
26Confidential
Structured Streaming
“The key idea in Structured Streaming is to treat a live data stream as a table
that is being continuously appended” (Structured Streaming Documentation)
Source: https://spark.apache.org/docs/latest/img/structured-streaming-stream-as-a-table.png
27Confidential
Structured Streaming (Cont.)
Source: https://spark.apache.org/docs/latest/img/structured-streaming-model.png
28Confidential
mapGroupsWithState
A second iteration at stateful aggregations in Spark
Resiliency & Stability -> Checkpointing
• Checkpoints are incremental, only deltas!
• Allows state recovery between upgrades *
*According to a set of tests made by us, may not apply to all cases and isn’t documented
behavior
29Confidential
Spark Structured Streaming
• More new features and cool stuff
• Event based timeouts (previously only processing based)​
• Watermarking (New)​
• Deduplication (New)​
• Timeout per state item (Enhancement)​
30Confidential
Our experience so far
Running ~ 1 month in production with Spark 2.2 and mapGroupsWithState:
Pros:
• Queries seem to take less time on average than Spark Streaming *
• No need to save state manually
• Deduplication out of the box is awesome
• Event based timeouts + Watermarking for late data is also awesome
* In peak hours, from ~ 3 seconds per batch to 0.6 seconds per query (x5)
31Confidential
Our experience so far (Cont.)
Neutral:
• Kafka users: Spark now maps a TopicPartition to a particular Executor, improving data
locality (less shuffling).
• This means that in order to scale up, you must have at least a 1:1 mapping
between number of Kafka partitions and Spark Executors.
Cons:
• Creates a significantly larger memory overhead (due to internal state implementation)
• Makes heavier use of HDFS (many small file writes)
• Doesn’t support multiple states (yet)
• UI not as good as Streaming
32Confidential
Wrapping up
• Overall, Spark Streaming is a great candidate for small-medium loads or none
Stateful aggregations streams.
• If you’re considering Spark as an option for your business, start with
Structured Streaming from the get go.
• Do consider Apache Flink and it’s similar state management module which
allows pluggable state stores as an alternative.
33Confidential
• Real-time Streaming ETL with Structured Streaming:
https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-
streaming-apache-spark-2-1.html
• Making Structured Streaming Ready for Production:
https://www.youtube.com/watch?v=UQiuyov4J-4&feature=youtu.be
• Arbitrary Stateful Aggregations in Structured Streaming in Apache Spark:
https://www.youtube.com/watch?v=JAb4FIheP28
• Exploring Spark Stateful Streaming: http://asyncified.io/2016/07/31/exploring-stateful-
streaming-with-apache-spark
• Exploring Stateful Streaming with Spark Structured Streaming:
http://asyncified.io/2017/07/30/exploring-stateful-streaming-with-spark-structured-streaming
Resources
Thank you for listening!
Questions?

More Related Content

What's hot (20)

PDF
Using Redis at Facebook
Redis Labs
 
PPTX
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Fwdays
 
PDF
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
Matt Fuller
 
PDF
Cassandra Introduction & Features
Phil Peace
 
PDF
Building Scalable, Real Time Applications for Financial Services with DataStax
DataStax
 
PPTX
What's new in MongoDB 2.6
Matias Cascallares
 
PDF
Presto+MySQLで分散SQL
Sadayuki Furuhashi
 
PPTX
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
Michael Stack
 
PDF
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
Hazelcast
 
PPTX
Architecture of a Kafka camus infrastructure
mattlieber
 
PDF
Security Best Practices for your Postgres Deployment
PGConf APAC
 
PDF
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
StreamNative
 
PPTX
TPC-H in MongoDB
Aung Thu Rha Hein
 
PDF
NewSQL overview, Feb 2015
Ivan Glushkov
 
PDF
Facebook Presto presentation
Cyanny LIANG
 
PDF
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
ScyllaDB
 
ODP
Devoxx 2016 talk: Going Global with Nomad and Google Cloud Platform
Bastiaan Bakker
 
PPTX
Redis Labs and SQL Server
Lynn Langit
 
PDF
Using apache spark for processing trillions of records each day at Datadog
Vadim Semenov
 
PDF
Make 2016 your year of SMACK talk
DataStax Academy
 
Using Redis at Facebook
Redis Labs
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Fwdays
 
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
Matt Fuller
 
Cassandra Introduction & Features
Phil Peace
 
Building Scalable, Real Time Applications for Financial Services with DataStax
DataStax
 
What's new in MongoDB 2.6
Matias Cascallares
 
Presto+MySQLで分散SQL
Sadayuki Furuhashi
 
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
Michael Stack
 
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
Hazelcast
 
Architecture of a Kafka camus infrastructure
mattlieber
 
Security Best Practices for your Postgres Deployment
PGConf APAC
 
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
StreamNative
 
TPC-H in MongoDB
Aung Thu Rha Hein
 
NewSQL overview, Feb 2015
Ivan Glushkov
 
Facebook Presto presentation
Cyanny LIANG
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
ScyllaDB
 
Devoxx 2016 talk: Going Global with Nomad and Google Cloud Platform
Bastiaan Bakker
 
Redis Labs and SQL Server
Lynn Langit
 
Using apache spark for processing trillions of records each day at Datadog
Vadim Semenov
 
Make 2016 your year of SMACK talk
DataStax Academy
 

Similar to Spark Streaming @ Scale (Clicktale) (20)

PPTX
Stream, stream, stream: Different streaming methods with Spark and Kafka
Itai Yaffe
 
PPT
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
PDF
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
PDF
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
PPTX
Getting Started with Spark Structured Streaming - Current 22
Dustin Vannoy
 
PDF
Getting Started With Spark Structured Streaming With Dustin Vannoy | Current ...
HostedbyConfluent
 
PPTX
Apache Spark Components
Girish Khanzode
 
PDF
Structured Streaming in Spark
Digital Vidya
 
PPTX
Spark Structured Streaming
Revin Chalil
 
PDF
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
PPTX
Meetup spark structured streaming
José Carlos García Serrano
 
PDF
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
PDF
2017 big data landscape and cutting edge innovations public
Evans Ye
 
PDF
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 
PDF
Spark streaming state of the union
Databricks
 
PDF
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdf
nilanjan172nsvian
 
PDF
Productizing Structured Streaming Jobs
Databricks
 
PDF
Introduction to Spark Streaming
datamantra
 
PDF
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
 
PPTX
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Itai Yaffe
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
Getting Started with Spark Structured Streaming - Current 22
Dustin Vannoy
 
Getting Started With Spark Structured Streaming With Dustin Vannoy | Current ...
HostedbyConfluent
 
Apache Spark Components
Girish Khanzode
 
Structured Streaming in Spark
Digital Vidya
 
Spark Structured Streaming
Revin Chalil
 
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
Meetup spark structured streaming
José Carlos García Serrano
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
2017 big data landscape and cutting edge innovations public
Evans Ye
 
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 
Spark streaming state of the union
Databricks
 
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdf
nilanjan172nsvian
 
Productizing Structured Streaming Jobs
Databricks
 
Introduction to Spark Streaming
datamantra
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
Ad

Recently uploaded (20)

PPTX
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
PPTX
Introduction to Design of Machine Elements
PradeepKumarS27
 
PPTX
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
PPTX
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
PDF
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
PDF
Ethics and Trustworthy AI in Healthcare – Governing Sensitive Data, Profiling...
AlqualsaDIResearchGr
 
DOCX
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
PPTX
MobileComputingMANET2023 MobileComputingMANET2023.pptx
masterfake98765
 
PDF
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
PPTX
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
PDF
Electrical Engineer operation Supervisor
ssaruntatapower143
 
PDF
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
PPTX
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
PPTX
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
PPTX
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
PDF
Design Thinking basics for Engineers.pdf
CMR University
 
PPTX
VITEEE 2026 Exam Details , Important Dates
SonaliSingh127098
 
PPTX
Damage of stability of a ship and how its change .pptx
ehamadulhaque
 
PPTX
Hashing Introduction , hash functions and techniques
sailajam21
 
PDF
Zilliz Cloud Demo for performance and scale
Zilliz
 
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
Introduction to Design of Machine Elements
PradeepKumarS27
 
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
Ethics and Trustworthy AI in Healthcare – Governing Sensitive Data, Profiling...
AlqualsaDIResearchGr
 
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
MobileComputingMANET2023 MobileComputingMANET2023.pptx
masterfake98765
 
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
Electrical Engineer operation Supervisor
ssaruntatapower143
 
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
Design Thinking basics for Engineers.pdf
CMR University
 
VITEEE 2026 Exam Details , Important Dates
SonaliSingh127098
 
Damage of stability of a ship and how its change .pptx
ehamadulhaque
 
Hashing Introduction , hash functions and techniques
sailajam21
 
Zilliz Cloud Demo for performance and scale
Zilliz
 
Ad

Spark Streaming @ Scale (Clicktale)

Editor's Notes

  • #6: Monitor message sending mechanism 1 kafka topic
  • #8: Monitor message sending mechanism 1 kafka topic
  • #10: How do we aggregate user messages over time in a Streaming application??
  • #11: Do a brief overview of all points, 15-20 seconds per point. At the end of the slide do an intro to Spark and talk a little about why we chose it over alternatives
  • #13: Ask a question: How many people use Spark in production? How many people use Spark Streaming in production? How many do Sessionization?
  • #14: Spark is not real time streaming, but micro batching Where is the state held?
  • #19: Talk about each file system briefly