SlideShare a Scribd company logo
PRESENTED BY
Redis + Spark Structured Streaming:
A Perfect Combination to Scale-out Your Continuous
Applications
Dave Nielsen
Redis Labs
PRESENTED BY
Agenda:
How to collect and process data stream in real-time at scale
IoT
User Activity
Messages
PRESENTED BY
http://bit.ly/spark-redis
PRESENTED BY
http://bit.ly/spark-redis
PRESENTED BY
Breaking up Our Solution into Functional Blocks
Click data
Record all clicks Count clicks in real-time Query clicks by assets
2. Data Processing1. Data Ingest 3. Data Querying
PRESENTED BY
ClickAnalyzer
Redis Stream Redis Hash Spark SQLStructured Stream Processing
1. Data Ingest 2. Data Processing 3. Data Querying
The Actual Building Blocks of Our Solution
Click data
PRESENTED BY
1. Data Ingest
PRESENTED BY
ClickAnalyzer
Redis Stream Redis Hash Spark SQLStructured Stream Processing
1. Data Ingest 2. Data Processing 3. Data Querying
Data Ingest using Redis Streams
PRESENTED BY
What is Redis Streams?
PRESENTED BY
Redis Streams in its Simplest Form
ConsumerProducer
PRESENTED BY
Redis Streams can Connect Many Producers and Consumers
Producer 2
Producer m
Producer 1
Producer 3
Consumer 1
Consumer n
Consumer 2
Consumer 3
PRESENTED BY
Comparing Redis Streams with Redis Pub/Sub, Lists, Sorted Sets
Pub/Sub
• Fire and forget
• No persistence
• No lookback queries
Lists
• Tight coupling between
producers and consumers
• Persistence for transient
data only
• No lookback queries
Sorted Sets
• Data ordering isn’t built-in;
producer controls the order
• No maximum limit
• The data structure is not
designed to handle data
streams
PRESENTED BY
What is Redis Streams?
Pub/Sub Lists Sorted Sets
It is like Pub/Sub, but
with persistence
It is like Lists, but decouples
producers and consumers
It is like Sorted Sets, but
asynchronous
+
• Lifecycle management of streaming data
• Built-in support for timeseries data
• A rich choice of options to the consumers to read streaming and static data
• Super fast lookback queries powered by radix trees
• Automatic eviction of data based on the upper limit
PRESENTED BY
Redis Streams Benefits
Analytics
Data Backup
Consumers
Producer
Messaging
It enables asynchronous data exchange between producers and consumers
and historical range queries
PRESENTED BY
Redis Streams Benefits
Producer
Image Processor
Arrival Rate: 500/sec
Consumption Rate: 500/sec
Image Processor
Image Processor
Image Processor
Image Processor
Redis Stream
With consumer groups, you can scale out and avoid backlogs
PRESENTED BY
Classifier 1
Classifier 2
Classifier n
Consumer Group
XREADGROUP
XREAD
Consumers
Producer 2
Producer m
Producer 1
Producer 3
XADD
XACK
Deep Learning-based
Classification
Analytics
Data Backup
Messaging
Redis Streams Benefits
Simplify data collection, processing and
distribution to support complex scenarios
PRESENTED BY
Classifier 1
Classifier 2
Classifier n
Consumer Group
XREADGROUP
XREAD
Consumers
Producer 2
Producer m
Producer 1
Producer 3
XADD
XACK
Deep Learning-based
Classification
Analytics
Data Backup
Messaging
PRESENTED BY
Our Ingest Solution
Redis Stream
1. Data Ingest
Command
xadd clickstream * img [image_id]
Sample data
127.0.0.1:6379> xrange clickstream - +
1) 1) "1553536458910-0"
2) 1) ”image_1"
2) "1"
2) 1) "1553536469080-0"
2) 1) ”image_3"
2) "1"
3) 1) "1553536489620-0"
2) 1) ”image_3"
2) "1”
.
.
.
.
PRESENTED BY
2. Data Processing
PRESENTED BY
ClickAnalyzer
Redis Stream Redis Hash Spark SQLStructured Stream Processing
1. Data Ingest 2. Data Processing 3. Data Querying
Data Processing using Spark’s Structured Streaming
PRESENTED BY
What is Structured Streaming?
PRESENTED BY
“Structured Streaming provides fast, scalable, fault-
tolerant, end-to-end exactly-once stream
processing without the user having to reason about
streaming.”
Definition
PRESENTED BY
How Structured Streaming Works?
Micro-batches as
DataFrames (tables)
Source: Data Stream
DataFrame Operations
Selection: df.select(“xyz”).where(“a > 10”)
Filtering: df.filter(_.a > 10).map(_.b)
Aggregation: df.groupBy(”xyz").count()
Windowing: df.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word"”
).count()
Deduplication: df.dropDuplicates("guid")
Output Sink
Spark Structured Streaming
PRESENTED BY
ClickAnalyzer
Redis Stream Redis HashStructured Stream Processing
Redis Streams as data source
Spark-Redis Library
Redis as data sink
 Developed using Scala
 Compatible with Spark 2.3 and higher
 Supports
• RDD
• DataFrames
• Structured Streaming
PRESENTED BY
1. Connect to the Redis instance
2. Map Redis Stream to Structured Streaming schema
3. Create the query object
4. Run the query
Steps for Using Redis Streams as Data Source
PRESENTED BY
Redis Streams as Data Source
1. Connect to the Redis instance
val spark = SparkSession.builder()
.appName("redis-df")
.master("local[*]")
.config("spark.redis.host", "localhost")
.config("spark.redis.port", "6379")
.getOrCreate()
val clickstream = spark.readStream
.format("redis")
.option("stream.keys","clickstream")
.schema(StructType(Array(
StructField("img", StringType)
)))
.load()
val queryByImg = clickstream.groupBy("img").count
PRESENTED BY
Redis Streams as Data Source
2. Map Redis Stream to Structured Streaming schema
val spark = SparkSession.builder()
.appName("redis-df")
.master("local[*]")
.config("spark.redis.host", "localhost")
.config("spark.redis.port", "6379")
.getOrCreate()
val clickstream = spark.readStream
.format("redis")
.option("stream.keys","clickstream")
.schema(StructType(Array(
StructField("img", StringType)
)))
.load()
val queryByImg = clickstream.groupBy("img").count
xadd clickstream * img [image_id]
PRESENTED BY
Redis Streams as Data Source
3. Create the query object
val spark = SparkSession.builder()
.appName("redis-df")
.master("local[*]")
.config("spark.redis.host", "localhost")
.config("spark.redis.port", "6379")
.getOrCreate()
val clickstream = spark.readStream
.format("redis")
.option("stream.keys","clickstream")
.schema(StructType(Array(
StructField("img", StringType)
)))
.load()
val queryByImg = clickstream.groupBy("img").count
PRESENTED BY
Redis Streams as Data Source
4. Run the query
val clickstream = spark.readStream
.format("redis")
.option("stream.keys","clickstream")
.schema(StructType(Array(
StructField("img", StringType)
)))
.load()
val queryByImg = clickstream.groupBy("img").count
val clickWriter: ClickForeachWriter = new ClickForeachWriter("localhost","6379")
val query = queryByImg.writeStream
.outputMode("update")
.foreach(clickWriter)
.start()
query.awaitTermination()
Custom output sink
PRESENTED BY
How to Setup Redis as Output Sink
override def process(record: Row) = {
var img = record.getString(0);
var count = record.getLong(1);
if(jedis == null){
connect()
}
jedis.hset("clicks:"+img, "img", img)
jedis.hset("clicks:"+img, "count", count.toString)
}
Create a custom class extending ForeachWriter and override the method, process()
Save as Hash with structure
clicks:[image]
img [image]
count [count]
Example
clicks:image_1001
img image_1001
count 1029
clicks:image_1002
img image_1002
count 392
.
.
.
img count
image_1001 1029
image_1002 392
. .
. .
Table: Clicks
PRESENTED BY
3. Data Querying
PRESENTED BY
ClickAnalyzer
Redis Stream Redis Hash Spark SQLStructured Stream Processing
1. Data Ingest 2. Data Processing 3. Data Querying
Query Redis using Spark SQL
PRESENTED BY
1. Initialize Spark Context with Redis
2. Create table
3. Run Query
Steps to Query Redis using Spark SQL
clicks:image_1001
img image_1001
count 1029
clicks:image_1002
img image_1002
count 392
.
.
.
.
img count
image_1001 1029
image_1002 392
. .
. .
Redis Hash to SQL mapping
PRESENTED BY
1. Initialize
scala> import org.apache.spark.sql.SparkSession
scala> val spark = SparkSession.builder().appName("redis-
test").master("local[*]").config("spark.redis.host","localhost").config("spark.redis.port","6379").getOrCreate()
scala> val sc = spark.sparkContext
scala> import spark.sql
scala> import spark.implicits._
2. Create table
scala> sql("CREATE TABLE IF NOT EXISTS clicks(img STRING, count INT) USING org.apache.spark.sql.redis OPTIONS (table
'clicks’)”)
How to Query Redis using Spark SQL
PRESENTED BY
3. Run Query
scala> sql("select * from clicks").show();
+----------+-----+
| img|count|
+----------+-----+
|image_1001| 1029|
|image_1002| 392|
|. | .|
|. | .|
|. | .|
|. | .|
+----------+-----+
How to Query Redis using Spark SQL
PRESENTED BY
Code
Email dave@redislabs.com for this slide deck
Or download from https://github.com/redislabsdemo/
PRESENTED BY
Recap
PRESENTED BY
PRESENTED BY
ClickAnalyzer
Redis Stream Redis Hash Spark SQLStructured Stream Processing
1. Data Ingest 2. Data Processing 3. Data Querying
Building Blocks of our Solution
Redis Streams as data source; Redis as data sinkSpark-Redis Library is used for:
PRESENTED BY
Questions
?
?
?
?
?
?
?
?
?
?
?
Thank you!
dave@redislabs.com
@davenielsen
Dave Nielsen

More Related Content

What's hot (20)

PDF
오픈소스로 만드는 DB 모니터링 시스템 (w/graphite+grafana)
I Goo Lee
 
PDF
YOW2018 Cloud Performance Root Cause Analysis at Netflix
Brendan Gregg
 
PDF
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
PPTX
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
PyData
 
PDF
eBPF - Observability In Deep
Mydbops
 
PDF
ISUCONの勝ち方 YAPC::Asia Tokyo 2015
Masahiro Nagano
 
PPTX
RSS++
Tom Barbette
 
PDF
Understand your system like never before with OpenTelemetry, Grafana, and Pro...
LibbySchulze
 
DOC
cassandra調査レポート
Akihiro Kuwano
 
PDF
How to Extend Apache Spark with Customized Optimizations
Databricks
 
PPTX
Apache BigtopによるHadoopエコシステムのパッケージング(Open Source Conference 2021 Online/Osaka...
NTT DATA Technology & Innovation
 
PDF
Graph Database 101- What, Why and How?.pdf
Neo4j
 
PDF
Spark at Zillow
Steven Hoelscher
 
PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PDF
EDB Postgres Replication Server
EDB
 
PPTX
Cassandra
Upaang Saxena
 
PDF
ElasticSearch in action
Codemotion
 
PDF
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
OpenStack Korea Community
 
PPTX
DataOps introduction : DataOps is not only DevOps applied to data!
Adrien Blind
 
PDF
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Odinot Stanislas
 
오픈소스로 만드는 DB 모니터링 시스템 (w/graphite+grafana)
I Goo Lee
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
Brendan Gregg
 
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
PyData
 
eBPF - Observability In Deep
Mydbops
 
ISUCONの勝ち方 YAPC::Asia Tokyo 2015
Masahiro Nagano
 
Understand your system like never before with OpenTelemetry, Grafana, and Pro...
LibbySchulze
 
cassandra調査レポート
Akihiro Kuwano
 
How to Extend Apache Spark with Customized Optimizations
Databricks
 
Apache BigtopによるHadoopエコシステムのパッケージング(Open Source Conference 2021 Online/Osaka...
NTT DATA Technology & Innovation
 
Graph Database 101- What, Why and How?.pdf
Neo4j
 
Spark at Zillow
Steven Hoelscher
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
EDB Postgres Replication Server
EDB
 
Cassandra
Upaang Saxena
 
ElasticSearch in action
Codemotion
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
OpenStack Korea Community
 
DataOps introduction : DataOps is not only DevOps applied to data!
Adrien Blind
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Odinot Stanislas
 

Similar to Redis Streams plus Spark Structured Streaming (20)

PDF
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Databricks
 
PDF
Redis+Spark Structured Streaming: Roshan Kumar
Redis Labs
 
PDF
The Scout24 Data Platform (A Technical Deep Dive)
RaffaelDzikowski
 
PDF
As You Seek – How Search Enables Big Data Analytics
Inside Analysis
 
PDF
Log everything! @DC13
DECK36
 
PDF
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
DataStax
 
PDF
MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...
MongoDB
 
PDF
Top 8 WCM Trends 2010
David Nuescheler
 
PDF
Started from the Bottom: Exploiting Data Sources to Uncover ATT&CK Behaviors
JamieWilliams130
 
PDF
Making App Developers More Productive
Postman
 
PDF
Scaling and Modernizing Data Platform with Databricks
Databricks
 
PPTX
Microsoft Azure Big Data Analytics
Mark Kromer
 
PDF
RICOH THETA x IoT Developers Contest : Cloud API Seminar (2nd installation)
contest-theta360
 
PDF
Real-time big data analytics based on product recommendations case study
deep.bi
 
PPTX
IMC Summit 2016 Breakout - William Bain - Implementing Extensible Data Struct...
In-Memory Computing Summit
 
PDF
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
PDF
Big Data Expo 2015 - Gigaspaces Making Sense of it all
BigDataExpo
 
PDF
Yahoo’s next generation user profile platform
Kai (Kelvin) Liu
 
PDF
Yahoo's Next Generation User Profile Platform
DataWorks Summit/Hadoop Summit
 
PDF
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
Sungmin Kim
 
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Databricks
 
Redis+Spark Structured Streaming: Roshan Kumar
Redis Labs
 
The Scout24 Data Platform (A Technical Deep Dive)
RaffaelDzikowski
 
As You Seek – How Search Enables Big Data Analytics
Inside Analysis
 
Log everything! @DC13
DECK36
 
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
DataStax
 
MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...
MongoDB
 
Top 8 WCM Trends 2010
David Nuescheler
 
Started from the Bottom: Exploiting Data Sources to Uncover ATT&CK Behaviors
JamieWilliams130
 
Making App Developers More Productive
Postman
 
Scaling and Modernizing Data Platform with Databricks
Databricks
 
Microsoft Azure Big Data Analytics
Mark Kromer
 
RICOH THETA x IoT Developers Contest : Cloud API Seminar (2nd installation)
contest-theta360
 
Real-time big data analytics based on product recommendations case study
deep.bi
 
IMC Summit 2016 Breakout - William Bain - Implementing Extensible Data Struct...
In-Memory Computing Summit
 
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Big Data Expo 2015 - Gigaspaces Making Sense of it all
BigDataExpo
 
Yahoo’s next generation user profile platform
Kai (Kelvin) Liu
 
Yahoo's Next Generation User Profile Platform
DataWorks Summit/Hadoop Summit
 
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
Sungmin Kim
 
Ad

More from Dave Nielsen (12)

PPTX
10 Ways to Scale with Redis - LA Redis Meetup 2019
Dave Nielsen
 
PPTX
10 Ways to Scale Your Website Silicon Valley Code Camp 2019
Dave Nielsen
 
PPTX
Microservices - Is it time to breakup?
Dave Nielsen
 
PPTX
Add Redis to Postgres to Make Your Microservices Go Boom!
Dave Nielsen
 
PPTX
BigDL Deep Learning in Apache Spark - AWS re:invent 2017
Dave Nielsen
 
PDF
Redis as a Main Database, Scaling and HA
Dave Nielsen
 
PPTX
Redis Functions, Data Structures for Web Scale Apps
Dave Nielsen
 
PPT
Cloud Storage API
Dave Nielsen
 
PPT
Mashery
Dave Nielsen
 
PPT
Google App Engine
Dave Nielsen
 
PPT
Unified Cloud Storage Api
Dave Nielsen
 
PPT
Integrating Wikis And Other Social Content
Dave Nielsen
 
10 Ways to Scale with Redis - LA Redis Meetup 2019
Dave Nielsen
 
10 Ways to Scale Your Website Silicon Valley Code Camp 2019
Dave Nielsen
 
Microservices - Is it time to breakup?
Dave Nielsen
 
Add Redis to Postgres to Make Your Microservices Go Boom!
Dave Nielsen
 
BigDL Deep Learning in Apache Spark - AWS re:invent 2017
Dave Nielsen
 
Redis as a Main Database, Scaling and HA
Dave Nielsen
 
Redis Functions, Data Structures for Web Scale Apps
Dave Nielsen
 
Cloud Storage API
Dave Nielsen
 
Mashery
Dave Nielsen
 
Google App Engine
Dave Nielsen
 
Unified Cloud Storage Api
Dave Nielsen
 
Integrating Wikis And Other Social Content
Dave Nielsen
 
Ad

Recently uploaded (20)

PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Research Methodology Overview Introduction
ayeshagul29594
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 

Redis Streams plus Spark Structured Streaming

  • 1. PRESENTED BY Redis + Spark Structured Streaming: A Perfect Combination to Scale-out Your Continuous Applications Dave Nielsen Redis Labs
  • 2. PRESENTED BY Agenda: How to collect and process data stream in real-time at scale IoT User Activity Messages
  • 5. PRESENTED BY Breaking up Our Solution into Functional Blocks Click data Record all clicks Count clicks in real-time Query clicks by assets 2. Data Processing1. Data Ingest 3. Data Querying
  • 6. PRESENTED BY ClickAnalyzer Redis Stream Redis Hash Spark SQLStructured Stream Processing 1. Data Ingest 2. Data Processing 3. Data Querying The Actual Building Blocks of Our Solution Click data
  • 8. PRESENTED BY ClickAnalyzer Redis Stream Redis Hash Spark SQLStructured Stream Processing 1. Data Ingest 2. Data Processing 3. Data Querying Data Ingest using Redis Streams
  • 9. PRESENTED BY What is Redis Streams?
  • 10. PRESENTED BY Redis Streams in its Simplest Form ConsumerProducer
  • 11. PRESENTED BY Redis Streams can Connect Many Producers and Consumers Producer 2 Producer m Producer 1 Producer 3 Consumer 1 Consumer n Consumer 2 Consumer 3
  • 12. PRESENTED BY Comparing Redis Streams with Redis Pub/Sub, Lists, Sorted Sets Pub/Sub • Fire and forget • No persistence • No lookback queries Lists • Tight coupling between producers and consumers • Persistence for transient data only • No lookback queries Sorted Sets • Data ordering isn’t built-in; producer controls the order • No maximum limit • The data structure is not designed to handle data streams
  • 13. PRESENTED BY What is Redis Streams? Pub/Sub Lists Sorted Sets It is like Pub/Sub, but with persistence It is like Lists, but decouples producers and consumers It is like Sorted Sets, but asynchronous + • Lifecycle management of streaming data • Built-in support for timeseries data • A rich choice of options to the consumers to read streaming and static data • Super fast lookback queries powered by radix trees • Automatic eviction of data based on the upper limit
  • 14. PRESENTED BY Redis Streams Benefits Analytics Data Backup Consumers Producer Messaging It enables asynchronous data exchange between producers and consumers and historical range queries
  • 15. PRESENTED BY Redis Streams Benefits Producer Image Processor Arrival Rate: 500/sec Consumption Rate: 500/sec Image Processor Image Processor Image Processor Image Processor Redis Stream With consumer groups, you can scale out and avoid backlogs
  • 16. PRESENTED BY Classifier 1 Classifier 2 Classifier n Consumer Group XREADGROUP XREAD Consumers Producer 2 Producer m Producer 1 Producer 3 XADD XACK Deep Learning-based Classification Analytics Data Backup Messaging Redis Streams Benefits Simplify data collection, processing and distribution to support complex scenarios
  • 17. PRESENTED BY Classifier 1 Classifier 2 Classifier n Consumer Group XREADGROUP XREAD Consumers Producer 2 Producer m Producer 1 Producer 3 XADD XACK Deep Learning-based Classification Analytics Data Backup Messaging
  • 18. PRESENTED BY Our Ingest Solution Redis Stream 1. Data Ingest Command xadd clickstream * img [image_id] Sample data 127.0.0.1:6379> xrange clickstream - + 1) 1) "1553536458910-0" 2) 1) ”image_1" 2) "1" 2) 1) "1553536469080-0" 2) 1) ”image_3" 2) "1" 3) 1) "1553536489620-0" 2) 1) ”image_3" 2) "1” . . . .
  • 19. PRESENTED BY 2. Data Processing
  • 20. PRESENTED BY ClickAnalyzer Redis Stream Redis Hash Spark SQLStructured Stream Processing 1. Data Ingest 2. Data Processing 3. Data Querying Data Processing using Spark’s Structured Streaming
  • 21. PRESENTED BY What is Structured Streaming?
  • 22. PRESENTED BY “Structured Streaming provides fast, scalable, fault- tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.” Definition
  • 23. PRESENTED BY How Structured Streaming Works? Micro-batches as DataFrames (tables) Source: Data Stream DataFrame Operations Selection: df.select(“xyz”).where(“a > 10”) Filtering: df.filter(_.a > 10).map(_.b) Aggregation: df.groupBy(”xyz").count() Windowing: df.groupBy( window($"timestamp", "10 minutes", "5 minutes"), $"word"” ).count() Deduplication: df.dropDuplicates("guid") Output Sink Spark Structured Streaming
  • 24. PRESENTED BY ClickAnalyzer Redis Stream Redis HashStructured Stream Processing Redis Streams as data source Spark-Redis Library Redis as data sink  Developed using Scala  Compatible with Spark 2.3 and higher  Supports • RDD • DataFrames • Structured Streaming
  • 25. PRESENTED BY 1. Connect to the Redis instance 2. Map Redis Stream to Structured Streaming schema 3. Create the query object 4. Run the query Steps for Using Redis Streams as Data Source
  • 26. PRESENTED BY Redis Streams as Data Source 1. Connect to the Redis instance val spark = SparkSession.builder() .appName("redis-df") .master("local[*]") .config("spark.redis.host", "localhost") .config("spark.redis.port", "6379") .getOrCreate() val clickstream = spark.readStream .format("redis") .option("stream.keys","clickstream") .schema(StructType(Array( StructField("img", StringType) ))) .load() val queryByImg = clickstream.groupBy("img").count
  • 27. PRESENTED BY Redis Streams as Data Source 2. Map Redis Stream to Structured Streaming schema val spark = SparkSession.builder() .appName("redis-df") .master("local[*]") .config("spark.redis.host", "localhost") .config("spark.redis.port", "6379") .getOrCreate() val clickstream = spark.readStream .format("redis") .option("stream.keys","clickstream") .schema(StructType(Array( StructField("img", StringType) ))) .load() val queryByImg = clickstream.groupBy("img").count xadd clickstream * img [image_id]
  • 28. PRESENTED BY Redis Streams as Data Source 3. Create the query object val spark = SparkSession.builder() .appName("redis-df") .master("local[*]") .config("spark.redis.host", "localhost") .config("spark.redis.port", "6379") .getOrCreate() val clickstream = spark.readStream .format("redis") .option("stream.keys","clickstream") .schema(StructType(Array( StructField("img", StringType) ))) .load() val queryByImg = clickstream.groupBy("img").count
  • 29. PRESENTED BY Redis Streams as Data Source 4. Run the query val clickstream = spark.readStream .format("redis") .option("stream.keys","clickstream") .schema(StructType(Array( StructField("img", StringType) ))) .load() val queryByImg = clickstream.groupBy("img").count val clickWriter: ClickForeachWriter = new ClickForeachWriter("localhost","6379") val query = queryByImg.writeStream .outputMode("update") .foreach(clickWriter) .start() query.awaitTermination() Custom output sink
  • 30. PRESENTED BY How to Setup Redis as Output Sink override def process(record: Row) = { var img = record.getString(0); var count = record.getLong(1); if(jedis == null){ connect() } jedis.hset("clicks:"+img, "img", img) jedis.hset("clicks:"+img, "count", count.toString) } Create a custom class extending ForeachWriter and override the method, process() Save as Hash with structure clicks:[image] img [image] count [count] Example clicks:image_1001 img image_1001 count 1029 clicks:image_1002 img image_1002 count 392 . . . img count image_1001 1029 image_1002 392 . . . . Table: Clicks
  • 32. PRESENTED BY ClickAnalyzer Redis Stream Redis Hash Spark SQLStructured Stream Processing 1. Data Ingest 2. Data Processing 3. Data Querying Query Redis using Spark SQL
  • 33. PRESENTED BY 1. Initialize Spark Context with Redis 2. Create table 3. Run Query Steps to Query Redis using Spark SQL clicks:image_1001 img image_1001 count 1029 clicks:image_1002 img image_1002 count 392 . . . . img count image_1001 1029 image_1002 392 . . . . Redis Hash to SQL mapping
  • 34. PRESENTED BY 1. Initialize scala> import org.apache.spark.sql.SparkSession scala> val spark = SparkSession.builder().appName("redis- test").master("local[*]").config("spark.redis.host","localhost").config("spark.redis.port","6379").getOrCreate() scala> val sc = spark.sparkContext scala> import spark.sql scala> import spark.implicits._ 2. Create table scala> sql("CREATE TABLE IF NOT EXISTS clicks(img STRING, count INT) USING org.apache.spark.sql.redis OPTIONS (table 'clicks’)”) How to Query Redis using Spark SQL
  • 35. PRESENTED BY 3. Run Query scala> sql("select * from clicks").show(); +----------+-----+ | img|count| +----------+-----+ |image_1001| 1029| |image_1002| 392| |. | .| |. | .| |. | .| |. | .| +----------+-----+ How to Query Redis using Spark SQL
  • 36. PRESENTED BY Code Email [email protected] for this slide deck Or download from https://github.com/redislabsdemo/
  • 39. PRESENTED BY ClickAnalyzer Redis Stream Redis Hash Spark SQLStructured Stream Processing 1. Data Ingest 2. Data Processing 3. Data Querying Building Blocks of our Solution Redis Streams as data source; Redis as data sinkSpark-Redis Library is used for:

Editor's Notes

  • #3: Agendaa: Takeaway: Use Redis Streams + Spark-Redis + Structured Streaming with Microbatches in Spark to collect and process data stream in real-time at scale
  • #4: Call the spark engine with spark-submit in scala directory: spark-submit --class com.redislabs.streaming.ClickAnalysis --jars ./lib/spark-redis-2.4.0-SNAPSHOT-jar-with-dependencies.jar --master local[*] ./target/scala-2.11/redisexample_2.11-1.0.jar 4 datastructures clickstream – redis stream data structure to collect all clicks $ XLEN clickstream Spark 2.3 introduced Structured streaming that does microbatches – collect from streams, such as Redis Streams or Kafka A few messages every few milliseconds. Then Spark runs queries to aggregate and collecting and storing somewhere, such as Redis Trick – Any hash data structure that starts with clicks: belongs to a table called Clicks $ HGETALL clicks:image_1 Fields called image and count are columns in the table Configure Spark SQL so it knows that any field that starts with clicks belongs to a table called Clicks $ HMSET clicks:image_test img test count 10000
  • #5: Summarize demo
  • #6: Read then go to next slide
  • #7: What are the functional blocks: Data ingest-collect all clicks without losing any – Redis Cloud w/Streams  free up to 30 MB Spark-Redis into Spark Process data in real-time – Spark w/Structured Streaming for Microbatches Data Querying – some kind of custom chart, leaderboard, or Grafana  Using Spark SQL with Redis Cloud again
  • #11: Cover at high level Connects producers with consumers. May have many of either
  • #12: Redis Streams supports both asynchronous communication and look-back queries
  • #13: How many have used pubsub, list or sorted sets Pubsub – no lookback queries. All asynchrous List – one list cannot support many consumers. Sorted Sets – solves problem  don’t need copy for each consumer, but have to always poll for data. Can use blocking call, but transforms into a list. For streaming you have to poll.
  • #14: Redis Streams manages the life cycle of the streaming data effectively (Example: Consumer groups and their commands XREADGROUP, XACK and XCLAIM ensure every data object is consumed properly) It offers consumers a rich choice of options from to consume the data – they can read from where they left off, or only the new data, or from the beginning The lookback queries are super fast as they are powered by the radix trees Kafka or Kenisis have a timeframe limit. Redis has no timeframe, you can cap by maxlength / size
  • #15: If you have different types of consumers …
  • #16: Ex: toll booth – can back up – but we can match rate of arrival with rate of departure
  • #18: So this is our stream example
  • #19: clickstream is the stream key Lets redis create the timestamp img is the field So that’s how data ingest works. Any questions?
  • #20: Stop and run query to see latest count scala> sql("select * from clicks").show(); $ DEL clicks:image_test
  • #23: Marketing definition from databricks
  • #24: With Structured Streaming, Spark pulls data in micro-batches (like a table) Every microbatch has rows and each row is like a dataframe so you can run dataframe operations on these microbatches Windowing – can aggregate for last 30 mins Can also dedupe Go to databricks website to see more I’m doing aggregations in my demo Used to only do batches Now in 2.4 microbatches- batches in miliseconds Now in experimental mode – is continuous processing, get dataframes in microsections Finally can define an Output sink, Or output to console 3 modes: dump everything, append only, or update sync,
  • #25: Redis Labs developed and supports this open source library Data Source and Data Sync Redis streams Dump data into Redis Query Redis from Spark All written in Scala
  • #26: How do you use Redis as a Datasource? There are 4 steps Connect to Redis db Map Redis Streams keyvaalue pairs to microbatch Query object Run the query in a loop
  • #27: Connect to Redis Cloud Move to next sldie
  • #28: 2. Interpreting the stream data Clickstream is key name Img is field name
  • #29: 3. Defining the query Group by img Count
  • #30: 4. Dumping into a ForEachWriter - Custom – see next slide -
  • #32: Stop and run query to see latest count scala> sql("select * from clicks").show();
  • #33: Any questions? How do you query redis? Can connect ODBC drivers to Spark and can query Redis?
  • #34: How to connect to Redis How to map to Redis? Run Query
  • #35: 1. 2. Create a table (do it only once). Tell it to use class name org.apache.spark.sql.redis Map to table ‘clicks’
  • #36: 3. Run query