SlideShare a Scribd company logo
WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Roshan Kumar, Redis Labs
@roshankumar
Redis + Structured Streaming:
A Perfect Combination to Scale-out Your
Continuous Applications
#UnifiedAnalytics #SparkAISummit
This Presentation is About….
How to collect and process data stream in real-time at scale
IoT
User Activity
Messages
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuous Applications
http://bit.ly/spark-redis
Breaking up Our Solution into Functional Blocks
Click data
Record all clicks Count clicks in real-time Query clicks by assets
2. Data Processing1. Data Ingest 3. Data Querying
ClickAnalyzer
Redis Stream Redis Hash Spark SQLStructured Stream Processing
1. Data Ingest 2. Data Processing 3. Data Querying
The Actual Building Blocks of Our Solution
Click data
1. Data Ingest
8#UnifiedAnalytics #SparkAISummit
ClickAnalyzer
Redis Stream Redis Hash Spark SQLStructured Stream Processing
1. Data Ingest 2. Data Processing 3. Data Querying
Data Ingest using Redis Streams
What is Redis Streams?
Redis Streams in its Simplest Form
Redis Streams Connects Many Producers and Consumers
Comparing Redis Streams with Redis Pub/Sub, Lists, Sorted Sets
Pub/Sub
• Fire and forget
• No persistence
• No lookback
queries
Lists
• Tight coupling between
producers and
consumers
• Persistence for
transient data only
• No lookback queries
Sorted
Sets
• Data ordering isn’t built-in;
producer controls the
order
• No maximum limit
• The data structure is not
designed to handle data
streams
What is Redis Streams?
Pub/Sub Lists Sorted Sets
It is like Pub/Sub, but
with persistence
It is like Lists, but decouples
producers and consumers
It is like Sorted Sets,
but asynchronous
+
• Lifecycle management of streaming data
• Built-in support for timeseries data
• A rich choice of options to the consumers to read streaming and static data
• Super fast lookback queries powered by radix trees
• Automatic eviction of data based on the upper limit
Redis Streams Benefits
It enables asynchronous data exchange between producers and
consumers and historical range queries
Redis Streams Benefits
With consumer groups, you can scale out and avoid backlogs
Redis Streams Benefits
17#UnifiedAnalytics #SparkAISummit
Simplify data collection, processing and
distribution to support complex scenarios
Data Ingest Solution
Redis Stream
1. Data Ingest
Command
xadd clickstream * img [image_id]
Sample data
127.0.0.1:6379> xrange clickstream - +
1) 1) "1553536458910-0"
2) 1) ”image_1"
2) "1"
2) 1) "1553536469080-0"
2) 1) ”image_3"
2) "1"
3) 1) "1553536489620-0"
2) 1) ”image_3"
2) "1”
.
.
.
.
2. Data Processing
19#UnifiedAnalytics #SparkAISummit
ClickAnalyzer
Redis Stream Redis Hash Spark SQLStructured Stream Processing
1. Data Ingest 2. Data Processing 3. Data Querying
Data Processing using Spark’s Structured Streaming
What is Structured Streaming?
“Structured Streaming provides fast, scalable, fault-
tolerant, end-to-end exactly-once stream processing
without the user having to reason about streaming.”
Definition
How Structured Streaming Works?
Micro-batches as
DataFrames (tables)
Source: Data Stream
DataFrame Operations
Selection: df.select(“xyz”).where(“a > 10”)
Filtering: df.filter(_.a > 10).map(_.b)
Aggregation: df.groupBy(”xyz").count()
Windowing: df.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word"”
).count()
Deduplication: df.dropDuplicates("guid")
Output Sink
Spark Structured Streaming
ClickAnalyzer
Redis Stream Redis HashStructured Stream Processing
Redis Streams as data source
Spark-Redis Library
Redis as data sink
§ Developed using Scala
§ Compatible with Spark 2.3 and higher
§ Supports
• RDD
• DataFrames
• Structured Streaming
Redis Streams as Data Source
1. Connect to the Redis instance
2. Map Redis Stream to Structured Streaming schema
3. Create the query object
4. Run the query
Code Walkthrough: Redis Streams as Data Source
1. Connect to the Redis instance
val spark = SparkSession.builder()
.appName("redis-df")
.master("local[*]")
.config("spark.redis.host", "localhost")
.config("spark.redis.port", "6379")
.getOrCreate()
val clickstream = spark.readStream
.format("redis")
.option("stream.keys","clickstream")
.schema(StructType(Array(
StructField("img", StringType)
)))
.load()
val queryByImg = clickstream.groupBy("img").count
Code Walkthrough: Redis Streams as Data Source
2. Map Redis Stream to Structured Streaming schema
val spark = SparkSession.builder()
.appName("redis-df")
.master("local[*]")
.config("spark.redis.host", "localhost")
.config("spark.redis.port", "6379")
.getOrCreate()
val clickstream = spark.readStream
.format("redis")
.option("stream.keys","clickstream")
.schema(StructType(Array(
StructField("img", StringType)
)))
.load()
val queryByImg = clickstream.groupBy("img").count
xadd clickstream * img [image_id]
Code Walkthrough: Redis Streams as Data Source
3. Create the query object
val spark = SparkSession.builder()
.appName("redis-df")
.master("local[*]")
.config("spark.redis.host", "localhost")
.config("spark.redis.port", "6379")
.getOrCreate()
val clickstream = spark.readStream
.format("redis")
.option("stream.keys","clickstream")
.schema(StructType(Array(
StructField("img", StringType)
)))
.load()
val queryByImg = clickstream.groupBy("img").count
Code Walkthrough: Redis Streams as Data Source
4. Run the query
val clickstream = spark.readStream
.format("redis")
.option("stream.keys","clickstream")
.schema(StructType(Array(
StructField("img", StringType)
)))
.load()
val queryByImg = clickstream.groupBy("img").count
val clickWriter: ClickForeachWriter = new ClickForeachWriter("localhost","6379")
val query = queryByImg.writeStream
.outputMode("update")
.foreach(clickWriter)
.start()
query.awaitTermination()
Custom output sink
Redis as Output Sink
override def process(record: Row) = {
var img = record.getString(0);
var count = record.getLong(1);
if(jedis == null){
connect()
}
jedis.hset("clicks:"+img, "img", img)
jedis.hset("clicks:"+img, "count", count.toString)
}
Create a custom class extending ForeachWriter and override the method, process()
Save as Hash with structure
clicks:[image]
img [image]
count [count]
Example
clicks:image_1001
img image_1001
count 1029
clicks:image_1002
img image_1002
count 392
.
.
.
.
img count
image_1001 1029
image_1002 392
. .
. .
Table: Clicks
3. Data Querying
31#UnifiedAnalytics #SparkAISummit
ClickAnalyzer
Redis Stream Redis Hash Spark SQLStructured Stream Processing
1. Data Ingest 2. Data Processing 3. Data Querying
Query Redis using Spark SQL
1. Initialize Spark Context with Redis
2. Create table
3. Run Query
3 Steps to Query Redis using Spark SQL
clicks:image_1001
img image_1001
count 1029
clicks:image_1002
img image_1002
count 392
.
.
.
.
img count
image_1001 1029
image_1002 392
. .
. .
Redis Hash to SQL mapping
1. Initialize
scala> import org.apache.spark.sql.SparkSession
scala> val spark = SparkSession.builder().appName("redis-
test").master("local[*]").config("spark.redis.host","localhost").config("spark.redis.port","6379").getOrCreate()
scala> val sc = spark.sparkContext
scala> import spark.sql
scala> import spark.implicits._
2. Create table
scala> sql("CREATE TABLE IF NOT EXISTS clicks(img STRING, count INT) USING org.apache.spark.sql.redis OPTIONS (table
'clicks’)”)
How to Query Redis using Spark SQL
3. Run Query
scala> sql("select * from clicks").show();
+----------+-----+
| img|count|
+----------+-----+
|image_1001| 1029|
|image_1002| 392|
|. | .|
|. | .|
|. | .|
|. | .|
+----------+-----+
How to Query Redis using Spark SQL
Recap
36#UnifiedAnalytics #SparkAISummit
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuous Applications
ClickAnalyzer
Redis Stream Redis Hash Spark SQLStructured Stream Processing
1. Data Ingest 2. Data Processing 3. Data Querying
Building Blocks of our Solution
Redis Streams as data source; Redis as data sinkSpark-Redis Library is used for:
Questions
?
?
?
?
?
?
?
?
?
?
?
roshan@redislabs.com
@roshankumar
Roshan Kumar
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot (20)

PDF
Making Apache Spark Better with Delta Lake
Databricks
 
PDF
Change Data Feed in Delta
Databricks
 
PDF
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
PDF
Cassandra Introduction & Features
DataStax Academy
 
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PDF
Apache Spark Core – Practical Optimization
Databricks
 
PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PDF
The Apache Spark File Format Ecosystem
Databricks
 
PDF
Apache Spark Overview
Vadim Y. Bichutskiy
 
PDF
Modularized ETL Writing with Apache Spark
Databricks
 
PDF
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
PPTX
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
PPTX
Introduction to Redis
Arnab Mitra
 
PDF
MyRocks Deep Dive
Yoshinori Matsunobu
 
PPTX
Free Training: How to Build a Lakehouse
Databricks
 
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
PDF
Apache Flink internals
Kostas Tzoumas
 
PPTX
Programming in Spark using PySpark
Mostafa
 
PDF
Physical Plans in Spark SQL
Databricks
 
Making Apache Spark Better with Delta Lake
Databricks
 
Change Data Feed in Delta
Databricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Cassandra Introduction & Features
DataStax Academy
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Apache Spark Core – Practical Optimization
Databricks
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
The Apache Spark File Format Ecosystem
Databricks
 
Apache Spark Overview
Vadim Y. Bichutskiy
 
Modularized ETL Writing with Apache Spark
Databricks
 
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
Introduction to Redis
Arnab Mitra
 
MyRocks Deep Dive
Yoshinori Matsunobu
 
Free Training: How to Build a Lakehouse
Databricks
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Apache Flink internals
Kostas Tzoumas
 
Programming in Spark using PySpark
Mostafa
 
Physical Plans in Spark SQL
Databricks
 

Similar to Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuous Applications (20)

PPTX
Redis Streams plus Spark Structured Streaming
Dave Nielsen
 
PDF
Redis+Spark Structured Streaming: Roshan Kumar
Redis Labs
 
PDF
Redis Streams - Fiverr Tech5 meetup
Itamar Haber
 
PDF
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
PDF
AI-Powered Streaming Analytics for Real-Time Customer Experience
Databricks
 
PDF
Running Analytics at the Speed of Your Business
Redis Labs
 
PDF
Beyond Caching: Extending Redis Enterprise for Real-Time Streams Processing
VMware Tanzu
 
PDF
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdf
nilanjan172nsvian
 
PPTX
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Data Con LA
 
PPTX
Meetup spark structured streaming
José Carlos García Serrano
 
PPTX
Real-time Analytics with Redis
Cihan Biyikoglu
 
PDF
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCE
Matt Stubbs
 
PPT
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
PDF
Redis v5 & Streams
Itamar Haber
 
PPTX
Redis Streams
Redis Labs
 
PDF
Blue Pill / Red Pill : The Matrix of thousands of data streams - Himanshu Gup...
Tech Triveni
 
PDF
Blue Pill / Red Pill : The Matrix of thousands of data streams - Himanshu Gup...
Knoldus Inc.
 
PPTX
Apache Spark Components
Girish Khanzode
 
PDF
Productizing Structured Streaming Jobs
Databricks
 
PDF
What's new with Apache Spark's Structured Streaming?
Miklos Christine
 
Redis Streams plus Spark Structured Streaming
Dave Nielsen
 
Redis+Spark Structured Streaming: Roshan Kumar
Redis Labs
 
Redis Streams - Fiverr Tech5 meetup
Itamar Haber
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
AI-Powered Streaming Analytics for Real-Time Customer Experience
Databricks
 
Running Analytics at the Speed of Your Business
Redis Labs
 
Beyond Caching: Extending Redis Enterprise for Real-Time Streams Processing
VMware Tanzu
 
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdf
nilanjan172nsvian
 
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Data Con LA
 
Meetup spark structured streaming
José Carlos García Serrano
 
Real-time Analytics with Redis
Cihan Biyikoglu
 
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCE
Matt Stubbs
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
Redis v5 & Streams
Itamar Haber
 
Redis Streams
Redis Labs
 
Blue Pill / Red Pill : The Matrix of thousands of data streams - Himanshu Gup...
Tech Triveni
 
Blue Pill / Red Pill : The Matrix of thousands of data streams - Himanshu Gup...
Knoldus Inc.
 
Apache Spark Components
Girish Khanzode
 
Productizing Structured Streaming Jobs
Databricks
 
What's new with Apache Spark's Structured Streaming?
Miklos Christine
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
PDF
Machine Learning CI/CD for Email Attack Detection
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Ad

Recently uploaded (20)

PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
What Is Data Integration and Transformation?
subhashenia
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 

Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuous Applications

  • 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  • 2. Roshan Kumar, Redis Labs @roshankumar Redis + Structured Streaming: A Perfect Combination to Scale-out Your Continuous Applications #UnifiedAnalytics #SparkAISummit
  • 3. This Presentation is About…. How to collect and process data stream in real-time at scale IoT User Activity Messages
  • 6. Breaking up Our Solution into Functional Blocks Click data Record all clicks Count clicks in real-time Query clicks by assets 2. Data Processing1. Data Ingest 3. Data Querying
  • 7. ClickAnalyzer Redis Stream Redis Hash Spark SQLStructured Stream Processing 1. Data Ingest 2. Data Processing 3. Data Querying The Actual Building Blocks of Our Solution Click data
  • 9. ClickAnalyzer Redis Stream Redis Hash Spark SQLStructured Stream Processing 1. Data Ingest 2. Data Processing 3. Data Querying Data Ingest using Redis Streams
  • 10. What is Redis Streams?
  • 11. Redis Streams in its Simplest Form
  • 12. Redis Streams Connects Many Producers and Consumers
  • 13. Comparing Redis Streams with Redis Pub/Sub, Lists, Sorted Sets Pub/Sub • Fire and forget • No persistence • No lookback queries Lists • Tight coupling between producers and consumers • Persistence for transient data only • No lookback queries Sorted Sets • Data ordering isn’t built-in; producer controls the order • No maximum limit • The data structure is not designed to handle data streams
  • 14. What is Redis Streams? Pub/Sub Lists Sorted Sets It is like Pub/Sub, but with persistence It is like Lists, but decouples producers and consumers It is like Sorted Sets, but asynchronous + • Lifecycle management of streaming data • Built-in support for timeseries data • A rich choice of options to the consumers to read streaming and static data • Super fast lookback queries powered by radix trees • Automatic eviction of data based on the upper limit
  • 15. Redis Streams Benefits It enables asynchronous data exchange between producers and consumers and historical range queries
  • 16. Redis Streams Benefits With consumer groups, you can scale out and avoid backlogs
  • 17. Redis Streams Benefits 17#UnifiedAnalytics #SparkAISummit Simplify data collection, processing and distribution to support complex scenarios
  • 18. Data Ingest Solution Redis Stream 1. Data Ingest Command xadd clickstream * img [image_id] Sample data 127.0.0.1:6379> xrange clickstream - + 1) 1) "1553536458910-0" 2) 1) ”image_1" 2) "1" 2) 1) "1553536469080-0" 2) 1) ”image_3" 2) "1" 3) 1) "1553536489620-0" 2) 1) ”image_3" 2) "1” . . . .
  • 20. ClickAnalyzer Redis Stream Redis Hash Spark SQLStructured Stream Processing 1. Data Ingest 2. Data Processing 3. Data Querying Data Processing using Spark’s Structured Streaming
  • 21. What is Structured Streaming?
  • 22. “Structured Streaming provides fast, scalable, fault- tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.” Definition
  • 23. How Structured Streaming Works? Micro-batches as DataFrames (tables) Source: Data Stream DataFrame Operations Selection: df.select(“xyz”).where(“a > 10”) Filtering: df.filter(_.a > 10).map(_.b) Aggregation: df.groupBy(”xyz").count() Windowing: df.groupBy( window($"timestamp", "10 minutes", "5 minutes"), $"word"” ).count() Deduplication: df.dropDuplicates("guid") Output Sink Spark Structured Streaming
  • 24. ClickAnalyzer Redis Stream Redis HashStructured Stream Processing Redis Streams as data source Spark-Redis Library Redis as data sink § Developed using Scala § Compatible with Spark 2.3 and higher § Supports • RDD • DataFrames • Structured Streaming
  • 25. Redis Streams as Data Source 1. Connect to the Redis instance 2. Map Redis Stream to Structured Streaming schema 3. Create the query object 4. Run the query
  • 26. Code Walkthrough: Redis Streams as Data Source 1. Connect to the Redis instance val spark = SparkSession.builder() .appName("redis-df") .master("local[*]") .config("spark.redis.host", "localhost") .config("spark.redis.port", "6379") .getOrCreate() val clickstream = spark.readStream .format("redis") .option("stream.keys","clickstream") .schema(StructType(Array( StructField("img", StringType) ))) .load() val queryByImg = clickstream.groupBy("img").count
  • 27. Code Walkthrough: Redis Streams as Data Source 2. Map Redis Stream to Structured Streaming schema val spark = SparkSession.builder() .appName("redis-df") .master("local[*]") .config("spark.redis.host", "localhost") .config("spark.redis.port", "6379") .getOrCreate() val clickstream = spark.readStream .format("redis") .option("stream.keys","clickstream") .schema(StructType(Array( StructField("img", StringType) ))) .load() val queryByImg = clickstream.groupBy("img").count xadd clickstream * img [image_id]
  • 28. Code Walkthrough: Redis Streams as Data Source 3. Create the query object val spark = SparkSession.builder() .appName("redis-df") .master("local[*]") .config("spark.redis.host", "localhost") .config("spark.redis.port", "6379") .getOrCreate() val clickstream = spark.readStream .format("redis") .option("stream.keys","clickstream") .schema(StructType(Array( StructField("img", StringType) ))) .load() val queryByImg = clickstream.groupBy("img").count
  • 29. Code Walkthrough: Redis Streams as Data Source 4. Run the query val clickstream = spark.readStream .format("redis") .option("stream.keys","clickstream") .schema(StructType(Array( StructField("img", StringType) ))) .load() val queryByImg = clickstream.groupBy("img").count val clickWriter: ClickForeachWriter = new ClickForeachWriter("localhost","6379") val query = queryByImg.writeStream .outputMode("update") .foreach(clickWriter) .start() query.awaitTermination() Custom output sink
  • 30. Redis as Output Sink override def process(record: Row) = { var img = record.getString(0); var count = record.getLong(1); if(jedis == null){ connect() } jedis.hset("clicks:"+img, "img", img) jedis.hset("clicks:"+img, "count", count.toString) } Create a custom class extending ForeachWriter and override the method, process() Save as Hash with structure clicks:[image] img [image] count [count] Example clicks:image_1001 img image_1001 count 1029 clicks:image_1002 img image_1002 count 392 . . . . img count image_1001 1029 image_1002 392 . . . . Table: Clicks
  • 32. ClickAnalyzer Redis Stream Redis Hash Spark SQLStructured Stream Processing 1. Data Ingest 2. Data Processing 3. Data Querying Query Redis using Spark SQL
  • 33. 1. Initialize Spark Context with Redis 2. Create table 3. Run Query 3 Steps to Query Redis using Spark SQL clicks:image_1001 img image_1001 count 1029 clicks:image_1002 img image_1002 count 392 . . . . img count image_1001 1029 image_1002 392 . . . . Redis Hash to SQL mapping
  • 34. 1. Initialize scala> import org.apache.spark.sql.SparkSession scala> val spark = SparkSession.builder().appName("redis- test").master("local[*]").config("spark.redis.host","localhost").config("spark.redis.port","6379").getOrCreate() scala> val sc = spark.sparkContext scala> import spark.sql scala> import spark.implicits._ 2. Create table scala> sql("CREATE TABLE IF NOT EXISTS clicks(img STRING, count INT) USING org.apache.spark.sql.redis OPTIONS (table 'clicks’)”) How to Query Redis using Spark SQL
  • 35. 3. Run Query scala> sql("select * from clicks").show(); +----------+-----+ | img|count| +----------+-----+ |image_1001| 1029| |image_1002| 392| |. | .| |. | .| |. | .| |. | .| +----------+-----+ How to Query Redis using Spark SQL
  • 38. ClickAnalyzer Redis Stream Redis Hash Spark SQLStructured Stream Processing 1. Data Ingest 2. Data Processing 3. Data Querying Building Blocks of our Solution Redis Streams as data source; Redis as data sinkSpark-Redis Library is used for:
  • 40. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT