SlideShare a Scribd company logo
Shay Nativ - Redis Labs
Real-time Machine Learning with
Redis-ML and Apache Spark
Agenda
● Intro to Redis and Redis Labs - 5 min
● Using Redis-ML for Model Serving - why and how - 10 min
● Building a recommendation system using Spark-ML and Redis-ML - 10 min
● QA
2
Redis Labs – Home of Redis
Founded in 2011
HQ in Mountain View CA, R&D center in Tel-Aviv IL
The commercial company behind Open Source Redis
Provider of the Redis Enterprise (Redise
) technology,
platform and products
3
Redise
Cloud Private
Redis Labs Products
Redise
Cloud Redise
Pack ManagedRedise
Pack
SERVICES SOFTWARE
Fully managed Redise
service
in VPCs within AWS, MS
Azure, GCP & IBM Softlayer
Fully managed Redise
service
on hosted servers within
AWS, MS Azure, GCP, IBM
Softlayer, Heroku, CF &
OpenShift
Downloadable Redise
software for any enterprise
datacenter or cloud
environment
Fully managed Redise
Pack in
private data centers
&& &
A Brief Overview of Redis
● Started in 2009 by Salvatore Sanfilippo
● Most popular KV store
● In memory - disk backed
● Notable Users:
○ Twitter, Netflix, Uber, Groupon, Twitch
○ Many, many more...
Redis Main Differentiations
Simplicity
(through Data Structures)
Extensibility
(through Redis Modules)
Performance
ListsSorted Sets
Hashes Hyperlog-logs
Geospatial
Indexes
Bitmaps
SetsStrings
Bit field
6
A Quick Recap of Redis
Key
"I'm a Plain Text String!"
{ A: “foo”, B: “bar”, C: “baz” }
Strings / Bitmaps / BitFields
Hash Tables (objects!)
Linked Lists
Sets
Sorted Sets
Geo Sets
HyperLogLog
{ A , B , C , D , E }
[ A → B → C → D → E ]
{ A: 0.1, B: 0.3, C: 100, D: 1337 }
{ A: (51.5, 0.12), B: (32.1, 34.7) }
00110101 11001110 10101010
Simple Redis Example (string, hash)
127.0.0.1:6379> SET spark summit
OK
127.0.0.1:6379> GET spark
"summit"
127.0.0.1:6379> HMSET spark_hash org apache version 2.1.1
OK
127.0.0.1:6379> HGET spark_hash version
"2.1.1"
127.0.0.1:6379> HGETALL spark_hash
1) "org"
2) "apache"
3) "version"
4) "2.1.1"
Another Simple Redis Example (sorted set)
127.0.0.1:6379> zadd my_sorted_set 1 foo
(integer) 1
127.0.0.1:6379> zadd my_sorted_set 5 bar
(integer) 1
127.0.0.1:6379> zadd my_sorted_set 3 baz
(integer) 1
127.0.0.1:6379> ZRANGE my_sorted_set 0 2
1) "foo"
2) "baz"
3) "bar"
127.0.0.1:6379>
New Capabilities
What Modules Actually Are
• Dynamic libraries loaded to redis
• Written in C/C++
• Use a C ABI/API isolating redis internals
• Use existing or add new data-structures
• Near Zero latency access to data
New Commands
New Data Types
Modules : A Revolutionary Approach
Adapt your database to your data, not the other way around
Secure way to store data in
Redis via encrypt/decrypt with
various Themis primitives
Time Series Graph
Time series values aggregation
in Redis
Crypto Engine Wrapper
Graph database on Redis based
on Cypher language
Based on Generic Cell Rate
Algorithm (GCRA)
Rate Limiter
ReJSON
Secondary Index/RQL
JSON Engine on Redis.
Pre-released
Indexing + SQL -like syntax for
querying indexes.
Pre-released
Neural Redis Redis-ML RediSearch
Full Text Search Engine in RedisMachine Learning Model
Serving
Simple Neural Network Native
to Redis
Redis ML
Machine Learning Model Server
Spark-ML End-to-End Flow
Spark Training
Custom Server
Model saved to
Parquet file
Data Loaded
to Spark
Pre-computed
results
Batch Evaluation
?
ClientApp
ML Models Serving Challenges
• Models are becoming bigger and more complex
• Can be challenging to deploy & serve
• Do not scale well, speed and size
• Can be very expensive
14
A Simpler Machine Learning Lifecycle
Any Training
Platform
Data loaded into Spark Model is saved in
Redis-ML
Redis-ML
Serving Client
Client
App
Client
App
Client
App
+
Spark Training
15
Redis-ML – ML Serving Engine
• Store training output as “hot model”
• Perform evaluation directly in Redis
• Enjoy the performance, scalability and HA of Redis
16
ML Models
Tree Ensembles
Linear Regression
Logistic Regression
Matrix + Vector Operations
More to come...
Redis-ML
17
Random Forest Model
• A collection of decision trees
• Supports classification & regression
• Splitter Node can be:
◦ Categorical (e.g. day == “Sunday”)
◦ Numerical (e.g. age < 43)
• Decision is taken by the majority of decision trees
18
Titanic Survival Predictor on a Decision Tree
YES
Sex =
Male ?
Age <
9.5?
*Sibps >
2.5?
Survived
Died
SurvivedDied
*Sibps = siblings + spouses
NO
19
Titanic Survival Predictor on a Random Forest
YES
Sex =
Male ?
Age <
9.5?
*Sibps >
2.5?
Survived
Died
SurvivedDied
NO YES
Country=
US?
State =
CA?
Height>
1.60m?
Survived
Died
SurvivedDied
NO YES
Weight<
80kg?
I.Q<100?
Eye color
=blue?
Survived
Died
SurvivedDied
NO
Tree #1 Tree #2 Tree #3
Would John Survive The Titanic
• John’s features:
{male, 34, married + 2, US, CA, 1.78m, 78kg, 110iq, blue eyes}
• Tree#1 – Survived
• Tree#2 – Failed
• Tree#3 – Survived
• Random forest decision - Survived
21
Forest Data Type Example
> MODULE LOAD "./redis-ml.so"
OK
> ML.FOREST.ADD myforest 0 . CATEGORIC sex “male” .L LEAF 1 .R LEAF 0
OK
> ML.FOREST.RUN myforest sex:male
"1"
> ML.FOREST.RUN myforest sex:no_thanx
"0"
Using Redis-ML With Spark
scala> import com.redislabs.client.redisml.MLClient
scala> import com.redislabs.provider.redis.ml.Forest
scala> val jedis = new Jedis("localhost")
scala> val rfModel = pipelineModel.stages.last.asInstanceOf[RandomForest]
// Create a new forest instance
scala> val f = new Forest(rfModel.trees)
// Load the model to redis
scala> f.loadToRedis("forest-test", "localhost")
// Classify a feature vector
scala> jedis.getClient.sendCommand(MLClient.ModuleCommand.FOREST_RUN,
"forest-test", makeInputString (0))
scala> jedis.getClient.getStatusCodeReply
res53: String = 1
Real World Challenge
• Ad serving company
• Need to serve 20,000 ads/sec @ 50msec data-center latency
• Runs 1k campaigns → 1K random forest
• Each forest has 15K trees
• On average each tree has 7 levels (depth)
• Would require < 1000 x c4.8xlarge
24
Redis ML with Spark ML
Classification Time Over Spark
40x Faster
25
Real World Example:
Movie Recommendation System
Spark Training
Overview
Redis-ML
+
27
Concept: One Forest For Each Movie
28
User Features:
(Age, Gender, Movie Ratings)
Movie_1 Forest
3
Movie_2 Forest 2
Movie_n Forest
5
.
.
.
The Tools
Transform:
29
Train:
Classify: +
Containers:
Using the Dockers
$ docker pull shaynativ/redis-ml
$ docker run --net=host shaynativ/redis-ml &
$
$ docker pull shaynativ/spark-redis-ml
$ docker run --net=host shaynativ/spark-redis-ml
Step 1: Get The Data
31
• Download and extract the MovieLens 100K Dataset
• The data is organized in separate files:
• Ratings: user id | item id | rating (1-5) | timestamp
• Item (movie) info: movie id | genre info fields (1/0)
• User info: user id | age | gender | occupation
• Our classifier should return the expected rating (from 1 to 5) a user would give the movie in
question
Step 2: Transform
32
• The training data for each movie should contain 1 line per user:
• class (rating from 1 to 5 the user gave to this movie)
• user info (age, gender, occupation)
• user ratings of other movies (movie_id:rating ...)
• user genre rating averages (genre:avg_score ...)
• Run gen_data.py to transform the files to the desired format
Step3: Train and Load to Redis
// Create a new forest instance
val rf = new
RandomForestClassifier().setFeatureSubsetStrategy("auto").setLabelCol("indexedLabel").setFeat
uresCol("indexedFeatures").setNumTrees(500)
…..
// Train model
val model = pipeline.fit(trainingData)
…..
val rfModel = model.stages(2).asInstanceOf[RandomForestClassificationModel]
// Load the model to redis
val f = new Forest(rfModel.trees)
f.loadToRedis(”movie-10", "127.0.0.1")
33
Step 3: Execute in Redis
Redis-ML
+
Spark
Training
Client App
Python Client Example
>> import redis
>> config = {"host":"localhost", "port":6379}
>> r = redis.StrictRedis(**config)
>> user_profile = r.get("user_shay_profile")
>> print(user_profile)
12:1.0,13:1.0,14:3.0,15:1.0,17:1.0,18:1.0,19:1.0,20:1.0,23:1.0,24:5.0,1.0,115:1.0,116:2.
0,117:2.0,119:1.0,120:4.0,121:2.0,122:2.0,
........
1360:1.0,1361:1.0,1362:1.0,
1701:6.0,1799:435.0,1801:0.2,1802:0.11,1803:0.04,1812:0.04,1813:0.07,1814:0.24,1815:0.09
,1816:0.32,1817:0.06
>> r.execute_command("ML.FOREST.RUN", "movie-10", user_profile)
'3'
Redis CLI Example
>keys *
127.0.0.1:6379> KEYS *
1) "movie-5"
2) "movie-1"
........
8) "movie-6"
9) "movie-4"
10) "movie-10"
11) "user_1_profile”
>ML.FOREST.RUN movie-10
12:1.0,13:1.0,,332:3.0,333:1.0,334:1.0,335:2.0,336:1.0,357:2.0,358:1.0,359:1.0,362:1.0,367:1.
........
,410:3.0,411:2.0,412:2.0,423:1.0,454:1.0,455:1.0,456:1.0,457:3.0,458:1.0,459:1.0,470:1”
"3”
>
Performance
Redis time: 0.635129ms, res=3
Spark time: 46.657662ms, res=3.0
---------------------------------------
Redis time: 0.644444ms, res=3
Spark time: 49.028983ms, res=3.0
---------------------------------------
Classification averages:
redis: 0.9401250000000001 ms
spark: 58.01970206666667 ms
ratio: 61.71488053893542
diffs: 0.0
Getting Actual Recommendations - Python Script
#!/usr/bin/python
import operator
import redis
config = {"host":"localhost", "port":6379}
r = redis.StrictRedis(**config)
user_profile = r.get("user-1-profile")
results = {}
for i in range(1, 11):
results[i] = r.execute_command("ML.FOREST.RUN", "movie-{}".format(i), user_profile)
print "Movies sorted by scores:"
sorted_results = sorted(results.items(), key=operator.itemgetter(1), reverse=True)
for k,v in sorted_results:
print "movie-{}:{}".format(k,v)
print ""
print "Recommended movie: movie-{}".format(sorted_results[0][0])
Getting Actual Recommendations - Results
$ ./classify_user.py 1
Movies sorted by scores:
movie-4:3
movie-3:2
movie-6:2
movie-7:2
movie-8:2
movie-9:2
movie-1:1
movie-2:1
movie-5:1
movie-10:0
Recommended movie for user 1: movie-4
Summary
• Train with Spark, Serve with Redis
• 97% resource cost serving
• Simplify ML lifecycle
• Redise
(Cloud or Pack):
‒Scaling, HA, Performance
‒PAYG – cost optimized
‒Ease of use
‒Supported by the teams who created Spark and
Redis
Spark Training
Data loaded into Spark Model is saved in
Redis-ML
Redis-ML
Serving Client
Client
App
Client
App
Client
App
+
Resources
● Redis-ML: https://github.com/RedisLabsModules/redis-ml
● Spark-Redis-ML: https://github.com/RedisLabs/spark-redis-ml
● Databricks Notebook: http://bit.ly/sparkredisml
● Dockers: https://hub.docker.com/r/shaynativ/redis-ml/
https://hub.docker.com/r/shaynativ/spark-redis-ml/
Q&A
Thank You.
shay@redislabs.com

More Related Content

PDF
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Databricks
 
PDF
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Databricks
 
PDF
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Databricks
 
PDF
Clipper: A Low-Latency Online Prediction Serving System
Databricks
 
PDF
Building Robust ETL Pipelines with Apache Spark
Databricks
 
PDF
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Databricks
 
PDF
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
PDF
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Databricks
 
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Databricks
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Databricks
 
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Databricks
 
Clipper: A Low-Latency Online Prediction Serving System
Databricks
 
Building Robust ETL Pipelines with Apache Spark
Databricks
 
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Databricks
 
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Databricks
 

What's hot (20)

PDF
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
PDF
Fast and Reliable Apache Spark SQL Releases
DataWorks Summit
 
PDF
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks
 
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
PDF
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Databricks
 
PDF
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Databricks
 
PDF
Just enough DevOps for Data Scientists (Part II)
Databricks
 
PPTX
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
PDF
Informational Referential Integrity Constraints Support in Apache Spark with ...
Databricks
 
PDF
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Databricks
 
PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
PDF
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
 
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
PDF
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
PDF
Advanced Natural Language Processing with Apache Spark NLP
Databricks
 
PDF
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Databricks
 
PDF
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
PDF
What's New in Apache Spark 2.3 & Why Should You Care
Databricks
 
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
PPTX
Spark r under the hood with Hossein Falaki
Databricks
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
Fast and Reliable Apache Spark SQL Releases
DataWorks Summit
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Databricks
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Databricks
 
Just enough DevOps for Data Scientists (Part II)
Databricks
 
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
Informational Referential Integrity Constraints Support in Apache Spark with ...
Databricks
 
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Databricks
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Advanced Natural Language Processing with Apache Spark NLP
Databricks
 
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Databricks
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
What's New in Apache Spark 2.3 & Why Should You Care
Databricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Spark r under the hood with Hossein Falaki
Databricks
 
Ad

Similar to Building a Large Scale Recommendation Engine with Spark and Redis-ML with Shay Nativ (20)

PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PPTX
RedisConf17 - Redis Labs - Implementing Real-time Machine Learning with Redis-ML
Redis Labs
 
PPTX
Serving predictive models with Redis
Tague Griffith
 
PDF
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit
 
PPTX
How in memory technology will impact machine deep learning services (redis la...
Avner Algom
 
PDF
Big Data LDN 2017: Serving Predictive Models with Redis
Matt Stubbs
 
PPTX
What's new with enterprise Redis - Leena Joshi, Redis Labs
Redis Labs
 
PDF
Rails Tips and Best Practices
David Keener
 
PDF
Deploying Real-Time Decision Services Using Redis with Tague Griffith
Databricks
 
PDF
Redispresentation apac2012
Ankur Gupta
 
PPTX
Brk2051 sql server on linux and docker
Bob Ward
 
PDF
Boosting Machine Learning with Redis Modules and Spark
Dvir Volk
 
PDF
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
Redis Labs
 
PDF
Introduction Mysql
Gerben Menschaert
 
PDF
Mysql introduction
Prof. Wim Van Criekinge
 
PPTX
Redis Modules - Redis India Tour - 2017
HashedIn Technologies
 
PPTX
BGOUG "Agile Data: revolutionizing database cloning'
Kyle Hailey
 
PDF
What's New in Apache Hive
DataWorks Summit
 
PDF
Lessons learned while building Omroep.nl
tieleman
 
PPTX
Real-time Analytics with Redis
Cihan Biyikoglu
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
RedisConf17 - Redis Labs - Implementing Real-time Machine Learning with Redis-ML
Redis Labs
 
Serving predictive models with Redis
Tague Griffith
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit
 
How in memory technology will impact machine deep learning services (redis la...
Avner Algom
 
Big Data LDN 2017: Serving Predictive Models with Redis
Matt Stubbs
 
What's new with enterprise Redis - Leena Joshi, Redis Labs
Redis Labs
 
Rails Tips and Best Practices
David Keener
 
Deploying Real-Time Decision Services Using Redis with Tague Griffith
Databricks
 
Redispresentation apac2012
Ankur Gupta
 
Brk2051 sql server on linux and docker
Bob Ward
 
Boosting Machine Learning with Redis Modules and Spark
Dvir Volk
 
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
Redis Labs
 
Introduction Mysql
Gerben Menschaert
 
Mysql introduction
Prof. Wim Van Criekinge
 
Redis Modules - Redis India Tour - 2017
HashedIn Technologies
 
BGOUG "Agile Data: revolutionizing database cloning'
Kyle Hailey
 
What's New in Apache Hive
DataWorks Summit
 
Lessons learned while building Omroep.nl
tieleman
 
Real-time Analytics with Redis
Cihan Biyikoglu
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

Recently uploaded (20)

PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPTX
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
PPTX
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PPTX
Presentation on animal welfare a good topic
kidscream385
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PPTX
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
Presentation on animal welfare a good topic
kidscream385
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 

Building a Large Scale Recommendation Engine with Spark and Redis-ML with Shay Nativ

  • 1. Shay Nativ - Redis Labs Real-time Machine Learning with Redis-ML and Apache Spark
  • 2. Agenda ● Intro to Redis and Redis Labs - 5 min ● Using Redis-ML for Model Serving - why and how - 10 min ● Building a recommendation system using Spark-ML and Redis-ML - 10 min ● QA 2
  • 3. Redis Labs – Home of Redis Founded in 2011 HQ in Mountain View CA, R&D center in Tel-Aviv IL The commercial company behind Open Source Redis Provider of the Redis Enterprise (Redise ) technology, platform and products 3
  • 4. Redise Cloud Private Redis Labs Products Redise Cloud Redise Pack ManagedRedise Pack SERVICES SOFTWARE Fully managed Redise service in VPCs within AWS, MS Azure, GCP & IBM Softlayer Fully managed Redise service on hosted servers within AWS, MS Azure, GCP, IBM Softlayer, Heroku, CF & OpenShift Downloadable Redise software for any enterprise datacenter or cloud environment Fully managed Redise Pack in private data centers && &
  • 5. A Brief Overview of Redis ● Started in 2009 by Salvatore Sanfilippo ● Most popular KV store ● In memory - disk backed ● Notable Users: ○ Twitter, Netflix, Uber, Groupon, Twitch ○ Many, many more...
  • 6. Redis Main Differentiations Simplicity (through Data Structures) Extensibility (through Redis Modules) Performance ListsSorted Sets Hashes Hyperlog-logs Geospatial Indexes Bitmaps SetsStrings Bit field 6
  • 7. A Quick Recap of Redis Key "I'm a Plain Text String!" { A: “foo”, B: “bar”, C: “baz” } Strings / Bitmaps / BitFields Hash Tables (objects!) Linked Lists Sets Sorted Sets Geo Sets HyperLogLog { A , B , C , D , E } [ A → B → C → D → E ] { A: 0.1, B: 0.3, C: 100, D: 1337 } { A: (51.5, 0.12), B: (32.1, 34.7) } 00110101 11001110 10101010
  • 8. Simple Redis Example (string, hash) 127.0.0.1:6379> SET spark summit OK 127.0.0.1:6379> GET spark "summit" 127.0.0.1:6379> HMSET spark_hash org apache version 2.1.1 OK 127.0.0.1:6379> HGET spark_hash version "2.1.1" 127.0.0.1:6379> HGETALL spark_hash 1) "org" 2) "apache" 3) "version" 4) "2.1.1"
  • 9. Another Simple Redis Example (sorted set) 127.0.0.1:6379> zadd my_sorted_set 1 foo (integer) 1 127.0.0.1:6379> zadd my_sorted_set 5 bar (integer) 1 127.0.0.1:6379> zadd my_sorted_set 3 baz (integer) 1 127.0.0.1:6379> ZRANGE my_sorted_set 0 2 1) "foo" 2) "baz" 3) "bar" 127.0.0.1:6379>
  • 10. New Capabilities What Modules Actually Are • Dynamic libraries loaded to redis • Written in C/C++ • Use a C ABI/API isolating redis internals • Use existing or add new data-structures • Near Zero latency access to data New Commands New Data Types
  • 11. Modules : A Revolutionary Approach Adapt your database to your data, not the other way around Secure way to store data in Redis via encrypt/decrypt with various Themis primitives Time Series Graph Time series values aggregation in Redis Crypto Engine Wrapper Graph database on Redis based on Cypher language Based on Generic Cell Rate Algorithm (GCRA) Rate Limiter ReJSON Secondary Index/RQL JSON Engine on Redis. Pre-released Indexing + SQL -like syntax for querying indexes. Pre-released Neural Redis Redis-ML RediSearch Full Text Search Engine in RedisMachine Learning Model Serving Simple Neural Network Native to Redis
  • 13. Spark-ML End-to-End Flow Spark Training Custom Server Model saved to Parquet file Data Loaded to Spark Pre-computed results Batch Evaluation ? ClientApp
  • 14. ML Models Serving Challenges • Models are becoming bigger and more complex • Can be challenging to deploy & serve • Do not scale well, speed and size • Can be very expensive 14
  • 15. A Simpler Machine Learning Lifecycle Any Training Platform Data loaded into Spark Model is saved in Redis-ML Redis-ML Serving Client Client App Client App Client App + Spark Training 15
  • 16. Redis-ML – ML Serving Engine • Store training output as “hot model” • Perform evaluation directly in Redis • Enjoy the performance, scalability and HA of Redis 16
  • 17. ML Models Tree Ensembles Linear Regression Logistic Regression Matrix + Vector Operations More to come... Redis-ML 17
  • 18. Random Forest Model • A collection of decision trees • Supports classification & regression • Splitter Node can be: ◦ Categorical (e.g. day == “Sunday”) ◦ Numerical (e.g. age < 43) • Decision is taken by the majority of decision trees 18
  • 19. Titanic Survival Predictor on a Decision Tree YES Sex = Male ? Age < 9.5? *Sibps > 2.5? Survived Died SurvivedDied *Sibps = siblings + spouses NO 19
  • 20. Titanic Survival Predictor on a Random Forest YES Sex = Male ? Age < 9.5? *Sibps > 2.5? Survived Died SurvivedDied NO YES Country= US? State = CA? Height> 1.60m? Survived Died SurvivedDied NO YES Weight< 80kg? I.Q<100? Eye color =blue? Survived Died SurvivedDied NO Tree #1 Tree #2 Tree #3
  • 21. Would John Survive The Titanic • John’s features: {male, 34, married + 2, US, CA, 1.78m, 78kg, 110iq, blue eyes} • Tree#1 – Survived • Tree#2 – Failed • Tree#3 – Survived • Random forest decision - Survived 21
  • 22. Forest Data Type Example > MODULE LOAD "./redis-ml.so" OK > ML.FOREST.ADD myforest 0 . CATEGORIC sex “male” .L LEAF 1 .R LEAF 0 OK > ML.FOREST.RUN myforest sex:male "1" > ML.FOREST.RUN myforest sex:no_thanx "0"
  • 23. Using Redis-ML With Spark scala> import com.redislabs.client.redisml.MLClient scala> import com.redislabs.provider.redis.ml.Forest scala> val jedis = new Jedis("localhost") scala> val rfModel = pipelineModel.stages.last.asInstanceOf[RandomForest] // Create a new forest instance scala> val f = new Forest(rfModel.trees) // Load the model to redis scala> f.loadToRedis("forest-test", "localhost") // Classify a feature vector scala> jedis.getClient.sendCommand(MLClient.ModuleCommand.FOREST_RUN, "forest-test", makeInputString (0)) scala> jedis.getClient.getStatusCodeReply res53: String = 1
  • 24. Real World Challenge • Ad serving company • Need to serve 20,000 ads/sec @ 50msec data-center latency • Runs 1k campaigns → 1K random forest • Each forest has 15K trees • On average each tree has 7 levels (depth) • Would require < 1000 x c4.8xlarge 24
  • 25. Redis ML with Spark ML Classification Time Over Spark 40x Faster 25
  • 26. Real World Example: Movie Recommendation System
  • 28. Concept: One Forest For Each Movie 28 User Features: (Age, Gender, Movie Ratings) Movie_1 Forest 3 Movie_2 Forest 2 Movie_n Forest 5 . . .
  • 30. Using the Dockers $ docker pull shaynativ/redis-ml $ docker run --net=host shaynativ/redis-ml & $ $ docker pull shaynativ/spark-redis-ml $ docker run --net=host shaynativ/spark-redis-ml
  • 31. Step 1: Get The Data 31 • Download and extract the MovieLens 100K Dataset • The data is organized in separate files: • Ratings: user id | item id | rating (1-5) | timestamp • Item (movie) info: movie id | genre info fields (1/0) • User info: user id | age | gender | occupation • Our classifier should return the expected rating (from 1 to 5) a user would give the movie in question
  • 32. Step 2: Transform 32 • The training data for each movie should contain 1 line per user: • class (rating from 1 to 5 the user gave to this movie) • user info (age, gender, occupation) • user ratings of other movies (movie_id:rating ...) • user genre rating averages (genre:avg_score ...) • Run gen_data.py to transform the files to the desired format
  • 33. Step3: Train and Load to Redis // Create a new forest instance val rf = new RandomForestClassifier().setFeatureSubsetStrategy("auto").setLabelCol("indexedLabel").setFeat uresCol("indexedFeatures").setNumTrees(500) ….. // Train model val model = pipeline.fit(trainingData) ….. val rfModel = model.stages(2).asInstanceOf[RandomForestClassificationModel] // Load the model to redis val f = new Forest(rfModel.trees) f.loadToRedis(”movie-10", "127.0.0.1") 33
  • 34. Step 3: Execute in Redis Redis-ML + Spark Training Client App
  • 35. Python Client Example >> import redis >> config = {"host":"localhost", "port":6379} >> r = redis.StrictRedis(**config) >> user_profile = r.get("user_shay_profile") >> print(user_profile) 12:1.0,13:1.0,14:3.0,15:1.0,17:1.0,18:1.0,19:1.0,20:1.0,23:1.0,24:5.0,1.0,115:1.0,116:2. 0,117:2.0,119:1.0,120:4.0,121:2.0,122:2.0, ........ 1360:1.0,1361:1.0,1362:1.0, 1701:6.0,1799:435.0,1801:0.2,1802:0.11,1803:0.04,1812:0.04,1813:0.07,1814:0.24,1815:0.09 ,1816:0.32,1817:0.06 >> r.execute_command("ML.FOREST.RUN", "movie-10", user_profile) '3'
  • 36. Redis CLI Example >keys * 127.0.0.1:6379> KEYS * 1) "movie-5" 2) "movie-1" ........ 8) "movie-6" 9) "movie-4" 10) "movie-10" 11) "user_1_profile” >ML.FOREST.RUN movie-10 12:1.0,13:1.0,,332:3.0,333:1.0,334:1.0,335:2.0,336:1.0,357:2.0,358:1.0,359:1.0,362:1.0,367:1. ........ ,410:3.0,411:2.0,412:2.0,423:1.0,454:1.0,455:1.0,456:1.0,457:3.0,458:1.0,459:1.0,470:1” "3” >
  • 37. Performance Redis time: 0.635129ms, res=3 Spark time: 46.657662ms, res=3.0 --------------------------------------- Redis time: 0.644444ms, res=3 Spark time: 49.028983ms, res=3.0 --------------------------------------- Classification averages: redis: 0.9401250000000001 ms spark: 58.01970206666667 ms ratio: 61.71488053893542 diffs: 0.0
  • 38. Getting Actual Recommendations - Python Script #!/usr/bin/python import operator import redis config = {"host":"localhost", "port":6379} r = redis.StrictRedis(**config) user_profile = r.get("user-1-profile") results = {} for i in range(1, 11): results[i] = r.execute_command("ML.FOREST.RUN", "movie-{}".format(i), user_profile) print "Movies sorted by scores:" sorted_results = sorted(results.items(), key=operator.itemgetter(1), reverse=True) for k,v in sorted_results: print "movie-{}:{}".format(k,v) print "" print "Recommended movie: movie-{}".format(sorted_results[0][0])
  • 39. Getting Actual Recommendations - Results $ ./classify_user.py 1 Movies sorted by scores: movie-4:3 movie-3:2 movie-6:2 movie-7:2 movie-8:2 movie-9:2 movie-1:1 movie-2:1 movie-5:1 movie-10:0 Recommended movie for user 1: movie-4
  • 40. Summary • Train with Spark, Serve with Redis • 97% resource cost serving • Simplify ML lifecycle • Redise (Cloud or Pack): ‒Scaling, HA, Performance ‒PAYG – cost optimized ‒Ease of use ‒Supported by the teams who created Spark and Redis Spark Training Data loaded into Spark Model is saved in Redis-ML Redis-ML Serving Client Client App Client App Client App +
  • 41. Resources ● Redis-ML: https://github.com/RedisLabsModules/redis-ml ● Spark-Redis-ML: https://github.com/RedisLabs/spark-redis-ml ● Databricks Notebook: http://bit.ly/sparkredisml ● Dockers: https://hub.docker.com/r/shaynativ/redis-ml/ https://hub.docker.com/r/shaynativ/spark-redis-ml/
  • 42. Q&A