Hunter Kelly
@retnuh
Real-Time Domain Rankings
with Kafka Streams
What are we doing?
● Discovering relevant domains in the Fashion Web
● Use modified HITS - a.k.a. Hubs & Authorities
Why use Kafka Streams?
● Why not Spark, Hadoop, Whatevs?
● Advantages of Kafka Streams
Show me the nitty gritty!
● Okay!
The Fashion Web
Curated list of Fashion related sites
● Initial lists from domain experts
● Support multiple languages
● What are the next “best” domains?
cosmopolitan
stylecaster
instyle
racked
harpersbazaar
elle
vogue
hypebeast
http://www.cs.cornell.edu/home/kleinber/auth.pdf
http://en.wikipedia.org/wiki/HITS_algorithm
* Hyperlink-Induced Topic Search, but no-one seems to call it that
Need Data-driven method
● Kleinberg’s HITS* algorithm
● Allows us to discover Hubs and Authorities
● Flattened to domain level
HITS in a nutshell
HITS in a nutshell
Adjacency Matrix
Image courtesy of
http://faculty.ycp.edu/~dbabcock/PastCourses/cs360/lectures/lecture15.html
HITS in a nutshell
● Why not just counts?
● If you point to good sites, you’re a better
Hub
● If good sites point to you, you’re a better
Authority
● Each iteration, the weights change, until
they reach convergence
Common Questions
● What about non-fashion domains?
● Why not PageRank?
Kafka Streams
Please tell me you aren’t using Kafka
to do Matrix multiplication!
Please tell me you aren’t using Kafka
to do Matrix multiplication!
No, I am not crazy enough* to use Kafka to do
matrix multiplication!
Please tell me you aren’t using Kafka
to do Matrix multiplication!
No, I am not crazy enough* to use Kafka to do
matrix multiplication!
*Although, I probably did spend too much
time thinking about it on the bus!
Why not use Map-Reduce?
Basically what it was invented for,
right?
You could use Hadoop...
You could use Hadoop...
You could use Spark…
● High infrastructure overhead if not using it
for anything else
● Bad initial experience w/ Spark Streaming
snapshotting and recovery
● Already using Kafka!
Why use Kafka Streams?
● Has primitives necessary for Map-Reduce
○ Map step groupBy groupByKey
○ Reduce step reduce aggregate
● Focus is on your data
not distributed computing machinery
● Streaming allows us to have
(near)
real-time, up-to-date data
The Nitty Gritty
Kafka Streams 101
● No explicit consumers/producers -
plumbing handled for you
● Topics still a fundamental communication
piece
● Think of individual datum flowing through
KStreams & KTables
Kafka Streams 101 - KSTreams
● Focus is on specific functional
transformations - map, filter, flatMap
● Also supports various flavours of joins with
other KStreams
● Usually created from one or more topics or
a transformation on another KStream
Kafka Streams 101 - KTable
● Still offers functional transforms, but on a
primary-keyed table
● Offers persistent storage
● Created from aggregations on a KStream
or transforms on other KTable
● The bridge between KStream and KTable
● Created by doing a groupBy or
groupByKey on a KStream
● Create KTables by doing reduce,
aggregate, count
Kafka Streams 101 - KGroupedStream
Output Topic
(log compacted)
KTable ops:
groupByKey,
reduce,
toStream
KStream
Input Topic
KStream ops:
flatMapDomain Link Extractor
Domain Reducer
HITS Calculator & API
Data Flow
Output Topic
(log compacted)
KTable ops:
groupByKey,
reduce,
toStream
KStream
Input Topic
KStream ops:
flatMapDomain Link Extractor
Domain Reducer
HITS Calculator & API
Output Topic
(log compacted)
KTable ops:
groupByKey,
reduce,
toStream
KStream
Input Topic
KStream ops:
flatMapDomain Link Extractor
Domain Reducer
HITS Calculator & API
Output Topic
(log compacted)
KTable ops:
groupByKey,
reduce,
toStream
KStream
Input Topic
KStream ops:
flatMapDomain Link Extractor
Domain Reducer
HITS Calculator & API
Current
Domain Link Extractor
Input Topic
Fetch Content
External Links
Resolve Links
Extract Domains
KStream
KStream ops:
flatMap
Extracted Link
Resolver
Domain Link Extractor
Input Topic
Fetch Content
External Links
Resolve Links
Extract Domains
Output
Topic
KStream ops:
flatMap
KStream ops:
flatMap
KStream ops:
flatMap
Domain Link Extractor
Input Topic
Fetch Content
External Links
Resolve Links
Extract Domains
KStream
Other Function(s)
KStream
KStream ops:
flatMap
KStream Topic Whatev’s
With Fork
So what’s
the point?
TL;DR
● Kafka Streams can help solve problems in
your application domain
● Focus on your data!
● Naturally decompose the problem into
flexible microservices
Hunter Kelly
@retnuh
https://github.com/retnuh

Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams