SlideShare a Scribd company logo
Joining Infinity — Windowless Stream Processing with Flink
Sanjar Akhmedov, Software Engineer, ResearchGate
It started when two researchers discovered first-
hand that collaborating with a friend or colleague on
the other side of the world was no easy task. There are many variations of
passages of Lorem Ipsum
ResearchGate is a social
network for scientists.
Connect the world of science.
Make research open to all.
Structured system
There are many variations of
passages of Lorem Ipsum
We have, and are
continuing to change
how scientific
knowledge is shared and
discovered.
11,000,000+
Members
110,000,000+
Publications
1,300,000,000+
Citations
Feature: Research Timeline
Feature: Research Timeline
Diverse data sources
Proxy
Frontend
Services
memcache MongoDB Solr PostgreSQL
Infinispan HBaseMongoDB Solr
Big data pipeline
Change
data
capture
Import
Hadoop cluster
Export
Data Model
Account Publication
Claim
1 *
Author
Authorship
1*
Hypothetical SQL
Publication
Authorship
1*
CREATE TABLE publications (
id SERIAL PRIMARY KEY,
author_ids INTEGER[]
);
Account
Claim
1 *
Author
CREATE TABLE accounts (
id SERIAL PRIMARY KEY,
claimed_author_ids INTEGER[]
);
CREATE MATERIALIZED VIEW account_publications
REFRESH FAST ON COMMIT
AS
SELECT
accounts.id AS account_id,
publications.id AS publication_id
FROM accounts
JOIN publications
ON ANY (accounts.claimed_author_ids) = ANY (publications.author_ids);
• Data sources are distributed across different DBs
• Dataset doesn’t fit in memory on a single machine
• Join process must be fault tolerant
• Deploy changes fast
• Up-to-date join result in near real-time
• Join result must be accurate
Challenges
Change data capture (CDC)
User Microservice DB
Request Write
Cache
Sync
Solr/ES
Sync
HBase/HDFS
Sync
Change data capture (CDC)
User Microservice DB
Request Write
Log
K2
1
K1
4
Extract
Change data capture (CDC)
User Microservice DB
Request Write
Log
K2
1
K1
4
K1
Ø
Extract
Change data capture (CDC)
User Microservice DB
Request Write
Log
K2
1
K1
4
K1
Ø
KN
42
…
Extract
Change data capture (CDC)
User Microservice DB
Cache
Request Write
Log
K2
1
K1
4
K1
Ø
KN
42
…
Extract
Sync
Change data capture (CDC)
User Microservice DB
Cache
Request Write
Log
K2
1
K1
4
K1
Ø
KN
42
…
Extract
Sync
HBase/HDFSSolr/ES
Join two CDC streams into one
NoSQL1
SQL Kafka
Kafka
Flink Streaming
Join
Kafka NoSQL2
Flink job topology
Accounts
Stream
Join(CoFlatMap)
Account
Publications
Publications
Stream
…
Author 2
Author 1
Author N
DataStream<Account> accounts = kafkaTopic("accounts");
DataStream<Publication> publications = kafkaTopic("publications");
DataStream<AccountPublication> result = accounts.connect(publications)
.keyBy("claimedAuthorId", "publicationAuthorId")
.flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() {
transient ValueState<String> authorAccount;
transient ValueState<String> authorPublication;
public void open(Configuration parameters) throws Exception {
authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null));
authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null));
}
public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception {
authorAccount.update(account.id);
if (authorPublication.value() != null) {
out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));
}
}
public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception {
authorPublication.update(publication.id);
if (authorAccount.value() != null) {
out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));
}
}
});
Prototype implementation
DataStream<Account> accounts = kafkaTopic("accounts");
DataStream<Publication> publications = kafkaTopic("publications");
DataStream<AccountPublication> result = accounts.connect(publications)
.keyBy("claimedAuthorId", "publicationAuthorId")
.flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() {
transient ValueState<String> authorAccount;
transient ValueState<String> authorPublication;
public void open(Configuration parameters) throws Exception {
authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null));
authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null));
}
public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception {
authorAccount.update(account.id);
if (authorPublication.value() != null) {
out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));
}
}
public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception {
authorPublication.update(publication.id);
if (authorAccount.value() != null) {
out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));
}
}
});
Prototype implementation
DataStream<Account> accounts = kafkaTopic("accounts");
DataStream<Publication> publications = kafkaTopic("publications");
DataStream<AccountPublication> result = accounts.connect(publications)
.keyBy("claimedAuthorId", "publicationAuthorId")
.flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() {
transient ValueState<String> authorAccount;
transient ValueState<String> authorPublication;
public void open(Configuration parameters) throws Exception {
authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null));
authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null));
}
public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception {
authorAccount.update(account.id);
if (authorPublication.value() != null) {
out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));
}
}
public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception {
authorPublication.update(publication.id);
if (authorAccount.value() != null) {
out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));
}
}
});
Prototype implementation
DataStream<Account> accounts = kafkaTopic("accounts");
DataStream<Publication> publications = kafkaTopic("publications");
DataStream<AccountPublication> result = accounts.connect(publications)
.keyBy("claimedAuthorId", "publicationAuthorId")
.flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() {
transient ValueState<String> authorAccount;
transient ValueState<String> authorPublication;
public void open(Configuration parameters) throws Exception {
authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null));
authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null));
}
public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception {
authorAccount.update(account.id);
if (authorPublication.value() != null) {
out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));
}
}
public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception {
authorPublication.update(publication.id);
if (authorAccount.value() != null) {
out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));
}
}
});
Prototype implementation
DataStream<Account> accounts = kafkaTopic("accounts");
DataStream<Publication> publications = kafkaTopic("publications");
DataStream<AccountPublication> result = accounts.connect(publications)
.keyBy("claimedAuthorId", "publicationAuthorId")
.flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() {
transient ValueState<String> authorAccount;
transient ValueState<String> authorPublication;
public void open(Configuration parameters) throws Exception {
authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null));
authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null));
}
public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception {
authorAccount.update(account.id);
if (authorPublication.value() != null) {
out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));
}
}
public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception {
authorPublication.update(publication.id);
if (authorAccount.value() != null) {
out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));
}
}
});
Prototype implementation
DataStream<Account> accounts = kafkaTopic("accounts");
DataStream<Publication> publications = kafkaTopic("publications");
DataStream<AccountPublication> result = accounts.connect(publications)
.keyBy("claimedAuthorId", "publicationAuthorId")
.flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() {
transient ValueState<String> authorAccount;
transient ValueState<String> authorPublication;
public void open(Configuration parameters) throws Exception {
authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null));
authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null));
}
public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception {
authorAccount.update(account.id);
if (authorPublication.value() != null) {
out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));
}
}
public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception {
authorPublication.update(publication.id);
if (authorAccount.value() != null) {
out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));
}
}
});
Prototype implementation
DataStream<Account> accounts = kafkaTopic("accounts");
DataStream<Publication> publications = kafkaTopic("publications");
DataStream<AccountPublication> result = accounts.connect(publications)
.keyBy("claimedAuthorId", "publicationAuthorId")
.flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() {
transient ValueState<String> authorAccount;
transient ValueState<String> authorPublication;
public void open(Configuration parameters) throws Exception {
authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null));
authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null));
}
public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception {
authorAccount.update(account.id);
if (authorPublication.value() != null) {
out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));
}
}
public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception {
authorPublication.update(publication.id);
if (authorAccount.value() != null) {
out.collect(new AccountPublication(authorAccount.value(), authorPublication.value()));
}
}
});
Prototype implementation
Example dataflow
Account Publications
Accounts
Alice 2
Publications
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Author 2
Author N
…
Example dataflow
Account Publications
Accounts
Alice 2
Publications
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Author 2
Author N
…
Example dataflow
Account Publications
Accounts
Alice 2
Publications
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Author 2
Alice
Author N
…
Example dataflow
Account Publications
Accounts
Alice 2
Bob 1
Publications
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Author 2
Alice
Author N
…
Example dataflow
Account Publications
Accounts
Alice 2
Bob 1
Publications
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Author 2
Alice
Author N
…
(Bob, 1)
Example dataflow
Account Publications
Accounts
Alice 2
Bob 1
Publications
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Bob
Author 2
Alice
Author N
…
(Bob, 1)
Example dataflow
Account Publications
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Bob
Author 2
Alice
Author N
…
Example dataflow
Account Publications
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Bob
Author 2
Alice
Author N
…
Example dataflow
Account Publications
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Bob Paper1
Author 2
Alice
Author N
…
Example dataflow
Account Publications
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Bob Paper1
Author 2
Alice
Author N
…
Example dataflow
Account Publications
K1 (Bob, Paper1)
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Bob Paper1
Author 2
Alice
Author N
…
• ✔ Data sources are distributed across different DBs
• ✔ Dataset doesn’t fit in memory on a single machine
• ✔ Join process must be fault tolerant
• ✔ Deploy changes fast
• ✔ Up-to-date join result in near real-time
• ? Join result must be accurate
Challenges
Paper1 gets deleted
Account Publications
K1 (Bob, Paper1)
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Paper1 Ø
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Bob Paper1
Author 2
Alice
Author N
…
Paper1 gets deleted
Account Publications
K1 (Bob, Paper1)
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Paper1 Ø
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Bob Paper1
Author 2
Alice
Author N
…
Paper1 gets deleted
Account Publications
K1 (Bob, Paper1)
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Paper1 Ø
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Bob Paper1
Author 2
Alice
Author N
…
?
Paper1 gets deleted
Account Publications
K1 (Bob, Paper1)
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Paper1 Ø
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Bob Paper1
Author 2
Alice
Author N
…
Need previous
value
Paper1 gets deleted
Account Publications
K1 (Bob, Paper1)
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Paper1 Ø
Accounts
Stream
Join
Account
Publications
Diff with
Previous
State
Publications
Stream
Author 1
Bob Paper1
Author 2
Alice
Author N
…
Paper1 gets deleted
Account Publications
K1 (Bob, Paper1)
K1 Ø
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Paper1 Ø
Accounts
Stream
Join
Account
Publications
Diff with
Previous
State
Publications
Stream
Author 1
Bob Paper1
Author 2
Alice
Author N
…
Paper1 gets deleted
Account Publications
K1 (Bob, Paper1)
K1 Ø
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Paper1 Ø
Accounts
Stream
Join
Account
Publications
Diff with
Previous
State
Publications
Stream
Author 1
Bob Paper1
Author 2
Alice
Author N
…
Need K1 here,
e.g. K1 = 𝒇(Bob, Paper1)
Paper1 gets updated
Account Publications
K1 (Bob, Paper1)
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Paper1 2
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Bob Paper1
Author 2
Alice
Author N
…
Paper1 gets updated
Account Publications
K1 (Bob, Paper1)
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Paper1 2
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Bob Paper1
Author 2
Alice
Author N
…
Paper1 gets updated
Account Publications
K1 (Bob, Paper1)
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Paper1 2
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Bob Paper1
Author 2
Alice Paper1
Author N
…
Paper1 gets updated
Account Publications
K1 (Bob, Paper1)
(Alice, Paper1)
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Paper1 2
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Bob Paper1
Author 2
Alice Paper1
Author N
…
(Alice, Paper1)
Paper1 gets updated
Account Publications
K1 (Bob, Paper1)
?? (Alice, Paper1)
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Paper1 2
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Bob Paper1
Author 2
Alice Paper1
Author N
…
Alice claims Paper1 via different author
Account Publications
K1 (Bob, Paper1)
K2 (Alice, Paper1)
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Paper1 (1, 2)
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Bob Paper1
Author 2
Alice Paper1
Author N
…
Alice claims Paper1 via different author
Account Publications
K1 (Bob, Paper1)
K2 (Alice, Paper1)
Accounts
Alice 2
Bob 1
Bob Ø
Publications
Paper1 1
Paper1 (1, 2)
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Bob Paper1
Author 2
Alice Paper1
Author N
…
Alice claims Paper1 via different author
Account Publications
K1 (Bob, Paper1)
K2 (Alice, Paper1)
Accounts
Alice 2
Bob 1
Bob Ø
Publications
Paper1 1
Paper1 (1, 2)
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Bob Paper1
Author 2
Alice Paper1
Author N
…
Alice claims Paper1 via different author
Account Publications
K1 (Bob, Paper1)
K2 (Alice, Paper1)
K1 Ø
Accounts
Alice 2
Bob 1
Bob Ø
Publications
Paper1 1
Paper1 (1, 2)
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Ø Paper1
Author 2
Alice Paper1
Author N
…
Alice claims Paper1 via different author
Account Publications
K1 (Bob, Paper1)
K2 (Alice, Paper1)
K1 Ø
Accounts
Alice 2
Bob 1
Bob Ø
Publications
Paper1 1
Paper1 (1, 2)
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Ø Paper1
Author 2
Alice Paper1
Author N
…
Alice claims Paper1 via different author
Account Publications
K1 (Bob, Paper1)
K2 (Alice, Paper1)
K1 Ø
Accounts
Alice 2
Bob 1
Bob Ø
Alice 1
Publications
Paper1 1
Paper1 (1, 2)
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Ø Paper1
Author 2
Alice Paper1
Author N
…
Alice claims Paper1 via different author
Account Publications
K1 (Bob, Paper1)
K2 (Alice, Paper1)
K1 Ø
Accounts
Alice 2
Bob 1
Bob Ø
Alice 1
Publications
Paper1 1
Paper1 (1, 2)
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Ø Paper1
Author 2
Alice Paper1
Author N
…
2. (Alice, 1)
Alice claims Paper1 via different author
Account Publications
K1 (Bob, Paper1)
K2 (Alice, Paper1)
K1 Ø
Accounts
Alice 2
Bob 1
Bob Ø
Alice 1
Publications
Paper1 1
Paper1 (1, 2)
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Alice Paper1
Author 2
Ø Paper1
Author N
…
2. (Alice, 1)
Alice claims Paper1 via different author
Account Publications
K1 (Bob, Paper1)
K2 (Alice, Paper1)
K1 Ø
Accounts
Alice 2
Bob 1
Bob Ø
Alice 1
Publications
Paper1 1
Paper1 (1, 2)
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Alice Paper1
Author 2
Ø Paper1
Author N
…
2. (Alice, 1)
(Alice, Paper1)
Alice claims Paper1 via different author
Account Publications
K1 (Bob, Paper1)
K2 (Alice, Paper1)
K1 Ø
K2 (Alice, Paper1)
K2 Ø
Accounts
Alice 2
Bob 1
Bob Ø
Alice 1
Publications
Paper1 1
Paper1 (1, 2)
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Alice Paper1
Author 2
Ø Paper1
Author N
…
2. (Alice, 1)
(Alice, Paper1)
Alice claims Paper1 via different author
Account Publications
K1 (Bob, Paper1)
K2 (Alice, Paper1)
K1 Ø
K3 (Alice, Paper1)
K2 Ø
Accounts
Alice 2
Bob 1
Bob Ø
Alice 1
Publications
Paper1 1
Paper1 (1, 2)
Accounts
Stream
Join
Account
Publications
Publications
Stream
Author 1
Alice Paper1
Author 2
Ø Paper1
Author N
…
2. (Alice, 1)
(Alice, Paper1)
Pick correct natural IDs
e.g. K3 = 𝒇(Alice, Author1, Paper1)
• Keep previous element state to update
previous join result
• Stream elements are not domain entities
but commands such as delete or upsert
• Joined stream must have natural IDs
to propagate deletes and updates
How to solve deletes and updates
Generic join graph
Account
Publications
Accounts
Stream
Publications
Stream
Diff
Alice
Bob
…
Diff
Paper1
PaperN
…
Join
Author1
AuthorM
…
Generic join graph
Operate on commands
Account
Publications
Accounts
Stream
Publications
Stream
Diff
Alice
Bob
…
Diff
Paper1
PaperN
…
Join
Author1
AuthorM
…
Memory requirements
Account
Publications
Accounts
Stream
Publications
Stream
Diff
Alice
Bob
…
Diff
Paper1
PaperN
…
Join
Author1
AuthorM
…
Full copy of
Accounts stream
Full copy of
Publications
stream
Full copy of
Accounts stream
on left side
Full copy of
Publications stream
on right side
Network load
Account
Publications
Accounts
Stream
Publications
Stream
Diff
Alice
Bob
…
Diff
Paper1
PaperN
…
Join
Author1
AuthorM
…
Reshuffle
Reshuffle
Network
Network
• In addition to handling Kafka traffic we need to reshuffle all
data twice over the network
• We need to keep two full copies of each joined stream in
memory
Resource considerations
Questions
We are hiring - www.researchgate.net/careers
Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

More Related Content

What's hot (20)

PPTX
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
PDF
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward
 
PDF
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
ucelebi
 
PPTX
Kostas Tzoumas - Apache Flink®: State of the Union and What's Next
Ververica
 
PPTX
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Flink Forward
 
PDF
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward
 
PPTX
Apache Flink @ NYC Flink Meetup
Stephan Ewen
 
PPTX
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward
 
PDF
Flink Streaming Berlin Meetup
Márton Balassi
 
PPTX
Continuous Processing with Apache Flink - Strata London 2016
Stephan Ewen
 
PPTX
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...
Flink Forward
 
PPTX
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward
 
PPTX
Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...
Flink Forward
 
PPTX
The Stream Processor as the Database - Apache Flink @ Berlin buzzwords
Stephan Ewen
 
PPTX
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward
 
PPTX
Apache Flink Berlin Meetup May 2016
Stephan Ewen
 
PDF
A look at Flink 1.2
Stefan Richter
 
PPTX
Apache Flink at Strata San Jose 2016
Kostas Tzoumas
 
PDF
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Timo Walther
 
PDF
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
ucelebi
 
Kostas Tzoumas - Apache Flink®: State of the Union and What's Next
Ververica
 
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Flink Forward
 
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward
 
Apache Flink @ NYC Flink Meetup
Stephan Ewen
 
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward
 
Flink Streaming Berlin Meetup
Márton Balassi
 
Continuous Processing with Apache Flink - Strata London 2016
Stephan Ewen
 
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...
Flink Forward
 
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward
 
Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...
Flink Forward
 
The Stream Processor as the Database - Apache Flink @ Berlin buzzwords
Stephan Ewen
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward
 
Apache Flink Berlin Meetup May 2016
Stephan Ewen
 
A look at Flink 1.2
Stefan Richter
 
Apache Flink at Strata San Jose 2016
Kostas Tzoumas
 
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Timo Walther
 
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward
 

Viewers also liked (20)

PDF
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
Ververica
 
PPTX
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Ververica
 
PDF
Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup
Ververica
 
PDF
Real-time analytics as a service at King
Gyula Fóra
 
PPTX
Kostas Tzoumas - Stream Processing with Apache Flink®
Ververica
 
PPTX
Stream Analytics with SQL on Apache Flink
Fabian Hueske
 
PPTX
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
confluent
 
PPTX
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
Ververica
 
PPTX
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Slim Baltagi
 
PDF
Dive into Spark Streaming
Gerard Maas
 
PPTX
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Fabian Hueske
 
PPTX
Juggling with Bits and Bytes - How Apache Flink operates on binary data
Fabian Hueske
 
PPTX
Robert Metzger - Apache Flink Community Updates November 2016 @ Berlin Meetup
Ververica
 
PPTX
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
PPTX
Apache Flink Community Updates November 2016 @ Berlin Meetup
Robert Metzger
 
PDF
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
PDF
Alexander Kolb – Flink. Yet another Streaming Framework?
Flink Forward
 
PDF
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 
PDF
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Dan Halperin
 
PDF
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Till Rohrmann
 
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
Ververica
 
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Ververica
 
Stefan Richter - A look at Flink 1.2 and beyond @ Berlin Meetup
Ververica
 
Real-time analytics as a service at King
Gyula Fóra
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Ververica
 
Stream Analytics with SQL on Apache Flink
Fabian Hueske
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
confluent
 
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
Ververica
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Slim Baltagi
 
Dive into Spark Streaming
Gerard Maas
 
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Fabian Hueske
 
Juggling with Bits and Bytes - How Apache Flink operates on binary data
Fabian Hueske
 
Robert Metzger - Apache Flink Community Updates November 2016 @ Berlin Meetup
Ververica
 
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
Apache Flink Community Updates November 2016 @ Berlin Meetup
Robert Metzger
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
Alexander Kolb – Flink. Yet another Streaming Framework?
Flink Forward
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Flink Forward
 
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Dan Halperin
 
Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...
Till Rohrmann
 
Ad

Similar to Apache Flink Meetup: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink (20)

PDF
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
HostedbyConfluent
 
PDF
Open west 2015 talk ben coverston
bcoverston
 
PPTX
Apache Flink Training: DataStream API Part 2 Advanced
Flink Forward
 
PPTX
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward
 
PPTX
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward
 
PDF
Apache Flink Stream Processing
Suneel Marthi
 
PDF
CDC Stream Processing with Apache Flink
Timo Walther
 
PDF
Tapping into Scientific Data with Hadoop and Flink
Michael Häusler
 
PPTX
Stream processing - Apache flink
Renato Guimaraes
 
PDF
Apache Flink - a Gentle Start
Liangjun Jiang
 
PPTX
Advanced
mxmxm
 
PDF
Flink Forward Berlin 2018: Xingcan Cui - "Stream Join in Flink: from Discrete...
Flink Forward
 
PPTX
January 2016 Flink Community Update & Roadmap 2016
Robert Metzger
 
PDF
Streaming Dataflow with Apache Flink
huguk
 
PPTX
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
Flink Forward
 
PDF
Deeply Declarative Data Pipelines
HostedbyConfluent
 
PDF
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
PDF
Changelog Stream Processing with Apache Flink
Flink Forward
 
PPTX
Real Time Visibility with Flink
Rafi Aroch
 
PDF
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
confluent
 
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
HostedbyConfluent
 
Open west 2015 talk ben coverston
bcoverston
 
Apache Flink Training: DataStream API Part 2 Advanced
Flink Forward
 
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward
 
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward
 
Apache Flink Stream Processing
Suneel Marthi
 
CDC Stream Processing with Apache Flink
Timo Walther
 
Tapping into Scientific Data with Hadoop and Flink
Michael Häusler
 
Stream processing - Apache flink
Renato Guimaraes
 
Apache Flink - a Gentle Start
Liangjun Jiang
 
Advanced
mxmxm
 
Flink Forward Berlin 2018: Xingcan Cui - "Stream Join in Flink: from Discrete...
Flink Forward
 
January 2016 Flink Community Update & Roadmap 2016
Robert Metzger
 
Streaming Dataflow with Apache Flink
huguk
 
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
Flink Forward
 
Deeply Declarative Data Pipelines
HostedbyConfluent
 
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Changelog Stream Processing with Apache Flink
Flink Forward
 
Real Time Visibility with Flink
Rafi Aroch
 
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
confluent
 
Ad

More from Ververica (14)

PDF
2020-05-06 Apache Flink Meetup London: The Easiest Way to Get Operational wit...
Ververica
 
PDF
Webinar: How to contribute to Apache Flink - Robert Metzger
Ververica
 
PDF
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Ververica
 
PDF
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Ververica
 
PDF
Webinar: Detecting row patterns with Flink SQL - Dawid Wysakowicz
Ververica
 
PDF
Deploying Flink on Kubernetes - David Anderson
Ververica
 
PPTX
Webinar: Flink SQL in Action - Fabian Hueske
Ververica
 
PPTX
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
Ververica
 
PDF
2018-01 Seattle Apache Flink Meetup at OfferUp, Opening Remarks and Talk 2
Ververica
 
PPTX
Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP
Ververica
 
PDF
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Ververica
 
PDF
Aljoscha Krettek - Apache Flink® and IoT: How Stateful Event-Time Processing ...
Ververica
 
PPTX
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 
PPTX
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Ververica
 
2020-05-06 Apache Flink Meetup London: The Easiest Way to Get Operational wit...
Ververica
 
Webinar: How to contribute to Apache Flink - Robert Metzger
Ververica
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Ververica
 
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Ververica
 
Webinar: Detecting row patterns with Flink SQL - Dawid Wysakowicz
Ververica
 
Deploying Flink on Kubernetes - David Anderson
Ververica
 
Webinar: Flink SQL in Action - Fabian Hueske
Ververica
 
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
Ververica
 
2018-01 Seattle Apache Flink Meetup at OfferUp, Opening Remarks and Talk 2
Ververica
 
Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP
Ververica
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Ververica
 
Aljoscha Krettek - Apache Flink® and IoT: How Stateful Event-Time Processing ...
Ververica
 
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Ververica
 

Recently uploaded (20)

PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPT
introdution to python with a very little difficulty
HUZAIFABINABDULLAH
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PPTX
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
PDF
McKinsey - Global Energy Perspective 2023_11.pdf
niyudha
 
PDF
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PPTX
Nursing Shift Supervisor 24/7 in a week .pptx
amjadtanveer
 
PDF
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
introdution to python with a very little difficulty
HUZAIFABINABDULLAH
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
McKinsey - Global Energy Perspective 2023_11.pdf
niyudha
 
apidays Munich 2025 - Making Sense of AI-Ready APIs in a Buzzword World, Andr...
apidays
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
Nursing Shift Supervisor 24/7 in a week .pptx
amjadtanveer
 
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 

Apache Flink Meetup: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

  • 1. Joining Infinity — Windowless Stream Processing with Flink Sanjar Akhmedov, Software Engineer, ResearchGate
  • 2. It started when two researchers discovered first- hand that collaborating with a friend or colleague on the other side of the world was no easy task. There are many variations of passages of Lorem Ipsum ResearchGate is a social network for scientists.
  • 3. Connect the world of science. Make research open to all.
  • 4. Structured system There are many variations of passages of Lorem Ipsum We have, and are continuing to change how scientific knowledge is shared and discovered.
  • 8. Diverse data sources Proxy Frontend Services memcache MongoDB Solr PostgreSQL Infinispan HBaseMongoDB Solr
  • 10. Data Model Account Publication Claim 1 * Author Authorship 1*
  • 11. Hypothetical SQL Publication Authorship 1* CREATE TABLE publications ( id SERIAL PRIMARY KEY, author_ids INTEGER[] ); Account Claim 1 * Author CREATE TABLE accounts ( id SERIAL PRIMARY KEY, claimed_author_ids INTEGER[] ); CREATE MATERIALIZED VIEW account_publications REFRESH FAST ON COMMIT AS SELECT accounts.id AS account_id, publications.id AS publication_id FROM accounts JOIN publications ON ANY (accounts.claimed_author_ids) = ANY (publications.author_ids);
  • 12. • Data sources are distributed across different DBs • Dataset doesn’t fit in memory on a single machine • Join process must be fault tolerant • Deploy changes fast • Up-to-date join result in near real-time • Join result must be accurate Challenges
  • 13. Change data capture (CDC) User Microservice DB Request Write Cache Sync Solr/ES Sync HBase/HDFS Sync
  • 14. Change data capture (CDC) User Microservice DB Request Write Log K2 1 K1 4 Extract
  • 15. Change data capture (CDC) User Microservice DB Request Write Log K2 1 K1 4 K1 Ø Extract
  • 16. Change data capture (CDC) User Microservice DB Request Write Log K2 1 K1 4 K1 Ø KN 42 … Extract
  • 17. Change data capture (CDC) User Microservice DB Cache Request Write Log K2 1 K1 4 K1 Ø KN 42 … Extract Sync
  • 18. Change data capture (CDC) User Microservice DB Cache Request Write Log K2 1 K1 4 K1 Ø KN 42 … Extract Sync HBase/HDFSSolr/ES
  • 19. Join two CDC streams into one NoSQL1 SQL Kafka Kafka Flink Streaming Join Kafka NoSQL2
  • 21. DataStream<Account> accounts = kafkaTopic("accounts"); DataStream<Publication> publications = kafkaTopic("publications"); DataStream<AccountPublication> result = accounts.connect(publications) .keyBy("claimedAuthorId", "publicationAuthorId") .flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() { transient ValueState<String> authorAccount; transient ValueState<String> authorPublication; public void open(Configuration parameters) throws Exception { authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null)); authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null)); } public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception { authorAccount.update(account.id); if (authorPublication.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception { authorPublication.update(publication.id); if (authorAccount.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } }); Prototype implementation
  • 22. DataStream<Account> accounts = kafkaTopic("accounts"); DataStream<Publication> publications = kafkaTopic("publications"); DataStream<AccountPublication> result = accounts.connect(publications) .keyBy("claimedAuthorId", "publicationAuthorId") .flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() { transient ValueState<String> authorAccount; transient ValueState<String> authorPublication; public void open(Configuration parameters) throws Exception { authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null)); authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null)); } public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception { authorAccount.update(account.id); if (authorPublication.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception { authorPublication.update(publication.id); if (authorAccount.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } }); Prototype implementation
  • 23. DataStream<Account> accounts = kafkaTopic("accounts"); DataStream<Publication> publications = kafkaTopic("publications"); DataStream<AccountPublication> result = accounts.connect(publications) .keyBy("claimedAuthorId", "publicationAuthorId") .flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() { transient ValueState<String> authorAccount; transient ValueState<String> authorPublication; public void open(Configuration parameters) throws Exception { authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null)); authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null)); } public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception { authorAccount.update(account.id); if (authorPublication.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception { authorPublication.update(publication.id); if (authorAccount.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } }); Prototype implementation
  • 24. DataStream<Account> accounts = kafkaTopic("accounts"); DataStream<Publication> publications = kafkaTopic("publications"); DataStream<AccountPublication> result = accounts.connect(publications) .keyBy("claimedAuthorId", "publicationAuthorId") .flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() { transient ValueState<String> authorAccount; transient ValueState<String> authorPublication; public void open(Configuration parameters) throws Exception { authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null)); authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null)); } public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception { authorAccount.update(account.id); if (authorPublication.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception { authorPublication.update(publication.id); if (authorAccount.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } }); Prototype implementation
  • 25. DataStream<Account> accounts = kafkaTopic("accounts"); DataStream<Publication> publications = kafkaTopic("publications"); DataStream<AccountPublication> result = accounts.connect(publications) .keyBy("claimedAuthorId", "publicationAuthorId") .flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() { transient ValueState<String> authorAccount; transient ValueState<String> authorPublication; public void open(Configuration parameters) throws Exception { authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null)); authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null)); } public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception { authorAccount.update(account.id); if (authorPublication.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception { authorPublication.update(publication.id); if (authorAccount.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } }); Prototype implementation
  • 26. DataStream<Account> accounts = kafkaTopic("accounts"); DataStream<Publication> publications = kafkaTopic("publications"); DataStream<AccountPublication> result = accounts.connect(publications) .keyBy("claimedAuthorId", "publicationAuthorId") .flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() { transient ValueState<String> authorAccount; transient ValueState<String> authorPublication; public void open(Configuration parameters) throws Exception { authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null)); authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null)); } public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception { authorAccount.update(account.id); if (authorPublication.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception { authorPublication.update(publication.id); if (authorAccount.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } }); Prototype implementation
  • 27. DataStream<Account> accounts = kafkaTopic("accounts"); DataStream<Publication> publications = kafkaTopic("publications"); DataStream<AccountPublication> result = accounts.connect(publications) .keyBy("claimedAuthorId", "publicationAuthorId") .flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() { transient ValueState<String> authorAccount; transient ValueState<String> authorPublication; public void open(Configuration parameters) throws Exception { authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null)); authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null)); } public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception { authorAccount.update(account.id); if (authorPublication.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception { authorPublication.update(publication.id); if (authorAccount.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } }); Prototype implementation
  • 28. Example dataflow Account Publications Accounts Alice 2 Publications Accounts Stream Join Account Publications Publications Stream Author 1 Author 2 Author N …
  • 29. Example dataflow Account Publications Accounts Alice 2 Publications Accounts Stream Join Account Publications Publications Stream Author 1 Author 2 Author N …
  • 30. Example dataflow Account Publications Accounts Alice 2 Publications Accounts Stream Join Account Publications Publications Stream Author 1 Author 2 Alice Author N …
  • 31. Example dataflow Account Publications Accounts Alice 2 Bob 1 Publications Accounts Stream Join Account Publications Publications Stream Author 1 Author 2 Alice Author N …
  • 32. Example dataflow Account Publications Accounts Alice 2 Bob 1 Publications Accounts Stream Join Account Publications Publications Stream Author 1 Author 2 Alice Author N … (Bob, 1)
  • 33. Example dataflow Account Publications Accounts Alice 2 Bob 1 Publications Accounts Stream Join Account Publications Publications Stream Author 1 Bob Author 2 Alice Author N … (Bob, 1)
  • 34. Example dataflow Account Publications Accounts Alice 2 Bob 1 Publications Paper1 1 Accounts Stream Join Account Publications Publications Stream Author 1 Bob Author 2 Alice Author N …
  • 35. Example dataflow Account Publications Accounts Alice 2 Bob 1 Publications Paper1 1 Accounts Stream Join Account Publications Publications Stream Author 1 Bob Author 2 Alice Author N …
  • 36. Example dataflow Account Publications Accounts Alice 2 Bob 1 Publications Paper1 1 Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Author N …
  • 37. Example dataflow Account Publications Accounts Alice 2 Bob 1 Publications Paper1 1 Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Author N …
  • 38. Example dataflow Account Publications K1 (Bob, Paper1) Accounts Alice 2 Bob 1 Publications Paper1 1 Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Author N …
  • 39. • ✔ Data sources are distributed across different DBs • ✔ Dataset doesn’t fit in memory on a single machine • ✔ Join process must be fault tolerant • ✔ Deploy changes fast • ✔ Up-to-date join result in near real-time • ? Join result must be accurate Challenges
  • 40. Paper1 gets deleted Account Publications K1 (Bob, Paper1) Accounts Alice 2 Bob 1 Publications Paper1 1 Paper1 Ø Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Author N …
  • 41. Paper1 gets deleted Account Publications K1 (Bob, Paper1) Accounts Alice 2 Bob 1 Publications Paper1 1 Paper1 Ø Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Author N …
  • 42. Paper1 gets deleted Account Publications K1 (Bob, Paper1) Accounts Alice 2 Bob 1 Publications Paper1 1 Paper1 Ø Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Author N … ?
  • 43. Paper1 gets deleted Account Publications K1 (Bob, Paper1) Accounts Alice 2 Bob 1 Publications Paper1 1 Paper1 Ø Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Author N … Need previous value
  • 44. Paper1 gets deleted Account Publications K1 (Bob, Paper1) Accounts Alice 2 Bob 1 Publications Paper1 1 Paper1 Ø Accounts Stream Join Account Publications Diff with Previous State Publications Stream Author 1 Bob Paper1 Author 2 Alice Author N …
  • 45. Paper1 gets deleted Account Publications K1 (Bob, Paper1) K1 Ø Accounts Alice 2 Bob 1 Publications Paper1 1 Paper1 Ø Accounts Stream Join Account Publications Diff with Previous State Publications Stream Author 1 Bob Paper1 Author 2 Alice Author N …
  • 46. Paper1 gets deleted Account Publications K1 (Bob, Paper1) K1 Ø Accounts Alice 2 Bob 1 Publications Paper1 1 Paper1 Ø Accounts Stream Join Account Publications Diff with Previous State Publications Stream Author 1 Bob Paper1 Author 2 Alice Author N … Need K1 here, e.g. K1 = 𝒇(Bob, Paper1)
  • 47. Paper1 gets updated Account Publications K1 (Bob, Paper1) Accounts Alice 2 Bob 1 Publications Paper1 1 Paper1 2 Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Author N …
  • 48. Paper1 gets updated Account Publications K1 (Bob, Paper1) Accounts Alice 2 Bob 1 Publications Paper1 1 Paper1 2 Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Author N …
  • 49. Paper1 gets updated Account Publications K1 (Bob, Paper1) Accounts Alice 2 Bob 1 Publications Paper1 1 Paper1 2 Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Paper1 Author N …
  • 50. Paper1 gets updated Account Publications K1 (Bob, Paper1) (Alice, Paper1) Accounts Alice 2 Bob 1 Publications Paper1 1 Paper1 2 Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Paper1 Author N … (Alice, Paper1)
  • 51. Paper1 gets updated Account Publications K1 (Bob, Paper1) ?? (Alice, Paper1) Accounts Alice 2 Bob 1 Publications Paper1 1 Paper1 2 Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Paper1 Author N …
  • 52. Alice claims Paper1 via different author Account Publications K1 (Bob, Paper1) K2 (Alice, Paper1) Accounts Alice 2 Bob 1 Publications Paper1 1 Paper1 (1, 2) Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Paper1 Author N …
  • 53. Alice claims Paper1 via different author Account Publications K1 (Bob, Paper1) K2 (Alice, Paper1) Accounts Alice 2 Bob 1 Bob Ø Publications Paper1 1 Paper1 (1, 2) Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Paper1 Author N …
  • 54. Alice claims Paper1 via different author Account Publications K1 (Bob, Paper1) K2 (Alice, Paper1) Accounts Alice 2 Bob 1 Bob Ø Publications Paper1 1 Paper1 (1, 2) Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Paper1 Author N …
  • 55. Alice claims Paper1 via different author Account Publications K1 (Bob, Paper1) K2 (Alice, Paper1) K1 Ø Accounts Alice 2 Bob 1 Bob Ø Publications Paper1 1 Paper1 (1, 2) Accounts Stream Join Account Publications Publications Stream Author 1 Ø Paper1 Author 2 Alice Paper1 Author N …
  • 56. Alice claims Paper1 via different author Account Publications K1 (Bob, Paper1) K2 (Alice, Paper1) K1 Ø Accounts Alice 2 Bob 1 Bob Ø Publications Paper1 1 Paper1 (1, 2) Accounts Stream Join Account Publications Publications Stream Author 1 Ø Paper1 Author 2 Alice Paper1 Author N …
  • 57. Alice claims Paper1 via different author Account Publications K1 (Bob, Paper1) K2 (Alice, Paper1) K1 Ø Accounts Alice 2 Bob 1 Bob Ø Alice 1 Publications Paper1 1 Paper1 (1, 2) Accounts Stream Join Account Publications Publications Stream Author 1 Ø Paper1 Author 2 Alice Paper1 Author N …
  • 58. Alice claims Paper1 via different author Account Publications K1 (Bob, Paper1) K2 (Alice, Paper1) K1 Ø Accounts Alice 2 Bob 1 Bob Ø Alice 1 Publications Paper1 1 Paper1 (1, 2) Accounts Stream Join Account Publications Publications Stream Author 1 Ø Paper1 Author 2 Alice Paper1 Author N … 2. (Alice, 1)
  • 59. Alice claims Paper1 via different author Account Publications K1 (Bob, Paper1) K2 (Alice, Paper1) K1 Ø Accounts Alice 2 Bob 1 Bob Ø Alice 1 Publications Paper1 1 Paper1 (1, 2) Accounts Stream Join Account Publications Publications Stream Author 1 Alice Paper1 Author 2 Ø Paper1 Author N … 2. (Alice, 1)
  • 60. Alice claims Paper1 via different author Account Publications K1 (Bob, Paper1) K2 (Alice, Paper1) K1 Ø Accounts Alice 2 Bob 1 Bob Ø Alice 1 Publications Paper1 1 Paper1 (1, 2) Accounts Stream Join Account Publications Publications Stream Author 1 Alice Paper1 Author 2 Ø Paper1 Author N … 2. (Alice, 1) (Alice, Paper1)
  • 61. Alice claims Paper1 via different author Account Publications K1 (Bob, Paper1) K2 (Alice, Paper1) K1 Ø K2 (Alice, Paper1) K2 Ø Accounts Alice 2 Bob 1 Bob Ø Alice 1 Publications Paper1 1 Paper1 (1, 2) Accounts Stream Join Account Publications Publications Stream Author 1 Alice Paper1 Author 2 Ø Paper1 Author N … 2. (Alice, 1) (Alice, Paper1)
  • 62. Alice claims Paper1 via different author Account Publications K1 (Bob, Paper1) K2 (Alice, Paper1) K1 Ø K3 (Alice, Paper1) K2 Ø Accounts Alice 2 Bob 1 Bob Ø Alice 1 Publications Paper1 1 Paper1 (1, 2) Accounts Stream Join Account Publications Publications Stream Author 1 Alice Paper1 Author 2 Ø Paper1 Author N … 2. (Alice, 1) (Alice, Paper1) Pick correct natural IDs e.g. K3 = 𝒇(Alice, Author1, Paper1)
  • 63. • Keep previous element state to update previous join result • Stream elements are not domain entities but commands such as delete or upsert • Joined stream must have natural IDs to propagate deletes and updates How to solve deletes and updates
  • 65. Generic join graph Operate on commands Account Publications Accounts Stream Publications Stream Diff Alice Bob … Diff Paper1 PaperN … Join Author1 AuthorM …
  • 66. Memory requirements Account Publications Accounts Stream Publications Stream Diff Alice Bob … Diff Paper1 PaperN … Join Author1 AuthorM … Full copy of Accounts stream Full copy of Publications stream Full copy of Accounts stream on left side Full copy of Publications stream on right side
  • 68. • In addition to handling Kafka traffic we need to reshuffle all data twice over the network • We need to keep two full copies of each joined stream in memory Resource considerations
  • 69. Questions We are hiring - www.researchgate.net/careers