SlideShare a Scribd company logo
Developing and Operating
Real-Time Applications at Tencent
Xiaogang SHI
robbieshi@tencent.com
Outline 1
Background
Introduction to
Oceanus
Improvement to
Flink
PART I
PART III
PART II
About Tencent 2
GAMES
SOCIAL
NETWORK
MEDIA
FINTECH
CLOUD
VIDEO
MUSIC
NEWS
l The largest gaming company in the
world by active users and revenue
l 200 million MAU
l The leading mobile payment platform by
active users and transactions in China.
l 900 million MAU
l 1 billion transactions daily
l The largest social communities in China in
terms of MAU
l Weixin & Wechat: 1 billion MAU
l QQ: 807 million MAU
l The leading online video
streaming platform in China in
terms of mobile DAU and
subscriptions
l 89 million subscriptions
l Up 58% year-over-year
l Covered 25 regions and
operated 53 availability zones
l The largest online
music platform in
China
l 800 million MAU
l The top mobile news
app by active users
l 150 million MAU
https://www.tencent.com/en-us/articles/8003551553167294.pdf
Real-Time Applications at Tencent 3
ETL Monitoring Real-Time BI Online Learning
17 Trillion
Number of messages
received per day
210 Million
Maximum number of
messages received
per second
3 PB
Amount of data
received per day
Why Flink? 4
Consumer
Consumer Dispatcher
Dispatcher
Sender
Sender
Sender
Receiver
File Writer
Ingestor
Message Queue
Aggregator
It’s very difficult to improve performance while ensuring correctness.
An ETL pipeline as an example
Aggregator
Flink’s Strength 5
Efficient support for states
Automatic distribution while
rescaling
AT-LEAST and EXACTLY-
ONCE guarantees while
tolerating failures
Flexible and powerful
programming interfaces for
stream processing
Low latency and high
throughput
Oceanus Overview 6
A unified platform to develop and operate real-time applications
Message
Queue
ETL Monitoring
Business
Intelligence
Recommendation Advertisement
DB
TDFLINK (Flink at Tencent)
YARN ZooKeeperHDFS
CanvasOceanus JARSQL
Test
Online
Services
OLAP
Configuration Deployment Monitoring
Oceanus-ML
Developing with Oceanus 7
SOURCE
SELECT
WINDOW
SELECT
SINK
Type:
Name:
Fields:
Kafka
test_topic
Name String
Id Integer
Etime Long
Phone String
Type: Tumbling
Length: 10 seconds
Users can easily develop their applications by dragging and
connecting operators.
Developing with Oceanus 8
INSERT INTO uni_hot_stat (
song_id,
batch_time,
qy_play_score
)
SELECT
song_id,
sum(play_scorE) AS qy_play_score
FROM kfk_u002
GROUP BY song_id
Compile
JobGraph
Submit
Users can easily configure their applications
with visualized job graphs.
Jobs are managed with refined Cluster
Clients.
Operating with Oceanus 9
Improve operating efficiency with rich metrics.
1. An operator is bottleneck if its in-queue is full while its out-queue is not full.
2. The ratio between the throughput of different operators remain roughly the same when the parallelism changes. We can
utilize this property to configure the parallelism (Don’t work well with window operators).
3. There may exist data skew when the difference between the maximum and the minimum throughput is very large.
Operating with Oceanus 10
Much information can be obtained from thread stacks.
Checkpoint timeout: lock unreleased by blocked user functions, slow hdfs writes, blocked user checkpoint functions
Performance issues
Improvement to Flink 11
Job Management Resource Management
Non-disruptive
recovery of job masters
Avoid split-brain with
Zookeeper transactions
Fine-grained recovery of
tasks with cached result
partitions
Fine-grained resource
allocation
Improve scheduling
efficiency
Performance & Usability
Local keyed streams
Incremental windows
UDX
DimJoin
Top N
Refine leader coordination 12
Current problems: It’s difficult to reason about leadership in Flink.
JM 1 JM 2
Grant
leadership
Full GC
Confirm
leadership
Grant
leadership
Confirm
leadership
JM 1
Lost
leadership
Complete
Checkpoint 2
JM 2
Grant
leadership
Recover from
Checkpoint 1
CheckpointCoordinator
cannot be shut down
when the shutdown
method cannot obtain
the lock.
A job master may
publish its leadership
when its leadership is
revoked.
A job master may
successfully complete
a checkpoint when its
leadership is revoked.
Refine leader coordination 13
leader [EMPHEREL]
latches
latch-1[EPHEMEREL|SEQUENTIAL]
latch-2[EPHEMEREL|SEQUENTIAL]
checkpoints
checkpoint-1[PERSISTENT]
checkpoint-2[PERSISTENT]
/flink/{cluster-id}/jobmanagers/{job-id}
• Each leader contender creates a EPHEMERAL and SEQUENTIAL latch.
• The contender whose latch’s sequential number is smallest is elected as
the leader.
• A leader’s leadership is granted as long as its latch exists.
• Each contender can only access states when it has granted leadership and
its latch still exists.
zkClient.inTransaction()
.check().forPath(myLatch).and()
.setData().forPath(dataPath).and()
.commit()
Non-disruptive recovery of job masters 14
Avoid restarting tasks when the job master fails.
ZooKeeper
JobMaster JobMaster
Task
TaskExecutor
Task Task
3. Report tasks and slots
to the new job master
1. Grant Leadership
Task
TaskExecutor
Task Task Task
TaskExecutor
Task Task
2. Notify Leadership
Changed
4. Rebuild ExecutionGraph and
SlotPool with collected task and
slot information
3. Report tasks and slots
to the job master
Fine-grained resource allocation 15
1 core, 256 MB
Current Problems:
• The resource specification for operators does not take effect in resource allocation.
• Slots are allocated according to the number of available slots in task managers, instead of the amount of available resources.
• Yarn containers may be killed when the used resources exceed the allocated ones.
0.5 core, 512 MB
3 core, 2048 MB 3 core, 2048 MB
• Slot instances are created and destroyed dynamically.
• A task manager creates a slot instance if its available
resources are sufficient for the slot.
• A task manager destroys the slot instance if the tasks
in the slot finish.
• Users can specify the resources needed by operators.
• A slot’s resources are calculated by accumulating the
resources of the operators in the slot.
Local keyed streams 16
Current Problems: Performance is significantly degraded by data skew.
SOURCE
SOURCE
SOURCE
AGGREGATOR
AGGREGATOR
words.keyBy().count()
SOURCE
SOURCE
SOURCE
LOCAL AGGREGATOR
LOCAL AGGREGATOR
LOCAL AGGREGATOR
AGGREGATOR
AGGREGATOR
words
.localKeyBy().window().count()
.keyBy().sum()
Local keyed streams 16
0
5
10
15
20
25
30
35
40
45
0 10 20 33 50
WithoutLocal Aggregation With LocalAggregation
1 2 3
1 2 1 23
1 2 3
1 2 3
3 1 32
Each task has a complete key group
range.
Groups are distributed to tasks
according the number.
Groups with the same id are merged
at restoring.
Usability 17
UDX
Incremental Window
More than 40 UDX
are provided
Allow uses to obtain partial
results of windows
Dim Join
Optimized implement
of joins with external
storage
Top N
Future Work 18
Improve scheduling efficiency
Unified checkpoint mechanism for both streaming and batch jobs
Incorporating partitioning and timing into optimizer
SuperSQL: efficient data analytics across data sources (Hive, HBase, PostgreSQL, etc)
and data centers
Thank You

More Related Content

PDF
Flink Forward San Francisco 2019: The Trade Desk's Year in Flink - Jonathan ...
Flink Forward
 
PPTX
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward
 
PDF
Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse wit...
Flink Forward
 
PDF
Flink Forward San Francisco 2019: High cardinality data stream processing wit...
Flink Forward
 
PDF
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
Flink Forward
 
PDF
Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...
Till Rohrmann
 
PDF
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward
 
PDF
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...
Flink Forward
 
Flink Forward San Francisco 2019: The Trade Desk's Year in Flink - Jonathan ...
Flink Forward
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward
 
Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse wit...
Flink Forward
 
Flink Forward San Francisco 2019: High cardinality data stream processing wit...
Flink Forward
 
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
Flink Forward
 
Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...
Till Rohrmann
 
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward
 
Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...
Flink Forward
 

What's hot (20)

PDF
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
Flink Forward
 
PDF
Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine...
Flink Forward
 
PDF
Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API
Flink Forward
 
PDF
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
Flink Forward
 
PDF
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
Flink Forward
 
PDF
Marton Balassi – Stateful Stream Processing
Flink Forward
 
PDF
Flink Forward San Francisco 2019: Building Financial Identity Platform using ...
Flink Forward
 
PPTX
Flink Forward Berlin 2017: Till Rohrmann - From Apache Flink 1.3 to 1.4
Flink Forward
 
PDF
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward
 
PDF
Flink Forward San Francisco 2018: Steven Wu - "Scaling Flink in Cloud"
Flink Forward
 
PDF
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Flink Forward
 
PDF
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...
Flink Forward
 
PDF
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Flink Forward
 
PDF
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Thomas Weise
 
PDF
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward
 
PPTX
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Flink Forward
 
PPTX
data Artisans Product Announcement
Flink Forward
 
PDF
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Flink Forward
 
PDF
Virtual Flink Forward 2020: Build your next-generation stream platform based ...
Flink Forward
 
PPTX
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward
 
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
Flink Forward
 
Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine...
Flink Forward
 
Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API
Flink Forward
 
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
Flink Forward
 
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
Flink Forward
 
Marton Balassi – Stateful Stream Processing
Flink Forward
 
Flink Forward San Francisco 2019: Building Financial Identity Platform using ...
Flink Forward
 
Flink Forward Berlin 2017: Till Rohrmann - From Apache Flink 1.3 to 1.4
Flink Forward
 
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward
 
Flink Forward San Francisco 2018: Steven Wu - "Scaling Flink in Cloud"
Flink Forward
 
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Flink Forward
 
Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...
Flink Forward
 
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Flink Forward
 
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Thomas Weise
 
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
Flink Forward
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Flink Forward
 
data Artisans Product Announcement
Flink Forward
 
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Flink Forward
 
Virtual Flink Forward 2020: Build your next-generation stream platform based ...
Flink Forward
 
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward
 
Ad

Similar to Flink Forward San Francisco 2019: Developing and operating real-time applications with Oceanus - Xiaogang Shi (20)

PPTX
Kostas Tzoumas - Apache Flink®: State of the Union and What's Next
Ververica
 
PDF
Developing a database server: software engineer's view
Laurynas Biveinis
 
PPTX
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
PPTX
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
PDF
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward
 
PDF
Event-Driven Applications Done Right - Pulsar Summit SF 2022
StreamNative
 
PPTX
Advanced Flow Concepts Every Developer Should Know
Peter Caitens
 
PPTX
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
PPTX
Building A Mobile First API When You're Not Mobile First - Tyler Singletary
ProgrammableWeb
 
PPTX
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Apache Apex
 
PDF
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Karthik Ramasamy
 
PPTX
Introducing Events and Stream Processing into Nationwide Building Society
confluent
 
PPT
Running a Megasite on Microsoft Technologies
goodfriday
 
PDF
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
confluent
 
PPTX
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
PPTX
SharePoint 2013 Performance and Capacity Management
jems7
 
PPTX
Streaming in the Wild with Apache Flink
DataWorks Summit/Hadoop Summit
 
PDF
Managing Large Flask Applications On Google App Engine (GAE)
Emmanuel Olowosulu
 
PDF
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
Flink Forward
 
PPTX
Streaming in the Wild with Apache Flink
Kostas Tzoumas
 
Kostas Tzoumas - Apache Flink®: State of the Union and What's Next
Ververica
 
Developing a database server: software engineer's view
Laurynas Biveinis
 
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward
 
Event-Driven Applications Done Right - Pulsar Summit SF 2022
StreamNative
 
Advanced Flow Concepts Every Developer Should Know
Peter Caitens
 
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Building A Mobile First API When You're Not Mobile First - Tyler Singletary
ProgrammableWeb
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Apache Apex
 
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Karthik Ramasamy
 
Introducing Events and Stream Processing into Nationwide Building Society
confluent
 
Running a Megasite on Microsoft Technologies
goodfriday
 
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
confluent
 
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
SharePoint 2013 Performance and Capacity Management
jems7
 
Streaming in the Wild with Apache Flink
DataWorks Summit/Hadoop Summit
 
Managing Large Flask Applications On Google App Engine (GAE)
Emmanuel Olowosulu
 
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
Flink Forward
 
Streaming in the Wild with Apache Flink
Kostas Tzoumas
 
Ad

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
PPTX
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
PDF
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
PPTX
Autoscaling Flink with Reactive Mode
Flink Forward
 
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
PPTX
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
PDF
Flink powered stream processing platform at Pinterest
Flink Forward
 
PPTX
Apache Flink in the Cloud-Native Era
Flink Forward
 
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
PPTX
The Current State of Table API in 2022
Flink Forward
 
PDF
Flink SQL on Pulsar made easy
Flink Forward
 
PPTX
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
PPTX
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
PDF
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Autoscaling Flink with Reactive Mode
Flink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
Flink powered stream processing platform at Pinterest
Flink Forward
 
Apache Flink in the Cloud-Native Era
Flink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
The Current State of Table API in 2022
Flink Forward
 
Flink SQL on Pulsar made easy
Flink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 

Recently uploaded (20)

PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Doc9.....................................
SofiaCollazos
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 

Flink Forward San Francisco 2019: Developing and operating real-time applications with Oceanus - Xiaogang Shi

  • 1. Developing and Operating Real-Time Applications at Tencent Xiaogang SHI [email protected]
  • 3. About Tencent 2 GAMES SOCIAL NETWORK MEDIA FINTECH CLOUD VIDEO MUSIC NEWS l The largest gaming company in the world by active users and revenue l 200 million MAU l The leading mobile payment platform by active users and transactions in China. l 900 million MAU l 1 billion transactions daily l The largest social communities in China in terms of MAU l Weixin & Wechat: 1 billion MAU l QQ: 807 million MAU l The leading online video streaming platform in China in terms of mobile DAU and subscriptions l 89 million subscriptions l Up 58% year-over-year l Covered 25 regions and operated 53 availability zones l The largest online music platform in China l 800 million MAU l The top mobile news app by active users l 150 million MAU https://www.tencent.com/en-us/articles/8003551553167294.pdf
  • 4. Real-Time Applications at Tencent 3 ETL Monitoring Real-Time BI Online Learning 17 Trillion Number of messages received per day 210 Million Maximum number of messages received per second 3 PB Amount of data received per day
  • 5. Why Flink? 4 Consumer Consumer Dispatcher Dispatcher Sender Sender Sender Receiver File Writer Ingestor Message Queue Aggregator It’s very difficult to improve performance while ensuring correctness. An ETL pipeline as an example Aggregator
  • 6. Flink’s Strength 5 Efficient support for states Automatic distribution while rescaling AT-LEAST and EXACTLY- ONCE guarantees while tolerating failures Flexible and powerful programming interfaces for stream processing Low latency and high throughput
  • 7. Oceanus Overview 6 A unified platform to develop and operate real-time applications Message Queue ETL Monitoring Business Intelligence Recommendation Advertisement DB TDFLINK (Flink at Tencent) YARN ZooKeeperHDFS CanvasOceanus JARSQL Test Online Services OLAP Configuration Deployment Monitoring Oceanus-ML
  • 8. Developing with Oceanus 7 SOURCE SELECT WINDOW SELECT SINK Type: Name: Fields: Kafka test_topic Name String Id Integer Etime Long Phone String Type: Tumbling Length: 10 seconds Users can easily develop their applications by dragging and connecting operators.
  • 9. Developing with Oceanus 8 INSERT INTO uni_hot_stat ( song_id, batch_time, qy_play_score ) SELECT song_id, sum(play_scorE) AS qy_play_score FROM kfk_u002 GROUP BY song_id Compile JobGraph Submit Users can easily configure their applications with visualized job graphs. Jobs are managed with refined Cluster Clients.
  • 10. Operating with Oceanus 9 Improve operating efficiency with rich metrics. 1. An operator is bottleneck if its in-queue is full while its out-queue is not full. 2. The ratio between the throughput of different operators remain roughly the same when the parallelism changes. We can utilize this property to configure the parallelism (Don’t work well with window operators). 3. There may exist data skew when the difference between the maximum and the minimum throughput is very large.
  • 11. Operating with Oceanus 10 Much information can be obtained from thread stacks. Checkpoint timeout: lock unreleased by blocked user functions, slow hdfs writes, blocked user checkpoint functions Performance issues
  • 12. Improvement to Flink 11 Job Management Resource Management Non-disruptive recovery of job masters Avoid split-brain with Zookeeper transactions Fine-grained recovery of tasks with cached result partitions Fine-grained resource allocation Improve scheduling efficiency Performance & Usability Local keyed streams Incremental windows UDX DimJoin Top N
  • 13. Refine leader coordination 12 Current problems: It’s difficult to reason about leadership in Flink. JM 1 JM 2 Grant leadership Full GC Confirm leadership Grant leadership Confirm leadership JM 1 Lost leadership Complete Checkpoint 2 JM 2 Grant leadership Recover from Checkpoint 1 CheckpointCoordinator cannot be shut down when the shutdown method cannot obtain the lock. A job master may publish its leadership when its leadership is revoked. A job master may successfully complete a checkpoint when its leadership is revoked.
  • 14. Refine leader coordination 13 leader [EMPHEREL] latches latch-1[EPHEMEREL|SEQUENTIAL] latch-2[EPHEMEREL|SEQUENTIAL] checkpoints checkpoint-1[PERSISTENT] checkpoint-2[PERSISTENT] /flink/{cluster-id}/jobmanagers/{job-id} • Each leader contender creates a EPHEMERAL and SEQUENTIAL latch. • The contender whose latch’s sequential number is smallest is elected as the leader. • A leader’s leadership is granted as long as its latch exists. • Each contender can only access states when it has granted leadership and its latch still exists. zkClient.inTransaction() .check().forPath(myLatch).and() .setData().forPath(dataPath).and() .commit()
  • 15. Non-disruptive recovery of job masters 14 Avoid restarting tasks when the job master fails. ZooKeeper JobMaster JobMaster Task TaskExecutor Task Task 3. Report tasks and slots to the new job master 1. Grant Leadership Task TaskExecutor Task Task Task TaskExecutor Task Task 2. Notify Leadership Changed 4. Rebuild ExecutionGraph and SlotPool with collected task and slot information 3. Report tasks and slots to the job master
  • 16. Fine-grained resource allocation 15 1 core, 256 MB Current Problems: • The resource specification for operators does not take effect in resource allocation. • Slots are allocated according to the number of available slots in task managers, instead of the amount of available resources. • Yarn containers may be killed when the used resources exceed the allocated ones. 0.5 core, 512 MB 3 core, 2048 MB 3 core, 2048 MB • Slot instances are created and destroyed dynamically. • A task manager creates a slot instance if its available resources are sufficient for the slot. • A task manager destroys the slot instance if the tasks in the slot finish. • Users can specify the resources needed by operators. • A slot’s resources are calculated by accumulating the resources of the operators in the slot.
  • 17. Local keyed streams 16 Current Problems: Performance is significantly degraded by data skew. SOURCE SOURCE SOURCE AGGREGATOR AGGREGATOR words.keyBy().count() SOURCE SOURCE SOURCE LOCAL AGGREGATOR LOCAL AGGREGATOR LOCAL AGGREGATOR AGGREGATOR AGGREGATOR words .localKeyBy().window().count() .keyBy().sum()
  • 18. Local keyed streams 16 0 5 10 15 20 25 30 35 40 45 0 10 20 33 50 WithoutLocal Aggregation With LocalAggregation 1 2 3 1 2 1 23 1 2 3 1 2 3 3 1 32 Each task has a complete key group range. Groups are distributed to tasks according the number. Groups with the same id are merged at restoring.
  • 19. Usability 17 UDX Incremental Window More than 40 UDX are provided Allow uses to obtain partial results of windows Dim Join Optimized implement of joins with external storage Top N
  • 20. Future Work 18 Improve scheduling efficiency Unified checkpoint mechanism for both streaming and batch jobs Incorporating partitioning and timing into optimizer SuperSQL: efficient data analytics across data sources (Hive, HBase, PostgreSQL, etc) and data centers