Flink Forward San Francisco 2019: Developing and operating real-time applications with Oceanus - Xiaogang Shi

Developing and Operating
Real-Time Applications at Tencent
Xiaogang SHI
robbieshi@tencent.com

Outline 1
Background
Introduction to
Oceanus
Improvement to
Flink
PART I
PART III
PART II

About Tencent 2
GAMES
SOCIAL
NETWORK
MEDIA
FINTECH
CLOUD
VIDEO
MUSIC
NEWS
l The largest gaming company in the
world by active users and revenue
l 200 million MAU
l The leading mobile payment platform by
active users and transactions in China.
l 900 million MAU
l 1 billion transactions daily
l The largest social communities in China in
terms of MAU
l Weixin & Wechat: 1 billion MAU
l QQ: 807 million MAU
l The leading online video
streaming platform in China in
terms of mobile DAU and
subscriptions
l 89 million subscriptions
l Up 58% year-over-year
l Covered 25 regions and
operated 53 availability zones
l The largest online
music platform in
China
l 800 million MAU
l The top mobile news
app by active users
l 150 million MAU
https://www.tencent.com/en-us/articles/8003551553167294.pdf

Real-Time Applications at Tencent 3
ETL Monitoring Real-Time BI Online Learning
17 Trillion
Number of messages
received per day
210 Million
Maximum number of
messages received
per second
3 PB
Amount of data
received per day

Why Flink? 4
Consumer
Consumer Dispatcher
Dispatcher
Sender
Sender
Sender
Receiver
File Writer
Ingestor
Message Queue
Aggregator
It’s very difficult to improve performance while ensuring correctness.
An ETL pipeline as an example
Aggregator

Flink’s Strength 5
Efficient support for states
Automatic distribution while
rescaling
AT-LEAST and EXACTLY-
ONCE guarantees while
tolerating failures
Flexible and powerful
programming interfaces for
stream processing
Low latency and high
throughput

Oceanus Overview 6
A unified platform to develop and operate real-time applications
Message
Queue
ETL Monitoring
Business
Intelligence
Recommendation Advertisement
DB
TDFLINK (Flink at Tencent)
YARN ZooKeeperHDFS
CanvasOceanus JARSQL
Test
Online
Services
OLAP
Configuration Deployment Monitoring
Oceanus-ML

Developing with Oceanus 7
SOURCE
SELECT
WINDOW
SELECT
SINK
Type:
Name:
Fields:
Kafka
test_topic
Name String
Id Integer
Etime Long
Phone String
Type: Tumbling
Length: 10 seconds
Users can easily develop their applications by dragging and
connecting operators.

Developing with Oceanus 8
INSERT INTO uni_hot_stat (
song_id,
batch_time,
qy_play_score
)
SELECT
song_id,
sum(play_scorE) AS qy_play_score
FROM kfk_u002
GROUP BY song_id
Compile
JobGraph
Submit
Users can easily configure their applications
with visualized job graphs.
Jobs are managed with refined Cluster
Clients.

Operating with Oceanus 9
Improve operating efficiency with rich metrics.
1. An operator is bottleneck if its in-queue is full while its out-queue is not full.
2. The ratio between the throughput of different operators remain roughly the same when the parallelism changes. We can
utilize this property to configure the parallelism (Don’t work well with window operators).
3. There may exist data skew when the difference between the maximum and the minimum throughput is very large.

Operating with Oceanus 10
Much information can be obtained from thread stacks.
Checkpoint timeout: lock unreleased by blocked user functions, slow hdfs writes, blocked user checkpoint functions
Performance issues

Improvement to Flink 11
Job Management Resource Management
Non-disruptive
recovery of job masters
Avoid split-brain with
Zookeeper transactions
Fine-grained recovery of
tasks with cached result
partitions
Fine-grained resource
allocation
Improve scheduling
efficiency
Performance & Usability
Local keyed streams
Incremental windows
UDX
DimJoin
Top N

Refine leader coordination 12
Current problems: It’s difficult to reason about leadership in Flink.
JM 1 JM 2
Grant
leadership
Full GC
Confirm
leadership
Grant
leadership
Confirm
leadership
JM 1
Lost
leadership
Complete
Checkpoint 2
JM 2
Grant
leadership
Recover from
Checkpoint 1
CheckpointCoordinator
cannot be shut down
when the shutdown
method cannot obtain
the lock.
A job master may
publish its leadership
when its leadership is
revoked.
A job master may
successfully complete
a checkpoint when its
leadership is revoked.

Refine leader coordination 13
leader [EMPHEREL]
latches
latch-1[EPHEMEREL|SEQUENTIAL]
latch-2[EPHEMEREL|SEQUENTIAL]
checkpoints
checkpoint-1[PERSISTENT]
checkpoint-2[PERSISTENT]
/flink/{cluster-id}/jobmanagers/{job-id}
• Each leader contender creates a EPHEMERAL and SEQUENTIAL latch.
• The contender whose latch’s sequential number is smallest is elected as
the leader.
• A leader’s leadership is granted as long as its latch exists.
• Each contender can only access states when it has granted leadership and
its latch still exists.
zkClient.inTransaction()
.check().forPath(myLatch).and()
.setData().forPath(dataPath).and()
.commit()

Non-disruptive recovery of job masters 14
Avoid restarting tasks when the job master fails.
ZooKeeper
JobMaster JobMaster
Task
TaskExecutor
Task Task
3. Report tasks and slots
to the new job master
1. Grant Leadership
Task
TaskExecutor
Task Task Task
TaskExecutor
Task Task
2. Notify Leadership
Changed
4. Rebuild ExecutionGraph and
SlotPool with collected task and
slot information
3. Report tasks and slots
to the job master

Fine-grained resource allocation 15
1 core, 256 MB
Current Problems:
• The resource specification for operators does not take effect in resource allocation.
• Slots are allocated according to the number of available slots in task managers, instead of the amount of available resources.
• Yarn containers may be killed when the used resources exceed the allocated ones.
0.5 core, 512 MB
3 core, 2048 MB 3 core, 2048 MB
• Slot instances are created and destroyed dynamically.
• A task manager creates a slot instance if its available
resources are sufficient for the slot.
• A task manager destroys the slot instance if the tasks
in the slot finish.
• Users can specify the resources needed by operators.
• A slot’s resources are calculated by accumulating the
resources of the operators in the slot.

Local keyed streams 16
Current Problems: Performance is significantly degraded by data skew.
SOURCE
SOURCE
SOURCE
AGGREGATOR
AGGREGATOR
words.keyBy().count()
SOURCE
SOURCE
SOURCE
LOCAL AGGREGATOR
LOCAL AGGREGATOR
LOCAL AGGREGATOR
AGGREGATOR
AGGREGATOR
words
.localKeyBy().window().count()
.keyBy().sum()

Local keyed streams 16
0
5
10
15
20
25
30
35
40
45
0 10 20 33 50
WithoutLocal Aggregation With LocalAggregation
1 2 3
1 2 1 23
1 2 3
1 2 3
3 1 32
Each task has a complete key group
range.
Groups are distributed to tasks
according the number.
Groups with the same id are merged
at restoring.

Usability 17
UDX
Incremental Window
More than 40 UDX
are provided
Allow uses to obtain partial
results of windows
Dim Join
Optimized implement
of joins with external
storage
Top N

Future Work 18
Improve scheduling efficiency
Unified checkpoint mechanism for both streaming and batch jobs
Incorporating partitioning and timing into optimizer
SuperSQL: efficient data analytics across data sources (Hive, HBase, PostgreSQL, etc)
and data centers

Flink Forward San Francisco 2019: Developing and operating real-time applications with Oceanus - Xiaogang Shi

More Related Content

What's hot (20)

Similar to Flink Forward San Francisco 2019: Developing and operating real-time applications with Oceanus - Xiaogang Shi (20)

More from Flink Forward (20)

Recently uploaded (20)

Flink Forward San Francisco 2019: Developing and operating real-time applications with Oceanus - Xiaogang Shi