Interactive Control over Temporal Consistency while Stylizing Video Streams

2
Problem: Per-frame stylization of videos often leads to temporal flickering
Input Per-Frame Stylization

3
Further, most of the techniques do not provide consistency control
Video Watercolorization using Bidirectional Texture Advection ,
Bousseau et al., Transcations on Graphics, 2007.
Processing images and video for an impressionist effect,
Peter Litwinowicz, SIGGRAPH, 1997.
Style Specific Offline Processing
Stylizing Animation By Example,
Bénard et al., Transcations on Graphics, 2013.
Stylizing Video By Example,
Jamriška et al., Transcations on Graphics, 2019.

4
Fišer et al. , Color Me Noisy: Example-based Rendering of Hand-colored Animations with Temporal Noise Control, EGSR 2014.
Temporal inconsistency can
add to the artistic
look and feel.

5
To cater to the needs of
live video streaming or
conferencing.
Stylizing a live video conferencing session
Src: https://towardsdatascience.com/fancy-and-custom-neural-style-transfer-filters-for-video-conferencing-7eba2be1b6d5

6
• Should be capable of handling a wide range of stylization techniques
• Provides interactive temporal consistency control
• Capable of low latency processing of high-resolution video streams
Characteristics of a practical tool for stylizing video streams:

7
Ours
Thiomonier et al.,
ICME 2021
Shekhar et al.,
VMV 2019
Lai et al.,
ECCV 2018
Yao et al.
MM 2017
Bonneel et al.,
SIGGRAPH 2015
Aspects
No
No
Yes
No
Yes
No
Requires pre-processing?
Yes
No
Yes
No
No
Yes
Provides consistency
control?
Yes
N/A
Yes
N/A
N/A
(Not Applicable)
No
Provides interactive
consistency control?
Aspects Bonneel et al.,
SIGGRAPH 2015
Yao et al.
MM 2017
Lai et al.,
ECCV 2018
Shekhar et al.,
VMV 2019
Thiomonier et al.,
ICME 2021
Ours
They do not require knowledge about underlying stylization technique
However, what about the interactive consistency control?
Bonneel et al., Blind Video Temporal Consistency, SIGGRAPH 2015
Yao et al., Occlusion-aware Video Temporal Consistency, MM 2017
Lai et al., Learning Blind Video Temporal Consistency, ECCV 2018
Shekhar et al., Consistent Filtering of Videos and Dense Light-Fields Without Optic-Flow, VMV 2019
Thiomonier et al., Learning Long Term Style Preserving Blind Video Temporal Consistency, ICME 2021

10
𝐼𝑡−1 𝐼𝑡 𝐼𝑡+1
𝑃𝑡
𝑃𝑡−1 𝑃𝑡+1
𝑂𝑡−1
𝑤𝑝 𝑤𝑛
𝐿𝑡
Linear
combination
𝐺𝑡
𝑤𝑝
Use 𝑤𝑝 and 𝑤𝑛
for combining
1
2
3
Input:
𝐼𝑡−1, 𝐼𝑡, 𝐼𝑡+1 -- Input images at time instance 𝑡 − 1, 𝑡 , 𝑡 + 1
𝑃𝑡−1, 𝑃𝑡, 𝑃𝑡+1 -- Per-frame stylized images at time instance 𝑡 − 1, 𝑡 , 𝑡 + 1
𝑂𝑡−1 -- Output at previous time instance 𝑡 − 1
Output:
𝑂𝑡 -- Output at time instance 𝑡 ?

11
Global Consistency
Input (at time instance 𝒕): Per-frame stylized results 𝑃𝑡−1, 𝑃𝑡, 𝑃𝑡+1, Input Images
𝐼𝑡−1, 𝐼𝑡, 𝐼𝑡+1, and the previous output 𝑂𝑡−1
𝐺𝑡 = Γ(𝑂𝑡−1)
𝑤𝑝 = exp(−𝛼 𝐼𝑡 − Γ 𝐼𝑡−1
2
)
𝑤𝑛 = exp(−𝛼 𝐼𝑡 − Γ 𝐼𝑡+1
2)
Γ – is a warping function towards time instance 𝑡
• Backward and forward warping reduces artifacts
due to occlusion and flow inaccuracies
• Preserves local temporal variations
• Cannot reduce inconsistencies significantly
• Simple yet effective
• Leads to a loss of stylization
(in terms of colors and textures)
• Warping errors keep getting propagated
Local Consistency
𝐿𝑡 = 𝑤𝑝 ∙ Γ(𝑃𝑡−1) + 𝑤𝑛 ∙ Γ(𝑃𝑡+1) + (1 − 𝑤𝑝 − 𝑤𝑛) ∙ 𝑃𝑡

12
Linear
combination
𝑃𝑡
𝑂𝑡−1
𝑤𝑝 𝑤𝑛
𝐿𝑡
Linear
combination
𝐺𝑡
𝐴𝑡
Optimization
Solving
𝑂𝑡
𝑤𝑝
for combining
1
2
3
4
5
Input:
Output:

13
argmin න 𝛻𝑂𝑡 − 𝛻𝑃𝑡
2
+ 𝑤𝑠 𝑂𝑡 − 𝐴𝑡
2
Data Term
( High-frequency
details from 𝑃𝑡 )
Smoothness Term
( Temporally consistent
content from 𝐴𝑡 )
Weighting Parameter
𝑃𝑡 - Per-frame stylized
𝐴𝑡 - Temporally consistent
𝑂𝑡 - Per-frame output
• Formulation is similar to that employed by
Bonneel et al. SIGGRAPH 2015 and
Shekhar et al. VMV 2019
• Our novelty is the way in which we construct the
consistent image 𝐴𝑡
• Through an adaptive combination the consistent
image preserves both local and global consistency
aspects
𝐴𝑡 = (1 − 𝑤𝑝) ∙ 𝐿𝑡 + 𝑤𝑝 ∙ 𝐺𝑡

14
• We want to invoke the Smoothness Term only when the warping
accuracy is sufficiently high. 𝑤𝑠 is thus driven by the similarity of warped
input image 𝐴𝑡
𝐼
to 𝐼𝑡:
𝐴𝑡
𝐼
= 𝑤𝑝 ∙ Γ(𝐼𝑡−1) + 𝑤𝑛 ∙ Γ(𝐼𝑡+1) + (1 − 𝑤𝑝 − 𝑤𝑛) ∙ 𝐼𝑡
𝑤𝑠 = 𝜆 ∙ exp(−𝛼 𝐼𝑡 − 𝐴𝑡
𝐼 2)
• We clamp the weights 𝑤𝑝 and 𝑤𝑛 such that
0 < 𝑤𝑝 < 𝑘1 and 0 < 𝑤𝑛 < 𝑘2 and 0 < 𝑘1 + 𝑘2 < 1
• We can control the degree of temporal consistency by varying 𝐤𝟏 and 𝛌
argmin න 𝛻𝑂𝑡 − 𝛻𝑃𝑡
2
+ 𝑤𝑠 𝑂𝑡 − 𝐴𝑡
2
Data Term
( High-frequency
details from 𝑃𝑡 )
Smoothness Term
( Temporally consistent
content from 𝐴𝑡 )
𝑂𝑡 - Per-frame output
𝑃𝑡 - Per-frame stylized
𝐴𝑡 - Temporally consistent

Per-frame Stylized Only Global Consistency (𝐴𝑡 = 𝐺𝑡)
Only Local Consistency (𝐴𝑡 = 𝐿𝑡) Our full Approach (𝐴𝑡 as a linear comb. of 𝐺𝑡 𝑎𝑛𝑑 𝐿𝑡)

16
Linear
combination
𝑃𝑡
𝑂𝑡−1
𝑤𝑝 𝑤𝑛
𝐿𝑡
Linear
combination
𝐺𝑡
𝐴𝑡
Optimization
Solving
𝑂𝑡
𝑤𝑝
for combining
1
2
3
4
5
Input:
Output:
We require interactive performance and the
bottleneck in this regard is slow flow-based warping
-- To overcome this, we develop a fast optic-flow
neural network model

17
0
10
20
30
40
50
60
70
80
90
0 1 2 3 4 5 6 7 8 9
GMA
RAFT VCN
ours
liteflownet2
pwcnet
flownet2
arflow
spynet
Sintel final test EPE (lower is better)
Frames
per
second
(higher
is
better)

18
(a) Remove DenseNet Connections (b) Remove last flow estimator (c) Separable Conv. in Refinement (d) Prune 40% chnls.
Neural network compression steps
Results in a speedup factor of approx. 2.8, from 30 FPS to 85 FPS on RTX 2080

0
10
20
30
40
50
60
70
80
90
640 x 480 px 1280 x 720 px 1920 x 1080 px 1920 x 1080 px (Fast Preset)
Time
(milliseconds)
Runtime performance on a RTX 3090
Optical Flow Stabilization Total
19
“Fast preset” = downscale the flow computation by 2x and
use only 50 iterations of stabilization optimization instead of 150.
25 fps

Per-Frame Stylized Bonneel et al. [SIGGRAPH Asia 2015]
Lai et al. [ECCV 2018] Ours

25
132
128
127
39
43
44
0 20 40 60 80 100 120 140
Lai
Bonneel
Ours-obj.
Others Ours
*
*Ours-objective = Best performing on benchmarks (vs. Ours = subjectively determined parameters )
For 19 participants and 9 different videos we
compare our method against Bonneel
et al., Lai et al., and Ours-objective through a
total of 171 randomized A/B tests.
We ask the participants to select the output
which best preserves:
(i) temporal consistency and
(ii) similarity with the per-frame processed
video.

Per-Frame Processed Stabilized - Ours

29
Lowering 𝑘1/𝜆 and increasing 𝛼 can remove these artifacts

Prompt: 1920’s car in a roundabout, old movie
Per-Frame Processed: Img2Img Stable Diffusion Stabilized - Ours

31
• By combining local and global consistency aspects we can achieve consistency while preserving stylization
• Reasonable flow accuracy estimated by a lightweight flow network is enough for making stylized videos consistent
• Existing objective metrics for temporal consistency do not capture the subjective preference

32
• We propose the first approach that provides interactive consistency control for per-frame stylized videos
• A novel temporal consistency term that combines local and global consistency aspects
• Fast optical-flow inference is achieved by developing a lightweight flow network architecture based on PWC-Net
• The entire pipeline is GPU-based and can handle video streams at full-HD resolution
Future Work
• Use learning-based temporal denoising for local consistency to further improve the quality of results
• Explore the usage of depth-based and saliency-based masks to spatially vary consistency

Tha
Website and Code!
View publication stats

Interactive Control over Temporal Consistency while Stylizing Video Streams

More Related Content

Similar to Interactive Control over Temporal Consistency while Stylizing Video Streams (20)

More from Matthias Trapp (20)

Recently uploaded (20)

Interactive Control over Temporal Consistency while Stylizing Video Streams