SlideShare a Scribd company logo
Interactive Control over Temporal Consistency while Stylizing Video Streams
2
Problem: Per-frame stylization of videos often leads to temporal flickering
Input Per-Frame Stylization
3
Further, most of the techniques do not provide consistency control
Video Watercolorization using Bidirectional Texture Advection ,
Bousseau et al., Transcations on Graphics, 2007.
Processing images and video for an impressionist effect,
Peter Litwinowicz, SIGGRAPH, 1997.
Style Specific Offline Processing
Stylizing Animation By Example,
Bénard et al., Transcations on Graphics, 2013.
Stylizing Video By Example,
Jamriška et al., Transcations on Graphics, 2019.
4
Fišer et al. , Color Me Noisy: Example-based Rendering of Hand-colored Animations with Temporal Noise Control, EGSR 2014.
Temporal inconsistency can
add to the artistic
look and feel.
5
To cater to the needs of
live video streaming or
conferencing.
Stylizing a live video conferencing session
Src: https://towardsdatascience.com/fancy-and-custom-neural-style-transfer-filters-for-video-conferencing-7eba2be1b6d5
6
• Should be capable of handling a wide range of stylization techniques
• Provides interactive temporal consistency control
• Capable of low latency processing of high-resolution video streams
Characteristics of a practical tool for stylizing video streams:
7
Ours
Thiomonier et al.,
ICME 2021
Shekhar et al.,
VMV 2019
Lai et al.,
ECCV 2018
Yao et al.
MM 2017
Bonneel et al.,
SIGGRAPH 2015
Aspects
No
No
Yes
No
Yes
No
Requires pre-processing?
Yes
No
Yes
No
No
Yes
Provides consistency
control?
Yes
N/A
Yes
N/A
N/A
(Not Applicable)
No
Provides interactive
consistency control?
Aspects Bonneel et al.,
SIGGRAPH 2015
Yao et al.
MM 2017
Lai et al.,
ECCV 2018
Shekhar et al.,
VMV 2019
Thiomonier et al.,
ICME 2021
Ours
They do not require knowledge about underlying stylization technique
However, what about the interactive consistency control?
Bonneel et al., Blind Video Temporal Consistency, SIGGRAPH 2015
Yao et al., Occlusion-aware Video Temporal Consistency, MM 2017
Lai et al., Learning Blind Video Temporal Consistency, ECCV 2018
Shekhar et al., Consistent Filtering of Videos and Dense Light-Fields Without Optic-Flow, VMV 2019
Thiomonier et al., Learning Long Term Style Preserving Blind Video Temporal Consistency, ICME 2021
Temporal consistency (𝜆)
9
10
𝐼𝑡−1 𝐼𝑡 𝐼𝑡+1
𝑃𝑡
𝑃𝑡−1 𝑃𝑡+1
𝑂𝑡−1
𝑤𝑝 𝑤𝑛
𝐿𝑡
Linear
combination
𝐺𝑡
𝑤𝑝
Use 𝑤𝑝 and 𝑤𝑛
for combining
1
2
3
Input:
𝐼𝑡−1, 𝐼𝑡, 𝐼𝑡+1 -- Input images at time instance 𝑡 − 1, 𝑡 , 𝑡 + 1
𝑃𝑡−1, 𝑃𝑡, 𝑃𝑡+1 -- Per-frame stylized images at time instance 𝑡 − 1, 𝑡 , 𝑡 + 1
𝑂𝑡−1 -- Output at previous time instance 𝑡 − 1
Output:
𝑂𝑡 -- Output at time instance 𝑡 ?
11
Global Consistency
Input (at time instance 𝒕): Per-frame stylized results 𝑃𝑡−1, 𝑃𝑡, 𝑃𝑡+1, Input Images
𝐼𝑡−1, 𝐼𝑡, 𝐼𝑡+1, and the previous output 𝑂𝑡−1
𝐺𝑡 = Γ(𝑂𝑡−1)
𝑤𝑝 = exp(−𝛼 𝐼𝑡 − Γ 𝐼𝑡−1
2
)
𝑤𝑛 = exp(−𝛼 𝐼𝑡 − Γ 𝐼𝑡+1
2)
Γ – is a warping function towards time instance 𝑡
• Backward and forward warping reduces artifacts
due to occlusion and flow inaccuracies
• Preserves local temporal variations
• Cannot reduce inconsistencies significantly
• Simple yet effective
• Leads to a loss of stylization
(in terms of colors and textures)
• Warping errors keep getting propagated
Local Consistency
𝐿𝑡 = 𝑤𝑝 ∙ Γ(𝑃𝑡−1) + 𝑤𝑛 ∙ Γ(𝑃𝑡+1) + (1 − 𝑤𝑝 − 𝑤𝑛) ∙ 𝑃𝑡
12
Linear
combination
𝐼𝑡−1 𝐼𝑡 𝐼𝑡+1
𝑃𝑡
𝑃𝑡−1 𝑃𝑡+1
𝑂𝑡−1
𝑤𝑝 𝑤𝑛
𝐿𝑡
Linear
combination
𝐺𝑡
𝐴𝑡
Optimization
Solving
𝑂𝑡
𝑤𝑝
Use 𝑤𝑝 and 𝑤𝑛
for combining
1
2
3
4
5
Input:
𝐼𝑡−1, 𝐼𝑡, 𝐼𝑡+1 -- Input images at time instance 𝑡 − 1, 𝑡 , 𝑡 + 1
𝑃𝑡−1, 𝑃𝑡, 𝑃𝑡+1 -- Per-frame stylized images at time instance 𝑡 − 1, 𝑡 , 𝑡 + 1
𝑂𝑡−1 -- Output at previous time instance 𝑡 − 1
Output:
𝑂𝑡 -- Output at time instance 𝑡 ?
13
argmin න 𝛻𝑂𝑡 − 𝛻𝑃𝑡
2
+ 𝑤𝑠 𝑂𝑡 − 𝐴𝑡
2
Data Term
( High-frequency
details from 𝑃𝑡 )
Smoothness Term
( Temporally consistent
content from 𝐴𝑡 )
Weighting Parameter
𝑃𝑡 - Per-frame stylized
𝐴𝑡 - Temporally consistent
𝑂𝑡 - Per-frame output
• Formulation is similar to that employed by
Bonneel et al. SIGGRAPH 2015 and
Shekhar et al. VMV 2019
• Our novelty is the way in which we construct the
consistent image 𝐴𝑡
• Through an adaptive combination the consistent
image preserves both local and global consistency
aspects
𝐴𝑡 = (1 − 𝑤𝑝) ∙ 𝐿𝑡 + 𝑤𝑝 ∙ 𝐺𝑡
14
• We want to invoke the Smoothness Term only when the warping
accuracy is sufficiently high. 𝑤𝑠 is thus driven by the similarity of warped
input image 𝐴𝑡
𝐼
to 𝐼𝑡:
𝐴𝑡
𝐼
= 𝑤𝑝 ∙ Γ(𝐼𝑡−1) + 𝑤𝑛 ∙ Γ(𝐼𝑡+1) + (1 − 𝑤𝑝 − 𝑤𝑛) ∙ 𝐼𝑡
𝑤𝑠 = 𝜆 ∙ exp(−𝛼 𝐼𝑡 − 𝐴𝑡
𝐼 2)
• We clamp the weights 𝑤𝑝 and 𝑤𝑛 such that
0 < 𝑤𝑝 < 𝑘1 and 0 < 𝑤𝑛 < 𝑘2 and 0 < 𝑘1 + 𝑘2 < 1
• We can control the degree of temporal consistency by varying 𝐤𝟏 and 𝛌
argmin න 𝛻𝑂𝑡 − 𝛻𝑃𝑡
2
+ 𝑤𝑠 𝑂𝑡 − 𝐴𝑡
2
Data Term
( High-frequency
details from 𝑃𝑡 )
Smoothness Term
( Temporally consistent
content from 𝐴𝑡 )
𝑂𝑡 - Per-frame output
𝑃𝑡 - Per-frame stylized
𝐴𝑡 - Temporally consistent
Per-frame Stylized Only Global Consistency (𝐴𝑡 = 𝐺𝑡)
Only Local Consistency (𝐴𝑡 = 𝐿𝑡) Our full Approach (𝐴𝑡 as a linear comb. of 𝐺𝑡 𝑎𝑛𝑑 𝐿𝑡)
16
Linear
combination
𝐼𝑡−1 𝐼𝑡 𝐼𝑡+1
𝑃𝑡
𝑃𝑡−1 𝑃𝑡+1
𝑂𝑡−1
𝑤𝑝 𝑤𝑛
𝐿𝑡
Linear
combination
𝐺𝑡
𝐴𝑡
Optimization
Solving
𝑂𝑡
𝑤𝑝
Use 𝑤𝑝 and 𝑤𝑛
for combining
1
2
3
4
5
Input:
𝐼𝑡−1, 𝐼𝑡, 𝐼𝑡+1 -- Input images at time instance 𝑡 − 1, 𝑡 , 𝑡 + 1
𝑃𝑡−1, 𝑃𝑡, 𝑃𝑡+1 -- Per-frame stylized images at time instance 𝑡 − 1, 𝑡 , 𝑡 + 1
𝑂𝑡−1 -- Output at previous time instance 𝑡 − 1
Output:
𝑂𝑡 -- Output at time instance 𝑡 ?
We require interactive performance and the
bottleneck in this regard is slow flow-based warping
-- To overcome this, we develop a fast optic-flow
neural network model
17
0
10
20
30
40
50
60
70
80
90
0 1 2 3 4 5 6 7 8 9
GMA
RAFT VCN
ours
liteflownet2
pwcnet
flownet2
arflow
spynet
Sintel final test EPE (lower is better)
Frames
per
second
(higher
is
better)
18
(a) Remove DenseNet Connections (b) Remove last flow estimator (c) Separable Conv. in Refinement (d) Prune 40% chnls.
Neural network compression steps
Results in a speedup factor of approx. 2.8, from 30 FPS to 85 FPS on RTX 2080
0
10
20
30
40
50
60
70
80
90
640 x 480 px 1280 x 720 px 1920 x 1080 px 1920 x 1080 px (Fast Preset)
Time
(milliseconds)
Runtime performance on a RTX 3090
Optical Flow Stabilization Total
19
“Fast preset” = downscale the flow computation by 2x and
use only 50 iterations of stabilization optimization instead of 150.
25 fps
20
Interactive Control over Temporal Consistency while Stylizing Video Streams
22
Per-Frame Stylized Bonneel et al. [SIGGRAPH Asia 2015]
Lai et al. [ECCV 2018] Ours
Per-Frame Stylized Bonneel et al. [SIGGRAPH Asia 2015]
Lai et al. [ECCV 2018] Ours
25
132
128
127
39
43
44
0 20 40 60 80 100 120 140
Lai
Bonneel
Ours-obj.
Others Ours
*
*Ours-objective = Best performing on benchmarks (vs. Ours = subjectively determined parameters )
For 19 participants and 9 different videos we
compare our method against Bonneel
et al., Lai et al., and Ours-objective through a
total of 171 randomized A/B tests.
We ask the participants to select the output
which best preserves:
(i) temporal consistency and
(ii) similarity with the per-frame processed
video.
26
Per-Frame Processed Stabilized - Ours
Per-Frame Processed Stabilized - Ours
29
Lowering 𝑘1/𝜆 and increasing 𝛼 can remove these artifacts
Prompt: 1920’s car in a roundabout, old movie
Per-Frame Processed: Img2Img Stable Diffusion Stabilized - Ours
31
• By combining local and global consistency aspects we can achieve consistency while preserving stylization
• Reasonable flow accuracy estimated by a lightweight flow network is enough for making stylized videos consistent
• Existing objective metrics for temporal consistency do not capture the subjective preference
32
• We propose the first approach that provides interactive consistency control for per-frame stylized videos
• A novel temporal consistency term that combines local and global consistency aspects
• Fast optical-flow inference is achieved by developing a lightweight flow network architecture based on PWC-Net
• The entire pipeline is GPU-based and can handle video streams at full-HD resolution
Future Work
• Use learning-based temporal denoising for local consistency to further improve the quality of results
• Explore the usage of depth-based and saliency-based masks to spatially vary consistency
Tha
Website and Code!
View publication stats

More Related Content

Similar to Interactive Control over Temporal Consistency while Stylizing Video Streams (20)

PDF
A beginner's guide to Style Transfer and recent trends
JaeJun Yoo
 
PDF
Video Inpainting detection using inconsistencies in optical Flow
Cybersecurity Education and Research Centre
 
PPTX
A Closed-form Solution to Photorealistic Image Stylization
SherozbekJumaboev
 
PDF
Video saliency-detection using custom spatiotemporal fusion method
International Journal of Reconfigurable and Embedded Systems
 
PDF
06714519
yatin209
 
PDF
Multiple Style-Transfer in Real-Time
KaustavChakraborty28
 
PDF
Style gan2 review
taeseon ryu
 
PDF
Insertion of Impairments in Test Video Sequences for Quality Assessment Based...
Universidad Politécnica de Madrid
 
PDF
1-s2.0-S09252312240168zádgfsdgdfg01-main.pdf
ssuser1c6d971
 
PDF
Optic flow estimation with deep learning
Yu Huang
 
PDF
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
Edge AI and Vision Alliance
 
PDF
Video Classification Basic
Silversparro Technologies
 
PDF
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Wanjin Yu
 
PDF
YolactEdge Review [cdm]
Dongmin Choi
 
PPTX
Computer vision series
Perry Lea
 
PDF
Efficient fusion of spatio-temporal saliency for frame wise saliency identifi...
IAESIJAI
 
PDF
[AAAI 2021] Vid-ODE: Continuous-Time Video Generation with Neural Ordinary Di...
Sunghyun Park
 
PDF
J. Kim, CVPR 2024, MLILAB, KAIST AI.
MLILAB
 
PPTX
Open cv tutorial
Eric Larson
 
PDF
UIUC ECE547 term project
Yicheng Sun
 
A beginner's guide to Style Transfer and recent trends
JaeJun Yoo
 
Video Inpainting detection using inconsistencies in optical Flow
Cybersecurity Education and Research Centre
 
A Closed-form Solution to Photorealistic Image Stylization
SherozbekJumaboev
 
Video saliency-detection using custom spatiotemporal fusion method
International Journal of Reconfigurable and Embedded Systems
 
06714519
yatin209
 
Multiple Style-Transfer in Real-Time
KaustavChakraborty28
 
Style gan2 review
taeseon ryu
 
Insertion of Impairments in Test Video Sequences for Quality Assessment Based...
Universidad Politécnica de Madrid
 
1-s2.0-S09252312240168zádgfsdgdfg01-main.pdf
ssuser1c6d971
 
Optic flow estimation with deep learning
Yu Huang
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
Edge AI and Vision Alliance
 
Video Classification Basic
Silversparro Technologies
 
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Wanjin Yu
 
YolactEdge Review [cdm]
Dongmin Choi
 
Computer vision series
Perry Lea
 
Efficient fusion of spatio-temporal saliency for frame wise saliency identifi...
IAESIJAI
 
[AAAI 2021] Vid-ODE: Continuous-Time Video Generation with Neural Ordinary Di...
Sunghyun Park
 
J. Kim, CVPR 2024, MLILAB, KAIST AI.
MLILAB
 
Open cv tutorial
Eric Larson
 
UIUC ECE547 term project
Yicheng Sun
 

More from Matthias Trapp (20)

PDF
A Framework for Art-directed Augmentation of Human Motion in Videos on Mobile...
Matthias Trapp
 
PDF
A Framework for Interactive 3D Photo Stylization Techniques on Mobile Devices
Matthias Trapp
 
PDF
ALIVE-Adaptive Chromaticity for Interactive Low-light Image and Video Enhance...
Matthias Trapp
 
PDF
A Service-based Preset Recommendation System for Image Stylization Applications
Matthias Trapp
 
PDF
Design Space of Geometry-based Image Abstraction Techniques with Vectorizatio...
Matthias Trapp
 
PDF
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
Matthias Trapp
 
PDF
Efficient GitHub Crawling using the GraphQL API
Matthias Trapp
 
PDF
CodeCV - Mining Expertise of GitHub Users from Coding Activities - Online.pdf
Matthias Trapp
 
PDF
Non-Photorealistic Rendering of 3D Point Clouds for Cartographic Visualization
Matthias Trapp
 
PDF
TWIN4ROAD - Erfassung Analyse und Auswertung mobiler Multi Sensorik im Strass...
Matthias Trapp
 
PDF
Interactive Close-Up Rendering for Detail+Overview Visualization of 3D Digita...
Matthias Trapp
 
PDF
Web-based and Mobile Provisioning of Virtual 3D Reconstructions
Matthias Trapp
 
PDF
Visualization of Knowledge Distribution across Development Teams using 2.5D S...
Matthias Trapp
 
PDF
Real-time Screen-space Geometry Draping for 3D Digital Terrain Models
Matthias Trapp
 
PDF
FERMIUM - A Framework for Real-time Procedural Point Cloud Animation & Morphing
Matthias Trapp
 
PDF
Interactive Editing of Signed Distance Fields
Matthias Trapp
 
PDF
Integration of Image Processing Techniques into the Unity Game Engine
Matthias Trapp
 
PDF
Interactive GPU-based Image Deformation for Mobile Devices
Matthias Trapp
 
PDF
Interactive Photo Editing on Smartphones via Intrinsic Decomposition
Matthias Trapp
 
PDF
Service-based Analysis and Abstraction for Content Moderation of Digital Images
Matthias Trapp
 
A Framework for Art-directed Augmentation of Human Motion in Videos on Mobile...
Matthias Trapp
 
A Framework for Interactive 3D Photo Stylization Techniques on Mobile Devices
Matthias Trapp
 
ALIVE-Adaptive Chromaticity for Interactive Low-light Image and Video Enhance...
Matthias Trapp
 
A Service-based Preset Recommendation System for Image Stylization Applications
Matthias Trapp
 
Design Space of Geometry-based Image Abstraction Techniques with Vectorizatio...
Matthias Trapp
 
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
Matthias Trapp
 
Efficient GitHub Crawling using the GraphQL API
Matthias Trapp
 
CodeCV - Mining Expertise of GitHub Users from Coding Activities - Online.pdf
Matthias Trapp
 
Non-Photorealistic Rendering of 3D Point Clouds for Cartographic Visualization
Matthias Trapp
 
TWIN4ROAD - Erfassung Analyse und Auswertung mobiler Multi Sensorik im Strass...
Matthias Trapp
 
Interactive Close-Up Rendering for Detail+Overview Visualization of 3D Digita...
Matthias Trapp
 
Web-based and Mobile Provisioning of Virtual 3D Reconstructions
Matthias Trapp
 
Visualization of Knowledge Distribution across Development Teams using 2.5D S...
Matthias Trapp
 
Real-time Screen-space Geometry Draping for 3D Digital Terrain Models
Matthias Trapp
 
FERMIUM - A Framework for Real-time Procedural Point Cloud Animation & Morphing
Matthias Trapp
 
Interactive Editing of Signed Distance Fields
Matthias Trapp
 
Integration of Image Processing Techniques into the Unity Game Engine
Matthias Trapp
 
Interactive GPU-based Image Deformation for Mobile Devices
Matthias Trapp
 
Interactive Photo Editing on Smartphones via Intrinsic Decomposition
Matthias Trapp
 
Service-based Analysis and Abstraction for Content Moderation of Digital Images
Matthias Trapp
 
Ad

Recently uploaded (20)

PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PPTX
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
PDF
Market Insight : ETH Dominance Returns
CIFDAQ
 
PPTX
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
introduction to computer hardware and sofeware
chauhanshraddha2007
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
PDF
NewMind AI Weekly Chronicles – July’25, Week III
NewMind AI
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
Market Insight : ETH Dominance Returns
CIFDAQ
 
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
introduction to computer hardware and sofeware
chauhanshraddha2007
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
NewMind AI Weekly Chronicles – July’25, Week III
NewMind AI
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Ad

Interactive Control over Temporal Consistency while Stylizing Video Streams

  • 2. 2 Problem: Per-frame stylization of videos often leads to temporal flickering Input Per-Frame Stylization
  • 3. 3 Further, most of the techniques do not provide consistency control Video Watercolorization using Bidirectional Texture Advection , Bousseau et al., Transcations on Graphics, 2007. Processing images and video for an impressionist effect, Peter Litwinowicz, SIGGRAPH, 1997. Style Specific Offline Processing Stylizing Animation By Example, Bénard et al., Transcations on Graphics, 2013. Stylizing Video By Example, Jamriška et al., Transcations on Graphics, 2019.
  • 4. 4 Fišer et al. , Color Me Noisy: Example-based Rendering of Hand-colored Animations with Temporal Noise Control, EGSR 2014. Temporal inconsistency can add to the artistic look and feel.
  • 5. 5 To cater to the needs of live video streaming or conferencing. Stylizing a live video conferencing session Src: https://towardsdatascience.com/fancy-and-custom-neural-style-transfer-filters-for-video-conferencing-7eba2be1b6d5
  • 6. 6 • Should be capable of handling a wide range of stylization techniques • Provides interactive temporal consistency control • Capable of low latency processing of high-resolution video streams Characteristics of a practical tool for stylizing video streams:
  • 7. 7 Ours Thiomonier et al., ICME 2021 Shekhar et al., VMV 2019 Lai et al., ECCV 2018 Yao et al. MM 2017 Bonneel et al., SIGGRAPH 2015 Aspects No No Yes No Yes No Requires pre-processing? Yes No Yes No No Yes Provides consistency control? Yes N/A Yes N/A N/A (Not Applicable) No Provides interactive consistency control? Aspects Bonneel et al., SIGGRAPH 2015 Yao et al. MM 2017 Lai et al., ECCV 2018 Shekhar et al., VMV 2019 Thiomonier et al., ICME 2021 Ours They do not require knowledge about underlying stylization technique However, what about the interactive consistency control? Bonneel et al., Blind Video Temporal Consistency, SIGGRAPH 2015 Yao et al., Occlusion-aware Video Temporal Consistency, MM 2017 Lai et al., Learning Blind Video Temporal Consistency, ECCV 2018 Shekhar et al., Consistent Filtering of Videos and Dense Light-Fields Without Optic-Flow, VMV 2019 Thiomonier et al., Learning Long Term Style Preserving Blind Video Temporal Consistency, ICME 2021
  • 9. 9
  • 10. 10 𝐼𝑡−1 𝐼𝑡 𝐼𝑡+1 𝑃𝑡 𝑃𝑡−1 𝑃𝑡+1 𝑂𝑡−1 𝑤𝑝 𝑤𝑛 𝐿𝑡 Linear combination 𝐺𝑡 𝑤𝑝 Use 𝑤𝑝 and 𝑤𝑛 for combining 1 2 3 Input: 𝐼𝑡−1, 𝐼𝑡, 𝐼𝑡+1 -- Input images at time instance 𝑡 − 1, 𝑡 , 𝑡 + 1 𝑃𝑡−1, 𝑃𝑡, 𝑃𝑡+1 -- Per-frame stylized images at time instance 𝑡 − 1, 𝑡 , 𝑡 + 1 𝑂𝑡−1 -- Output at previous time instance 𝑡 − 1 Output: 𝑂𝑡 -- Output at time instance 𝑡 ?
  • 11. 11 Global Consistency Input (at time instance 𝒕): Per-frame stylized results 𝑃𝑡−1, 𝑃𝑡, 𝑃𝑡+1, Input Images 𝐼𝑡−1, 𝐼𝑡, 𝐼𝑡+1, and the previous output 𝑂𝑡−1 𝐺𝑡 = Γ(𝑂𝑡−1) 𝑤𝑝 = exp(−𝛼 𝐼𝑡 − Γ 𝐼𝑡−1 2 ) 𝑤𝑛 = exp(−𝛼 𝐼𝑡 − Γ 𝐼𝑡+1 2) Γ – is a warping function towards time instance 𝑡 • Backward and forward warping reduces artifacts due to occlusion and flow inaccuracies • Preserves local temporal variations • Cannot reduce inconsistencies significantly • Simple yet effective • Leads to a loss of stylization (in terms of colors and textures) • Warping errors keep getting propagated Local Consistency 𝐿𝑡 = 𝑤𝑝 ∙ Γ(𝑃𝑡−1) + 𝑤𝑛 ∙ Γ(𝑃𝑡+1) + (1 − 𝑤𝑝 − 𝑤𝑛) ∙ 𝑃𝑡
  • 12. 12 Linear combination 𝐼𝑡−1 𝐼𝑡 𝐼𝑡+1 𝑃𝑡 𝑃𝑡−1 𝑃𝑡+1 𝑂𝑡−1 𝑤𝑝 𝑤𝑛 𝐿𝑡 Linear combination 𝐺𝑡 𝐴𝑡 Optimization Solving 𝑂𝑡 𝑤𝑝 Use 𝑤𝑝 and 𝑤𝑛 for combining 1 2 3 4 5 Input: 𝐼𝑡−1, 𝐼𝑡, 𝐼𝑡+1 -- Input images at time instance 𝑡 − 1, 𝑡 , 𝑡 + 1 𝑃𝑡−1, 𝑃𝑡, 𝑃𝑡+1 -- Per-frame stylized images at time instance 𝑡 − 1, 𝑡 , 𝑡 + 1 𝑂𝑡−1 -- Output at previous time instance 𝑡 − 1 Output: 𝑂𝑡 -- Output at time instance 𝑡 ?
  • 13. 13 argmin න 𝛻𝑂𝑡 − 𝛻𝑃𝑡 2 + 𝑤𝑠 𝑂𝑡 − 𝐴𝑡 2 Data Term ( High-frequency details from 𝑃𝑡 ) Smoothness Term ( Temporally consistent content from 𝐴𝑡 ) Weighting Parameter 𝑃𝑡 - Per-frame stylized 𝐴𝑡 - Temporally consistent 𝑂𝑡 - Per-frame output • Formulation is similar to that employed by Bonneel et al. SIGGRAPH 2015 and Shekhar et al. VMV 2019 • Our novelty is the way in which we construct the consistent image 𝐴𝑡 • Through an adaptive combination the consistent image preserves both local and global consistency aspects 𝐴𝑡 = (1 − 𝑤𝑝) ∙ 𝐿𝑡 + 𝑤𝑝 ∙ 𝐺𝑡
  • 14. 14 • We want to invoke the Smoothness Term only when the warping accuracy is sufficiently high. 𝑤𝑠 is thus driven by the similarity of warped input image 𝐴𝑡 𝐼 to 𝐼𝑡: 𝐴𝑡 𝐼 = 𝑤𝑝 ∙ Γ(𝐼𝑡−1) + 𝑤𝑛 ∙ Γ(𝐼𝑡+1) + (1 − 𝑤𝑝 − 𝑤𝑛) ∙ 𝐼𝑡 𝑤𝑠 = 𝜆 ∙ exp(−𝛼 𝐼𝑡 − 𝐴𝑡 𝐼 2) • We clamp the weights 𝑤𝑝 and 𝑤𝑛 such that 0 < 𝑤𝑝 < 𝑘1 and 0 < 𝑤𝑛 < 𝑘2 and 0 < 𝑘1 + 𝑘2 < 1 • We can control the degree of temporal consistency by varying 𝐤𝟏 and 𝛌 argmin න 𝛻𝑂𝑡 − 𝛻𝑃𝑡 2 + 𝑤𝑠 𝑂𝑡 − 𝐴𝑡 2 Data Term ( High-frequency details from 𝑃𝑡 ) Smoothness Term ( Temporally consistent content from 𝐴𝑡 ) 𝑂𝑡 - Per-frame output 𝑃𝑡 - Per-frame stylized 𝐴𝑡 - Temporally consistent
  • 15. Per-frame Stylized Only Global Consistency (𝐴𝑡 = 𝐺𝑡) Only Local Consistency (𝐴𝑡 = 𝐿𝑡) Our full Approach (𝐴𝑡 as a linear comb. of 𝐺𝑡 𝑎𝑛𝑑 𝐿𝑡)
  • 16. 16 Linear combination 𝐼𝑡−1 𝐼𝑡 𝐼𝑡+1 𝑃𝑡 𝑃𝑡−1 𝑃𝑡+1 𝑂𝑡−1 𝑤𝑝 𝑤𝑛 𝐿𝑡 Linear combination 𝐺𝑡 𝐴𝑡 Optimization Solving 𝑂𝑡 𝑤𝑝 Use 𝑤𝑝 and 𝑤𝑛 for combining 1 2 3 4 5 Input: 𝐼𝑡−1, 𝐼𝑡, 𝐼𝑡+1 -- Input images at time instance 𝑡 − 1, 𝑡 , 𝑡 + 1 𝑃𝑡−1, 𝑃𝑡, 𝑃𝑡+1 -- Per-frame stylized images at time instance 𝑡 − 1, 𝑡 , 𝑡 + 1 𝑂𝑡−1 -- Output at previous time instance 𝑡 − 1 Output: 𝑂𝑡 -- Output at time instance 𝑡 ? We require interactive performance and the bottleneck in this regard is slow flow-based warping -- To overcome this, we develop a fast optic-flow neural network model
  • 17. 17 0 10 20 30 40 50 60 70 80 90 0 1 2 3 4 5 6 7 8 9 GMA RAFT VCN ours liteflownet2 pwcnet flownet2 arflow spynet Sintel final test EPE (lower is better) Frames per second (higher is better)
  • 18. 18 (a) Remove DenseNet Connections (b) Remove last flow estimator (c) Separable Conv. in Refinement (d) Prune 40% chnls. Neural network compression steps Results in a speedup factor of approx. 2.8, from 30 FPS to 85 FPS on RTX 2080
  • 19. 0 10 20 30 40 50 60 70 80 90 640 x 480 px 1280 x 720 px 1920 x 1080 px 1920 x 1080 px (Fast Preset) Time (milliseconds) Runtime performance on a RTX 3090 Optical Flow Stabilization Total 19 “Fast preset” = downscale the flow computation by 2x and use only 50 iterations of stabilization optimization instead of 150. 25 fps
  • 20. 20
  • 22. 22
  • 23. Per-Frame Stylized Bonneel et al. [SIGGRAPH Asia 2015] Lai et al. [ECCV 2018] Ours
  • 24. Per-Frame Stylized Bonneel et al. [SIGGRAPH Asia 2015] Lai et al. [ECCV 2018] Ours
  • 25. 25 132 128 127 39 43 44 0 20 40 60 80 100 120 140 Lai Bonneel Ours-obj. Others Ours * *Ours-objective = Best performing on benchmarks (vs. Ours = subjectively determined parameters ) For 19 participants and 9 different videos we compare our method against Bonneel et al., Lai et al., and Ours-objective through a total of 171 randomized A/B tests. We ask the participants to select the output which best preserves: (i) temporal consistency and (ii) similarity with the per-frame processed video.
  • 26. 26
  • 29. 29 Lowering 𝑘1/𝜆 and increasing 𝛼 can remove these artifacts
  • 30. Prompt: 1920’s car in a roundabout, old movie Per-Frame Processed: Img2Img Stable Diffusion Stabilized - Ours
  • 31. 31 • By combining local and global consistency aspects we can achieve consistency while preserving stylization • Reasonable flow accuracy estimated by a lightweight flow network is enough for making stylized videos consistent • Existing objective metrics for temporal consistency do not capture the subjective preference
  • 32. 32 • We propose the first approach that provides interactive consistency control for per-frame stylized videos • A novel temporal consistency term that combines local and global consistency aspects • Fast optical-flow inference is achieved by developing a lightweight flow network architecture based on PWC-Net • The entire pipeline is GPU-based and can handle video streams at full-HD resolution Future Work • Use learning-based temporal denoising for local consistency to further improve the quality of results • Explore the usage of depth-based and saliency-based masks to spatially vary consistency
  • 33. Tha Website and Code! View publication stats