Convex-hull Estimation using XPSNR for Versatile Video Coding

Convex-hull Estimation using XPSNR for
Versatile Video Coding
—
Vignesh V Menon, Christian R Helmrich, Adam Więckowski, Benjamin Bross, Detlev Marpe
Video Communication and Applications Dept., Fraunhofer HHI, Germany

MHV’24
Introduction
30.10.2024 © Fraunhofer
Slide 3
HTTP Adaptive Streaming (HAS)
• HTTP Adaptive Streaming (HAS) has become the standard for delivering video content over
various internet speeds and devices [1].
• Key Components:
○ Segmented Video Content: Video is encoded at multiple quality levels and split into
segments (e.g., 2-10 seconds each).
○ Manifest Files (MPD/HLS Manifest): Provides clients with information about available
video bitrates, resolutions, and segment locations.
○ Adaptive Bitrate (ABR) Streaming: Enables real-time switching between different video
qualities based on bandwidth, device capability, and buffer state.
• Video Coding Relevance:
○ Efficient Encoding: Advanced codecs (e.g., HEVC [2], VVC [3]) enable high
compression ratios without compromising quality, allowing adaptive streaming to meet
quality and bandwidth requirements effectively.
○ Per-Title/Per-Scene Encoding: Customizes encoding for each title or scene to achieve
optimal quality at each bitrate.
○ Quality Metrics (e.g., VMAF, PSNR): Used to select encoding parameters and maintain
consistent perceptual quality across bitrates.
[1] I. Sodagar, “The MPEG-DASH Standard for Multimedia Streaming Over the Internet,” IEEE MultiMedia, vol. 18, no. 4, pp. 62–67, 2011.
[2] G. J. Sullivan et al., “Overview of the High Efficiency Video Coding (HEVC) Standard,” in IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, 2012, pp. 1649–1668.
[3] B. Bross et al., “Overview of the Versatile Video Coding (VVC) Standard and its Applications,” in IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 10, 2021, pp. 3736–3764.

MHV’24
Introduction
Slide 4
Coding complexity
● Modern video standards, such as Versatile Video Coding
(VVC), provide high compression efficiency but at a cost of
significantly increased coding complexity.
● VVC Complexity: Up to 10x more complex than previous
standards (e.g., HEVC) [1].
● Challenge:
○ Computational Demands: Higher resolution and quality
requirements increase both encoding and decoding
time, impacting streaming latency and energy efficiency
[2].
● Complexity Metrics:
○ Encoding Time: Increased due to finer partitioning,
more prediction modes, and higher-level coding tools.
○ Decoding Time: High complexity impacts playback on
devices with limited processing power.
Figure: Rate-distortion (RD) and rate-decoding time curves of representative
sequences of Inter-4K dataset [3] encoded at 540p, 1080p, and 2160p resolutions
using VVenC at the faster preset, and decoded using VTM decoder.
[1] R. Kaafarani et al., “Evaluation Of Bitrate Ladders For Versatile Video Coder,” in 2021 International Conference on Visual Communications and Image Processing (VCIP), 2021, pp. 1–5.
[2] V. V. Menon et al., “Video Super-Resolution for Optimized Bitrate and Green Online Streaming,” in 2024 Picture Coding Symposium (PCS), 2024.
[3] A. Stergiou and R. Poppe, “AdaPool: Exponential Adaptive Pooling for Information-Retaining Downsampling,” in IEEE Transactions on Image Processing, vol. 32, 2023, pp. 251–266.

MHV’24
Introduction
Slide 5
Limitations of existing metrics
● PSNR [1] measurements are based on picture-wise MSE
○ Normalized to bit-depth of the assessed images
○ Usually, picture PSNR values averaged on videos
○ Problem: sensitivity of HVS to distortion not fixed
● SSIM [2] is a multiplicative combination of luminance, contrast and structure
similarities
● VMAF [3] applies several VQA models
○ Model results fused using machine learning
○ High correlation between VMAF and subjective scores.
○ Problem: high computational complexity, not differentiable, fails for
VVC-coded UHD content [4, 5].
[1] Alliance For Telecommunications Industry Solutions, “Objective Video Quality Measurement Using A PeakSignal-To-Noise-Ratio (PSNR) Full Reference Technique,” T1.TR.74-2001, 2001.
[2] Z. Wang et al., “Image quality assessment: from error visibility to structural similarity,” in IEEE Transactions on Image Processing, vol. 13, no. 4, 2004, pp. 600–612.
[3] Z. Li et al., “VMAF: The journey continues,” Netflix Technology Blog, vol. 25, 2018.
[4] M. Wien and V. Baroncini, “Report on VVC compression performance verification testing in the SDR UHD Random Access Category,” in WG 05 MPEG Joint Video Coding Team(s) with ITU-T
SG 16, document JVET-T0097, Oct. 2020.
[5] C. R. Helmrich et al., “Information on and analysis of the VVC encoders in the SDR UHD verification test,” in WG 05 MPEG Joint Video Coding Team(s) with ITUT SG 16, document
JVET-T0103, Oct. 2020
Figure: VMAF computation [3].

MHV’24
Introduction
Slide 6
Objective
Aim:
● To enhance the efficiency and quality of HTTP Adaptive Streaming in Versatile Video
Coding (VVC) using an improved perceptual quality metric, XPSNR, as a reliable
alternative to traditional metrics like VMAF.
Core Goals:
● Perceptual Quality Optimization: Improve perceptual video quality by leveraging
XPSNR, which shows higher correlation with subjective quality scores for UHD
content.
● Efficient Bitrate-Resolution Selection: Develop a method for estimating the convex
hull online to determine optimal bitrate-resolution pairs, balancing quality and
computational efficiency.
● Resource Optimization: Reduce encoding and decoding times through effective
bitrate-resolution adaptation, allowing for lower energy consumption and enhanced
user experience.

MHV’24
XPSNR
Slide 8
Introduction
● XPSNR (eXtended Peak Signal-to-Noise Ratio) is an advanced VQA metric.
● Extension of PSNR: While traditional PSNR lacks correlation with human perception, XPSNR incorporates
perceptual considerations, making it more aligned with subjective video quality judgments.
● Low complexity: Designed to maintain low computational complexity, XPSNR is efficient for real-time
encoding tasks, particularly in high-resolution content like UHD.
● Defined block-wise for block B (at index k) of input picture pic.
● Also depends on image width W, height H, and bit depth BD.
● Core: visual activity measure from spatio-temporal high-pass
Only low-complexity operations, squares and square-roots cancel.
[1] C. R. Helmrich et al., “Information on and analysis of the VVC encoders in the SDR UHD verification test,” in WG 05 MPEG Joint Video Coding Team(s) with ITU-T SG 16, document JVET-T0103, Oct. 2020.
[2] M. Wien and V. Baroncini, “Report on VVC compression performance verification testing in the SDR UHD Random Access Category,” in WG 05 MPEG Joint Video Coding Team(s) with ITU-T SG 16, document JVET-T0097,
Oct. 2020.
[3] C. R. Helmrich et al., “A study of the extended perceptually weighted peak signal-to-noise ratio (XPSNR) for video compression with different resolutions and bit depths,” in ITU Journal: ICT Discoveries, vol. 3, May 2020.
[Online] http://handle.itu.int/11.1002/pub/8153d78b-en

MHV’24
XPSNR
Slide 9
Improvement over VMAF
Figure: Comparison of computation times of
quality metrics for UHD videos with 240 frames.
Table: Evaluation of Pearson linear correlation. Higher
values implies higher correlation with MOS scores [1].
Table: Evaluation of Spearman rank correlation [1].
[1] C. R. Helmrich et al., “Information on and analysis of the VVC encoders in the SDR UHD verification test,” in WG 05 MPEG Joint Video Coding Team(s) with ITU-T SG 16, document JVET-T0103, Oct. 2020.

MHV’24
VEXUS Architecture
Slide 11
Convex-hull
Figure: Conceptual plot to depict the bitrate-quality relationship for
any video source encoded at different resolutions [2].
[1] R. Kaafarani et al., “Evaluation Of Bitrate Ladders For Versatile Video Coder,” in 2021 International Conference on Visual Communications and Image Processing (VCIP), 2021, pp. 1–5.
[2] A. Aaron et al., "Per-title encode optimization." The Netflix Techblog (2015).
[4] A. Wieckowski et al., “Vvenc: An Open And Optimized VVC Encoder Implementation,” in 2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Jul. 2021, pp. 1–2.
[5] A. Wieckowski et al., “Towards a live software decoder implementation for the upcoming versatile video coding (vvc) codec,” in Proc. IEEE International Conference on Image Processing (ICIP), pp. 3124–3128.
● The convex hull is where the encoding point achieves “Pareto efficiency”
[1, 2].
● Online convex-hull estimation methods provide a dynamic and adaptive
means to optimize bitrate and resolution selections [2].
Figure: Rate-XPSNR curves and decoding times of example video sequences
from the employed dataset, Inter4K [3], encoded using VVenC [4] and decoded
using VVdeC [5].

MHV’24
VEXUS Architecture
Slide 12
Components:
● Input parameters:
○ Set of supported resolutions,
○ Set of bitrates,
○ Maximum supported resolution.
● Outputs:
○ Optimized encoding bitrate ladder.
Workflow
Figure: Convex-hull estimation and encoding architecture using VEXUS.

MHV’24
Spatiotemporal complexity feature extraction
Slide 13
We use seven DCT-energy-based features extracted using Video
Complexity Analyzer (VCA) [1, 2]:
● average luma texture energy (EY),
● average gradient of the luma texture energy (h)
● average luma brightness (LY),
● average chroma texture energy of U and V channels (EU and EV)
● average chroma brightness of U and V channels (LU and LV).
[1] V. V. Menon et al., “Green Video Complexity Analysis for Efficient Encoding in Adaptive Video Streaming,” in First International ACM Green Multimedia Systems Workshop (GMSys ’23), 2023.
[2] V. V. Menon et al., "JND-Aware Two-Pass Per-Title Encoding Scheme for Adaptive Live Streaming," in IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 2, pp. 1281-1294, Feb. 2024, doi:
10.1109/TCSVT.2023.3290725.
Figure: Heatmap of EY (left) and h (right).

MHV’24
XPSNR-optimized resolution prediction
Slide 14
● Process: Two-part approach involving modeling and optimization.
● Goal: Maximize perceptual quality.
The perceptual quality of the representation (rt, bt) rely on the extracted video complexity features, encoding resolution, and target
bitrate:
● A higher resolution, and/or bitrate may improve the quality.
● LY, LU, and LV features consider variations in luminance and chrominance within localized regions.
● EY, EU, and EV features account for variations in texture across different frame regions, providing insights into how well the
compression or reconstruction method preserves textural details.
● h introduces a temporal dimension, where dynamic changes in texture over time influencing perceived quality are captured.
Modeling:
Optimization:
VEXUS optimizes the perceptual quality (in terms of XPSNR) of encoded video segments.

MHV’24
Optimized QP prediction
Slide 15
Modeling:
The optimized QP is modeled as a function of spatiotemporal features, target bitrate b, and normalized resolution height r‘ as:
● For applications such as streaming, avoiding exceeding the maximum bitrates specified in the HLS/DASH manifests [1, 2] during
the encoding process is essential.
● Failure to adhere to these limits can lead to buffer overflows or underflows in video players.
[1] I. Sodagar, “The MPEG-DASH Standard for Multimedia Streaming Over the Internet,” IEEE MultiMedia, vol. 18, no. 4, pp. 62–67, 2011.
[2] A. Bentaleb, B. Taani, A. C. Begen, C. Timmerer, and R. Zimmermann, “A Survey on Bitrate Adaptation Schemes for Streaming Media Over HTTP,” vol, 21, no. 1, IEEE Communications Surveys
Tutorials, pp. 562–585, 2019.
QP optimization:
The optimization function aims to predict the QP, minimizing the discrepancy between the predicted and target bitrate for a given
resolution.

MHV’24
Optimized QP prediction
Slide 16
● The equation captures the non-linear relationship between bitrate and QP by employing a
logarithmic mapping of the bitrate values.
● VVenC implemented capped VBR ratecontrol in Jan 2024 release [1], the QP is specified
using the qp option, while the maxrate (easy mode) or MaxBitrate (expert mode) option is
used to specify the upper bound of bitrate variability.
● This method involves training distinct XGboost regression models for minimum and maximum QP values (qmin and qmax,
respectively).
● The optimized for a target bitrate b is determined using linear regression, as follows:
Figure: QP versus normalized bitrate (in log
scale) for a representative video segment.
[1] C. Helmrich, V. George, V. V. Menon, A. Wieckowski, B. Bross, and D. Marpe, “Fast constant-quality video encoding using VVenC with rate capping based on pre-analysis statistics”, 2024 IEEE International
Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 2024, pp. 1823-1828, doi: 10.1109/ICIP51287.2024.10647456.
Cascaded Approach

MHV’24
Experimental setup
Slide 18
● We used 1000 videos of the Inter-4K dataset [1] to validate the performance of
the encoding methods.
● We encoded the sequences at UHD (2160p) 60fps using VVenC v1.10 [2]
using preset 0 (faster).
● We extracted the spatiotemporal features using VCA v2.0.
● We ran constant quality encoding by varying qp values from qmin to qmax for
each resolution.
● We computed full-reference PSNR and XPSNR quality metrics after the
compressed video was upscaled to the original resolution (2160p).
Table: Experimental parameters used to evaluate VEXUS.
[2] A. Więckowski, J. Brandenburg, T. Hinz, C. Bartnik, V. George, G. Hege, C. Helmrich, A. Henkel, C. Lehmann, C. Stoffers, I. Zupancic, B. Bross, and D. Marpe, “VVenC: An Open And Optimized
VVC Encoder Implementation,” in Proc. IEEE International Conference on Multimedia Expo Workshops (ICMEW), pp. 1–2.
Figure: Calculation of the groundtruth PSNR, XPSNR, and bitrate to train
the prediction models. This example shows the ground truth calculation of a
video encoded at 1080p with qp 30.
Dataset

MHV’24
Experimental setup
Slide 19
[1] Apple Inc., “HLS Authoring Specification for Apple Devices.” [Online]. Available: https://developer.apple.com/documentation/http-live-streaming/hls-authoring-specification-for-apple-devices
[2] J. De Cock et al., “Complexity-based consistent-quality encoding in the cloud,” in 2016 IEEE International Conference on Image Processing (ICIP), 2016, pp. 1484–1488.
[3] C. Chen et al., “Optimized Transcoding for Large Scale Adaptive Streaming Using Playback Statistics,” in 2018 25th IEEE International Conference on Image Processing (ICIP), 2018, pp.
3269–3273.
Benchmarks
1. Default: This method employs a fixed resolution encoding, i.e., all
bitstreams are encoded at the exact resolution as the input video.
2. FixedLadder: This method employs a fixed set of bitrate-resolution pairs.
We use the HLS bitrate ladder specified in the Apple authoring
specifications [1] as the fixed set of bitrate-resolution pairs.
3. Bruteforce: This method determines optimized resolution, which yields the
highest XPSNR for a given target bitrate after an exhaustive encoding
process at all supported resolutions and QPs [2, 3].
Table: An example fixed bitrate-ladder, i.e., set of
bitrate-resolution pairs. Source: [1]

MHV’24
Prediction analysis
Slide 21
● Accuracy:
○ XPSNR prediction
■ MAE: 0.17 dB, R2: 0.99
■ Std. dev: 0.22 dB
○ QP prediction
■ MAE: 1.32, R2: 0.97
■ Std. dev: 1.86
● Speed of feature extraction: 176 fps
● Model inference time: 4 ms
Figure: Prediction results of the XPSNR prediction model.
● XPSNR prediction model
○ XGBoost regressor [1] is trained for each supported resolution.
○ Hyperparameters: max_depth=10, and n_estimators=400.
● QP prediction models
○ XGBoost regressor is trained for qmin and qmax for each supported resolution.
[1] T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2016,
pp. 785–794.

MHV’24
Rate-distortion Analysis
Slide 22
RD curve of VEXUS closely mirrors the Bruteforce method, indicating the effectiveness of its predictive modeling in approximating
optimized resolutions and QPs.
Figure: RD curves of representative video sequences using Default (green line), FixedLadder (blue line), Bruteforce (black line), and VEXUS (red line) encodings.

MHV’24
Encoding and decoding times
Slide 23
Figure: Encoding and decoding times of representative video sequences using Default (green line), FixedLadder (blue line),
Bruteforce (black line), and VEXUS (red line) encodings.
● Encoding and decoding times are reduced for lower bitrates, as encoding and decoding operations become less complex due to
lower resolutions.

MHV’24
Result Summary
Slide 24
● Coding efficiency (in terms of Bjøntegaard Delta [1] rates), encoding and decoding times decrease as rmax decreases.
● The trade-off between quality and coding efficiency is based on the target audience, delivery platform, and available resources.
[1] HSTP-VID-WPOM, “Working practices using objective metrics for evaluation of video coding efficiency experiments,” International Telecommunication Union, 2020. [Online].
Available: http://handle.itu.int/11.1002/pub/8160e8da-en
Table: Average results of the encoding schemes compared to Default encoding.

MHV’24
Conclusions
Slide 26
● XPSNR demonstrates a better correlation with subjective quality scores for VVC-coded UHD content.
● Leveraging this insight, we introduced an approach where XPSNR is predicted for VVC-coded bitstreams using spatiotemporal
complexity features of the video and the target encoding configuration.
● We proposed VEXUS, where the convex-hull is estimated online using the predicted XPSNR.
● On average, VEXUS yields a substantial improvement of 5.84 dB in PSNR and 0.62 dB in XPSNR for the same bitrates compared to the
conventional UHD encoding with the VVenC encoder, followed by a 44.43% reduction in overall encoding time, and a 65.46% reduction
in overall decoding time using VTM decoder.
Open-source tools:
1. VVC encoder: Fraunhofer Versatile Video Encoder (VVenC) v1.10
Available: https://github.com/fraunhoferhhi/vvenc
2. VVC decoder: VTM reference decoder v22.0
Available: https://vcgit.hhi.fraunhofer.de/jvet/VVCSoftware_VTM
3. Spatiotemporal feature extractor: Video Complexity Analyzer (VCA) v2.0
Available: https://github.com/cd-athena/VCA
4. Convex-hull estimation framework:
Available: https://github.com/PhoenixVideo/QADRA

Thank you for your attention
— ▪ Vignesh V Menon (vignesh.menon@hhi.fraunhofer.de)

Convex-hull Estimation using XPSNR for Versatile Video Coding

More Related Content

Similar to Convex-hull Estimation using XPSNR for Versatile Video Coding (20)

More from Vignesh V Menon (19)

Recently uploaded (20)

Convex-hull Estimation using XPSNR for Versatile Video Coding