End-to-end Quality of Experience Evaluation for HTTP Adaptive Streaming

E N D - T O - E N D
Q U A L I T Y O F E X P E R I E N C E
E V A L U A T I O N F O R
H T T P A D A P T I V E S T R E A M I N G
B A B A K T A R A G H I
U N I V . - P R O F . D I D R . C H R I S T I A N T I M M E R E R
A S S O C . - P R O F . D I D R . M A T H I A S L U X
A S S O C . - P R O F . D I D R . K L A U S S C H Ö F F M A N N
A S S O C . - P R O F . D I D R . A L I C E N G I Z B E G Ě N
C L A S S O F 2 0 2 0
A T H E N A C H R I S T I A N D O P P L E R ( C D ) L A B O R A T O R Y
I T E C - I N S T I T U T E O F I N F O R M A T I O N T E C H N O L O G Y

Agenda
• Introduction and Context (9 minutes)
• Evaluation Frameworks (8 minutes)
• Studies on QoE Impacting Factors
(12 minutes)
• Comprehensive Dataset Presentation
(7 minutes)
• Highlights and Future Directions
(3 minutes)
• Q&A

Introduction
Context, HAS and QoE
Research Questions
Research Methodology
Contributions and Publications

HTTP Adaptive Streaming I
Figure 1: HTTP Adaptive Streaming (HAS) concept and how the delivered quality of segments depends on the shape of the network.
4

• Provisioning
– Codecs and Encoders, Encryptors
• Delivery
– Network Protocols, and
Topologies
• Consumption
– Media players and ABR
algorithms
HTTP Adaptive Streaming II
Consumption
Delivery
Provisioning
5
End-to-end
Aspect

Quality of Experience I
The degree of delight or annoyance of the user of
an application or service. It results from the
fulfilment of his or her expectations with respect to
the utility and/or enjoyment of the application or
service in the light of the user’s personality and
current state. – Brunnström et al. [27]
6

Quality of Experience II
• How to evaluate or measure the degree of annoyance or delightfulness of the user
– Objective Evaluation
• Understand and formulate the metrics
– Start-up Delay: How long does it take for the user to see the first frame
of the video from the moment s/he clicks the play button?
– Delivered Media Quality: What is the delivered media quality
at each moment and in average?
• E.g.: VMAF, Resolution, and Bitrate
– Stall Events (rebuffering): How many times a
stall event happens and for how long?
• Using quality models
– Subjective Evaluation
• Investigate the perceived quality by user
– Conduct evaluation with human subjects
7

Research Questions
RQ1) How to design, develop, and deploy
scalable end-to- end QoE evaluation
groundwork for HAS, encompassing both
video-on-demand content and low-latency live
streaming?
RQ2) What are the QoE influencing perceptual
factors, and how can they be effectively
evaluated through subjective assessment
methods in HAS? And how do existing quality
models align with the findings derived from
the subjective assessments?
8

• Empirical Research Methodology
– An approach to investigation that relies on direct or indirect observation and experience to
gather data and generate knowledge. It involves systematically collecting and analysing
empirical evidence, such as measurements, experiments, and observations, to test
hypotheses and validate theories.
– Data-driven Assessment
– Real-world Evaluation and User-Centric Perspective
– Allows Objective and Subjective Measures
• Objective: Unbiased and quantifiable, using predetermined criteria
and standards [9]
• Subjective: The process of assessment based on personal
opinions, feelings, or individual judgments [9]
– Supports Iterative Improvement
– Helps with Industry and Standardization
Research Methodology
9

10
CAdViSE: cloud-based adaptive video streaming evaluation framework for the automated testing of media players.
In Proceedings of the 11th ACM Multimedia Systems Conference (MMSys), 2020
Understanding Quality of Experience of Heuristic- based HTTP Adaptive Bitrate Algorithms. In Proceedings of the
31st ACM Workshop on Network and Operating Systems Support for Digital Audio and Video (NOSSDAV), 2021
INTENSE: In-Depth Studies on Stall Events and Quality Switches and Their Impact on the Quality of Experience in
HTTP Adaptive Streaming. In IEEE Access, 2021
Multi-codec ultra high definition 8K MPEG-DASH dataset. In Proceedings of the 13th ACM Multimedia Systems
Conference (MMSys), 2022
LLL-CAdViSE: Live Low-Latency Cloud-Based Adaptive Video Streaming Evaluation Framework. In IEEE Access, 2023
Contributations

Evaluation Frameworks
11
CAdViSE: Cloud-based Adaptive Video
Streaming Evaluation
LLL-CAdViSE: Live Low-Latency Cloud-based Adaptive Video Streaming Evaluation
• Media Players Evaluation with CAdViSE
• Live Low-latency Evaluation with LLL-CAdViSE
Use Cases

• A Quality of Experience evaluation framework for HTTP Adaptive Streaming
– Facilitates an organized and structured evaluation
• Test environment remains the same; therefore, the results can be
interpreted as improved performance or otherwise
– Its cloud-based, since scalability is a key factor
– Enabled to assess multiple ABR algorithms and media simultaneously
– Simulates network conditions; accepts network traces as plugins
• Mimics real world network characteristics scenarios
– Provides unified insights into quality metrics
• Measures raw metrics
• Works seamlessly with analytic tools (graphs and plots)
CAdViSE (What?)
12

• Application Layer
– Runner, Initializer and Starter scripts
– Written with Bash Script, Python and JavaScript
• Cloud Components
– Player Container (VNC and Selenium)
– Network Emulator
– EC2 Instances, SSM Execution, DynamoDB, S3 and
CloudWatch
• Logs and Analytics
– Comprehensive Logs
– Analytic Players Plugin
CAdViSE (How?)
13

Live Low-Latency (CAdViSE)
14
Server (AWS EC2)
- Generate the live feed
- Encode
- Package (DASH & HLS)
- Ingest & Deliver
- Calculate MOS
- Manipulate network
Client (AWS EC2)
- Run media player
- Redirect requests to
server
- Record logs
- Manipulate
network
Database (AWS DynamoDB)
- Store log records
- Index the data
- Retrieve log
records
LLL-CAdViSE Console (Shell)
- Manage EC2 instances
- Initialize server and client(s)
- Execute the experiment
- Execute QoE
calculation

P r e l i m i n a r y
E v a l u a t i o n
W i t h C A d V i S E
15
• 5 Experiments; 9:00 minutes each
• AWS EC2 t2.medium instances (4Gib RAM,2 CPU
cores)
• Emulated Network Profiles: 4 mbit/s <> 800 kbit/s

• Target Latencies: 1s, 3s, 5s, and 10s
• Two streaming formats:
– MPEG-DASH (dash.js 4.4.1)
– HLS (hls.js 1.2.0)
• ARB algorithms:
– Learn2Adapt-LowLatency (L2A-LL)
– Low-on-Latency Plus (LoLP)
• 3 Experiments of 420 seconds
• Network profiles:
– Bicycle commuter LTE network
– Car driver LTE network
– Train commuter LTE network
– Tram commuter LTE network
– Network0 up to 10Gpbs
LLL-CAdViSE Evaluation Setup
16
0
1000
2000
3000
4000
5000
6000
7000
8000
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
KBPS
TIME (SECONDS)
Biker Network Profile
Available Bandwidth Poly. (Available Bandwidth)
0
2000
4000
6000
8000
10000
12000
14000
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
KBPS
TIME (SECONDS)
Car Driver Network Profile
0
1000
2000
3000
4000
5000
6000
7000
8000
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
KBPS
TIME (SECONDS)
Train Commuter Network Profile
0
1000
2000
3000
4000
5000
6000
7000
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
KBPS
TIME (SECONDS)
Tram Commuter Network Profile

LLL-CAdViSE Evaluation Result I
17
• All time values are in seconds.
• a: Experiment title, format: [protocol]-[ABR]-[network]-
[target latency] (def: Default, l2a: L2A-LL).
• b: Average of the sum of stall events duration.
• c: Average start-up delay.
• d: Average of the sum of seek events duration.
• e: Average quantity of quality switches.
• f: Playback bitrate (min-max-avg) in kbps.
• g: Latency (min-max-avg).
• h: Playback rate (min-max-avg).
• i: Average MOS predicted by the ITU-T P.1203
quality model.

LLL-CAdViSE Evaluation Result II
18
1
2
3
4
5
0
5
10
15
20
Bicycle Car Train Tram Net0
Average
P.1203
MOS
Average
Latency
(second)
Network Profiles
MPEG-DASH, TL: 5S
Default L2A-LL LoLP
Default L2A-LL LoLP
Latency:
MOS:
1
2
3
4
5
0
10
20
30
40
Bicycle Car Train Tram Net0
Average
P.1203
MOS
Average
Latency
(second)
Network Profiles
HLS, TL: 5S
Default L2A-LL LoLP
Default L2A-LL LoLP
Latency:
MOS:

Studies on QoE
Impacting Factors
19
• Exploring Adaptive Bitrate (ABR) Algorithms
• Objective and Subjective Evaluation
• Empirical Findings
Understanding Quality of Experience
• Minimum Noticeable Stall event Duration (MNSD) Evaluation
• Stall event vs. Quality level switch (SvQ) Evaluation
• Short stall events vs. a Longer stall event (SvL) Evaluation
• Relation of Stall event impact on the QoE with Video Quality level
(RSVQ) Evaluation
• Objective QoE Models Comparison
In-depth Studies on Stall Events and Quality Switches

• Throughput-based
– Uses throughput prediction heuristics to optimize streaming quality by estimating available network
bandwidth.
– Examples: PANDA, Festive, CrystalBall.
• Buffer-based
– Relies solely on buffer occupancy to make streaming decisions, aiming to prevent buffer underruns and
stalling.
– Examples: BBA0, BOLA, Quetra.
• Hybrid
– Integrates multiple heuristics such as throughput, buffer level, and latency
to make comprehensive streaming decisions.
– Examples: GTA, Elastic, MPC.
• Learning-based
– Utilizes machine learning techniques to adapt streaming quality based on
historical data and real-time network conditions.
– Examples: Pensieve, Fugu, Stick.
Exploring ABR Algorithms
20

CAdViSE Testbed:
Cloud-based platform for assessing ABR algorithms under diverse network conditions.
Ensures reproducibility with session logs for accurate recreation of streaming sessions.
Experiment Logs:
Logs archived in DynamoDB.
Script processes logs to simulate and inject stall events using FFmpeg.
Video Processing:
Generates a JSON file for ITU-T P.1203 model to obtain Mean Opinion Score (MOS).
Concatenates audio and video tracks for finalized mp4 files.
Evaluation Portal:
Developed using Serverless Architecture and AWS Lambda.
Based on ITU-T P.910 standards for subjective assessments.
Crowdsourced Testing:
Uses Amazon Mechanical Turk for participant recruitment.
Custom web media player delivers test sequences to users.
Evaluation Process:
Participants watch and rate 10 test sequences on a 1 to 5 scale.
Reliability questions ensure valid votes.
Results stored and processed via AWS services.
Objective and Subjective Evaluation
21
1000
2000
3000
4000
5000
6000
7000
8000
10
20
30
40
50
60
70
80
90
100
110
120
k
b
p
s
s
e
c
o
n
d
Ramp Up
Ramp Dow n
1000
2000
3000
4000
5000
6000
7000
8000
10
20
30
40
50
60
70
80
90
100
110
120
k
b
p
s
s
e
c
o
n
d
Stable
Fluctuation

Empirical Findings I
22
FastMPC Elastic BBA0 Quetra BOLA dash.js Shaka
Fluctuation 73.23 5.85 7.95 10.88 28.46 41.40 52.25
Ramp Down 30.63 8.35 6.18 10.33 11.29 21.29 34.90
Ramp Up 17.18 0.00 0.19 0.00 4.13 4.55 13.39
Stable 12.84 0.16 0.00 0.00 4.20 4.26 20.12
0
15
30
45
60
75
90
AVG.
STALL
(SECOND)
FastMPC Elastic BBA0
Fluctuation 5.48 5.36 5.48
Ramp Down 5.56 5.29 5.41
Ramp Up 7.22 6.37 6.56
Stable 5.65 5.46 5.40
0
2
4
6
8
10
12
AVG.
STARTUP
(SECOND)
Quetra BOLA dash.js Shaka
10.88 28.46 41.40 52.25
10.33 11.29 21.29 34.90
0.00 4.13 4.55 13.39
0.00 4.20 4.26 20.12
FastMPC Elastic BBA0 Quetra BOLA dash.js Shaka
Fluctuation 5.48 5.36 5.48 5.36 5.56 5.50 5.28
Ramp Down 5.56 5.29 5.41 5.43 5.57 5.54 5.40
Ramp Up 7.22 6.37 6.56 6.78 7.48 7.51 9.65
Stable 5.65 5.46 5.40 5.42 5.62 5.65 5.65
0
2
4
6
8
10
12
AVG.
STARTUP
(SECOND)

Empirical Findings II
23
2.64
2.93
3.13
2.24
3.07
2.26
2.70
3.66
3.87 3.98
3.67
3.80
3.34
3.67
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
BBA0 BOLA dash.js Elastic FastMPC Quetra Shaka
Pearson's Correlation Coefficient 0.84
Objective MOS Subjective MOS
2.22
1.86
1.99 2.07
1.91 1.98 1.98
3.39
3.21 3.29
3.12 3.10 3.08
3.30
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
Stable Network Profile Fluctuation Network Profile
2.56
2.67 2.63
2.26
2.84
2.26
2.79
3.62 3.73 3.65
3.45
3.68
3.41
3.73
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
RampUp Network Profile
2.33 2.43 2.33
2.00
2.48
2.00
2.35
3.48
3.65
3.48
3.26
3.48
3.26
3.45
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
RampDown Network Profile

In-depth Studies on Stall Events and
Quality Switches
• Minimum Noticeable Stall Duration (MNSD):
– Investigated the threshold below which stall events are not noticeable to users,
thus not affecting perceived QoE.
• Stall Event vs. Quality Switch (SvQ):
– Evaluated user preference between experiencing a stall event or a quality drop
during unfavourable network conditions.
• Short vs. Long Stall Events (SvL):
– Studied the impact on QoE of multiple short stall events versus a single longer
stall event, considering both predicted and perceived MOS.
• Stall Impact and Video Quality (RSVQ):
– Examined the relationship between the impact of stall events on QoE and video
quality level, addressing conflicting findings from previous studies.
• QoE Models Comparison:
– Compared various QoE objective evaluation models with subjective MOS results
to study their correlations.
24

Subjective Evaluation Portal
25

Minimum Noticeable Stall Duration
26
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
<
0
.
0
5
1
<
0
.
1
0
1
<
0
.
1
5
1
<
0
.
2
0
1
<
0
.
2
0
1
2
3
4
5
6
7
8
9
10
11
12
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
93
97
101
Number
of
Times
Stall
Event
Exposed
Stall Event Duration (Millisecond)
Missed Stall Events
Log. (Missed Stall Events)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
<
0.051
<
0.101
<
0.151
<
0.201
<
0.251
<
0.301
<
0.351
<
0.401
<
0.451
<
0.501
<
0.551
<
0.601
<
0.651
<
0.701
<
0.751
<
0.801
<
0.851
<
0.901
<
0.951
<
1.001
Stall Event Duration (Second)
Noticed Missed
65
69
73
77
81
85
89
93
97
101
cond)
Missed Stall Events
Log. (Missed Stall Events)
• Decrease in noticed stall events starts at
durations less than 0.301 seconds.
• Over 45% of subjects did not notice stall
events shorter than 0.051 seconds.
• Stall events under 0.004 seconds were not
noticeable to participants.

Stall Event vs. Quality Switch
27
Set A - Case I: A pattern with 6s stall
event and upward quality switch.
Set A - Case II: A pattern without a stall
event and continuous low-quality
streaming.
Set B - Case I: A pattern with high
video quality streaming but with a 6s
stall event.
Set B - Case II: A pattern with a
downward quality switch and without
stall event.

Stall Event vs. Quality Switch
28
Set A Case I Set A Case II Set B Case I Set B Case II
Perceived MOS 3.28 3.11 3.75 3.52
Predicted MOS 2.96 2.45 3.62 2.76
1
2
3
4
5
Mean
Opinion
Score
Stall Events' Patterns
• Preference for Case I in both Set A
and Set B over Case II.
• Preference for higher-quality
versions even with a 6-second stall.

Short vs. Long Stall Events
29
(0-0) (1-4) (1-8) (4-1) (4-2) (8-1)
Perceived MOS 4.54 4.11 3.83 3.44 3.35 3.23
Predicted MOS 4.71 4.31 4.12 3.33 3.23 2.67
1
2
3
4
5
Mean
Opinion
Score
Stall Events' Patterns (count,duration)
• Preference for longer stall
events over frequent, shorter
ones

Stall Impact and Video Quality
30
Q1
Q1 +
Stall
Q2
Q2 +
Stall
Q3
Q3 +
Stall
Perceived MOS 2.85 2.57 3.81 3.08 4.48 3.77
Predicted MOS 1.88 1.65 2.6 2.11 4.63 3.36
1
2
3
4
5
Mean
Opinion
Score
VMAF Video Quality
• Minor QoE penalty from stall events
at low-quality videos (Q1).
• Higher penalty on QoE for middle
(Q2) and high-quality (Q3) videos
with stall events.

QoE Models Comparison
31
• BiQPS and FINEAS:
- Inconsistent performance across
evaluations.
• P.1203 model:
- Best overall performance.
- Highest PCC and SRCC (> 0.8)
- Lowest RMSE: 0.326.
• Pearson Correlation Coefficient (PCC)
• Spearman’s Rank Correlation Coefficient (SRCC)
• Root Mean Square Error (RMSE)

A Comprehensive
Dataset
32
Video Codecs and Development Procedures
Source Video Sequences
Available Representations

Video Codecs and Development Procedures
33
• Advanced Video Coding (AVC)
– Library: libx264 (version 0.160.3011) from FFMPEG, slow preset.
• High Efficiency Video Coding (HEVC)
– Library: libx265 (version 3.4) from FFMPEG, slow preset.
• AOMedia Video 1 (AV1)
– Library: libsvtav1 (version 0.9.0) from FFMPEG, preset 8.
• Versatile Video Coding (VVC)
– Library: Fraunhofer VVenC (version 1.3.1), requires 8-bit YUV input, processed with FFMPEG and encoded with VVenC.
– At the dataset preparation time MP4Box (part of GPAC project) supports VVC in nightly builds, enabling MP4 file
packaging, VVC bitstream dumping, and MPEG-DASH content packaging
• ISOBMFF incl. VVC
• DASH manifest
Encoder
VVC Elementary Streams
GPAC
Encoded Audio Track
Playback
Decoder
GPAC
MP4Client

Available Representations
35
• Resolutions up to 7680x4320 or 8K
• Maximum media duration of 322 seconds
• Segment lengths of 4 and 8 seconds
• Available publicly with the following link:
http://www.itec.aau.at/ftp/datasets/mmsys22

Highlights And
Conclusion
Three Main Categories of Contributions:
1. Evaluation frameworks (CAdViSE and LLL-CAdViSE) for VOD and
live streaming.
1. Directly addresses RQ1.
2. Studies on subjective and objective QoE assessments and the
impacts of HAS defects on QoE.
1. Comprehensive dataset with up-to-date video technologies,
including 8K VVC.
36

Future Works
37
• Support for New Protocols and Codecs: Extend evaluation frameworks to include
emerging standards like WebRTC and VVC.
• Machine Learning for QoE: Apply machine learning techniques to predict and optimize
QoE based assessments.
• Enhance Quality Models: Align existing quality models with subjective assessment
findings for better prediction accuracy.
• Real-time QoE Monitoring: Develop tools for real-time
QoE monitoring and feedback to enable dynamic
adjustments during streaming sessions.
• User-centric QoE Personalization: Investigate methods
for personalizing QoE based on individual user
preferences and viewing habits.

End-to-end Quality of Experience Evaluation for HTTP Adaptive Streaming

More Related Content

Similar to End-to-end Quality of Experience Evaluation for HTTP Adaptive Streaming (20)

More from Alpen-Adria-Universität (20)

Recently uploaded (20)

End-to-end Quality of Experience Evaluation for HTTP Adaptive Streaming