【计算机视觉】基于轻量模型的无参考人脸图像质量评估方法研究：VQualA2025挑战赛技术综述资源-CSDN下载

101 浏览量 2025-08-28 21:58:33 上传评论收藏 15.12MB PDF 举报

资源推荐

资源详情

资源评论

VQualA 2025 Challenge on Face Image Quality Assessment:

Methods and Results

Sizhuo Ma

Wei-Ting Chen

Qiang Gao

Jian Wang

Chris Wei Zhou

Wei Sun Weixia Zhang Linhan Cao Jun Jia Xiangyang Zhu Dandan Zhu

Xiongkuo Min Guangtao Zhai Baoying Chen Xiongwei Xiao Jishen Zeng

Wei Wu Tiexuan Lou Yuchen Tan Chunyi Song Zhiwei Xu

MohammadAli Hamidi Hadi Amirpour Mingyin Bai Jiawang Du Zhenyu Jiang

Zilong Lu Ziguan Cui Zongliang Gan Xinpeng Li Shiqi Jiang Chenhui Li

Changbo Wang Weijun Yuan Zhan Li Yihang Chen Yifan Deng Ruting Deng

Zhanglu Chen Boyang Yao Shuling Zheng Feng Zhang Zhiheng Fu

Abhishek Joshi Aman Agarwal Rakhil Immidisetti Ajay Narasimha Mopidevi

Vishwajeet Shukla Hao Yang Ruikun Zhang Liyuan Pan Kaixin Deng

Hang Ouyang Fan Yang Zhizun Luo Zhuohang Shi Songning Lai Weilin Ruan

Yutao Yue

Abstract

Face images play a crucial role in numerous applications;

however, real-world conditions frequently introduce degra-

dations such as noise, blur, and compression artifacts,

affecting overall image quality and hindering subsequent

tasks. To address this challenge, we organized the VQualA

2025 Challenge on Face Image Quality Assessment (FIQA)

as part of the ICCV 2025 Workshops. Participants created

lightweight and efﬁcient models (limited to 0.5 GFLOPs

and 5 million parameters) for the prediction of Mean Opin-

ion Scores (MOS) on face images with arbitrary resolu-

tions and realistic degradations. Submissions underwent

comprehensive evaluations through correlation metrics on

a dataset of in-the-wild face images. This challenge at-

tracted 127 participants, with 1519 ﬁnal submissions. This

report summarizes the methodologies and ﬁndings for ad-

vancing the development of practical FIQA approaches.

1. Introduction

In recent years, face images have become integral to a

wide variety of applications, including video communica-

tion, photography, augmented reality, and digital content

∗

Sizhuo Ma (sma @ snap. com), Wei-Ting Chen (weitingchen @

microsoft . com), Qiang Gao (qgao @ snap . com), Jian Wang

(jwang4@snap.com) and Chris Wei Zhou (zhouw26@cardiff.

ac.uk) are the challenge organizers. The other authors are participants

of the VQualA 2025 Challenge on Face Image Quality Assessment.

creation. However, real-world face images are frequently

captured under non-ideal conditions due to environmen-

tal constraints and hardware limitations, resulting in com-

mon degradations such as noise, blur, compression artifacts,

and poor lighting. These degradations not only diminish

perceived image quality but also negatively impact down-

stream image processing tasks like enhancement, editing,

and synthesis. Moreover, compromised image quality can

adversely affect the performance and generalization ability

of data-driven models, including large-scale vision systems

and generative models, which depend on high-quality face

image datasets for effective training [15, 34]. Thus, the de-

velopment of robust generic FIQA methods capable of ac-

curately quantifying perceptual degradation levels has be-

come increasingly critical [2, 36].

To advance research in this area, we introduce the

VQualA 2025 Challenge on Face Image Quality Assess-

ment, held in conjunction with the ICCV 2025 Workshops.

This challenge focuses speciﬁcally on evaluating the per-

ceptual quality of face images on an arbitrary scale affected

by real-world degradations, emphasizing accuracy within

stringent computational constraints. Participants are tasked

with developing efﬁcient and lightweight models capable of

predicting the MOS of face images under conditions such

as blur, noise, and low illumination. To reﬂect realistic de-

ployment scenarios, submissions must adhere to computa-

tional constraints, including a maximum of 0.5 GFLOPs

and fewer than 5 million parameters. Model performance

will be rigorously evaluated using no-reference image qual-

arXiv:2508.18445v1 [cs.CV] 25 Aug 2025

ity metrics and extensive subjective human studies to ensure

alignment with human perceptual judgments.

The primary objective of this challenge is to encourage

innovation in efﬁcient and precise FIQA models suitable

for real-time deployment on mobile and edge devices, ul-

timately advancing the broader ﬁeld of perceptual quality

assessment and enabling practical, real-world applications.

This challenge garnered signiﬁcant interest, attracting

127 registered participants. Throughout the development

phase, participants submitted 1058 entries, followed by 461

submissions during the ﬁnal testing phase. Ultimately, 13

teams successfully submitted their ﬁnal models and accom-

panying fact sheets, each providing detailed methodologies

for face image quality assessment. Sec. 3 presents a com-

prehensive analysis and summary of submitted methods.

We anticipate that this challenge will contribute meaning-

fully to the ongoing progress of face image quality assess-

ment methods, particularly in real-world scenarios under

computational constraints.

This challenge is one of several associated with the

VQualA Workshop at ICCV 2025, including: Image Super-

Resolution Generated Content Quality Assessment [19],

Visual Quality Comparison for Large Multimodal Mod-

els [44], GenAI-Bench AIGC Video Quality Assess-

ment [3], Engagement Prediction for Short Videos [18] and

Document Image Quality Assessment [11].

2. VQualA FIQA Challenge

2.1. Datasets and Evaluation

To ensure a fair evaluation of participant solutions, we cu-

rated distinct training, validation, and testing datasets for

this challenge. Our training set comprises 27,686 images,

and our validation set contains 1,000 images, all collected

from CelebA [22] and Flickr. For the test set, we gathered

889 images exclusively from Flickr.

Variety in resolution. A key challenge of this compe-

tition was developing a Face Image Quality Assessment

(FIQA) method capable of handling in-the-wild images

with diverse resolutions. Unlike previous datasets, such as

GFIQA [36], our collected face images are not normalized

and exhibit a wide range of resolutions, with short-edge di-

mensions varying from 224 to 1024 pixels. To generate la-

bels for all datasets, we employed the state-of-the-art FIQA

method, DSL-FIQA [2]. For each image, regardless of its

resolution, 20 random patches were extracted and averaged

to determine its quality score.

Evaluation. The challenge was structured into two dis-

tinct phases:

• Development Phase: During this phase, participants

were provided with the training images and their cor-

responding labels, along with the validation images.

They were tasked with developing their solutions and

uploading prediction results for the validation set,

which were then compared against the ground truth.

• Testing Phase: For the testing phase, participants were

required to upload their model deﬁnitions and weights.

The models then processed the unseen test images di-

rectly on our server, and the results were compared

against the ground truth labels. We intentionally did

not release the test dataset due to a strict constraint of

0.5 GFLOPs and 5M parameters. Releasing the test

images could have led to participants using larger mod-

els to generate pseudo-labels for these images, subse-

quently training smaller models that overﬁt, thereby

compromising the fairness of the competition.

The awards were determined according to the testing

phase scores. We use the average of SROCC and

PLCC as the overall score:

Score = (SROCC + PLCC)/2 (1)

2.2. Baseline

To facilitate the development of solutions, a MobileNetV2-

based [35] baseline was provided, which accepts 224 × 224

pixel image patches as input. During inference, multiple

random patches were cropped from the original image and

processed by the network. The resulting output scores were

subsequently averaged to yield a ﬁnal prediction. The base-

line was trained using the Adam [16] optimizer with a learn-

ing rate of 5 × 10

−4

and a weight decay of 10

−5

. The loss

function used was Mean Squared Error. Training was con-

ducted for 20 epochs with a batch size of 64.

While satisfactory performance was achieved with an

ensemble of 20 random crops, adherence to GFLOPs con-

straints necessitates the use of a single crop, which exhibits

suboptimal performance (see Tab. 1). This constraint poses

a challenge for participants, requiring the development of

optimal input-handling strategies, including appropriate re-

sizing and cropping techniques, to maximize performance

under computational limitations.

2.3. Challenge Results

Table 1 summarizes the challenge results. A total of 13

teams submitted their solutions and accompanying fact

sheets. The top-performing method attained a score of

0.9664, representing an improvement of over 0.13 rela-

tive to the baseline, notably with comparable computational

complexity (GFLOPs) and a reduced number of parameters.

The subsequent section provides a detailed description

of each submitted solution. A list of team members and

afﬁliations are included in Appendix A.

Table 1. Challenge Results

Rank Team Score SROCC PLCC GFLOPS NumParams[M]

1 ECNU-SJTU VQA Team 0.9664 0.9692 0.9637 0.3313 1.1796

2 MediaForensics 0.9624 0.9624 0.9624 0.4687 1.5189

3 Next 0.9583 0.9630 0.9535 0.4533 1.2224

4 ATHENAFace 0.9566 0.9600 0.9533 0.4985 2.0916

5 NJUPT-IQA-Group 0.9547 0.9530 0.9564 0.4860 3.7171

6 ECNU VIS Lab 0.9406 0.9397 0.9415 0.4923 3.2805

7 JNU620 0.9334 0.9413 0.9255 0.4097 3.2511

8 ISeeCV 0.9279 0.9282 0.9275 0.4890 0.9513

9 RegNet 0.9242 0.9262 0.9222 0.4895 4.0252

10 Conquerit 0.9038 0.9118 0.8958 0.2235 4.7795

11 BIT ssvgg 0.8727 0.8897 0.8557 0.5120 4.7242

12 2077Agent 0.8432 0.8529 0.8335 0.2852 1.3005

13 DERS 0.6999 0.7098 0.6900 0.8980 6.0523

Baseline 0.8309 0.8334 0.8283 0.3139 3.2511

labelled data

Teacher

Model

unlabeled data

Enhanced

Teacher

Model

unlabeled data

Student

Model

training

inferring

Figure 1. ECNU-SJTU VQA Team.

3. Teams and Methods

3.1. Efﬁcient Face Image Quality Assessment via

Self-training and Knowledge Distillation (by

ECNU-SJTU VQA Team)

The ECNU-SJTU VQA Team proposes a framework com-

prising two main stages, as described in Fig. 1. First,

they trained a teacher model using a self-training approach.

Speciﬁcally, the Swin Transformer Base (Swin-B) [23] was

adopted. The classiﬁcation head was removed and replaced

with a two-layer multilayer perceptron (MLP) consisting of

128 and 1 hidden neurons, respectively, serving as the re-

gression head. They began by training the teacher model on

the labeled face image quality assessment (FIQA) dataset

provided by the challenge organizers. Next, they collected

a large-scale unlabeled face image dataset (approximately

200k images) from the Internet. The trained teacher model

was then used to generate pseudo-labels for these images.

They combined the labeled images and pseudo-labeled im-

ages to retrain the teacher model, thereby enhancing its per-

formance through self-training. After this, the enhanced

teacher was used to generate pseudo-labels for an additional

set of collected face images (approximately 200k images)

for the second-stage training.

In the second stage, they trained a student model us-

ing the labeled images, the ﬁrst-round and the second-

round pseudo-labeled images. The student model employed

EdgeNeXt-XX-Small [27] as the backbone, with its clas-

siﬁcation head replaced by the same two-layer MLP re-

gression head. Through learning from ground-truth data,

the teacher-labeled data, and the enhanced teacher-labeled

data, the student model achieved competitive performance,

closely matching that of the enhanced teacher.

Their approach is inspired by the Self-training with

Noisy Student framework [41], but differs in two key as-

pects. First, unlike the original method which uses a larger

model for iterative self-training, they retained the same ar-

chitecture (i.e., Swin-B). Additionally, since their goal was

to assess visual quality, they avoided introducing noise to

the input images for self-training, as it might degrade their

perceptual ﬁdelity. Second, they further distilled the en-

hanced teacher into a lightweight student model to enable

efﬁcient image quality assessment. More details can be

found in the challenge paper [37].

Training details. They implemented their framework us-

ing PyTorch 2.4 [30]. For the two-round teacher training,

they used the AdamW [25] optimizer with a learning rate

of 1×10

−4

, a weight decay of 1×10

−6

, and a learning rate

decay factor of 0.1 every 10 epochs. The model was trained

剩余14页未读，继续阅读

评论收藏

内容反馈

码流怪侠

粉丝: 4w+

【计算机视觉】基于轻量模型的无参考人脸图像质量评估方法研究：VQualA 2025挑战赛技术综述

最新资源

【计算机视觉】基于轻量模型的无参考人脸图像质量评估方法研究：VQualA 2025挑战赛技术综述

DeepSeek从入门到精通-清华大学-202502.pdf

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

pdf转为扫描件，免费

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

DEEP SEEK 本地部署（Ollama + ChatBox）+ 私有知识库（cherry studio）教程

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

行人跌倒数据集（VOC格式）

CIFAR10数据集免费下载

DeepSeek从入门到精通-清华大学

大作业05-YOLOV5口罩检测数据集+代码+模型 2000张标注好的数据+教学视频.zip

Deep Learning Tuning Playbook（中译版）

LabVIEW AI Vision(LabVIEW AI视觉工具包)

zotero翻译插件.xpi

清华deepseek入门到精通文档 夸克网盘资源下载

YOLOv5 人脸口罩图片数据集

基于YOLOv8-Pose的姿态识别项目，带数据集可直接跑通的源码

免费Ollama 官方大模型服务器安装程序

Unet眼底血管图像分割数据集+代码+模型+系统界面+教学视频.zip

人工智能应用：DeepSeek从入门到精通的操作指南与多功能实战详解

皮肤病语义分割数据集+代码+unet模型 2000张标注好的数据+教学视频

mamba、causal-conv1d安装.whl文件

YOLOv8目标追踪实战全套资源包 - 源码与数据集完整分享

Ollama windows安装包 0.5.7（截止2025-02-01）

【大作业-08】YOLOV5火灾检测数据集+代码+模型 2000张标注好的数据+教学视频

时间序列预测实战(十九)魔改Informer模型进行滚动长期预测（科研版本，结果可视化）

基于YOLOv5实现垃圾分类目标检测

cudnn-10.1-windows10-x64-v7.6.5.32.zip

download-NEU-DET

Apple A2289 820-01987图纸下载

23.葫芦岛北港工业区商务园区起步区概念规划与城市设计.zip

最新资源

清华deepseek入门到精通文档夸克网盘资源下载