FBP-diffusion: diffusion model combining MobileViT and dynamic loss correction for facial beauty prediction

Gan, Junying; Li, Huicong; Xie, Xiaoshan; Chen, Hantian; Zhuang, Zhenxin

doi:10.1007/s10489-025-06723-8

FBP-diffusion: diffusion model combining MobileViT and dynamic loss correction for facial beauty prediction

Open access
Published: 05 July 2025

Volume 55, article number 842, (2025)
Cite this article

You have full access to this open access article

Download PDF

Applied Intelligence Aims and scope Submit manuscript

FBP-diffusion: diffusion model combining MobileViT and dynamic loss correction for facial beauty prediction

Download PDF

834 Accesses
Explore all metrics

Abstract

Facial beauty prediction (FBP) is a frontier topic at the intersection of artificial intelligence and computational aesthetics, aiming to enable computers to autonomously predict or assess facial beauty. Currently, while FBP methods have achieved good results on well-processed datasets, they typically exhibit reduced prediction performance on datasets with more unavoidable noisy labels. Diffusion models (DMs) can denoise and reconstruct label encodings, capturing uncertainty in the prediction process through the randomness of their outputs. Therefore, we propose FBP-Diffusion, an improved diffusion model that integrates MobileViT and dynamic loss correction (DLC). Specifically, MobileViT, effective at modeling both detailed and global information, is employed as a conditional information encoder to produce preliminary predictions, which are then fed into the reverse process to guide label generation. DLC is introduced to enhance the model’s denoising capability and robustness, in which the cross-entropy loss is increased by the prediction probabilities of FBP-Diffusion obtained after the reverse process and probability transfer, and then dynamically integrated into the noise estimation loss. Experimental results on four representative facial beauty databases demonstrate that FBP-Diffusion outperforms both conventional DMs and FBP methods, particularly noting a 5.17% accuracy improvement on relatively noisy datasets over state-of-the-art FBP methods.

FBPFormer: Dynamic Convolutional Transformer for Global-Local-Contexual Facial Beauty Prediction

CNN based facial aesthetics analysis through dynamic robust losses and ensemble regression

Article Open access 26 August 2022

Toward Tiny and High-Quality Facial Makeup with Data Amplify Learning

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Facial beauty prediction (FBP) is a cutting-edge topic in artificial intelligence that converts the perception of facial beauty into a machine learning recognition problem by extracting aesthetic features, thereby enhancing computers’ ability to judge or predict facial beauty. Research in FBP has accelerated the development of practical applications, including natural and realistic facial image beautification technologies [1,2,3], reliable medical beauty solutions [4], and automated assessments of comic and film characters [5]. FBP provides a theoretical foundation for various fields and industries, with significant scientific value and broad application prospects.

In recent years, many data-driven deep learning methods have aimed to achieve computational results that better reflect human aesthetics by minimizing interference and capturing deeper features, continuously advancing research in this field [7,8,9,10,11]. There are four representative facial beauty databases for FBP classification tasks: SCUT-FBP [12], SCUT-FBP5500 [13], CelebA [14], and the Large Scale Asian Female Beauty Database (LSAFBD) [15]. Differences in dataset quality can affect prediction outcomes to some extent. Although both SCUT-FBP5500 and LSAFBD have five category labels, the LSAFBD dataset is less refined and contains more noisy labels, which is reflected in the same model typically achieving higher accuracy on the former compared to the latter. For instance, DIDTAN [16], ER-BLS [17], TransBLS-B [18], and [19] achieved 76.40$\%$, 74.69$\%$, 78.46$\%$, and 75.41$\%$ accuracy on SCUT-FBP5500, respectively, compared to 65.40$\%$, 62.13$\%$, 66.49$\%$, and 61.37$\%$ on LSAFBD, with differences exceeding 10$\%$. These methods utilize different model architectures, including convolutional neural networks (CNNs), vision transformers (ViTs), and broad learning systems (BLS), yet they still suffer from performance degradation caused by noisy labels in datasets. Given that noisy labels are unavoidable in real-world scenarios, there is an urgent need for effective solutions to address this challenge.

Over time, people have become aware of this issue, yet only a few existing FBP approaches address noisy labels, in which the classifier is also mainly designed as a label corrector to purify potential noisy labels before the second stage of training [16, 20]. However, the databases are imbalanced in terms of sample distribution, and the models frequently struggle to capture differential features. As a result, they may mistakenly treat class labels with fewer samples as noise and make extensive modifications to them, leading to highly unstable training. Such denoising methods, which focus on correcting true labels without improving model architectures, have inherent limitations that hinder the effective resolution of noisy label issues.

Diffusion models (DMs) [21] are explicit generative models based on likelihood estimations. Figure 1 depicts a schematic diagram of the diffusion process of DMs. During the forward diffusion process, Gaussian noise is progressively added to the input data, while in the reverse process, the model learns to denoise and gradually recover the original data from the noisy data. Unlike image generation tasks, which require the denoising of image data $\varvec{x}_t$, DM-based classifiers can denoise and reconstruct label codings $\varvec{y}_t$ [22], viewing the modeling process of DMs as a label denoising process. This approach has been successfully applied in fields such as medical image classification, crystal property prediction, and time series forecasting [23,24,25,26,27,28,29], yielding good results. However, to our knowledge, current research has not yet applied DMs to the FBP field.

Therefore, to reduce the influence of noisy labels and improve prediction performance, we constructed an improved diffusion model for FBP, called FBP-Diffusion. Specifically, MobileViT [30], a lightweight general-purpose ViT, is employed as a conditional information encoder, generating an initial prediction that replaces the endpoint noise and guides label generation by being input into all timesteps during the reverse process. Subsequently, MobileViT models the noise transition matrix, which is multiplied by the softmax probability obtained by the DM during forward propagation. The cross-entropy loss between this product and the true label is then introduced into the training process and dynamically added to the noise estimation loss of the DM. In addition, we use the sampling method from the Denoising Diffusion Implicit Model (DDIM) [31] to enhance the inference efficiency of FBP-Diffusion.

Extensive comparison and ablation experiments conducted on four facial beauty datasets demonstrate that FBP-Diffusion outperforms baseline classifiers in both DMs and FBP methods. It achieves an accuracy of 73.40% on the relatively noisy LSAFBD dataset, marking a 5.17% improvement over the previous best FBP methods. It attains the highest accuracy on the more refined SCUT-FBP, SCUT-FBP5500 and CelebA datasets, with rates of 67.71%, 78.60% and 82.56%, respectively.

The main contributions of this study are as follows:

We propose FBP-Diffusion, which utilizes a DM-based classifier for FBP and integrates pretrained MobileViT to obtain conditional information for the DM, effectively addressing the issue of noisy labels in FBP.
We use the noise transition matrix to perform probability transfer and enhance the DM’s denoising capability through dynamic loss correction (DLC), further improving model performance.
Comparing FBP-Diffusion with benchmark DMs and existing FBP methods reveals its state-of-the-art performance, particularly on relatively noisy datasets, highlighting its superiority, effectiveness, and generalization.

The remainder of this paper is organized as follows: Related works are reviewed in Section 2; the overall model scheme and implementation details are presented in Section 3; experimental analysis is discussed in Section 4; and the study is summarized with future research directions in Section 5.

2 Related works

2.1 Facial beauty prediction

With the development of deep learning technology, CNNs have become a mainstream tool for FBP due to their ability to effectively extract deep features. In [32,33,34], multitask CNNs are proposed by incorporating additional information from specific tasks to enhance FBP performance. Bougourzi et al. [10] introduced a framework integrating regression CNNs, with ResneXt-50 and Inception-v3 as backbones, and employed multiple dynamic loss functions during training. Sun et al. [11, 35] leveraged attribute information such as gender and ethnicity to guide CNN training, achieving implicit feature alignment across attribute domains, and further proposed dynamic attention convolution for CNNs. Besides CNNs, other network architectures have also been explored in FBP models. Gan et al. [17, 18], Zhai et al. [36] utilized local feature fusion, transfer learning-based CNNs, and adaptive transformers independently as feature extractors for the broad learning system. Peng et al. [37], Liu et al. [38] combined ViTs with CNNs to develop a two-branch network and a dynamic convolutional transformer for FBP. Although these methods produced favorable results, only a few algorithms have been specifically designed for learning with noisy labels. Gan et al. [16, 20] designed models that functioned as both classifiers and label correctors with CNNs as the backbone and a two-stage training process. However, this approach exhibited limited effectiveness on databases with imbalanced sample sizes, revealing issues such as weak model generalization and low prediction accuracy.

2.2 Diffusion model

DMs enable accurate likelihood calculation and representation learning, with a primary use in image generation tasks. In 2022, Han et al. [22] proposed the Classification and Regression Diffusion (CARD) model, which regards deterministic classification as a denoising process for labels, allowing for more flexible uncertainty modeling during label generation. Similar to other DMs, CARD includes both forward and reverse processes, and is learned by optimizing the evidence lower bound via stochastic gradient descent [39, 40].

During the forward process, CARD, according to a variance schedule $\{{\beta _t}\}_{t=1:T} \in (0,1)^T$, adds Gaussian noise $\varvec{\epsilon } \sim \mathcal {N}(0,\varvec{I})$ to the n-dimensional one-hot label encoding $\varvec{y}_0$, gradually generating a series of intermediate variables $\varvec{y}_{1:T}$. Suppose that after T timesteps in the forward process, $\varvec{y}_0$ eventually converges to

$$\begin{aligned} p(\varvec{y}_T|\varvec{x}) = \mathcal {N}(f_\phi (\varvec{x}), \varvec{I}) \end{aligned}$$

(1)

where $\varvec{x}$ represents the input image data, $f_\phi (\varvec{x})$ is the prior knowledge of the relation between $\varvec{x}$ and $\varvec{y}_0$, which is set to 0 or $\mathbb {E}(\varvec{y}|\varvec{x})$, approximated by the pretrained conditional information encoder $f_\phi$, and $\varvec{I}$ denotes the identity matrix. Then, the conversion of adjacent intermediate variables is modeled following a Gaussian distribution:

$$\begin{aligned} q \left( \varvec{y}_t|\varvec{y}_{t-1}, f_\phi (\varvec{x}) \right) = \mathcal {N} \left( \varvec{y}_t; \sqrt{1-{\beta _t}} \varvec{y}_{t-1}+(1-\sqrt{1-{\beta _t}})f_\phi (\varvec{x}), {\beta _t}\varvec{I} \right) \end{aligned}$$

(2)

The closed-form sampling distribution with an arbitrary timesteps t is

$$\begin{aligned} q \left( \varvec{y}_t|\varvec{y}_0, f_\phi (\varvec{x}) \right) = \mathcal {N} \left( \varvec{y}_t; \sqrt{\bar{\alpha }_t} \varvec{y}_0+(1-\sqrt{\bar{\alpha }_t})f_\phi (\varvec{x}), (1-{\bar{\alpha }_t})\varvec{I} \right) \end{aligned}$$

(3)

where $\alpha _t:=1-\beta _t$ and $\bar{\alpha }_t:=\prod _t\alpha _t$.

In the reverse process, CARD gradually reconstructs the label encoding $\varvec{y}_0$ from the Gaussian noise $p(\varvec{y}_T|\varvec{x}) = \mathcal {N}(f_\phi (\varvec{x}), \varvec{I})$ by $\varvec{x}$ and $f_\phi$. The forward process posteriors is as follows:

$$\begin{aligned}&q(\varvec{y}_{t-1}|\varvec{y}_t, \varvec{y}_0, \varvec{x}) = q \left( \varvec{y}_{t-1}|\varvec{y}_t, \varvec{y}_0, f_\phi (\varvec{x}) \right) \\& = \mathcal {N} \left( \varvec{y}_{t-1}; \widetilde{\varvec{\mu }} \left( \varvec{y}_t, \varvec{y}_0, f_\phi (\varvec{x}) \right), {\widetilde{\beta }_t}\varvec{I} \right) \end{aligned}$$

(4)

The detailed derivation of the sampling distribution parameters from (2) to (3), as well as the mean $\widetilde{\varvec{\mu }}$ and variance $\widetilde{\beta }_t$ of $q(\varvec{y}_{t-1}|\varvec{y}_t, \varvec{y}_0, \varvec{x})$ in (4), can be found in Appendix A.1 of [22]. A function approximator $\varvec{\epsilon }_{\theta } (\varvec{x}, \varvec{y}_t, f_\phi (\varvec{x}), t)$ is constructed to predict the diffusion noise $\varvec{\epsilon }$, and the denoised label encoding $\varvec{\hat{y}}_0$ obtained after the reverse process can be re-parameterized by (3), as follows:

$$\begin{aligned} \varvec{\hat{y}}_0 = \frac{1}{\sqrt{\bar{\alpha }_t}} \left( \varvec{y}_t - (1-{\sqrt{\bar{\alpha }_t}})f_\phi (\varvec{x}) - \sqrt{1-\bar{\alpha }_t}\varvec{\epsilon }_{\theta } (\varvec{x}, \varvec{y}_t, f_\phi (\varvec{x}), t) \right) \end{aligned}$$

(5)

Chen et al. [24] applied the DDIM sampling method to the non-Markovian forward process of CARD, generating label encoding in fewer steps on the predefined sampling trajectory of $\left\{ T \right. =\tau _S>\cdots>\tau _s>\cdots> \tau _1= \left. 1 \right\}$, where $S<T$, thereby reducing the inference time significantly. After replacing t with $\tau _s$, (3) can be rewritten, and $\varvec{y}_t$ is calculated as follows:

$$\begin{aligned} \varvec{y}_{\tau _s} = \sqrt{\bar{\alpha }_{\tau _s}}\varvec{y}_0+\left( 1-\sqrt{\bar{\alpha }_{\tau _s}} \right) f_\phi (\varvec{x}) + \sqrt{1-\bar{\alpha }_{\tau _s}}\varvec{\epsilon } \end{aligned}$$

(6)

Likewise, the denoised label $\varvec{\hat{y}}_0$, i.e., the predicted value of $\varvec{y}_0$, can be calculated as

$$\begin{aligned}& \widetilde{\varvec{y}}_0 = \frac{1}{\sqrt{\bar{\alpha }_{\tau _s}}}\\& \left( \varvec{y}_{\tau _s} - \left( 1-\sqrt{\bar{\alpha }_{\tau _s}} \right) f_\phi (\varvec{x}) -\sqrt{1-\bar{\alpha }_{\tau _s}} \varvec{\epsilon }_{\theta } \left( \varvec{x}, \varvec{y}_{\tau _s}, f_\phi (\varvec{x}), \tau _s \right) \right) \end{aligned}$$

(7)

Since the function approximator $\varvec{\epsilon }_{\theta } \left( \varvec{x}, \varvec{y}_{\tau _s}, f_\phi (\varvec{x}), \tau _s \right)$ predicts $\varvec{\epsilon }$, it follows that when ${\tau _{s-1}}>0$ and $\varvec{y}_{\tau _s}$ is given, $\varvec{y}_{\tau _{s-1}}$ can be calculated from (6) as follows:

$$\begin{aligned}\varvec{y}_{\tau _{s-1}} & = \sqrt{\bar{\alpha }_{\tau _{s-1}}} \widetilde{\varvec{y}}_0+ \left( 1-\sqrt{\bar{\alpha }_{\tau _{s-1}}} \right) f_\phi (\varvec{x}) \\& + \sqrt{1-\bar{\alpha }_{\tau _{s-1}}} \varvec{\epsilon }_{\theta } \left( \varvec{x}, \varvec{y}_{\tau _s}, f_\phi (\varvec{x}), \tau _s \right) \end{aligned}$$

(8)

2.3 Loss correction

Loss Correction (LC) assumes that the noisy labels are transferred and corrupted from the true labels according to an unknown noise transition matrix M. It aims to learn this matrix to adjust the loss function during training, thereby reducing the model’s reliance on noisy labels and enhancing its robustness [42,43,44]. The primary approach involves first estimating M with available prior knowledge, and then correcting the loss by either multiplying M with the model prediction or by multiplying M with the loss value. The effectiveness of LC largely depends on the estimation of matrix M, and several studies have been conducted to optimize this estimation. Yao et al. [46] introduced an intermediate class to decompose M into the product of two easily estimated transition matrices. Zhang et al. [47] estimated M and learned a classifier simultaneously, proposing total variation regularization to make the predicted probabilities more distinguishable. Cheng et al. [48] observed that instances with similar appearance and poor quality are more likely to be mislabeled, and therefore formulated this hypothesis as a manifold embedding to reduce the degree of freedom of M and stabilize its estimation. Wang et al. [49] extracted data with high-confidence pseudo-labels and noisy labels to train a transition-matrix-estimation network in a federated manner. However, to the best of our knowledge, LC or noise transition matrices have not yet been used for FBP or in CARD-based classifiers.

3 Method

Similar to CARD and its variants, the proposed FBP-Diffusion model denoises and reconstructs label encodings, treating deterministic classification and the learning of noisy labels as a progressive denoising process. In this section, we present the model, along with its conditional information encoder and DLC.

3.1 FBP-Diffusion

The structure of FBP-Diffusion is illustrated in Fig. 2, comprising three parts: a denoising network, conditional information encoder $f_\phi$, and DLC module. In the reverse process, the Gaussian noise label encoding $\varvec{y}_T = \mathcal {N}(f_\phi (\varvec{x}), \varvec{I})$ is progressively reconstructed into the predicted label encoding $pre.\varvec{y}_0$, represented as $\varvec{y}_T \rightarrow \cdots \rightarrow \varvec{y}_t \rightarrow \varvec{y}_{t-1} \rightarrow \cdots \rightarrow pre.\varvec{y}_0$, with timesteps denoted by $t\sim (T \rightarrow 1)$. First, given an input facial beauty image $\varvec{x}$, we use MobileViT as the encoder $f_\phi$ to obtain the initial probability prediction $f_\phi (\varvec{x})$, which is taken as the mean of the Gaussian distribution $\varvec{y}_T$, along with the response embedding of the denoising network. “Add noise” in Fig. 2 represents obtaining $\varvec{y}_T$ from $f_\phi (\varvec{x})$. Next, utilizing ResNet18 as the encoder $f_\textrm{R}$, we obtain high-dimensional features $f_\textrm{R}(\varvec{x})$ after the Batch Normalization (BN) layers, which serve as the image embeddings of the denoising network. The denoising network is used for each conversion step in the reverse process. Finally, in the DLC module, the estimated noise transition matrix M obtained from $f_\phi$ is multiplied by the softmax probability $pre.\varvec{y}_0$ of the label encoding in the last step of the reverse process, yielding the final output $\varvec{y}_\textrm{F}$ of FBP-Diffusion. In the training stage, the difference between the predictions $\varvec{y}_{1:T}$ before and after passing through the denoising network should be close to Gaussian noise, leading to the noise estimation loss $\mathcal {L}_{\varvec{\epsilon }}$. The final output $\varvec{y}_\textrm{F}$ should be close to the true label $\varvec{y}$, resulting in the cross-entropy loss $\mathcal {L}_c$.

The denoising network learns the noise distribution added during the forward diffusion process and parameterizes the reverse process. $\varvec{y}_t$ and $f_\phi (\varvec{x})$ are input into the fully connected layer after concatenating, and the timestep t is multiplied by the embedded layer. Following processing by the BN layer and Softplus activation function, the Hadamard product with the high-dimensional feature $f_\textrm{R}(\varvec{x})$ is calculated by the encoder $f_\textrm{R}$, and the resulting vector is processed layer by layer. The output of the last fully connected layer of the denoising network is the next $\varvec{y}_{1:T}$ predicted value.

3.2 Conditional information encoder

The flexible pretrained encoder, MobileViT, is used as an auxiliary classifier, i.e., $f_\phi$, to gather conditional information to guide the denoising process while learning the noise transition matrix M, as depicted in Fig. 3. $f_\phi$ serves not only as a feature extractor but also directly outputs the probabilities $f_\phi (\varvec{x})$, with a dimension equal to the number of classes, denoted by n. $f_\phi$ has pretrained on the ImageNet-1k database, and its parameters are transferred to the FBP task and fine-tuned with respect to the input facial beauty image to accelerate the FBP-Diffusion learning process. The structure starts with a standard $3 \times 3$ convolutional layer, followed by the Inverted Residual Block (MV2) proposed in MobileNetv2, the core MobileViT block, and the global pooling and fully connected layers. $\downarrow \!2$ indicates down-sampling in the module, and L denotes the number of Transformer encoders [51] in the MobileViT block.

In the FBP-Diffusion reverse process, the image input is $\varvec{x} \in \mathbb {R} ^{H \times W \times C}$, and H, W, C represent the height, width, and number of image channels, respectively. An n-dimensional vector is obtained by the encoder $f_\phi$, and the dimension becomes 2n after concatenated with $\varvec{y}_t$. The high-dimensional vector with a dimension of $fea_{dim}$ is obtained after mapping through the fully connected layer. We set $fea_{dim}$, the dimension of the timestep t after the embedding layer, and the dimension of $f_\textrm{R}(\varvec{x})$ to 2048, and keep the dimension unchanged until the last fully connected layer of the denoising network remaps the dimension from 2048 to n. Because of the presence of several fully connected layers in the denoising network, the memory consumption of the model increases with $fea_{dim}$. In the DLC module, the fine-tuned $f_\phi$ outputs the initial probability prediction of the whole training set, and the noise transition matrix M can be obtained by normalizing with the predicted confusion matrix.

3.3 Dynamic loss correction

The powerful conditional information encoder effectively guides the reverse process, ensuring that the prediction $pre.\varvec{y}_0$ of the DM is closer to the true label. However, when the model excessively focuses on the conditional information, the generated labels strictly adhere to this information, which reduces generation diversity and weakens the denoising capability. To address this issue, alternative noisy label learning methods may be considered. However, a two-stage training approach solely for purifying noisy labels is not suitable for FBP. In this context, we introduce a DLC module to optimize the training process of the DM by dynamically adjusting the total loss.

During the training stage, CARD and its variants calculate the loss based on the predicted $\varvec{y}_{1:T}$ values before and after the denoising network, reflecting the single conversion of the intermediate variable, and optimize the network parameters with noise estimation loss, mainly Mean Squared Error (MSE), and Maximum-Mean Discrepancy (MMD) regularization loss [23]. During the inference stage, complex calculations must be performed at each timestep to obtain the prediction $pre.\varvec{y}_0$, and the overall performance of the DM is than verified, which requires significant computing resources. In other words, obtaining $pre.\varvec{y}_0$ during the training phase and calculating the loss between this prediction and the true label are difficult. Therefore, we choose to calculate the re-parameterized denoising label $\varvec{\hat{y}}_0$, and use the cross-entropy loss between $\varvec{\hat{y}}_0$ and $\varvec{y}$ to train the denoising network, rather than according to the inference one by one timestep. The detailed process is expressed in Algorithm 1.

The derivation and calculation of the cross-entropy loss are as follows:

$$\begin{aligned} \mathcal {L}_c& = \mathcal {L}_c \left( \varvec{y},\ \hat{p}(\varvec{y}_\text{F}|\varvec{x}) \right) \\&= -\varvec{y}^{\textrm{T}}\log \ \hat{p}\left( \varvec{y}_{\text{F}}|\varvec{x} \right) \\& =-\sum \nolimits _{k=1}^n{\log \ \hat{p}\left( \varvec{y}_{\text{F}}=\varvec{y}_k|\varvec{x}\right) } \\& =-\sum\nolimits _{k=1}^n \log \ \left( \sum \nolimits _{i=1}^n p\left( \varvec{y}_{\textrm{F}}=\varvec{y}_k|pre.\varvec{y}_0=\varvec{y}_i \right) \right.\\& \left. \hat{p}\left( pre.\varvec{y}_0=\varvec{y}_i|\varvec{x} \right) \right) \\& =-\sum \nolimits _{k=1}^n{\log \ \sum \nolimits _{i=1}^n{\overline{M}_{ik}}\ \hat{p} \left( pre.\varvec{y}_0=\varvec{y}_i|\varvec{x} \right) } \nonumber \\&=-\sum\nolimits_{k=1}^n{\log \ \sum\nolimits_{i=1}^n{(M^\text{T})_{ik}}\ \hat{p} \left( pre.\varvec{y}_0=\varvec{y}_i|\varvec{x} \right) } \\& =-\sum\nolimits _{k=1}^n \log \ \sum\nolimits _{i=1}^n{(M^\text{T})_{ik}} \\& \cdot \frac{\exp \left( - \left( \hat{\varvec{y}}_0-1_n \right) _{i}^{2} \right) }{\sum\nolimits _{j=1}^n {\exp \left( - \left( \hat{\varvec{y}}_0-1_n \right) _{j}^{2} \right) }} \end{aligned}$$

(9)

where $\varvec{y}$ and $\varvec{y}_{\textrm{F}}$ represent the ground truth labels and the output of FBP-Diffusion, respectively. n is the number of classes, k, i, and j represent the k-th, i-th, and j-th classes respectively. $1_n$ denotes an n-dimensional vector with all elements equal to 1. $\overline{M}$ denotes the general noise transition matrix, which contains the transition probabilities from the noisy label to the denoised label. In our work, $\overline{M}$ is defined as $M^\textrm{T}$, where M is the priori information provided by the conditional information encoder, expressed as a probabilistic prediction of the training set. Other symbols not explained here can be found in the nomenclature at the end of this paper.

By applying the chain rule, we decompose the target conditional probability $\hat{p}\left( \varvec{y}_{\textrm{F}}=\varvec{y}_k|\varvec{x}\right)$ into a sum of products of conditional probabilities in (9), with the intermediate variable $pre.\varvec{y}_0$.

Because of the differences between the re-parameterized $\varvec{\hat{y}}_0$ and $pre.\varvec{y}_0$ obtained via step-by-step inference, the value of $\mathcal {L}_c$ during training is higher and does not change significantly, whereas the noise estimation loss $\mathcal {L}_{\varvec{\epsilon }}$ decreases rapidly and exhibits lower values. To match the characteristics of DMs training and inference and to further optimize the training process, we sum these two losses, with the total loss denoted by $\mathcal {L}$, as follows:

$$\begin{aligned} \mathcal {L} = \left( \frac{1}{\mathcal {L}_{\varvec{\epsilon }} / (\mathcal {L}_c + 1)} \right) \cdot \mathcal {L}_{\varvec{\epsilon }} + \left( \frac{\mathcal {L}_{\varvec{\epsilon }} / \mathcal {L}_c}{\mathcal {L}_{\varvec{\epsilon }} / (\mathcal {L}_c + 1)} \right) \cdot \mathcal {L}_c \end{aligned}$$

(10)

In addition, we utilize the DDIM sampling method to improve inference efficiency, expressed as $pre.\varvec{y}_0 = \textrm{DDIM}(\varvec{y}_T=f_\phi (\varvec{x}),\varvec{x})$. Owing to the low dimensionality of the label vector, the model performs well even when S is much smaller than T, thereby significantly reducing computation time. The inference phase of FBP-Diffusion is detailed in Algorithm 2. First, the parameterized $\widetilde{\varvec{y}}_0$ is calculated by (7), and the next $\varvec{y}_{\tau _{s-1}}$ prediction value is iteratively computed, yielding $pre.\varvec{y}_0$ after $\tau _s$ timesteps. Subsequently, the softmax probability $pre.\varvec{y}_0$ is multiplied by $M^{\textrm{T}}$ to obtain $\varvec{y}_{\textrm{F}}$. Finally, $\varvec{y}_{\textrm{F}}$ is normalized to [0, 1], yielding the final output of FBP-Diffusion.

4 Experimental results and analysis

4.1 Experimental objects

SCUT-FBP [12] is a facial beauty database created by the South China University of Technology, containing 500 frontal images of young Asian female faces with beauty scores, with varying resolutions. The beauty scores of all images range from 1 to 5, with higher scores indicating higher beauty levels, and the scores were averaged from the ratings provided by 70 volunteers. We used the mode of the scores to classify the images into five categories: “1”, “2”, “3”, “4” and “5”, corresponding to “ very unattractive”, “ not attractive”, “ average”, “ attractive”, and “ very attractive”, respectively.

SCUT-FBP5500 [13] is another facial beauty database created by the South China University of Technology, containing 5,500 frontal facial images with beauty scores, covering different genders, ages and races, with a resolution of $350 \times 350$. There were 2,000 images of Asian women, 2,000 images of Asian men, 750 images of Caucasian men, and 750 images of Caucasian women. All images were rated on a scale ranging from 1 to 5, with higher scores indicating greater beauty, and these scores were averaged from assessments made by 60 volunteers. Likewise, we categorized the images into five classes using the mode of the scores, with the numerical labels and their meanings consistent with those in SCUT-FBP.

LSAFBD [15] is a facial beauty database created by our project team, containing 100,000 frontal facial images, covering various backgrounds, poses and ages, with a resolution of $144 \times 144$, of which 80,000 are unlabeled images and 20,000 are labeled images. We used 10,000 labeled images of Asian female faces as subjects, with a total of five attractiveness categories of “0”, “1”, “2”, “3” and “4”, corresponding to “very unattractive”, “unattractive”, “average”, “attractive” and “very attractive”, respectively. The images were evaluated by 75 volunteers, and the scores were averaged.

CelebA [14] is a facial attribute database collected and published by the Chinese University of Hong Kong, containing 202,599 facial images of 10,177 celebrities, with a resolution of $178 \times 218$. Each image has 40 attribute annotations. Using the “Attractive” label, 118,165 female images were divided into two categories: “unattractive” and “attractive”, corresponding to “1” and “2”, with 37,911 and 80,254 images, respectively. Figure 4 shows the distribution ratios and image examples of the four datasets, revealing an imbalance in their sample quantities.

4.2 Experimental environment

All experiments in this study were implemented on PyTorch 1.10.0, CUDA 11.3, and Ubuntu 22.04. The device included an NVIDIA GeForce RTX 3080 Ti GPU, a 12th Gen Intel® Cor™ i9-12900K$\times$24 CPU, 64GB of RAM, and 2.5TB hard drive. Each database was divided into training set and testing set following a ratio of 8 : 2. All images were adjusted to $224 \times 224$ resolution, and the images of training set were randomly flipped horizontally, rotated by $\pm 10$ degrees, processed via color dither, and normalized. Specifically, the training set of SCUT-FBP was augmented due to the limited sample size, with each image being rotated by $180^\circ$ and then horizontally flipped to generate four versions. For the diffusion process, the timesteps T was set to 1000, S to 10, the noise was adjusted linearly according to $\beta _1=1\times 10^{-4}$ and $\beta _T=0.02$, ResNet18 was used as the encoder $f_{\textrm{R}}$ without pre-training, MobileViT_xs was used as the conditional information encoder $f_\phi$, the fine-tuning epoch was set to 11, i.e., the pre-epoch, and the training result of the last pre-epoch was selected. FBP-Diffusion is trained for 100 epochs, with inference performed every 10 epochs. The value of $fea_{dim}$ was set to 2048, and the batch size was taken to be 16 in all experiments. During training, the Adam optimizer was used with an initial learning rate of $1\times 10^{-4}$. The learning rate of the diffusion model was adjusted by way of half-cycle cosine decay function, while $f_\phi$ was not subject to additional processing. The values of other hyperparameters not mentioned above were set in accordance with [22].

4.3 Evaluation metrics

We evaluate the performance of FBP-Diffusion with image classification metrics, including Accuracy (Acc), Precision (Prec), Recall (Rec), and F1 Score (F1), defined by

$$\begin{aligned} \textrm{Acc} = \frac{TP+TN}{TP+TN+FP+FN} \end{aligned}$$

(11)

$$\begin{aligned} \textrm{Prec} = \frac{TP}{TP+FP} \end{aligned}$$

(12)

$$\begin{aligned} \textrm{Rec} = \frac{TP}{TP+FN} \end{aligned}$$

(13)

$$\begin{aligned} \textrm{F}1 = \frac{2 \times \textrm{Prec} \times \textrm{Rec}}{\textrm{Prec} + \textrm{Rec}} \end{aligned}$$

(14)

where, TP represents true positives, FP represents false positives, FN represents false negatives, and TN represents true negatives. For binary classification, Prec, Rec, and F1 scores were calculated by treating classes “1” and “2” as the positive class individually. The three metrics were then averaged with weights equal to the sample proportions. Likewise, for multi-class classification, these metrics were computed for each class and then averaged to obtain the final values.

Because the accuracy of the model was similar to the weighted recall rate, we chose the macro average to calculate Rec in multi-class classification, and the weighted average to other metrics. The higher the values of Accuracy, Precision, Recall, and F1, the better the model’s performance. In addition, the training time of the DMs, excluding the fine-tuning time of the conditional information encoder, was considered to evaluate the efficiency of the model.

4.4 Comparison experiments with baseline methods

With CARD and its variants as baseline methods, we conducted experiments on four facial beauty databases and compared their performance with FBP-Diffusion. The experimental results are listed in Tables 1−4, with their order based on the size of the databases. The best results are highlighted in bold, and the second-best results are underlined. To demonstrate the effectiveness of FBP-Diffusion, all comparative experiments for DMs were conducted under identical experimental environments and model configurations, except for the pre-epoch determined by the design of $f_\phi$, which included the versions of third-party libraries in the running environment, input image size, timesteps T and S, $\beta _t$, epoch, $fea_{dim}$, batch_size, and initial learning rate. The values of other hyperparameters were adopted as per their specifications in the original papers.

Table 1 Comparison with the baseline method on the SCUT-FBP

Full size table

Table 2 Comparison with the baseline method on the SCUT-FBP5500

Full size table

Table 3 Comparison with the baseline method on the LSAFBD

Full size table

Table 4 Comparison with the baseline method on the CelebA

Full size table

In the baseline method, CARD initially used DMs for the classification task, with both the encoders $f_\textrm{R}$ and $f_\phi$ set to ResNet18. Based on CARD, DiffMIC [23] applied DMs to general medical image classification, introduced a dual-granularity conditional guidance model as $f_\phi$, and optimized the model parameters by the MMD regularization loss. CD-Loop [25] used DMs to accurately and reliably predict chromatin loops, with a $28 \times 28$ Hi-C contact map as the input and the LeNet5 model as the encoder $f_\phi$. LRA-diffusion [24] utilized a pre-trained CLIP model as $f_\phi$, retrieved labels from the k-nearest neighbors in training set based on the neighbor consistency principle to replace the true labels for modeling DMs, and accelerated the training process of DMs by the DDIM sampling method, with the output of $f_\phi$ saved for later reuse.

From Tables 1 to 4, FBP-Diffusion achieves the highest values for Acc, Prec, Rec, and F1 in most cases, except for Recall on SCUT-FBP, where it attains the second-highest value. A comparative analysis of the performance across the four databases reveals that LRA-diffusion, particularly on SCUT-FBP, exhibits a higher Recall value, which appears to be derived from a weighted average, as its value closely aligns with the other three metrics. This can be attributed to the varying judgment capabilities of different noise label learning methods on the positive and negative classes, with LRA-diffusion demonstrating greater sensitivity to the minority classes in smaller databases. Overall, FBP-Diffusion demonstrates the best classification performance.

Upon further examination, CD-Loop is limited by the input size and encoder settings, and the k-nearest neighbor retrieval method of LRA-diffusion is not suitable for FBP as it struggles with high similarity of image features involved in FBP tasks. DiffMIC leverages global and local features of images effectively and exhibits the second-best classification result. LRA-diffusion exhibits the shortest training time, followed by FBP-Diffusion. DiffMIC implements joint training for $f_\phi$ and the denoising networks, thus exhibiting the longest training time.

4.5 Comparison with state-of-the-art FBP methods

To further verify the effectiveness of the proposed method, we compared the accuracy of FBP-diffusion with the other state-of-the-art FBP methods on four databases. The results are listed in Table 5, ordered by the year of publication. Among the methods considered, 2M BeautyNet [32] is a multi-input and multi-task CNN model that utilizes gender recognition to assist FBP tasks. The self-correcting noise label method [20] and DIDTAN [16] serve both as classifiers and label correctors and are trained in a multistage manner. E-BLS and ER-BLS [17] use EfficientNet based on transfer learning as the feature extractor and further refine the features and fit the prediction results through BLS. Moreover, TransBLS-T and TransBLS-B [18] capture global and local features of the face with the GLAFormer encoder and effectively combine ViTs [51] with BLS. The adaptive multitask method [19] adopts an adaptive sharing strategy and attention feature fusion to improve FBP accuracy. In addition to this, other attention models have been compared. Swin Transformer [53] improves computational efficiency and model representation through a sliding window attention mechanism. MLP-Mixer [54] is based entirely on earlier multilayer perceptrons and relies solely on basic matrix multiplication for data processing and feature extraction.

Table 5 Comparison of accuracies (%) of state-of-the-art FBP methods on the Facial Beauty database

Full size table

The results indicate that the classification accuracy of FBP-Diffusion is better than those of the existing methods, reaching 67.71$\%$, 78.60$\%$, 73.40$\%$, and 82.56$\%$ on the four datasets, which are 1.04$\%$, 2.56$\%$, 5.17$\%$, and 0.71$\%$ higher than the second-best results, respectively.

4.6 Ablation experiment

To evaluate the effectiveness of $f_\phi$ and DLC modules in FBP-Diffusion, ablation experiments were performed on the four databases listed in Table 6. Here, “Diffusion” refers to the enhancement of CARD with the addition of DDIM and some detailed implementation changes in the model fine-tuning process. Apart from the changes to the encoder $f_\phi$ and the addition of the DLC module, it is identical to FBP-Diffusion, represented as “Diffusion+MobileViT+DLC”. The ablation experiments were performed under the same experimental conditions and model configurations described in Section 4.2. Furthermore, experiments conducted on the same database utilized the same fine-tuned MobileViT model, with its pre-training time denoted by “-” since it differs from “Time”, which refers to the training time of DMs.

Table 6 Effectiveness of each module in FBP-Diffusion

Full size table

The data presented in Table 6 indicates that the accuracy gradually increases with the application of MobileViT and the addition of the DLC module, and it can be seen that MobileViT has a better impact on the model’s accuracy than the DLC module, with enhancements ranging from 0.49% to 7.29% and 0% to 0.4%, respectively. In a word, FBP-Diffusion outperforms the two baseline models used and achieves the best classification accuracy. It exhibits the highest values in almost all classification metrics, except for the Prec value on the LSAFBD database. This is attributed to the probability transfer in the DLC module, which comes at the cost of a certain amount of misprediction, thus lowering the Prec or Rec values. In smaller databases, there is more variation in the sample size for each category, resulting in lower macro-averaged Rec values compared to other metrics. On the LSAFBD database, the “Diffusion” model achieves an accuracy of 70.39%, surpassing all other methods listed in Table 5, which further validates the advantages of diffusion models in noisy databases. In addition, “Diffusion+MobileViT” requires the shortest training time, while FBP-Diffusion is the second shortest.

Notably, on the more refined small-scale databases, the DLC module has almost no improvement in prediction accuracy, but performs well on the other three metrics, with a maximum improvement of 6.86%. On large-scale databases, the denoising capability and performance of DMs are relatively good, and the DLC module exhibits a less pronounced effect on the overall performance improvement. To illustrate the effects of loss correction in greater detail and highlight the effectiveness of the DLC module, the confusion matrices of the output results before and after correction are depicted in Fig. 5, and the loss curves during the training of FBP-Diffusion are depicted in Fig. 6.

The DLC module modifies the outputs corresponding to the samples through probabilistic transitions, while DMs minimize the overall loss between these modified outputs and the true labels by means of iterative training, thereby enabling a larger number of samples to be accurately classified. The confusion matrix for the SCUT-FBP database in Fig. 5 shows that some samples originally predicted as “3” are reclassified as “4” after the addition of the DLC. For the SCUT-FBP5500 database, some samples originally predicted as “2” are reclassified as “1”, while some others originally predicted as “3” are reclassified as “4” following DLC processing. Overall, the number of samples whose predicted labels are changed from incorrect to correct is greater than the number of samples whose labels are changed from correct to incorrect. This similar situation is also observed in the other two databases. However, the effect of the DLC module is less noticeable in the CelebA database due to its large scale and the binary classification task.

In Fig. 6, the blue curve represents the noise estimation loss $\mathcal {L}_{\varvec{\epsilon }}$, the black curve represents the cross-entropy loss $\mathcal {L}_c$, and the red curve represents the overall loss $\mathcal {L}$ of FBP-Diffusion. $\mathcal {L}_c$ is derived by re-parameterizing $\varvec{\hat{y}}_0$ and the true labels $\varvec{y}$. As the difference between them is large, $\mathcal {L}_c$ exhibits a higher value and a smaller variance. However, $\mathcal {L}_{\varvec{\epsilon }}$ exhibits a significant decreasing trend from the beginning, and despite subsequent fluctuations, it remains at a low level. These two losses reflect different convergence characteristics. When combined, the trend of $\mathcal {L}$ generally follows that of $\mathcal {L}_{\varvec{\epsilon }}$, indicating that $\mathcal {L}_{\varvec{\epsilon }}$ dominates the overall loss. In the later stages of training, most of the loss values on the four databases fluctuate within 0.3. It can be seen that the larger the size of the databases, the smaller the initial loss and the faster the model converges to the same loss value. For the CelebA dataset, the loss initially drops below 0.4 and reaches a state close to 0 more quickly. This is because larger databases contain more samples and features, and the model is able to find valid feature representations faster.

Table 7 Effects of different values of $fea_{dim}$ and S on FBP-Diffusion performance

Full size table

Table 7 lists the accuracy and training time of FBP-Diffusion corresponding to different values of the hyperparameter, $fea_{dim}$ and S. When $fea_{dim}$ is varied, the value of S is set to 10; when S is varied, the value of $fea_{dim}$ is set to 2048. As the hyperparameter values increase, the training time gradually increases. The accuracy also exhibits a noticeable pattern, peaking at $fea_{dim}=2048$ and $S=10$, and then decreasing as the values deviate from these points. This is indicative of the overall trend, but weaker for the SCUT-FBP. When S varies between 10 and 40, the prediction accuracy of the model hardly changes and the training time is more random since it is all very low, indicating that the optimal value of S is not fixed on different databases. When applying diffusion models to other domains, it is recommended to perform several experiments to empirically determine the optimal value. The accuracy values do not vary significantly, indicating that with lower $fea_{dim}$ and S values, the model still performs well while considerably reducing training time and decreasing memory usage.

5 Conclusions

To mitigate the impact of noisy labels and enhance prediction performance, we proposed FBP-Diffusion, a diffusion model that integrates MobileViT and DLC. MobileViT serves as the conditional information encoder to generate high-accuracy preliminary predictions that guide label generation during the reverse process in DMs. Using the learned noise transition matrix, we introduced a DLC module to increase the cross-entropy loss between the predicted probabilities and true labels, which is dynamically incorporated into the noise estimation loss to influence the training process of FBP-Diffusion. Experimental results on the SCUT-FBP, SCUT-FBP5500, LSAFBD, and CelebA databases demonstrate that FBP-Diffusion outperforms conventional DMs and existing FBP methods, achieving notable results and showing its applicability to other learning tasks and related fields. Additionally, as DMs have inherently strong denoising capabilities and high robustness with respect to noisy labels, the improvement in accuracy due to DLC is not substantial. Future works should focus on further optimizing the noise transition matrix and exploring other noisy label learning methods to refine the model design continuously.

Data Availability

LSAFBD: The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy. SCUT-FBP: Dataset utilized in this paper is publicly available. http://www.hcii-lab.net/data/SCUT-FBP/. SCUT-FBP5500: Dataset utilized in this paper is publicly available. https://github.com/HCIILAB/SCUT-FBP5500-Database-Release. CelebA: Dataset utilized in this paper is publicly available. http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html.

Abbreviations

$\beta _t$ :: noise scheduling parameter at timestep t
$\varvec{\epsilon }$ :: Gaussian noise of DM
$\varvec{\epsilon }_{\theta }$ :: function approximator, ${\theta }$ denotes the model parameters
$\varvec{\hat{y}}_0$ :: re-parameterized denoised label encoding
$\varvec{I}$ :: the identity matrix
$\varvec{x}$ or $\varvec{x}_0$ :: original image data
$\varvec{y}$ or $\varvec{y}_0$ :: one-hot encoding of the ground truth labels
$\varvec{y}_\textrm{F}$ :: final output of FBP-Diffusion
$\varvec{y}_{1:T}$ :: all variables of $\varvec{y}$ from timestep 1 to T
$\varvec{y}_{\tau _s}$ :: $\varvec{y}_t$ in DDIM
$\mathcal {L}$ :: the total loss
$\mathcal {L}_c$ :: the cross-entropy loss
$\mathcal {L}_{\varvec{\epsilon }}$ :: noise estimation loss
$\overline{M}$ :: general noise transition matrix
$\tau _s$ :: t in DDIM, where s is lowercase
$\widetilde{\beta }_t$ :: posterior variance of the forward process
$\widetilde{\varvec{\mu }}$ :: posterior mean of the forward process
$\widetilde{\varvec{y}}_0$ :: $\varvec{\hat{y}}_0$ in DDIM
$f_\textrm{R}$ :: encoder in denoising network
$f_\textrm{R}(\varvec{x})$ :: high-dimensional features from $f_\textrm{R}$
$f_\phi$ :: the conditional information encoder
$f_\phi (\varvec{x})$ :: the prior knowledge of the relation between $\varvec{x}$ and $\varvec{y}_0$
$fea_{dim}$ :: dimension of intermediate layers in the denoising network
M :: noise transition matrix in our work
n :: the number of classes
$pre.\varvec{y}_0$ :: label encoding obtained by the reverse process
q :: approximate probability distribution
S :: T in DDIM, where S is uppercase
T :: total number of diffusion steps
t :: timestep

References

Luthfi M, Rachmadi RF, Purnama IKE, Nugroho SMS (2022) Mobile device facial beauty prediction using convolutional neural network as makeup reference. In: 2022 International conference on computer engineering, network, and intelligent multimedia (CENIM), pp 1–5. https://doi.org/10.1109/CENIM56801.2022.10037321
Chen H, Li W, Gao X, Xiao B (2023) Aep-gan: aesthetic enhanced perception generative adversarial network for asian facial beauty synthesis. Appl Intell 53(17):20441–20468. https://doi.org/10.1007/s10489-023-04576-7
Article Google Scholar
Peng T, Li M, Chen F, Xu Y, Xie Y, Sun Y, Zhang D (2024) Isfb-gan: interpretable semantic face beautification with generative adversarial network. Expert Syst Appl 236:121131. https://doi.org/10.1016/j.eswa.2023.121131
Article Google Scholar
Atiyeh BS, Chahine F (2021) Outcome measurement of beauty and attractiveness of facial aesthetic rejuvenation surgery. J Craniofacial Surg 32(6):2091–2096. https://doi.org/10.1097/SCS.0000000000007821
Article Google Scholar
Guo M, Xu F, Wang S, Wang Z, Lu M, Cui X, Ling X (2024) Synthesis, style editing, and animation of 3d cartoon face. Tsinghua Sci Technol 29(2):506–516. https://doi.org/10.26599/TST.2023.9010028
Article Google Scholar
Przylipiak M, Przylipiak J, Terlikowski R, Lubowicka E, Chrostek L, Przylipiak A (2018) Impact of face proportions on face attractiveness. J Cosmet Dermatol 17(6):954–959. https://doi.org/10.1111/jocd.12783
Article Google Scholar
Arabo W, Abdulazeez AM (2024) Facial beauty prediction based on deep learning: A review. Indonesian J Comput Sci 13(1)
Laurinavičius D, Maskeliūnas R, Damaševičius R (2023) Improvement of facial beauty prediction using artificial human faces generated by generative adversarial network. Cognit Comput 15(3):998–1015. https://doi.org/10.1007/s12559-023-10117-8
Article Google Scholar
Bae J, Buu S-J, Lee S (2024) Anchor-net: distance-based self-supervised learning model for facial beauty prediction. IEEE Access 12:61375–61387. https://doi.org/10.1109/ACCESS.2024.3394870
Article Google Scholar
Bougourzi F, Dornaika F, Barrena N, Distante C, Taleb-Ahmed A (2023) Cnn based facial aesthetics analysis through dynamic robust losses and ensemble regression. Appl Intell 53(9):10825–10842. https://doi.org/10.1007/s10489-022-03943-0
Article Google Scholar
Sun Z, Lin L, Yu Y, Jin L (2024) Learning feature alignment across attribute domains for improving facial beauty prediction. Expert Syst Appl 249:123644. https://doi.org/10.1016/j.eswa.2024.123644
Article Google Scholar
Xie D, Liang L, Jin L, Xu J, Li M (2015) Scut-fbp: a benchmark dataset for facial beauty perception. In: 2015 IEEE international conference on systems, man, and cybernetics, pp 1821–1826. https://doi.org/10.1109/SMC.2015.319
Liang L, Lin L, Jin L, Xie D, Li M (2018) Scut-fbp5500: a diverse benchmark dataset for multi-paradigm facial beauty prediction. In: 2018 24th International conference on pattern recognition (ICPR), pp 1598–1603. https://doi.org/10.1109/ICPR.2018.8546038
Liu Z, Luo P, Wang X, Tang X (2015) Deep learning face attributes in the wild. In: Proceedings of the IEEE international conference on computer vision, pp 3730–3738
Zhai Y, Huang Y, Xu Y, Zeng J, Yu F, Gan J (2016) Benchmark of a large scale database for facial beauty prediction. In: Proceedings of the 1st international conference on intelligent information processing, pp 1–5. https://doi.org/10.1145/3028842.3028863
Gan J, Wu B, Zou Q, Zheng Z, Mai C, Zhai Y, He G (2022) Dual-input dual-task attention network incorporating noisy label correction mechanism for facial beauty prediction. Chin J J Signal Process 38(10):2124–2133. https://doi.org/10.16798/j.issn.1003-0530.2022.10.013
Article Google Scholar
Gan J, Xie X, Zhai Y, He G, Mai C, Luo H (2023) Facial beauty prediction fusing transfer learning and broad learning system. Soft Comput 27(18):13391–13404. https://doi.org/10.1007/s00500-022-07563-1
Article Google Scholar
Gan J, Xie X, He G, Luo H (2023) Transbls: transformer combined with broad learning system for facial beauty prediction. Appl Intell 53(21):26110–26125. https://doi.org/10.1007/s10489-023-04931-8
Article Google Scholar
Gan J, Luo H, Xiong J, Xie X, Li H, Liu J (2023) Facial beauty prediction combined with multi-task learning of adaptive sharing policy and attentional feature fusion. Electronics 13(1):179. https://doi.org/10.3390/electronics13010179
Article Google Scholar
Gan J, Wu B, Zhai Y, He G, Mai C, Bai Z (2022) Self-correcting noise labels for facial beauty prediction. Chin J J Image Graph 27(08):2487–2495
Article Google Scholar
Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inf Process Syst 33:6840–6851
Google Scholar
Han X, Zheng H, Zhou M (2022) Card: classification and regression diffusion models. Adv Neural Inf Process Syst 35:18100–18115
Google Scholar
Yang Y, Fu H, Aviles-Rivero AI, Schönlieb C-B, Zhu L (2023) Diffmic: dual-guidance Diffusion Network for Medical Image Classification. In: International conference on medical image computing and computer assisted intervention (MICCAI), pp 95–105. https://doi.org/10.1007/978-3-031-43987-2_10
Chen J, Zhang R, Yu T, Sharma R, Xu Z, Sun T, Chen C (2023) Label-retrieval-augmented diffusion models for learning from noisy labels. In: Thirty-seventh conference on neural information processing systems (NeurIPS), pp 66499–66517. https://proceedings.neurips.cc/paper_files/paper/2023/file/d191ba4c8923ed8fd8935b7c98658b5f-Paper-Conference.pdf
Shen J, Wang Y, Luo J (2024) Cd-loop: a chromatin loop detection method based on the diffusion model. Front Genet 15:1393406. https://doi.org/10.3389/fgene.2024.1393406
Article Google Scholar
Li Y, Chen W, Hu X, Chen B, Zhou M (2024) Transformer-modulated diffusion models for probabilistic multivariate time series forecasting. In: The twelfth international conference on learning representations (ICLR). https://openreview.net/forum?id=qae04YACHs
Song Z, Meng Z, King I (2024) A diffusion-based pre-training framework for crystal property prediction. In: Proceedings of the AAAI conference on artificial intelligence, vol 38, pp 8993–9001. https://doi.org/10.1609/aaai.v38i8.28748
Hassan S, Ullah M, Imran AS, Mujtaba G, Yamin MM, Hashmi E, Cheikh FA, Beghdadi A (2024) A self-supervised diffusion framework for facial emotion recognition. In: 2024 IEEE international conference on image processing (ICIP), pp 465–471. https://doi.org/10.1109/ICIP51287.2024.10648251
Peng H, Sun H, Luo S, Zuo Z, Zhang S, Wang Z, Wang Y (2024) Diffusion-based conditional wind power forecasting via channel attention. IET Renew Power Gener 18(3):306–320. https://doi.org/10.1049/rpg2.12825
Article Google Scholar
Mehta S, Rastegari M (2022) Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. In: International conference on learning representations (ICLR). https://openreview.net/forum?id=vh-0sUt8HlG
Song J, Meng C, Ermon S (2021) Denoising diffusion implicit models. In: International conference on learning representations (ICLR). https://openreview.net/forum?id=St1giarCHLP
Gan J, Xiang L, Zhai Y, Mai C, He G, Zeng J, Bai Z, Labati RD, Piuri V, Scotti F (2020) 2m beautynet: facial beauty prediction based on multi-task transfer learning. IEEE Access 8:20245–20256. https://doi.org/10.1109/ACCESS.2020.2968837
Article Google Scholar
Vahdati E, Suen CY (2021) Facial beauty prediction from facial parts using multi-task and multi-stream convolutional neural networks. Int J Pattern Recognit Artif Intell 35(12):2160002. https://doi.org/10.1142/S0218001421600028
Article Google Scholar
Zhang P, Liu Y (2022) Nas4fbp: facial beauty prediction based on neural architecture search. In: International conference on artificial neural networks, pp 225–236. https://doi.org/10.1007/978-3-031-15934-3_19
Sun Z, Xiao Z, Yu Y, Lin L (2024) Dynamic attentive convolution for facial beauty prediction. IEICE Trans Inf Syst 107(2):239–243. https://doi.org/10.1587/transinf.2023EDL8058
Article Google Scholar
Zhai Y, Yu C, Qin C, Zhou W, Ke Q, Gan J, Labati RD, Piuri V, Scotti F (2020) Facial beauty prediction via local feature fusion and broad learning system. IEEE Access 8:218444–218457. https://doi.org/10.1109/ACCESS.2020.3032515
Article Google Scholar
Peng T, Li M, Chen F, Xu Y, Zhang D (2024) Geometric prior guided hybrid deep neural network for facial beauty analysis. CAAI Trans Intell Technol 9(2):467–480. https://doi.org/10.1049/cit2.12197
Article Google Scholar
Liu Q, Lin L, Shen Z, Yu Y (2023) Fbpformer: dynamic convolutional transformer for global-local-contexual facial beauty prediction. In: International conference on artificial neural networks, pp 223–235. https://doi.org/10.1007/978-3-031-44204-9_19
Croitoru F-A, Hondru V, Ionescu RT, Shah M (2023) Diffusion models in vision: a survey. IEEE Trans Pattern Anal Mach Intell 45(9):10850–10869. https://doi.org/10.1109/TPAMI.2023.3261988
Article Google Scholar
Yang L, Zhang Z, Song Y, Hong S, Xu R, Zhao Y, Zhang W, Cui B, Yang M-H (2023) Diffusion models: a comprehensive survey of methods and applications. ACM Comput Surv 56(4):1–39. https://doi.org/10.1145/3626235
Article Google Scholar
Luo C (2022) Understanding diffusion models: a unified perspective
Fu B, Peng Y, Lan X, Qin X (2023) Survey of label noise learning algorithms based on deep learning. Chin J J Comput Appl 43(03):674–684
Google Scholar
Li X, Liu T, Han B, Niu G, Sugiyama M (2021) Provably end-to-end label-noise learning without anchor points. In: International conference on machine learning, pp 6403–6413. https://proceedings.mlr.press/v139/li21l.html
He Q, Yan Z, Diao W, Sun X (2024) Dlc: dynamic loss correction for cross-domain remotely sensed segmentation. IEEE Trans Geosci Remote Sens 62:1–14. https://doi.org/10.1109/TGRS.2024.3402127
Article Google Scholar
Patrini G, Rozza A, Krishna Menon A, Nock R, Qu L (2017) Making deep neural networks robust to label noise: A loss correction approach. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1944–1952
Yao Y, Liu T, Han B, Gong M, Deng J, Niu G, Sugiyama M (2020) Dual t: reducing estimation error for transition matrix in label-noise learning. Adv Neural Inf Process Syst 33:7260–7271
Google Scholar
Zhang Y, Niu G, Sugiyama M (2021) Learning noise transition matrix from only noisy labels via total variation regularization. In: International conference on machine learning, pp 12501–12512. https://proceedings.mlr.press/v139/zhang21n.html
Cheng D, Liu T, Ning Y, Wang N, Han B, Niu G, Gao X, Sugiyama M (2022) Instance-dependent label-noise learning with manifold-regularized transition matrix estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16630–16639
Wang L, Bian J, Xu J (2024) Federated learning with instance-dependent noisy label. In: ICASSP 2024-2024 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 8916–8920. https://doi.org/10.1109/ICASSP48485.2024.10447823
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4510–4520
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: International conference on learning representations. https://openreview.net/forum?id=YicbFdNTTy
Zhai Y, Cao H, Deng W, Gan J, Piuri V, Zeng J (2019) Beautynet: joint multiscale cnn and transfer learning method for unconstrained facial beauty prediction. Comput Intell Neurosci 2019(1):1910624. https://doi.org/10.1155/2019/1910624
Article Google Scholar
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV)
Tolstikhin IO, Houlsby N, Kolesnikov A, Beyer L, Zhai X, Unterthiner T, Yung J, Steiner A, Keysers D, Uszkoreit J, Lucic M, Dosovitskiy A (2021) Mlp-mixer: an all-mlp architecture for vision. In: Advances in neural information processing systems, vol 34, pp 24261–24272. https://proceedings.neurips.cc/paper_files/paper/2021/file/cba0a4ee5ccd02fda0fe3f9a3e7b89fe-Paper.pdf

Download references

Acknowledgements

This work is supported in part by funds from the National Natural Science Foundation of China under Grant 61771347.

Author information

Authors and Affiliations

School of Electronic Information and Control Engineering, Software Engineering Institute of Guangzhou, Guangdong, 510990, Guangzhou, China
Junying Gan & Huicong Li
School of Electronic and Information Engineering, South China University of Technology, Guangdong, 510641, Guangzhou, China
Xiaoshan Xie
School of Electronics and Information Engineering, Wuyi University, Yingbin Avenue Middle, Guangdong, 529020, Jiangmen, China
Hantian Chen & Zhenxin Zhuang

Authors

Junying Gan
View author publications
Search author on:PubMed Google Scholar
Huicong Li
View author publications
Search author on:PubMed Google Scholar
Xiaoshan Xie
View author publications
Search author on:PubMed Google Scholar
Hantian Chen
View author publications
Search author on:PubMed Google Scholar
Zhenxin Zhuang
View author publications
Search author on:PubMed Google Scholar

Contributions

Junying Gan: Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Writing– review & editing. Huicong Li: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Visualization, Writing– original draft, Writing– review & editing. Xiaoshan Xie: Methodology, Validation, Writing– review & editing. Hantian Chen: Conceptualization, Methodology. Zhenxin Zhuang: Investigation.

Corresponding author

Correspondence to Junying Gan.

Ethics declarations

Competing interest

All the authors declare that they have no competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical and informed consent for data used

This paper does not involve ethical and informed consent.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Gan, J., Li, H., Xie, X. et al. FBP-diffusion: diffusion model combining MobileViT and dynamic loss correction for facial beauty prediction. Appl Intell 55, 842 (2025). https://doi.org/10.1007/s10489-025-06723-8

Download citation

Accepted: 13 June 2025
Published: 05 July 2025
Version of record: 05 July 2025
DOI: https://doi.org/10.1007/s10489-025-06723-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

FBP-diffusion: diffusion model combining MobileViT and dynamic loss correction for facial beauty prediction

Abstract

Similar content being viewed by others

FBPFormer: Dynamic Convolutional Transformer for Global-Local-Contexual Facial Beauty Prediction

CNN based facial aesthetics analysis through dynamic robust losses and ensemble regression

Toward Tiny and High-Quality Facial Makeup with Data Amplify Learning

Explore related subjects

1 Introduction

2 Related works

2.1 Facial beauty prediction

2.2 Diffusion model

2.3 Loss correction

3 Method

3.1 FBP-Diffusion

3.2 Conditional information encoder

3.3 Dynamic loss correction

4 Experimental results and analysis

4.1 Experimental objects

4.2 Experimental environment

4.3 Evaluation metrics

4.4 Comparison experiments with baseline methods

4.5 Comparison with state-of-the-art FBP methods

4.6 Ablation experiment

5 Conclusions

Data Availability

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interest

Ethical and informed consent for data used

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords