1 Introduction

Facial beauty prediction (FBP) is a cutting-edge topic in artificial intelligence that converts the perception of facial beauty into a machine learning recognition problem by extracting aesthetic features, thereby enhancing computers’ ability to judge or predict facial beauty. Research in FBP has accelerated the development of practical applications, including natural and realistic facial image beautification technologies [1,2,3], reliable medical beauty solutions [4], and automated assessments of comic and film characters [5]. FBP provides a theoretical foundation for various fields and industries, with significant scientific value and broad application prospects.

In recent years, many data-driven deep learning methods have aimed to achieve computational results that better reflect human aesthetics by minimizing interference and capturing deeper features, continuously advancing research in this field [7,8,9,10,11]. There are four representative facial beauty databases for FBP classification tasks: SCUT-FBP [12], SCUT-FBP5500 [13], CelebA [14], and the Large Scale Asian Female Beauty Database (LSAFBD) [15]. Differences in dataset quality can affect prediction outcomes to some extent. Although both SCUT-FBP5500 and LSAFBD have five category labels, the LSAFBD dataset is less refined and contains more noisy labels, which is reflected in the same model typically achieving higher accuracy on the former compared to the latter. For instance, DIDTAN [16], ER-BLS [17], TransBLS-B [18], and [19] achieved 76.40\(\%\), 74.69\(\%\), 78.46\(\%\), and 75.41\(\%\) accuracy on SCUT-FBP5500, respectively, compared to 65.40\(\%\), 62.13\(\%\), 66.49\(\%\), and 61.37\(\%\) on LSAFBD, with differences exceeding 10\(\%\). These methods utilize different model architectures, including convolutional neural networks (CNNs), vision transformers (ViTs), and broad learning systems (BLS), yet they still suffer from performance degradation caused by noisy labels in datasets. Given that noisy labels are unavoidable in real-world scenarios, there is an urgent need for effective solutions to address this challenge.

Over time, people have become aware of this issue, yet only a few existing FBP approaches address noisy labels, in which the classifier is also mainly designed as a label corrector to purify potential noisy labels before the second stage of training [16, 20]. However, the databases are imbalanced in terms of sample distribution, and the models frequently struggle to capture differential features. As a result, they may mistakenly treat class labels with fewer samples as noise and make extensive modifications to them, leading to highly unstable training. Such denoising methods, which focus on correcting true labels without improving model architectures, have inherent limitations that hinder the effective resolution of noisy label issues.

Diffusion models (DMs) [21] are explicit generative models based on likelihood estimations. Figure 1 depicts a schematic diagram of the diffusion process of DMs. During the forward diffusion process, Gaussian noise is progressively added to the input data, while in the reverse process, the model learns to denoise and gradually recover the original data from the noisy data. Unlike image generation tasks, which require the denoising of image data \(\varvec{x}_t\), DM-based classifiers can denoise and reconstruct label codings \(\varvec{y}_t\) [22], viewing the modeling process of DMs as a label denoising process. This approach has been successfully applied in fields such as medical image classification, crystal property prediction, and time series forecasting [23,24,25,26,27,28,29], yielding good results. However, to our knowledge, current research has not yet applied DMs to the FBP field.

Fig. 1
figure 1

Process diagram of diffusion models [21, 22]

Therefore, to reduce the influence of noisy labels and improve prediction performance, we constructed an improved diffusion model for FBP, called FBP-Diffusion. Specifically, MobileViT [30], a lightweight general-purpose ViT, is employed as a conditional information encoder, generating an initial prediction that replaces the endpoint noise and guides label generation by being input into all timesteps during the reverse process. Subsequently, MobileViT models the noise transition matrix, which is multiplied by the softmax probability obtained by the DM during forward propagation. The cross-entropy loss between this product and the true label is then introduced into the training process and dynamically added to the noise estimation loss of the DM. In addition, we use the sampling method from the Denoising Diffusion Implicit Model (DDIM) [31] to enhance the inference efficiency of FBP-Diffusion.

Extensive comparison and ablation experiments conducted on four facial beauty datasets demonstrate that FBP-Diffusion outperforms baseline classifiers in both DMs and FBP methods. It achieves an accuracy of 73.40% on the relatively noisy LSAFBD dataset, marking a 5.17% improvement over the previous best FBP methods. It attains the highest accuracy on the more refined SCUT-FBP, SCUT-FBP5500 and CelebA datasets, with rates of 67.71%, 78.60% and 82.56%, respectively.

The main contributions of this study are as follows:

  • We propose FBP-Diffusion, which utilizes a DM-based classifier for FBP and integrates pretrained MobileViT to obtain conditional information for the DM, effectively addressing the issue of noisy labels in FBP.

  • We use the noise transition matrix to perform probability transfer and enhance the DM’s denoising capability through dynamic loss correction (DLC), further improving model performance.

  • Comparing FBP-Diffusion with benchmark DMs and existing FBP methods reveals its state-of-the-art performance, particularly on relatively noisy datasets, highlighting its superiority, effectiveness, and generalization.

The remainder of this paper is organized as follows: Related works are reviewed in Section 2; the overall model scheme and implementation details are presented in Section 3; experimental analysis is discussed in Section 4; and the study is summarized with future research directions in Section 5.

2 Related works

2.1 Facial beauty prediction

With the development of deep learning technology, CNNs have become a mainstream tool for FBP due to their ability to effectively extract deep features. In [32,33,34], multitask CNNs are proposed by incorporating additional information from specific tasks to enhance FBP performance. Bougourzi et al. [10] introduced a framework integrating regression CNNs, with ResneXt-50 and Inception-v3 as backbones, and employed multiple dynamic loss functions during training. Sun et al. [11, 35] leveraged attribute information such as gender and ethnicity to guide CNN training, achieving implicit feature alignment across attribute domains, and further proposed dynamic attention convolution for CNNs. Besides CNNs, other network architectures have also been explored in FBP models. Gan et al. [17, 18], Zhai et al. [36] utilized local feature fusion, transfer learning-based CNNs, and adaptive transformers independently as feature extractors for the broad learning system. Peng et al. [37], Liu et al. [38] combined ViTs with CNNs to develop a two-branch network and a dynamic convolutional transformer for FBP. Although these methods produced favorable results, only a few algorithms have been specifically designed for learning with noisy labels. Gan et al. [16, 20] designed models that functioned as both classifiers and label correctors with CNNs as the backbone and a two-stage training process. However, this approach exhibited limited effectiveness on databases with imbalanced sample sizes, revealing issues such as weak model generalization and low prediction accuracy.

2.2 Diffusion model

DMs enable accurate likelihood calculation and representation learning, with a primary use in image generation tasks. In 2022, Han et al. [22] proposed the Classification and Regression Diffusion (CARD) model, which regards deterministic classification as a denoising process for labels, allowing for more flexible uncertainty modeling during label generation. Similar to other DMs, CARD includes both forward and reverse processes, and is learned by optimizing the evidence lower bound via stochastic gradient descent [39, 40].

During the forward process, CARD, according to a variance schedule \(\{{\beta _t}\}_{t=1:T} \in (0,1)^T\), adds Gaussian noise \(\varvec{\epsilon } \sim \mathcal {N}(0,\varvec{I})\) to the n-dimensional one-hot label encoding \(\varvec{y}_0\), gradually generating a series of intermediate variables \(\varvec{y}_{1:T}\). Suppose that after T timesteps in the forward process, \(\varvec{y}_0\) eventually converges to

$$\begin{aligned} p(\varvec{y}_T|\varvec{x}) = \mathcal {N}(f_\phi (\varvec{x}), \varvec{I}) \end{aligned}$$
(1)

where \(\varvec{x}\) represents the input image data, \(f_\phi (\varvec{x})\) is the prior knowledge of the relation between \(\varvec{x}\) and \(\varvec{y}_0\), which is set to 0 or \(\mathbb {E}(\varvec{y}|\varvec{x})\), approximated by the pretrained conditional information encoder \(f_\phi\), and \(\varvec{I}\) denotes the identity matrix. Then, the conversion of adjacent intermediate variables is modeled following a Gaussian distribution:

$$\begin{aligned} q \left( \varvec{y}_t|\varvec{y}_{t-1}, f_\phi (\varvec{x}) \right) = \mathcal {N} \left( \varvec{y}_t; \sqrt{1-{\beta _t}} \varvec{y}_{t-1}+(1-\sqrt{1-{\beta _t}})f_\phi (\varvec{x}), {\beta _t}\varvec{I} \right) \end{aligned}$$
(2)

The closed-form sampling distribution with an arbitrary timesteps t is

$$\begin{aligned} q \left( \varvec{y}_t|\varvec{y}_0, f_\phi (\varvec{x}) \right) = \mathcal {N} \left( \varvec{y}_t; \sqrt{\bar{\alpha }_t} \varvec{y}_0+(1-\sqrt{\bar{\alpha }_t})f_\phi (\varvec{x}), (1-{\bar{\alpha }_t})\varvec{I} \right) \end{aligned}$$
(3)

where \(\alpha _t:=1-\beta _t\) and \(\bar{\alpha }_t:=\prod _t\alpha _t\).

In the reverse process, CARD gradually reconstructs the label encoding \(\varvec{y}_0\) from the Gaussian noise \(p(\varvec{y}_T|\varvec{x}) = \mathcal {N}(f_\phi (\varvec{x}), \varvec{I})\) by \(\varvec{x}\) and \(f_\phi\). The forward process posteriors is as follows:

$$\begin{aligned}&q(\varvec{y}_{t-1}|\varvec{y}_t, \varvec{y}_0, \varvec{x}) = q \left( \varvec{y}_{t-1}|\varvec{y}_t, \varvec{y}_0, f_\phi (\varvec{x}) \right) \\& = \mathcal {N} \left( \varvec{y}_{t-1}; \widetilde{\varvec{\mu }} \left( \varvec{y}_t, \varvec{y}_0, f_\phi (\varvec{x}) \right), {\widetilde{\beta }_t}\varvec{I} \right) \end{aligned}$$
(4)

The detailed derivation of the sampling distribution parameters from (2) to (3), as well as the mean \(\widetilde{\varvec{\mu }}\) and variance \(\widetilde{\beta }_t\) of \(q(\varvec{y}_{t-1}|\varvec{y}_t, \varvec{y}_0, \varvec{x})\) in (4), can be found in Appendix A.1 of [22]. A function approximator \(\varvec{\epsilon }_{\theta } (\varvec{x}, \varvec{y}_t, f_\phi (\varvec{x}), t)\) is constructed to predict the diffusion noise \(\varvec{\epsilon }\), and the denoised label encoding \(\varvec{\hat{y}}_0\) obtained after the reverse process can be re-parameterized by (3), as follows:

$$\begin{aligned} \varvec{\hat{y}}_0 = \frac{1}{\sqrt{\bar{\alpha }_t}} \left( \varvec{y}_t - (1-{\sqrt{\bar{\alpha }_t}})f_\phi (\varvec{x}) - \sqrt{1-\bar{\alpha }_t}\varvec{\epsilon }_{\theta } (\varvec{x}, \varvec{y}_t, f_\phi (\varvec{x}), t) \right) \end{aligned}$$
(5)

Chen et al. [24] applied the DDIM sampling method to the non-Markovian forward process of CARD, generating label encoding in fewer steps on the predefined sampling trajectory of \(\left\{ T \right. =\tau _S>\cdots>\tau _s>\cdots> \tau _1= \left. 1 \right\}\), where \(S<T\), thereby reducing the inference time significantly. After replacing t with \(\tau _s\), (3) can be rewritten, and \(\varvec{y}_t\) is calculated as follows:

$$\begin{aligned} \varvec{y}_{\tau _s} = \sqrt{\bar{\alpha }_{\tau _s}}\varvec{y}_0+\left( 1-\sqrt{\bar{\alpha }_{\tau _s}} \right) f_\phi (\varvec{x}) + \sqrt{1-\bar{\alpha }_{\tau _s}}\varvec{\epsilon } \end{aligned}$$
(6)

Likewise, the denoised label \(\varvec{\hat{y}}_0\), i.e., the predicted value of \(\varvec{y}_0\), can be calculated as

$$\begin{aligned}& \widetilde{\varvec{y}}_0 = \frac{1}{\sqrt{\bar{\alpha }_{\tau _s}}}\\& \left( \varvec{y}_{\tau _s} - \left( 1-\sqrt{\bar{\alpha }_{\tau _s}} \right) f_\phi (\varvec{x}) -\sqrt{1-\bar{\alpha }_{\tau _s}} \varvec{\epsilon }_{\theta } \left( \varvec{x}, \varvec{y}_{\tau _s}, f_\phi (\varvec{x}), \tau _s \right) \right) \end{aligned}$$
(7)

Since the function approximator \(\varvec{\epsilon }_{\theta } \left( \varvec{x}, \varvec{y}_{\tau _s}, f_\phi (\varvec{x}), \tau _s \right)\) predicts \(\varvec{\epsilon }\), it follows that when \({\tau _{s-1}}>0\) and \(\varvec{y}_{\tau _s}\) is given, \(\varvec{y}_{\tau _{s-1}}\) can be calculated from (6) as follows:

$$\begin{aligned}\varvec{y}_{\tau _{s-1}} & = \sqrt{\bar{\alpha }_{\tau _{s-1}}} \widetilde{\varvec{y}}_0+ \left( 1-\sqrt{\bar{\alpha }_{\tau _{s-1}}} \right) f_\phi (\varvec{x}) \\& + \sqrt{1-\bar{\alpha }_{\tau _{s-1}}} \varvec{\epsilon }_{\theta } \left( \varvec{x}, \varvec{y}_{\tau _s}, f_\phi (\varvec{x}), \tau _s \right) \end{aligned}$$
(8)

2.3 Loss correction

Loss Correction (LC) assumes that the noisy labels are transferred and corrupted from the true labels according to an unknown noise transition matrix M. It aims to learn this matrix to adjust the loss function during training, thereby reducing the model’s reliance on noisy labels and enhancing its robustness [42,43,44]. The primary approach involves first estimating M with available prior knowledge, and then correcting the loss by either multiplying M with the model prediction or by multiplying M with the loss value. The effectiveness of LC largely depends on the estimation of matrix M, and several studies have been conducted to optimize this estimation. Yao et al. [46] introduced an intermediate class to decompose M into the product of two easily estimated transition matrices. Zhang et al. [47] estimated M and learned a classifier simultaneously, proposing total variation regularization to make the predicted probabilities more distinguishable. Cheng et al. [48] observed that instances with similar appearance and poor quality are more likely to be mislabeled, and therefore formulated this hypothesis as a manifold embedding to reduce the degree of freedom of M and stabilize its estimation. Wang et al. [49] extracted data with high-confidence pseudo-labels and noisy labels to train a transition-matrix-estimation network in a federated manner. However, to the best of our knowledge, LC or noise transition matrices have not yet been used for FBP or in CARD-based classifiers.

3 Method

Similar to CARD and its variants, the proposed FBP-Diffusion model denoises and reconstructs label encodings, treating deterministic classification and the learning of noisy labels as a progressive denoising process. In this section, we present the model, along with its conditional information encoder and DLC.

3.1 FBP-Diffusion

The structure of FBP-Diffusion is illustrated in Fig. 2, comprising three parts: a denoising network, conditional information encoder \(f_\phi\), and DLC module. In the reverse process, the Gaussian noise label encoding \(\varvec{y}_T = \mathcal {N}(f_\phi (\varvec{x}), \varvec{I})\) is progressively reconstructed into the predicted label encoding \(pre.\varvec{y}_0\), represented as \(\varvec{y}_T \rightarrow \cdots \rightarrow \varvec{y}_t \rightarrow \varvec{y}_{t-1} \rightarrow \cdots \rightarrow pre.\varvec{y}_0\), with timesteps denoted by \(t\sim (T \rightarrow 1)\). First, given an input facial beauty image \(\varvec{x}\), we use MobileViT as the encoder \(f_\phi\) to obtain the initial probability prediction \(f_\phi (\varvec{x})\), which is taken as the mean of the Gaussian distribution \(\varvec{y}_T\), along with the response embedding of the denoising network. “Add noise” in Fig. 2 represents obtaining \(\varvec{y}_T\) from \(f_\phi (\varvec{x})\). Next, utilizing ResNet18 as the encoder \(f_\textrm{R}\), we obtain high-dimensional features \(f_\textrm{R}(\varvec{x})\) after the Batch Normalization (BN) layers, which serve as the image embeddings of the denoising network. The denoising network is used for each conversion step in the reverse process. Finally, in the DLC module, the estimated noise transition matrix M obtained from \(f_\phi\) is multiplied by the softmax probability \(pre.\varvec{y}_0\) of the label encoding in the last step of the reverse process, yielding the final output \(\varvec{y}_\textrm{F}\) of FBP-Diffusion. In the training stage, the difference between the predictions \(\varvec{y}_{1:T}\) before and after passing through the denoising network should be close to Gaussian noise, leading to the noise estimation loss \(\mathcal {L}_{\varvec{\epsilon }}\). The final output \(\varvec{y}_\textrm{F}\) should be close to the true label \(\varvec{y}\), resulting in the cross-entropy loss \(\mathcal {L}_c\).

Fig. 2
figure 2

The overall architecture of FBP-Diffusion

The denoising network learns the noise distribution added during the forward diffusion process and parameterizes the reverse process. \(\varvec{y}_t\) and \(f_\phi (\varvec{x})\) are input into the fully connected layer after concatenating, and the timestep t is multiplied by the embedded layer. Following processing by the BN layer and Softplus activation function, the Hadamard product with the high-dimensional feature \(f_\textrm{R}(\varvec{x})\) is calculated by the encoder \(f_\textrm{R}\), and the resulting vector is processed layer by layer. The output of the last fully connected layer of the denoising network is the next \(\varvec{y}_{1:T}\) predicted value.

3.2 Conditional information encoder

The flexible pretrained encoder, MobileViT, is used as an auxiliary classifier, i.e., \(f_\phi\), to gather conditional information to guide the denoising process while learning the noise transition matrix M, as depicted in Fig. 3. \(f_\phi\) serves not only as a feature extractor but also directly outputs the probabilities \(f_\phi (\varvec{x})\), with a dimension equal to the number of classes, denoted by n. \(f_\phi\) has pretrained on the ImageNet-1k database, and its parameters are transferred to the FBP task and fine-tuned with respect to the input facial beauty image to accelerate the FBP-Diffusion learning process. The structure starts with a standard \(3 \times 3\) convolutional layer, followed by the Inverted Residual Block (MV2) proposed in MobileNetv2, the core MobileViT block, and the global pooling and fully connected layers. \(\downarrow \!2\) indicates down-sampling in the module, and L denotes the number of Transformer encoders [51] in the MobileViT block.

Fig. 3
figure 3

Structure of conditional information encoder [30]

In the FBP-Diffusion reverse process, the image input is \(\varvec{x} \in \mathbb {R} ^{H \times W \times C}\), and HWC represent the height, width, and number of image channels, respectively. An n-dimensional vector is obtained by the encoder \(f_\phi\), and the dimension becomes 2n after concatenated with \(\varvec{y}_t\). The high-dimensional vector with a dimension of \(fea_{dim}\) is obtained after mapping through the fully connected layer. We set \(fea_{dim}\), the dimension of the timestep t after the embedding layer, and the dimension of \(f_\textrm{R}(\varvec{x})\) to 2048, and keep the dimension unchanged until the last fully connected layer of the denoising network remaps the dimension from 2048 to n. Because of the presence of several fully connected layers in the denoising network, the memory consumption of the model increases with \(fea_{dim}\). In the DLC module, the fine-tuned \(f_\phi\) outputs the initial probability prediction of the whole training set, and the noise transition matrix M can be obtained by normalizing with the predicted confusion matrix.

3.3 Dynamic loss correction

The powerful conditional information encoder effectively guides the reverse process, ensuring that the prediction \(pre.\varvec{y}_0\) of the DM is closer to the true label. However, when the model excessively focuses on the conditional information, the generated labels strictly adhere to this information, which reduces generation diversity and weakens the denoising capability. To address this issue, alternative noisy label learning methods may be considered. However, a two-stage training approach solely for purifying noisy labels is not suitable for FBP. In this context, we introduce a DLC module to optimize the training process of the DM by dynamically adjusting the total loss.

During the training stage, CARD and its variants calculate the loss based on the predicted \(\varvec{y}_{1:T}\) values before and after the denoising network, reflecting the single conversion of the intermediate variable, and optimize the network parameters with noise estimation loss, mainly Mean Squared Error (MSE), and Maximum-Mean Discrepancy (MMD) regularization loss [23]. During the inference stage, complex calculations must be performed at each timestep to obtain the prediction \(pre.\varvec{y}_0\), and the overall performance of the DM is than verified, which requires significant computing resources. In other words, obtaining \(pre.\varvec{y}_0\) during the training phase and calculating the loss between this prediction and the true label are difficult. Therefore, we choose to calculate the re-parameterized denoising label \(\varvec{\hat{y}}_0\), and use the cross-entropy loss between \(\varvec{\hat{y}}_0\) and \(\varvec{y}\) to train the denoising network, rather than according to the inference one by one timestep. The detailed process is expressed in Algorithm 1.

Algorithm 1
figure a

Training.

The derivation and calculation of the cross-entropy loss are as follows:

$$\begin{aligned} \mathcal {L}_c& = \mathcal {L}_c \left( \varvec{y},\ \hat{p}(\varvec{y}_\text{F}|\varvec{x}) \right) \\&= -\varvec{y}^{\textrm{T}}\log \ \hat{p}\left( \varvec{y}_{\text{F}}|\varvec{x} \right) \\& =-\sum \nolimits _{k=1}^n{\log \ \hat{p}\left( \varvec{y}_{\text{F}}=\varvec{y}_k|\varvec{x}\right) } \\& =-\sum\nolimits _{k=1}^n \log \ \left( \sum \nolimits _{i=1}^n p\left( \varvec{y}_{\textrm{F}}=\varvec{y}_k|pre.\varvec{y}_0=\varvec{y}_i \right) \right.\\& \left. \hat{p}\left( pre.\varvec{y}_0=\varvec{y}_i|\varvec{x} \right) \right) \\& =-\sum \nolimits _{k=1}^n{\log \ \sum \nolimits _{i=1}^n{\overline{M}_{ik}}\ \hat{p} \left( pre.\varvec{y}_0=\varvec{y}_i|\varvec{x} \right) } \nonumber \\&=-\sum\nolimits_{k=1}^n{\log \ \sum\nolimits_{i=1}^n{(M^\text{T})_{ik}}\ \hat{p} \left( pre.\varvec{y}_0=\varvec{y}_i|\varvec{x} \right) } \\& =-\sum\nolimits _{k=1}^n \log \ \sum\nolimits _{i=1}^n{(M^\text{T})_{ik}} \\& \cdot \frac{\exp \left( - \left( \hat{\varvec{y}}_0-1_n \right) _{i}^{2} \right) }{\sum\nolimits _{j=1}^n {\exp \left( - \left( \hat{\varvec{y}}_0-1_n \right) _{j}^{2} \right) }} \end{aligned}$$
(9)

where \(\varvec{y}\) and \(\varvec{y}_{\textrm{F}}\) represent the ground truth labels and the output of FBP-Diffusion, respectively. n is the number of classes, k, i, and j represent the k-th, i-th, and j-th classes respectively. \(1_n\) denotes an n-dimensional vector with all elements equal to 1. \(\overline{M}\) denotes the general noise transition matrix, which contains the transition probabilities from the noisy label to the denoised label. In our work, \(\overline{M}\) is defined as \(M^\textrm{T}\), where M is the priori information provided by the conditional information encoder, expressed as a probabilistic prediction of the training set. Other symbols not explained here can be found in the nomenclature at the end of this paper.

By applying the chain rule, we decompose the target conditional probability \(\hat{p}\left( \varvec{y}_{\textrm{F}}=\varvec{y}_k|\varvec{x}\right)\) into a sum of products of conditional probabilities in (9), with the intermediate variable \(pre.\varvec{y}_0\).

Because of the differences between the re-parameterized \(\varvec{\hat{y}}_0\) and \(pre.\varvec{y}_0\) obtained via step-by-step inference, the value of \(\mathcal {L}_c\) during training is higher and does not change significantly, whereas the noise estimation loss \(\mathcal {L}_{\varvec{\epsilon }}\) decreases rapidly and exhibits lower values. To match the characteristics of DMs training and inference and to further optimize the training process, we sum these two losses, with the total loss denoted by \(\mathcal {L}\), as follows:

$$\begin{aligned} \mathcal {L} = \left( \frac{1}{\mathcal {L}_{\varvec{\epsilon }} / (\mathcal {L}_c + 1)} \right) \cdot \mathcal {L}_{\varvec{\epsilon }} + \left( \frac{\mathcal {L}_{\varvec{\epsilon }} / \mathcal {L}_c}{\mathcal {L}_{\varvec{\epsilon }} / (\mathcal {L}_c + 1)} \right) \cdot \mathcal {L}_c \end{aligned}$$
(10)

In addition, we utilize the DDIM sampling method to improve inference efficiency, expressed as \(pre.\varvec{y}_0 = \textrm{DDIM}(\varvec{y}_T=f_\phi (\varvec{x}),\varvec{x})\). Owing to the low dimensionality of the label vector, the model performs well even when S is much smaller than T, thereby significantly reducing computation time. The inference phase of FBP-Diffusion is detailed in Algorithm 2. First, the parameterized \(\widetilde{\varvec{y}}_0\) is calculated by (7), and the next \(\varvec{y}_{\tau _{s-1}}\) prediction value is iteratively computed, yielding \(pre.\varvec{y}_0\) after \(\tau _s\) timesteps. Subsequently, the softmax probability \(pre.\varvec{y}_0\) is multiplied by \(M^{\textrm{T}}\) to obtain \(\varvec{y}_{\textrm{F}}\). Finally, \(\varvec{y}_{\textrm{F}}\) is normalized to [0, 1], yielding the final output of FBP-Diffusion.

Algorithm 2
figure b

Inference.

4 Experimental results and analysis

4.1 Experimental objects

SCUT-FBP [12] is a facial beauty database created by the South China University of Technology, containing 500 frontal images of young Asian female faces with beauty scores, with varying resolutions. The beauty scores of all images range from 1 to 5, with higher scores indicating higher beauty levels, and the scores were averaged from the ratings provided by 70 volunteers. We used the mode of the scores to classify the images into five categories: “1”, “2”, “3”, “4” and “5”, corresponding to “ very unattractive”, “ not attractive”, “ average”, “ attractive”, and “ very attractive”, respectively.

SCUT-FBP5500 [13] is another facial beauty database created by the South China University of Technology, containing 5,500 frontal facial images with beauty scores, covering different genders, ages and races, with a resolution of \(350 \times 350\). There were 2,000 images of Asian women, 2,000 images of Asian men, 750 images of Caucasian men, and 750 images of Caucasian women. All images were rated on a scale ranging from 1 to 5, with higher scores indicating greater beauty, and these scores were averaged from assessments made by 60 volunteers. Likewise, we categorized the images into five classes using the mode of the scores, with the numerical labels and their meanings consistent with those in SCUT-FBP.

LSAFBD [15] is a facial beauty database created by our project team, containing 100,000 frontal facial images, covering various backgrounds, poses and ages, with a resolution of \(144 \times 144\), of which 80,000 are unlabeled images and 20,000 are labeled images. We used 10,000 labeled images of Asian female faces as subjects, with a total of five attractiveness categories of “0”, “1”, “2”, “3” and “4”, corresponding to “very unattractive”, “unattractive”, “average”, “attractive” and “very attractive”, respectively. The images were evaluated by 75 volunteers, and the scores were averaged.

CelebA [14] is a facial attribute database collected and published by the Chinese University of Hong Kong, containing 202,599 facial images of 10,177 celebrities, with a resolution of \(178 \times 218\). Each image has 40 attribute annotations. Using the “Attractive” label, 118,165 female images were divided into two categories: “unattractive” and “attractive”, corresponding to “1” and “2”, with 37,911 and 80,254 images, respectively. Figure 4 shows the distribution ratios and image examples of the four datasets, revealing an imbalance in their sample quantities.

Fig. 4
figure 4

The distribution ratios and image samples of the four facial beauty databases

4.2 Experimental environment

All experiments in this study were implemented on PyTorch 1.10.0, CUDA 11.3, and Ubuntu 22.04. The device included an NVIDIA GeForce RTX 3080 Ti GPU, a 12th Gen Intel® Cor™ i9-12900K\(\times\)24 CPU, 64GB of RAM, and 2.5TB hard drive. Each database was divided into training set and testing set following a ratio of 8 : 2. All images were adjusted to \(224 \times 224\) resolution, and the images of training set were randomly flipped horizontally, rotated by \(\pm 10\) degrees, processed via color dither, and normalized. Specifically, the training set of SCUT-FBP was augmented due to the limited sample size, with each image being rotated by \(180^\circ\) and then horizontally flipped to generate four versions. For the diffusion process, the timesteps T was set to 1000, S to 10, the noise was adjusted linearly according to \(\beta _1=1\times 10^{-4}\) and \(\beta _T=0.02\), ResNet18 was used as the encoder \(f_{\textrm{R}}\) without pre-training, MobileViT_xs was used as the conditional information encoder \(f_\phi\), the fine-tuning epoch was set to 11, i.e., the pre-epoch, and the training result of the last pre-epoch was selected. FBP-Diffusion is trained for 100 epochs, with inference performed every 10 epochs. The value of \(fea_{dim}\) was set to 2048, and the batch size was taken to be 16 in all experiments. During training, the Adam optimizer was used with an initial learning rate of \(1\times 10^{-4}\). The learning rate of the diffusion model was adjusted by way of half-cycle cosine decay function, while \(f_\phi\) was not subject to additional processing. The values of other hyperparameters not mentioned above were set in accordance with [22].

4.3 Evaluation metrics

We evaluate the performance of FBP-Diffusion with image classification metrics, including Accuracy (Acc), Precision (Prec), Recall (Rec), and F1 Score (F1), defined by

$$\begin{aligned} \textrm{Acc} = \frac{TP+TN}{TP+TN+FP+FN} \end{aligned}$$
(11)
$$\begin{aligned} \textrm{Prec} = \frac{TP}{TP+FP} \end{aligned}$$
(12)
$$\begin{aligned} \textrm{Rec} = \frac{TP}{TP+FN} \end{aligned}$$
(13)
$$\begin{aligned} \textrm{F}1 = \frac{2 \times \textrm{Prec} \times \textrm{Rec}}{\textrm{Prec} + \textrm{Rec}} \end{aligned}$$
(14)

where, TP represents true positives, FP represents false positives, FN represents false negatives, and TN represents true negatives. For binary classification, Prec, Rec, and F1 scores were calculated by treating classes “1” and “2” as the positive class individually. The three metrics were then averaged with weights equal to the sample proportions. Likewise, for multi-class classification, these metrics were computed for each class and then averaged to obtain the final values.

Because the accuracy of the model was similar to the weighted recall rate, we chose the macro average to calculate Rec in multi-class classification, and the weighted average to other metrics. The higher the values of Accuracy, Precision, Recall, and F1, the better the model’s performance. In addition, the training time of the DMs, excluding the fine-tuning time of the conditional information encoder, was considered to evaluate the efficiency of the model.

4.4 Comparison experiments with baseline methods

With CARD and its variants as baseline methods, we conducted experiments on four facial beauty databases and compared their performance with FBP-Diffusion. The experimental results are listed in Tables 14, with their order based on the size of the databases. The best results are highlighted in bold, and the second-best results are underlined. To demonstrate the effectiveness of FBP-Diffusion, all comparative experiments for DMs were conducted under identical experimental environments and model configurations, except for the pre-epoch determined by the design of \(f_\phi\), which included the versions of third-party libraries in the running environment, input image size, timesteps T and S, \(\beta _t\), epoch, \(fea_{dim}\), batch_size, and initial learning rate. The values of other hyperparameters were adopted as per their specifications in the original papers.

Table 1 Comparison with the baseline method on the SCUT-FBP
Table 2 Comparison with the baseline method on the SCUT-FBP5500
Table 3 Comparison with the baseline method on the LSAFBD
Table 4 Comparison with the baseline method on the CelebA

In the baseline method, CARD initially used DMs for the classification task, with both the encoders \(f_\textrm{R}\) and \(f_\phi\) set to ResNet18. Based on CARD, DiffMIC [23] applied DMs to general medical image classification, introduced a dual-granularity conditional guidance model as \(f_\phi\), and optimized the model parameters by the MMD regularization loss. CD-Loop [25] used DMs to accurately and reliably predict chromatin loops, with a \(28 \times 28\) Hi-C contact map as the input and the LeNet5 model as the encoder \(f_\phi\). LRA-diffusion [24] utilized a pre-trained CLIP model as \(f_\phi\), retrieved labels from the k-nearest neighbors in training set based on the neighbor consistency principle to replace the true labels for modeling DMs, and accelerated the training process of DMs by the DDIM sampling method, with the output of \(f_\phi\) saved for later reuse.

From Tables 1 to 4, FBP-Diffusion achieves the highest values for Acc, Prec, Rec, and F1 in most cases, except for Recall on SCUT-FBP, where it attains the second-highest value. A comparative analysis of the performance across the four databases reveals that LRA-diffusion, particularly on SCUT-FBP, exhibits a higher Recall value, which appears to be derived from a weighted average, as its value closely aligns with the other three metrics. This can be attributed to the varying judgment capabilities of different noise label learning methods on the positive and negative classes, with LRA-diffusion demonstrating greater sensitivity to the minority classes in smaller databases. Overall, FBP-Diffusion demonstrates the best classification performance.

Upon further examination, CD-Loop is limited by the input size and encoder settings, and the k-nearest neighbor retrieval method of LRA-diffusion is not suitable for FBP as it struggles with high similarity of image features involved in FBP tasks. DiffMIC leverages global and local features of images effectively and exhibits the second-best classification result. LRA-diffusion exhibits the shortest training time, followed by FBP-Diffusion. DiffMIC implements joint training for \(f_\phi\) and the denoising networks, thus exhibiting the longest training time.

4.5 Comparison with state-of-the-art FBP methods

To further verify the effectiveness of the proposed method, we compared the accuracy of FBP-diffusion with the other state-of-the-art FBP methods on four databases. The results are listed in Table 5, ordered by the year of publication. Among the methods considered, 2M BeautyNet [32] is a multi-input and multi-task CNN model that utilizes gender recognition to assist FBP tasks. The self-correcting noise label method [20] and DIDTAN [16] serve both as classifiers and label correctors and are trained in a multistage manner. E-BLS and ER-BLS [17] use EfficientNet based on transfer learning as the feature extractor and further refine the features and fit the prediction results through BLS. Moreover, TransBLS-T and TransBLS-B [18] capture global and local features of the face with the GLAFormer encoder and effectively combine ViTs [51] with BLS. The adaptive multitask method [19] adopts an adaptive sharing strategy and attention feature fusion to improve FBP accuracy. In addition to this, other attention models have been compared. Swin Transformer [53] improves computational efficiency and model representation through a sliding window attention mechanism. MLP-Mixer [54] is based entirely on earlier multilayer perceptrons and relies solely on basic matrix multiplication for data processing and feature extraction.

Table 5 Comparison of accuracies (%) of state-of-the-art FBP methods on the Facial Beauty database

The results indicate that the classification accuracy of FBP-Diffusion is better than those of the existing methods, reaching 67.71\(\%\), 78.60\(\%\), 73.40\(\%\), and 82.56\(\%\) on the four datasets, which are 1.04\(\%\), 2.56\(\%\), 5.17\(\%\), and 0.71\(\%\) higher than the second-best results, respectively.

4.6 Ablation experiment

To evaluate the effectiveness of \(f_\phi\) and DLC modules in FBP-Diffusion, ablation experiments were performed on the four databases listed in Table 6. Here, “Diffusion” refers to the enhancement of CARD with the addition of DDIM and some detailed implementation changes in the model fine-tuning process. Apart from the changes to the encoder \(f_\phi\) and the addition of the DLC module, it is identical to FBP-Diffusion, represented as “Diffusion+MobileViT+DLC”. The ablation experiments were performed under the same experimental conditions and model configurations described in Section 4.2. Furthermore, experiments conducted on the same database utilized the same fine-tuned MobileViT model, with its pre-training time denoted by “-” since it differs from “Time”, which refers to the training time of DMs.

Table 6 Effectiveness of each module in FBP-Diffusion

The data presented in Table 6 indicates that the accuracy gradually increases with the application of MobileViT and the addition of the DLC module, and it can be seen that MobileViT has a better impact on the model’s accuracy than the DLC module, with enhancements ranging from 0.49% to 7.29% and 0% to 0.4%, respectively. In a word, FBP-Diffusion outperforms the two baseline models used and achieves the best classification accuracy. It exhibits the highest values in almost all classification metrics, except for the Prec value on the LSAFBD database. This is attributed to the probability transfer in the DLC module, which comes at the cost of a certain amount of misprediction, thus lowering the Prec or Rec values. In smaller databases, there is more variation in the sample size for each category, resulting in lower macro-averaged Rec values compared to other metrics. On the LSAFBD database, the “Diffusion” model achieves an accuracy of 70.39%, surpassing all other methods listed in Table 5, which further validates the advantages of diffusion models in noisy databases. In addition, “Diffusion+MobileViT” requires the shortest training time, while FBP-Diffusion is the second shortest.

Fig. 5
figure 5

Confusion matrices of model outputs before and after loss correction

Notably, on the more refined small-scale databases, the DLC module has almost no improvement in prediction accuracy, but performs well on the other three metrics, with a maximum improvement of 6.86%. On large-scale databases, the denoising capability and performance of DMs are relatively good, and the DLC module exhibits a less pronounced effect on the overall performance improvement. To illustrate the effects of loss correction in greater detail and highlight the effectiveness of the DLC module, the confusion matrices of the output results before and after correction are depicted in Fig. 5, and the loss curves during the training of FBP-Diffusion are depicted in Fig. 6.

The DLC module modifies the outputs corresponding to the samples through probabilistic transitions, while DMs minimize the overall loss between these modified outputs and the true labels by means of iterative training, thereby enabling a larger number of samples to be accurately classified. The confusion matrix for the SCUT-FBP database in Fig. 5 shows that some samples originally predicted as “3” are reclassified as “4” after the addition of the DLC. For the SCUT-FBP5500 database, some samples originally predicted as “2” are reclassified as “1”, while some others originally predicted as “3” are reclassified as “4” following DLC processing. Overall, the number of samples whose predicted labels are changed from incorrect to correct is greater than the number of samples whose labels are changed from correct to incorrect. This similar situation is also observed in the other two databases. However, the effect of the DLC module is less noticeable in the CelebA database due to its large scale and the binary classification task.

Fig. 6
figure 6

Training loss curves of FBP-Diffusion

In Fig. 6, the blue curve represents the noise estimation loss \(\mathcal {L}_{\varvec{\epsilon }}\), the black curve represents the cross-entropy loss \(\mathcal {L}_c\), and the red curve represents the overall loss \(\mathcal {L}\) of FBP-Diffusion. \(\mathcal {L}_c\) is derived by re-parameterizing \(\varvec{\hat{y}}_0\) and the true labels \(\varvec{y}\). As the difference between them is large, \(\mathcal {L}_c\) exhibits a higher value and a smaller variance. However, \(\mathcal {L}_{\varvec{\epsilon }}\) exhibits a significant decreasing trend from the beginning, and despite subsequent fluctuations, it remains at a low level. These two losses reflect different convergence characteristics. When combined, the trend of \(\mathcal {L}\) generally follows that of \(\mathcal {L}_{\varvec{\epsilon }}\), indicating that \(\mathcal {L}_{\varvec{\epsilon }}\) dominates the overall loss. In the later stages of training, most of the loss values on the four databases fluctuate within 0.3. It can be seen that the larger the size of the databases, the smaller the initial loss and the faster the model converges to the same loss value. For the CelebA dataset, the loss initially drops below 0.4 and reaches a state close to 0 more quickly. This is because larger databases contain more samples and features, and the model is able to find valid feature representations faster.

Table 7 Effects of different values of \(fea_{dim}\) and S on FBP-Diffusion performance

Table 7 lists the accuracy and training time of FBP-Diffusion corresponding to different values of the hyperparameter, \(fea_{dim}\) and S. When \(fea_{dim}\) is varied, the value of S is set to 10; when S is varied, the value of \(fea_{dim}\) is set to 2048. As the hyperparameter values increase, the training time gradually increases. The accuracy also exhibits a noticeable pattern, peaking at \(fea_{dim}=2048\) and \(S=10\), and then decreasing as the values deviate from these points. This is indicative of the overall trend, but weaker for the SCUT-FBP. When S varies between 10 and 40, the prediction accuracy of the model hardly changes and the training time is more random since it is all very low, indicating that the optimal value of S is not fixed on different databases. When applying diffusion models to other domains, it is recommended to perform several experiments to empirically determine the optimal value. The accuracy values do not vary significantly, indicating that with lower \(fea_{dim}\) and S values, the model still performs well while considerably reducing training time and decreasing memory usage.

5 Conclusions

To mitigate the impact of noisy labels and enhance prediction performance, we proposed FBP-Diffusion, a diffusion model that integrates MobileViT and DLC. MobileViT serves as the conditional information encoder to generate high-accuracy preliminary predictions that guide label generation during the reverse process in DMs. Using the learned noise transition matrix, we introduced a DLC module to increase the cross-entropy loss between the predicted probabilities and true labels, which is dynamically incorporated into the noise estimation loss to influence the training process of FBP-Diffusion. Experimental results on the SCUT-FBP, SCUT-FBP5500, LSAFBD, and CelebA databases demonstrate that FBP-Diffusion outperforms conventional DMs and existing FBP methods, achieving notable results and showing its applicability to other learning tasks and related fields. Additionally, as DMs have inherently strong denoising capabilities and high robustness with respect to noisy labels, the improvement in accuracy due to DLC is not substantial. Future works should focus on further optimizing the noise transition matrix and exploring other noisy label learning methods to refine the model design continuously.