1 Introduction

The performance of a neural network (NN) is known to be quite sensitive to the nature and quality of the data used for training (Dodge & Karam, 2016; Zhou et al., 2018; Budach et al., 2022). Noise or other errors present in the data can have a profound impact on the performance and reliability of the trained network (Nazaré et al., 2018; Gupta & Gupta, 2019; Song et al., 2023). Artefacts such as labelling errors (Song et al., 2023) or image noise (Boncelet, 2009) are a ubiquitous phenomenon of real world data such as medical recordings (Pusey et al., 1986; Gravel et al., 2004; Goyal et al., 2018). While the effect of label noise is rather well understood (Rolnick et al., 2017; Jiang et al., 2020; Song et al., 2023) and is accounted for in the statistical model underlying the typical loss functions, cf. (1) below, the same is not true for errors in the input variable. Such a scenario is known in statistics as Errors-in-Variables (Fuller, 2009), but has so far found only limited attention in the deep learning community. One major issue in applying this statistical concept are computational limitations (cf. Section 1.1). To our knowledge, we present for the first time a scalable Errors-in-Variables framework in deep learning that leads to an improved prediction performance in the context of image classification.

Classification of images x via a NN \(f_\theta\) parametrized by \(\theta\) and softmax output can be interpreted as a statistical model for the labels y through

$$\begin{aligned} y\sim p(y|x,\theta )=\textrm{Cat}(y|f_\theta (x))\,, \end{aligned}$$
(1)

where \(\textrm{Cat}\) denotes the categorical distribution. Note, that while this model considers y to be noisy, it does not explicitly account for an uncertainty in its input x. In many cases, this is not adequate, given that real images are often subject to noise (Boncelet, 2009). In such a scenario, a more suitable statistical model is the Errors-in-Variables (EiV) model (Fuller, 2009) that we employ in this work:

$$\begin{aligned} y\sim p(y|\zeta , \theta ) = \textrm{Cat}(y|f_\theta (\zeta )), \qquad x \sim p(x|\zeta )\,. \end{aligned}$$
(2)

The model (2) makes the assumption that there is a true but unknown \(\zeta\) that underlies the observed x, which is subject to error. For linear regression problems, it is well known that not accounting for an error in the input will lead to a bias (Fuller, 2009). Similar observations have been made for general nonlinear models (Schennach, 2012; Chesher, 1991). For linear regression and multinomial logit regression, which can be seen as the linear analog of (1), explicit expressions for this bias can be deduced (Fuller, 2009; Kao & Schnell, 1987; Stefanski & Carroll, 1985).

Fig. 1
figure 1

Values of the posterior predictive distribution p(y|x) on the simplex for \(10^4\) test points of a noisification of CIFAR10 (noise level \(T=91\) cf. Sect. 4) with the Bayesian version of the non-EiV model (1) on the left and the EiV model (2) proposed in this work on the right. To illustrate the general prediction trend, CIFAR10 was reduced to three metaclasses as indicated by the labels on the vertices in this plot. Points that were correctly classified by p(y|x) are marked in cyan, points that were wrongly classified are indicated in red. Red points are plotted in the foreground (Color figure online)

The literature on supervised deep learning almost exclusively builds on statistical models without Errors-in-Variables (called non-EiVs in this work), such as (1). The effect of errors in the input is therefore barely known. Our work builds on one of the few works on this topic (Martin & Elster, 2023), which applies a Bayesian treatment (Gal & Ghahramani, 2016; Kendall & Gal, 2017) of EiV as in (2). A Bayesian handling of (2) allows to model the distribution of the unknown \(\zeta\) given x by some distribution \(p(\zeta |x)\) which we here call the input posterior. The exact form of this distribution will depend on the applied Bayesian modelling. Although the method from (Martin & Elster, 2023) was able to improve on the obtained uncertainties with the thereby chosen \(p(\zeta |x)\), it did not lead to an improvement of the prediction performance. We substantially improve on this by using a neural network to model the input posterior \(p(\zeta |x)\). We chose a diffusion model (Ho et al., 2020) as input posterior, i.e., to sample \(\zeta\) from \(p(\zeta |x)\). In principle, different types of neural networks could be used to model \(p(\zeta |x)\). In contrast to how diffusion models are typically applied in the literature (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2020; Croitoru et al., 2023), our diffusion model does not generate an image from pure Gaussian noise but instead starts at a specified noise level that matches the expected noise in the image, cf. Section 2.2 below. In using the diffusion model on noisy input data to sample \(\zeta\), we observe that EiV (2) leads to a substantial increase in prediction performance of the trained model in comparison to non-EiV. We thereby provide a proof of principle that treating errors in the inputs via an adequate statistical model can indeed boost the performance of a trained NN.

Figure 1 compares the non-EiV and EiV predictions for a noisy version of CIFAR10 (cf. Section 4). In this plot, we reduced the ten classes of CIFAR10 to three meta classes as indicated by the labels on the vertices of the 2-simplices shown. Correctly and incorrectly classified images are plotted in cyan and red, respectively. The left-hand side of Fig. 1 shows the values of the posterior predictive distribution p(y|x) yielded by the Bayesian treatment (Gal & Ghahramani, 2016) of the non-EiV model (1). The right plot illustrates the corresponding performance obtained with the Bayesian treatment of the EiV model (2). In Fig. 1 we immediately note that the EiV model has far fewer red points and hence demonstrates a substantially higher accuracy than the non-EiV model. Moreover, the non-EiV plot has an accumulation of red points in the vertices of the simplex (i.e. wrong predictions with high confidence). This accumulation is substantially reduced by the EiV model, indicating that the learned statistical model yields a more reliable prediction of class probabilities. In Sects. 2 and 3 details of the computation are provided and in Sect. 4 we will indeed show that EiV consistently leads to a substantially better calibration of the posterior predictive distribution.

The key component that leads to this improvement in performance and reliability is to use a NN to model the input posterior \(p(\zeta |x)\). In case of diffusion models, a dataset of unnoisy datapoints is required. For the purpose of providing a proof of principle we will in this work partially use the actual unnoisy dataset underlying the studied problem. In practice, such a dataset could be obtained, for example, from a simulation (Nikolenko, 2019), by measuring some inputs with a higher accuracy (Plenge et al., 2012), or via a surrogate dataset. We will also demonstrate, on the example of CIFAR100 with CIFAR10 used as a surrogate dataset, the usability of the latter for the proposed EiV method.

This work is structured as follows: in Sects. 2 and 3 we explain the fusion of Bayesian neural networks (BNNs) with the EiV model (2) and explain how diffusion models can be used to model \(p(\zeta |x)\). The practical implementation is discussed in Sect. 3.1. In Sect. 4 and 5 we compare the performance and calibration of EiV, non-EiV and Average Input Variable (AIV). In AIV, we use the diffusion model only to denoise the input variables but do not take the underlying statistical model into account. We finish by presenting our conclusions in Sect. 6.

1.1 Related work

EiV models have been explored and applied since long in the statistical literature, both from the point of view of classical (Fuller, 2009; Van Huffel, 1997) as well as Bayesian statistics, cf., e.g. Dellaportas and Stephens (1995); Leonard (2011). In an EiV model not just the observed data y is modelled as being noise-corrupted, but also the latent (i.e. non-observable) explanatory variable \(\zeta\) is inferred given observed, noise-corrupted data x. However, while there is some literature on noisy inputs for neural networks, e.g. (Im et al., 2015; Loquercio et al., 2020; Nazaré et al., 2018; Ivanovic et al., 2022), EiV models have hardly been studied in deep learning. There is an older body of literature from the 1990 s and early 2000 s (Bassu et al., 1999; Van Gorp et al., 1998; Sragner & Horvath, 2003; Seghouane & Fleury, 2001) which considers EiV in a manner similar to that found in frequentist statistics (Fuller, 2009) for rather simple, often one-dimensional, problems. These approaches have to evaluate the model at testing time using the noisy input x, since the true input \(\zeta\) is unknown. Given a model for \(p(\zeta |x)\), Bayesian approaches offer a way around this problem. However, most existing works in this direction (Wright, 1999; Wright et al., 2000; Pavone et al., 2018; Zhang et al., 2011; Yuan et al., 2020) rely on Markov Chain Monte Carlo, which is computationally prohibitive for deep neural networks (DNNs) and higher dimensional problems. For the sake of computing uncertainties, a Laplace approximation (Wright, 1999; Pavone et al., 2018) can be used, though doing so requires the computation of the Hessian. Our work is conceptually closest to (Martin & Elster, 2023), where variational inference based BNNs (Gal & Ghahramani, 2016; Kendall & Gal, 2017; Gal et al., 2017; Blundell et al., 2015; Kingma et al., 2015; Duvenaud et al., 2016) are combined with an EiV handling of the input. To our knowledge, (Martin & Elster, 2023) is the only approach that is scalable to DNNs and higher dimensional problems, but it does not provide any increase in prediction performance in contrast to the approach presented in this work.

2 Background

2.1 Bayesian EiV for neural networks

In supervised learning, NNs model the relationship between an input variable \(x_i\) and its target \(y_i\). In the training phase the network is optimized w.r.t. the parameter \(\theta\) that best characterizes the distribution \(p(y|x,\theta )\) given a training dataset \(\mathcal {D}=\{(x_1,y_1),\ldots , (x_N,y_N)\}\) of pairs of inputs \(x_i\) and targets \(y_i\).

Bayesian neural networks While non-Bayesian approaches find an optimal value for \(\theta\) by minimizing some loss function such as the negative-log-likelihood \(-\sum _{i=1}^N \log p(y_i|x_i,\theta )\), BNNs (Gal & Ghahramani, 2016; Gal et al., 2017; Kingma et al., 2015; Kendall & Gal, 2017; Blundell et al., 2015; Zhang et al., 2018; Duvenaud et al., 2016) consider \(\theta\) as a random variable and learn a variational distribution \(q_\phi (\theta )\) that approximates the, computationally infeasible, posterior \(p(\theta |\mathcal {D})\):

$$\begin{aligned} q_\phi (\theta ) \approx p(\theta | \mathcal {D}) \propto p(\theta ) \cdot \prod _{i=1}^N p(y_i|x_i,\theta )\,, \end{aligned}$$
(3)

where \(p(\theta )\) is some prior distribution such as \(p(\theta )=\mathcal {N}(\theta |0,\lambda ^{-1} \textbf{I})\). The variational distribution in (3) is obtained by choosing the optimal hyperparameter \(\phi\) for a family \((q_\phi )_\phi\) of distributions, that is typically chosen such that it is feasible to sample and to infer an optimal \(\phi\). This optimal \(\phi\) is often obtained by minimizing the Kullback–Leibler-divergence \(\mathcal {L}(\phi )=D_{\textrm{KL}}(q_\phi (\theta )\Vert p(\theta |\mathcal {D}))\) of the variational distribution \(q_{\phi }(\theta )\) and the posterior \(p(\theta | \mathcal {D})\).

EiV Errors-in-Variables considers the input variable x as noisy. More precisely, the NN induces a distribution \(p(y|\zeta , x)\) with a true input \(\zeta\), which is unknown, and all we can observe are noisy versions x of \(\zeta\) that follow some distribution \(p(x|\zeta )\).

As was shown in (Martin & Elster, 2023), once we have access to an input posterior \(p(\zeta |x) \propto p(\zeta ) \cdot p(x|\zeta )\) that describes our knowledge about \(\zeta\) given its noisy version x, the variational approach from (3) can be adapted to the EiV setup, namely

$$\begin{aligned} q_\phi (\theta ) \approx p(\theta |\mathcal {D}) \propto p(\theta ) \cdot \prod _{i=1}^N \int \textrm{d}\zeta _i \, p(\zeta _i |x_i) \,p(y_i|\zeta _i,\theta ) , \end{aligned}$$

for which the Kullback–Leibler loss \(D_{\textrm{KL}}(q_\phi (\theta )\Vert p(\theta |\mathcal {D}))\) turns, up to constants and rescaling, into the loss function

$$\begin{aligned} \mathcal {L}(\phi ) = -\frac{1}{N} \sum _{i=1}^N \mathbb {E}_{\theta \sim q_\phi (\theta )} \left[ \log \left( \mathbb {E}_{\zeta _i \sim p(\zeta _i|x_i)} [ p(y_i | \zeta _i, \theta ) ] \right) \right] + \frac{1}{N}D_{\textrm{KL}}(q_\phi (\theta ) \Vert p(\theta ))\,. \end{aligned}$$
(4)

The crucial ingredient required to apply EiV successfully is the choice of the input posterior \(p(\zeta |x)\). For \(p(\zeta |x)\) equal to the Dirac distribution at x the loss (3) simply coincides with the classical variational inference loss used e.g. in (Gal & Ghahramani, 2016; Gal et al., 2017; Kendall & Gal, 2017). Next we show how diffusion models can be used as input posteriors.

2.2 Diffusion models as input posterior \(p(\zeta |x)\)

In this work we propose to use diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2020; Croitoru et al., 2023) to obtain an informative input posterior distribution \(p(\zeta |x)\) for (4).

Diffusion model Suppose we are given images \(\zeta\) from a (unknown) data distribution \(p(\zeta )\). Starting from an image \(x^0=\zeta\) we iteratively substitute the pixels of the image by Gaussian noise via a diffusion process (Sohl-Dickstein et al., 2015; Ho et al., 2020). More precisely, at each time step t, the conditionals of this process are given by

$$\begin{aligned} p(x^t | x^{t-1}) = \mathcal {N}(x^t |\sqrt{1 - \beta _t} \,x^{t-1}, \beta _t \,\textbf{I})\,, \end{aligned}$$
(5)

where the \(0<\beta _t < 1\) are chosen such that \(\beta _{t}> \beta _{t-1}\). The reversed conditional distribution is given by (Sohl-Dickstein et al., 2015)

$$\begin{aligned} p(x^{t-1} | x^t) = \mathcal {N}(x^{t-1} |\mu (x^t, t), \Sigma (x^t, t))\,, \end{aligned}$$
(6)

where, in the context of diffusion models, \(\mu (x^t,t)\) and \(\Sigma (x^t,t)\) are modelled by a NN. In this work we follow (Ho et al., 2020) where the variance is set to a constant \(\Sigma (x^t, t) = \sigma ^2_t \textbf{I} = \frac{1-\bar{\alpha }_{t-1}}{1-\bar{\alpha }_{t}} \beta _t \,\textbf{I}\) with \(\overline{\alpha }_t = \prod _{s=1}^t (1-\beta _s)\). From (5) we obtain for the distribution of \(x^t\) given \(x^0=\zeta\)

$$\begin{aligned} p(x^t |\zeta )=\mathcal {N}(x^t | \sqrt{\overline{\alpha }_t}\, \zeta , (1-\overline{\alpha }_t) \, \textbf{I}). \end{aligned}$$
(7)

In particular, we obtain for \(t \rightarrow \infty\) an \(x^t\) that is distributed as \(\mathcal {N}(0,\textbf{I})\) and independent of \(x^0=\zeta\). To sample from the reverse distribution \(p(\zeta |x^t)\) the conditional (6) can be applied iteratively. We here summarize this iterative application, the diffusion model, as

$$\begin{aligned} \zeta = \Xi (x^t, \varvec{z}^t, t) \sim p(\zeta |x^t) \,, \end{aligned}$$
(8)

where \(\Xi\) denotes t iterative evaluations of the NN \(\mu (x^{s}, s)\) and subsequent sampling \(x^{s-1} = \mu (x^{s},s) + z^s \sim p(x^{s-1}|x^{s})\), with \(z^s\sim \mathcal {N}(0, \sigma ^2_s \textbf{I})\), starting at \(s=t\) from \(x^t\). In (8) we use the abbreviation \(\varvec{z}^t=(z^t, z^{t-1},\ldots ,z^0)\).

Diffusion models and EiV Song et al. (2020); Sohl-Dickstein et al. (2015); Ho et al. (2020) sample from (8) with a large \(t \rightarrow \infty\), where \(x^t\) can be considered to be distributed according to \(\mathcal {N}(0,\textbf{I})\). This allows them to generate samples starting from pure Gaussian noise. We followed these approaches for training the diffusion model \(\Xi\). However, once trained we applied \(\Xi\) to sample from the conditional (8) for a finite, but fixed \(t=T\) whose \(\overline{\alpha }_T\) describes the noise level we expect in our data. In other words we identify the \(x^T\) from the diffusion model with the x used in (2). In particular, \(p(x|\zeta ) = p(x^T|\zeta )\) is given by (7) and the input posterior \(p(\zeta |x) = p(\zeta |x^T)\) is given by (8).

In the experiments below we introduce noise of different levels T in our data. To this end, we start from the unnoisy datapoints \(\zeta ^{\textrm{true}}\) and then use \(p(x|\zeta ^{\textrm{true}})\), i.e., (7) at noise level \(t=T\). The same T is then used for the application of the diffusion model (8) as input posterior \(p(\zeta |x)=p(x^0|x^T)\).

As pointed out in Sect. 1 to train the diffusion model \(\Xi\) requires a dataset of clean, unnoisy datapoints \(\zeta\), which can, in practice, either be provided by a denoiser, a simulated dataset or a surrogate dataset.

3 Models

3.1 EiV model

Our approach requires two types of networks: a diffusion model \(\Xi\) that induces \(p(\zeta |x)\) via (8) with \(t=T\) and a classification network \(f_\theta (\zeta )\) that induces a distribution \(p(y|\theta , \zeta )=\textrm{Cat}(y|f_\theta (\zeta ))\). The diffusion model \(\Xi\) needs to be trained, in the standard manner (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2020) before the training of the classification network \(f_\theta\). In this section we discuss the training and evaluation of \(f_\theta\) given a trained \(\Xi\).

Training We implement the variational distribution \(q_\phi (\theta )\) via Monte Carlo Dropout as in (Gal & Ghahramani, 2016; Kendall & Gal, 2017) for which \(\phi\) equals a vector of the size of the network parameter. The random network parameter \(\theta\) is created from \(\phi\) using a random dropout mask m, i.e., \(\theta (\phi ) = \phi \odot m\). The regularization term \(D_{\textrm{KL}}(q_\phi (\theta )\Vert p(\theta ))\) equals (approximately)Footnote 1\(\frac{\lambda }{2} (1-p_D) \Vert \phi \Vert _2^2\), where \(p_D\) denotes the dropout probability, cf. (Gal & Ghahramani, 2016). The factor \(\lambda\) is the precision parameter of \(p(\theta )\), cf. Sect. 2.1. Replacing the first term in (4) via Monte Carlo sampling we obtain

$$\begin{aligned} \mathcal {L}^{\mathrm {M.C.}}(\phi )&:= -\frac{1}{M} \sum _{i=1}^M \log \left( \frac{1}{n_\zeta } \sum _{l=1}^{n_\zeta } p(y_{i}|\zeta _{i,l}, \theta _i) \right) + \frac{\lambda }{2N} (1-p_{D}) \Vert \phi \Vert _2^2 \,, \end{aligned}$$
(9)

where \(\theta _i \sim q_\phi (\theta )\), \(\{(x_1,y_1),\ldots ,(x_M,y_M)\} \subseteq \mathcal {D}\) is a minibatch of size M and \(\zeta _{i,l} \sim p(\zeta _i |x_i)\) denote \(n_\zeta\) samples from the diffusion model for the same input \(x_i\). Recall that \(p(y_i | \zeta _{i,l},\theta _i)\) is simply the evaluation of the neural network \(f_{\theta _i}\) at input \(\zeta _{i,l}\) with the dropout realization \(\theta _i\). Note that in (9) we have the same dropout realization \(\theta _i\) for all \(n_\zeta\) inputs \(\zeta _{i,1},\ldots , \zeta _{i,n_\zeta }\), this is necessary due to the non-linear logarithm outside the inner expectation in the loss function (4).

Evaluation Once we have trained the classification network via (9) we can evaluate the network on a new input datapoint x via sampling from \(\zeta \sim p(\zeta |x)\) (the diffusion model) and \(\theta \sim q_\phi (\theta )\) (using dropout in our case) and evaluating the network as \(p(y|\zeta ,\theta )\). For example, by multiple sampling of \(\zeta\) and \(\theta\) we obtain the posterior predictive distribution of y given x:

$$\begin{aligned} p(y|x) = \mathbb {E}_{\theta \sim q_\phi (\theta ), \zeta \sim p(\zeta |x)} [p(y|\zeta ,\theta )]. \end{aligned}$$
(10)

3.2 Non-EiV and AIV models for comparison

We compare our model with two baseline methods. One of them is a standard Bayesian non-EiV model, cf. (1), based on Bernoulli Dropout (Gal & Ghahramani, 2016). A comparison of EiV with this model evaluates the benefit of using the statistical model (2) instead of (1). To exclude the possibility that the only benefit gained is due to denoising, we further compare our model with what we call Average Input Variable (AIV). To evaluate AIV on an x we first “denoise” it by drawing \(n_\zeta\) samples \(\zeta _l\sim p(\zeta |x)\) (the diffusion model) and then take their average \(\overline{\zeta } = \frac{1}{n_\zeta } \sum _{l=1}^{n_\zeta } \zeta _l\) as input to the AIV model. For both models we use the same architecture for the classification network \(f_\theta\) as for the EiV model.

Training The non-EiV and AIV models in this work are trained using the standard loss function arising from variational inference via Bernoulli Dropout (Gal & Ghahramani, 2016):

$$\begin{aligned} -\frac{1}{M} \sum _{i=1}^M \log p_{i} + \frac{\lambda }{2N} (1-p_{D}) \Vert \phi \Vert _2^2 \,. \end{aligned}$$

For non-EiV we take \(p_{i} = \textrm{Cat}(y_i |f_{\theta _i}(x_i))\) whereas for AIV we take \(p_{i} = \textrm{Cat}(y_i |f_{\theta _i}(\overline{\zeta }_i))\) where \(\overline{\zeta }_i = \frac{1}{n_\zeta }\sum _{l=1}^{n_\zeta } \zeta _{i,l}\) with \(\zeta _{i,l} \sim p(\zeta _i|x_i)\) being \(n_\zeta\) draws from the diffusion model.

Evaluation For evaluation of the non-EiV and AIV models we use their posterior predictive distribution, which is given by \(p(y|x) = \mathbb {E}_{\theta \sim q_\phi (\theta )} [p(y|x,\theta )]\) for non-EiV and, similarly, by \(p(y|\overline{\zeta }) = \mathbb {E}_{\theta \sim q_\phi (\theta )} [p(y|\overline{\zeta },\theta )]\) with \(\overline{\zeta }=\frac{1}{n_\zeta } \sum _{l=1}^{n_\zeta } \zeta _{l},\,\zeta _{l}\sim p(\zeta | x)\) for AIV.

4 Experiments

Algorithm 1
figure a

EiV optimization

4.1 Experimental setup

In our experiments we compared the accuracy and calibration of EiV, non-EiV and AIV. We evaluated our approach on the MNIST, CIFAR10 and CIFAR100 datasets. All experiments were conducted over three seeds and we provide the average of these results and their standard deviation.

Generation of noisy and denoised data Algorithm 1 demands to denoise each noisy sample on the fly. However, this is computationally infeasible with our choice of NN to approximate samples from the input posterior \(p(\zeta | x^t)\). Hence, for our experiments we approximate Algorithm 1 by storing denoised datasets with a limited number of denoised samples and perform optimization with these fixed datasets. For each dataset we trained classification networks at different noise levels. The noise levels are parametrized by the time parameter T of Sect. 2.2. For each T we created noisy datapoints by including Gaussian noise of level T to the original images \(\zeta ^{\textrm{true}}_i\) via (7) to obtain noisy images \(x_i = x^T_i\). From these, together with the labels \(y_i\) from the unnoisy dataset, we generated a noisy dataset \(\mathcal {D}_x^T = \{(x_i, y_i)\}\) and a denoised dataset \(\mathcal {D}_{\zeta }^T = \{(\zeta _{i,l})_{l=1,\ldots ,n_\zeta }, y_i )\})\), where the \(\zeta _{i,l}\) are \(n_\zeta =5\) samples of the diffusion model (8) for the same input \(x_{i}\) evaluated at the corresponding level T. We used \(\mathcal {D}_x^T\) for the training of the non-EiV model and \(\mathcal {D}_{\zeta }^T\) for the training of the EiV and AIV models using the loss functions explained in Sect. 3.1 and Sect. 3.2. For the evaluation, we created test data in the same manner but used \(n_\zeta =50\) instead. Noisified and denoisified samples from the data generation process are visualized in Appendix D.

Accuracy As a measure of performance, we evaluated the accuracy of the EiV, non-EiV and AIV model by comparing the classes predicted by the posterior predictive distribution, computed as in Sect. 3.1 and 3.2, with the actual labels \(y_i\) in the data.

Calibration of the posterior predictive distribution For a well-calibrated model its confidence is a good approximation of the actual probability of correctness. Commonly DNNs are not sufficiently calibrated because they tend to yield overconfident predictions (Guo et al., 2017; Li & Hoiem, 2018). This can be visualized by a reliability diagram. In our experiments we evaluated the posterior predictive distribution of the EiV, AIV and non-EiV approaches on two common calibration measures and compared their reliability diagrams. The expected calibration error \(\textrm{ECE} = \sum _k^K \frac{|b_k |}{N} |\textrm{acc}(b_k) - \textrm{conf}(b_k) |\) bins the N predictions in K equally sized intervals, to estimate the deviation of the accuracy from the confidence of the model (Naeini et al., 2015). The \(k^{\textrm{th}}\) interval contains \(b_k\) samples and \(\sum _k |b_k | = N\). \(\textrm{acc}(b_k)\) and \(\textrm{conf}(b_k)\) are the accuracy and average confidence of the samples in bin \(b_k\). A limitation of the ECE is that it only uses the probability for the predicted class. In contrast, the static calibration error \(\textrm{SCE} = \sum _{k,c}^{K,C} \frac{|b_{k,c} |}{N C} |\textrm{acc}(b_{k,c}) - \textrm{conf}(b_{k,c}) |\) calculates a classwise ECE and averages over all classes C (Nixon et al., 2019). A precise and concise description of these measures can be found in Gawlikowski et al. (2021).

Transferability of diffusion model The experiments we conducted on MNIST and CIFAR10 require a diffusion model that is trained on the true input samples \(\zeta ^{\textrm{true}}\), to which we, in a typical scenario, don’t have access. Generally, we only have access to the noisy dataset \(\mathcal {D}_x^T\). We bypass this shortcoming on CIFAR100 by using CIFAR10 as a surrogate dataset. Neither EiV nor AIV require access to the \(\zeta ^{\textrm{true}}\) from CIFAR100 in this set of experiments, but rather only to the noisy datapoints x. Instead we trained a diffusion model on CIFAR10 and used it to sample \(\zeta\)s for CIFAR100.

4.2 Implementation details

Diffusion model Both diffusion models \(\Xi _{\textrm{CIFAR10}}\) and \(\Xi _{\textrm{MNIST}}\) were optimized as in (Ho et al., 2020) with the same training strategy and choice of hyperparameters. Our U-Net backbone was taken from (Song et al., 2023), which is a PyTorch implementation of the backbone used in (Ho et al., 2020). We trained \(\Xi _{\textrm{CIFAR10}}\) and \(\Xi _{\textrm{MNIST}}\) for 800,000 epochs and took the network parameter with the lowest loss value.

Training To train the classification networks we used the Adam optimizer, a batch size of 256, \(L_2\) penalty of 0.0001, gradient clipping of 0.1, and a learning rate of \(lr = 0.001\). During the first \(0.25 * n_{\textrm{epoch}}\) epochs we linearly increased the learning rate to lr. Afterwards the learning rate decayed linearly. We used standard data normalization techniques during training. To classify CIFAR10 and CIFAR100 (Krizhevsky, 2009) a modified version of ResNet9 (He et al., 2015) was used where an additional fully connected layer was added and all fully connected layers are augmented by dropout layers. We call this architecture ResNet9DO. We trained this network for \(n_{\textrm{epoch}} = 100\) epochs. An MLP with one hidden layer and dropout layers before each fully connected layer was trained on MNIST (LeCun et al., 1998). We trained this MLP for \(n_{\textrm{epoch}} = 80\) epochs. The dropout rate \(p_D\) for all networks is 0.5. The code was implemented in PyTorch (Paszke et al., 2019), and we used Hydra (Yadan, 2019) to manage different configurations.

Training algorithm Pseudocode of the training algorithm and the loss function (9) is provided in Algorithm 1. The \(L_2\) regularization of (9) was implemented by Adam’s \(L_2\) penalty Kingma and Ba (2014); Loshchilov and Hutter (2017). EiV training requires applying the same dropout mask on all \(\zeta _{i,l}\). PyTorch’s dropout layer does not offer this option. We used a slightly modified dropout implementation to satisfy this condition.

5 Results

Table 1 Performance comparison of the EiV, non-EiV and AIV models for the datasets considered in this work and different noise levels T

Table 1 summarizes our main results for MNIST, CIFAR10 and CIFAR100. Here, we compare the accuracy, ECE and SCE of the three considered models, EiV, non-EiV and AIV, at different noise levels T. For each noise level, the best metric values are highlighted in bold. We observe that EiV clearly outperforms non-EiV and AIV in all three metrics. Only for the rather simple case of MNIST does AIV show a comparable performance to EiV with a slightly better calibration but less accuracy. The results from Table 1 can be seen as a proof of principle that a proper treatment of EiV indeed improves the performance and reliability in line with statistical expectations.

Most importantly, the CIFAR100 experiment demonstrates the practical usability of our approach by showing that it can indeed suffice to use a diffusion model that was trained on a surrogate dataset to achieve a more accurate and better calibrated model using EiV. Moreover, for CIFAR10 and CIFAR100, the comparison with AIV shows that the observed advantage of EiV is not due to the denoising process but is actually rooted in the proper usage of the underlying statistical model. This observation is, moreover, not a peculiarity of the Bayesian treatment of the network parameter; similar results for the (non-Bayesian) ResNet9 architecture can be found in Appendix B. In the following we discuss in more detail the outcomes of our experiments: improved accuracy, better calibrated posterior predictive distribution and the usability of surrogate datasets.

Fig. 2
figure 2

Accuracy of the trained networks on the test dataset. The networks were trained with \(n_{\zeta } = 5\) but evaluated with \(n_{\zeta } = n_{\theta } = 50\). MNIST was denoised with \(\Xi _{\textrm{MNIST}}\) and CIFAR10/100 were denoised with \(\Xi _{\textrm{CIFAR10}}\). EiV (black, dashed) generally outperforms AIV (blue, solid) and non-EiV (magenta, dotted). Results were averaged over three seeds. The standard deviations are shown as shaded areas (Color figure online)

Accuracy gain Fig. 2 visualizes the accuracy of the three statistical models for different noise levels T. For all datasets the EiV model (black) outperforms the non-EiV model (magenta). On MNIST we see that the maximum accuracy of the EiV model is retained even up to noise level \(T = 241\) and only afterwards starts to drop gradually. This might be explained by the image reconstruction process, because if the noise level is too high, the diffusion model “reconstructs” a different digit than the true digit on which the noisification process was performed. In Appendix D we present a number of such cases. On CIFAR10 the performance gain increases at higher noise level: e.g. at noise level \(T=151\), the EiV model shows an accuracy improvement of \(6.1\%\) over non-EiV. For CIFAR100 and \(T> 2\) the accuracy of the EiV model is at least \(2.2\%\) higher than the non-EiV model. This demonstrates that using EiV with a surrogate dataset to learn the input posterior can yield a model that is more accurate than one that does not account for errors in the input variables.

The comparison with AIV (blue) shows that the improved accuracy is not an artefact of the denoising process itself but is related to the proper statistical treatment of EiV. Even though on the simplest dataset, MNIST, the EiV and the AIV model perform similarly (their curves in Fig. 2 are almost congruent), the EiV model is superior for the more complex datasets, CIFAR10 and CIFAR100. On CIFAR10, for instance, the EiV model has a \(1.1\% - 2.5\%\) higher accuracy for \(T> 2\). This superiority of EiV is most obvious for CIFAR100, where the AIV model does not lead to any performance gain compared to the non-EiV model.

More reliable posterior predictive distribution For all three models, we checked the calibration of the posterior predictive distribution p(y|x), i.e., whether its value can be expected to reflect its accuracy. This is visualized in the reliability diagrams in Fig. 3 for CIFAR10 and CIFAR100, which plot the bin accuracy vs. the average bin confidence. The size of the markers is a function of the number of samples in each bin. We observe that the EiV model leads to an improved calibration in all considered cases. For CIFAR10 the EiV model (black) is almost perfectly calibrated, whereas non-EiV (magenta) and AIV (blue) are clearly overconfident. For CIFAR100, where a surrogate dataset was used, EiV still has a significantly improved calibration compared to AIV and non-EiV. The remaining reliability diagrams in Appendix A support this conclusion. These observations are confirmed by the summary statistics presented in Table 1. EiV has the lowest ECE and SCE on CIFAR10 and CIFAR100 at all noise levels. In addition, we observe a drawback of the AIV model because it leads to a very high ECE and SCE. The preprocessing of the input seems to make AIV more confident than non-EiV even in cases where its prediction accuracy is lower as can been seen on CIFAR100. The increased reliability of the posterior predictive distribution of a EiV treatment is also demonstrated by the common NLL metric which we included in Table 3 in Appendix C.

True inputs not needed Figs. 2 and 3 and Table 1 consistently demonstrate that to perform EiV successfully, it can be sufficient to use a diffusion model trained on a similar but different dataset. Our experiments on CIFAR100 show that the EiV model leads to an improved model accuracy and is better calibrated, even though the diffusion model \(\Xi _{\textrm{CIFAR10}}\) was trained on the surrogate dataset CIFAR10. Note that this property does not hold for AIV, which does not perform better than non-EiV on CIFAR100.

Results without dropout So far we have supplemented a BNN with an input posterior to compare the performances and calibrations of EiV, AiV and non-EiV. In the appendix we show that qualitatively the same behaviour is observed if we consider a standard ResNet9 without any dropout mask and both with (EiV and AIV) and without (non-EiV) an input posterior. The Figs. 6 and 7 and the Table 4 suggest that the proposed Bayesian treatment of the input is beneficial even when not followed by a Bayesian treatment of \(\theta\).

6 Conclusions

Summary In this work, we have shown how the statistical concept of EiV can be applied in deep learning. We have used diffusion models to obtain an input posterior to obtain a data distribution of the denoised inputs. Concretely, we have empirically shown the potential benefit of EiV in the context of image classification in cases where the classification network has no access to the unnoisy input data but only to noisy input data. In particular, if a diffusion model is used as an input posterior, the EiV model clearly outperforms non-EiV and AIV models in prediction performance and model calibration. In addition, our experiments on CIFAR100 show that access to the unnoisy data may not be required at all. Even if the diffusion model is trained on a dataset of unnoisy images that is different from the dataset on which the classification network is applied, EiV can lead to better calibrated and more precise predictions.

Outlook In our experiments, input errors were caused by Gaussian noise. A further possible application of our model could be on images that are corrupted by other types of noise (Boncelet, 2009; Gravel et al., 2004) or on images with missing regions. In the latter case, using diffusion models as an input posterior is promising because these networks have already been applied successfully for inpainting tasks (Lugmayr et al., 2022). Another research direction is to verify our results on larger images.