Abstract
Errors-in-Variables is the statistical concept used to explicitly model input variable errors caused, for example, by noise. While it has long been known in statistics that not accounting for such errors can produce a substantial bias, the vast majority of deep learning models have thus far neglected Errors-in-Variables approaches. Reasons for this include a significant increase of the numerical burden and the challenge in assigning an appropriate prior in a Bayesian treatment. To date, the attempts made to use Errors-in-Variables for neural networks do not scale to deep networks or are too simplistic to enhance the prediction performance. This work shows for the first time how Bayesian deep Errors-in-Variables models can increase the prediction performance. We present a scalable variational inference scheme for Bayesian Errors-in-Variables and demonstrate a significant increase in prediction performance for the case of image classification. Concretely, we use a diffusion model as input posterior to obtain a distribution over the denoised image data. We also observe that training the diffusion model on an unnoisy surrogate dataset can suffice to achieve an improved prediction performance on noisy data.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
The performance of a neural network (NN) is known to be quite sensitive to the nature and quality of the data used for training (Dodge & Karam, 2016; Zhou et al., 2018; Budach et al., 2022). Noise or other errors present in the data can have a profound impact on the performance and reliability of the trained network (Nazaré et al., 2018; Gupta & Gupta, 2019; Song et al., 2023). Artefacts such as labelling errors (Song et al., 2023) or image noise (Boncelet, 2009) are a ubiquitous phenomenon of real world data such as medical recordings (Pusey et al., 1986; Gravel et al., 2004; Goyal et al., 2018). While the effect of label noise is rather well understood (Rolnick et al., 2017; Jiang et al., 2020; Song et al., 2023) and is accounted for in the statistical model underlying the typical loss functions, cf. (1) below, the same is not true for errors in the input variable. Such a scenario is known in statistics as Errors-in-Variables (Fuller, 2009), but has so far found only limited attention in the deep learning community. One major issue in applying this statistical concept are computational limitations (cf. Section 1.1). To our knowledge, we present for the first time a scalable Errors-in-Variables framework in deep learning that leads to an improved prediction performance in the context of image classification.
Classification of images x via a NN \(f_\theta\) parametrized by \(\theta\) and softmax output can be interpreted as a statistical model for the labels y through
where \(\textrm{Cat}\) denotes the categorical distribution. Note, that while this model considers y to be noisy, it does not explicitly account for an uncertainty in its input x. In many cases, this is not adequate, given that real images are often subject to noise (Boncelet, 2009). In such a scenario, a more suitable statistical model is the Errors-in-Variables (EiV) model (Fuller, 2009) that we employ in this work:
The model (2) makes the assumption that there is a true but unknown \(\zeta\) that underlies the observed x, which is subject to error. For linear regression problems, it is well known that not accounting for an error in the input will lead to a bias (Fuller, 2009). Similar observations have been made for general nonlinear models (Schennach, 2012; Chesher, 1991). For linear regression and multinomial logit regression, which can be seen as the linear analog of (1), explicit expressions for this bias can be deduced (Fuller, 2009; Kao & Schnell, 1987; Stefanski & Carroll, 1985).
Values of the posterior predictive distribution p(y|x) on the simplex for \(10^4\) test points of a noisification of CIFAR10 (noise level \(T=91\) cf. Sect. 4) with the Bayesian version of the non-EiV model (1) on the left and the EiV model (2) proposed in this work on the right. To illustrate the general prediction trend, CIFAR10 was reduced to three metaclasses as indicated by the labels on the vertices in this plot. Points that were correctly classified by p(y|x) are marked in cyan, points that were wrongly classified are indicated in red. Red points are plotted in the foreground (Color figure online)
The literature on supervised deep learning almost exclusively builds on statistical models without Errors-in-Variables (called non-EiVs in this work), such as (1). The effect of errors in the input is therefore barely known. Our work builds on one of the few works on this topic (Martin & Elster, 2023), which applies a Bayesian treatment (Gal & Ghahramani, 2016; Kendall & Gal, 2017) of EiV as in (2). A Bayesian handling of (2) allows to model the distribution of the unknown \(\zeta\) given x by some distribution \(p(\zeta |x)\) which we here call the input posterior. The exact form of this distribution will depend on the applied Bayesian modelling. Although the method from (Martin & Elster, 2023) was able to improve on the obtained uncertainties with the thereby chosen \(p(\zeta |x)\), it did not lead to an improvement of the prediction performance. We substantially improve on this by using a neural network to model the input posterior \(p(\zeta |x)\). We chose a diffusion model (Ho et al., 2020) as input posterior, i.e., to sample \(\zeta\) from \(p(\zeta |x)\). In principle, different types of neural networks could be used to model \(p(\zeta |x)\). In contrast to how diffusion models are typically applied in the literature (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2020; Croitoru et al., 2023), our diffusion model does not generate an image from pure Gaussian noise but instead starts at a specified noise level that matches the expected noise in the image, cf. Section 2.2 below. In using the diffusion model on noisy input data to sample \(\zeta\), we observe that EiV (2) leads to a substantial increase in prediction performance of the trained model in comparison to non-EiV. We thereby provide a proof of principle that treating errors in the inputs via an adequate statistical model can indeed boost the performance of a trained NN.
Figure 1 compares the non-EiV and EiV predictions for a noisy version of CIFAR10 (cf. Section 4). In this plot, we reduced the ten classes of CIFAR10 to three meta classes as indicated by the labels on the vertices of the 2-simplices shown. Correctly and incorrectly classified images are plotted in cyan and red, respectively. The left-hand side of Fig. 1 shows the values of the posterior predictive distribution p(y|x) yielded by the Bayesian treatment (Gal & Ghahramani, 2016) of the non-EiV model (1). The right plot illustrates the corresponding performance obtained with the Bayesian treatment of the EiV model (2). In Fig. 1 we immediately note that the EiV model has far fewer red points and hence demonstrates a substantially higher accuracy than the non-EiV model. Moreover, the non-EiV plot has an accumulation of red points in the vertices of the simplex (i.e. wrong predictions with high confidence). This accumulation is substantially reduced by the EiV model, indicating that the learned statistical model yields a more reliable prediction of class probabilities. In Sects. 2 and 3 details of the computation are provided and in Sect. 4 we will indeed show that EiV consistently leads to a substantially better calibration of the posterior predictive distribution.
The key component that leads to this improvement in performance and reliability is to use a NN to model the input posterior \(p(\zeta |x)\). In case of diffusion models, a dataset of unnoisy datapoints is required. For the purpose of providing a proof of principle we will in this work partially use the actual unnoisy dataset underlying the studied problem. In practice, such a dataset could be obtained, for example, from a simulation (Nikolenko, 2019), by measuring some inputs with a higher accuracy (Plenge et al., 2012), or via a surrogate dataset. We will also demonstrate, on the example of CIFAR100 with CIFAR10 used as a surrogate dataset, the usability of the latter for the proposed EiV method.
This work is structured as follows: in Sects. 2 and 3 we explain the fusion of Bayesian neural networks (BNNs) with the EiV model (2) and explain how diffusion models can be used to model \(p(\zeta |x)\). The practical implementation is discussed in Sect. 3.1. In Sect. 4 and 5 we compare the performance and calibration of EiV, non-EiV and Average Input Variable (AIV). In AIV, we use the diffusion model only to denoise the input variables but do not take the underlying statistical model into account. We finish by presenting our conclusions in Sect. 6.
1.1 Related work
EiV models have been explored and applied since long in the statistical literature, both from the point of view of classical (Fuller, 2009; Van Huffel, 1997) as well as Bayesian statistics, cf., e.g. Dellaportas and Stephens (1995); Leonard (2011). In an EiV model not just the observed data y is modelled as being noise-corrupted, but also the latent (i.e. non-observable) explanatory variable \(\zeta\) is inferred given observed, noise-corrupted data x. However, while there is some literature on noisy inputs for neural networks, e.g. (Im et al., 2015; Loquercio et al., 2020; Nazaré et al., 2018; Ivanovic et al., 2022), EiV models have hardly been studied in deep learning. There is an older body of literature from the 1990 s and early 2000 s (Bassu et al., 1999; Van Gorp et al., 1998; Sragner & Horvath, 2003; Seghouane & Fleury, 2001) which considers EiV in a manner similar to that found in frequentist statistics (Fuller, 2009) for rather simple, often one-dimensional, problems. These approaches have to evaluate the model at testing time using the noisy input x, since the true input \(\zeta\) is unknown. Given a model for \(p(\zeta |x)\), Bayesian approaches offer a way around this problem. However, most existing works in this direction (Wright, 1999; Wright et al., 2000; Pavone et al., 2018; Zhang et al., 2011; Yuan et al., 2020) rely on Markov Chain Monte Carlo, which is computationally prohibitive for deep neural networks (DNNs) and higher dimensional problems. For the sake of computing uncertainties, a Laplace approximation (Wright, 1999; Pavone et al., 2018) can be used, though doing so requires the computation of the Hessian. Our work is conceptually closest to (Martin & Elster, 2023), where variational inference based BNNs (Gal & Ghahramani, 2016; Kendall & Gal, 2017; Gal et al., 2017; Blundell et al., 2015; Kingma et al., 2015; Duvenaud et al., 2016) are combined with an EiV handling of the input. To our knowledge, (Martin & Elster, 2023) is the only approach that is scalable to DNNs and higher dimensional problems, but it does not provide any increase in prediction performance in contrast to the approach presented in this work.
2 Background
2.1 Bayesian EiV for neural networks
In supervised learning, NNs model the relationship between an input variable \(x_i\) and its target \(y_i\). In the training phase the network is optimized w.r.t. the parameter \(\theta\) that best characterizes the distribution \(p(y|x,\theta )\) given a training dataset \(\mathcal {D}=\{(x_1,y_1),\ldots , (x_N,y_N)\}\) of pairs of inputs \(x_i\) and targets \(y_i\).
Bayesian neural networks While non-Bayesian approaches find an optimal value for \(\theta\) by minimizing some loss function such as the negative-log-likelihood \(-\sum _{i=1}^N \log p(y_i|x_i,\theta )\), BNNs (Gal & Ghahramani, 2016; Gal et al., 2017; Kingma et al., 2015; Kendall & Gal, 2017; Blundell et al., 2015; Zhang et al., 2018; Duvenaud et al., 2016) consider \(\theta\) as a random variable and learn a variational distribution \(q_\phi (\theta )\) that approximates the, computationally infeasible, posterior \(p(\theta |\mathcal {D})\):
where \(p(\theta )\) is some prior distribution such as \(p(\theta )=\mathcal {N}(\theta |0,\lambda ^{-1} \textbf{I})\). The variational distribution in (3) is obtained by choosing the optimal hyperparameter \(\phi\) for a family \((q_\phi )_\phi\) of distributions, that is typically chosen such that it is feasible to sample and to infer an optimal \(\phi\). This optimal \(\phi\) is often obtained by minimizing the Kullback–Leibler-divergence \(\mathcal {L}(\phi )=D_{\textrm{KL}}(q_\phi (\theta )\Vert p(\theta |\mathcal {D}))\) of the variational distribution \(q_{\phi }(\theta )\) and the posterior \(p(\theta | \mathcal {D})\).
EiV Errors-in-Variables considers the input variable x as noisy. More precisely, the NN induces a distribution \(p(y|\zeta , x)\) with a true input \(\zeta\), which is unknown, and all we can observe are noisy versions x of \(\zeta\) that follow some distribution \(p(x|\zeta )\).
As was shown in (Martin & Elster, 2023), once we have access to an input posterior \(p(\zeta |x) \propto p(\zeta ) \cdot p(x|\zeta )\) that describes our knowledge about \(\zeta\) given its noisy version x, the variational approach from (3) can be adapted to the EiV setup, namely
for which the Kullback–Leibler loss \(D_{\textrm{KL}}(q_\phi (\theta )\Vert p(\theta |\mathcal {D}))\) turns, up to constants and rescaling, into the loss function
The crucial ingredient required to apply EiV successfully is the choice of the input posterior \(p(\zeta |x)\). For \(p(\zeta |x)\) equal to the Dirac distribution at x the loss (3) simply coincides with the classical variational inference loss used e.g. in (Gal & Ghahramani, 2016; Gal et al., 2017; Kendall & Gal, 2017). Next we show how diffusion models can be used as input posteriors.
2.2 Diffusion models as input posterior \(p(\zeta |x)\)
In this work we propose to use diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2020; Croitoru et al., 2023) to obtain an informative input posterior distribution \(p(\zeta |x)\) for (4).
Diffusion model Suppose we are given images \(\zeta\) from a (unknown) data distribution \(p(\zeta )\). Starting from an image \(x^0=\zeta\) we iteratively substitute the pixels of the image by Gaussian noise via a diffusion process (Sohl-Dickstein et al., 2015; Ho et al., 2020). More precisely, at each time step t, the conditionals of this process are given by
where the \(0<\beta _t < 1\) are chosen such that \(\beta _{t}> \beta _{t-1}\). The reversed conditional distribution is given by (Sohl-Dickstein et al., 2015)
where, in the context of diffusion models, \(\mu (x^t,t)\) and \(\Sigma (x^t,t)\) are modelled by a NN. In this work we follow (Ho et al., 2020) where the variance is set to a constant \(\Sigma (x^t, t) = \sigma ^2_t \textbf{I} = \frac{1-\bar{\alpha }_{t-1}}{1-\bar{\alpha }_{t}} \beta _t \,\textbf{I}\) with \(\overline{\alpha }_t = \prod _{s=1}^t (1-\beta _s)\). From (5) we obtain for the distribution of \(x^t\) given \(x^0=\zeta\)
In particular, we obtain for \(t \rightarrow \infty\) an \(x^t\) that is distributed as \(\mathcal {N}(0,\textbf{I})\) and independent of \(x^0=\zeta\). To sample from the reverse distribution \(p(\zeta |x^t)\) the conditional (6) can be applied iteratively. We here summarize this iterative application, the diffusion model, as
where \(\Xi\) denotes t iterative evaluations of the NN \(\mu (x^{s}, s)\) and subsequent sampling \(x^{s-1} = \mu (x^{s},s) + z^s \sim p(x^{s-1}|x^{s})\), with \(z^s\sim \mathcal {N}(0, \sigma ^2_s \textbf{I})\), starting at \(s=t\) from \(x^t\). In (8) we use the abbreviation \(\varvec{z}^t=(z^t, z^{t-1},\ldots ,z^0)\).
Diffusion models and EiV Song et al. (2020); Sohl-Dickstein et al. (2015); Ho et al. (2020) sample from (8) with a large \(t \rightarrow \infty\), where \(x^t\) can be considered to be distributed according to \(\mathcal {N}(0,\textbf{I})\). This allows them to generate samples starting from pure Gaussian noise. We followed these approaches for training the diffusion model \(\Xi\). However, once trained we applied \(\Xi\) to sample from the conditional (8) for a finite, but fixed \(t=T\) whose \(\overline{\alpha }_T\) describes the noise level we expect in our data. In other words we identify the \(x^T\) from the diffusion model with the x used in (2). In particular, \(p(x|\zeta ) = p(x^T|\zeta )\) is given by (7) and the input posterior \(p(\zeta |x) = p(\zeta |x^T)\) is given by (8).
In the experiments below we introduce noise of different levels T in our data. To this end, we start from the unnoisy datapoints \(\zeta ^{\textrm{true}}\) and then use \(p(x|\zeta ^{\textrm{true}})\), i.e., (7) at noise level \(t=T\). The same T is then used for the application of the diffusion model (8) as input posterior \(p(\zeta |x)=p(x^0|x^T)\).
As pointed out in Sect. 1 to train the diffusion model \(\Xi\) requires a dataset of clean, unnoisy datapoints \(\zeta\), which can, in practice, either be provided by a denoiser, a simulated dataset or a surrogate dataset.
3 Models
3.1 EiV model
Our approach requires two types of networks: a diffusion model \(\Xi\) that induces \(p(\zeta |x)\) via (8) with \(t=T\) and a classification network \(f_\theta (\zeta )\) that induces a distribution \(p(y|\theta , \zeta )=\textrm{Cat}(y|f_\theta (\zeta ))\). The diffusion model \(\Xi\) needs to be trained, in the standard manner (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2020) before the training of the classification network \(f_\theta\). In this section we discuss the training and evaluation of \(f_\theta\) given a trained \(\Xi\).
Training We implement the variational distribution \(q_\phi (\theta )\) via Monte Carlo Dropout as in (Gal & Ghahramani, 2016; Kendall & Gal, 2017) for which \(\phi\) equals a vector of the size of the network parameter. The random network parameter \(\theta\) is created from \(\phi\) using a random dropout mask m, i.e., \(\theta (\phi ) = \phi \odot m\). The regularization term \(D_{\textrm{KL}}(q_\phi (\theta )\Vert p(\theta ))\) equals (approximately)Footnote 1\(\frac{\lambda }{2} (1-p_D) \Vert \phi \Vert _2^2\), where \(p_D\) denotes the dropout probability, cf. (Gal & Ghahramani, 2016). The factor \(\lambda\) is the precision parameter of \(p(\theta )\), cf. Sect. 2.1. Replacing the first term in (4) via Monte Carlo sampling we obtain
where \(\theta _i \sim q_\phi (\theta )\), \(\{(x_1,y_1),\ldots ,(x_M,y_M)\} \subseteq \mathcal {D}\) is a minibatch of size M and \(\zeta _{i,l} \sim p(\zeta _i |x_i)\) denote \(n_\zeta\) samples from the diffusion model for the same input \(x_i\). Recall that \(p(y_i | \zeta _{i,l},\theta _i)\) is simply the evaluation of the neural network \(f_{\theta _i}\) at input \(\zeta _{i,l}\) with the dropout realization \(\theta _i\). Note that in (9) we have the same dropout realization \(\theta _i\) for all \(n_\zeta\) inputs \(\zeta _{i,1},\ldots , \zeta _{i,n_\zeta }\), this is necessary due to the non-linear logarithm outside the inner expectation in the loss function (4).
Evaluation Once we have trained the classification network via (9) we can evaluate the network on a new input datapoint x via sampling from \(\zeta \sim p(\zeta |x)\) (the diffusion model) and \(\theta \sim q_\phi (\theta )\) (using dropout in our case) and evaluating the network as \(p(y|\zeta ,\theta )\). For example, by multiple sampling of \(\zeta\) and \(\theta\) we obtain the posterior predictive distribution of y given x:
3.2 Non-EiV and AIV models for comparison
We compare our model with two baseline methods. One of them is a standard Bayesian non-EiV model, cf. (1), based on Bernoulli Dropout (Gal & Ghahramani, 2016). A comparison of EiV with this model evaluates the benefit of using the statistical model (2) instead of (1). To exclude the possibility that the only benefit gained is due to denoising, we further compare our model with what we call Average Input Variable (AIV). To evaluate AIV on an x we first “denoise” it by drawing \(n_\zeta\) samples \(\zeta _l\sim p(\zeta |x)\) (the diffusion model) and then take their average \(\overline{\zeta } = \frac{1}{n_\zeta } \sum _{l=1}^{n_\zeta } \zeta _l\) as input to the AIV model. For both models we use the same architecture for the classification network \(f_\theta\) as for the EiV model.
Training The non-EiV and AIV models in this work are trained using the standard loss function arising from variational inference via Bernoulli Dropout (Gal & Ghahramani, 2016):
For non-EiV we take \(p_{i} = \textrm{Cat}(y_i |f_{\theta _i}(x_i))\) whereas for AIV we take \(p_{i} = \textrm{Cat}(y_i |f_{\theta _i}(\overline{\zeta }_i))\) where \(\overline{\zeta }_i = \frac{1}{n_\zeta }\sum _{l=1}^{n_\zeta } \zeta _{i,l}\) with \(\zeta _{i,l} \sim p(\zeta _i|x_i)\) being \(n_\zeta\) draws from the diffusion model.
Evaluation For evaluation of the non-EiV and AIV models we use their posterior predictive distribution, which is given by \(p(y|x) = \mathbb {E}_{\theta \sim q_\phi (\theta )} [p(y|x,\theta )]\) for non-EiV and, similarly, by \(p(y|\overline{\zeta }) = \mathbb {E}_{\theta \sim q_\phi (\theta )} [p(y|\overline{\zeta },\theta )]\) with \(\overline{\zeta }=\frac{1}{n_\zeta } \sum _{l=1}^{n_\zeta } \zeta _{l},\,\zeta _{l}\sim p(\zeta | x)\) for AIV.
4 Experiments
4.1 Experimental setup
In our experiments we compared the accuracy and calibration of EiV, non-EiV and AIV. We evaluated our approach on the MNIST, CIFAR10 and CIFAR100 datasets. All experiments were conducted over three seeds and we provide the average of these results and their standard deviation.
Generation of noisy and denoised data Algorithm 1 demands to denoise each noisy sample on the fly. However, this is computationally infeasible with our choice of NN to approximate samples from the input posterior \(p(\zeta | x^t)\). Hence, for our experiments we approximate Algorithm 1 by storing denoised datasets with a limited number of denoised samples and perform optimization with these fixed datasets. For each dataset we trained classification networks at different noise levels. The noise levels are parametrized by the time parameter T of Sect. 2.2. For each T we created noisy datapoints by including Gaussian noise of level T to the original images \(\zeta ^{\textrm{true}}_i\) via (7) to obtain noisy images \(x_i = x^T_i\). From these, together with the labels \(y_i\) from the unnoisy dataset, we generated a noisy dataset \(\mathcal {D}_x^T = \{(x_i, y_i)\}\) and a denoised dataset \(\mathcal {D}_{\zeta }^T = \{(\zeta _{i,l})_{l=1,\ldots ,n_\zeta }, y_i )\})\), where the \(\zeta _{i,l}\) are \(n_\zeta =5\) samples of the diffusion model (8) for the same input \(x_{i}\) evaluated at the corresponding level T. We used \(\mathcal {D}_x^T\) for the training of the non-EiV model and \(\mathcal {D}_{\zeta }^T\) for the training of the EiV and AIV models using the loss functions explained in Sect. 3.1 and Sect. 3.2. For the evaluation, we created test data in the same manner but used \(n_\zeta =50\) instead. Noisified and denoisified samples from the data generation process are visualized in Appendix D.
Accuracy As a measure of performance, we evaluated the accuracy of the EiV, non-EiV and AIV model by comparing the classes predicted by the posterior predictive distribution, computed as in Sect. 3.1 and 3.2, with the actual labels \(y_i\) in the data.
Calibration of the posterior predictive distribution For a well-calibrated model its confidence is a good approximation of the actual probability of correctness. Commonly DNNs are not sufficiently calibrated because they tend to yield overconfident predictions (Guo et al., 2017; Li & Hoiem, 2018). This can be visualized by a reliability diagram. In our experiments we evaluated the posterior predictive distribution of the EiV, AIV and non-EiV approaches on two common calibration measures and compared their reliability diagrams. The expected calibration error \(\textrm{ECE} = \sum _k^K \frac{|b_k |}{N} |\textrm{acc}(b_k) - \textrm{conf}(b_k) |\) bins the N predictions in K equally sized intervals, to estimate the deviation of the accuracy from the confidence of the model (Naeini et al., 2015). The \(k^{\textrm{th}}\) interval contains \(b_k\) samples and \(\sum _k |b_k | = N\). \(\textrm{acc}(b_k)\) and \(\textrm{conf}(b_k)\) are the accuracy and average confidence of the samples in bin \(b_k\). A limitation of the ECE is that it only uses the probability for the predicted class. In contrast, the static calibration error \(\textrm{SCE} = \sum _{k,c}^{K,C} \frac{|b_{k,c} |}{N C} |\textrm{acc}(b_{k,c}) - \textrm{conf}(b_{k,c}) |\) calculates a classwise ECE and averages over all classes C (Nixon et al., 2019). A precise and concise description of these measures can be found in Gawlikowski et al. (2021).
Transferability of diffusion model The experiments we conducted on MNIST and CIFAR10 require a diffusion model that is trained on the true input samples \(\zeta ^{\textrm{true}}\), to which we, in a typical scenario, don’t have access. Generally, we only have access to the noisy dataset \(\mathcal {D}_x^T\). We bypass this shortcoming on CIFAR100 by using CIFAR10 as a surrogate dataset. Neither EiV nor AIV require access to the \(\zeta ^{\textrm{true}}\) from CIFAR100 in this set of experiments, but rather only to the noisy datapoints x. Instead we trained a diffusion model on CIFAR10 and used it to sample \(\zeta\)s for CIFAR100.
4.2 Implementation details
Diffusion model Both diffusion models \(\Xi _{\textrm{CIFAR10}}\) and \(\Xi _{\textrm{MNIST}}\) were optimized as in (Ho et al., 2020) with the same training strategy and choice of hyperparameters. Our U-Net backbone was taken from (Song et al., 2023), which is a PyTorch implementation of the backbone used in (Ho et al., 2020). We trained \(\Xi _{\textrm{CIFAR10}}\) and \(\Xi _{\textrm{MNIST}}\) for 800,000 epochs and took the network parameter with the lowest loss value.
Training To train the classification networks we used the Adam optimizer, a batch size of 256, \(L_2\) penalty of 0.0001, gradient clipping of 0.1, and a learning rate of \(lr = 0.001\). During the first \(0.25 * n_{\textrm{epoch}}\) epochs we linearly increased the learning rate to lr. Afterwards the learning rate decayed linearly. We used standard data normalization techniques during training. To classify CIFAR10 and CIFAR100 (Krizhevsky, 2009) a modified version of ResNet9 (He et al., 2015) was used where an additional fully connected layer was added and all fully connected layers are augmented by dropout layers. We call this architecture ResNet9DO. We trained this network for \(n_{\textrm{epoch}} = 100\) epochs. An MLP with one hidden layer and dropout layers before each fully connected layer was trained on MNIST (LeCun et al., 1998). We trained this MLP for \(n_{\textrm{epoch}} = 80\) epochs. The dropout rate \(p_D\) for all networks is 0.5. The code was implemented in PyTorch (Paszke et al., 2019), and we used Hydra (Yadan, 2019) to manage different configurations.
Training algorithm Pseudocode of the training algorithm and the loss function (9) is provided in Algorithm 1. The \(L_2\) regularization of (9) was implemented by Adam’s \(L_2\) penalty Kingma and Ba (2014); Loshchilov and Hutter (2017). EiV training requires applying the same dropout mask on all \(\zeta _{i,l}\). PyTorch’s dropout layer does not offer this option. We used a slightly modified dropout implementation to satisfy this condition.
5 Results
Table 1 summarizes our main results for MNIST, CIFAR10 and CIFAR100. Here, we compare the accuracy, ECE and SCE of the three considered models, EiV, non-EiV and AIV, at different noise levels T. For each noise level, the best metric values are highlighted in bold. We observe that EiV clearly outperforms non-EiV and AIV in all three metrics. Only for the rather simple case of MNIST does AIV show a comparable performance to EiV with a slightly better calibration but less accuracy. The results from Table 1 can be seen as a proof of principle that a proper treatment of EiV indeed improves the performance and reliability in line with statistical expectations.
Most importantly, the CIFAR100 experiment demonstrates the practical usability of our approach by showing that it can indeed suffice to use a diffusion model that was trained on a surrogate dataset to achieve a more accurate and better calibrated model using EiV. Moreover, for CIFAR10 and CIFAR100, the comparison with AIV shows that the observed advantage of EiV is not due to the denoising process but is actually rooted in the proper usage of the underlying statistical model. This observation is, moreover, not a peculiarity of the Bayesian treatment of the network parameter; similar results for the (non-Bayesian) ResNet9 architecture can be found in Appendix B. In the following we discuss in more detail the outcomes of our experiments: improved accuracy, better calibrated posterior predictive distribution and the usability of surrogate datasets.
Accuracy of the trained networks on the test dataset. The networks were trained with \(n_{\zeta } = 5\) but evaluated with \(n_{\zeta } = n_{\theta } = 50\). MNIST was denoised with \(\Xi _{\textrm{MNIST}}\) and CIFAR10/100 were denoised with \(\Xi _{\textrm{CIFAR10}}\). EiV (black, dashed) generally outperforms AIV (blue, solid) and non-EiV (magenta, dotted). Results were averaged over three seeds. The standard deviations are shown as shaded areas (Color figure online)
Accuracy gain Fig. 2 visualizes the accuracy of the three statistical models for different noise levels T. For all datasets the EiV model (black) outperforms the non-EiV model (magenta). On MNIST we see that the maximum accuracy of the EiV model is retained even up to noise level \(T = 241\) and only afterwards starts to drop gradually. This might be explained by the image reconstruction process, because if the noise level is too high, the diffusion model “reconstructs” a different digit than the true digit on which the noisification process was performed. In Appendix D we present a number of such cases. On CIFAR10 the performance gain increases at higher noise level: e.g. at noise level \(T=151\), the EiV model shows an accuracy improvement of \(6.1\%\) over non-EiV. For CIFAR100 and \(T> 2\) the accuracy of the EiV model is at least \(2.2\%\) higher than the non-EiV model. This demonstrates that using EiV with a surrogate dataset to learn the input posterior can yield a model that is more accurate than one that does not account for errors in the input variables.
The comparison with AIV (blue) shows that the improved accuracy is not an artefact of the denoising process itself but is related to the proper statistical treatment of EiV. Even though on the simplest dataset, MNIST, the EiV and the AIV model perform similarly (their curves in Fig. 2 are almost congruent), the EiV model is superior for the more complex datasets, CIFAR10 and CIFAR100. On CIFAR10, for instance, the EiV model has a \(1.1\% - 2.5\%\) higher accuracy for \(T> 2\). This superiority of EiV is most obvious for CIFAR100, where the AIV model does not lead to any performance gain compared to the non-EiV model.
More reliable posterior predictive distribution For all three models, we checked the calibration of the posterior predictive distribution p(y|x), i.e., whether its value can be expected to reflect its accuracy. This is visualized in the reliability diagrams in Fig. 3 for CIFAR10 and CIFAR100, which plot the bin accuracy vs. the average bin confidence. The size of the markers is a function of the number of samples in each bin. We observe that the EiV model leads to an improved calibration in all considered cases. For CIFAR10 the EiV model (black) is almost perfectly calibrated, whereas non-EiV (magenta) and AIV (blue) are clearly overconfident. For CIFAR100, where a surrogate dataset was used, EiV still has a significantly improved calibration compared to AIV and non-EiV. The remaining reliability diagrams in Appendix A support this conclusion. These observations are confirmed by the summary statistics presented in Table 1. EiV has the lowest ECE and SCE on CIFAR10 and CIFAR100 at all noise levels. In addition, we observe a drawback of the AIV model because it leads to a very high ECE and SCE. The preprocessing of the input seems to make AIV more confident than non-EiV even in cases where its prediction accuracy is lower as can been seen on CIFAR100. The increased reliability of the posterior predictive distribution of a EiV treatment is also demonstrated by the common NLL metric which we included in Table 3 in Appendix C.
True inputs not needed Figs. 2 and 3 and Table 1 consistently demonstrate that to perform EiV successfully, it can be sufficient to use a diffusion model trained on a similar but different dataset. Our experiments on CIFAR100 show that the EiV model leads to an improved model accuracy and is better calibrated, even though the diffusion model \(\Xi _{\textrm{CIFAR10}}\) was trained on the surrogate dataset CIFAR10. Note that this property does not hold for AIV, which does not perform better than non-EiV on CIFAR100.
Results without dropout So far we have supplemented a BNN with an input posterior to compare the performances and calibrations of EiV, AiV and non-EiV. In the appendix we show that qualitatively the same behaviour is observed if we consider a standard ResNet9 without any dropout mask and both with (EiV and AIV) and without (non-EiV) an input posterior. The Figs. 6 and 7 and the Table 4 suggest that the proposed Bayesian treatment of the input is beneficial even when not followed by a Bayesian treatment of \(\theta\).
6 Conclusions
Summary In this work, we have shown how the statistical concept of EiV can be applied in deep learning. We have used diffusion models to obtain an input posterior to obtain a data distribution of the denoised inputs. Concretely, we have empirically shown the potential benefit of EiV in the context of image classification in cases where the classification network has no access to the unnoisy input data but only to noisy input data. In particular, if a diffusion model is used as an input posterior, the EiV model clearly outperforms non-EiV and AIV models in prediction performance and model calibration. In addition, our experiments on CIFAR100 show that access to the unnoisy data may not be required at all. Even if the diffusion model is trained on a dataset of unnoisy images that is different from the dataset on which the classification network is applied, EiV can lead to better calibrated and more precise predictions.
Outlook In our experiments, input errors were caused by Gaussian noise. A further possible application of our model could be on images that are corrupted by other types of noise (Boncelet, 2009; Gravel et al., 2004) or on images with missing regions. In the latter case, using diffusion models as an input posterior is promising because these networks have already been applied successfully for inpainting tasks (Lugmayr et al., 2022). Another research direction is to verify our results on larger images.
Data availability
The datasets used in this paper are public. We have provided the corresponding reference of each dataset.
Notes
Technical remark: In our implementation some of the network parameter (biases and convolutional weights) are not affected by dropout (\(p_D=0\)). For these layers \(\lambda\) is chosen such that the coefficient of the \(L^2\) regularization is the same throughout the network.
References
Bassu, D., Lo, J.T. & Nave, J. (1999). Training recurrent neural networks with noisy input measurements. In: IJCNN’99. International joint conference on neural networks. Proceedings (Cat. No. 99CH36339), (vol. 1, pp. 359–363). IEEE.
Blundell, C., Cornebise, J., Kavukcuoglu, K. & Wierstra, D. (2015). Weight uncertainty in neural network. In: International conference on machine learning, (pp. 1613–1622). PMLR.
Boncelet, C. (2009). Chapter 7-image noise models. In A. Bovik (Ed.), The essential guide to image processing (pp. 143–167). Academic Press. https://doi.org/10.1016/B978-0-12-374457-9.00007-X
Budach, L., Feuerpfeil, M., Ihde, N., Nathansen, A., Noack, N., Patzlaff, H., Harmouch, H. & Naumann, F. (2022). The effects of data quality on machine learning performance. arXiv preprint arXiv:2207.14529.
Chesher, A. (1991). The effect of measurement error. Biometrika, 78(3), 451–462.
Croitoru, F.-A., Hondru, V., Ionescu, R.T. & Shah, M. (2023). Diffusion models in vision: A survey. In IEEE transactions on pattern analysis and machine intelligence.
Dellaportas, P. & Stephens, D.A. (1995). Bayesian analysis of errors-in-variables regression models. Biometrics 1085–1095.
Dodge, S. & Karam, L. (2016). Understanding how image quality affects deep neural networks. In: 2016 Eighth international conference on quality of multimedia experience (QoMEX) (pp. 1–6). IEEE.
Duvenaud, D., Maclaurin, D. & Adams, R. (2016). Early stopping as nonparametric variational inference. In: Artificial intelligence and statistics (pp. 1070–1077). PMLR.
Fuller, W. A. (2009). Measurement error models. Wiley. https://doi.org/10.1002/9780470316665
Gal, Y. & Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In: International conference on machine learning (pp. 1050–1059). PMLR.
Gal, Y., Hron, J. & Kendall, A. (2017). Concrete dropout. arXiv preprint arXiv:1705.07832.
Gawlikowski, J., Tassi, C. R. N., Ali, M., Lee, J., Humt, M., Feng, J., Kruspe, A. M., Triebel, R., Jung, P., Roscher, R., Shahzad, M., Yang, W., Bamler, R., & Zhu, X. (2021). A survey of uncertainty in deep neural networks. Artificial Intelligence Review, 56, 1513–1589.
Van Gorp, J., Schoukens, J., & Pintelon, R. (1998). The Errors-in-Variables cost function for learning neural networks with noisy inputs. Intelligent Engineering Through Artificial Neural Networks, 8, 141–146.
Goyal, B., Agrawal, S., & Sohi, B. (2018). Noise issues prevailing in various types of medical images. Biomedical & Pharmacology Journal, 11(3), 1227.
Gravel, P., Beaudoin, G., & De Guise, J. A. (2004). A method for modeling noise in medical images. IEEE Transactions on Medical Imaging, 23(10), 1221–1232.
Guo, C., Pleiss, G., Sun, Y. & Weinberger, K.Q. (2017). On calibration of modern neural networks. In: International conference on machine learning.
Gupta, S. & Gupta, A. (2019). Dealing with noise problem in machine learning data-sets: A systematic review. Procedia Computer Science 161, 466–474. The 5th information systems international conference, 23-24 July 2019, Surabaya, Indonesia https://doi.org/10.1016/j.procs.2019.11.146.
He, K., Zhang, X., Ren, S. & Sun, J. (2015). Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385.
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.
Van Huffel, S. (1997). Recent advances in total least squares techniques and errors-in-variables modeling. SIAM.
Im, D.J., Ahn, S., Memisevic, R. & Bengio, Y. (2015). Denoising criterion for variational auto-encoding framework. In: AAAI conference on artificial intelligence.
Ivanovic, B., Lin, Y., Shrivastava, S., Chakravarty, P. & Pavone, M. (2022). Propagating state uncertainty through trajectory forecasting. In: 2022 international conference on robotics and automation (ICRA) (pp. 2351–2358). IEEE.
Jiang, L., Huang, D., Liu, M. & Yang, W. (2020). Beyond synthetic noise: Deep learning on controlled noisy labels. In: International conference on machine learning (pp. 4804–4815). PMLR.
Kao, C., & Schnell, J. F. (1987). Errors in variables in the multinomial response model. Economics Letters, 25(3), 249–254. https://doi.org/10.1016/0165-1765(87)90222-9
Kendall, A. & Gal, Y. (2017). What uncertainties do we need in Bayesian deep learning for computer vision? arXiv preprint arXiv:1703.04977.
Kingma, D.P., Salimans, T. & Welling, M. (2015). Variational dropout and the local reparameterization trick. arXiv preprint arXiv:1506.02557.
Kingma, D.P. & Ba, J. (2014). Adam: A method for stochastic optimization, 51(3). arXiv preprint https://www.jstor.org/stable/2533007.
Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical report. https://www.cs.toronto.edu/~kriz/cifar.html.
Lakshminarayanan, B., Pritzel, A. & Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86, 2278–2324.
Leonard, D. (2011). Estimating a bivariate linear relationship. Bayesian Analysis, 6(4), 727–754. https://doi.org/10.1214/11-BA627
Li, Z. & Hoiem, D. (2018). Improving confidence estimates for unfamiliar examples. 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) pp. 2683–2692.
Loquercio, A., Segu, M., & Scaramuzza, D. (2020). A general framework for uncertainty estimation in deep learning. IEEE Robotics and Automation Letters, 5(2), 3153–3160.
Loshchilov, I. & Hutter, F. (2017). Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101.
Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R. & Van Gool, L. (2022). Repaint: Inpainting using denoising diffusion probabilistic models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11461–11471.
Maddox, W.J., Izmailov, P., Garipov, T., Vetrov, D.P. & Wilson, A.G. (2019). A simple baseline for bayesian uncertainty in deep learning. Advances in neural information processing systems, 32.
Martin, J., & Elster, C. (2023). Aleatoric uncertainty for errors-in-variables models in deep regression. Neural Processing Letters, 55(4), 4799–4818. https://doi.org/10.1007/s11063-022-11066-3
Naeini, M. P., Cooper, G. F., & Hauskrecht, M. (2015). Obtaining well calibrated probabilities using Bayesian binning. Proceedings of the AAAI conference on artificial intelligence, 29(1), 2901–2907
Nazaré, T.S., Costa, G.B.P., Contato, W.A. & Ponti, M. (2018). Deep convolutional neural networks and noisy images. In: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: 22nd Iberoamerican congress, CIARP 2017, Valparaíso, Chile, November 7–10, 2017, Proceedings 22 (pp. 416–424). Springer.
Nikolenko, S.I. (2019). Synthetic data for deep learning. arXiv preprint arXiv:1909.11512.
Nixon, J., Dusenberry, M.W., Zhang, L., Jerfel, G. & Tran, D. (2019). Measuring calibration in deep learning. arXiv preprint arXiv:1909.11512.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J. & Chintala, S. (2019). PyTorch: An imperative style, high-performance deep learning library. Curran Associates, Inc. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
Pavone, A., Svensson, J., Langenberg, A., Pablant, N., Hoefel, U., Kwak, S., Wolf, R., Team, W.-X. (2018). Bayesian uncertainty calculation in neural network inference of ion and electron temperature profiles at w7–x. Review of Scientific Instruments, 89(10), 10–102.
Plenge, E., Poot, D. H., Bernsen, M., Kotek, G., Houston, G., Wielopolski, P., Weerd, L., Niessen, W. J., & Meijering, E. (2012). Super-resolution methods in MRI: Can they improve the trade-off between resolution, signal-to-noise ratio, and acquisition time? Magnetic Resonance in Medicine, 68(6), 1983–1993.
Pusey, E., Lufkin, R. B., Brown, R., Solomon, M. A., Stark, D. D., Tarr, R., & Hanafee, W. (1986). Magnetic resonance imaging artifacts: Mechanism and clinical significance. Radiographics, 6(5), 891–911.
Rolnick, D., Veit, A., Belongie, S. & Shavit, N. (2017). Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694.
Schennach, S.M. (2012). Measurement error in nonlinear models: A review. cemmap working paper. https://www.cemmap.ac.uk/wp-content/uploads/2020/08/CWP4112.pdf.
Seghouane, A.-K., Fleury, G. (2001). A cost function for learning feedforward neural networks subject to noisy inputs. In: Proceedings of the 6th international symposium on signal processing and its applications (Cat. No. 01EX467), vol. 2, (pp. 386–389). IEEE.
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. In: International conference on machine learning (pp. 2256–2265). PMLR.
Song, H., Kim, M., Park, D., Shin, Y., & Lee, J.-G. (2023). Learning from noisy labels with deep neural networks: A survey. IEEE Transactions on Neural Networks and Learning Systems, 34(11), 8135–8153. https://doi.org/10.1109/TNNLS.2022.3152527
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S. & Poole, B. (2020). Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456.
Sragner, L. & Horvath, G. (2003). Improved model order estimation for nonlinear dynamic systems. In: 2nd IEEE International workshop on intelligent data acquisition and advanced computing systems: technology and applications, 2003. Proceedings (pp. 266–271). IEEE.
Stefanski, L. A., & Carroll, R. J. (1985). Covariate measurement error in logistic regression. The Annals of Statistics, 13(4), 1335–1351.
Wright, W. (1999). Bayesian approach to neural-network modeling with input uncertainty. IEEE Transactions on Neural Networks, 10(6), 1261–1270.
Wright, W., Ramage, G., Cornford, D., & Nabney, I. T. (2000). Neural network modelling with input uncertainty: Theory and application. Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology, 26(1), 169–188.
Yadan, O. (2019). Hydra - A framework for elegantly configuring complex applications. Github. https://github.com/facebookresearch/hydra.
Yuan, J., Zhu, J., & Nian, V. (2020). Neural network modeling based on the Bayesian method for evaluating shipping mitigation measures. Sustainability, 12(24), 10486.
Zhang, X., Liang, F., Yu, B., & Zong, Z. (2011). Explicitly integrating parameter, input, and structure uncertainties into Bayesian neural networks for probabilistic hydrologic forecasting. Journal of Hydrology, 409(3–4), 696–709.
Zhang, G., Sun, S., Duvenaud, D. & Grosse, R. (2018). Noisy natural gradient as variational inference. In: International conference on machine learning (pp. 5852–5861). PMLR.
Zhou, Y., Liu, D. & Huang, T. (2018). Survey of face detection on low-quality images. In: 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018) (pp. 769–773). IEEE.
Funding
Open Access funding enabled and organized by Projekt DEAL. This project is part of the programme “Metrology for Artificial Intelligence in Medicine” (M4AIM) that is funded by the Federal Ministry for Economic Affairs and Climate Action (BMWK) in the frame of the QI-Digital initiative.
Author information
Authors and Affiliations
Contributions
JF contributes to conceptualization, methodology, experiments, analyzing, and writing. JM contributes to conceptualization, methodology, experiments, analyzing, and writing. CE contributes to conceptualization, methodology, and analyzing.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known Conflict of interest or Conflict of interest that are relevant to the content of this paper.
Ethics approval
Not applicable
Consent to participate
Not applicable. None of the experiments in this paper involves animals, plants, or human entities.
Consent for publication
Not applicable. The paper does not include data or images requiring permissions.
Code availability
The code to reproduce these experiments is provided at https://github.com/josh3142/EiV.
Additional information
Editor: Emilie Devijver.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Reliability diagrams
Those reliability diagrams of ResNet9D0 that were trained on CIFAR10 and CIFAR100 and that are not displayed in Fig. 3 are presented in Fig. 4.
Reliability diagrams showing bin accuracy versus average bin confidence for CIFAR10 (top row) and CIFAR100 (bottom row) at different noise levels T. Color coding as in Fig. 2: EiV (black), non-EiV (magenta) and AIV (blue). The marker size for the points is chosen proportional to \((\text{ number } \text{ of } \text{ samples } \text{ in } \text{ bin})^{1/2}\). Results were averaged over three seeds (Color figure online)
Reliability diagrams showing bin accuracy versus average bin confidence for CIFAR10 (top row) and CIFAR100 (bottom row) for those noise levels T not displayed in Fig. 3. Color coding as in Fig. 2: EiV (black), non-EiV (magenta) and AIV (blue). The marker size for the points is chosen proportional to \((\text{ number } \text{ of } \text{ samples } \text{ in } \text{ bin})^{1/2}\) (Color figure online)
Appendix B Robustness of EiV
In this section we provide evidence that EiV is neither tied to a large number of draws \(n_{\zeta }\) nor does it require BNNs to lead to a substantially improved inference. We show that \(n_{\zeta } = 5\) is sufficient to obtain a better model performance. Further a Bayesian treatment of the network parameter \(\theta\) is not required, either. To study the latter, we consider (9) with the variational distribution \(q_{\phi }(\theta )\) being the Dirac distribution \(\delta (\theta - \hat{\theta })\) centered at the maximum a posteriori estimation (MAP) \(\hat{\theta }\). We conducted with ResNet9 on CIFAR10 and CIFAR100 the same experiments as in Sect. 4. These results show that EiV performs superior to its competing statistical models in both settings.
Accuracy of the trained networks on the test dataset. The networks were trained and, in contrast to Fig. 2, evaluated with \(n_{\zeta } = 5\). Dropout was realized with \(n_{\theta } = 50\). MNIST was denoised with \(\Xi _{\textrm{MNIST}}\) and CIFAR10 and CIFAR100 were denoised with \(\Xi _{\textrm{CIFAR10}}\). EiV (black, dashed) is superior to AIV (blue, solid) and non-EiV (magenta, dotted) in most cases
Similar to Fig. 2 in Fig. 5 the accuracy plots of the three statistical models for different noise levels T are plotted. In contrast to Fig. 2 the models have been evaluated with \(n_{\zeta } = 5\). For CIFAR10 and CIFAR100 we observe a slightly worse performance compared to an inference with \(n_{\zeta } = 50\). We used the same trained models as in Sect. 4 to perform inference.
Accuracy of the (non-Bayesian) ResNet9 on the test dataset. The networks were trained with \(n_{\zeta } = 5\) and evaluated with \(n_{\zeta } \in \{5, 50\}\). The datasets were denoised with \(\Xi _{\textrm{CIFAR10}}\). EiV (black, dashed) is superior to AIV (blue, solid) and non-EiV (magenta, dotted)
In this work we introduced EiV in a Bayesian setting. In particular, our experiments in Sect. 4 were performed with BNNs. However, the accuracy plots in Fig. 6 show that EiV leads to an improved model performance on (non-Bayesian) NNs, in which the random network parameters \(\theta _i\) are replaced by the MAP \(\hat{\theta }\), as well. However, the input is still treated in a Bayesian manner using the diffusion model. The numerical results are presented in Table 4 and a selection of reliability diagrams is provided in Fig. 7.
Reliability diagrams for the (non-Bayesian) ResNet9 and \(n_{\zeta } = 50\): Each diagram shows the bin accuracy versus average bin confidence for CIFAR10 (top row) and CIFAR100 (bottom row) at different noise levels T. Color coding as in Fig. 2: EiV (black), non-EiV (magenta) and AIV (blue). The marker size for the points is chosen proportional to \((\text{ number } \text{ of } \text{ samples } \text{ in } \text{ bin})^{1/2}\)
Appendix C Metrics with standard deviations
Appendix D Visualization of input data
In this section the noisification process is visualized. The parameter T determines how much an image \(x^0\) diffuses into a standard Gaussian distribution as described by (7). The denoised image \(\zeta ^T\) is obtained by reversing this diffusion process (Figs. 8, 9, 10). Using (7) the parameter T can be directly translated into the progress of the diffusion process. In particular, for the noise levels used in the experiments the mean and the standard deviation of \(p(x^T|\zeta )\) are listed in Table 5 for CIFAR and Table 6 for MNIST.
Visualization of examples from MNIST at different noise levels T. The diffusion model \(\Xi _{\textrm{MNIST}}\) was used to denoise the images. At each noise level T, \(x^T\) is the input for non-EiV and \(\zeta ^T\) is used for EiV and AIV. In some cases the diffusion model reconstructs a digit that is different from the true digit
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Faller, J., Martin, J. & Elster, C. Deep Errors-in-Variables using a diffusion model. Mach Learn 114, 107 (2025). https://doi.org/10.1007/s10994-025-06744-x
Received:
Revised:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1007/s10994-025-06744-x












