Post-selection confidence bounds for prediction performance

Rink, Pascal; Brannath, Werner

doi:10.1007/s10994-024-06632-w

Post-selection confidence bounds for prediction performance

Open access
Published: 14 February 2025

Volume 114, article number 82, (2025)
Cite this article

You have full access to this open access article

Download PDF

Machine Learning Aims and scope Submit manuscript

Post-selection confidence bounds for prediction performance

Download PDF

1066 Accesses
1 Altmetric
Explore all metrics

Abstract

In machine learning, the selection of a promising model from a potentially large number of competing models and the assessment of its generalization performance are critical tasks that need careful consideration. Typically, model selection and evaluation are strictly separated tasks, splitting the sample at hand into training, validation, and evaluation sets, and only computing a single confidence interval for the prediction performance of the final selected model. We however regard the selection problem as a simultaneous inference problem and propose an algorithm to compute valid lower confidence bounds for multiple models that have been selected based on their prediction performance in the evaluation set. For this, we use bootstrap tilting and a maxT-type multiplicity correction. Various simulation experiments show that this leads to lower confidence bounds for the conditional performance that are at least as good as bounds from standard methods, and that reliably reach the nominal coverage probability. Also, a better performing final prediction model is selected this way, especially when the sample size is small. The approach is universally applicable for any combination of prediction models, any model selection strategy, and any prediction performance measure that accepts weights.

A new approach in model selection for ordinal target variables

Article Open access 20 May 2021

Overfitting, Model Tuning, and Evaluation of Prediction Performance

Underlying Objective

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Many machine learning (ML) applications involve both model selection and the assessment of that model’s prediction performance on future observations. This is particularly challenging when only little data is available to perform both tasks. By allocating a greater fraction of the data towards model selection the goodness assessment gets less reliable, and the allocation of a greater fraction towards goodness assessment poses the risk of selecting a subpar prediction model. In such situations it is desirable to have a procedure at hand that resolves this problem reliably.

Recent work by Westphal and Brannath (2020) showed that it is beneficial in terms of final model performance and statistical power to select multiple models for goodness assessment, shifting the model selection to the evaluation set, in spite of the need to correct for multiplicity then. While Westphal and Brannath (2020) proposed a multiple test in such cases, we propose a way how to compute valid lower confidence bounds for the conditional prediction performance of the final selected model. Reporting a confidence interval is reasonable since a point estimate for the performance does not take into account the uncertainty of the estimate.

We follow the idea of Berk et al. (2013) and interpret this post-selection inference problem as a simultaneous inference problem, controlling for the FWER

$$\begin{aligned} P_{{\varvec{\theta }}} ( \theta _i < \theta _{i, L} \text { for any } i \in \{1, \ldots , m\} ) \le \alpha , \end{aligned}$$

(1)

where $\theta _i$ denotes the performance of prediction model i, $\theta _{i, L}$ denotes the corresponding lower confidence bound at significance level $\alpha > 0$, and ${\varvec{\theta }}= (\theta _1, \ldots , \theta _m)$ is the vector of true predictive performances. With this type 1 error control, in practice, we are therefore able to answer the question whether there is a model among the candidate models that has prediction performance $\theta _i$ at least as large as a reference performance $\theta _0$ with high confidence, no matter how and which subset of the initial competition has been selected for evaluation. In particular, since

$$\begin{aligned} \alpha \ge P_{{\varvec{\theta }}} (\cup _{i = 1}^m \{ \theta _i< \theta _{i, L} \}) \ge P_{{\varvec{\theta }}} (\theta _s < \theta _{s, L}), \end{aligned}$$

(2)

for all $s \in \{1, \ldots , m\}$, even if s is data-dependent, this coverage guarantee carries over to each potentially final selected model s.

Simultaneous inference for all the candidate models might be an overly ambitious requirement and might yield too conservative decisions. Therefore, in order to increase the precision of the lower confidence bound, we propose to not evaluate all candidate models, but only a promising selection of them. To put it another way: we exclude models from the evaluation that are unlikely to have the best predictive performance. It is also possible to report lower confidence bounds for the prediction performance of all the competing models.

Our proposed confidence bounds are universally valid: they work with any measure of prediction performance (as long as it accepts weights, as we will discuss later), with any combination of prediction models even from different model classes (for example, linear and non-linear candidate models), any model selection strategy (formal or informal, or even based on post-hoc considerations), and are computationally undemanding as no additional model training is involved. While Berk et al. (2013) proposed universally valid post-selection confidence bounds for regression coefficients, we are interested in a post-selection lower confidence bound for the conditional prediction performance of a model selected based on its evaluation performance.

In many applications it is sufficient to know a lower confidence bound for the performance of a prediction model. For instance, in medical diagnosis, when deploying a machine learning model to decide on the treatment of a patient’s condition, a lower bound ensures a minimum acceptable performance for patient safety. In such cases one might not be interested so much in an upper bound because it represents the best-case scenario, but one’s priority is to ensure an acceptable minimum performance. It is then advisable to only compute the lower bound because, when we fix the confidence level, the lower bound of a two-sided interval is less informative than a one-sided lower interval.

1.1 Conditional versus unconditional performance

We are particularly interested in the conditional prediction performance, that is the generalization performance of the model trained on the present sample. For that, in a model selection and performance estimation regime, the prevailing recommendation in the literature is to split the sample at hand into three parts, a training, validation, and evaluation set (Goodfellow et al., 2016; Hastie et al., 2009; Japkowicz & Shah, 2011; Murphy, 2012; Raschka, 2018), see Fig. 1. Depending on the specific selection rule, the training and validation set can sometimes be combined to form a learning set. For instance, this is true when cross-validation (CV) is used to identify promising models from a number of candidate models, based on their cross-validated prediction performance. Using the entire sample at hand for CV however is not a solution to our problem since it actually estimates the unconditional prediction performance, that is the average prediction performance of a model fit on other training data from the same distribution as the original data (Bates et al., 2021; Hastie et al., 2009).

There are proposals in the literature on how to use CV anyway to obtain an estimate of the conditional performance such as Nested CV (NCV, Bates et al., 2021) and Bootstrap Bias-Corrected CV (BBC-CV, Tsamardinos et al., 2018).

In our investigations, however, it turns out that neither BBC-CV nor NCV yield intervals that reliably maintain the desired level of confidence for the conditional performance. On the contrary, the resulting confidence intervals are seriously biased and have far too low coverage probability (BBC-CV), or fluctuate widely around the nominal level (NCV). We present an extensive discussion and presentation of our findings on the two procedures in the Supplement Information in Section 1.

In addition, both BBC-CV and NCV are only applicable when CV is employed. This can lead to very long computation times, for example, when complex neural nets are trained. Furthermore, NCV, unlike our proposed method, relies on normal approximations, needs to run repeatedly to ensure stable estimates, and thus requires many more model fits.

In contrast, our proposed method does not need a CV scheme to directly and inherently estimate a lower confidence bound for the conditional performance, and it can be applied to a variety of performance measures such as (balanced) prediction accuracy, AUC, precision/positive predicted value, recall/sensitivity, specificity, negative predicted value, and F1 score, and without any distributional assumptions and or additional model training.

1.2 Bootstrap tilting confidence intervals

Our proposed confidence bounds are obtained using bootstrap resampling. In particular, we use bootstrap tilting (BT), introduced by Efron (1981), which is a general approach to estimate confidence intervals for some parameter $\theta = \theta (F)$ using an i. i. d. sample $(y_1, y_2, \ldots , y_n)$ from an unknown distribution F. This parameter $\theta$ will later be our performance estimate of choice. We denote the observed value of $\theta$ by ${\hat{\theta }} = \theta ({\hat{F}})$, where ${\hat{F}}$ is the empirical distribution function of the sample $(y_1, y_2, \ldots , y_n)$.

Many bootstrap confidence interval methods, such as bootstrap percentile intervals, assume that the distribution of ${\hat{\theta }} - \theta _0$ does not depend on the test value $\theta _0$. Due to this assumption it is not necessary to consider each value of $\theta _0$ separately; the pivotal distribution is estimated using the non-parametric bootstrap, which is then used to invert the pivotal and derive confidence intervals. BT on the other hand aims to estimate the distribution of ${\hat{\theta }} - \theta _0$ for each test value $\theta _0$ by adjusting the probabilities to resample with from the sample $(y_1, y_2, \ldots , y_n)$. The lower confidence bound is then formed consisting of those values of $\theta _0$ that could not be rejected in a test of the null hypothesis $H_0 :\theta \le \theta _0$. This way the distribution to resample from is consistent with the null distribution.

In particular, the idea is to reduce the problem to a one-parametric family $(F_\tau )_\tau$ of distributions that includes the empirical distribution function ${\hat{F}}$ and to find the particular $F_\tau$ that has $\alpha \cdot 100 \%$ of the bootstrap distribution exceeding the observed value ${\hat{\theta }}$,

$$\begin{aligned} P_{F_\tau } ( \theta \ge {\hat{\theta }} ) = \alpha . \end{aligned}$$

(3)

The family $(F_\tau )_\tau$ is restricted to have support on the observations $(y_1, y_2, \ldots , y_n)$. This allows us to express a distribution $F_\tau$ in terms of the probability mass ${\varvec{p}}_\tau = [p_1 (\tau ), p_2 (\tau ), \ldots , p_n (\tau )]$ it puts on $(y_1, y_2, \ldots , y_n)$. For example, the empirical distribution function ${\hat{F}}$ can be identified with ${\varvec{p}}_0 = (n^{-1}, n^{-1}, \ldots , n^{-1})$.

One possible family of distributions is the exponential tilting family with weights

$$\begin{aligned} p_i (\tau ) = \frac{e^{\tau U_i ({\varvec{p}}_0)}}{\sum _{j=1}^n e^{\tau U_j ({\varvec{p}}_0)}}, \end{aligned}$$

(4)

and directional derivatives (also known as empirical influence functions)

$$\begin{aligned} U_i ({\varvec{p}}_0) = \lim _{\epsilon \rightarrow 0} \epsilon ^{-1} \{ \theta [{\varvec{p}}_0 + \epsilon ({\varvec{\delta }}_i - {\varvec{p}}_0)] - \theta ({\varvec{p}}_0) \}, \end{aligned}$$

(5)

where ${\varvec{\delta }}_i$ is the zero-vector with a single 1 in position i. In case the parameter $\theta$ is a mean, the directional derivative reduces to $U_i({\varvec{p}}_0) = y_i - {\bar{y}}$, and $p_i(\tau ) = e^{\tau y_i} / \sum _{j=1}^n e^{\tau y_j}$. This is the case, for instance, when the parameter of interest is the prediction accuracy $\theta = n^{-1} \sum _{i=1}^n I \{y_i = {\hat{y}}_i\}$ in a classification problem, where I denotes the indicator function, and ${\hat{y}}_i$ is the predicted class label of the i-th observation.

From Eq. (4) we see that the so-called tilting parameter $\tau$ is monotonically related to $p_i (\tau )$ and thus to $F_\tau$ such that a specific value of $\tau$ corresponds to a specific value of $\theta (F_\tau ) = \theta _0$ and to a null hypothesis $H_0 :\theta \le \theta _0$. We will use this duality between the tilting parameter $\tau$ and the reference value $\theta _0$ to obtain the confidence interval.

Among all the distributions with support only on $(y_1, y_2, \ldots , y_n)$, the exponential tilting weights in Eq. (4) minimize the backward Kullback–Leibler distance (relative entropy) $d_{\text {KL}} ({\varvec{p}}_\tau , {\varvec{p}}_0) = \sum _{i=1}^n p_i (\tau ) \log [p_i (\tau ) / n^{-1}]$ in ${\varvec{p}}_\tau$ between $F_\tau$ and ${\hat{F}}$ subject to the constraint

$$\begin{aligned} \theta (F_\tau ) = \theta _0. \end{aligned}$$

(6)

In this sense, the corresponding tilted distribution $F_\tau$ is the closest distribution to the observed sample under the constraint in Eq. (6) in a one-parametric family.

Finally, to find a lower confidence bound $\theta _L$ for $\theta$, we find the largest value of $\tau < 0$ such that the corresponding level $\alpha$ test still rejects $H_0$; this means $\theta _L$ is the largest value of $\theta _0$ such that, if the sample came from a distribution with parameter $\theta _0$, the probability of observing ${\hat{\theta }}$ or an even larger value is $\alpha$, see Eq. (3).

Conceptually, for any given value of $\tau$, we need to sample from $F_\tau$ and check whether Eq. (3) holds true. This is both expensive and exposed to the randomness of repeated sampling. What we actually do is to employ an importance sampling reweighting approach as proposed by Efron (1981).

Suppose we want to estimate a probability $p_t = P (X \ge t)$ and let $(X_b)_b$ be an i. i. d. sample from the same distribution F that X has, $b = 1, \ldots , B$. Let f denote the density of F and $W = f/f_*$. Then

$$\begin{aligned} p_t = E [ I \{ X \ge t \} ] = \int \left[ I \{ X \ge t \} \frac{f(x)}{f_*(x)} \right] f_*(x) \, dx = E_* [ I \{ X \ge t \} \, W(X) ], \end{aligned}$$

(7)

where the density $f_*$ of the design distribution dominates the density f of the target distribution, that is, $f(x) = 0$ whenever $f_*(x) = 0$ (in order to avoid an indefinite integrand), and the importance sampling estimate for $p_t$ is given by

$$\begin{aligned} {\hat{p}}_t = B^{-1} \sum _{b=1}^B I \{ X_b^* \ge t\} \,W(X_b^*), \end{aligned}$$

(8)

where the $X_b^*$ are sampled from $f_*$. Efron (1981) uses the empirical distribution ${\hat{F}}$ as the design distribution, that is, $f_*$ puts probability mass 1/n on each observation in the sample.

This idea allows us to estimate the probability in Eq. (3) and, thus, to find the lower bound $\theta _L$ using only bootstrap resamples from the observed empirical distribution ${\hat{F}}$. Let $M_{i,b}^*$ be the number of times $y_i$ is sampled in resample $b = 1, \ldots , B$. We reweight each resample b with the relative likelihood

$$\begin{aligned} W_b(\tau ) = \frac{\prod _{i=1}^n p_i(\tau )^{M_{i,b}^*}}{\prod _{i=1}^n n^{-1}} \end{aligned}$$

(9)

of the resample under ${\varvec{p}}_\tau$-reweighted sampling relative to sampling with equal weights $n^{-1}$, and calibrate the tilting parameter $\tau < 0$ such that the estimated probability of observing at least ${\hat{\theta }}$ under the tilted distribution $F_\tau$ is $\alpha$,

$$\begin{aligned} \alpha = P_{F_\tau } (\theta ({\hat{F}}^*) \ge {\hat{\theta }}) = B^{-1} \sum _{b = 1}^B I \{ {\hat{\theta }}_b^* \ge {\hat{\theta }} \} \, W_b(\tau ), \end{aligned}$$

(10)

where ${\hat{F}}^*$ is the resampling empirical distribution. Then the value of the parameter $\theta$ that corresponds to that calibrated value of $\tau$ and the respective sampling weights $p_\tau$ is the desired lower confidence bound $\theta _L = \theta (p_\tau )$. Figure 2 illustrates this idea.

So, the basic concept of bootstrap tilting is to create bootstrap samples from the empirical distribution and then adjusting the probabilities of these resamples to match the constraint in Eq. (6). This reveals the close connections to empirical likelihood ideas, see, for instance, Dickhaus (2018), and how bootstrap tilting can be applied there.

The tilting approach does not work if the data $y_1 = \ldots = y_n$ is constant because then

$$\begin{aligned} p_1(\tau ) = \ldots = p_n(\tau ) \quad \text {for any } \tau , \end{aligned}$$

(11)

and the empirical distribution cannot be tilted. This can for instance be an issue in binary classification, when the model perfectly predicts the true class labels. One option to deal with this issue is to switch to another (conservative) interval estimation method. In the aforementioned example this could for instance be a Clopper-Pearson lower confidence bound at a Šidák-corrected significance level.

Under the assumption that $\theta$ is a smooth function of sample means, DiCiccio and Romano (1990) showed that BT produces consistent confidence intervals which are second-order correct, that means, that have one-sided non-coverage probabilities of $\alpha + {\mathcal {O}}(n^{-1})$. Also, bootstrapping with exponential tilting weights and importance sampling reduces the variance of the p-value estimate in Eq. (10), see Davison & Hinkley (1997), is transformation-invariant, reaches a comparable level of precision to other bootstrap techniques with far less resamples, and offers a good balance between confidence interval length and accuracy in terms of coverage (Hesterberg, 1999).

However, our proposed pipeline involves model selection, and multiple models are being evaluated, see Fig. 3. Thus, we modify the BT routine and incorporate a maxT-type multiplicity control, which is a well-known standard approach in simultaneous inference (Dickhaus, 2014). For example, the maxT correction is used to compute simultaneous confidence intervals for linear combinations of parameters such as mean differences or contrasts in (generalized) linear models, and it is implemented, for instance, in the multcomp R package (Hothorn et al., 2008). The basic idea behind the maxT correction is to estimate the joint distribution of the test statistics under the null hypothesis and use this distribution to construct confidence intervals. By central limit theorems, under the null hypothesis, this joint distribution can often be approximated by a multivariate normal distribution. An extensive discussion of the maxT correction can be found in Westfall & Young (1993). However, the application of the maxT correction in ML settings is not common. Also, to the best of our knowledge, this is the first time that BT is extended to simultaneous inference and applied in a ML evaluation setup. Our approach enables us to simultaneously evaluate the conditional performances of multiple models and provide valid confidence bounds for them and in particular one for the final selected model.

The idea to use a resampling-based approach to correct for multiplicity is not a new one. Westfall & Young (1993) provide a comprehensive overview of various resampling techniques, including the bootstrap, and how they can be applied in the multiple testing framework in order to control the FWER.

In the following, we consider a binary classification problem where a potentially large number r of candidate models have already been trained and a number of promising models $s_1, \ldots , s_m$ have already been selected for evaluation, based on their validation performances ${\hat{\eta }}_{s_1}, \ldots {\hat{\eta }}_{s_m}$ and following some selection rule. We call this multitude of models selected for evaluation to be the set of preselected models. In addition, we suppose that retraining of the preselected models on the entire learning data has already been performed, yielding models $\hat{{\varvec{\beta }}}_{s_1}, \ldots , \hat{{\varvec{\beta }}}_{s_m}$. Also, suppose that the associated performance estimates ${\hat{\theta }}_{s_1}, \ldots , {\hat{\theta }}_{s_m}$ have been obtained based on the predictions from the hold-out evaluation set, and a final model $s \in \{s_1, \ldots , s_m\}$ has been selected due to its evaluation performance ${\hat{\theta }}_s$ following some (other) selection rule.

Section 2 has all the details to our proposed method. In Sect. 3 we show a selection of results from our simulation experiments and a complete presentation can be found in the Supplement Information. We apply our proposed approach on several real-world data sets in Sect. 4. Our presentation ends with a discussion in Sect. 5.

2 Method

For brevity, let $j = 1, \ldots , m$ denote the preselected models instead of $s_1, \ldots , s_m$ and let $s \in \{1, \ldots , m\}$ denote the final selected model, that is the model with the most promising evaluation performance ${\hat{\theta }}_s$, which is a function of that model’s evaluation predictions ${\hat{y}}_{1s}, {\hat{y}}_{2s}, \ldots , {\hat{y}}_{ns}$, where n is the size of the evaluation set at hand. Note that this estimate ${\hat{\theta }}_s$ of generalizing prediction performance is subject to selection bias and therefore overly optimistic. To compute our proposed multiplicity-adjusted bootstrap tilting (MABT) lower confidence bound we only need these predictions ${\hat{y}}_{ij}$ from all competing preselected models $j = 1, \ldots , m$ in the evaluation set and the associated true class labels $y_i$, $i = 1, \ldots , n$. For instance, in case the performance measure of interest is prediction accuracy, the predictions are the predicted class labels; in case of the AUC, they may be class probabilities.

2.1 Bootstrap resampling

Our proposed confidence bounds come from BT with a multiplicity adjustment due to the simultaneous evaluation of multiple candidate models. The general idea is to first estimate the tilted distribution under hypothetical values for the performance measure, and then to adjust for multiplicity in order to finally calibrate the tilting parameter $\tau$ accordingly. Note that no additional model training is needed here, once the models are preselected and retrained, using only the learning set. Thus, the predictions from the evaluation set can be understood as conditional on the trained models and on the learning data. We draw bootstrap resamples from the set of observation indices $\{1, \ldots , n\}$ in the evaluation set. For each of the candidate models and each of the resamples, we estimate the performance from this resample. This way, for each resample $b = 1, \ldots , B$ and each of the competing models $j = 1, \ldots , m$, we obtain a bootstrap performance estimate ${\hat{\theta }}_{bj}^*$. (Quantities that carry the $*$ subscript always indicate bootstrap quantities.) In the following, we use these bootstrap performance estimates to estimate various empirical cumulative distribution functions.

2.2 Tilting

The first of these empirical cumulative distribution functions stems from the estimation of conventional BT confidence intervals. From the bootstrap performance estimates ${\hat{\theta }}_{bs}^*$ of the final selected model s, we estimate the empirical tilted cumulative distribution function

$$\begin{aligned} {\hat{F}}_{s, \tau }^* (x) = \frac{1}{B} \sum _{b = 1}^B W_b (\tau ) \, I \{ {\hat{\theta }}_{bs}^* \le x \}, \end{aligned}$$

(12)

where I is the indicator function, with importance sampling weights $W_b(\tau )$, as introduced in Sect. 1. We plug the observed performance ${\hat{\theta }}_s$ of the final selected model s into ${\hat{F}}_{s, \tau }^*$ to get ${\hat{F}}_{s, \tau }^* ({\hat{\theta }}_s)$. Later, we plug this value into another distribution function that we derive next. This will provide the multiplicity correction that we need to account for the evaluation of multiple models.

2.3 Multiplicity correction and calibration

Since the performance estimates ${\hat{\theta }}_{bj}^*$ all come from different and not necessarily comparable models, we need to transform them appropriately to bring them to a comparable scale and compute a multiplicity-adjusted (1 minus) p-value that can be used later on to calibrate the tilting parameter $\tau$ and finally obtain the desired lower confidence bound. For each of the models j, we estimate the empirical cumulative distribution function

$$\begin{aligned} {\hat{F}}_j^* (x) = \frac{1}{B} \sum _{b = 1}^B I \{ {\hat{\theta }}_{bj}^* \le x \} \end{aligned}$$

from the bootstrap performance estimates ${\hat{\theta }}_{bj}^*$. Plugging these into ${\hat{F}}_j^*$ for each j yields transformed bootstrap performance estimates ${\hat{u}}_{bj}^*$ that are now uniformly distributed on the unit interval. Similar to a maxT multiplicity correction, for each bootstrap resample b, among the transformed estimates ${\hat{u}}_{bj}^*$, we now identify the maximum estimate ${\hat{u}}_{b, \max }^* = \max _{j=1}^m {\hat{u}}_{bj}^*$. From these maximum values, we estimate

$$\begin{aligned} {\hat{F}}_{max}^* (x) = \frac{1}{B} \sum _{b = 1}^B I \{ {\hat{u}}_{b, \max }^* \le x \} \end{aligned}$$

(13)

and refer to it as the maximum empirical cumulative distribution function.

2.4 Lower confidence bound

In the previous paragraphs, we estimated two empirical cumulative distribution functions, ${\hat{F}}_{s, \tau }^*$ and ${\hat{F}}_{max}^*$ in Eqs. (12) and (13), respectively. The former estimates the tilted cumulative distribution function while the latter corrects for multiplicity. We next bring those two together and state the calibration task to obtain a conditional lower confidence bound for the performance of the final selected model s from our proposed MABT approach: Find a value of the tilting parameter $\tau < 0$ such that

$$\begin{aligned} {\hat{F}}_{max}^* [ {\hat{F}}_{s, \tau }^* ({\hat{\theta }}_s) ] = 1 - \alpha . \end{aligned}$$

(14)

Once this specific value for $\tau$ has been found, we finally obtain the desired lower confidence bound via

$$\begin{aligned} {\hat{\theta }}_{s, L} = \theta ( {\hat{F}}_{s, \tau }^* ). \end{aligned}$$

We implemented this numerically as a root-finding problem which calibrates the tilting parameter $\tau$ conservatively, that means Eq. (14) holds with $\ge$. An R implementation of our proposed confidence bounds can be accessed via a public GitHub repository at https://github.com/pascalrink/mabt. It is able to compute our proposed confidence bounds for the prediction accuracy and the AUC of a prediction model.

Figure 4 illustrates our proposed approach in case the performance measure is prediction accuracy, and a formal algorithmic description is placed in the Supplement Information in Section 2. In summary, we estimate a BT confidence bound using an adjusted significance level due to the maximum distribution ${\hat{F}}_{\max }^*$ of all the preselected prediction models $j = 1, \ldots , m$. This yields simultaneous confidence bounds $({\hat{\theta }}_{1, L}, \ldots {\hat{\theta }}_{m, L})$ for all the preselected models j, that means Eq. (1) holds. This way, due to Eq. (2), we obtain a valid confidence bound for the prediction performance of any model $j = 1, \ldots , m$ and hence a valid confidence bound ${\hat{\theta }}_{s, L}$ for the selected model s.

3 Simulation experiments

We now investigate the goodness of our proposed confidence bounds in extensive simulation experiments and compare them to existing standard approaches in terms of four aspects: coverage probability, size of lower confidence bound, true performance of the final selected model, and the distance between the true performance and the lower bound, which we call tightness for brevity. We expect our procedure to produce models with better true predictive performance than standard methods due to the gainful way we select models for evaluation (see Fig. 3) in comparison to the default pipeline (see Fig. 1). On the other hand, due to the present multiplicity that we need to account for, the question is if this reflects in larger lower confidence bounds, as this is the main quantity when reporting. In other words, there is no real advantage to have found a better prediction model if we are unable to identify it as such.

The R code used for our simulation experiments and for all the associated figures can be found in a publicly accessible GitHub repository at https://github.com/pascalrink/mabt-experiments.

3.1 Data generation

Using a standard 50%/25%/25% split, we sample the training, validation, and evaluation data as well as another larger data set of 20,000 observations all from the same distribution. The latter serves as a ground truth from which we derive the true model performances. We consider two different setups feature case A and feature case B how to generate the features and how to get the true class labels from them. They represent different data complexities in some way. Nonetheless, both setups lead to balanced classes.

In the first, simpler, and somewhat synthetic feature case A, we draw uncorrelated random numbers from the standard normal distribution and put them into the feature matrix ${\varvec{X}} = ({\varvec{x}}_i)_i$ with columns ${\varvec{x}}_i$. We specify a sparse true coefficient vector ${\varvec{\beta }}$ with only 1% non-zero coefficients and obtain the true class labels using the inverse logit function and a vector of uncorrelated observations $u_i$ that are uniform on the unit interval,

$$\begin{aligned} y_i = I \Bigl \{ \frac{1}{1 + \exp (- {\varvec{x}}_i {\varvec{\beta }})} \ge u_i \Bigr \}. \end{aligned}$$

Specifically, we draw 1000 features and choose ${\varvec{\beta }}= c \cdot (1, 1, \ldots , 1, 0, 0, \ldots , 0)$ to have only 10 non-zero entries, with signal strength $c = 2$.

In feature case B, we use the twoClassSim R function from the caret R package (Kuhn, 2008). As to the true class labels, this function generates a feature matrix that includes linear effects, non-linear effects, and noise variables, each both uncorrelated and correlated with constant correlation $\rho = 0.8$, and $1\%$ mislabeled data. The full description of the feature and label generation is presented in the Supplement Information in Section 3. We try to mimic a more complex case here, which is closer to real-world applications than feature case A.

3.2 Model training

As to model training, we consider both, a linear and a non-linear classifier. In the linear case, we train lasso models with 100 equidistant values for the tuning parameter $\lambda$ between zero and the maximum regularization value $\lambda _{\max } = \min \{ \lambda > 0 :\hat{{\varvec{\beta }}}_\lambda = {\varvec{0}} \}$, which is the smallest tuning parameter value such that none of the features is selected into the model, where $\hat{{\varvec{\beta }}}_\lambda$ is the estimate of the true vector ${\varvec{\beta }}$ of coefficients from a lasso regression with regularization parameter value $\lambda$. Note that $\lambda _{\max }$ depends on the input training data and we leave its computation to the glmnet R function from the glmnet R package (Friedman et al., 2010).

In the non-linear case, we train random forests, using the tuneRanger R function from the tuneRanger R package, which automatically tunes the hyperparameters (number of features to possibly split at in each node, minimal node size, fraction of observations to sample) with model-based optimization (Probst et al., 2018). Again, we end up with 100 competing models.

3.3 Performance estimation

In our proposed pipeline (see Fig. 3), performance estimation happens at two different stages. First, the performances ${\hat{\eta }}_1, \ldots , {\hat{\eta }}_r$ of all candidate models is estimated. One way to do this is to train the models using the training data and estimate their performances using the validation data. Another option is to perform ten-fold CV on the entire learning data, which is the combination of training and validation data. We expect these estimates to be less dependent on the split of the data into training and validation sets and, thus, to lead to better final selections. A third option is to use a resampling-based approach to obtain an estimate of model performance, as implemented in the tuneRanger R function that we use to tune the random forests.

In case of the linear lasso classifier, we consider both the prediction accuracy and the AUC, but we refrain from computing non-cross-validated estimates for the AUC for brevity. In case of the non-linear random forest classifier, we only consider the prediction accuracy estimate obtained from the tuneRanger R function. Note that when using CV the resulting performance estimates are neither conditional on the trained model nor validation performances in the sense of the non-cross-validated estimates; they are averaged unconditional performances over the ten folds. Similar applies for the resampling-based approach. However, we use these validation performances only to identify promising models to preselect for evaluation later on.

Once models are preselected for evaluation based on their validation performances, they are refitted using the entire learning data before their generalization performances are estimated using the evaluation data. This is the second and final round of performance estimation.

3.4 Model selection

As with performance estimation, model selection happens at two different stages, too, see Fig. 3. First, promising models are identified based on their validation performance and preselected for evaluation. Second, a final model s is selected among them, based on its evaluation performance ${\hat{\theta }}_s$. The two selection rules do not necessarily need to be the same. (Note that it is even possible to use one performance measure for preselection and another for the final selection, although we do not consider this here.)

In our simulation experiments, based on their validation performance, we preselect for evaluation either the single best model, or the top 10% of the models. In case the CV performance is used for selection, in addition to the single best and the top 10% selection rule, we select all the models with CV performance within one CV standard error of the best model (the within 1 SE selection rule). This is more adaptive than the top 10% rule; when the candidate models all perform comparably well on the validation set, more models will be selected, while it is possible that only a few candidate models will be preselected when there are a few models that perform clearly better than the rest. Because we use a maxT-type multiplicity correction (see Sect. 2.3), a larger number of candidate models usually leads to a stronger correction, which might make the resulting confidence interval unnecessarily conservative; if more models are preselected for evaluation the probability is higher that any one of them has a better performance in a bootstrap resample than the final selected model, which shifts the maximum distribution to the right.

In general, preselecting multiple models for evaluation instead of only a single one is reasonable because of the uncertainty in the validation performance estimates since we do not want to exclude a promising model from evaluation just because it scored slightly worse in the validation than another model.

The preselected models are then refitted using all training and validation data. In any case, in the final selection stage, we select the one model s with the best evaluation performance ${\hat{\theta }}_s = \max _{i = 1}^m {\hat{\theta }}_i$ and report a lower confidence bound ${\hat{\theta }}_{s, L}$ for it.

It is possible that multiple models have the same evaluation performance. We then select the least complex of those. In case of the lasso classifier this is the model with the largest value of the tuning parameter $\lambda$. In case of the random forest classifier, we use the computation time for model fitting as a measure of model complexity as a proxy for the triplet of hyperparameters, and we regard models with shorter computation times as less complex.

3.5 Confidence bounds

We compute our MABT lower confidence bounds using a significance level of $\alpha = 0.05$ and 10,000 bootstrap resamples when the performance measure of interest is the prediction accuracy, and 2000 resamples in case of AUC, for computational brevity. We use the R function auc from the R package pROC to estimate the AUC (Robin et al., 2011).

Table 1 provides a summary of the confidence bounds estimated for each configuration. In case of prediction accuracy, we compare our proposed bounds to a number of existing standard approaches $\alpha = 0.05$: the Wald normal approximation interval, which is known to sometimes struggle to reach the nominal significance level, especially when sample size is small; the Wilson interval, which is an improvement over the Wald interval in many respects as it allows for asymmetric intervals, incorporates a continuity correction, and can also be used when sample size is small; the Clopper-Pearson (CP) exact interval, which uses the binomial and, thus, the correct distribution rather than an approximation to it; and the default BT confidence interval as presented in Sect. 1 in case of the lasso classifier. For the AUC, we again take the BT confidence interval for comparison, as well as DeLong intervals, which are the default choice for an asymptotic interval here, and Hanley-McNeil (HM) intervals, which use a simpler variance estimator. To compute the DeLong intervals we use the R function ci.auc from the pROC R package (Robin et al., 2011).

In those scenarios where multiple models are evaluated simultaneously, a multiplicity adjustment is necessary. Our proposed MABT confidence bounds control for such multiplicity, whatever the selection rule applied. For the standard methods in the competition however we apply the Šidák-correction and estimate the confidence bound using a reduced and adjusted significance level of $\alpha _{\textrm{Sidak}} (m) = 1 - (1 - \alpha )^{1 / m}$, where m is the number of preselected models. We can safely assume that the predictions from the various candidate models are not negatively dependent and, thus, we choose the Šidák over the Bonferroni adjustment (Dickhaus, 2014), since they are less conservative.

Table 1 Listing of the considered interval procedures per performance measure and selection rule

Full size table

3.6 Results

Next we present the results from 5000 simulation runs in case of the lasso classifier and 10,000 runs in case of the random forest classifier, for each combination of simulation parameters. Figure 5 presents the overall observed coverage probabilities of the seven interval methods over all simulation experiments that use CV for validation performance estimation, which means over all performance measures, selection rules, sample sizes, feature generation methods, and both classifiers. (In none of our simulation runs we have encountered the issue described in Eq. 11.) We aim for 95% nominal coverage.

Overall, our proposed approach yields the most reliable observed coverage probabilities. They are less conservative than all the other bounds in the proposed pipeline and even less conservative than the CP bounds in the default pipeline. In addition, the DeLong, HM, and Wald intervals are too liberal in the default pipeline.

In terms of computation time, the additional burden of MABT mainly depends on the number of bootstrap resamples drawn and the performance measure considered. For prediction accuracy the additional burden is little. On the other hand, the computation of the AUC for each candidate model in each bootstrap resample takes considerably more time. The calibration of the tilting parameter only adds little extra. An overview of the computation times for the different methods is given in Table 2, for the different sample sizes and performance measures when employing the single best and top 10% selection rules for the standard and the proposed method, respectively.

Table 2 Average computation times for a single lower confidence bound for the final selected model in seconds on a modern CPU (single core) by performance measure, evaluation sample size, and estimation method

Full size table

3.6.1 Lasso classifier

The total runtime of the lasso experiments is about 2 days on a modern CPU with 50 threads. For brevity, we only present a selection of more detailed results of the lasso simulations here, which are the most competitive approaches in terms of coverage probability and size of the lower confidence bounds. The individual arguments why we dismiss certain results from presentation here as well as the omitted results can be found in the Supplement Information in Section 3. For prediction accuracy, this leaves us with the following competition: Wilson unadjusted confidence bounds for the single best performing model; BT confidence bounds for the single best performing model; our proposed MABT confidence bounds for the best model of the within 1 SE selection of models. For AUC, the competition is the same as for prediction accuracy with DeLong replacing Wilson. In any case, for preselection, the validation performance is computed using the CV estimate from the learning data. Since we split the data into 50% training, 25% validation, and 25% evaluation data, for the selected results presented here, this is effectively a 75%/25% learning/evaluation split because for CV we use both the training and validation data.

In the results we see that our proposed confidence bounds are somewhat conservative in the simpler setups, that is feature case A (see Sect. 3.1) with prediction accuracy, and less conservative in the more complex setups, that is feature case B or in case of AUC, see Figs. 6 and 7. At the same time, the adverse effect of conservatism is not pronounced and does not directly translate into smaller lower bounds, see Figs. 8 and 9. Similar observations have already been made in Hall (1988), where other conservative bootstrap intervals were not directly related to larger intervals. Also, the confidence bounds from our proposed method are less variable than the competing methods. Regarding tightness, we cannot draw a final conclusion, see Figs. 10 and 11. Note however that in some AUC simulation configurations the BT and the DeLong method yield too liberal intervals such that the comparison of our proposed confidence bounds to those in terms of size of the lower bound and tightness is unfair. (Too liberal intervals are identified as those whose coverage falls below $0.9469 = 1 - \alpha - \sqrt{(1-\alpha ) \cdot \alpha / 5000}$, which is the desired coverage probability minus one standard error due to the finite number of simulation runs.) Nevertheless, especially in case of the smaller sample size, there is a visible gain in final model performance, see Figs. 12 and 13.

Since the estimated confidence bounds come in matched sets (per simulated data set), Table 3 displays the absolute number and proportion of simulated data sets in the lasso setting in which the MABT bound is larger than the CP and Wilson bound, respectively. We only consider those instances here where the particular confidence interval covers the true performance. In case the interval of one method covers the true value and another method does not, the former is always regarded as superior to the latter. In a large part of the instances in which the interval covers the true performance, the MABT lower bounds are superior to those of the competing methods. Also, BT turns out to be very competitive here, which is an interesting observation by itself.

Similarly, we can compare the final model performances, see Table 4. The proposed ML pipeline yields models with strictly better true performance in 42 to 61% of all cases, and models that are at least equally as good in 66 to 73% of all cases.

Table 3 Per-data set comparison of lower confidence bounds in the lasso simulations for the prediction accuracy

Full size table

Table 4 Per-data set comparison of the true performances from the various selection rules in the lasso simulations

Full size table

3.6.2 Random forest classifier

Next we present the results of the random forest simulations. Similar to the lasso case, we only show a selection of more detailed results here. The omitted results can be found in the Supplement Information in Section 4.2. In addition to the MABT confidence bounds for the best model of the top 10% preselected models, we thus present the CP and Wilson unadjusted confidence bounds for the one model with the best validation performance. Again, we perform the learning on 75% of the data and evaluate on the remaining 25%.

The total runtime is about 3.5 days on a modern CPU with 60 threads. Overall, we get similar results as in the lasso case: the MABT bounds are slightly conservative (see Fig. 14), but also slightly larger and tighter than the CP and Wilson bounds (Figs. 15 and 16). Also, the proposed pipeline yields a slight improvement in the final model performance (Fig. 17).

As before, the results come in matched sets, which allows for a per-data set comparison of the confidence interval methods and ML pipelines regarding the size of the lower bound as well as the true prediction performance. MABT lower bounds are larger than CP and Wilson bounds in the majority of data sets, see Table 5, and the proposed pipeline yields better (at least equally as good) final models in 59 (64), 59 (63), and 58% (63%) of the simulated data sets with evaluation sample sizes 50, 100, and 150, respectively.

Table 5 Per-data set comparison of lower confidence bounds in the random forest simulations for the prediction accuracy

Full size table

4 Applications to real-world data

In this section, we present some applications of our proposed approach on real-world data sets. The first example shows what can go wrong and to what extent when only one model is evaluated. In the second example we benchmark MABT confidence bounds against standard methods on a variety of data sets.

4.1 Cardiotocography data

In this first application we apply our proposed ML pipeline and the MABT confidence bounds on the Cardiotocography data set (Ayres-de Campos & Bernardes, 2010) from the UCI ML Repository. Westphal and Brannath (2020) used this data set to demonstrate that the evaluation of multiple models can be beneficial, and our application resembles their setup. The question is if the potential gains outweigh the losses suffered through the multiplicity correction.

The Cardiotocography data set contains medical data related to fetal monitoring during pregnancy. Cardiotocography is a standard procedure used to assess the well-being of a fetus by monitoring its heart rate patterns, uterine contractions, and other relevant physiological parameters. The data set contains 2126 complete cardiotocograms which were analyzed by three expert obstetricians. Depending on the degree of anomaly observed, these experts assigned a consensus label to each CTG. For more details on this data set, see Ayres-de Campos et al. (2000). We dichotomize the ordinal label into two classes and want to predict suspect abnormal state (471 instances) versus normal state (1655 instances) from 23 features of fetal heart rate.

Each CTG instance contains the date on which the measurement was taken. We use this information to split the data (but not in model training), reflecting the lag in real-world applications: we learn on the first 75% of the instances, preselect promising models using the within 1 SE rule, and evaluate them on the last 25% of the instances.

During learning, we fit 100 models from the elastic net class (Zou & Hastie, 2005; Friedman et al., 2010) with five equidistant values between 0 and 1 for the tuning parameter ${\tilde{\alpha }}$ that balances the lasso and ridge penalty, and 20 equidistant values between 0 and $\lambda _{\max }({\tilde{\alpha }})$ for overall strength of regularization, see Sect. 3.2. We use ten-fold cross-validation to obtain the validation performance estimate for each of the models on the learning data, and preselect candidates for evaluation using the within 1 SE selection rule. We then refit the preselected models on the entire learning data and select the one model with the highest evaluation performance among them for confidence bound estimation. We compute the confidence intervals using the methods listed in Table 1 for comparison here except the BT-based intervals. We use 10,000 bootstrap resamples to compute the MABT bound. The total computation time is 15 s on a modern CPU (single core).

Table 6 shows the validation and evaluation ranks and performances for each preselected elastic net. We observe a clear drop in performance when going from the CV estimate to the evaluation estimate on previously unseen data. This might be due to the data splitting based on the CTG measurement date, which would hint at some systematic deviations.

Table 6 Validation and evaluation ranks and performances of the preselected models

Full size table

The procedure selects nine models for evaluation using the within 1 SE selection rule with cross-validated prediction accuracy estimates between 93.04 and 93.54 %. It turns out that the model with the best validation performance has the worst evaluation performance among the preselected models. In addition, the model with the best evaluation performance has the second-lowest validation performance of all the preselected models.

Table 7 Estimated lower confidence bounds for the final selected model from the default and proposed pipeline, respectively, when only a single run of CV is used to estimate the validation performances

Full size table

This of course has an effect on the confidence intervals, too. Table 7 shows the lower confidence bound estimates for the final selected model; employing the proposed pipeline of preselecting multiple models for evaluation and subsequently computing an MABT confidence bound leads to the selection of a superior model and to a larger lower confidence bound for the generalization performance, in comparison to the default pipeline and standard methods.

Table 8 Validation and evaluation ranks and performances of the preselected models

Full size table

Table 9 Estimated lower confidence bounds for the final selected model from the default and proposed pipeline, respectively, when a ten-times repeated ten-fold CV is used to estimate the validation performances

Full size table

Admittedly, the presented situation is a very unfortunate one, since a different allocation of the learning data to the CV folds might prevent this phenomenon from happening. Indeed, when we use a ten-times repeated ten-fold CV scheme instead of a single run of CV, the evaluation results turn out to be much closer together, as presented in Table 8, with a total computation time of 21 s. While the best model in the validation is still the second-best model in the evaluation, there are still some rather unexpected differences between the validation and the evaluation ranks. For example, the second-best model in the validation turns out the worst in the evaluation with a considerable inferior evaluation performance. However, the lower confidence bound estimates are rather similar, as displayed in Table 9.

The question that comes to mind now is whether repeated CV is the appropriate tool to prevent situations as presented in Table 6 from happening. In fact, this is not always true, see Section 6 in the Supplement Information for details.

Because these are only select cases, we additionally run the selection-evaluation scheme 100 times, using ten-times repeated ten-fold CV, with different allocations of the learning data to the CV folds in order to investigate the average behavior of the presented approaches. We observe that in all instances the MABT lower confidence bound is the largest among the competing methods except the unadjusted Wald bound; here, the MABT lower bound is larger in 91 of 100 cases. Note however that the Wald interval turned out too liberal in our simulation results in Sect. 3.6. Also, in 10 of these 100 instances the user would have reported a subpar lower confidence bound if they used the default pipeline instead of the proposed pipeline (together with MABT) due to a subpar selection. The default approach yields lower bounds of about 70% while MABT yields a bound of about 75%, which means a relative gain of 7%. This means, MABT achieves $1 - (1-0.75)/(1-0.7) = 17\%$ of the maximum achievable improvement over the default approach.

Overall, MABT produces the largest lower confidence bounds, but the margin often is not too large. Perhaps more importantly, the proposed pipeline together with MABT protects the user from selecting a subpar model and, thus, reporting a smaller lower confidence bound. The information whether the allocation of the data will lead to a subpar selection remains hidden to the user; they would only notice it in the evaluation if at all. At the same time, the results here show that there is a gain in using MABT confidence bounds beyond the gains of repeated CV and the selection of a better model.

4.2 OpenML benchmark

In this second application, we use data sets from OpenML (Vanschoren et al., 2013) to compare our proposed MABT confidence bounds against standard methods. The OpenML platform is used as a collaborative hub for ML research and facilitates sharing, exploration, and analysis of ML data, tasks, and experiments. We resemble Probst et al. (2018) for our purposes here, who utilize the OpenML platform to benchmark their tuneRanger function against other random forest tuning implementations, using the OpenML100 benchmarking suite and the OpenML R package (Casalicchio et al., 2017).

We only use data sets from OpenML without any missing values and whose associated ML tasks are expected to complete within 10,000 s. In some of the data sets, the random forest classifier reaches near perfect prediction performance. To make the classification problems more difficult, we add a small amount of noise to the true class labels and split the data into 75% learning and 25% evaluation data. We fit random forests to the learning data using the tuneRanger R function as in the simulations in Sect. 3, and estimate the conditional prediction accuracy using the evaluation data. For preselection, we sort the candidate models by complexity (computation time, see Sect. 3.4) and use the top 10% selection rule. For the final model selection, less complex models are again preferred over more complex ones. Finally, MABT lower confidence bounds for the prediction accuracy of the final selected model, which is the model with the highest accuracy on the evaluation data, are again obtained using 10,000 bootstrap resamples.

We assess the MABT bounds in two different setups. In the first one, for each data set, we perform learning, selection, and evaluation only once; this is what a user would usually do in practice. The total computation time for this single-run setup is 50 min for the binary classification tasks and 9.5 h for the multi-class tasks, on a modern CPU with 40 threads. In the second one, however, for each data set we repeat learning, selection, and evaluation ten times and average the resulting bounds before the comparison. We do this in order to account for possibly unfortunate situations like we encountered before in Sect. 4.1. This allows a fairer assessment of the performance of our proposed approach.

For the binary classification tasks, we obtain 33 complete data sets from the OpenML platform. A detailed overview of the data sets considered here is placed in the Supplement Information in Section 7. In a per-data set comparison, the MABT lower bound is larger than the CP, Wald, and Wilson bounds in 27, 20, and 26 data sets, respectively. Figure 18 shows the per-data set absolute differences and relative gains in prediction accuracy between the MABT and the CP, Wald, and Wilson bound, respectively. In addition, it shows what percentage of the maximum possible improvement we achieve by using MABT over the comparator.

We conclude that in these binary classification problems our proposed approach leads to mostly small gains over the standard methods. However, there are a number of data sets in which our approach leads to major improvements. These remain visible in the ten-times repeated scenario, though they are less pronounced, see Fig. 19. At the same time, in the latter scenario, the MABT lower bound is larger than the CP, Wald, and Wilson bounds in 31, 21, and 31 data sets, respectively, thus in even more instances.

Regarding the multi-class problems, we obtain 36 complete data sets from OpenML. In the per-data set comparison, the MABT lower bound is larger than the CP, Wald, and Wilson bounds in 20, 16, and 20 instances, respectively. In Fig. 20 the gains of MABT over the standard methods are presented.

In these multi-class problems the gains of MABT bounds are mostly small too, but again there are several instances where MABT lower bounds are remarkably larger than the competition. The results to these extreme examples are presented in more detail in the Supplement Information in Section 7.3. Similar to the binary classification problems, the gains are less pronounced in the ten-times repeated scenario, but still visible, see Fig. 21. Again, in this scenario, the MABT bound is larger than the CP, Wald, and Wilson bounds in even more data sets than before (31, 27, and 30, respectively).

5 Discussion

We proposed MABT confidence bounds that allow for valid inference when more than a single candidate predictive model is evaluated. This is especially beneficial in situations where data is scarce as it eases the allocation of data towards learning and evaluation.

Other methods in the literature require the use of CV, are computationally more expensive, cannot be applied to some relevant performance measures, or do not reach the nominal coverage level reliably. In contrast, the proposed approach directly estimates a lower confidence bound for the conditional performance by resampling from the model’s predictions from the evaluation set, without any additional model training, such that the additional computational burden is little. Furthermore, MABT yields valid confidence bounds for any potential selection rule, any competition of prediction models, even from different model classes, and any performance measure as long as there is a version of it which accepts weights.

In our simulation experiments, our proposed confidence bounds are shown to be reliable regarding coverage probability while still offering comparably large lower confidence bounds as competing methods. Also, the evaluation of multiple candidate models instead of a single one can lead to (far) better final model selections; we see in various real-world data examples that the model selection strongly depends on the data splitting, even when CV or even repeated CV is employed. Our proposed approach mitigates this dependence, stabilizes the model selection, even from (repeated) CV, and still yields competitive lower confidence bounds despite the multiplicity correction. In some simulations, BT in the default single-model selection setup yields very competitive lower confidence bounds in comparison to the other confidence bounds, which is an interesting finding by itself, as the application of BT confidence intervals in ML, to the best of our knowledge, has not been studied yet.

Future research could explore selection rules for evaluating multiple models and determining when to use which one. Current approaches such as the within 1 SE and top 10% might be inefficient and the granted significance level might be used more efficiently by more sophisticated rules.

Finally, we regard MABT as a valuable and universally applicable tool for ML selection and evaluation tasks, offering practical advantages over standard methods.

Availability of data and materials

The Breast Cancer Wisconsin Diagnostic and Cardiotocography data sets are available online at the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic) and https://archive.ics.uci.edu/dataset/193/cardiotocography, respectively. The OpenML platform can be reached at https://www.openml.org/.

References

Ayres-de Campos, D., & Bernardes, J. (2010). Cardiotocography. UCI Machine Learning Repository. https://doi.org/10.24432/C51S4N
Article Google Scholar
Ayres-de Campos, D., Bernardes, J., Garrido, A., et al. (2000). Sisporto 2.0: A program for automated analysis of cardiotocograms. The Journal of Maternal-Fetal Medicine, 9(5), 311–318.
Article MATH Google Scholar
Bates, S., Hastie, T., & Tibshirani, R. (2021). Cross-validation: What does it estimate and how well does it do it? https://doi.org/10.48550/ARXIV.2104.00673
Berk, R., Brown, L., Buja, A., et al. (2013). Valid post-selection inference. The Annals of Statistics, 41(2), 802–837. http://www.jstor.org/stable/23566582.
Article MathSciNet MATH Google Scholar
Casalicchio, G., Bossek, J., Lang, M., et al. (2017). OpenML: An R package to connect to the machine learning platform OpenML. Computational Statistics, 32(3), 1–1. https://doi.org/10.1007/s00180-017-0742-2. 10.1007/s00180-017-0742-2.
Article MathSciNet MATH Google Scholar
Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their application. Cambridge series in statistical and probabilistic mathematics. Cambridge University Press. https://doi.org/10.1017/CBO9780511802843
Book MATH Google Scholar
DiCiccio, T. J., & Romano, J. P. (1990). Nonparametric confidence limits by resampling methods and least favorable families. International Statistical Review, 58(1), 59–76. http://www.jstor.org/stable/1403474.
Article MATH Google Scholar
Dickhaus, T. (2014). Simultaneous statistical inference. Springer.
Book MATH Google Scholar
Dickhaus, T. (2018). Theory of nonparametric tests. Springer.
Book MATH Google Scholar
Dua, D., & Graff, C. (2017). UCI machine learning repository. http://archive.ics.uci.edu/ml
Efron, B. (1981). Nonparametric standard errors and confidence intervals. The Canadian Journal of Statistics, 9(2), 139–158. http://www.jstor.org/stable/3314608.
Article MathSciNet MATH Google Scholar
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22. https://doi.org/10.18637/jss.v033.i01
Article MATH Google Scholar
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
MATH Google Scholar
Hall, P. (1988). Theoretical comparison of bootstrap confidence intervals. The Annals of Statistics, 16(3), 927–953. http://www.jstor.org/stable/2241604.
MathSciNet MATH Google Scholar
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning Springer series in statistics (2nd ed.). Springer.
Book MATH Google Scholar
Hesterberg, T. (1999). Bootstrap tilting confidence intervals and hypothesis tests. Computing Science and Statistics, 31, 389–393. https://www.researchgate.net/publication/2269406_Bootstrap_Tilting_Confidence_Intervals.
MATH Google Scholar
Hothorn, T., Bretz, F., & Westfall, P. (2008). Simultaneous inference in general parametric models. Biometrical Journal, 50(3), 346–363.
Article MathSciNet MATH Google Scholar
Japkowicz, N., & Shah, M. (2011). Evaluating learning algorithms: A classification perspective. Cambridge University Press. https://doi.org/10.1017/CBO9780511921803
Book MATH Google Scholar
Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software, 28(5), 1–26. https://doi.org/10.18637/jss.v028.i05
Article MATH Google Scholar
Murphy, K. P. (2012). Machine learning: A probabilistic perspective. MIT Press.
MATH Google Scholar
Probst, P., Wright, M., & Boulesteix, A. L. (2018). Hyperparameters and tuning strategies for random forest. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. https://doi.org/10.1002/widm.1301
Article MATH Google Scholar
Raschka, S. (2018). Model evaluation, model selection, and algorithm selection in machine learning. https://doi.org/10.48550/ARXIV.1811.12808
Robin, X., Turck, N., Hainard, A., et al. (2011). pROC: An open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, 77.
Article Google Scholar
Tsamardinos, I., Greasidou, E., & Borboudakis, G. (2018). Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation. Machine Learning, 107, 1895–1922. https://doi.org/10.1007/s10994-018-5714-4
Article MathSciNet MATH Google Scholar
Vanschoren, J., van Rijn, J. N., Bischl, B., et al. (2013). OpenML: Networked science in machine learning. SIGKDD Explorations, 15(2), 49–60. https://doi.org/10.1145/2641190.2641198
Article MATH Google Scholar
Westfall, P. H., & Young, S. S. (1993). Resampling-based multiple testing: Examples and methods for p-value adjustment (Vol. 279). Wiley.
MATH Google Scholar
Westphal, M., & Brannath, W. (2020). Evaluation of multiple prediction models: A novel view on model selection and performance assessment. Statistical Methods in Medical Research, 29(6), 1728–1745. https://doi.org/10.1177/0962280219854487
Article MathSciNet MATH Google Scholar
Wolberg, W., Mangasarian, O., Street, N., et al. (1995). Breast cancer wisconsin (diagnostic). UCI Machine Learning Repository. https://doi.org/10.24432/C5DW2B
Article MATH Google Scholar
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B (Statistical Methodology), 67(2), 301–320. http://www.jstor.org/stable/3647580.
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors acknowledge the provision of the Breast Cancer Wisconsin Diagnostic (Wolberg et al., 1995) and the Cardiotocography (Ayres-de Campos & Bernardes, 2010) data sets at the UCI Machine Learning Repository (Dua & Graff, 2017) as well as the OpenML platform for the provision of a multitude of data sets and infrastructure for our ML experiments (Vanschoren et al., 2013; Casalicchio et al., 2017). The authors would like to express their sincere appreciation to three anonymous reviewers for their constructive feedback and valuable suggestions that significantly improved the manuscript.

Funding

Open Access funding enabled and organized by Projekt DEAL. P. Rink acknowledges funding by the Deutsche Forschungsgemeinschaft (DFG) project number 281474342.

Author information

Authors and Affiliations

Competence Center for Clinical Trials, University of Bremen, Linzer Str. 4, 28359, Bremen, Germany
Pascal Rink & Werner Brannath

Authors

Pascal Rink
View author publications
Search author on:PubMed Google Scholar
Werner Brannath
View author publications
Search author on:PubMed Google Scholar

Contributions

Not applicable

Corresponding author

Correspondence to Pascal Rink.

Ethics declarations

Conflict of interest

The authors have no Conflict of interest to declare that are relevant to the content of this article.

Ethics approval

Not applicable

Consent to participate

Not applicable

Consent for publication

Not applicable

Additional information

Editor: Xiaoli Fern.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 3299 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Rink, P., Brannath, W. Post-selection confidence bounds for prediction performance. Mach Learn 114, 82 (2025). https://doi.org/10.1007/s10994-024-06632-w

Download citation

Received: 15 November 2022
Revised: 18 September 2024
Accepted: 01 December 2024
Published: 14 February 2025
Version of record: 14 February 2025
DOI: https://doi.org/10.1007/s10994-024-06632-w

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Post-selection confidence bounds for prediction performance

Abstract

Similar content being viewed by others

A new approach in model selection for ordinal target variables

Overfitting, Model Tuning, and Evaluation of Prediction Performance

Underlying Objective

Explore related subjects

1 Introduction

1.1 Conditional versus unconditional performance

1.2 Bootstrap tilting confidence intervals

2 Method

2.1 Bootstrap resampling

2.2 Tilting

2.3 Multiplicity correction and calibration

2.4 Lower confidence bound

3 Simulation experiments

3.1 Data generation

3.2 Model training

3.3 Performance estimation

3.4 Model selection

3.5 Confidence bounds

3.6 Results

3.6.1 Lasso classifier

3.6.2 Random forest classifier

4 Applications to real-world data

4.1 Cardiotocography data

4.2 OpenML benchmark

5 Discussion

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 3299 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords