TCR: topologically consistent reweighting for XGBoost in regression tasks

Zühlke, Monty-Maximilian; Kudenko, Daniel

doi:10.1007/s10994-024-06704-x

TCR: topologically consistent reweighting for XGBoost in regression tasks

Open access
Published: 24 February 2025

Volume 114, article number 108, (2025)
Cite this article

You have full access to this open access article

Download PDF

Machine Learning Aims and scope Submit manuscript

TCR: topologically consistent reweighting for XGBoost in regression tasks

Download PDF

Monty-Maximilian Zühlke¹ &
Daniel Kudenko¹

1499 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Gradient boosted tree ensembles (GBTEs) such as XGBoost continue to outperform other machine learning models on tabular data. However, the plethora of adjustable hyperparameters can exacerbate optimisation, especially in regression tasks with no intuitive performance measures such as accuracy and confidence. Automated machine learning frameworks alleviate the hyperparameter search for users, but if the optimisation procedure ends prematurely due to resource constraints, it is questionable whether users receive good models. To tackle this problem, we introduce a cost-efficient method to retrofit previously optimised XGBoost models by retraining them with a new weight distribution over the training instances. We base our approach on topological results, which allows us to infer model-agnostic weights for specific regions of the data distribution where the targets are more susceptible to input perturbations. By linking our theory to the training procedure of XGBoost regressors, we then establish a topologically consistent reweighting scheme, which is independent of the specific model instance. Empirically, we verify that our approach improves prediction performance, outperforms other reweighting methods and is much faster than a hyperparameter search. To enable users to find the optimal weights for their data, we provide guides based on our findings on 20 datasets. Our code is available at: https://github.com/montymaxzuehlke/tcr.

Prediction of autogenous shrinkage in ultra-high-performance concrete (UHPC) using hybridized machine learning

Article 30 October 2024

Hyperparameter Optimization for Gradient-Boosted Tree-Based Machine Learning Models and Their Effect on Model Performance

Building Automated Computational Models for Predicting Energy Consumption in High-Performance Concrete Production

Article 16 May 2025

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Gradient boosted tree ensembles (GBTEs) (Friedman, 2001), especially XGBoost (Chen & Guestrin, 2016), continue to outperform other machine learning algorithms on tabular data, including deep-learning models (Borisov et al., 2022; Grinsztajn et al., 2022). This is particularly surprising for regression tasks, where the target is continuously distributed, as tree ensembles are merely approximations of continuous functions. However, due to the long list of adjustable hyperparameters, finding the best GBTE can be expensive, and because manual optimisation is infeasible, users need to rely on automated machine learning frameworks such as H2O (LeDell & Poirier, 2020). While these frameworks automate the search for the best combination of parameters, it is unclear whether they will find it, especially within a limited time budget. Similarly, it is uncertain whether a renewed or extended search may lead to improvements and, therefore, justify the added computational costs. What users need are cost-efficient methods to improve their GBTE’s prediction performance independent of the specific model instance. Thus, we propose a new model-agnostic technique to achieve precisely this: retrofitting XGBoost regressors by retraining them with an adjusted weight distribution over the data points.

Similar to classification tasks, there exist approaches for regression tasks to improve a model’s prediction performance via instance weights. However, most of these focus on balancing the target distribution, either indirectly by combining under- and oversampling based on binning as in SMOGN (Branco et al., 2017) or directly via kernel density estimation as in DenseWeight (Steininger et al., 2021). But as rare targets in regression tasks are usually either extremely high or low, increasing their weights may impede a model’s ability to generalise. While predictions may improve for a few selected points, they may degrade for the remaining data (comp. the results obtained by Steininger et al. (2021)). In contrast, we extract weights based on local properties of the target function to retrofit models and improve their prediction performance over the entire target distribution.

In line with the balancing methods employed by SMOGN and DenseWeight, our approach is also data-centric. However, contrary to these methods, we neither sub-sample data points (which amounts to setting weights to 0) nor use unique instance-specific weights. Conversely, we define one consistent weight factor for several, possibly disconnected regions of points, where the target function is less robust towards feature changes. Proving that this robustness information is topologically consistent, meaning that it is locally consistent on sets which are open w.r.t. the topology of the feature space, will be our first theoretical contribution in this work. As the second contribution, we connect this robustness information to the training dynamics of XGBoost regressors to motivate a particular reweighting scheme, which we term topologically consistent reweighting (TCR). In essence, TCR will downweight the least robust elements by some predefined weight factor before the model is retrained. Figure 1 visualises the pipeline.

Through an extensive analysis across 20 datasets, we show that our approach satisfies three crucial criteria for real-world use: (i) the necessary criterion - learned representations differ for the most and least robust regions of the data (meaning the concept of data robustness is "rich" enough in a machine-learning sense); (ii) the sufficient criterion - TCR improves the average generalisation performance of XGBoost regressors while outperforming baselines provided by DenseWeight (Steininger et al., 2021), SMOGN (Branco et al., 2017) and AXIL (Geertsema & Lu, 2023; iii) the economic criterion - computing robustness values is (significantly) faster than a hyperparameter search using H2O (LeDell & Poirier, 2020). With our work, we cater to users who have already optimised and deployed an XGBoost model in a regression setup and seek a cost-efficient and quick improvement method. Because the optimal weight factor in TCR depends on the data, we guide users to select this parameter via decision trees that we fitted to our experiments’ metadata (see Sect. 5.1). To conclude, our contributions are:

We propose a theoretical framework to locate regions of points that differ in how robust their targets are to input variations and connect this information to the training procedure of XGBoost regressors.
We show empirically that this robustness information is "rich" enough for machines to learn different representations of the most and least robust regions of the data.
We develop and analyse a topologically consistent reweighting scheme (TCR) to retrofit and improve existing XGBoost regressors.
We conduct several ablation studies to investigate the hyperparameter sensitivity of TCR.
We provide tree-based guides for users to select the weight factor for their individual dataset based on our experimental results.

2 Background and related work

2.1 Gradient boosted tree ensembles and XGBoost

The basis of GBTEs are classification and regression trees (CARTs); we focus on regression trees, which can be expressed as a weighted sum

$$\begin{aligned} \phi (x):= \sum _{j=1}^T w_j {\textbf{1}}_{R_j}(x), \end{aligned}$$

where $\{ R_j\}_{j=1}^T$ is a set of disjoint regions whose union is the input or feature space, for example, ${\mathbb {R}}^m$ or $[0,1]^m$, ${\textbf{1}}$ is the indicator function and $w_j \in {\mathbb {R}}$ for $j=1, \dots , T$ are the weights of the leaves (see, for example, Hastie et al. (2001)). Whereas the indicator functions control into which region a datum falls, the $w_j$ are the final predictions for every point in the corresponding region.

While one tree has limited predictive capabilities, an ensemble of trees can be very precise by aggregating the individual trees’ outputs. On the one hand, this motivated random forests (Breiman, 2001), which combine this aggregation with bootstrapping subsets from the training data for each new tree (a process known as bagging). On the other hand, it motivated boosting methods, where trees are also added sequentially but after being trained on re-weighted versions of the dataset, for example, in AdaBoost (Freund & Schapire, 1997). Finally, gradient boosting methods are based on the paradigm that the to-be-optimised loss function can be differentiated w.r.t. to the model. More precisely, for fixed $f(x): {\mathbb {R}}^m \rightarrow {\mathbb {R}}$, we can express the empirical loss on a dataset $\{(x_i, y_i)\}_{i=1}^n \subset {\mathbb {R}}^m \times {\mathbb {R}}$ w.r.t. a loss function ${\mathcal {L}}$ as:

$$\begin{aligned} {\mathcal {L}}(f):= \sum _{i=1}^n {\mathcal {L}}(f(x_i), y_i) \end{aligned}$$

If ${\mathcal {L}}$ is differentiable w.r.t. f, we can regard $f_i:= f(x_i)$ as a variable of the loss objective ${\hat{f}}:= (f_1, \dots , f_n) \mapsto {\mathcal {L}}({\hat{f}})$ and search for ${\hat{f}}^*:= \mathop {\mathrm{arg\,min}}\nolimits _{{\hat{f}}} {\mathcal {L}}({\hat{f}})$ using gradient descent. Component-wise, this gradient is given by:

$$\begin{aligned} \frac{\partial {\mathcal {L}}(\phi , y_i)}{\partial \phi }_{|_{\phi = f_i}} \end{aligned}$$

Note that the gradient has as many components as there are elements in the dataset. If ${\mathcal {L}}$ is the squared loss, optimising this objective in the direction of steepest descent reduces to adding learners fitted to the loss residuals (Friedman, 2001). This approach allows for efficient optimisation and motivated several frameworks such as LightGBM (Ke et al., 2017), CatBoost (Prokhorenkova et al., 2018) and XGBoost (Chen & Guestrin, 2016). In this work, we focus on XGBoost as it uses a second-order approximation of the loss objective, which allows for a unique way of integrating weights (see Sect. 3.2).

2.2 Sample weights in regression tasks

Manually adjusting sample weights alters their importance relative to the entire dataset before training. One can manipulate the importance directly, via a data point’s contribution to a loss objective, or indirectly, via observing the data point more or less often using over- and undersampling, respectively. If the target distribution is assumed to be discrete (as in classification setups), emphasising important or underrepresented classes can counter imbalanced distributions and help the model produce more precise estimates at test stage. On the other hand, removing redundant data points (by setting weights to 0) can likewise improve a learner’s precision (Ohno-Machado et al., 1998).

In the continuous case, however, these class-based methods fall short. One can reduce the regression case to the classification case by binning the targets to adjust weights based on the bins, but this requires a reasonable insight into the underlying data distribution, for example, via domain knowledge, which is why many methods focus on rare targets above or below some threshold and seek to balance the overall distribution.

2.2.1 DenseWeight

For imbalanced target distributions, Steininger et al. (2021) introduced DenseWeight, using kernel density estimation methods to increase the weight of instances with underrepresented targets. More precisely, they assign instance weights depending on the probability of the targets occurring after estimating a density over the target distribution (as a measure of rarity). The final weights are influenced by a hyperparameter (“$\alpha$”), which controls the impact of the (normalised) density function. The larger the hyperparameter, the larger the weights for rare targets relative to the entire distribution.

2.2.2 SMOGN

Another related approach for imbalanced regression data is SMOGN (Branco et al., 2017), which assigns weights indirectly by combining under- and oversampling techniques. SMOGN is based on SMOTER (Torgo et al., 2013) which, in turn, is based on SMOTE (Chawla et al., 2002). The latter stands for "Synthetic Minority Over-sampling TEchnique" and was designed for classification tasks. The idea is to over-sample any minority class via synthesizing new instances instead of sampling with replacement. The synthesized elements are taken from line segments joining the nearest neighbours of the same minority class. SMOTER is SMOTE for regression tasks where a relevance function is introduced to determine the set of instances that will be over-sampled while generating new targets based on a weighted average. Finally, SMOGN combines random under-sampling with one of two over-sampling strategies. The first is SMOTER and performed if the original datum is close-enough to its nearest neighbours; the second is adding noise sampled from a Gaussian distribution if the neighbours are too far away.

2.2.3 AXIL

Recently, Geertsema and Lu (2023) showed that a GBTE’s prediction can always be expressed as a linear combination of targets, which led to so-called Additive eXplanations with Instance Loadings or AXIL weights. These weights reveal the impact of individual training targets on the final prediction similar to feature importance scores produced by LIME (Ribeiro et al., 2016) and SHAP (Lundberg & Lee, 2017). The idea can be expressed concisely as

$$\begin{aligned} {\hat{y}} = K \cdot y, \end{aligned}$$

where ${\hat{y}} \in {\mathbb {R}}^n$ are the model predictions, $y \in {\mathbb {R}}^n$ are the training targets and $K \in {\mathbb {R}}^{n \times n}$ is a weight matrix. However, we can modify the approach of Geertsema and Lu (2023) to extract instance-specific weights by measuring the total contribution of each target. More precisely, we can transpose the matrix K and set y to be the vector containing 1’s in every component (${\textbf{1}}_v$) to gain a weight vector w as:

$$\begin{aligned} w = K^\top \cdot {\textbf{1}}_v \end{aligned}$$

To the best of our knowledge, this set of instance weights has not been used before, but due to the inherent dependence on a GBTE’s prediction, we included these modified AXIL weights as an additional baseline.

2.2.4 Performance-driven approaches

Orthogonal to data-driven approaches like DenseWeight, SMOGN, AXIL and our TCR, there exist methods to gauge the importance of instances post hoc based on a trained model’s prediction performance. More precisely, these methods quantify the contribution of one or several data points to the final prediction performance w.r.t. some loss measure. For example, the Leave-One-Out method (Hastie et al., 2001) can be performed iteratively to measure the loss delta when leaving one instance out of the training data. A more elaborate approach combines this technique with the Shapley value from game theory (Shapley, 1953) to provide equitable contributions (Ghorbani & Zou, 2019). While these can be regarded as importance scores and, in extension, instance weights, they inherently depend on both the cost function and the model. Additionally, these performance-driven approaches are costly due to the iterative design of data removal, retraining and reevaluation. While determining contributions at the group level by collecting data points in subsets can alleviate the computational burden (Ghorbani & Zou, 2019), users must specify reasonable domain-specific group criteria to avoid selection bias.

2.2.5 GBTEs and sample weights

For boosting methods, Seiffert et al. (2008) compare boosting by resampling with boosting by reweighting and find that the former is favourable for unbalanced datasets. However, their work appeared years before the introduction of modern frameworks like XGBoost (Chen & Guestrin, 2016), which allows reweighting and subsampling both instances and features. Moreover, an essential aspect of XGBoost is that weights are rescaled instead of replaced, which is a more natural way to influence learning.

Finally, GBTEs and sample weights have been combined in other contexts. In line with the classic boosting paradigm, Zhang et al. (2020) propose DoubleEnsemble for time series forecasting in financial data analysis by adding learners for which sample weights are adapted iteratively via learning trajectories based on loss curves. Rogozhnikov (2016) uses boosted decision trees to determine weights for aligning Monte-Carlo simulated data with real-world data in high energy physics while MacNell et al. (2023) investigate the effect of survey weights when training a GBTE, introducing a new perspective of post-hoc reweighting to align model predictions with reality.

3 Theoretical background and framework

We now build the theoretical framework for TCR. After defining the primary measure for extracting sensitivity and robustness values from any dataset, we will prove that this information is locally consistent on sets which are open w.r.t. the topology of the feature space. We then connect robustness information to regions where the target function is more or less approximable by constant functions. Since approximating functions via piece-wise constant base learners is the goal of XGBoost (and any GBTE) in regression tasks, we can link our results to the training procedure and motivate a topologically consistent reweighting scheme, namely TCR. Finally, we present algorithms to calculate sensitivity and robustness values for labelled/unlabelled points and sets based on the entire data or subsamples before discussing the physics of robustness values and providing several toy examples.

We begin by regarding feature spaces as metric spaces (X, d) and label spaces as Banach spaces $(E, \Vert \cdot \Vert )$. The common spaces ${\mathbb {R}}^n$ and the hypercube $[0,1]^n \subset {\mathbb {R}}^n$ with any $p-$norm, $p \in [1, \infty )$, are metric spaces, while ${\mathbb {R}}$ with its Euclidean norm is a Banach space. A dataset $\{ x_i \}_{i=1}^n =: {\mathcal {D}} \subseteq X$ will be a finite collection of points while labels are assigned via a map $y: X \rightarrow E$, where we assume that $\# y({\mathcal {D}}) \ge 2$ and $\infty > \Vert y\Vert _\infty := \max _i \Vert y(x_i)\Vert$.

3.1 Data sensitivity and robustness

The key to our approach is utilising the information of how robust or, reciprocally, sensitive a data point is towards a label-changing perturbation relative to the change of the corresponding label:

Definition 1

We define the sensitivity of $x_i \in {\mathcal {D}}$ with respect to $y, {\mathcal {D}}, d$ and $\Vert \cdot \Vert$ as:

$$\begin{aligned} \displaystyle s^i_y:= \max _{x_j \in {\mathcal {D}} \setminus \{ x_i\} } \frac{\Vert y(x_i) - y(x_j) \Vert }{d(x_i, x_j)} \in (0, \infty ) \end{aligned}$$

(1)

Applying the transformation $\textstyle s \mapsto \Vert y\Vert _\infty \cdot (\Vert y\Vert _\infty + s)^{-1}$ to the sensitivity values normalises the information and inverts the meaning of small and large values. Hence, we define the robustness of $x_i \in {\mathcal {D}}$ with respect to $y, {\mathcal {D}}, d$ and $\Vert \cdot \Vert$ as:

$$\begin{aligned} \displaystyle r^i_y:= \frac{\Vert y\Vert _\infty }{\Vert y\Vert _\infty + s^i_y} \in (0,1) \end{aligned}$$

The expression $s^i_y$ measures how sensitive the target $y(x_i)$ of a point $x_i$ is towards perturbations in feature space and as shown in Fig. 2, there is an intuitive motivation for maximising the quotient in (1) over the entire dataset. In the case of housing prices, for example, d measures how different houses are w.r.t. the number of rooms, square meter, area in town, year of construction, etc., while $s^i_y$ measures how quickly a house’s price $y(x_i)$ can change when varying the aforementioned features $x_i$. The larger the price difference between similar houses, the higher the sensitivity or, equivalently, the lower the robustness. The following result reveals the topological consistency of both sensitivity and robustness values:

Theorem 1

(Topological Consistency of Sensitivity and Robustness Values) Let $x_i \in {\mathcal {D}}$ and assume $y: X \rightarrow E$ is locally Lipschitz continuous at $x_i$. Then there exists an open set $U_i \subset X$ around $x_i$ on which

$$\begin{aligned} s^i_y: U_i \rightarrow {\mathbb {R}}, \ \ \ z \mapsto s^i_y(z):= \max _{x_j \in {\mathcal {D}} \setminus \{x_i\}} \frac{\Vert y(z) - y(x_j) \Vert }{d(z, x_j)} \end{aligned}$$

is Lipschitz continuous, that is:

$$\begin{aligned} | s^i_y(z_1) - s^i_y(z_2) | \le K_i \cdot d(z_1, z_2) \ \ \ \forall \ z_1, z_2 \in U_i \end{aligned}$$

Furthermore, there exists an open set $U'_i \subseteq U_i$ and a $K'_i >0$ such that

$$\begin{aligned} | r^i_y(z_1) - r^i_y(z_2) | \le K'_i \cdot d(z_1, z_2) \ \ \ \forall \ z_1, z_2 \in U'_i, \end{aligned}$$

where

$$\begin{aligned} \displaystyle r^i_y(z):= \frac{\Vert y\Vert _\infty }{\Vert y\Vert _\infty + s^i_y(z)}: U'_i \rightarrow {\mathbb {R}}. \end{aligned}$$

Proof

(See Appendix A.) $\square$

Remark 1

Intuitively, if two points are “close enough”, their sensitivity and robustness values will be close as well, showing that the concept is locally consistent. Moreover, this consistency holds on sets that are open w.r.t. the topology of the feature space, proving the information is topologically consistent.

Assume now that the sensitivity values $\{ s_y^1, \dots , s_y^n \}$ of a labelled dataset $\{ (x_i, y_i:= y(x_i)) \}_{i=1}^n$ are ordered by magnitude (without loss of generality we can assume $s_y^1 \le \dots \le s_y^n$). Then, by design, this list of ordered sensitivity values is a list of non-decreasing lower bounds on the label function’s Lipschitz constant Lip(y) on X:

$$\begin{aligned} s_y^1 \le \dots \le s_y^n \le \sup _{x_i \ne x_j, x_i, x_j \in X} \frac{\Vert y(x_i) - y(x_j) \Vert }{d(x_i, x_j)} =: Lip(y) \end{aligned}$$

The Lipschitz constant (if it exists) can be considered as a measure for how strongly the label function deviates from a constant function in at least one region. Indeed, a function is constant if and only if its Lipschitz constant is 0. Combining these facts with Theorem 1 allows us now to interpret the sensitivity values in a novel, informal but intuitive way:

Remark 2

The higher the sensitivity (or, equivalently, lower the robustness) of a data region, the more this region prevents the target function from being constant and, thus, from being approximable by constant functions.

3.2 Data robustness and XGBoost

We now explain in what way manipulating weights based on robustness values can benefit an XGBoost regressor. An intuitive guide regarding the training of such models is given by one of the developers in (Chen, 2014). Let $\{ (x_i, y_i:= y(x_i)) \}_{i=1}^n \subset {\mathbb {R}}^m \times {\mathbb {R}}$ again be a labelled dataset and $f^k$ be an XGBoost regressor in boosting round $k+1$. Assume the model is optimised w.r.t. the squared error as the loss objective, that is:

$$\begin{aligned} \sum _{i=1}^n {\mathcal {L}}(f^k(x_i), y_i):= \sum _{i=1}^n \frac{1}{2} |y_i - f^k(x_i) |^2 \end{aligned}$$

(2)

The components of the loss gradient w.r.t. $f^k_i:= f^k(x_i)$ are given by:

$$\begin{aligned} \partial _i {\mathcal {L}}:= \frac{\partial {\mathcal {L}}(\phi , y_i)}{\partial \phi }_{|_{\phi = f^k_i}} = f^k_i - y_i \end{aligned}$$

XGBoost also uses the second derivatives $\partial ^2_i {\mathcal {L}}:= \partial _i ( \partial _i {\mathcal {L}} )$ during training, which can be considered natural weights that depend on the chosen loss function (Chen & Guestrin, 2016). More precisely, the loss objective to find the next base learner $\phi$ can be formulated for any fixed tree structure as

$$\begin{aligned} \sum _{i=1}^n \frac{\partial ^2_i {\mathcal {L}}}{2} \left( \phi (x_i) - \left( -\frac{\partial _i {\mathcal {L}}}{\partial ^2_i {\mathcal {L}}}\right) \right) ^2 + \gamma T + \frac{\lambda }{2} \Vert w\Vert ^2_2 + constant, \end{aligned}$$

(3)

which is a weighted squared error between the prediction $\phi (x_i)$ and $-\frac{\partial _i {\mathcal {L}}}{\partial ^2_i {\mathcal {L}}}$ (up to the regularisation terms). Here, T corresponds to the number of leaves of the new base learner (regulated by the hyperparameter $\gamma$) and $\frac{1}{2} \Vert w\Vert ^2$, $w \in {\mathbb {R}}^T$, contains the leaves’ weights (regulated by the hyperparameter $\lambda$). Finally, for the squared loss in (2), the $\partial ^2_i {\mathcal {L}}$ reduce to 1 and the new base learner $\phi$ is fitted to the targets $- \partial _i {\mathcal {L}}$. This makes intuitively sense as adding the new base learner to the previous model, $f^{k+1}:= f^k + \phi$, will turn (2) into:

$$\begin{aligned} \sum _{i=1}^n \frac{1}{2} |y_i - f^{k+1}(x_i) |^2 = \sum _{i=1}^n \frac{1}{2} |y_i - f^{k}(x_i) - \phi (x_i) |^2 = \sum _{i=1}^n \frac{1}{2} |(- \partial _i {\mathcal {L}}) - \phi (x_i) |^2 \end{aligned}$$

Let now $\tau := \{ \tau _i \}_{i=1}^n$ be a set of weights to regulate the squared loss contribution of the individual instances $\{ x_i \}_{i=1}^n$. Then expressions (2) and (3) turn into

$$\begin{aligned} \sum _{i=1}^n {\mathcal {L}}_{\tau _i}(f^k(x_i), y_i):= \sum _{i=1}^n \frac{\tau _i}{2} |y_i - f^k(x_i) |^2 \end{aligned}$$

(4)

and

$$\begin{aligned} \sum _{i=1}^n \frac{\tau _i}{2} \bigl ( \phi (x_i) - (-\partial _i {\mathcal {L}}) \bigr )^2 + \gamma T + \frac{\lambda }{2} \Vert w\Vert ^2_2 + constant, \end{aligned}$$

(5)

respectively. Notice how the adjusted weights $\tau _i$ cancel out for the new base learner’s target ($-\frac{\partial _i ({\mathcal {L}}_{\tau _i})}{\partial ^2_i ( {\mathcal {L}}_{\tau _i})} = -\frac{\tau _i \partial _i {\mathcal {L}}}{\tau _i \partial ^2_i {\mathcal {L}}} = -\frac{\partial _i {\mathcal {L}}}{\partial ^2_i {\mathcal {L}}} = -\partial _i {\mathcal {L}}$), while still being present in the base learner’s objective. This is fundamentally different from the classic gradient boosting paradigm, where the new base learner would be fit to the weighted residuals, that is, with training data $\{ (x_i, - \partial _i {\mathcal {L}}_{\tau _i} = -\tau _i \partial _i {\mathcal {L}}) \}_{i=1}^n$. Up to regularisation terms, this leads to the loss objective

$$\begin{aligned} \sum _{i=1}^n \frac{1}{2} |\phi (x_i) - (-\tau _i \partial _i {\mathcal {L}}) |^2, \end{aligned}$$

where the weights $\tau _i$ regulate the targets directly and not the loss contributions as in (5). This shows why adjusting weights for XGBoost regressors is more natural than other models and explains our focus in this work.

As GBTEs are piece-wise constant functions, they are also piece-wise Lipschitz. Assuming that both y and $f^k$ are Lipschitz continuous around a training instance $x_i$, we can apply Theorem 1 again to show that there exists an open set $U_i$ in $X = {\mathbb {R}}^m$ around $x_i$ such that

$$\begin{aligned} s^i_{res_k}: U_i \rightarrow {\mathbb {R}}, \ \ \ z \mapsto s^i_{res_k}(z):= \max _{x_j \in {\mathcal {D}} \setminus \{x_i\}} \frac{|res_k(z) - res_k(x_j) |}{d(z, x_j)} \end{aligned}$$

is Lipschitz continuous, where $res_k$ is the residual function

$$\begin{aligned} res_k: {\mathbb {R}}^m \rightarrow {\mathbb {R}}, \ \ \ x \mapsto f^k(x) - y(x) \end{aligned}$$

and d is any suitable metric on ${\mathbb {R}}^m$. Note that $res_k(x_i) = \partial _i {\mathcal {L}}$ in boosting round $k+1$ for all $x_i$ in the training set. Similar to Remark 2, we gain the following insight:

Remark 3

The lower the robustness of a data region w.r.t. $res_k$, the more this region prevents the residual function from being constant and, thus, from being approximable by constant functions.

Approximating $res_k$ with constant functions, however, is precisely the objective of the new base learner in (3) and (5). At the beginning of the training, the XGBoost regressor is initialised as the mean average over the training instances’ targets, $c:= n^{-1} \sum _{i=1}^n y(x_i)$. Since the expression in (1) is the same for $-y$ and invariant under translations by constant functions, the sensitivity values are identical w.r.t. the target function, y, and w.r.t to the residual function in the first boosting round, $res_1$:

$$\begin{aligned} s_y^i = s_{c-y}^i = s_{res_1}^i \end{aligned}$$

We can now refine Remark 3 to include the robustness of the data w.r.t. y:

Remark 4

The lower the robustness of a data region w.r.t. y, the more this region prevents the first residual function $res_1$ from being constant and, thus, from being approximable by constant functions.

In this sense, the robustness of a data point (w.r.t. y) can be regarded as a measure of difficulty from the perspective of the first base learner, with the least robust points being the most difficult to predict. Importantly, this difficulty persists as long as sufficiently many of the least robust elements remain in the same leaf. More generally, if two points $x_1$ and $x_2$ always fall in the same leaf of every base learner up to boosting round k, then:

$$\begin{aligned} \frac{|res_k(x_1) - res_k(x_2) |}{d(x_1, x_2)} = \frac{|f^k(x_1) - y(x_1) - f^k(x_2) + y(x_2) |}{d(x_1, x_2)} = \frac{|y(x_1) - y(x_2) |}{d(x_1, x_2)} \end{aligned}$$

Since the previous boosting round’s residuals define the targets in the next round, the least robust training elements are prone to skew the loss objective in (3) (notice that a GBTE’s final precision in any region of the data is determined by how well the residuals in that region can be approximated by constant functions during training). A final remark will conclude what we have worked out:

Remark 5

The lower the robustness of a data region w.r.t. y, the less approximable by constant functions it is and the more likely it will hinder learning.

There are two possible solutions to this problem: (i) increasing the tree complexity, especially in the least robust regions, to likewise increase the probability of a more fine-grained feature space partition and (ii) downscaling the weights of the least robust instances before training so that their impact on the loss objective is reduced. While the first is a more natural way to influence learning, it requires a (novel) hyperparameter search for a more complex cost objective. The second option, however, merely requires calculating the robustness values and retraining the existing model, which is precisely what TCR is designed for.

For this, we need to investigate which portion of the least robust elements should be reweighted and how the weights should be defined. It is reasonable to assume that these parameters depend on the dataset and so only an empirical study can reveal more insights into the dynamics (more details in Sect. 4.). However, because TCR is intended for retrofitting models with fixed hyperparameters, the degree of freedom is limited. For example, if $\sum _i \tau _i \ne 1$, the two sets $\{\tau _i \}_{i=1}^n$ and $\{ \frac{\tau _i}{\sum _i \tau _i} \}_{i=1}^n$ (normalised weights) have a different impact on the training due to the fixed hyperparameters of the regularisation terms. For this reason, most of the weights should remain at their default value 1.0, for which the hyperparameters were optimised.

3.3 Robustness values in practice

We can generalise the sensitivity measure in (1) using a nearest neighbour map (assuming that the ${\mathrm{arg\,min}}$ is a single point):

Proposition 2

Let $\{(x_i, y(x_i))\}_{i=1}^n \subset X \times E$, ${\mathcal {D}}:= \{x_i \}_{i=1}^n$ and let

$$\begin{aligned} F: X \rightarrow E, \ \ \ z \mapsto F(z):= y \bigl (\mathop {\mathrm{arg\,min}}\limits _{x_k \in {\mathcal {D}}} d(z, x_k) \bigr ) \end{aligned}$$

be the nearest neighbour map. Then for every $x_i \in {\mathcal {D}}$, there exist open sets $U_i, U'_i$ around $x_i$ with $U'_i \subseteq U_i$ and constants $K_i, K'_i$ such that

$$\begin{aligned} \displaystyle |s^i_F(z_1) - s^i_F(z_2)| \le K_i \cdot d(z_1, z_2) \ \forall \ z_1, z_2 \in U_i \end{aligned}$$

and

$$\begin{aligned} \displaystyle |r^i_F(z_1) - r^i_F(z_2)| \le K'_i \cdot d(z_1, z_2) \ \forall \ z_1, z_2 \in U'_i, \end{aligned}$$

where

$$\begin{aligned} \displaystyle s^i_F: U_i \rightarrow {\mathbb {R}}, \ \ \ z \mapsto s^i_F(z):= \max _{x_j \in {\mathcal {D}} \setminus \{x_i\}} \frac{\Vert F(z) - F(x_j) \Vert }{d(z, x_j)}. \end{aligned}$$

and

$$\begin{aligned} \displaystyle r^i_F(z):= \frac{\Vert y\Vert _\infty }{\Vert y\Vert _\infty + s^i_F(z)}: U'_i \rightarrow {\mathbb {R}}. \end{aligned}$$

In particular, $s^i_y = s^i_F(x_i)$ and $r^i_y = r^i_F(x_i)$ for all $x_i \in {\mathcal {D}}$.

Proof

Follows from Theorem 1 for $y = F$ as F is locally constant around $x_i \in {\mathcal {D}}$ and, therefore, locally Lipschitz continuous around $x_i \in {\mathcal {D}}$. $\square$

Definition 2

With the above notations, we define the sensitivity and robustness of any $z \in X$ w.r.t. $y, {\mathcal {D}}, d$ and $\Vert \cdot \Vert$ as

$$\begin{aligned} \displaystyle s_F(z):= \max _{x_j \in {\mathcal {D}} \setminus \{z\}} \frac{\Vert F(z) - F(x_j) \Vert }{d(z, x_j)} = \max _{x_j \in {\mathcal {D}} \setminus \{z\}} \frac{ \Vert y(\mathop {\mathrm{arg\,min}}\limits _{x_k \in {\mathcal {D}}} d(z, x_k)) - y(x_j) \Vert }{d(z, x_j)} \end{aligned}$$

and

$$\begin{aligned} r_F(z):= \frac{\Vert y\Vert _\infty }{\Vert y\Vert _\infty + s_F(z)}, \end{aligned}$$

respectively.

Notice that these definitions connect naturally to the above Proposition: if $x_i = \mathop {\mathrm{arg\,min}}\nolimits _{x_k \in {\mathcal {D}}} d(z, x_k)$, we have

$$\begin{aligned} \displaystyle s_F(z) = \max _{x_j \in {\mathcal {D}} \setminus \{z\}} \frac{\Vert F(z) - F(x_j) \Vert }{d(z, x_j)} = \max _{x_j \in {\mathcal {D}} \setminus \{x_i\}} \frac{\Vert F(z) - F(x_j) \Vert }{d(z, x_j)} = s^i_F(z). \end{aligned}$$

3.3.1 Algorithms to calculate robustness values

Algorithm 1 is a straight-forward adaption of the nearest neighbour approach and can be used to calculate the sensitivity and robustness values of any labelled or unlabelled point $z \in X$. However, as we need to compare z with each differently labelled $x_k \in {\mathcal {D}}$, the number of calculations grows quadratically in the number of elements when determining the robustness values for an entire dataset. Fortunately, given the necessary assumptions, Theorem 1 allows us to possibly shrink this effort: as stated in Remark 1, we know that the local consistency of the $s^i_y$ and $r^i_y$ permits us to infer the sensitivity and robustness values from close enough points. Consequently, instead of computing the robustness values for the entire set ${\mathcal {D}}$, we can use Algorithm 1 for elements in a smaller subset $\tilde{{\mathcal {D}}} \subset {\mathcal {D}}$ and infer the remaining values from their nearest neighbours in $\tilde{{\mathcal {D}}}$. This leads to Algorithm 2, which has the potential to be much more efficient (as we do not need to calculate any quotients for elements in ${\mathcal {D}} - \tilde{{\mathcal {D}}}$, comp. line 5 in Algorithm 1), enabling the use of distance comparisons based on an efficient space partitioning. Whenever $\tilde{{\mathcal {D}}} \subset {\mathcal {D}}$ and $\tilde{{\mathcal {D}}} \ne {\mathcal {D}}$, we refer to the result as proxy robustness values. Unfortunately, since the nearest neighbour model also has quadratic complexity in the worst case, the worst case complexity remains the same.

3.3.2 The physics of robustness values

Let us explore some of the physics of robustness values. In practice, we do not know the data distribution (which is why we need machine learning models), but there are some elementary dynamics influenced by the number of features and unique targets when calculating robustness values.

First, the “curse of dimensionality” points to increasingly sparse data for increasing numbers of features (the dimensionality of the surrounding space). Sparsity corresponds to larger distances between points, leading to larger robustness values. As an example, assume $(X_n)_n$ is a sequence of independently, uniformly distributed random variables over the unit interval $I:= [0, 1] \ \forall \ n$. If we fix $x_0 \in I^n:= I \times \dots \times I$ (n-times), the probability of a point being closer to $x_0$ than $\epsilon < \frac{1}{2}$ w.r.t. the Euclidean norm is bounded by the volume of the surrounding cube of side length $2\epsilon$ (intersected with $I^n$). Formally, we have ${\mathcal {P}}(\Vert (X_1, \dots , X_n) - x_0\Vert _2 \le \epsilon )< (2\epsilon )^n < 1$, meaning the probability of two points being at most $\epsilon$ apart (w.r.t. the Euclidean metric) shrinks exponentially in the dimension n. Consequently, adding features to an existing dataset can, therefore, artificially increase the robustness values.

The influence of the number of targets is more subtle and diverse. One traceable dynamic is lowering the number of unique targets, for example, by binning the values. In this case, we decrease the set of points that are relevant for the robustness calculation in (1) because more points have the same label. Consequently, binning the target values of a fixed dataset will never decrease the robustness values.

Figure 3 displays several distributions of robustness values for the real-world dataset Diabetes (Efron et al., 2004), where we decreased/increased the number of features/unique targets using different methods. We see that removing and adding features decreases and increases the absolute robustness values, respectively-a similar, though less pronounced and consistent image emerges when removing and adding unique targets. The effect is much more pronounced for the features as we manipulate multiple dimensions instead of only one (the variation of the denominator is greater than the variation of the numerator in (1)). By comparing how the indices of the 10% least robust data points change, we also notice that the strength when increasing/decreasing the target values has a noticeable impact although the robustness distributions are similar from left to right in rows 2 and 3.

These examples demonstrate that several factors can influence the concrete robustness values, and even though the overall distributions may not change significantly, the positions of the least robust elements can. Depending on the dataset, this may influence the performance of TCR, intended to reweight only the least robust elements. This highlights the necessity of a broad empirical analysis to reveal insights into the dependency on factors like the number of features and unique targets.

3.3.3 TCR for toy examples

Before we begin with our main empirical analysis, we showcase the effect of TCR for four real-valued toy data distributions, comp. Figure 4. After splitting the data into training and test sets, we calculated the robustness values before fitting multiple XGBoost regressors, either with an unchanged weight distribution or based on weights defined by TCR, where we varied the weight factors and the amount of reweighted points. Except for the third function, we see benefits in generalisation performance. Notably, reweighting the 10% least robust elements seems to provide the best results, followed by the 5% and 20% thresholds. The latter shows a sub-par performance for the last function. In all other cases, the boxplots below indicate that benefits are consistent and increase with smaller weight factors. We hypothesise that the generally weaker performance of TCR for the third function may be attributed to the comparatively smaller range of variation in the robustness values ($\approx [0.3, 0.5]$) in contrast to the distributions of the other functions ($\approx [0.1, 0.5] / [0.0, 0.4] / [0.1, 0.5]$). In other words, the least robust elements are still too robust to negatively impact training.

4 Evaluation

We now evaluate our approach of topologically consistent reweighting (TCR) across 20 regression datasets w.r.t. the following three criteria:

(i) Necessary Criterion: We test whether machines learn different representations for the most and least robust regions of the data in an unsupervised and supervised setting. This is important to show that the concept of data robustness is "rich" enough in a machine-learning sense.
(ii) Sufficient Criterion: We test whether retraining XGBoost regressors with decreased weights for the least robust data points (that is, applying TCR) can improve their generalisation performance while outperforming baselines provided by DenseWeight (Steininger et al., 2021), SMOGN (Branco et al., 2017) and AXIL (Geertsema & Lu, 2023).
(iii) Economic Criterion: We compare the time necessary to perform a reasonable hyperparameter search with the time necessary for computing the robustness values to show that the latter is much faster. We furthermore investigate the trade-off for TCR when using proxy robustness values in terms of the time savings and the changes in prediction performance.

Table 1 Dataset overview, where N, UT, S, UT/S, NF:CF, UF, R and ID indicate the name, the number of unique targets, the set size, the quotient of the previous two, the numbers of nominal and continuous features (excluding the target), the number of unique feature values, the reference and the OpenML ID, respectively; “sk” indicates that the dataset is available on scikit-learn; the IDs in parentheses refer to the original sets

Full size table

4.1 Experimental details and methodology

Below we detail the experimental setup to test the three criteria listed above, where we describe the collection of datasets and preprocessing steps, before explaining our methods and their motivation. All experiments were conducted over ten random splits for each dataset.

4.1.1 Datasets

Table 1 displays the details for all 20 datasets while Fig. 5 provides an intuitive visual comparison. The “-REG” flag for Wine Quality -REG indicates that we used the seven categories of wine quality as regression labels. Yprop41 and Topo21 stem from the same original data but were processed differently, see also (Grinsztajn et al., 2022). Mimic Los is a medical dataset we extracted from the MIMIC-III database (Johnson et al., 2016a, b; Goldberger et al., 2000); more details are in Appendix B. While the first ten sets span a diverse range of complexity and feature/target diversity, the second ten are benchmark datasets from Grinsztajn et al. (2022).

4.1.2 Preprocessing and robustness calculations

After normalising the data (min-max-scaling features and labels to [0, 1]), we split each set into training and test set at a ratio of 4:1 in a stratified fashion to guarantee similar label distributions. We then determined the robustness values of the training data (using Algorithm 2 with the entire set or subsets of 5K/10K/20K elements) and the test data (using Algorithm 1 point-wise). In both cases, we used the corresponding Euclidean norms for d and $\Vert \cdot \Vert$. In some cases, we dropped or engineered features (details in the supplementary material).

4.1.3 Weighted misalignment of principal axes

We ran principal component analysis (PCA) (Pearson, 1901) on subsets of the most and least robust elements ($\subset {\mathbb {R}}^m$), respectively, and compared the angles between the subset-dependent principal axes, weighted by the explained variance. More formally, let $\{(a_1, v_1), \dots , (a_{{\tilde{m}}}, v_{{\tilde{m}}})\}$ and $\{(b_1, w_1), \dots , (b_{{\tilde{m}}}, w_{{\tilde{m}}})\}$ be the sets of “(explained variance, principal axis)”-tuples for two subsets as provided by the PCA algorithm, where $a_1 \ge \dots \ge a_{{\tilde{m}}}$ and $b_1 \ge \dots \ge b_{{\tilde{m}}}$. Then we calculate:

$$\begin{aligned} \sum _{i=1}^{{\tilde{m}}} a_i \cdot b_i \cdot \arccos (\langle v_i, w_i \rangle ) \end{aligned}$$

(6)

Note that PCA produces unit length vectors $v_i, w_i$ by default, so $\langle v_i, w_i \rangle$ is the cosine similarity between $v_i$ and $w_i$. In essence, we measured how exchangeable the two subsets’ principal axes are as orthonormal base vectors that point in the directions of greatest variation. The more similar the feature representations of the two sets, the more exchangeable the base vectors and the smaller the sum in (6). A more detailed explanation and mathematical motivation is given in Appendix B.2.

To contrast these results, we repeated the same calculations after selecting subsets randomly (by shuffling the robustness values). In our experiments we reduced the dimension of the data by 1 ($m - {\tilde{m}} = 1$). The idea here is to determine how strongly the directions, in which the principal axes of maximum variance point, diverge between the various subsets of more and less robust training data. This allows us to measure how (mis-)aligned the principal axes in these regions are as a proxy measure for the difference of their feature representations.

4.1.4 Adversarial validation

We divided the training and test sets into least and most robust halves and discarded the original regression targets. We then combined two arbitrary but distinct halves in a new training set and assigned binary labels, encoding to which half each point belongs. On this combined set we fitted an XGBoost classifier (Chen & Guestrin, 2016) to see whether it could learn to distinguish between the two original halves; such a model is also called a domain classifier, for example, by Rabanser et al. (2019). Finally, we measured the area under the receiver operating characteristic curve over ten cross-folds (commonly referred to as AUC).^{Footnote 1}The idea here is to see how well the model can learn to distinguish between data from opposing ends of the robustness spectrum. An AUC close to 0.5 signals guesswork with the model classifying points randomly. The larger the AUC (up to 1.0), the better it has learned to differentiate between both sets, indicating qualitative differences in the least and most robust regions of the data.

4.1.5 TCR for XGBoost regressors

We evaluated TCR for XGBoost regressors to see whether retrofitting models with decreased weights for the least robust elements could improve their generalisation performance in comparison with DenseWeight (Steininger et al., 2021), SMOGN (Branco et al., 2017) and AXIL (Geertsema & Lu, 2023). Because retrofitting a model assumes that the former has been optimised previously, we began with a hyperparameter search for each of the 20 datasets. For this, we used the H2O automated machine learning framework (LeDell & Poirier, 2020) with a time-budget of 3 h (including fine-tuning) for each training/test split, resulting in a total of 200 XGBoost regressors. After the parameters were optimised, we saved the reference predictions on the test sets before reloading and retraining the models with adjusted weights for the training data and saving their test predictions as well. Finally, we compared all prediction performances in terms of the mean absolute error (MAE), the root mean squared error (RMSE) and the R2-score (R2).

For DenseWeight, we changed the only adjustable parameter $\alpha$ in $\{0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 1.0\}$ ($\alpha > 1$ yields negative weights). For SMOGN, we used the same parameters as Branco et al. (2017) and Steininger et al. (2021), that is, Gaussian noise with scale 0.01, 5 nearest neighbours and a threshold of 0.8 for the relevance function (the implementation of SMOGN was provided by Kunz (2020)^{Footnote 2}). For AXIL, we used the same parameters as in the author’s benchmark script^{Footnote 3} for the LightGBM model (Ke et al., 2017). For TCR, we reweighted the 5%/10%/20% least robust elements of the training data based on the overall distributions of robustness values, see Figs. 6 and 7 below; to avoid instance-specific biases we used the same weight factor $\lambda \in \{0.95, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1 \}$ for all selected elements. The idea here is to evaluate (i) for which datasets retrofitting the reference models using TCR can improve generalisation performance and (ii) how performances compare with the other approaches (DenseWeight, SMOGN, AXIL).

4.1.6 Economic feasibility of TCR

Finally, we tracked and compared the time necessary for H2O to optimise the hyperparameters as described above and the time necessary to compute the robustness values for the individual training sets using Algorithm 2 based on the entire data and subsets of 5K/10K/20K elements (proxy robustness values). We also compared the performance changes for TCR when using proxy robustness values. All computation times were benchmarked on 8 cores of an AMD EPYC 7513 with sufficient RAM. The idea here is to see how fast the robustness calculations can be performed in contrast to a hyperparameter search and to explore the trade-off between prediction performance and efficiency when using proxy robustness values for TCR.

4.2 Results

We now describe the main data analysis and benchmark results before dissecting TCR in more detail. Again, all results are averaged over 10 random seeds to increase the representativeness.

4.2.1 Comparison of robustness values

Figures 6 and 7 visualise the robustness distributions of the first and second 10 datasets listed in Table 1, respectively. The robustness values of the test sets tend to be larger due to the approximate nature of Algorithm 1 for unlabelled points and the dynamics of high-dimensional vector spaces. In essence, since every data point from the test set is assigned the target of its nearest neighbour in the training data, there needs to be at least one differently-labelled point in the training data to which it is closer than its neighbour. But this is increasingly unlikely in high-dimensional space due to the greater data sparsity (the “curse of dimensionality”). We give a more detailed explanation in Appendix B.3. For Bike Sharing Demand, Superconduct, and MBGM, we see flat parts on the left side caused by points with (almost) identical features but distinct targets. In some cases, the robustness values seem to be normally distributed as for BNG and Elevators; in other cases, we see very long tails on the left, indicating a wide range of least robust values. For Treasury, Online News Popularity, Isolet and Ailerons these tails are more evenly distributed, while the distributions seem degenerated for Bike Sharing Demand, Superconduct and MBGM. However, in almost all cases, the curves in the left plots start to grow linearly after the $5\%$ or $10\%$ mark until they reach the opposite side of the robustness spectrum, indicating qualitative differences for the data to the left and right of these thresholds.

4.2.2 Weighted misalignment heatmaps

Figure 8 displays heatmaps, where the values/colours are determined by expression (6). For example, the top right value in each heatmap is the weighted misalignment for two randomly selected disjoint halves of the training data, averaged over ten random splits. Conversely, the bottom left values show the weighted misalignment for the most and least robust half of the training data, again averaged over ten random splits.

In general, the fewer points the subsets share, the greater the misalignment of the principal axes. However, the differences are significantly larger when the subsets also differ in terms of their robustness values, indicated by the yellow ("hot") fields below the diagonal. For example, the average weighted misalignment between the principal axes for two random halves of the training data for BNG is 0.023, while it grows more than six-fold (0.144) when the set is halved based on robustness values. This demonstrates that the differences in robustness add to the misalignment. In other words, the most and least robust regions of the data are determined by different features and the colour gradients indicate that the misalignment between principal axes and the misalignment w.r.t. robustness values grow proportionally.

Table 2 Adversarial validation results (mean average AUC with standard deviation in parentheses), where the sets were halved randomly (right) and based on robustness values (left)

Full size table

4.2.3 Adversarial validation comparison table

Table 2 displays the adversarial validation results (mean average AUC with standard deviation in parentheses), where the sets were halved randomly (right half) and based on robustness values (left half). For example, the centre value for Diabetes (top left) shows the mean average AUC (0.69) when fitting the model to the set that results from combining the least robust half of the original test set ($\mathbf {t-}$) with the most robust half of the original training set ($\mathbf {x+}$).

From the right half of the table we see that the models did not learn to distinguish between any combination of sets when these were extracted randomly (AUC close to 0.5). Conversely, we see values close to 1 if the subsets were extracted based on robustness values and stem from the opposite side of the robustness spectrum (listed in bold). Combining two sets from the same side of the robustness spectrum, on the other hand, again leads to values closer to 0.5. This shows that the domain classifier can learn to distinguish between the most and least robust elements, but not between data from the same side of the robustness spectrum. In other words, the most and least robust regions of the data differ sufficiently so that a model can learn to discriminate between them, highlighting that robustness information is "rich" enough to influence learning.

4.2.4 Conclusion (necessary criterion)

Based on the results displayed in Fig. 8 and Table 2, we conclude that the necessary criterion is met. Both unsupervised and supervised models can be used to discriminate between the least and most robust regions of the data distribution, indicating that the latter are sufficiently different in terms of their feature representations.

4.2.5 Benchmarking TCR

In a first study, we compared all methods when using conservative parameters for DenseWeight ($\alpha = 0.1$) and TCR ($\lambda = 0.9$, meaning we decrease the weights of the least robust elements by only 10%) to evaluate whether these methods could produce consistent rather than large benefits. Figure 9 displays the results when reweighting the 10% least robust elements (leading to the best performance overall, similar as for the toy examples in Sect. 3.3.3); the corresponding overviews for the 5% and 20% thresholds are shown in Fig. 20 in Appendix C. We analyse differences between these thresholds in Sect. 4.3.5.

TCR improves the XGBoost models’ performances across all three metrics in 18 out of the 20 cases while also beating the other approaches in most cases. The average point scores for DenseWeight, SMOGN, AXIL, and TCR are 10.33, 0, 25.66 and 48.66, respectively, showing the superior performance of TCR. Figure 10 extends Fig. 9 by the DenseWeight and TCR performances when using optimal parameters ($\alpha$ and $\lambda$) for each dataset, respectively. TCR now improves the models’ performances across all three metrics in 19 out of the 20 cases while still beating the other approaches in most cases. The average point scores for DenseWeight, DenseWeight (optimal) SMOGN, AXIL, TCR and TCR (optimal) are 11, 37, 0, 35.66, 63.66 and 89.33, respectively. In general: SMOGN did not lead to any improvements albeit being very expensive; AXIL provided benefits in 9/20 cases but not consistently as the algorithm crashed for large sets (a limitation the authors stated (Geertsema & Lu, 2023)); DenseWeight also provided benefits for 9/20 cases, but only when using the best parameters.

4.2.6 Conclusion (sufficient criterion)

Based on the results displayed in Figs. 9 and 10, we conclude that the sufficient criterion is met. Using TCR with both conservative ($\lambda = 0.9$) and optimal weight factors improves the XGBoost regressors’ prediction performance across several metrics in nearly all cases while also outperforming the other reweighting methods. Although the prediction performance can be increased for the right weight factor $\lambda \ne 0.9$, the conservative parameter choice leads to almost reliable benefits.

4.2.7 Computation times for H2O and TCR

Table 3 lists the average computation times in seconds (with the standard deviation in parentheses) when optimising the hyperparameters using H2O and when calculating the robustness values based on the entire training sets. As we limited the hyperparameter search to 3 h (= 10800 s), we can see from column five that the H2O time budget was insufficient for training sets with more than 10,000 elements (comp. Table 1). The exception to this rule is Isolet with more than 600 features. In contrast, the robustness calculations always finish before H2O with a maximum duration of 8511.44 s for the largest set BNG.

Table 3 Average computation times in seconds (with the standard deviation in parentheses) when optimising the hyperparameters using H2O and when calculating the robustness values based on the entire training sets

Full size table

Table 4 Average computation times in seconds (with the standard deviation in parentheses) when calculating the robustness values based on the entire set or based on subsets of 5K/10K/20K elements

Full size table

Although the comparison is not entirely fair since H2O delivers a trained model while an additional training loop is necessary after determining the robustness values, training XGBoost regressors is extremely fast and thus negligible. Indeed, we measured the time necessary for 100 consecutive TCR runs on identical hardware, including loading, training and prediction routines and found that the average runtime was the highest for Online News Popularity with 211.78 s ($\approx$3.5 min or $\approx$2% of the entire budget).

4.2.8 Benchmarking TCR with proxy robustness values

Even though calculating the robustness values is very fast for small sets, we also see a significant time increase for larger sets like Online News Popularity, Diamonds and BNG (comp. Table 3). This is because the number of necessary calculations grows quadratically in the number of elements in a dataset (see Sect. 3.3). We therefore repeated the robustness calculations for randomly sampled subsets of 5k/10K/20K elements and derived the remaining values using a nearest neighbour model (see Algorithm 2). Table 4 compares the computation times, showing a significant speed up when determining these proxy robustness values. We also notice reappearing cut-off times ($\approx 60/220/920$ seconds), which approximately grow by a factor of 4 when increasing the subset size by a factor of 2 (demonstrating the quadratic complexity).

To explore the trade-off in prediction performance for TCR, we repeated the experiments by reweighting points based on the proxy robustness values. Figure 11 shows the overviews for the different sample sizes similar to Fig. 10. We notice immediately that the TCR performance scores at the bottom of each overview increase with increasing sample size up to where they are identical with the ones in Fig. 10 (note that only the training sets of Online News Popularity, Diamonds, and BNG have more than 20K elements). However, there are notable differences in absolute terms: Fig. 12 compares the TCR performance deltas for six datasets when using robustness values based on the entire set or smaller subsets. We can see that the "quality" of the robustness values (measured in terms of the subset size with the entire set producing the highest quality values) significantly impacts performance. While smaller subsets do enable benefits for conservative weight factors (left third of the plots), performance deltas are more likely to be negative for smaller weight factors. In contrast, when using robustness values calculated from the entire data, we notice a much greater absolute performance increase, even for smaller weight factors. In this sense, there is a natural trade-off between the quality of robustness values and the improvements that TCR can provide across all weight factors.

4.2.9 Conclusion (economic criterion)

Based on the results displayed in Tables 3 and 4, and Fig. 11, we conclude that the sufficient criterion is met. Calculating the robustness values for TCR is much faster than a hyperparameter search (using identical computational resources). Determining the robustness values for subsets of fixed sizes reduces the computational overhead significantly, especially for large sets, while still being sufficiently precise to enable benefits when using conservative weight factors. However, there is a notable trade-off between the "quality" of the robustness values and the corresponding prediction performances: whereas time can be saved when restricting the calculations to subsets and using the resulting proxy robustness values in TCR, the overall performance benefits can shrink significantly across all 3 metrics.

4.3 Ablation study

In this section, we dissect the previous results to gain insights into the hyperparameter sensitivity of TCR and the conditions under which it is outperformed by the other methods based on the results displayed in Fig. 10 (using a TCR threshold of 0.1). We disregarded SMOGN as it did not show any benefits. Boxplots display the distribution of differences (or deltas) between the original and the retrofitted models’ prediction performances for all three metrics (larger values are better), where the best method is highlighted in dashed red lines based on the mean average over all 10 splits. If a method neither improves nor deteriorates the original performance (on average), we expect the deltas to be normally distributed around 0, leading to evenly spaced boxes with the median (black horizontal line) at 0. We left out the boxes of the worst performances for better visibility.

4.3.1 When no method yields benefits

Wine Quality -REG is the only set for which no method could improve the original models’ prediction performances, comp. Figure 13 (Denseweight with $\alpha = 0.1$ performed the least worse). A possible explanation is that the original XGBoost regressors leave no room for improvement as one tree of depth 3 is already sufficiently complex to capture a target distribution with only 7 distinct values (comp. Table 1) and H2O did finish the hyperparameter optimisation (comp. Table 3). Another explanation for why TCR in particular underperformed may come from non-reliable robustness values due to the very limited unique targets (comp. Section 3.3.2). These may have affected the TCR performance akin to what we observed when using proxy robustness values in Figs. 11 and 12.

4.3.2 When DenseWeight performs best

For Isolet, CPU Act and Pol, the right parameter selection in DenseWeight led to the overall best improvements as displayed in Fig. 14. Although TCR ($\lambda = 0.8$) beats DenseWeight in terms of the MAE on CPU Act, the improved RMSE performances indicate that DenseWeight generally led to more precise predictions of extreme values. Inspecting Fig. 5, we see that all three sets share a relatively small number of unique targets ($< 100$), which may have led to distorted robustness values (comp. Section 3.3.2) and, consequently, the weaker performance of TCR akin to what we observed when using proxy robustness values in Figs. 11 and 12.

4.3.3 When AXIL performs best

For Yprop41 and Elevators, AXIL led to improvements over the other methods as displayed in Fig. 15, although TCR beats AXIL in terms of the MAE on Yprop41 for multiple weight factors. While Elevators has few unique targets ($< 100$, comp. Figure 5), Yprop41 is much more diverse, both in the number of unique targets and the number of features. Interestingly, on Topo21, which is based on the same original data but with significantly more features, TCR with optimal parameters can beat AXIL consistently, comp. Figure 10. What both Yprop41 and Elevators have in common, however, is the comparatively small dataset size, which may explain the weaker effect of TCR. Indeed, when reweighting the least robust 5% of the data (instead of 10%), TCR with optimal parameters beats AXIL on Yprop41 across all metrics (comp. Figure 20 in Appendix C). Again, like before, the weaker performance of TCR may be explained by distorted robustness values akin to what we observed when using proxy robustness values in Figs. 11 and 12.

4.3.4 When TCR performs best

For all other datasets not displayed above, the correct parameter selection in TCR led to improvements over all other methods. We display a selection in Fig. 16. Note that boxes and whiskers for most configurations of TCR are in the positive range, indicating reliable benefits. Moreover, we notice an approximate continuous behaviour of the boxplots, signalling that benefits are very robust in terms of the weight factors. Except for Diabetes, the displayed sets are very diverse w.r.t. the number of unique targets or features while also having training sets with at least $\approx 14K$ elements (comp. Table 1 and Fig. 5). We hypothesise that these factors likely contributed to the robustness values being more representative compared to the smaller or less diverse sets for which the other methods could outperform TCR.

4.3.5 Sensitivity analysis for TCR

To measure the sensitivity of TCR along a secondary axis, we compared the performance deltas across the range of weight factors when reweighting the 5%/10%/20% least robust elements. Figures 17 and 18 display the variation for the six datasets, where TCR was not the best option (comp. Figure 13, 14, 15) and six datasets, where it was (comp. Figure 16), respectively. In general, smaller weight factors and larger ratios of reweighted points lead to more extreme differences in prediction performance and there is a limit for all datasets regarding how extreme both parameters can be selected while maintaining benefits. However, in contrast to the first six datasets, a sub-optimal selection of parameters is much more forgiving for the second six sets, where barely any configuration entails negative performance deltas.

4.3.6 TCR for very large sets

As a final addition to our experiments, we tested TCR with 5 very large datasets (400K - 1.25M elements) to test its performance in cases where calculating the robustness values for the entire sets using Algorithm 2 is unfeasible. Instead, we calculated proxy robustness values based on subsets of 5K/10K/20K/50K elements. The details are described and displayed in Appendix C.1. In essence, we notice a mixed performance of TCR. However, the prediction performances across the different subset sizes are very similar, indicating that there is barely any impact of increasing the subset size from 5K to 50K. We hypothesise that this is an indicator of the 50K subsets still being too small to produce precise enough robustness values, effectively hindering TCR, similar to the performance graphs in Fig. 12. This is also corroborated by the less consistent trends when varying both the weight factor and the reweighting threshold.

5 Discussion and conclusion

The results presented in the previous section indicate several relationships connected to the performance of TCR that we detail below to outline the key aspects:

5.1 Data diversity and TCR

We notice a correlation between a dataset’s diversity factors, that is, the number of unique targets/features/elements, and the performance of TCR. Datasets like Wine Quality -REG, CPU Act Pol, Yprop41 and Elevators, where TCR shows a weaker (though not necessarily negative) effect, are near the origin in Fig. 5, while sets like Bike Sharing Demand, Mimic Los, Online News Popularity, Diamonds and BNG, where we see clear improvements, are further away from it. In other words, the performance of TCR increases with the diversity of the datasets in terms of the number of unique targets/features/elements. To test whether a connection between these factors and the weight factor of TCR exists, we fitted regression trees based on the information in Table 1, the results displayed in Fig. 10 and our insights gained in Sect. 4.3. For this, we created a 20-element meta-dataset with features given by the relative amount of unique target values, the number of dataset features and the number of elements (comp. columns 2, 4 and 5 of Table 1) with labels defined by the optimal weight factor for each dataset when reweighting the 10% least robust elements. We included the number of elements as the distance from the origin alone could not predict TCR benefits (see the position of Isolet in Fig. 5). Likewise, we included the relative rather than the absolute number of unique targets to account for the observed performance benefits for Diabetes and Treasury. Since there were no benefits for Wine Quality -REG, the label was set to 1.0, meaning TCR falls back to the original (default) weights.

We display the MAE tree in Fig. 19; the result for RMSE and R2 is displayed in Fig. 23 in Appendix C (both trees are identical). As indicated by the colour gradient from left (large weight factors) to right (small weight factors), it is possible to establish a qualitative distinction between datasets (note that all three metadata features were used in the tree, signalling that none is superfluous). In other words, the MAE tree in Fig. 19 shows a correlation between a dataset’s diversity factors and the optimal weight factor in TCR based on our empirical results. This enables users to heuristically select a weight factor for their individual dataset based only on meta-data information. As a concrete example, the rightmost leaf (corresponding to a weight factor of 0.1) contains 4 datasets defined by having at least 38.5 features and at least 14.6% unique target values. Although the (identical) trees for the other two metrics do not show a similar pattern, we notice that large and small weight factors can be discerned quickly (the distinction is much finer for weight factors in the range [0.3, 0.7]).

5.2 Data basis for TCR

As seen in Figs. 11 and 12, the subset size in Algorithm 2 correlates with how well TCR performs. While smaller subsets do not necessarily lead to a performance deficit, using larger sets can significantly boost the gained benefits. While it is mathematically justified to derive the robustness values from nearby points given the assumptions in Theorem 1, we have no information as to how precise the derived values are (to do this, we would effectively need to know the value of the Lipschitz constants $K_i$ and $K'_i$ in Theorem 1). The similar performance curves in our experiments with very large sets (comp. Figure 21 in Appendix C) also align with this idea and we think TCR would greatly benefit from robustness values based on the entire data.

5.3 Conclusion

In conclusion, we presented topologically consistent reweighting (TCR) as a method to retrofit existing XGBoost regressors by training them with a new weight distribution over the training instances. For this, we extracted sensitivity/robustness values from the data to identify points from regions where the target function is particularly susceptible to input changes. After proving that this information is locally consistent on sets which are open w.r.t. the topology of the feature space, we established links to the training procedure of XGBoost regressors, which ultimately motivated a reweighting scheme, namely TCR.

Empirically, we showed that TCR meets three criteria: (i) the necessary criterion - learned representations differ for the most and least robust regions of the data, meaning robustness information is “rich” enough in a machine-learning sense; (ii) the sufficient criterion - TCR improves the prediction performance of XGBoost regressors while outperforming concurrent reweighting methods; (iii) the economic criterion - computing the required robustness values is (significantly) faster than a hyperparameter search. Afterwards, we dissected the dependence of TCR on the weight factor, the number of reweighted instances and the subset size when using Algorithm 2 to determine proxy robustness values. Our results indicate that data diversity regarding the number of unique targets/features/elements correlates with whether TCR shows consistent benefits and that using the entire data in Algorithm 2 allows significantly better performance improvements. If we combine these insights with our discussion in Sect. 3.3.2 and regard the diversity of the data as well as the subset size in Algorithm 2 as factors that influence the quality or reliability of the robustness values, we can formulate the following informal hypothesis: The quality/reliability of the robustness values determines the quality/reliability of the TCR performance. Investigating this hypothesis in the future can further enlighten the dynamics behind TCR.

Furthermore, the tree-based MAE guide not only indicates a trend between the datasets’ metadata and the optimal weight factor in TCR but can aid users in identifying the optimal parameter for their individual dataset in practice. In summary, our results show that TCR provides users with a cost-efficient, data-centric and model-agnostic way to retrofit and improve their XGBoost regressors. Understanding the connection between the metadata and the optimal parameter choice for TCR in more detail can widen the applicability of our method and deepen insights into the underlying dynamics. Extending the data basis for the tree in Fig. 19, for example, can render the decision process more robust and reveal more trends. Also, since significant performance improvements can be gained when determining the precise rather than proxy robustness values, developing a more time-efficient algorithm than Algorithm 2 is crucial, especially for very large datasets.

Finally, we also have shown that other methods, such as DenseWeight and AXIL, which were not explicitly designed for retrofitting XGBoost regressors, can lead to improvements. Especially our novel way of using AXIL weights highlights that there are unexplored methods to influence training procedures. A deeper insight into the individual mechanics may allow us to combine two or more methods in a retrofitting pipeline tailored to an arbitrary dataset. Combined with an even broader empirical analysis of TCR, this would likewise connect to automating the entire retrofitting process and, in extension, the hyperparameter search of automated machine learning frameworks such as H2O.

Data availibility

Except for the Mimic Los dataset that we extracted from the MIMIC-III database, all datasets and libraries are publicly available.

Code Availability

Our code is available at: https://github.com/montymaxzuehlke/tcr.

Notes

We used the approach outlined in this Kaggle post: https://www.kaggle.com/code/carlmcbrideellis/what-is-adversarial-validation/notebook (Accessed: 4.12.2023)
(Accessed: 17.12.2023.)
https://github.com/pgeertsema/AXIL_paper/blob/main/axil_benchmarks.py (Accessed: 28.12.2023)

References

Bertin-Mahieux, T. (2011). Year prediction msd. https://doi.org/10.24432/C50K61
Borisov, V., Leemann, T., Seßler, K., et al. (2022). Deep neural networks and tabular data: A survey. IEEE Transactions on Neural Networks and Learning Systems. https://doi.org/10.1109/TNNLS.2022.3229161
Article MATH Google Scholar
Bose, S., Johnson, A., Moskowitz, A., et al. (2018). Impact of intensive care unit discharge delays on patient outcomes: A retrospective cohort study. Journal of Intensive Care Medicine, 34, 088506661880027. https://doi.org/10.1177/0885066618800276
Article Google Scholar
Branco, P., Torgo, L., & Ribeiro, R.P. (2017). SMOGN: a pre-processing approach for imbalanced regression. In: Luís Torgo PB, Moniz N (eds) Proceedings of the first international workshop on learning with imbalanced domains: Theory and applications, proceedings of machine learning research, vol. 74. PMLR, (pp. 36–50), https://proceedings.mlr.press/v74/branco17a.html
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. https://doi.org/10.1023/A:1010933404324
Article MATH Google Scholar
Chawla, N. V., Bowyer, K. W., Hall, L. O., et al. (2002). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.
MATH Google Scholar
Chen, T. (2014). Introduction to boosted trees. University of Washington Computer Science, 22(115), 14–40.
MATH Google Scholar
Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, (pp. 785–794).
Cobzaş, Ş., Miculescu, R., & Nicolae, A. (2019). Lipschitz Functions. Lecture Notes in Mathematics, Springer International Publishing, https://books.google.de/books?id=r9yZDwAAQBAJ
Cortez, P., Cerdeira, A., Almeida, F., et al. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47, 547–553.
Google Scholar
Efron, B., Hastie, T., Johnstone, I., et al. (2004). Least angle regression. The Annals of Statistics, 32(2), 407–499. https://doi.org/10.1214/009053604000000067
Article MathSciNet MATH Google Scholar
Fanaee-T, H., & Gama, J. (2014). Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence, 2, 113–127. https://doi.org/10.1007/s13748-013-0040-3
Article MATH Google Scholar
Feng, J., Lurati, L., Ouyang, H., et al. (2003). Predictive toxicology: Benchmarking molecular descriptors and statistical methods, (pp. 1463–1470), https://doi.org/10.1021/ci034032s
Fernandes, K. (2015). A proactive intelligent decision support system for predicting the popularity of online news.[SPACE]https://doi.org/10.1007/978-3-319-23485-4_53
Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139. https://doi.org/10.1006/jcss.1997.1504
Article MathSciNet MATH Google Scholar
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5), 1189–1232. https://doi.org/10.1214/aos/1013203451
Article MathSciNet MATH Google Scholar
Geertsema, P., & Lu, H. (2023). Instance-based explanations for gradient boosting machine predictions with axil weights. ArXiv: abs/2301.01864. https://api.semanticscholar.org/CorpusID:255440273
Ghorbani, A., & Zou, J. (2019). Data shapley: Equitable valuation of data for machine learning. In International conference on machine learning, PMLR, (pp. 2242–2251).
Goldberger, A. L., Amaral, L. A. N., Glass, L., et al. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation., 101(23), e215–e220. https://doi.org/10.1161/01.CIR.101.23.e215
Article MATH Google Scholar
Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022). Why do tree-based models still outperform deep learning on typical tabular data? Advances in Neural Information Processing Systems, 35, 507–520.
Google Scholar
Guyon, I., Sun-Hosoya, L., Boullé, M., et al. (2019). Analysis of the AutoML Challenge Series 2015–2018, (pp. 177–219) Springer International Publishing, Cham. https://doi.org/10.1007/978-3-030-05318-5_10
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: data mining, inference, and prediction. Springer series in statistics, Springer https://books.google.de/books?id=VRzITwgNV2UC
Johnson, A., Pollard, T., & Mark, R. (2016a). “MIMIC-III Clinical Database” (version 1.4). PhysioNet.
Johnson, A., Pollard, T., Shen, L., et al. (2016). Mimic-iii, a freely accessible critical care database. Scientific Data, 3, 160035. https://doi.org/10.1038/sdata.2016.35
Article Google Scholar
Kawala, F., Douzal-Chouakria, A., Gaussier, E., et al. (2013). Prédictions d’activité dans les réseaux sociaux en ligne. In 4ième conférence sur les modèles et l’analyse des réseaux: Approches mathématiques et informatiques, (p. 16).
Ke, G., Meng, Q., Finley, T., et al. (2017). Lightgbm: A highly efficient gradient boosting decision tree. In: I. Guyon, U. V. Luxburg, S. Bengio, et al (eds.), Advances in neural information processing systems, vol 30. Curran Associates, Inc., https://proceedings.neurips.cc/paper_files/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf
Kelley Pace, R., & Barry, R. (1997). Sparse spatial autoregressions. Statistics and Probability Letters, 33(3), 291–297. https://doi.org/10.1016/S0167-7152(96)00140-X
Article MATH Google Scholar
Knauer, F., & Cukierski, W. (2015). Rossmann store sales. https://kaggle.com/competitions/rossmann-store-sales, kaggle
Kunz, N. (2020). SMOGN: Synthetic minority over-sampling technique for regression with gaussian noise. https://pypi.org/project/smogn/
LeDell, E., & Poirier, S. (2020). H2O AutoML: Scalable automatic machine learning. In 7th ICML workshop on automated machine learning (AutoML)[SPACE]https://www.automl.org/wp-content/uploads/2020/07/AutoML_2020_paper_61.pdf
Lundberg, S.M., & Lee, S.I. (2017). A unified approach to interpreting model predictions. In Proceedings of the 31st international conference on neural information processing systems, (pp. 4768-4777), Curran Associates Inc., Red Hook, NY, USA, NIPS’17.
MacNell, N., Feinstein, L., Wilkerson, J., et al. (2023). Implementing machine learning methods with complex survey data: Lessons learned on the impacts of accounting sampling weights in gradient boosting. PLoS ONE, 18(1), 1–15. https://doi.org/10.1371/journal.pone.0280387
Article Google Scholar
Ohno-Machado, L., Fraser, H.S.F., & Øhrn, A. (1998). Improving machine learning performance by removing redundant cases in medical data sets. In AMIA 1998, American medical informatics association annual symposium, Lake Buena Vista, FL, USA, November 7-11, 1998. AMIA, http://knowledge.amia.org/amia-55142-a1998a-1.588514/t-001-1.590475/f-001-1.590476/a-099-1.590738/a-100-1.590735
Pearson, K. (1901). Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11), 559–572. https://doi.org/10.1080/14786440109462720
Article MATH Google Scholar
Prokhorenkova, L., Gusev, G., Vorobev, A., et al. (2018). Catboost: unbiased boosting with categorical features. In Proceedings of the 32nd international conference on neural information processing systems, (pp. 6639–6649). Curran Associates Inc., Red Hook, NY, USA, NIPS’18.
Rabanser, S., Günnemann, S., & Lipton, Z. (2019). Failing loudly: An empirical study of methods for detecting dataset shift. Advances in Neural Information Processing Systems, 32
Ribeiro, M.T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, (pp. 1135–1144).
Rogozhnikov, A. (2016). Reweighting with boosted decision trees. Journal of Physics: Conference Series, 762(1), 012036. https://doi.org/10.1088/1742-6596/762/1/012036
Article MATH Google Scholar
Seiffert, C., Khoshgoftaar, T.M., Hulse, J.V., et al. (2008). Resampling or reweighting: A comparison of boosting implementations. In 2008 20th IEEE international conference on tools with artificial intelligence, (pp. 445–451), https://doi.org/10.1109/ICTAI.2008.59
Shapley, L. S. (1953). A value for n-person games. In H. W. Kuhn & A. W. Tucker (Eds.), Contributions to the Theory of Games II (pp. 307–317). Princeton: Princeton University Press.
MATH Google Scholar
Smith, D. J., & Vamanamurthy, M. K. (1989). How small is a unit ball? Mathematics Magazine, 62(2), 101–107. https://doi.org/10.1080/0025570X.1989.11977419
Article MathSciNet MATH Google Scholar
Steininger, M., Kobs, K., Davidson, P., et al. (2021). Density-based weighting for imbalanced regression. Machine Learning, 110, 1–25. https://doi.org/10.1007/s10994-021-06023-5
Article MathSciNet MATH Google Scholar
Torgo, L., Ribeiro, R. P., Pfahringer, B., et al. (2013). Smote for regression. In L. Correia, L. P. Reis, & J. Cascalho (Eds.), Progress in Artificial Intelligence (pp. 378–389). Springer.
MATH Google Scholar
Zhang, C., Li, Y., Chen, X., et al. (2020). Doubleensemble: A new ensemble method based on sample reweighting and feature selection for financial data analysis. In 2020 IEEE international conference on data mining (ICDM). IEEE Computer Society, Los Alamitos, CA, USA, (pp. 781–790), https://doi.org/10.1109/ICDM50108.2020.00087

Download references

Funding

Open Access funding enabled and organized by Projekt DEAL. Funded by the Lower Saxony Ministry of Science and Culture under grant number ZN3492 within the Lower Saxony “Vorab” of the Volkswagen Foundation and supported by the Center for Digital Innovations (ZDIN).

Author information

Authors and Affiliations

L3S Research Center, Appelstraße 9A, 30167, Hannover, Lower Saxony, Germany
Monty-Maximilian Zühlke & Daniel Kudenko

Authors

Monty-Maximilian Zühlke
View author publications
Search author on:PubMed Google Scholar
Daniel Kudenko
View author publications
Search author on:PubMed Google Scholar

Contributions

The first author (Monty-Maximilian Zühlke) conducted the theoretical and empirical analyses, created and implemented the codebase and wrote the article, under the supervision of the co-author Daniel Kudenko. All authors approved of the final version of the manuscript.

Corresponding author

Correspondence to Monty-Maximilian Zühlke.

Ethics declarations

Conflict of interest

Not applicable.

Ethical approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Materials availability

Not applicable.

Additional information

Editors: Rita P. Ribeiro, Ana Carolina Lorena, Albert Bifet.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A. Proof of Theorem 1

The proof of Theorem 1 requires a few preliminary results that we state first, see Cobzaş et al. (2019):

Proposition A

Let (X, d) , $(X', d')$ and $(X'', d'')$ be metric spaces. If $f:X \rightarrow X'$ and $g: X' \rightarrow X''$ are (locally) Lipschitz continuous, then $g \circ f: X \rightarrow X''$ is (locally) Lipschitz continuous. Moreover:

$$\begin{aligned} L(g \circ f) \le L(g) \cdot L(f) \end{aligned}$$

Proposition B

Let (X, d) be a metric space, ${\mathbb {R}}$ endowed with the usual Euclidean metric and $f: X \rightarrow {\mathbb {R}}$ a Lipschitz function having the property that there exists $m >0$ such that $|f(x)| \ge m$ for all $x \in X$. Then $f^{-1}$ is Lipschitz and:

$$\begin{aligned} \displaystyle L\left( \frac{1}{f}\right) \le \frac{L(f)}{m^2} \end{aligned}$$

Proposition C

Let (X, d) be a metric space. If the functions $f,g: M \rightarrow {\mathbb {R}}$ are bounded and Lipschitz, then fg is Lipschitz. Moreover:

$$\begin{aligned} L(fg) \le \sup _{x \in X} |f(x)| \cdot L(g) + \sup _{x \in X} |g(x)| \cdot L(f) \end{aligned}$$

Proposition D

Let (X, d) be a metric space and ${\mathcal {F}}$ a family of real-valued $K-$Lipschitz functions defined on X such that $\varphi (x):= \sup \{f(x) \ | \ f \in {\mathcal {F}} \}$ is finite for every $x \in X$. Then the function $\varphi$ is $K-$Lipschitz. In particular, for two Lipschitz functions $f,g: X \rightarrow {\mathbb {R}}$, their maximum $\max \{f, g\}$ is Lipschitz with $L(\max \{f, g\}) = \max \{L(f), L(g)\}$.

Proof of Theorem 1

There exists an open set $U_i \subset X$ around $x_i$ on which y is Lipschitz with $U_i \cap {\mathcal {D}} = \{x_i \}$ and $x_j \notin \partial U_i \ \forall \ j$. Naturally, the maps $z \mapsto d(z, x_j)$ and $z \mapsto \Vert y(z)-y(x_j) \Vert$ are Lipschitz on $U_i$ for all j (comp. Proposition A). Since $x_j \notin \partial U_i \ \forall \ j$, we have $\inf _{z \in U_i} d(z, x_j) > 0 \ \forall \ j \ne i$ and so the reciprocal $z \mapsto d(z, x_j)^{-1}$ is Lipschitz and bounded on $U_i$ (comp. Proposition B) as is the product with $z \mapsto \Vert y(z) - y(x_j) \Vert$ (comp. Proposition C) for all $j \ne i$. Define the following finite family of functions on $U_i$:

$$\begin{aligned} \displaystyle {\mathcal {F}}^i:= \bigl \{f_j^i (z):= \frac{\Vert y(z) - y(x_j) \Vert }{d(z, x_j)} \ | \ j \ne i \bigr \} \end{aligned}$$

Combining the aforementioned facts, the functions $f^i_j$ are Lipschitz on $U_i$. Let $L^i_j:= L(f^i_j)$ be the corresponding Lipschitz constants and $K^i:= \max _{j \ne i} L^i_j$. Define:

$$\begin{aligned} \varphi ^i(z):= \sup \{ f^i_j(z) \ | \ f^i_j \in {\mathcal {F}}_i \} = \max _{j \ne i} \{f^i_j(z) \} \end{aligned}$$

As $f^i_j(z)$ is finite for every $z \in U_i$, so is $\varphi ^i(z)$. Thus, $\varphi ^i$ is Lipschitz on $U_i$ with Lipschitz constant $L(\varphi ^i) = K^i$ (comp. Proposition D).

Finally, the normalisation function $(0, \infty ) \ni s \mapsto \Vert y\Vert _\infty \cdot (\Vert y\Vert _\infty + s)^{-1} \in (0,1)$ is continuously differentiable (hence, locally Lipschitz continuous) and the concatenation of two locally Lipschitz continuous maps is again locally Lipschitz continuous (see Proposition A). Thus, there exists a (possibly smaller) open set $U'_i \subseteq U_i$ and a (possibly larger) constant $K'_i \ge K_i$ such that:

$$\begin{aligned} | r^i_y(z_1) - r^i_y(z_2) | \le K'_i \cdot d(z_1, z_2), \ \ \ z_1, z_2 \in U'_i \end{aligned}$$

$\square$

Appendix B. Additional explanations

1.1 B.1 Mimic los details

Since this medical dataset was extracted by us, we provide some additional information (note that we cannot share the dataset itself due to privacy restrictions). We queried the MIMIC-III database (Johnson et al., 2016a, b; Goldberger et al., 2000), where we used guidance from Bose et al. (2018) as to which medical features are of interest. The set contains a variety of features from the available patient records from the first 48 h after admission. The regression target is the final length of the patient’s stay (Los "=" Length of stay).

1.2 B.2 How Eq. (6) is related to the difference of the feature representations

By applying PCA to two subsets $A, B \subset {\mathbb {R}}^m$, we receive orthonormal bases $\{v_1,..., v_{{\tilde{m}}}\}$ and $\{w_1,..., w_{{\tilde{m}}}\}$, where $m\ge {\tilde{m}}>0$. These base vectors point in the directions along which the feature variation is greatest for the data contained in A and B, respectively, while being orthonormal: $\langle v_i, v_j \rangle = \langle w_i, w_j \rangle = \delta _{ij}$, where $\delta _{ij}$ is the Kronecker delta and $\langle \cdot , \cdot \rangle$ is the Euclidean inner product. The variation of data in A (B) is greatest along $v_1$ ($w_1$), while the variation in A (B) is greatest along $v_2$ ($w_2$) once the prior variation of $v_1$ ($w_1$) has been accounted for (which is expressed by them being orthogonal, $\langle v_1, v_2 \rangle = 0$). Similarly, the variation in A (B) is greatest along $v_3$ ($w_3$) once the prior variation of $v_1$ and $v_2$ ($w_1$ and $w_2$) has been accounted for, and so forth.

Each element in A (or B) can be represented as a linear combination of $\{v_1,..., v_{{\tilde{m}}} \}$ (or $\{w_1,..., w_{{\tilde{m}}} \}$) up to some reconstruction term, depending on how much smaller ${\tilde{m}}$ is than the original feature dimension m. More precisely, for $m> {\tilde{m}} > 0$, each $a \in A$ can be deconstructed into $a = a_v + a_v^{\bot }$ for $a_v \in {\mathbb {R}}^{{\tilde{m}}}, a_v^{\bot } \in {\mathbb {R}}^{m - {\tilde{m}}}$, where $a_v = \sum _{i=1}^{{\tilde{m}}} \alpha _i v_i$ is the feature representation for $a_v$ in terms of the base provided by PCA (and similarly for $b = b_w + b_w^{\bot }$).

Now, since $\{v_1,..., v_{{\tilde{m}}} \}$ and $\{w_1,..., w_{{\tilde{m}}} \}$ are bases for the same vector space (up to isomorphy), we can also express $a_v$ and $b_w$ in the respective other basis: $a_w = \sum _{i=1}^{{\tilde{m}}} \tilde{\alpha }_i w_i$, $b_v = \sum _{i=1}^{{\tilde{m}}} \tilde{\beta }_i v_i$. This means, we gain the following equality:

$$\begin{aligned} \alpha _j = \langle a_v, v_j \rangle = \langle a_w, v_j \rangle = \left\langle \sum _{i=1}^{{\tilde{m}}} \tilde{\alpha }_i w_i, v_j \right\rangle = \sum _{j \ne i=1}^{{\tilde{m}}} \tilde{\alpha }_i \langle w_i, v_j \rangle + \tilde{\alpha }_j \langle w_j, v_j \rangle \end{aligned}$$

The more aligned $w_j$ and $v_j$ are, the less aligned are $v_j$ and $\{w_1, \dots w_{j-1}, w_{j+1}, \dots , w_{{\tilde{m}}} \}$ by definition of $w_j$. In other words, for $\langle w_j, v_j \rangle \rightarrow 1$, we get $\langle w_i, v_j \rangle \rightarrow 0 \ \forall \ i \ne j$ and, thus, $\alpha _j \rightarrow \tilde{\alpha }_j$. We see that $\langle w_j, v_j \rangle$ effectively measures how well one can substitute $w_j$ for $v_j$ in the two base representations of $a_v$, thereby expressing how similar the data variation along $v_j$ and $w_j$ is. The term $\textrm{arccos}(\langle v_i, w_i \rangle )$ measures the angle between $v_i$ and $w_i$ in the range $[0, \pi ]$, quantising how (mis-)aligned the vectors are (0:= completely aligned, 1:=completely misaligned), so the result is always non-negative. Intuitively, the more aligned the principal axes $v_j$ and $w_j$ are, the the more similar the data variation in A and B along them. Finally, (6) is the sum over all these (mis-)alignment angles, ordered and weighted by how much variation of the data is explained along the corresponding principal axes $v_j$ and $w_j$ ($j=1, \dots , {\tilde{m}}$), respectively.

1.3 B.3 The robustness values of the test sets are higher than the robustness values of the training sets

The robustness values of the test sets tend to be larger due to the approximate nature of Algorithm 1 for unlabelled points and the dynamics of high-dimensional vector spaces. More precisely, assume z is an element of the test set and $x_z:= \mathop {\mathrm{arg\,min}}\nolimits _{x_k \in {\mathcal {D}}} d(z, x_k)$ its nearest neighbour in the training set ${\mathcal {D}}$ with $x_z \ne z$. As we set the label $y(z):= y(x_z)$ in Algorithm 1, the set of admissible points in the robustness calculation, that is, the points having a distinct label, are identical for z and $x_z$. Consequently, the difference in robustness between z and $x_z$ is determined entirely by their distance to differently labelled elements. Moreover, the robustness of z can only be smaller than or equal to the robustness of $x_z$ if there is at least one point $x_* \in {\mathcal {D}}$ with $y(x_*) \ne y(x_z) = y(z)$ and

$$\begin{aligned} s_y(x_z) \le \frac{\Vert y(x_z) - y(x_*) \Vert }{d(z, x_*)} \Leftrightarrow d(x_*, z) \le \frac{\Vert y(x_z) - y(x_*) \Vert }{s_y(x_z)} = \frac{\Vert y(x_z) - y(x_*) \Vert }{\displaystyle \max _{x_j \in {\mathcal {D}} \setminus \{ x_z\} } \frac{\Vert y(x_z) - y(x_j) \Vert }{d(x_z, x_j)}}. \end{aligned}$$

In particular, we see that

$$\begin{aligned} d(z, x_*) = d(x_*, z) \le \displaystyle \min _{x_j \in {\mathcal {D}} \setminus \{ x_z\} } \Bigg ( \frac{\Vert y(x_z) - y(x_*) \Vert }{\Vert y(x_z) - y(x_j) \Vert } d(x_z, x_j) \Bigg ) \le d(x_z, x_*). \end{aligned}$$

In other words, z needs to lie in the d-ball of radius $d(x_z, x_*)$ around $x_*$. The volume of an n-dimensional Euclidean ball of radius r, $V_n(r)$, is given by

$$\begin{aligned} \frac{r^n \pi ^{\frac{n}{2}}}{\Gamma (\frac{n}{2} +1 )}, \end{aligned}$$

where $\Gamma (\cdot )$ is the Gamma function. This expression goes to 0 for $n \rightarrow \infty$ independent of the radius (Smith & Vamanamurthy, 1989). Assuming that z is, for example, uniformly distributed over the hypercube (similar to the setup in Sect. 3.3.2), the the probability of $d(z, x_*)$ being smaller or equal to $d(x_z, x_*)$ for any $x_*$ is small for large numbers of features. This means that the probability of $s_y(x_z)$ being smaller than $s_y(z)$ (determined via Algorithm 1) or, equivalently, the robustness of z being smaller than the robustness of its nearest neighbour $x_z$, is small as well. By repeating this argument for every z in the test set, we receive an explanation for the robustness values of the test set being larger than the robustness values of the training set.

Appendix C. Additional results and graphics

1.1 C.1 TCR for very large sets

As a final addition to our experiments, we tested TCR with 5 very large datasets (400 K–1.25 M elements) to test its performance in cases where calculating the robustness values for the entire sets using Algorithm 2 is unfeasible. Tables 5 and 6 list the dataset details and computation times, respectively, where we calculated the proxy robustness values based on subsets of 5K/10K/20K/50K elements using Algorithm 2. Whereas the computation times for subsets of 5K/10K/20K elements for Yolanda, Year and Buzzinsocialmedia Twitter are similar to the ones for Diamonds and BNG from before (comp. Table 3), we notice a significant increase for the two larger sets due to the nearest neighbour model needing to predict many more elements.

To increase the likelihood of H2O finding a suitable configuration of hyperparameters, we set the time limits to 9 h for Yolanda, Year and Buzzinsocialmedia Twitter ($\approx$ 400–600 K elements), to 15 h for Rossmann Store Sales ($\approx$800 K elements) and to 24 h for Delays Zurich Transport ($\approx$1.25 M elements). The results are displayed in Figs. 21 and 22. Although performance improvements exist, particularly for the smaller sets, the overall impression is mixed. However, we notice that the performance curves in Fig. 21 are very similar independent of the subset size, which may indicate that the robustness information is not precise enough to warrant benefits similar to what we noticed for Online News Popularity, Diamonds and BNG (comp. Figure 12). This is further corroborated by the less consistent trends when varying both the weight factor and the reweighting threshold (comp. Figure 22).

Table 5 Dataset overview as in Table 1 for the five very large sets

Full size table

Table 6 Average computation times in seconds (with the standard deviation in parentheses) when calculating the training sets’ robustness values using Algorithm 2 based on subsets of 5K/10K/20K/50K elements

Full size table

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zühlke, MM., Kudenko, D. TCR: topologically consistent reweighting for XGBoost in regression tasks. Mach Learn 114, 108 (2025). https://doi.org/10.1007/s10994-024-06704-x

Download citation

Received: 09 February 2024
Revised: 05 November 2024
Accepted: 15 November 2024
Published: 24 February 2025
Version of record: 24 February 2025
DOI: https://doi.org/10.1007/s10994-024-06704-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

TCR: topologically consistent reweighting for XGBoost in regression tasks

Abstract

Similar content being viewed by others

Prediction of autogenous shrinkage in ultra-high-performance concrete (UHPC) using hybridized machine learning

Hyperparameter Optimization for Gradient-Boosted Tree-Based Machine Learning Models and Their Effect on Model Performance

Building Automated Computational Models for Predicting Energy Consumption in High-Performance Concrete Production

Explore related subjects

1 Introduction

2 Background and related work

2.1 Gradient boosted tree ensembles and XGBoost

2.2 Sample weights in regression tasks

2.2.1 DenseWeight

2.2.2 SMOGN

2.2.3 AXIL

2.2.4 Performance-driven approaches

2.2.5 GBTEs and sample weights

3 Theoretical background and framework

3.1 Data sensitivity and robustness

Definition 1

Theorem 1

Proof

Remark 1

Remark 2

3.2 Data robustness and XGBoost

Remark 3

Remark 4

Remark 5

3.3 Robustness values in practice

Proposition 2

Proof

Definition 2

3.3.1 Algorithms to calculate robustness values

3.3.2 The physics of robustness values

3.3.3 TCR for toy examples

4 Evaluation

4.1 Experimental details and methodology

4.1.1 Datasets

4.1.2 Preprocessing and robustness calculations

4.1.3 Weighted misalignment of principal axes

4.1.4 Adversarial validation

4.1.5 TCR for XGBoost regressors

4.1.6 Economic feasibility of TCR

4.2 Results

4.2.1 Comparison of robustness values

4.2.2 Weighted misalignment heatmaps

4.2.3 Adversarial validation comparison table

4.2.4 Conclusion (necessary criterion)

4.2.5 Benchmarking TCR

4.2.6 Conclusion (sufficient criterion)

4.2.7 Computation times for H2O and TCR

4.2.8 Benchmarking TCR with proxy robustness values

4.2.9 Conclusion (economic criterion)

4.3 Ablation study

4.3.1 When no method yields benefits

4.3.2 When DenseWeight performs best

4.3.3 When AXIL performs best

4.3.4 When TCR performs best

4.3.5 Sensitivity analysis for TCR

4.3.6 TCR for very large sets

5 Discussion and conclusion

5.1 Data diversity and TCR

5.2 Data basis for TCR

5.3 Conclusion

Data availibility

Code Availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval and consent to participate

Consent for publication

Materials availability

Additional information

Publisher's Note

Appendices