Empirical risk minimization in the interpolating regime with application to neural network learning

Mücke, Nicole; Steinwart, Ingo

doi:10.1007/s10994-025-06738-9

Empirical risk minimization in the interpolating regime with application to neural network learning

Open access
Published: 21 February 2025

Volume 114, article number 102, (2025)
Cite this article

You have full access to this open access article

Download PDF

Machine Learning Aims and scope Submit manuscript

Empirical risk minimization in the interpolating regime with application to neural network learning

Download PDF

1322 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

A common strategy to train deep neural networks (DNNs) is to use very large architectures and to train them until they (almost) achieve zero training error. Empirically observed good generalization performance on test data, even in the presence of lots of label noise, corroborate such a procedure. On the other hand, in statistical learning theory it is known that over-fitting models may lead to poor generalization properties, occurring in e.g. empirical risk minimization (ERM) over too large hypotheses classes. Inspired by this contradictory behavior, so-called interpolation methods have recently received much attention, leading to consistent and optimally learning methods for, e.g., some local averaging schemes with zero training error. We extend this analysis to ERM-like methods for least squares regression and show that for certain, large hypotheses classes called inflated histograms, some interpolating empirical risk minimizers enjoy very good statistical guarantees while others fail in the worst sense. Moreover, we show that the same phenomenon occurs for DNNs with zero training error and sufficiently large architectures.

Compressive Sensing and Neural Networks from a Statistical Learning Perspective

Feature Equilibrium: An Adversarial Training Method to Improve Representation Learning

Article Open access 27 April 2023

Resistant Neural Network Learning via Resistant Empirical Risk Minimization

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

During the last few decades statistical learning theory (SLT) has developed powerful techniques to analyze many variants of (regularized) empirical risk minimizers, see e.g. Devroye et al. (1996); Vapnik (1998); van de Geer (2000); Györfi et al. (2002); Steinwart and Christmann (2008); Tsybakov (2009); Shalev-Shwartz and Ben-David (2014). The resulting learning guarantees, which include finite sample bounds, oracle inequalities, learning rates, adaptivity, and consistency, assume in most cases that the effective hypotheses space of the considered method is sufficiently small in terms of some notion of capacity such as VC-dimension, fat-shattering dimension, Rademacher complexities, covering numbers, or eigenvalues.

Most training algorithms for DNNs also optimize an (regularized) empirical error term over a hypotheses space, namely the class of functions that can be represented by the architecture of the considered DNN, see Goodfellow et al. (2016), Part II. However, unlike for many classical empirical risk minimizers, the hypotheses space is parametrized in a rather complicated manner. Consequently, the optimization problem is, in general, harder to solve. A common way to address this in practice is to use very large DNNs, since despite their size, training them is often easier, see e.g. Salakhutdinov (2017), Ma et al. (2018) and the references therein. Now, for sufficiently large DNNs it has been recently observed that common training algorithms can achieve zero training error on randomly, or arbitrarily labeled training sets, see Zhang et al. (2016). Because of this ability, their effective hypotheses space can no longer have a sufficiently small capacity in the sense of classical SLT, so that the usual techniques for analyzing learning algorithms are no longer suitable, see e.g. the discussion on this in Zhang et al. (2016), Belkin et al. (2018), Nagarajan and Kolter (2019), Zhou et al. (2020), Zhang et al. (2021). In fact, SLT provides well known examples of large hypotheses spaces for which zero training error is possible but a simple empirical risk minimizer fails to learn. This phenomenon is known as over-fitting, and common wisdom suggests that successful learning algorithms need to avoid over-fitting, see e.g. Györfi et al. (2002), pp. 21–22. The empirical evidence mentioned above thus stands in stark contrast to this credo of SLT.

This somewhat paradoxical behavior has recently sparked interests, leading to deeper theoretical investigations of the so called double/ multiple-descent phenomenon for different model settings. More specifically, Belkin et al. (2020) analyzed linear regression with random feature selection and investigated the random Fourier feature model. This model has also been analyzed by Mei and Montanari (2019). For linear regression, where model complexity is measured in terms of the number of parameters, the authors in Bartlett et al. (2020), Tsigler and Bartlett (2020) show that over-parameterization is even essential for benign over-fitting. However, these results are highly distribution dependent and require a specific covariance structure and (sub-) Gaussian data. For more details we refer also to Belkin et al. (2018); Chen et al. (2020); Liang et al. (2020); Neyshabur et al. (2019); Allen-Zhu et al. (2019). Another line of research (Belkin et al., 2019) shows for classical learning methods, namely Nadaraya-Watson estimator with certain singular kernels, that interpolating the training data can achieve optimal rates for problems of nonparametric regression and prediction with square loss.

Nonparametric regression with DNNs has been analyzed by various authors, see e.g. McCaffrey and Gallant (1994); Kohler and Krzyżak (2005); Kohler and Langer (2021); Suzuki (2018); Yarotsky (2018) and references therein. In particular, we highlight (Schmidt-Hieber, 2020) where it is shown that sparsely connected ReLU-DNNs achieve the minmax rates of convergence up to log-factors under a general assumption on the regression function. In Kohler and Langer (2021), the authors show that sparsity is not necessary to derive optimal rates of convergence and optimal rates have been established for fully connected feedforward neural networks with ReLU-activation. Here, an important observation is that DNNs are able to circumvent the curse of dimensionality (Bauer & Kohler, 2019). More structured input data are investigated in Kohler et al. (2023).

Beyond empirical evidence there are therefore also theoretical results showing that interpolating the data and good learning performance is simultaneously possible. So far, however, the considered interpolating learning methods do not implement an empirical risk minimization (ERM) scheme nor do they closely resemble the learning mechanisms of DNNs. In this paper, we take a step towards closing this gap.

First, we explicitly construct, for data sets of size n, large classes of hypotheses ${\mathcal {H}}_n$ for which we show that some interpolating least squares ERM algorithms over ${\mathcal {H}}_n$ enjoy very good statistical guarantees, while other interpolating least squares ERM algorithms over ${\mathcal {H}}_n$ fail in a strong sense. To be more precise, we observe the following phenomena: There exists a universally consistent empirical risk minimizer and there exists an empirical risk minimizer whose predictors converge to the negative regression function for most distributions. In particular, the latter empirical risk minimizer is not consistent for most distributions, and even worse, the obtained risks are usually far off the best possible risk. We further construct modifications that enjoy minmax optimal rates of convergence up to some log factor under standard assumptions. In addition, there are also ERM algorithms that exhibit an intermediate behavior between these two extreme cases, with arbitrarily slow convergence.

The finding that an interpolating estimator is not necessarily benign is a known fact. For instance, Zhou et al. (2020), Yang et al. (2021), Bartlett and Long (2021), Koehler et al. (2021) show that the uniform bound fails to give the precise evaluation of the risk for minimum norm interpolators in over-parameterized linear models, and there are estimators that provably give worse errors than the minimum norm interpolator. In this paper, we analyze a different setting and different class of estimators. To put our results in perspective, we note that classical SLT shows that for sufficiently small hypotheses classes, all versions of empirical risk minimizers enjoy good statistical guarantees. In contrast, our results demonstrate that this is no longer true for large hypotheses classes. For such hypotheses spaces, the description empirical risk minimizer is thus not sufficient to identify well-behaving learning algorithms. Instead, the class of algorithms described by ERM over such hypotheses spaces may encompass learning algorithms with extremely distinct learning behavior.

Second, we show that exactly the same phenomena occur for interpolating ReLU-DNNs of at least two hidden layers with widths growing linearly in both input dimension d and sample size n. We present DNN training algorithms that produce interpolating predictors and that enjoy consistency and optimal rates, at least up to some log factor. In addition, this training can be done in ${\mathcal {O}}(d^2\cdot n^2)$-time if the DNNs are implemented as fully connected networks. Since the constructed predictors have a particularly sparse structure, the training time can actually be reduced to ${\mathcal {O}}(d\cdot n \cdot \log n)$ by implementing the DNNs as loosely connected networks. Moreover, we show that there are other efficient and feasible training algorithms for exactly the same architectures that fail in the worst possible sense, and like in the ERM case, there are also a variety of training algorithms performing in between these two extreme cases.

The rest of the paper is organized as follows: In Sect. 2 we firstly recall classical histograms as ERMs that we extend then to the class of inflated histograms. We provide specific examples of interpolating predictors from that class. In our main theorems we derive consistency results and learning rates. In the following Sect. 3 we explain how inflated histograms can be approximated by ReLU networks, having analogous learning properties. We discuss our results in Sect. 4.

All our proofs are deferred to the Appendices A, B, C, and D. Finally, in the supplementary material E we derive general uniform bounds for histograms based on data-dependent partitions. This result is needed for proving our main results and is of independent interest.

2 The histogram rule revisited

In this section, we reconsider the histogram rule within the framework of regression. Specifically, in Sect. 2.1, we recall the classical histogram rule and demonstrate how to modify it to obtain a predictor that can interpolate the given data. In Sect. 2.2, we construct specific interpolating empirical risk minimizers for a broad class of loss functions. The core idea is to begin with classical histogram rules and then expand their hypothesis spaces, allowing us to find interpolating empirical risk minimizers within these enlarged spaces. Section 2.3 presents a generic oracle inequality, while Sect. 2.4 focuses on learning rates for the least squares loss.

To begin with, let us introduce some necessary notations. Throughout this work, we consider $X:=[-1,1]^d$ and $Y=[-1,1]$ if not specified otherwise. Moreover, $L:X\times Y\times \mathbb {R}\rightarrow [0,\infty )$ denotes the loss function. If not specified otherwise, we restrict ourselves to the least squares loss $L(x,y,f(x))=(y-f(x))^2$. Given a dataset $D:= ((x_1, y_1),...,(x_n, y_n)) \in (X\times Y)^n$ drawn i.i.d. from an unknown distribution P on $X \times Y$, the aim of supervised learning is to build a function $f_D: X \rightarrow \mathbb {R}$ based on D such that its risk

$$\begin{aligned} {\mathcal{R}_{L,P}(f_D)}:= \int _{X \times Y} L(x,y, f_D(x)) \; dP(x,y) , \end{aligned}$$

(1)

is close to the smallest possible risk

$$\begin{aligned} {\mathcal{R}^*_{L,P}} = \inf _{f:X \rightarrow \mathbb {R}} {\mathcal{R}_{L,P}(f)} \,. \end{aligned}$$

(2)

In the following, ${\mathcal{R}^*_{L,P}}$ is called the Bayes risk and an ${f_{L,P}^*}: X \rightarrow \mathbb {R}$ satisfying ${\mathcal{R}_{L,P}(f^*_{L,P})} = {\mathcal{R}^*_{L,P}}$ is called Bayes decision function. Recall, that for the least squares loss, ${f_{L,P}^*}$ equals the conditional mean function, i.e. ${f_{L,P}^*}(x) = \mathbb {E}_P(Y|x)$ for $P_X$-almost all $x\in X$, where $P_X$ denotes the marginal distribution of P on X. In general, estimators $f_D$ having small excess risk

$$\begin{aligned} {\mathcal{R}_{L,P}(f_D)} - {\mathcal{R}^*_{L,P}} = ||f_D - {f_{L,P}^*}||_{L_2(P_X)}^2 , \end{aligned}$$

(3)

where $\Vert \cdot \Vert _{L_2(P_X)}$ denotes the usual $L_2$-norm with respect to $P_X$, are considered as good in classical statistical learning theory.

Now, to describe the class of learning algorithms we are interested in, we need the empirical risk of an $f:X\rightarrow \mathbb {R}$, i.e.

$$\begin{aligned} {\mathcal{R}_{L,D}(f)}:= \frac{1}{n}\sum _{i=1}^n L(x_i, y_i, f(x_i)) . \end{aligned}$$

Recall, that an empirical risk minimizer over some set $\mathcal{F}$ of functions $f:X\rightarrow \mathbb {R}$ chooses, for every data set D, an $f_D \in {\mathcal {F}}$ that satisfies

$$\begin{aligned} {\mathcal{R}_{L,D}(f_D)} = \inf _{f \in {\mathcal {F}}}{\mathcal{R}_{L,D}(f)} \,. \end{aligned}$$

Note that the definition of empirical risk minimizers implicitly requires that the infimum on the right hand side is attained, namely by $f_D$. In general, however, $f_D$ does not need to be unique. It is well-known that if we have a suitably increasing sequence of hypotheses classes ${\mathcal {F}}_n$ with controlled capacity, then every empirical risk minimizer $D\mapsto f_D$ that ensures $f_D \in {\mathcal {F}}_n$ for all data sets D of length n learns in the sense of e.g. universal consistency, and under additional assumptions it may also enjoy minmax optimal learning rates, see e.g. Devroye et al. (1996), van de Geer (2000), Györfi et al. (2002); Steinwart and Christmann (2008).

2.1 Classical histograms

Particular simple empirical risk minimizers are histogram rules (HRs). To recall the latter, we fix a finite partition $\mathcal{A} = (A_j)_{j\in J}$ of X and for $x\in X$ we write A(x) for the unique cell $A_j$ with $x\in A_j$. Moreover, we define

$$\begin{aligned} {\mathcal {H}}_{\mathcal{A}} := \biggl \{ \sum _{j\in J} c_j\varvec{1}_{A_j} \; : \; c_j \in Y \biggr \} \;, \end{aligned}$$

(4)

where $\varvec{1}_{A_j}$ denotes the indicator function of the cell $A_j$. Now, given a data set D and a loss L an $\mathcal{A}$-histogram is an $h_{D, \mathcal{A}} = \sum _{j=1}^m c_j^*\varvec{1}_{A_j} \in {\mathcal {H}}_{\mathcal{A}}$ that satisfies

$$\begin{aligned} \sum _{i: x_i \in A_j} L(x_i,y_i, c^*_j) = \inf _{c\in Y} \sum _{i: x_i \in A_j} L(x_i, y_i, c) \end{aligned}$$

(5)

for all, so-called non-empty cells $A_j$, that is, cells $A_j$ with $N_j:=|\{i: x_i \in A_j\}| >0$. Clearly, $D\mapsto h_{D, \mathcal{A}}$ is an empirical risk minimizer. Moreover, note that in general $h_{D, \mathcal{A}}$ is not uniquely determined, since $c_j^*\in Y$ can take arbitrary values for empty cells $A_j$. In particular, there are more than one empirical risk minimizer over ${\mathcal {H}}_{\mathcal{A}}$ as soon as $m,n\ge 2$.

Before we proceed, let us consider the specific example of the least squares loss in more detail. Here, a simple calculation shows, see Lemma A.1, that for all non-empty cells $A_j$, the coefficient $c_j^*$ in (5) is uniquely determined by

$$\begin{aligned} c_j^*:= \frac{1}{N_j} \sum _{i: x_i \in A_j} y_i . \end{aligned}$$

(6)

In the following, we call every resulting $D\mapsto h_{D, \mathcal{A}}$ with

$$\begin{aligned} h_{D, \mathcal{A}}: = \sum _{j=1}^m c_j^*\varvec{1}_{A_j} \in {\mathcal {H}}_{\mathcal{A}} \end{aligned}$$

an empirical HR for regression with respect to the least-squares loss L. For later use we also introduce an infinite sample version of a classical histogram

$$\begin{aligned} h_{P,{\mathcal {A}}}: = \sum _{j\in J} c^*_j \varvec{1}_{A_j} , \qquad \qquad \text{ where } \quad c^*_j:= \frac{1}{P_X(A_j)} \int _{A_j}{f_{L,P}^*}(x) dP_X(x) \; \end{aligned}$$

(7)

for all cells $A_j$ with $P_X(A_j)>0$. Similarly to empirical histograms one has

$$\begin{aligned} {\mathcal {R}}_{L,P}( h_{P,{\mathcal {A}}}) = \inf _{h \in {\mathcal {H}}_{\mathcal {A}}} {\mathcal {R}}_{L,P}(h) \;. \end{aligned}$$

We are mostly interested in HRs on $X=[-1,1]^d$ whose underlying partition essentially consists of cubes with a fixed width. To rigorously deal with boundary effects, we first say that a partition $(B_j)_{j\ge 1}$ of $\mathbb {R}^d$ is a cubic partition of width $s>0$, if each cell $B_j$ is a translated version of $[0,s)^d$, i.e. there is an $x^\dagger \in \mathbb {R}^d$ called offset such that for all $j\ge 1$ there exist $k_j:= ( k'_1,\dots , k'_d)\in \mathbb {Z}^d$ with

$$\begin{aligned} B_j = x^\dagger + sk_j + [0,s)^d \, . \end{aligned}$$

(8)

Now, a partition $\mathcal{A} = (A_j)_{j\in J}$ of $X=[-1,1]^d$ is called a cubic partition of width $s>0$, if there is a cubic partition ${\mathcal {B}}=(B_j)_{j\ge 1}$ of $\mathbb {R}^d$ with width $s>0$ such that $J= \{j\ge 1: B_j \cap X \ne \emptyset \}$ and $A_j = B_j\cap X$ for all $j\in J$. If $s\in (0,1]$, then, up to reordering, this $(B_j)_{j\ge 1}$ is uniquely determined by $\mathcal{A}$.

If the hypotheses space (4) is based on a cubic partition of $X=[-1,1]^d$ with width $s>0$, then the resulting HRs are well understood. For example, universal consistency and learning rates have been established, see e.g. Devroye et al. (1996); Györfi et al. (2002). In general, these results only require a suitable choice for the widths $s=s_n$ for $n\rightarrow \infty$ but no specific choice of the cubic partition of width s. For this reason we write $\mathcal{H}_s:= \bigcup \mathcal{H}_{\mathcal{A}}$, where the union runs over all cubic partitions $\mathcal{A}$ of X with fixed width $s\in (0,1]$.

2.2 Interpolating predictors and inflated histograms

In this section we construct particular interpolating empirical risk minimizers for a broad class of losses.

Definition 2.1

(Interpolating Predictor) We say that an $f:X\rightarrow Y$ interpolates D, if

$$\begin{aligned} {\mathcal{R}_{L,D}(f)} = {\mathcal{R}^*_{L,D}}:= \inf _{\tilde{f}:X\rightarrow \mathbb {R}} {\mathcal{R}_{L,D}(\tilde{f})}\,, \end{aligned}$$

where we emphasize that the infimum is taken over all $\mathbb {R}$-valued functions, while f is required to be Y-valued.

Clearly, an $f:X\rightarrow Y$ interpolates D if and only if

$$\begin{aligned} \sum _{k: x_k =x_i^*} L(x_i,y_i, f(x_i^*)) = \inf _{c\in \mathbb {R}} \sum _{k: x_k =x_i^*} L(x_i, y_i, c)\, , \qquad \qquad i=1,\dots ,m, \end{aligned}$$

(9)

where $x_1^*,\dots , x_m^*$ are the elements of $D_X:= \{x_i: i=1,\dots ,n\}$.

It is easy to check that for the least squares loss L and all data sets D there exists an $f_D^*$ interpolating D. Moreover, we have ${\mathcal{R}^*_{L,D}} > 0$ if and only if D contains contradicting samples, i.e. $x_i = x_k$ but $y_i \ne y_k$. Finally, if ${\mathcal{R}^*_{L,D}} = 0$, then any interpolating $f_D^*$ needs to satisfy $f_D^*(x_i) = y_i$ for all $i=1,\dots ,n$.

Definition 2.2

(Interpolatable Loss) We say that L is interpolatable for D if there exists an $f:X\rightarrow Y$ that interpolates D, i.e. ${\mathcal{R}_{L,D}(f)} = {\mathcal{R}^*_{L,D}}$.

Note that (9) in particular ensures that the infimum over $\mathbb {R}$ on the right is attained at some $c^*_i\in Y$. Many common losses including the least squares, the hinge, and the classification loss interpolate all D, and for the latter three losses we have ${\mathcal{R}^*_{L,D}} > 0$ if and only if D contains contradicting samples, i.e. $x_i = x_k$ but $y_i \ne y_k$. Moreover, for the least squares loss, $c^*_i$ can be easily computed by averaging over all labels $y_k$ that belong to some sample $x_k$ with $x_k = x_i$.

Let us now describe more precisely the inflated versions of $\mathcal{H}_s$. For $r,s>0$ and $m\ge 0$ we want to consider functions

$$\begin{aligned} f = h + \sum _{i=1}^m b_i \varvec{1}_{x_i^* + t B_\infty } \end{aligned}$$

(10)

with $h\in \mathcal{H}_s,\, b_i \in 2Y,\, x_i^* \in X,$ and $t \in [0,r]$, where $B_\infty := [-1,1]^d$. In other words, for $m\ge 1$, such an f changes a classical histogram $h \in \mathcal{H}_s$ on at most m small neighborhoods of some arbitrary points $x_1^*,\dots ,x_m^*$ in X. Such changes are useful for finding interpolating predictors. In general, these small neighborhoods $x_i^* + t B_\infty$ however may intersect and may be contained in more than one cell $A_j$ of the considered partition $\mathcal{A}$ with $h \in {\mathcal {H}}_{\mathcal {A}}$. To avoid undesired boundary effects we restrict the class of all admissible cubic partitions ${\mathcal {A}}$ of X associated with h. An additional technical difficulty arises in particular when constructing interpolating predictors since the set of points $\{x_1^*,..., x_m^*\}\subset X$ are naturally the random input variables. As a consequence, the admissible cubic partitions become data-dependent. As a next step, we introduce the notion of a partitioning rule. To this end, we write

$$\begin{aligned} \textrm{Pot}_m(X):= \bigl \{A\subset X: |A| = m \bigr \} \end{aligned}$$

for the set of all subsets of X having cardinality m. Moreover, we denote the set of all finite partitions of X by ${\mathcal {P}}(X)$.

Definition 2.3

Given an integer $m\ge 1$, an m-sample partitioning rule for X is a map $\pi _m: \textrm{Pot}_m(X)\rightarrow {\mathcal {P}}(X)$, i.e. a map that associates to every subset $\{x_1^*,..., x_m^*\}\subset X$ of cardinality m a finite partition $\mathcal{A}$. Additionally, we will call an m-sample partitioning rule that assigns to any such $\{x_1^*,..., x_m^*\}\in \textrm{Pot}_m(X)$ a cubic partition with fixed width $s \in (0,1]$ an m-sample cubic partitioning rule and write $\pi _{m,s}$.

Next we explain in more detail which particular partitions are considered as admissible.

Definition 2.4

(Proper Alignment) Let ${\mathcal {A}}$ be a cubic partition of X with width $s\in (0,1]$, ${\mathcal {B}}$ be the partition of $\mathbb {R}^d$ that defines $\mathcal{A}$, and $r\in (0,s)$. We say that ${\mathcal {A}}$ is properly aligned to the set of points $\{x_1^*,..., x_m^*\}\in \textrm{Pot}_m(X)$ with parameter r, if for all $i,k=1,\dots ,m$ we have

$$\begin{aligned} x_i^* + r B_\infty&\subset B(x_i^*)\, , \end{aligned}$$

(11)

$$\begin{aligned} x_i^* + r B_\infty \cap x_k^* + r B_\infty&= \emptyset \qquad \qquad \text{ whenever } i\ne k\text{, } \end{aligned}$$

(12)

where $B(x_i^*)$ is the unique cell^{Footnote 1} of ${\mathcal {B}}$ that contains $x_i^*$.

Clearly, if ${\mathcal {A}}$ is properly aligned with parameter $r> 0$, then it is also properly aligned for any parameter $t \in [0, r]$ for the same set of points $\{x_j^*\}_{j=1}^m$ in $\textrm{Pot}_m(X)$. Moreover, any cubic partition ${\mathcal {A}}$ of X with width $s>0$ is properly aligned with the parameter $r=0$ for any set of points $\{x_j^*\}_{j=1}^m$ in $\textrm{Pot}_m(X)$.

In what follows, we establish the existence of cubic partitions ${\mathcal {A}}$ that are properly aligned to a given set of points with parameter $r>0$ being sufficiently small. In other words, we construct a special m-sample cubic partitioning rule $\pi _{m,s}$. We call henceforth any such rule $\pi _{m,s}$ an m-sample properly aligned cubic partitioning rule. To this end, let $D_X:= \{x_1^*,..., x_m^*\} \in \textrm{Pot}_m(X)$ be a set of points and note that (12) holds for all $r>0$ satisfying

$$\begin{aligned} r<\frac{1}{2}\min _{i,k:i\ne k}\Vert x_i^*-x_k^* \Vert _\infty . \end{aligned}$$

Clearly, a brute-force algorithm finds such an r in $\mathcal{O}(dm^2)$-time. However, a smarter approach is to first sort the first coordinates $x_{1,1}^*,\dots , x_{m,1}^*$ and to determine the smallest positive distance $r_1$ of two consecutive, non-identical ordered coordinates. This approach is then repeated for the remaining $d-1$-coordinates, so at the end we have $r_1,\dots ,r_d>0$. Then

$$\begin{aligned} r^*:= r^*_{D_X}:= \frac{1}{3} \min \{r_1,\dots ,r_d\} \end{aligned}$$

(13)

satisfies (12) and the used algorithm is $\mathcal{O}(d\cdot m \log m)$ in time. Our next result shows that we can also ensure (11) by jiggling the cubic partitions. Being rather technical, the proof is deferred to the Appendix B.

Theorem 2.5

(Existence of Properly Aligned Cubic Partitioning Rule) For all $d\ge 1$, $s\in (0,1]$, and $m\ge 1$ there exist an m-sample cubic partitioning rule $\pi _{m,s}$ with $|{{\,\textrm{Im}\,}}(\pi _{m,s})|\le (m+1)^d$ that assigns to each set of points $D_X:=\{x_1^*,..., x_m^*\} \in \textrm{Pot}_m(X)$ a cubic partition $\mathcal{A}$ that is properly aligned to $\{x_1^*,..., x_m^*\}$ with parameter $r:= r_{D_X}:= \min \{r^*,\frac{s}{3\,m+3}\}$, where $r^*= r^*_{D_X}$ is defined in (13).

The construction of an m-sample cubic partitioning rule $\pi _{m,s}$ basically relies on the representation (8) of cubic partitions ${\mathcal {B}}$ of $\mathbb {R}^d$. In fact, the proof of Theorem 2.5 shows that there exists a finite set $x_1^\dagger ,..., x^\dagger _K \in \mathbb {R}^d$ of candidate offsets, with $K=(m+1)^d$. While at first glance this number seems to be prohibitively large for an efficient search, it turns out that the proof of Theorem (2.5) actually provides a simple algorithm that is $\mathcal{O}(d\cdot m)$ in time for identifying coordinate-wise the $x_\ell ^\dagger$ that leads to $\pi _{m,s}(\{x_1^*,..., x_m^*\})$.

Being now well prepared, we introduce the class of inflated histograms.

Definition 2.6

Let $s \in (0,1]$ and $m\ge 1$. Then a function $f:X\rightarrow Y$ is called an m-inflated histogram of width s, if there exist a subset $\{x_1^*,..., x_m^*\} \in \textrm{Pot}_m(X)$ and a cubic partition $\mathcal{A}$ of width s that is properly aligned to $\{x_1^*,..., x_m^*\}$ with parameter $r\in [0,s)$ such that

$$\begin{aligned} f = h + \sum _{i=1}^m b_i \varvec{1}_{x_i^* + t B_\infty }\, , \end{aligned}$$

where $h\in {\mathcal {H}}_{\mathcal {A}}$, $t\in [0,r]$, and $b_i\in 2Y$ for all $i=1,\dots ,m$. We denote the set of all m-inflated histograms of width s by ${\mathcal {F}}_{s, m}$. Moreover, for $k\ge 1$ we write

$$\begin{aligned} \mathcal{F}^*_{s,k} := \mathcal{F}_{s,1}\cup \dots \cup \mathcal{F}_{s,k}\, . \end{aligned}$$

Note that the condition $t\le r < s$ ensures that the representation $f= h + \sum _{i=1}^m b_i \varvec{1}_{x_i^* + t B_\infty }$ of any $f\in {\mathcal {F}}_{s, k}$ is unique. In addition, given an $f\in \mathcal{F}^*_{s,k}$, the number m of inflation points $\{x_1^*,..., x_m^*\}$ is uniquely determined, too, and hence so is the representation of f. For a depiction of an inflated histogram for regression (with and without proper alignment) we refer to Fig. 1.

So far we have formalized the notion of interpolation and defined an appropriate inflated hypotheses class for modified histograms. The m-inflated histograms in Definition 2.6 can attain any values in Y that arise from classical histograms or changes of classical histograms on a discrete set (note that implicitly restricts the choices of the $b_i$). This is a quite general definition and m-inflated histograms do not need to be interpolating. In our next result we go a step further by providing a sufficient condition for the existence of interpolating predictors in ${\mathcal {F}}_{s, m}$. The idea is to give a condition on the $b_i$ that ensures that an inflated histogram is interpolating. This depends, of course, on the $c_j$.

Proposition 2.7

Let L be a loss that is interpolatable for $D=((x_1,y_i),\dots ,(x_n,y_n))$ and let $x_1^*,\dots , x_m^*$ be as in (9). Moreover, for $s\in (0,1]$ and $r>0$ we fix an $f^*\in {\mathcal {F}}_{s, m}$ with representation as given in Definition 2.6. For $i=1,\dots ,m$ let $j_i$ be the index such that $x_i^* \in A_{j_i}$. Then $f^*$ interpolates D, if for all $i=1,\dots ,m$ we have

$$\begin{aligned} b_i = - c_{j_i} + \arg \min _{c\in Y} \sum _{k: x_k = x_i^*} L(x_k, y_k, c) \, . \end{aligned}$$

(14)

Proof of Proposition 2.7

By our assumptions we have

$$\begin{aligned} c_i^*:= b_i + c_{j_i} \in \arg \min _{c\in Y} \sum _{k: x_k = x_i^*} L(x_k, y_k, c) = \arg \min _{c\in \mathbb {R}} \sum _{k: x_k = x_i^*} L(x_k, y_k, c) \,, \end{aligned}$$

where the last equality is a consequence of the fact that there is an $f:X\rightarrow Y$ satisfying (9). Moreover, since (11) and (12) hold, we find $f^*(x_i^*) = h(x_i^*) + b_i = c_{j_i} + b_i = c_i^*$, and therefore $f^*$ interpolates D by (9).$\square$

Note that for all $c_{j_i} \in Y$ the value $b_i$ given by (14) satisfies $b_i \in 2Y$ and we have $b_i=0$ if $c_{j_i}$ is contained in the $\arg \min$ in (14). Consequently, defining $b_i$ by (14) always gives an interpolating $f^*\in {\mathcal {F}}_{s, m}$. Moreover, (14) shows that an interpolating $f^*\in {\mathcal {F}}_{s, m}$ can have an arbitrary histogram part $h\in \mathcal{H}_{\mathcal{A}}$, that is, the behavior of $f^*$ outside the small $tB_\infty$-neighborhoods around the samples of D can be arbitrary. In other words, as soon as we have found a properly aligned cubic partition $\mathcal{A}$ in the sense of ${\mathcal {F}}_{s,m}$, we can pick an arbitrary histogram $h\in \mathcal{H}_{\mathcal{A}}$ and compute the $b_i$’s by (14). Intuitively, if the chosen $tB_\infty$-neighborhoods are sufficiently small, then the prediction capabilities of the resulting interpolating predictor are (mostly) determined by the chosen histogram part $h\in \mathcal{H}_{\mathcal{A}}$. Based on this observation, we can now construct different, interpolating $f^*_D\in {\mathcal {F}}_{s, m}$ that have particularly good and bad learning behaviors.

Example 2.8

(Good interpolating histogram rule) Let L be the least squares loss, $s\in (0,1]$ be a cell width, $\rho \ge 0$ be an inflation parameter, and $D=((x_1,y_i),\dots ,(x_n,y_n))$ be a data set. By $D_X=\{x_1^*,...,x_m^*\}$ we denote the set of all covariates $x_j \in X$ with $(x_j, y_j)$ belonging to the data set. For $m=|D_X|$, Theorem 2.5 ensures the existence of a cubic partition ${\mathcal {A}}_D=\pi _{m,s}(D_X)$ with width $s\in (0,1]$, being properly aligned to $D_X$ with the data-dependent parameter r. Based on this data-dependent cubic partition ${\mathcal {A}}_D$ we fix an empirical histogram for regression

$$\begin{aligned} h_{D,\mathcal{A}_D}^+:= \sum _{j\in J} c_j^+ \varvec{1}_{A_j} \in \mathcal{H}_{s} \end{aligned}$$

(15)

with coefficients $(c_j^+)_{j \in J}$ precisely given in (6). Applying now Proposition 2.7 gives us an $f_{D,s,\rho }^+\in {\mathcal {F}}_{s, m} \subset \mathcal{F}^*_{s,n}$, which interpolates D and has the representation

$$\begin{aligned} f_{D,s,\rho }^+:= h_{D,\mathcal{A}_D}^+ + \sum _{i=1}^m b^+_i \varvec{1}_{x_i^* + t B_\infty } , \end{aligned}$$

where the $b^+_1,\dots ,b_m^+$ are calculated according to the rule (14), and $t:= \min \{r, \rho \}$ is again data-dependent. We call the map $D\mapsto f_{D,s,\rho }^+$ a good interpolating histogram rule.

Example 2.9

(Bad interpolating histogram rule) Let L be the least squares loss, $s\in (0,1]$ be a cell width, $\rho \ge 0$ be an inflation parameter, and $D=((x_1,y_i),\dots ,(x_n,y_n))$ be a data set. Consider again a cubic partition ${\mathcal {A}}_D=\pi _{m,s}(D_X)$ with width $s\in (0,1]$, that is properly aligned to $D_X$ with parameter r and fix an empirical histogram $h_{D,\mathcal{A}_D}^+ \in \mathcal{H}_{s}$ as in (15). Setting $t:= \min \{r, \rho \}$, we define a predictor $f_{D,s,\rho }^-\in {\mathcal {F}}_{s, m}$ by

$$\begin{aligned} f_{D,s,\rho }^-:= h_{D,\mathcal{A}_D}^- + \sum _{i=1}^m b^-_i \varvec{1}_{x_i^* + t B_\infty } , \end{aligned}$$

with $\mathcal{H}_{\mathcal{A}}$-part $h_{D,\mathcal{A}_D}^-:= - h_{D,\mathcal{A}_D}^+$. The $b^-_1,\dots ,b_m^-$ are calculated according to (14) and satisfy

$$\begin{aligned} b_i^- = b_i^+ + 2c_{j_i}^+ , \end{aligned}$$

for any $i=1,...,m$ and where $j_i$ denotes the index such that $x_i^* \in A_{j_i}$. By writing

$$\begin{aligned} D_X^{+t} := \bigcup _{i=1}^m \bigl ( x_i^* + t B_\infty \bigr ) \end{aligned}$$

(16)

we easily see that the definition of $f_{D,s,\rho }^-$ gives $f_{D,s,\rho }^-\in {\mathcal {F}}_{s, m} \subset \mathcal{F}^*_{s,n}$ and

$$\begin{aligned} f_{D,s,\rho }^-(x) = {\left\{ \begin{array}{ll} f_{D,s,\rho }^+(x) & \text{ if } x\in D_X^{+t}\\ -f_{D,s,\rho }^+(x) & \text{ if } x\not \in D_X^{+t}\, , \end{array}\right. } \end{aligned}$$

(17)

while Proposition 2.7 ensures that $f_{D,s,\rho }^-$ interpolates D. We call the map $D\mapsto f_{D,s,\rho }^-$ a bad interpolating histogram rule and remark that t is, like for good interpolating histogram rules, data-dependent.

2.3 A generic oracle inequality for empirical risk minimization

The main purpose of this section is to present a general variance improved oracle inequality for empirical risk minimizers for bounding the excess risk with respect to a broad class of loss functions, which is of independent interest. In Sect. 2.4, we apply this result to the special case of the least squares loss and to histogram rules that choose their cubic partitions in a certain, data-dependent way. In particular, we give an optimized uniform bound that crucially relies on an explicit capacity bound, expressed in terms of the covering number, see Definition E.1. This is a necessary step to provide the learning properties of histograms based on data-dependent cubic partitions. The proof of this result is provided in Appendix E.1.

Theorem 2.10

Let $L:X\times Y\times \mathbb {R}\rightarrow [0,\infty )$ be a locally Lipschitz continuous loss, $\mathcal{F}\subset \mathcal{L}_{\infty }(X)$ be a closed, separable set satisfying $\Vert f \Vert _\infty \le M$ for a suitable constant $M>0$ and all $f\in \mathcal{F}$, and P be a distribution on $X\times Y$ that has a Bayes decision function $f_{L,P}^{*}$ with ${\mathcal{R}_{L,P}({f_{L,P}^*})}< \infty$. Assume that there exist constants $B>0$, $\vartheta \in [0,1]$, and $V\ge B^{2-\vartheta }$ such that for all measurable $f: X \rightarrow [-M,M]$ we have

$$\begin{aligned} \Vert L\circ f- L\circ {f_{L,P}^*} \Vert _\infty\le & B\,, \end{aligned}$$

(18)

$$\begin{aligned} \mathbb {E}_{P} \bigl ( L\circ f - L\circ {f_{L,P}^*}\bigr )^{2}\le & V \cdot \bigl (\mathbb {E}_{P} (L\circ f - L\circ {f_{L,P}^*}) \bigr )^{\vartheta }\,. \end{aligned}$$

(19)

Then, for all measurable empirical risk minimization algorithms $D\mapsto f_D$, all $n\ge 1$, $\tau >0$, and all $\varepsilon >0$ we have

$$\begin{aligned} {\mathcal{R}_{L,P}(f_D)} - {\mathcal{R}_{L,P}^{*}}\le & 4\bigl ( {\mathcal{R}_{L,P,\mathcal{F}}^*}-{\mathcal{R}_{L,P}^{*}}\bigr ) + 5\, |L|_{M,1} \cdot \varepsilon \\ & \quad + 2 \biggl (\frac{16 V\bigl (\tau +1+ \ln \mathcal{N}(\mathcal{F},\Vert \cdot \Vert _\infty ,\varepsilon ) \bigr ) }{n}\biggr )^{\frac{1}{2-\vartheta }} \end{aligned}$$

with probability $P^n$ not less than $1- e^{-\tau }$. Here, $\mathcal{N}(\mathcal{F},\Vert \cdot \Vert _\infty ,\varepsilon )$ denotes the $\varepsilon$-covering number of $\mathcal{F}$.

Note that variance improved oracle inequalities generally provide refined bounds for the estimation error part for the excess risk under the more strict assumptions (18), (19). This leads to faster rates of convergence, compared to basic statistical analysis, see e.g. (Steinwart and Christmann (2008), Chapter 6 and 7). Our variance improved oracle inequality improves over the one in (Steinwart and Christmann (2008), Theorem 7.2) for empirical risk minimizers. We go beyond finite function classes and bound the capacity in terms of covering numbers.

2.4 Main results for least squares loss

Our main results below show that the description good/ bad interpolating histogram rule from the above Examples 2.8/ 2.9, respectively, is indeed justified, provided the inflation parameter is chosen appropriately. Here we recall that good learning algorithms can be described by a small excess risk, or equivalently, a small distance to the Bayes decision function ${f_{L,P}^*}$, see (3). To describe bad learning behavior, we denote the point spectrum of $P_X$ by

$$\begin{aligned} \Delta :=\{ x \in X\;: \; P_X(\{x\}) > 0 \} , \end{aligned}$$

(20)

see Hoffman-Jorgensen (2017). One easily verifies that $\Delta$ is at most countable, since $P_X$ is finite. Moreover, for an arbitrary but fixed version ${f_{L,P}^*}$ of the Bayes decision function, we write

$$\begin{aligned} {f_{L,P}^\dagger }:= \varvec{1}_\Delta {f_{L,P}^*}- \varvec{1}_{X\setminus \Delta }{f_{L,P}^*}\qquad \text{ and } \qquad {\mathcal {R}}^\dagger _{L,P}&:= {\mathcal{R}_{L,P}({f_{L,P}^\dagger })}\, , \end{aligned}$$

where we note that ${\mathcal {R}}^\dagger _{L,P}$ does, of course, not depend on the choice of ${f_{L,P}^*}$. Moreover, note that for $x\in \Delta$ the value ${f_{L,P}^*}(x)$ is also independent of the choice of ${f_{L,P}^*}$ and it holds $f^\dagger _{L,P} (x) = {f_{L,P}^*}(x)$. In contrast, for $x\in X{\setminus } \Delta$ with ${f_{L,P}^*}(x) \ne 0$ we have $f^\dagger _{L,P}(x) \ne {f_{L,P}^*}(x)$. In fact, a quick calculation using (3) shows

$$\begin{aligned} {\mathcal {R}}^\dagger _{L,P} - {\mathcal{R}_{L,P}^{*}} = \Vert {f_{L,P}^\dagger }- {f_{L,P}^*} \Vert _{{L_{2}(P_X)}}^2 = 4 \Vert \varvec{1}_{X\setminus \Delta }{f_{L,P}^*} \Vert _{{L_{2}(P_X)}}^2\, , \end{aligned}$$

(21)

and consequently we have ${\mathcal {R}}^\dagger _{L,P} - {\mathcal{R}_{L,P}^{*}}>0$ whenever $P_X(\Delta ) < 1$ and ${f_{L,P}^*}$ does not almost surely vanish on $X\setminus \Delta$. It seems fair to say that the overwhelming majority of “interesting” P fall into this category. Finally, note that in general we do not have an equality of the form (3), when we replace ${\mathcal{R}_{L,P}^{*}}$ and ${f_{L,P}^*}$ by ${\mathcal {R}}^\dagger _{L,P}$ and ${f_{L,P}^\dagger }$. However, for $y,t,t'\in Y=[-1,1]$ we have $|L(y,t) - L(y,t')| \le 4 |t-t'|$, and consequently we find

$$\begin{aligned} \bigl |{\mathcal{R}_{L,P}(f)} - {\mathcal {R}}^\dagger _{L,P}\bigr | \le 4\, \Vert f - {f_{L,P}^\dagger } \Vert _{{L_{2}(P_X)}} \end{aligned}$$

(22)

for all $f:X\rightarrow Y$. For this reason, we will investigate the bad interpolating histogram rule only with respect to its $L_2$-distance to ${f_{L,P}^\dagger }$.

Before we state our main result of this section we need to introduce one more assumption that will be required for parts of our results.

Assumption 2.11

There exists a non-decreasing continuous map $\varphi : \mathbb {R}_+ \rightarrow \mathbb {R}_+$ with $\varphi (0)=0$ such that for any $t \ge 0$ and $x \in X$ one has $P_X(x + tB_\infty ) \le \varphi ( t)$.

Note that this assumption implies $P_X(\{x\})=0$ for any $x \in X$. Moreover, it is satisfied for the uniform distribution $P_X$, if we consider $\phi (t):= 2^d t^d$, and a simple argument shows that modulo the constant appearing in $\phi$ the same is true if $P_X$ only has a bounded Lebesgue density. The latter is, however, not necessary. Indeed, for $X=[-1,1]$ and $0<\beta < 1$ it is easy to construct unbounded Lebesgue densities that satisfy Assumption 2.11 for $\phi$ of the form $\phi (t) = ct^\beta$, and higher dimensional analogs are also easy to construct. Moreover, in higher dimensions Assumption 2.11 also applies to various distributions living on sufficiently smooth low-dimensional manifolds.

With these preparations we can now establish the following theorem that shows that for an inflation parameter $\rho =0$ (see Examples 2.8, 2.9) the good interpolating histogram rule is universally consistent while the bad interpolating histogram rule fails to be consistent in a stark sense. It further shows consistency, respectively non-consistency for $\rho =\rho _n>0$ with $\rho _n\rightarrow 0$.

Theorem 2.12

Let L be the least-squares loss and let $D \in (X \times Y)^n$ be an i.i.d. sample of size $n \ge 1$. Let $D \mapsto f_{D,s,\rho }^+$ denote the good interpolating histogram rule from Example 2.8. Similarly, let $D \mapsto f_{D,s,\rho }^-$ denote the bad interpolating histogram rule from Example 2.9. Assume that $(s_n)_{n \in \mathbb {N}}$ is a sequence with $s_n \rightarrow 0$ as well as $\frac{\ln (n s_n^d)}{n s_n^d}\rightarrow 0$ as $n \rightarrow \infty$.

(i)
(Non)-consistency for $\rho _n = 0$. We have in probability for $|D|\rightarrow \infty$
$$\begin{aligned} \Vert f_{D,s_n,0}^+- {f_{L,P}^*} \Vert _{{L_{2}(P_X)}}&\rightarrow 0 \,, \end{aligned}$$
(23)
$$\begin{aligned} \Vert f_{D,s_n,0}^-- {f_{L,P}^\dagger } \Vert _{{L_{2}(P_X)}}&\rightarrow 0 \, . \end{aligned}$$
(24)
(ii)
(Non)-consistency for $\rho _n >0$. Let $(\rho _n)_{n \in \mathbb {N}}$ be a non-negative sequence with $\rho _n \rightarrow 0$ as $n \rightarrow \infty$. Then for all distributions P that satisfy Assumption 2.11 for a function $\varphi$ with $n\varphi (\rho _n) \rightarrow 0$ for $n\rightarrow \infty$, we have
$$\begin{aligned} ||f_{D,s_n, \rho _n}^+- {f_{L,P}^*}||_{L_2(P_X)} \rightarrow 0 \,, \end{aligned}$$
(25)
$$\begin{aligned} ||f_{D,s_n,\rho _n}^-- {f_{L,P}^\dagger }||_{L_2(P_X)} \rightarrow 0 \,, \end{aligned}$$
(26)
in probability for $|D|\rightarrow \infty$.

The proof of Theorem 2.12 is provided in Appendix C.2. Our second main result, whose proof is provided in Appendix C.3, refines the above theorem and establishes learning rates for the good and bad interpolating histogram rules, provided the width $s_n$ and the inflation parameter $\rho _n$ decrease sufficiently fast as $n \rightarrow \infty$.

Theorem 2.13

Let L be the least-squares loss and let $D \in (X \times Y)^n$ be an i.i.d. sample of size $n \ge 1$. Let $D \mapsto f_{D,s,\rho }^+$ denote the good interpolating histogram rule from Example 2.8. Similarly, let $D \mapsto f_{D,s,\rho }^-$ denote the bad interpolating histogram rule from Example 2.9. Suppose that ${f_{L,P}^*}$ is $\alpha$-Hölder continuous with $\alpha \in (0,1]$ and that P satisfies Assumption 2.11 for some function $\varphi$. Assume further that $(s_n)_{n \in \mathbb {N}}$ is a sequence with

$$\begin{aligned} s_n = n^{-\gamma }, \quad \gamma = \frac{1}{2\alpha + d} \end{aligned}$$

and that $(\rho _n)_{n\ge 1}$ is a non-negative sequence with $n \varphi (\rho _n) \le \ln (n) n^{-2/3}$ for all $n\ge 1$. Then there exists a constant $c_{d,\alpha }>0$ only depending on d, $\alpha$, and $|f^*_{L,P}|_\alpha$, such that for all $n\ge 1$ the good interpolating histogram rule satisfies

$$\begin{aligned} ||f_{D,s_n, \rho _n}^+- {f_{L,P}^*}||_{L_2(P_X)}&\le c_{\alpha ,d} \sqrt{\ln (n)} \left( \frac{1}{n} \right) ^{\alpha \gamma } \;, \end{aligned}$$

(27)

with probability $P^n$ not less than $1- 2^dn^{1+d} e^{-n^{d \gamma }}$. Furthermore, for all $n\ge 1$, the bad interpolating histogram rule satisfies

$$\begin{aligned} ||f_{D,s_n,\rho _n}^-- {f_{L,P}^\dagger }||_{L_2(P_X)}&\le c_{\alpha ,d} \sqrt{\ln (n)} \left( \frac{1}{n} \right) ^{\alpha \gamma }\;. \end{aligned}$$

(28)

with probability $P^n$ not less than $1- 2^dn^{1+d} e^{-n^{d \gamma }}$.

For a discussion of our results, we refer to Sect. 4.

3 Approximation of histograms with ReLU networks

The goal of this section is to build neural networks of suitable depth and width that mimic the learning properties of inflated histogram rules. To be more precise, we aim to construct a particular class of inflated networks that contains good and bad interpolating predictors, similar to the good and bad interpolating histogram rules from Example 2.8 and Example 2.9, respectively.

We begin with describing in more detail the specific networks that we will consider. Given an activation function $\sigma : \mathbb {R}\rightarrow \mathbb {R}$ and $b \in \mathbb {R}^p$ we define the shifted activation function $\sigma _b: \mathbb {R}^p \rightarrow \mathbb {R}^p$ as

$$\begin{aligned} \sigma _b (y):= ( \sigma (y_1+b_1),..., \sigma (y_p+b_p) )^T \; \end{aligned}$$

(29)

where $y_j$, $j=1,...,p$ denote the components of $y \in \mathbb {R}^p$. A hidden layer with activation $\sigma$, of width $p \in \mathbb {N}$ and with input dimension $q \in \mathbb {N}$ is a function $H_\sigma :\mathbb {R}^q \rightarrow \mathbb {R}^p$ of the form

$$\begin{aligned} H_\sigma (x) := (\sigma _b \circ A )( x) \;, \quad x\in \mathbb {R}^q, \end{aligned}$$

(30)

where A is a $p \times q$ weight matrix and $b \in \mathbb {R}^{p}$ is a shift vector or bias. Clearly, each pair (A, b) describes a layer, but in general, a layer, if viewed as a function, can be described by more than one such pair. The class of networks we consider is given in the following definition.

Definition 3.1

Given an activation function $\sigma : \mathbb {R}\rightarrow \mathbb {R}$ and an integer $\tilde{L}\ge 1$, a neural network with architecture $p \in \mathbb {N}^{\tilde{L}+1}$ is a function $f: \mathbb {R}^{p_0} \rightarrow \mathbb {R}^{p_{\tilde{L}}}$, having a representation of the form

$$\begin{aligned} f(x)&= H_{\text{ id } , \tilde{L}}\circ H_{\sigma ,\tilde{L}-1}\circ \dots \circ H_{\sigma , 1}(x) \;, \quad x \in \mathbb {R}^{p_0} \;, \end{aligned}$$

(31)

where $H_{\sigma , l}: \mathbb {R}^{p_{l-1}} \rightarrow \mathbb {R}^{p_l}$ is a hidden layer of width $p_l \in \mathbb {N}$ and input dimension $p_{l-1} \in \mathbb {N}$, $l=1,...,\tilde{L}-1$. Here, the last layer $H_{\text{ id }, \tilde{L}}: \mathbb {R}^{p_{\tilde{L}-1}} \rightarrow \mathbb {R}^{p_{\tilde{L}}}$ is associated to the identity $\text{ id }: \mathbb {R}\rightarrow \mathbb {R}$.

A network architecture is therefore described by an activation function $\sigma$ and a width vector $p = (p_0,...,p_{\tilde{L}}) \in \mathbb {N}^{\tilde{L}+1}$. The positive integer $\tilde{L}$ is the number of layers, $\tilde{L}-1$ is the number of hidden layers or the depth. Here, $p_0$ is the input dimension and $p_{\tilde{L}}$ is the output dimension. In the sequel, we confine ourselves to the ReLU-activation function $|\cdot |_+: \mathbb {R}\rightarrow [0,\infty )$ defined by

$$\begin{aligned} |t|_+:=\max \{0,t\} , \quad t \in \mathbb {R}. \end{aligned}$$

Moreover, we consider networks with fixed input dimension $p_0=d$ and output dimension $p_{\tilde{L}}=1$, that is,

$$\begin{aligned} H_{\text{ id }, \tilde{L}}(x) = \langle a,x \rangle + b , \quad x\in \mathbb {R}^{p_{\tilde{L}-1}} . \end{aligned}$$

Thus, we may parameterize the (inner) architecture by the width vector $(p_1,...,p_{\tilde{L}-1}) \in \mathbb {N}^{\tilde{L}-1}$ of the hidden layers only. In the following, we denote the set of all such neural networks by ${\mathcal {A}}_{p_1,...,p_{\tilde{L}-1}}$.

3.1 $\varepsilon$-approximate inflated histograms

Motivated by the representation (4) for histograms, the first step of our construction approximates the indicator function of a multi-dimensional interval by a small part of a possibly large DNN. This will be our main building block. We emphasize that the ReLU activation function is particularly suited for this approximation and it thus plays a key role in our entire construction.

For the formulation of the corresponding result we fix some notation. For $z_1, z_2 \in {\mathbb {R}^d}$ we write $z_1\le z_2$ if each coordinate satisfies $z_{1,i}\le z_{2,i}$, $i=1,\dots ,d$. We define $z_1 < z_2$ analogously. In addition, if $z_1\le z_2$, then the multi-dimensional interval is $[z_1, z_2]:= \{ z\in {\mathbb {R}^d}: z_1\le z\le z_2 \}$, and we similarly define $(z_1, z_2)$ if $z_1 < z_2$. Finally, for $s\in \mathbb {R}$, we let $z_1 + s:= (z_{1,1}+s,\dots ,z_{1,d}+s)$.

Definition 3.2

Let $A\subset X$, $z_1, z_2\in {\mathbb {R}^d}$ with $z_{1} < z_{2}$ and $\varepsilon >0$ with $\varepsilon < \frac{1}{2}\cdot \min \{z_{2,i}-z_{1,i}: i=1,\dots ,d \}$. Then a network $\varvec{1}_A^{(\varepsilon )} \in {\mathcal {A}}_{2d,1}$ is called an $\varepsilon$-Approximation of the indicator function $\varvec{1}_A: X \rightarrow [0,1]$ if

$$\begin{aligned} \{ \varvec{1}_A^{(\varepsilon )} = \varvec{1}_{A}\} = [z_{1}+\varepsilon , z_{2}-\varepsilon ] \cup \bigl ( X\setminus A \bigr ) , \end{aligned}$$

and if

$$\begin{aligned} \{\varvec{1}_A^{(\varepsilon )} >1 \} = \emptyset , \quad \{\varvec{1}_A^{(\varepsilon )} <0 \} = \emptyset . \end{aligned}$$

The next lemma ensures the existence of such approximations. The full construction is elementary calculus and is provided in Appendix D.2, in particular in Lemma D.3. Lemma D.5 provides then the desired properties.

Lemma 3.3

[Existence of $\varepsilon$-Approximations] Let $z_1, z_2\in {\mathbb {R}^d}$ and $\varepsilon >0$ as in Definition 3.2. Then for all $A\subset X$ with $[z_{1}+\varepsilon , z_{2}-\varepsilon ] \subset A \subset [z_{1}, z_{2}]$ there exists an $\varepsilon$-Approximation $\varvec{1}_A^{(\varepsilon )}$ of $\varvec{1}_A$.

Figure 2 illustrates $\varvec{1}_A^{(\varepsilon )}$ for $d=1$. Moreover, the proof of Lemma D.3 shows that out of the $2d^2$ weight parameters of the first layer, only 2d are non-zero. In addition, the 2d weight parameters of the neuron in the second layer are all identical. In order to approximate inflated histograms we need to know how to combine several functions of the form provided by Lemma 3.3 into a single neural network. An appealing feature of our DNNs is that the concatenation of layer structures is very easy.

Lemma 3.4

If $c \in \mathbb {R}$, $(p_1, p_2) \in \mathbb {N}^2$, and $g \in {\mathcal {A}}_{p}$, $g' \in {\mathcal {A}}_{p'}$, then $cg \in {\mathcal {A}}_{p}$ and $g + g' \in {\mathcal {A}}_{p+p'}.$

Lemma 3.4 describes some properties of neural networks with respect to scaling and addition. It tells us that the class of neural networks is closed under scalar multiplication and addition, with the width of the resulting networks adjusted appropriately. The proof is based on elementary linear algebra. For an extended version of this result, see Lemma D.2. In particular, our constructed DNNs have a particularly sparse structure and the number of required neurons behaves in a very controlled and natural fashion.

With these insights, we are now able to find a representation similar to (4). To this end, we choose a cubic partition ${\mathcal {A}}=(A_j)_{j \in J}$ of X with width $s>0$ and define for $\varepsilon \in (0, \frac{s}{3}]$

$$\begin{aligned} {\mathcal {H}}^{(\varepsilon )}_{\mathcal{A}} := \biggl \{ \sum _{j \in J} c_j \; \varvec{1}_{A_j}^{(\varepsilon )} \; : \; c_j \in Y \biggr \}\, , \end{aligned}$$

where $\varvec{1}_{A_j}^{(\varepsilon )}:= (\varvec{1}_{B_j}^{(\varepsilon )})_{|A_j}$ is the restriction of $\varvec{1}_{B_j}^{(\varepsilon )}$ to $A_j$ and $\varvec{1}_{B_j}^{(\varepsilon )}$ is an $\varepsilon$-approximation of $\varvec{1}_{B_j}$ of Lemma 3.3. Here, $B_j$ is the cell with $A_j = B_j\cap X$, see the text around (8). We call any function in ${\mathcal {H}}^{(\varepsilon )}_{\mathcal{A}}$ an $\varepsilon$-approximate histogram.

Our considerations above show that we have ${\mathcal {H}}^{(\varepsilon )}_{\mathcal{A}} \subset {\mathcal {A}}_{p_1,p_2}$ with $p_1 = 2d|J|$ and $p_2 = |J|$. Thus, any $\varepsilon$-approximate histogram can be represented by a neural network with 2 hidden layers. Inflated versions are now straightforward.

Definition 3.5

Let $s \in (0,1]$, $m\ge 1$, and $\varepsilon \in (0, s/3]$. Then a function $f^{\varepsilon }: X \rightarrow Y$ is called an $\varepsilon$-approximated m-inflated histogram of width s if there exist a subset $\{x_1^*,..., x_m^*\} \in \textrm{Pot}_m(X)$ and a cubic partition $\mathcal{A}$ of width s that is properly aligned to $\{x_1^*,..., x_m^*\}$ with parameter $r\in [0,s)$ such that

$$\begin{aligned} f^{(\varepsilon )} = h^{(\varepsilon )} + \sum _{i=1}^m b_i \varvec{1}^{(\delta )}_{x_i^* + t B_\infty } , \end{aligned}$$

where $h^{(\varepsilon )} \in {\mathcal {H}}_{\mathcal {A}}^{(\varepsilon )}$, $t \in (0,r]$, $\delta \in (0, t/3]$, $b_i \in 2Y$ and where $\varvec{1}^{(\delta )}_{x_i^* + t B_\infty }$ is a $\delta$-approximation of $\varvec{1}_{x_i^* + t B_\infty }$ for all $i=1,...,m$. We denote the set of all $\varepsilon$-approximated m-inflated histograms of width s by ${\mathcal {F}}^{(\varepsilon )}_{s, m}$.

A short calculation shows that ${\mathcal {F}}^{(\varepsilon )}_{s, m} \subset {\mathcal {A}}_{p_1,p_2}$ with $p_1 = 2d(m+|J|)$, $p_2 = m+|J|$ and $|J| \le (2/s)^d$. With these preparations, we can now introduce good and bad interpolating DNNs.

Example 3.6

(Good and bad interpolating DNN) Let L be the least squares loss, $s \in (0,1]$ be a cell width and let $\rho > 0$ be an inflation parameter. For a data set $D=((x_1,y_1),\dots ,(x_n,y_n))$ we consider again a cubic partition ${\mathcal {A}}_D=\pi _{m,s}(D_X)$, with $m=|D_X|$, being properly aligned to $D_X$ with parameter r. Set $t:=\min \{r, \rho \}$. According to Example 2.8, a good interpolating HR is given by

$$\begin{aligned} f_{D,s,\rho }^+:= \sum _{j\in J} c_j^+ \varvec{1}_{A_j} + \sum _{i=1}^m b^+_i \varvec{1}_{x_i^* + t B_\infty } , \end{aligned}$$

where the $(c_j^+)_{j \in J}$ are given in (6) and $b^+_1,\dots ,b_m^+$ are from (14). For $\varepsilon := \delta := t/3$ we then define the good interpolating DNN by

$$\begin{aligned} g_{D,s,\rho }^+= \sum _{j\in J} c_j^+ \varvec{1}^{(\varepsilon )}_{A_j} + \sum _{i=1}^m b^+_i \varvec{1}^{(\delta )}_{x_i^* + t B_\infty } \,. \end{aligned}$$

Clearly, we have $g_{D,s,\rho }^+\in {\mathcal {F}}^{(\varepsilon )}_{s, m}$. We call the map $D\mapsto g_{D,s,\rho }^+$ a good interpolating DNN and it is easy to see that this network indeed interpolates D. Finally, the bad interpolating DNN $g_{D,s,\rho }^-$ is defined analogously using the bad interpolating HR from Example 2.9, instead.

Similarly to our inflated histograms from the previous section, the next theorem shows that the good interpolating DNN is consistent while the bad interpolating DNN fails to be. The proof of this result is given in Appendix D.3.

Theorem 3.7

[(Non)-consistency] Let L be the least-squares loss and let $D \in (X \times Y)^n$ be an i.i.d. sample of size $n \ge 1$. Let $D \mapsto g_{D,s,\rho }^+$ denote the good interpolating DNN from Example 3.6. Similarly, let $D \mapsto g_{D,s,\rho }^-$ denote the bad interpolating DNN from Example 3.6. Assume that $(s_n)_{n \in \mathbb {N}}$ is a sequence with $s_n \rightarrow 0$, $\frac{\ln (n s_n^d)}{n s_n^d}\rightarrow 0$ as $n \rightarrow \infty$ as well as $s_n > 2n^{-1/d}$. Additionally, let $(\rho _n)_{n \in \mathbb {N}}$ be a non-negative sequence with $\rho _n \le 2n^{-1/d}$. Then $g_{D,s_n, \rho _n}^{\pm }\in {\mathcal {A}}_{4dn, 2n}$. Moreover, for all distributions P that satisfy Assumption 2.11 for a function $\varphi$ with $\rho _n^{-d} \varphi ( \rho _n ) \rightarrow 0$ for $n\rightarrow \infty$, we have

$$\begin{aligned} ||g_{D,s_n, \rho _n}^+- {f_{L,P}^*}||_{L_2(P_X)} \rightarrow 0 \,, \end{aligned}$$

(32)

$$\begin{aligned} ||g_{D,s_n, \rho _n}^-- {f_{L,P}^\dagger }||_{L_2(P_X)} \rightarrow 0 \,, \end{aligned}$$

(33)

in probability for $|D|\rightarrow \infty$.

The above result can further be refined to establishing rates of convergence if the width $s_n$ and the inflation parameter $\rho _n$ converge to zero sufficiently fast as $n \rightarrow \infty$. The proof is provided in Appendix D.4.

Theorem 3.8

[Learning Rates] Let L be the least-squares loss and let $D \in (X \times Y)^n$ be an i.i.d. sample of size $n \ge 1$. Let $D \mapsto g_{D,s,\rho }^+$ denote the good interpolating DNN from Example 3.6. Similarly, let $D \mapsto g_{D,s,\rho }^-$ denote the bad interpolating DNN from Example 3.6. Suppose that ${f_{L,P}^*}$ is $\alpha$-Hölder continuous with $\alpha \in (0,1]$ and that P satisfies Assumption 2.11 for some function $\varphi$. Assume further that $(s_n)_{n \in \mathbb {N}}$ is a sequence with

$$\begin{aligned} s_n = n^{-\gamma }, \quad \gamma = \frac{1}{2\alpha + d} \end{aligned}$$

and that $(\rho _n)_{n\ge 1}$ is a non-negative sequence with $\rho _n \le 2n^{-1/d}$ and $\rho _n^{-d} \varphi (\rho _n) \le \ln (n) n^{-2/3}$ for all $n\ge 1$. Then there exists a constant $c_{d,\alpha }>0$ only depending on d, $\alpha$, and the Lipschitz constant $|f^*_{L,P}|_\alpha$, such that for all $n\ge 2$ the good interpolating histogram rule satisfies

$$\begin{aligned} ||g_{D,s_n, \rho _n}^+- {f_{L,P}^*}||_{L_2(P_X)}&\le c_{\alpha ,d} \sqrt{\ln (n)} \left( \frac{1}{n} \right) ^{\alpha \gamma } \;, \end{aligned}$$

(34)

with probability $P^n$ not less than $1- 2^dn^{1+d} e^{-n^{d \gamma }}$. Furthermore, for all $n\ge 2$, the bad interpolating histogram rule satisfies

$$\begin{aligned} ||g_{D,s_n, \rho _n}^-- {f_{L,P}^\dagger }||_{L_2(P_X)}&\le c_{\alpha ,d} \sqrt{\ln (n)} \left( \frac{1}{n} \right) ^{\alpha \gamma }\;. \end{aligned}$$

(35)

with probability $P^n$ not less than $1- 2^dn^{1+d} e^{-n^{d \gamma }}$. Finally, there exists a natural number $n_{d, \alpha } > 0$ such that for any $n \ge n_{d, \alpha }$ we have $g_{D,s_n, \rho _n}^{\pm }\in {\mathcal {A}}_{4dn, 2n}$.

Note that the rates of convergence in (34) and (35) remain true if we consider a sequence $s_n$ with $c^{-1} n^{-\gamma } \le s_n \le cn^{-\gamma }$ for some constant c independent of n. In fact, the only reason why we have formulated Theorem 3.8 with $s_n = n^{-\gamma }$ is to avoid another constant appearing in the statements. Moreover, if we choose $s_n:= 2 a \lfloor n^{-\gamma }\rfloor ^{-1}$ with $a:= 3^{1/d}/(3^{1/d}-2)$, then we have $|J| \le (2 s_n^{-1} + 2)^d \le (a^{-1}n^{1/d} + 2)^d \le n$ for all $n\ge 3$. Consequently, for $m:= n$, we can choose $n_{d, \alpha }:= 3$, and hence we have $g_{D,s_n, \rho _n}^{\pm }\in {\mathcal {A}}_{4dn, 2n}$ for all $n\ge 3$ while (34) and (35) hold true modulo a change in the constant $c_{\alpha ,d}$.

4 Discussion and summary of results

In this section we summarize our finding and put them into a broader context.

4.1 Inflated histograms

To set the results from Sect. 2 in context, let us first recall that even for a fixed hypotheses class, ERM is, in general, not a single algorithm, but a collection of algorithms. In fact, this ambiguity appears, as soon as the ERM-optimization problem has not a unique solution for certain data sets, and as Lemma A.1 shows, this non-uniqueness may even occur for strictly convex loss functions such as the least squares loss. Now, the standard techniques of statistical learning theory are capable of showing that for sufficiently small hypotheses classes, all versions of ERM enjoy good statistical guarantees. In other words, the non-uniqueness of ERM does not affect its learning capabilities as long as the hypotheses class is sufficiently small. In addition, it is folklore that in some large hypotheses classes, there may be heavily overfitting ERM solutions, leading to the usual conclusion that such hypotheses classes should be avoided.

In contrast to this common wisdom, however, Theorem 2.12 demonstrates that for large hypotheses classes, the situation may be substantially more complicated: First, it shows that there exist empirical risk minimizers, whose predictors converge to a function ${f_{L,P}^\dagger }$, see (24), that in almost all interesting cases is far off the target regression function, see (21), confirming that the overfitting issue is indeed present for the chosen hypotheses classes. Moreover, this strong overfitting may actually take place with fast convergence, see (28). Despite this negative result, however, we can also find empirical risk minimizers that enjoy a good learning behavior in terms of consistency (23) and almost optimal learning rates (27). In other words, both the expected overfitting and standard learning guarantees may be realized by suitable versions of empirical risk minimizers over these hypotheses classes. In fact, these two different behaviors are just extreme examples, and a variety of intermediate behaviors are possible, too: Indeed, as the training error can be solely controlled by the corrections on the inflating parts, the behaviour of the histogram part h can be arbitrarily chosen. For our theorems above, we have chosen a particular good and bad h-part, repectively, but of course, a variety of other choices leading to intermediate behavior are also possible. As a consequence, the ERM property of an algorithm working with a large hypotheses class is, in general, no longer a sufficient notion for describing its learning behavior. Instead, additional assumptions are required to determine its learning behavior. In this respect we also note that for our inflated hypotheses classes, other learning algorithms that do not (approximately) minimize the empirical risk may also enjoy good learning properties. Indeed, by setting the inflating parts to zero, we recover standard histograms, which in geneneral do not have close-to-zero training error, but for which the guarantees of our good interpolating predictors also hold true.

Of course, the chosen hypotheses classes may, to some extent, appear artificial. Nonetheless, in Sect. 3 they are key for showing that for sufficiently large DNN architectures exactly the same phenomena occur for some of their global minima.

4.2 Neural networks

To fully appreciate Theorems 3.7 and 3.8 as well as their underlying construction let us discuss its various consequences.

Training. The good interpolating DNN predictors $g_{D,s_n, \rho _n}^+$ show that it is actually possible to train sufficiently large, over-parameterized DNNs such that they become consistent and enjoy optimal learning rates up to a logarithmic factor without adapting the network size to the particular smoothness of the target function. In fact, it suffices to consider DNNs with two hidden layers and 4dn, respectively 2n neurons in the first, respectively second, hidden layer. In other words, Theorems 3.7 and 3.8 already apply to moderately over-parameterized DNNs, and by the particular properties of the ReLU-activation function, also for all larger network architectures. In addition, when using architectures of minimal size, training, that is constructing $g_{D,s_n, \rho _n}^+$, can be done in $\mathcal{O}(d^2\cdot n^2)$-time if the NNs are implemented as fully connected networks. Moreover, the constructed NNs have a particularly sparse structure and exploiting this can actually reduce the training time to $\mathcal{O}(d \cdot n\cdot \log n)$. While we present statistically sound end-to-end^{Footnote 2} proofs of consistency and optimal rates for NNs, we also need to admit that our training algorithm is mostly interesting from a theoretical point of view, but useless for practical purposes.

Optimization Landscape. Theorems 3.7 and 3.8 also have its consequences for DNNs trained by variants of stochastic gradient descent (SGD) if the resulting predictor is interpolating. Indeed, these theorems show that ending in a global minimum may result in either a very good learning behavior or an extremely overfitting, bad behavior. In fact, all the observations made for histograms at the end of Sect. 2 apply to DNNs, too. In particular, since for $n\ge n_{d,\alpha }$ the ${\mathcal {A}}_{4dn, 2n}$-networks can $\varepsilon$-approximate all functions in $\mathcal{F}_{s,n}^*$ for all $\varepsilon \ge 0$ and all $s\in [n^{-1/d}, 1]$, we can, for example, find, for each polynomial learning rate slower than $n^{-\alpha \gamma }$, an interpolating learning method $D\mapsto f_D$ with $f_D\in {\mathcal {A}}_{4dn, 2n}$ that learns with this rate. Similarly, we can find interpolating $f_D\in {\mathcal {A}}_{4dn, 2n}$ with various degrees of bad learning behavior. In summary, the optimization landscape induced by ${\mathcal {A}}_{4dn, 2n}$ contains a wide variety of global minima whose learning properties range somewhat continuously from essentially optimal to extremely poor. Consequently, an optimization guarantee for (S)GD, that is, a guarantee that (S)GD finds a global minimum in the optimization landscape, is useless for learning guarantees unless more information about the particular nature of the minimum found is provided. Moreover, it becomes clear that considering (S)GD without the initialization of the weights and biases is a meaningless endeavor: For example, constructing $g_{D,s_n, \rho _n}^{\pm }$ can be viewed as a very particular form of initialization for which (S)GD won’t change the parameters anymore. More generally, when initializing the parameters randomly in the attraction basin of $g_{D,s_n, \rho _n}^{\pm }$ then GD will converge to $g_{D,s_n, \rho _n}^{\pm }$ and therefore the behavior of GD is completely determined by the initialization. In this respect note that so far there is no statistically sound way to distinguish between good and bad interpolating DNNs on the basis of the training set alone, and hence the only way to identify good interpolating DNNs obtained by SGD is to use a validation set (that SGD finds bad local minima is shown in Liu et al. (2020)). Now, for the good interpolating DNNs of Theorem 3.7 it is actually possible to construct a finite set of candidates such that the one with the best validation error achieves the optimal learning rates without knowing $\alpha$. For DNNs trained by SGD, however, we do not have this luxury anymore. Indeed, while we can still identify the best predicting DNN from a finite set of SGD-learned interpolating DNNs we no longer have any theoretical understanding of whether there is any useful candidate among them, or whether they all behave like a bad $g_{D,s_n, \rho _n}^-$.

For both consistency and learning with essentially optimal rates it is by no means necessary to find a global minimum, or at least a local minimum, in the optimization landscape. For example, the positive learning rates (27) also hold for ordinary cubic histograms with widths $s_n:= n^{-\gamma }$, and the latter can, of course, also be approximated by ${\mathcal {A}}_{4dn, 2n}$. Repeating the proof of Theorem 3.8 it is easy to verify that these approximations also enjoy the good learning rates (34). Moreover, these approximations $f_D$ are almost never global minima, or more precisely, $f_D$ is not a global minimum as soon as there exist a cubic cell A containing two samples $x_i$ and $x_j$ with different labels, i.e. $y_i\ne y_j$. In fact, in this case, $f_D$ is not even a local minimum. To see this, assume without loss of generality that $x_i$ is one of the samples in A with $y_i \ne f_D(x_i)$. Considering $f_{D,\lambda }:= f_D + \lambda b_i^+ \varvec{1}_{x_i+tB_\infty }^{(t/3)}$ for all $\lambda \in [0,1]$ and $t:= \min \{r,\rho \}$ we then see that there is a continuous path in the parameter space of ${\mathcal {A}}_{4dn, 2n}$ that corresponds to the $\Vert \cdot \Vert _\infty$-continuous path $\lambda \mapsto f_{D,\lambda }$ in the set of functions ${\mathcal {A}}_{4dn, 2n}$ for which we have ${\mathcal{R}_{L,D}(f_{D,\lambda })} < {\mathcal{R}_{L,D}(f_D)}$ for all $\lambda \in (0,1]$. In other words, $f_D$ is not a local minimum. In this respect we note that this phenomenon also occurs to some extent in under-parameterized DNNs, at least for $d=1$. Indeed, if we consider $m:= 1$ and $s_n:= n^{-\gamma }$, then $f_D, f_{D,\lambda }\in {\mathcal {A}}_{4dn^{\gamma d}, 2 n^{\gamma d}}$ for all sufficiently large n. Now, the functions in ${\mathcal {A}}_{4dn^{\gamma d}, 2 n^{\gamma d}}$ have $\mathcal{O}(d^2 n^{2\gamma d})$ many parameters and for $2\gamma d = \frac{2d}{2\alpha +d}< 1$, that is $\alpha > d/2 = 1/2$, we then see that we have strictly less than $\mathcal{O}(\sqrt{n})$ neurons with $\mathcal{O}(n)$ parameters, while all the observations made so far still hold.

Finally, we want to mention that a number of recent works analyze concrete efficient GD-type algorithms (Ji et al., 2021; Ji & Telgarsky, 2019; Song et al., 2021; Chen et al., 2021; Kuzborskij & Szepesvari, 2021; Kohler & Krzyzak, 2019; Nguyen & Mücke, 2024; Braun et al., 2024) and SGD-type algorithms (Rolland et al., 2021; Deng et al., 2022; Li & Liang, 2018; Kalimeris et al., 2019; Allen-Zhu et al., 2019; Cao et al., 2024), with a focus on the particular algorithmic properties and network architecture (e.g. early stopping and the required degree of overparameterization) rather than ERM. Also, the effect of regularization is investigated in e.g. Hu et al. (2021); Wei et al. (2019). Our work differs from the perspective given in these works in the sense that our aim is to provide a theoretical investigation of the qualitatively different learning properties of interpolating ReLU-DNNs at once.

Data availability

Not applicable.

Code availability

Not applicable.

Notes

Note that this gives $A(x_i^*) = B(x_i^*)\cap X$.
By end-to-end we mean the explicit construction of an efficient, feasible, and implementable training algorithm and the rigorous statistical analysis of this very particular algorithm under minimal assumptions.
This is justified since $\varepsilon _n = \rho _n/2 \le n^{-1/d} < s_n/2$.

References

Allen-Zhu, Z., Li, Y. & Liang, Y. (2019). Learning and generalization in overparameterized neural networks, going beyond two layers. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 6158–6169.
Allen-Zhu, Z., Li, Y., & Song, Z. (2019). A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pp. 242–252. PMLR.
Bartlett, P. L., & Long, P. M. (2021). Failures of model-dependent generalization bounds for least-norm interpolation. Journal of Machine Learning Research, 22(204), 1–5.
MathSciNet MATH Google Scholar
Bartlett, P. L., Long, P. M., Lugosi, G., & Tsigler, A. (2020). Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48), 30063–30070.
MathSciNet MATH Google Scholar
Bauer, H. (2001). Measure and Integration Theory. Berlin: De Gruyter.
MATH Google Scholar
Bauer, B., & Kohler, M. (2019). On deep learning as a remedy for the curse of dimensionality in nonparametric regression. The Annals of Statistics, 47(4), 2261–2285.
MathSciNet MATH Google Scholar
Belkin, M., Hsu, D. J., & Mitra, P. (2018). Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pp. 2300–2311. Curran Associates, Inc.
Belkin, M., Hsu, D., & Ji, Xu. (2020). Two models of double descent for weak features. SIAM Journal on Mathematics of Data Science, 2(4), 1167–1180.
MathSciNet MATH Google Scholar
Belkin, M., Rakhlin, A., & Tsybakov, A. B. (2019). Does data interpolation contradict statistical optimality? In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1611–1619. PMLR.
Braun, A., Kohler, M., Langer, S., & Walk, H. (2024). Convergence rates for shallow neural networks learned by gradient descent. Bernoulli, 30(1), 475–502.
MathSciNet MATH Google Scholar
Cao, D., Guo, Z.-C., & Shi, L. (2024). Stochastic gradient descent for two-layer neural networks. Preprint at arXiv:2407.07670.
Chen, L., Min, Y., Belkin, M., & Karbasi, A. (2020). Multiple descent: Design your own generalization curve. Preprint at arXiv:2008.01036.
Chen, Z., Cao, Y., Zou, D., & Gu, Q. (2021). How much over-parameterization is sufficient to learn deep relu networks? In International Conference on Learning Representations (ICLR).
Deng, Y., Kamani, M. M., & Mahdavi, M. (2022). Local sgd optimizes overparameterized neural networks in polynomial time. In International Conference on Artificial Intelligence and Statistics, pp. 6840–6861. PMLR .
Devroye, L., Györfi, L., & Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer.
MATH Google Scholar
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
MATH Google Scholar
Györfi, L., Kohler, M., Krzyżak, A., & Walk, H. (2002). A Distribution-Free Theory of Nonparametric Regression. Springer.
MATH Google Scholar
Hoffman-Jorgensen, J. (2017). Probability with a View Towards Statistics (Vol. I). Routledge.
MATH Google Scholar
Hu, T., Wang, W., Lin, C., & Cheng, G. (2021). Regularization matters: A nonparametric perspective on overparametrized neural network. In International Conference on Artificial Intelligence and Statistics, pp. 829–837. PMLR.
Ji, Z., Li, J., & Telgarsky, M. (2021). Early-stopped neural networks are consistent. Advances in Neural Information Processing Systems, 34, 1805–1817.
MATH Google Scholar
Ji, Z., & Telgarsky, M. (2019). Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. International Conference on Learning Representations (ICLR).
Kalimeris, D., Kaplun, G., Nakkiran, P., Edelman, B., Yang, T., Barak, B., & Zhang, H. (2019). Sgd on neural networks learns functions of increasing complexity. Advances in neural information processing systems, 32.
Koehler, F., Zhou, L., Sutherland, D. J., & Srebro, N. (2021). Uniform convergence of interpolators: Gaussian width, norm bounds and benign overfitting. Advances in Neural Information Processing Systems, 34, 20657–20668.
Google Scholar
Kohler, M., & Krzyżak, A. (2005). Adaptive regression estimation with multilayer feedforward neural networks. Nonparametric Statistics, 17(8), 891–913.
MathSciNet MATH Google Scholar
Kohler, M., & Krzyzak, A. (2019). Over-parametrized deep neural networks do not generalize well. Preprint at arXiv:1912.03925.
Kohler, M., & Langer, S. (2021). On the rate of convergence of fully connected deep neural network regression estimates. The Annals of Statistics, 49(4), 2231–2249.
MathSciNet MATH Google Scholar
Kohler, M., Langer, S., & Reif, U. (2023). Estimation of a regression function on a manifold by fully connected deep neural networks. Journal of Statistical Planning and Inference, 222, 160–181.
MathSciNet MATH Google Scholar
Kuzborskij, I., & Szepesvári, C. (2021). Nonparametric regression with shallow overparameterized neural networks trained by gd with early stopping. In Conference on Learning Theory, pp. 2853–2890. PMLR.
Li, Y., & Liang, Y. (2018). Learning overparameterized neural networks via stochastic gradient descent on structured data. Advances in neural information processing systems, 31.
Liang, T., Rakhlin, A., & Zhai, X. (2020) On the multiple descent of minimum-norm interpolants and restricted lower isometry of kernels. In Conference on Learning Theory, pp. 2683–2711. PMLR.
Liu, S., Papailiopoulos, D., & Achlioptas, D. (2020). Bad global minima exist and sgd can reach them. Advances in Neural Information Processing Systems, 33, 8543–8552.
MATH Google Scholar
Ma, S., Bassily, R., & Belkin, M. (2018). The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 3325–3334.
McCaffrey, D. F., & Gallant, A. R. (1994). Convergence rates for single hidden layer feedforward networks. Neural Networks, 7(1), 147–158.
MATH Google Scholar
Mei, S., & Montanari, A. (2019). The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75, 667.
MathSciNet MATH Google Scholar
Nagarajan, V., & Kolter, J. Z. (2019). Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems 32.
Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., & Srebro, N. (2019). Towards understanding the role of over-parametrization in generalization of neural networks. In International Conference on Learning Representations (ICLR).
Nguyen, M., & Mücke, N. (2024). How many neurons do we need? a refined analysis for shallow networks trained with gradient descent. Journal of Statistical Planning and Inference, 233, 106169.
MathSciNet MATH Google Scholar
Rolland, P., Ramezani-Kebrya, A., Song, C. H., Latorre, F., & Cevher, V. (2021). Linear convergence of sgd on overparametrized shallow neural networks.
Salakhutdinov, R. (2017). Deep learning tutorial at the Simons Institute. https://simons.berkeley.edu/talks/ruslan-salakhutdinov-01-26-2017-1.
Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with relu activation function. The Annals of Statistics, 48(4), 1875–1897.
MathSciNet MATH Google Scholar
Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
MATH Google Scholar
Song, C., Ramezani-Kebrya, A., Pethick, T., Eftekhari, A., & Cevher, V. (2021). Subquadratic overparameterization for shallow neural networks. Advances in Neural Information Processing Systems, 34, 11247–11259.
Google Scholar
Steinwart, I., & Christmann, A. (2008). Support Vector Machines. Springer.
MATH Google Scholar
Suzuki, T. (2018). Adaptivity of deep relu network for learning in besov and mixed smooth besov spaces: Optimal rate and curse of dimensionality. In International Conference on Learning Representations.
Tsigler, A., & Bartlett, P. L. (2020). Benign overfitting in ridge regression. Preprint at arXiv:2009.14286.
Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer.
MATH Google Scholar
van de Geer, S. (2000). Applications of Empirical Process Theory. Cambridge University Press.
MATH Google Scholar
Vapnik, V. N. (1998). Statistical Learning Theory. Wiley.
MATH Google Scholar
Wei, C., Lee, J. D., Liu, Q., & Ma, T. (2019). Regularization matters: Generalization and optimization of neural nets vs their induced kernel. Advances in Neural Information Processing Systems, 32.
Yang, Z., Bai, Y., & Mei, S. (2021). Exact gap between generalization error and uniform convergence in random feature models. In International Conference on Machine Learning, pp. 11704–11715. PMLR.
Yarotsky, D. (2018). Optimal approximation of continuous functions by very deep relu networks. In Conference on learning theory, pp. 639–649. PMLR, .
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. In Technical report. arxiv.org/abs/1611.03530.
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2021). Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3), 107–115.
MATH Google Scholar
Zhou, L., Sutherland, D. J., & Srebro, N. (2020). On uniform convergence and low-norm interpolation learning. Advances in Neural Information Processing Systems, 33 .

Download references

Funding

Open Access funding enabled and organized by Projekt DEAL. Not applicable.

Author information

Authors and Affiliations

Institute for Mathematical Stochastics, Technical University of Braunschweig, Brunswick, Germany
Nicole Mücke
Institute for Stochastics and Applications, University of Stuttgart, Stuttgart, Germany
Ingo Steinwart

Authors

Nicole Mücke
View author publications
Search author on:PubMed Google Scholar
Ingo Steinwart
View author publications
Search author on:PubMed Google Scholar

Contributions

All authors whose names appear on the submission 1) made substantial contributions to the conception or design of the work; 2) drafted the work or revised it critically for important intellectual content; 3) approved the version to be published; and 4) agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Corresponding author

Correspondence to Nicole Mücke.

Ethics declarations

Conflict of interest

All authors whose names appear on the submission declare to have no conflict of interest.

Ethical approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Editor: Hendrik Blockeel.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Characterization of empirical risk minimizers

In this section we briefly provide a full characterization of empirical risk minimizers that we use several times for proving our main results.

Lemma A.1

(Characterization of ERMs) Let Y be convex, $A\subseteq X$ be non-empty, $\mathcal{A} = (A_j)_{j\in J}$ be a finite partition of A, and

$$\begin{aligned} \mathcal{H}_{\mathcal{A}}:= \biggl \{ \sum _{j\in J} c_j \varvec{1}_{A_j}: c_j \in Y \biggr \} . \end{aligned}$$

Moreover, let $D=((x_1,y_1),\dots ,(x_n,y_n)) \in (X\times Y)^n$ be a data set and let $L_A(x, y, t)=\varvec{1}_A(x)L(x,y,t)$, with L being the least squares loss. Furthermore, denote the number of samples whose covariates fall into cell $A_j$ by $N_j$, that is $N_j:=|\{i: x_i \in A_j\}|$. Then, for every $f^*\in \mathcal{H}_{\mathcal{A}}$ with representation $f^* = \sum _{j\in J} c_j \varvec{1}_{A_j}$, the following statements are equivalent:

(i)
The function $f^*$ is an empirical risk minimizer, that is
$$\begin{aligned} {\mathcal{R}_{L_A,D}(f^*)} = \min _{f\in \mathcal{H}_{\mathcal{A}}} {\mathcal{R}_{L_A,D}(f)}\,. \end{aligned}$$
(ii)
For all $j\in J$ satisfying $N_j \ne 0$ we have
$$\begin{aligned} c_j = \frac{1}{N_j}\sum _{i: x_i\in A_j} y_i \;. \end{aligned}$$
(36)

Proof of of Lemma A.1

We first note that for an $f^*\in \mathcal{H}_{\mathcal{A}}$ with representation $f^*= \sum _{j\in J} c_j \varvec{1}_{A_j}$ we have

$$\begin{aligned} {\mathcal{R}_{L_A,D}(f^*)} = \frac{1}{n} \sum _{i=1}^n \varvec{1}_A(x_i) L\bigl (x_i,y_i, f^*(x_i)\bigr ) = \frac{1}{n} \sum _{j\in J} \sum _{i: x_i\in A_j} L(x_i,y_i, c_j)\, . \end{aligned}$$

Consequently, $f^*$ is an empirical risk minimizer, if and only if $c_j$ minimizes $\sum _{i: x_i\in A_j} L(x_i,y_i, \cdot )$ for all $j\in J$. Now, if $N_j = 0$, the sum is empty, and hence there is nothing to consider. For $j\in J$ with $N_j$ we observe that

$$\begin{aligned} \sum _{i: x_i\in A_j} L(x_i,y_i, c_j ) = N_j c_j^2 - 2c_j\biggl ( \sum _{i: x_i\in A_j} y_i \biggr ) + \sum _{i: x_i\in A_j} y_i^2\,, \end{aligned}$$

which is minimized for $c_j$ given by (36). $\square$

B Existence of properly aligned cubic partitioning rule

In this section we prove the existence of a properly aligned cubic partitioning rule.

Proof of Theorem 2.5

Recall that cubic partitions $\mathcal{B}$ of $\mathbb {R}^d$ have a representation of the form (8). Now, to construct $\pi _{m,s}$ we will consider a finite set of candidate offsets $x_1^\dagger , \dots , x_K^\dagger \in \mathbb {R}^d$. For the construction of these offsets we write $\delta := s/(m+1)$ and for $j \in \{0,\dots ,m\}$ we further define

$$\begin{aligned} z_{j}^\dagger := \Bigl ( j + \frac{1}{2} \Bigr ) \, \delta \,. \end{aligned}$$

Now, our candidate offsets $x_1^\dagger , \dots , x_K^\dagger \in \mathbb {R}^d$ are exactly those vectors whose coordinates are taken from $z_{0}^\dagger , \dots , z_{m}^\dagger$. Clearly, this gives $K=(m+1)^d$. Now let $\{x_1^*,\dots , x_m^*\}\in \textrm{Pot}_m([-1,1]^d)$. In the following, we will identify the offset $x_\ell ^\dagger$ that leads to $\pi _{m,s}(\{x_1^*,\dots , x_m^*\})$ coordinate-wise. We begin by determining its first coordinate $x_{\ell ,1}^\dagger$. To this end, we define

$$\begin{aligned} I_j:= \bigcup _{k\in \mathbb {Z}} \bigr [k s + j\delta , \, k s +(j+1)\delta \bigl ) \,. \end{aligned}$$

Our first goal is to show that $I_0,\dots ,I_m$ are a partition of $\mathbb {R}$. To this end, we fix an $x\in \mathbb {R}$. Then there exists a unique $k\in \mathbb {Z}$ with $k s \le x < (k +1) s$. Moreover, for $y:= x- k s\in [0,s)$, there exists a unique $j\in \{0,\dots ,m\}$ with $j\delta \le y < (j+1)\delta$. Consequently, we have found $x\in [k s + j\delta , \, k s +(j+1)\delta )$. This shows $\mathbb {R}\subset I_0\cup \dots \cup I_m$, and the converse inclusion is trivial. Let us now fix some $j,j'\in \{0,\dots ,m\}$ and assume that there is an $x\in I_j \cap I_{j'}$. Then there exist $k,k'\in \mathbb {Z}$ such that

$$\begin{aligned} x \in \bigr [k s + j\delta , \, k s +(j+1)\delta \bigl ) \, \cap \, \bigr [k' s + j'\delta , \, k' s +(j'+1)\delta \bigl )\, . \end{aligned}$$

(37)

Since $(j+1)\delta \le s$ and $(j'+1)\delta \le s$, we conclude that $k s \le x < (k +1) s$ and $k' s \le x < (k' +1) s$. As observed above this implies $k = k'$. Now consider $y:= x-k s\in [0,s)$. Then (37) implies

$$\begin{aligned} y \in \bigr [j\delta , \, (j+1)\delta \bigl ) \, \cap \, \bigr [ j'\delta , \, (j'+1)\delta \bigl )\,, \end{aligned}$$

and again we have seen above that this implies $j=j'$. This shows $I_j \cap I_{j'} =\emptyset$ for all $j\ne j'$.

Let us now denote the first coordinate of $x_i^*$ by $x_{i,1}^*$. Then $D_{X,1}^*:= \{x_{i,1}^*: i=1,\dots ,m\}$ satisfies $|D_{X,1}^*|\le m$ and since we have $m+1$ cells $I_j$, we conclude that there exists a $j_1^*\in \{0,\dots ,m\}$ with $D_{X,1}^* \cap I_{j_1^*} = \emptyset$. We define

$$\begin{aligned} x_{\ell ,1}^\dagger := z_{j_1^*}^\dagger = \Bigl ( j_1^* + \frac{1}{2} \Bigr ) \, \delta \,. \end{aligned}$$

Next we repeat this construction for the remaining $d-1$ coordinates, so that we finally obtain $x_\ell ^\dagger := (z_{j_1^*}^\dagger , \dots , z_{j_d^*}^\dagger )\in \mathbb {R}^d$ for indices $j_1^*,\dots ,j_d^*\in \{0,\dots ,m\}$ found by the above reasoning.

It remains to show that (11) holds for the cubic partition (8) with offset $x_\ell ^\dagger$ and all $t>0$ with $t\le \frac{s}{3\,m+3} = \delta /3$. To this end, we fix an $x_i^*$. Then its cell $B(x_i^*)$ is described by a unique $k:= (k_1,\dots ,k_d)\in \mathbb {Z}^d$, namely

$$\begin{aligned} B(x_i^*) = \bigl [x_{\ell ,1}^\dagger + k_1s,\, x_{\ell ,1}^\dagger + (k_1+1)s\bigr ) \times \dots \times \bigl [x_{\ell ,d}^\dagger + k_ds, \, x_{\ell ,d}^\dagger + (k_d+1)s\bigr )\,. \end{aligned}$$

Let us now consider the first coordinate $x_{i,1}^*$. By construction we know that $x_{i,1}^* \not \in I_{j_1^*}$ and

$$\begin{aligned} \Bigl ( j_1^* + \frac{1}{2} \Bigr )\cdot \delta + k_1s \,\le \, x_{i,1}^* \,<\, \Bigl ( j_1^* + \frac{1}{2} \Bigr ) \cdot \delta + (k_1+1)s\, . \end{aligned}$$

(38)

Now, $x_{i,1}^* \not \in I_{j_1^*}$ implies

$$\begin{aligned} x_{i,1}^* \not \in \bigr [(k_1 +1) s + j_1^*\delta , \, (k_1+1) s +(j_1^*+1)\delta \bigl ) \end{aligned}$$

Since the right hand side of (38) excludes the case $x_{i,1}^* \ge (k_1+1) s +(j_1^*+1)\delta$, we hence find

$$\begin{aligned} x_{i,1}^* < (k_1 +1) s + j_1^*\delta = x_{\ell ,1}^\dagger + (k_1 +1) s - \delta /2 \,. \end{aligned}$$

This shows $x_{i,1}^* + r < x_{\ell ,1}^\dagger + (k_1 +1) s$ for all $r\in [-t,t]$. To show that $x_{i,1}^* + r > x_{\ell ,1}^\dagger + k_1 s$ holds for all $r\in [-t,t]$ we first observe that $x_{i,1}^* \not \in I_{j_1^*}$ also implies

$$\begin{aligned} x_{i,1}^* \not \in \bigr [k_1 s + j_1^*\delta , \, k_1 s +(j_1^*+1)\delta \bigl ) \,. \end{aligned}$$

Now, the left hand side of (38) excludes the case $x_{i,1}^* < k_1 s + j_1^*\delta$. Consequently, we have

$$\begin{aligned} x_{i,1}^* \ge k_1 s +(j_1^*+1)\delta = x_{\ell ,1}^\dagger + k_1 s + \delta /2 \end{aligned}$$

and this yields $x_{i,1}^* + r > x_{\ell ,1}^\dagger + k_1 s$ for all $r\in [-t,t]$. Finally, by repeating these considerations for the remaining $d-1$ coordinates, we conclude that $x_i^* + t B_\infty \subset B(x_i^*)$. $\square$

C Learning properties of inflated histograms

In this section we provide the proofs of the results for the good and bad interpolating histogram rules from Sect. 2.2. To this end let us introduce some more notation. For a measurable set A and a loss $L: X\times Y\times \mathbb {R}\rightarrow [0,\infty )$ we therefore introduce the loss $L_A: X\times Y \times \mathbb {R}\rightarrow [0,\infty )$ by

$$\begin{aligned} L_A(x,y,t) = \varvec{1}_A(x)L(x,y,t). \end{aligned}$$

(39)

Obviously, for any measurable function $f:X \rightarrow \mathbb {R}$ it holds

$$\begin{aligned} {\mathcal{R}_{L_{A},P}(f)} = {\mathcal{R}_{L_{A},P}(\varvec{1}_{A}f)} \;. \end{aligned}$$

(40)

Moreover, by linearity, for every measurable set $B\subset A$, the risk then decomposes as

$$\begin{aligned} {\mathcal{R}_{L_A,P}(f)}= {\mathcal {R}}_{L_{A \setminus B}, P}(f) + {\mathcal {R}}_{L_B, P}(f) . \end{aligned}$$

(41)

The next result shows that also the Bayes risk enjoys a similar decomposition.

Lemma C.1

Let $A,B\subset X$ be non-empty, disjoint, and measurable with $A\cup B = X$. Then we have

$$\begin{aligned} {\mathcal{R}_{L,P}^{*}} = {\mathcal{R}_{L_A,P}^{*}} + {\mathcal{R}_{L_B,P}^{*}}\,. \end{aligned}$$

Proof of Lemma C.1

Basically, this is a consequence of the presence of the indicator functions $\varvec{1}_A, \varvec{1}_B$ in the definition of $L_A, L_B$, see (39). More precisely, there is a sequence of functions $f_n^A$ with $\{f_n^A \ne 0\} \subset A$ such that

$$\begin{aligned} {\mathcal{R}_{L_A,P}(f_n^A)} \rightarrow {\mathcal{R}_{L_A,P}^{*}} \;, \end{aligned}$$

as $n \rightarrow \infty$, and similarly for A replaced by B. Thus, for $f_n:= f_n^A + f_n^B$, one has

$$\begin{aligned} {\mathcal{R}_{L,P}^{*}} \le \lim _{n \rightarrow \infty } {\mathcal{R}_{L,P}(f_n)}= \lim _{n \rightarrow \infty } {\mathcal{R}_{L_A,P}(f_n^A)} + \lim _{n \rightarrow \infty } {\mathcal{R}_{L_B,P}(f_n^B)} = {\mathcal{R}_{L_A,P}^{*}} + {\mathcal{R}_{L_B,P}^{*}}\,. \end{aligned}$$

Since the converse inequality is trivial, this proves the lemma. $\square$

1.1 C.1 Preparatory lemmata

The next lemma provides a bound on the difference of the risks of two measurable functions.

Lemma C.2

Let $Y=[-1,1]$ and let $f_1, f_2: X \rightarrow Y$ be measurable functions. For $A\subset X$ measurable and non-empty we define $L_A(x,y,t)=\varvec{1}_A(x)L(x,y, t)$ with L being the least square loss. Then the following two inequalities hold:

$$\begin{aligned} \bigl | {\mathcal{R}_{L_A,P}(f_1)}- {\mathcal{R}_{L_A,P}(f_2)} \bigr |&\le 4\;P_X\bigl (A \cap \{ f_1\ne f_2 \}\bigr ) \;, \\ || \varvec{1}_A(f_1 - f_2)||^2_{L_2(P_X)}&\le 4\;P_X\bigl (A \cap \{ f_1\ne f_2 \}\bigr ) \;. \end{aligned}$$

Proof of Lemma C.2

We begin by proving the first inequality. To this end, we note that the definition of L yields

$$\begin{aligned} {\mathcal{R}_{L,P}(f_1)}- {\mathcal{R}_{L,P}(f_2)}&= \int _A \int _Y (y-f_1(x))^2-(y-f_2(x))^2 \, P(dy|x)dP_X(x) \\&= \int _{A \cap \{ f_1\ne f_2 \}} \int _Y (y-f_1(x))^2-(y-f_2(x))^2 P(dy|x)dP_X(x) \; . \end{aligned}$$

Now observe that $y, f_i(x) \in [-1,1]$ implies $(y-f_i(x))^2 \le 4$. Moreover, we also have $(y-f_i(x))^2\ge 0$, and hence we conclude that

$$\begin{aligned} \Bigl | (y-f_1(x))^2-(y-f_2(x))^2 \Bigr | \le 4 . \end{aligned}$$

Combining these considerations we find

$$\begin{aligned} \bigl | {\mathcal{R}_{L,P}(f_1)}- {\mathcal{R}_{L,P}(f_2)} \bigr |&\le \int _{A \cap \{ f_1\ne f_2 \}} \int _Y \Bigl | (y-f_1(x))^2-(y-f_2(x))^2 \Bigr | P(dy|x)dP_X(x) \\&\le 4 P_X\bigl ( A \cap \{ f_1\ne f_2 \}\bigr )\; . \end{aligned}$$

The second inequality can be show similarly. Namely, we have

$$\begin{aligned} || \varvec{1}_A(f_1 - f_2)||^2_{L_2(P_X)}&= \int _{A}(f_1(x) - f_2(x))^2 dP_X(x) \\&=\int _{A \cap \{ f_1\ne f_2 \}}(f_1(x) - f_2(x))^2 dP_X(x) \\&\le 4P_X\bigl ( A \cap \{ f_1\ne f_2 \}\bigr ) \;, \end{aligned}$$

where we again used $f_i(x) \in [-1,1]$. $\square$

Lemma C.3

Let $h: X \rightarrow \mathbb {R}$ be measurable, $A \subseteq X$, and L be the least-squares loss. Then we have the identity

$$\begin{aligned} {\mathcal{R}_{L_A,P}(-h)} - {\mathcal{R}_{L_A,P}(-{f_{L,P}^*})} = {\mathcal{R}_{L_A,P}(h)} - {\mathcal{R}_{L_A,P}^{*}} + 4\langle \varvec{1}_A f^*_{L,P}\;, h - f^*_{L,P} \rangle _2 . \end{aligned}$$

Proof of Lemma C.3

Given $x \in \mathcal {X}$ and using

$$\begin{aligned} (a+b)^2 - (a-b)^2= 4ab , \end{aligned}$$

we obtain for the difference of inner risks

$$\begin{aligned}&\int _Y (y+h(x))^2 \; P(dy|x) - \int _Y (y+f^*_{L,P}(x))^2 \; P(dy|x) \\&= \int _Y (y+h(x))^2 - (y-h(x))^2 + (y-h(x))^2 \\&\qquad - (y+f^*_{L,P}(x))^2 + (y-f^*_{L,P}(x))^2 - (y-f^*_{L,P}(x))^2\; P(dy|x) \\&= \int _Y (y-h(x))^2 - (y-f^*_{L,P}(x))^2 + 4yh(x) - 4y f^*_{L,P}(x) \; P(dy|x) \;. \end{aligned}$$

Thus, since

$$\begin{aligned} f^*_{L,P}(x) = \int _Y y\; P(dy|x) , \end{aligned}$$

we arrive at

$$\begin{aligned}&{\mathcal{R}_{L_A,P}(-h)} - {\mathcal{R}_{L_A,P}(-{f_{L,P}^*})} \\&= {\mathcal{R}_{L_A,P}(h)} - {\mathcal{R}_{L_A,P}^{*}} + 4\int _A \int _Y yh(x) - y f^*_{L,P}(x) \; P(dy|x) P_X(dx) \\&= {\mathcal{R}_{L_A,P}(h)} - {\mathcal{R}_{L_A,P}^{*}} + 4\langle \varvec{1}_A f^*_{L,P}\; , h - f^*_{L,P}\rangle _2 \;, \end{aligned}$$

i.e., we have shown the assertion. $\square$

With these preparations we can now present the following key lemma that shows that it suffices to understand the behavior of the good and bad interpolating histogram rules on $\Delta$ and the behavior of $h_{D,\mathcal{A}_D}^+$.

Lemma C.4

Let L be the least squares loss, P be a distribution on $X\times Y$ with point spectrum $\Delta$, see (20), and $D \in (X\times Y)^n$ be a data set. Then for all $s\in (0,1]$ and all $\rho \ge 0$ the good interpolating histogram rule satisfies

$$\begin{aligned} {\mathcal{R}_{L,P}(f_{D,s,\rho }^+)}- {\mathcal{R}_{L,P}^{*}}&\le {\mathcal{R}_{L_{\Delta },P}(f_{D,s,\rho }^+)}- {\mathcal{R}_{L_\Delta ,P}^{*}} + 4 P_X(D_X^{+t}\setminus \Delta ) \\&\quad + {\mathcal{R}_{L,P}(h_{D,\mathcal{A}_D}^+)} - {\mathcal{R}_{L,P}^{*}} \end{aligned}$$

where $D_X^{+t}$ is defined by (16). Moreover, for all $s\in (0,1]$ and all $\rho \ge 0$ the bad interpolating histogram rule satisfies

$$\begin{aligned} \Vert f_{D,s,\rho }^-- {f_{L,P}^\dagger } \Vert _{{L_{2}(P_X)}}^2&\le {\mathcal{R}_{L_{\Delta },P}(f_{D,s,\rho }^-)}- {\mathcal{R}_{L_\Delta ,P}^{*}} + 4 P_X(D_X^{+t}\setminus \Delta )\\&\quad + {\mathcal{R}_{L,P}(h_{D,\mathcal{A}_D}^+)} - {\mathcal{R}_{L,P}^{*}}\, . \end{aligned}$$

Proof of Lemma C.4

To simplify notation, we write $A:= D_X^{+t}{\setminus } \Delta$ and $B:= X {\setminus }(D_X^{+t}\cup \Delta )$. Note that this yields the partition $X = \Delta \cup A \cup B$. In addition, we have $f_{D,s,\rho }^+(x) = h_{D,\mathcal{A}_D}^+(x)$ for all $x\in X{\setminus } D_X^{+t}$. Using this in combination with $B\subset X\setminus D_X^{+t}$ as well as the risk decomposition formula (41) and Lemma C.1 we then find

$$\begin{aligned} {\mathcal{R}_{L,P}(f_{D,s,\rho }^+)}- {\mathcal{R}_{L,P}^{*}}&= {\mathcal{R}_{L_\Delta ,P}(f_{D,s,\rho }^+)}- {\mathcal{R}_{L_\Delta ,P}^{*}} \\&\quad + {\mathcal{R}_{L_A,P}(f_{D,s,\rho }^+)}- {\mathcal{R}_{L_A,P}^{*}} \\&\quad + {\mathcal{R}_{L_B,P}(h_{D,\mathcal{A}_D}^+)} - {\mathcal{R}_{L_{B},P}^{*}} \, . \end{aligned}$$

Moreover, Lemma C.2 applied to $f_1:= f_{D,s,\rho }^+$ and $f_2:= {f_{L,P}^*}$ implies

$$\begin{aligned} {\mathcal{R}_{L_A,P}(f_{D,s,\rho }^+)}- {\mathcal{R}_{L_A,P}^{*}} \le 4 P_X \bigl (A \cap \{f_{D,s,\rho }^+\ne {f_{L,P}^*}\}\bigr ) \le 4 P_X (A)\, . \end{aligned}$$

In addition, we have

$$\begin{aligned} {\mathcal{R}_{L_B,P}(h_{D,\mathcal{A}_D}^+)} - {\mathcal{R}_{L_B,P}^{*}} \nonumber \\&\le {\mathcal{R}_{L_B,P}(h_{D,\mathcal{A}_D}^+)} - {\mathcal{R}_{L_B,P}^{*}} + {\mathcal{R}_{L_{X\setminus B},P}(h_{D,\mathcal{A}_D}^+)} - {\mathcal{R}_{L_{X\setminus B},P}^{*}} \nonumber \\&= {\mathcal{R}_{L,P}(h_{D,\mathcal{A}_D}^+)} - {\mathcal{R}_{L,P}^{*}} \, , \end{aligned}$$

(42)

where again we used (41) and Lemma C.1. Combining these estimates we then obtain the assertion for the good interpolating ERM.

To prove the inequality for the bad interpolating histogram rule, we consider the decomposition

$$\begin{aligned} \Vert f_{D,s,\rho }^-- {f_{L,P}^\dagger } \Vert _{{L_{2}(P_X)}}^2&= \int _{\Delta } \bigl ( f_{D,s,\rho }^-- {f_{L,P}^*}\bigr )^2\, dP_X\\&\quad + \int _{A} \bigl ( f_{D,s,\rho }^-- {f_{L,P}^\dagger }\bigr )^2\, dP_X\\&\quad + \int _{B} \bigl ( f_{D,s,\rho }^-- {f_{L,P}^\dagger }\bigr )^2\, dP_X\, , \end{aligned}$$

where in the first integral we used ${f_{L,P}^\dagger }(x) = {f_{L,P}^*}(x)$ for all $x\in \Delta$. Now, $f_{D,s,\rho }^-(x)\in [-1,1]$ and ${f_{L,P}^\dagger }(x) \in [-1,1]$ for all $x\in X$ gives

$$\begin{aligned} \int _{A} \bigl ( f_{D,s,\rho }^-- {f_{L,P}^\dagger }\bigr )^2\, dP_X\le 4 P_X(A)\, . \end{aligned}$$

Moreover, by (17) we find $f_{D,s,\rho }^-(x) = -f_{D,s,\rho }^+(x) = - h_{D,\mathcal{A}_D}^+(x)$ for all $x\in X{\setminus } D_X^{+t}$ and thus also for all $x\in B$. In addition, $B\subset X{\setminus } \Delta$ shows ${f_{L,P}^\dagger }(x) = -{f_{L,P}^*}(x)$ for all $x\in B$. Together, these considerations give $f_{D,s,\rho }^-(x) - {f_{L,P}^\dagger }(x) = -h_{D,\mathcal{A}_D}^+(x) + {f_{L,P}^*}(x)$ for all $x\in B$, and consequently we obtain

$$\begin{aligned} \int _{B} \bigl ( f_{D,s,\rho }^-- {f_{L,P}^\dagger }\bigr )^2\, dP_X\le \int \bigl ( h_{D,\mathcal{A}_D}^+ - {f_{L,P}^*}\bigr )^2\, dP_X= {\mathcal{R}_{L,P}( h_{D,\mathcal{A}_D}^+)} - {\mathcal{R}_{L,P}^{*}}\, . \end{aligned}$$

Combining these considerations finishes the proof. $\square$

1.2 C.2 Proof of Theorem 2.12

Throughout this section we assume that the general assumptions of Theorem 2.12 are satisfied. In particular, $D \in (X \times Y)^n$ is an i.i.d. sample of size $n \ge 1$ and $D_X:=\{x_1^*,..., x_{m_n}^*\}\in \textrm{Pot}_m(X)$ is the set of input observations. Moreover, $(s_n)_{n \in \mathbb {N}}$ is a sequence with $s_n \rightarrow 0$ as well as $\frac{\ln (n s_n^d)}{n s_n^d}\rightarrow 0$ as $n \rightarrow \infty$.

1.2.1 C.2.1 The good interpolating histogram rule

We begin by introducing the basic strategy of our proof. To this end, consider the good interpolating histogram rule from Example 2.8 with representation

$$\begin{aligned} f_{D,s,\rho }^+= h_{D,\mathcal{A}_D}^+ + \sum _{i=1}^{m} b_i \varvec{1}_{\{x_i^* + tB_\infty \}} \in {\mathcal {F}}_{s,m}. \end{aligned}$$

In view of (3) it suffices to consider the excess risk of $f_{D,s,\rho }^+$. Now observe that in the case i), i.e. for $\rho =0$, we have $t=0$ and thus $D_X^{+t} = D_X$. Since $P_X(D_X\setminus \Delta )= 0$ by the definition of $\Delta$, we then find by Lemma C.4 that

$$\begin{aligned} {\mathcal{R}_{L,P}(f_{D,s_n,0}^+)}- {\mathcal{R}_{L,P}^{*}} \le {\mathcal {R}}_1 ( h_{D,\mathcal{A}_D}^+) +{\mathcal {R}}_2(f_{D,s_n,0}^+)\,, \end{aligned}$$

where

$$\begin{aligned} {\mathcal {R}}_1 ( h_{D,\mathcal{A}_D}^+)&:= {\mathcal{R}_{L,P}(h_{D,\mathcal{A}_D}^+)} - {\mathcal{R}_{L,P}^{*}} \\ {\mathcal {R}}_2(f_{D,s_n,0}^+)&:= {\mathcal{R}_{L_{\Delta },P}(f_{D,s_n,0}^+)}- {\mathcal{R}_{L_\Delta ,P}^{*}} \, . \end{aligned}$$

Moreover, in the case ii), i.e. for $\rho =\rho _n>0$ the distribution P satisfies Assumption 2.11, which ensures $\Delta = \emptyset$. The latter implies ${\mathcal {R}}_2(f_{D,s_n,0}^+) = 0$, and therefore we find by Lemma C.4 that

$$\begin{aligned} {\mathcal{R}_{L,P}(f_{D,s_n,0}^+)}- {\mathcal{R}_{L,P}^{*}} \le 4 P_X (D_X^{+t}\setminus \Delta ) + {\mathcal {R}}_1 ( h_{D,\mathcal{A}_D}^+) \, . \end{aligned}$$

Moreover, by Assumption 2.11 and $t\le \rho = \rho _n$ we obtain

$$\begin{aligned} P_X (D_X^{+t}\setminus \Delta ) \le P_X (D_X^{+\rho _n}) \le \sum _{i=1}^n P_X (x_i + \rho _n B_\infty ) \le n \varphi (\rho _n) \rightarrow 0 \, , \end{aligned}$$

(43)

and consequently, it suffices to bound ${\mathcal {R}}_1 ( h_{D,\mathcal{A}_D}^+)$. Therefore, the rest of this subsection is devoted to bounding ${\mathcal {R}}_1 ( h_{D,\mathcal{A}_D}^+)$ and ${\mathcal {R}}_2(f_{D,s_n,0}^+)$ individually.

Bounding ${\mathcal {R}}_1 (h_{D,\mathcal{A}_D}^+)$. Thanks to Proposition E.4, we already know that

$$\begin{aligned} {\mathcal {R}}_1 (h_{D,\mathcal{A}_D}^+)= {\mathcal{R}_{L,P}(h_{D,\mathcal{A}_D}^+)} - {\mathcal{R}_{L,P}^{*}} \rightarrow 0 \end{aligned}$$

in probability for $n\rightarrow \infty$.

Bounding ${\mathcal {R}}_2(f_{D,s_n,0}^+)$. If $\Delta = \emptyset$ we obviously have ${\mathcal {R}}_2(f_{D,s_n,0}^+) = 0$, and hence we assume $\Delta \ne \emptyset$ in the following. In this case, $\Delta$ can be at most countable, and therefore we fix an at most countable enumeration $(\tilde{x}_j)_{j\in J}$ of $\Delta$, i.e.

$$\begin{aligned} \Delta = \bigcup _{j \in J} \{\tilde{x}_j \} . \end{aligned}$$

Let us further fix an $\epsilon >0$ and a finite subset $\Delta _0 \subset \Delta$ such that $P_X(\Delta \setminus \Delta _0) \le \epsilon$. With the help of (41) and Lemma C.1 we then observe that

$$\begin{aligned} {\mathcal{R}_{L_\Delta ,P}(f_{D,s_n,0}^+)} - {\mathcal{R}_{L_\Delta ,P}^{*}} = {\mathcal{R}_{L_{\Delta _0},P}(f_{D,s_n,0}^+)} - {\mathcal{R}_{L_{\Delta _0},P}^{*}} \; + \; {\mathcal{R}_{L_{\Delta \setminus \Delta _0},P}(f_{D,s_n,0}^+)} - {\mathcal{R}_{L_{\Delta \setminus \Delta _0},P}^{*}} . \end{aligned}$$

(44)

Since $Y=[-1,1]$ is bounded the second difference can be bounded by

$$\begin{aligned} \bigl |{\mathcal{R}_{L_{\Delta \setminus \Delta _0},P}(f_{D,s_n,0}^+)} - {\mathcal{R}_{L_{\Delta \setminus \Delta _0},P}^{*}} \bigr |&= \int _{X \times Y} \varvec{1}_{\Delta \setminus \Delta _0} (x) (f_{D,s_n,0}^+(x)- f^*_P(x))^2 \; dP(x,y) \nonumber \\&\le 4 \; P_X(\Delta \setminus \Delta _0) \nonumber \\&< 4 \epsilon \, . \end{aligned}$$

(45)

Our next step is to bound the first difference in (44). To this end, we write

$$\begin{aligned} J_0:= \{j\in J: \tilde{x}_j\in \Delta _0\} , \end{aligned}$$

$C_j:= \{\tilde{x}_j\}$ for $j\in J_0$, and ${\mathcal {C}}:= (C_j)_{j\in J_0}$. Then ${\mathcal {C}}$ is a finite partition of $\Delta _0$, and we set

$$\begin{aligned} {\mathcal {F}}_{\mathcal {C}}:= \biggl \{ \sum _{j \in J_0} c_j \varvec{1}_{C_j} : c_j \in Y \biggr \} \;. \end{aligned}$$

(46)

Since all $C_j$ are singletons, every measurable function $f:X\rightarrow Y$ satisfies $\varvec{1}_{\Delta _0} f \in {\mathcal {F}}_{\mathcal {C}}$. We thus conclude that $f_D:= \varvec{1}_{\Delta _0}f_{D,s_n,0}^+\in {\mathcal {F}}_{\mathcal {C}}$, too. Moreover, by (40) we know

$$\begin{aligned} {\mathcal{R}_{L_{\Delta _0},P}(f_D)} = {\mathcal{R}_{L_{\Delta _0},P}(f_{D,s_n,0}^+)}\, . \end{aligned}$$

(47)

Our next goal is to show that $f_D$ minimizes the empirical risk over ${\mathcal {F}}_{\mathcal {C}}$ with respect to $L_{\Delta _0}$. To this end, we fix a $j \in J_0$ for which we have $N_j:=|\{i: x_i \in C_j\}| > 0$. Since $f_{D,s_n,0}^+$ interpolates D by construction, Proposition 2.7 then gives

$$\begin{aligned} f_D(\tilde{x}_j) = \varvec{1}_{\Delta _0}(x)f_{D,s_n,0}^+(\tilde{x}_j) =f_{D,s_n,0}^+(\tilde{x}_j) = \frac{1}{N_j}\sum _{i: x_i \in C_j} y_i \,. \end{aligned}$$

(48)

Thus, Lemma A.1 shows that $f_D$ is indeed an empirical risk minimizer with respect to $L_{\Delta _0}$ and ${\mathcal {F}}_{\mathcal {C}}$.

Our next goal is to apply Theorem 2.10, which holds for all ERM with respect to $L_{\Delta _0}$ and ${\mathcal {F}}_{\mathcal {C}}$, to our specific ERM $f_D$. To this end, we first observe, as in the proof of Corollary E.2, that since L is the least squares loss, the assumptions (18) and (19) of Theorem 2.10 are satisfied for $L_{\Delta _0}$ with $\vartheta = 1$, $B=4$, and $V=16$. Moreover, our assumption $Y= [-1,1]$ ensures that $L_{\Delta _0}$ is locally Lipschitz continuous with $|L_{\Delta _0}|_{1,1} \le 4$. In addition, we have

$$\begin{aligned} {\mathcal {N}}({\mathcal {F}}_{\mathcal {C}}, ||\cdot ||_\infty , \varepsilon ) \le (2/\varepsilon )^{|\Delta _0|} . \end{aligned}$$

Applying Theorem 2.10 and optimizing the resulting oracle inequality with respect to $\varepsilon$ like at the end of the proof of Corollary E.2, we then see that, for all $n\ge 1$ and $\tau >0$,

$$\begin{aligned} {\mathcal{R}_{L_{\Delta _0},P}(f_D)} - {\mathcal{R}_{L_{\Delta _0},P}^{*}} \le 4 \bigl ( {\mathcal{R}_{L_{\Delta _0},P,{\mathcal {F}}_{\mathcal {C}}}^*}-{\mathcal{R}_{L_{\Delta _0},P}^{*}}\bigr ) + 1024 \,\frac{ \tau }{n} + 512 \,\frac{ |\Delta _0| }{n}\left( 1+\ln \left( \frac{n}{|\Delta _0|} \right) \right) \end{aligned}$$

holds with probability $P^n$ not less than $1- e^{-\tau }$. Now, to bound the approximation error term, we note that

$$\begin{aligned} {\mathcal {F}}_{\mathcal {C}}= \bigl \{ f:X\rightarrow Y \,\bigl |\, f \text{ measurable } \text{ and } f(x) = 0 \text{ for } \text{ all } x\not \in \Delta _0\bigr \} \, , \end{aligned}$$

and hence we easily find

$$\begin{aligned} {\mathcal{R}_{L_{\Delta _0},P}^{*}} = \inf _{f:X\rightarrow Y}{\mathcal{R}_{L_{\Delta _0},P}(f)} = \inf _{f:X\rightarrow Y} {\mathcal{R}_{L_{\Delta _0},P}(\varvec{1}_{\Delta _0}f)} = {\mathcal{R}_{L_{\Delta _0},P,{\mathcal {F}}_{\mathcal {C}}}^*}\, . \end{aligned}$$

Setting $\tau := \ln (n)$ we conclude that

$$\begin{aligned} {\mathcal{R}_{L_{\Delta _0},P}(f_D)} - {\mathcal{R}_{L_{\Delta _0},P}^{*}} \le 1024 \,\frac{ \tau }{n} + 512 \,\frac{ |\Delta _0| }{n}\left( 1+\ln \left( \frac{n}{|\Delta _0|} \right) \right) \end{aligned}$$

(49)

holds with probability $P^n$ not less than $1- 1/n$. For later use note that this oracle inequality actually holds for all ERM respect to $L_{\Delta _0}$ and ${\mathcal {F}}_{\mathcal {C}}$, since so does Theorem 2.10 and we have not used any property of our specfic ERM $f_D$ to derive (49). Finally, combining this with (44), (45), (47), and the obvious ${\mathcal{R}_{L_{\Delta _0},P}(f_{D,s_n,0}^+)} - {\mathcal{R}_{L_{\Delta _0},P}^{*}}\ge 0$, we conclude that

$$\begin{aligned} {\mathcal {R}}_3(f_{D,s_n,0}^+) = {\mathcal{R}_{L_\Delta ,P}(f_{D,s_n,0}^+)} - {\mathcal{R}_{L_\Delta ,P}^{*}} \rightarrow 0 \end{aligned}$$

in probability for $n\rightarrow \infty$.

1.2.2 C.2.2 The bad interpolating histogram rule

In this subsection we consider the bad interpolating histogram rule from Example 2.9 with representation

$$\begin{aligned} f_{D,s,\rho }^-= h_{D,\mathcal{A}_D}^- + \sum _{i=1}^{m} b_i \varvec{1}_{x_i^* + t B_\infty } \in {\mathcal {F}}_{s, m}. \end{aligned}$$

Now observe that in the case i) of Theorem 2.12, i.e. for $\rho =0$, we have $t=0$ and thus $D_X^{+t} = D_X$. Since $P_X(D_X\setminus \Delta )= 0$ by the definition of $\Delta$, we then see by Lemma C.4 and Proposition E.4 that it suffices to show that

$$\begin{aligned} {\mathcal{R}_{L_\Delta ,P}(f_{D,s_n,0}^-)} - {\mathcal{R}_{L_\Delta ,P}^{*}} \rightarrow 0 \end{aligned}$$

(50)

in probability for $n\rightarrow \infty$. To this end, we fix an $\epsilon >0$ and a finite $\Delta _0\subset \Delta$ with $P_X(\Delta {\setminus } \Delta _0 ) \le \epsilon$. Then we note that the decomposition (44) and the estimate (45) for $f_{D,s_n,0}^+$ also holds for $f_{D,s_n,0}^-$. Consequently, it suffices to bound the term

$$\begin{aligned} {\mathcal{R}_{L_{\Delta _0},P}(f_{D,s_n,0}^-)} - {\mathcal{R}_{L_{\Delta _0},P}^{*}}\, . \end{aligned}$$

To this end, recall that $f_{D,s_n,0}^+$ and $f_{D,s_n,0}^-$ are both interpolating predictors, and hence we have

$$\begin{aligned} f_{D,s_n,0}^-(x_i) = f_{D,s_n,0}^+(x_i) \end{aligned}$$

(51)

for all samples $(x_i, y_i)$ of D, and thus in particular for all samples $(x_i, y_i)$ of $D_0$ with $x_i \in \Delta$. Let us define $f_D:= \varvec{1}_{\Delta _0}f_{D,s_n,0}^-$. Combining (51) with (48) we see that $f_D$ is an empirical risk minimizer over the hypotheses set ${\mathcal {F}}_{\mathcal {C}}$ defined (46) with respect to $L_{\Delta _0}$. Since (49) has been shown for all ERM respect to $L_{\Delta _0}$ and ${\mathcal {F}}_{\mathcal {C}}$ we thus find

$$\begin{aligned} {\mathcal{R}_{L_{\Delta _0},P}(f_{D,s_n,0}^-)} - {\mathcal{R}_{L_{\Delta _0},P}^{*}} = {\mathcal{R}_{L_{\Delta _0},P}(f_D)} - {\mathcal{R}_{L_{\Delta _0},P}^{*}} \rightarrow 0 \end{aligned}$$

in probability for $n\rightarrow \infty$. This finishes the proof in the case i) of Theorem 2.12. Moreover, in the case ii), i.e. for $\rho =\rho _n>0$ the distribution P satisfies Assumption 2.11, which ensures $\Delta = \emptyset$. In combination with Lemma C.4 the latter implies

$$\begin{aligned} \Vert f_{D,s,\rho }^-- {f_{L,P}^\dagger } \Vert _{{L_{2}(P_X)}}^2 \le 4 P_X(D_X^{+t}\setminus \Delta ) + {\mathcal{R}_{L,P}(h_{D,\mathcal{A}_D}^+)} - {\mathcal{R}_{L,P}^{*}}\, . \end{aligned}$$

Now, the first term has already been bounded in (43) and the excess risk of $h_{D,\mathcal{A}_D}^+$ can again be bounded by Proposition E.4.

1.3 C.3 Proof of Theorem 2.13 (Learning Rates)

In the following we suppose that all assumption of Theorem 2.13 are satisfied.

Let us first prove the assertions for the good interpolating histogram rule. To this end, we first recall that Assumption 2.11 implies $\Delta = \emptyset$. By (3) and Lemma C.4 we then obtain

$$\begin{aligned} ||f_{D,s_n, \rho _n}^+- {f_{L,P}^*}||_{L_2(P_X)}^2 = {\mathcal{R}_{L,P}(f_{D,s,\rho }^+)}- {\mathcal{R}_{L,P}^{*}} \le 4 P_X(D_X^{+t}\setminus \Delta ) + {\mathcal{R}_{L,P}(h_{D,\mathcal{A}_D}^+)} - {\mathcal{R}_{L,P}^{*}} \, . \end{aligned}$$

Now, (43) shows

$$\begin{aligned} P_X (D_X^{+t}\setminus \Delta ) \le n \varphi (\rho _n) \le \ln (n) \, n^{-2/3} \le \ln (n) \, n^{-2\alpha \gamma } \, . \end{aligned}$$

(52)

Moreover, by Theorem 2.5 we know that $|{{\,\textrm{Im}\,}}(\pi _{m,s})|\le (m+1)^d \le 2^d n^d$ for all $m\le n$. Consequently, applying Proposition E.5 with $c=2^d$ and $\beta := d$ we find

$$\begin{aligned} {\mathcal{R}_{L,P}(h_{D,{\mathcal {A}}_D}^+)} - {\mathcal{R}^*_{L,P}} \le c_{d,\alpha }\ln (n ) \, n^{-2\alpha \gamma } \end{aligned}$$

(53)

with probability $P^n$ at least $1- 2^dn^{1+d} e^{-n^{d \gamma }}$, where $c_{d,\alpha }>0$ is a constant only depending on d, $\alpha$, and $|f^*_{L,P}|_\alpha$. Combining this with (52) we then obtain (27).

Finally, inequality (27) for the bad interpolating histogram rule follows analogously, since in this case Lemma C.4 shows

$$\begin{aligned} ||f_{D,s_n,\rho _n}^-- {f_{L,P}^\dagger }||_{L_2(P_X)}^2 \le 4 P_X(D_X^{+t}\setminus \Delta ) + {\mathcal{R}_{L,P}(h_{D,\mathcal{A}_D}^+)} - {\mathcal{R}_{L,P}^{*}} \, . \end{aligned}$$

D Learning properties of approximating neural networks

1.1 D.1 Auxiliary Results on Functions that can be represented by DNNs

In this section we present some results on algebraic properties of the set of functions that can be represented by DNNs. We particularly focus on the network sizes required to perform algebraic transformations of such functions.

To this end, recall that throughout this work we solely consider the ReLU-activation function $\sigma := |\cdot |_+$ and its shifted extensions (29). Given an input dimension d, a depth $\tilde{L}\ge 2$, and a width vector $(p_1,\dots ,p_{\tilde{L}-1}) \in \mathbb {N}^{\tilde{L}-1}$, a function $f\in {\mathcal {A}}_{p_1,\dots ,p_{\tilde{L}-1}}$ is then of the form (31), i.e.

$$\begin{aligned} f(x)&= H_{\tilde{L}}\circ H_{\tilde{L}-1}\circ \dots \circ H_1(x) \;, \quad x \in \mathbb {R}^{p_0} \;, \end{aligned}$$

where each layer $H_l$, $l=1,\dots , \tilde{L}$, is of the form (30), where we drop the index for the activation to ease notation. Specifically, each layer can be represented by a $p_l \times p_{l-1}$ weight matrix $A^{(l)}$ with $p_0:= d$ and $p_{\tilde{L}}:= 1$ and a shift vector $b^{(l)}\in \mathbb {R}^{p_l}$, and the last layer $H_{\tilde{L}}$ has the identity as an activation function. In the following, we thus say that the network f is represented by $(\mathfrak A, \mathfrak B)$, where $\mathfrak A:= (A^{(1)}, \dots , A^{(\tilde{L})})$ and $\mathfrak B:= (b^{(1)}, \dots , b^{(\tilde{L})})$. For later use we emphasize that $p_{\tilde{L}} = 1$ implies $b^{(\tilde{L})}\in \mathbb {R}$. Moreover note that each pair $(\mathfrak A, \mathfrak B)$ determines a neural network, but in general, a neural network, if viewed as a function, can be described by more than one such pair.

Now, our first lemma describes the changes in the representation when manipulating a single neural network.

Lemma D.1

Let $d\ge 1$, $\tilde{L}\ge 2$, and $p:=(p_1,\dots ,p_{\tilde{L}-1}) \in \mathbb {N}^{\tilde{L}-1}$. Moreover, let $f\in \mathcal{A}_{p}$ be a neural network with representation $\mathfrak A:= (A^{(1)}, \dots , A^{(\tilde{L})})$ and $\mathfrak B:= (b^{(1)}, \dots , b^{(\tilde{L})})$. Then the following statements hold true:

(i)
For all $\alpha \in \mathbb {R}$ and $c \in \mathbb {R}$ we have $\alpha f + c \in \mathcal{A}_{p}$ with representation
$$\begin{aligned} \bigl (A^{(1)}, \dots , A^{(\tilde{L}-1)}, \alpha A^{(\tilde{L})}\bigr ) \qquad \qquad \text{ and } \qquad \qquad \bigl (b^{(1)}, \dots , b^{(\tilde{L}-1)},\alpha b^{(\tilde{L})}+c\bigr ) \,. \end{aligned}$$
(ii)
We have $|f|_+\in \mathcal{A}_{p,1}$ with representation
$$\begin{aligned} \bigl (A^{(1)}, \dots , A^{(\tilde{L})}, 1\bigr ) \qquad \qquad \text{ and } \qquad \qquad \bigl (b^{(1)}, \dots , b^{(\tilde{L})}, 0\bigr ) . \end{aligned}$$

Proof of Lemma D.1

i).:: This immediately follows from the representation (31) and the fact that $H_{\tilde{L}}$ does not have an activation function.
ii).:: Let $\tilde{H}_1, \dots , \tilde{H}_{\tilde{L}+1}$ be the layers of the neural network $\tilde{f}$ given by the new representation. Then we have $H_l = \tilde{H}_l$ for all $l=1,\dots , \tilde{L}-1$ as well as $\tilde{H}_{\tilde{L}} = |H_{\tilde{L}}|_+$ and $\tilde{H}_{\tilde{L}+1} = {{\,\textrm{id}\,}}_\mathbb {R}$. Applying the representation (31) for f and $\tilde{f}$ then gives the assertion.

$\square$

Our next lemma describes a possible representation of the sum of two nets with the same depth $\tilde{L}$.

Lemma D.2

Let $d\ge 1$, $\tilde{L}\ge 2$, and $\dot{p}:=(\dot{p}_1,\dots ,\dot{p}_{\tilde{L}-1}) \in \mathbb {N}^{\tilde{L}-1}$ and $\ddot{p}:=(\ddot{p}_1,\dots ,\ddot{p}_{\tilde{L}-1}) \in \mathbb {N}^{\tilde{L}-1}$ be two width vectors. Then for all $\dot{f}\in \mathcal{A}_{\dot{p}}$ and $\ddot{f}\in \mathcal{A}_{\ddot{p}}$ we have

$$\begin{aligned} \dot{f} + \ddot{f} \in \mathcal{A}_{\dot{p} + \ddot{p}}\, . \end{aligned}$$

In addition, if $(\dot{\mathfrak A},\dot{\mathfrak B})$ and $(\ddot{\mathfrak A},\ddot{\mathfrak B})$ are representations of $\dot{f}$ and $\ddot{f}$, then $\dot{f} + \ddot{f}$ has the representation $\mathfrak A:= (A^{(1)}, \dots , A^{(\tilde{L})})$ and $\mathfrak B:= (b^{(1)}, \dots , b^{(\tilde{L})})$ defined by

$$\begin{aligned} A^{(1)}&:= \biggl ( \begin{array}{c}\dot{A}^{(1)}\\ \ddot{A}^{(1)} \end{array} \biggr ) \in \mathbb {R}^{(\dot{m}_1+\ddot{m}_1)\times d}\, ,&b^{(1)}&:= \biggl ( \begin{array}{c}\dot{b}^{(1)}\\ \ddot{b}^{(1)} \end{array} \biggr ) \in \mathbb {R}^{\dot{m}_1+\ddot{m}_1} \, , \end{aligned}$$

as well as

$$\begin{aligned} A^{(l)}&:= \biggl ( \begin{array}{cc}\dot{A}^{(l)} & 0\\ 0& \ddot{A}^{(l)} \end{array} \biggr ) \in \mathbb {R}^{(\dot{m}_l+\ddot{m}_l)\times (\dot{m}_{l-1}+\ddot{m}_{l-1})} \, ,&b^{(l)}&:= \biggl ( \begin{array}{c}\dot{b}^{(l)}\\ \ddot{b}^{(l)} \end{array} \biggr ) \in \mathbb {R}^{\dot{m}_l+\ddot{m}_l} \, ,&\end{aligned}$$

for all $l=2,\dots , \tilde{L}-1$ and

$$\begin{aligned} A^{(\tilde{L})}&:= \bigl ( \begin{array}{cc}\dot{A}^{(\tilde{L})}&\ddot{A}^{(\tilde{L})}\end{array} \bigr ) \in \mathbb {R}^{\dot{m}_{\tilde{L}}+\ddot{m}_{\tilde{L}}} \, ,&b^{(\tilde{L})}&:= \dot{b}^{(\tilde{L})} +\ddot{b}^{(\tilde{L})} \in \mathbb {R}\, . \\ \end{aligned}$$

Proof of Lemma D.2

Let $\dot{H}_{1}, \dots , \dot{H}_{\tilde{L}}$ be the layers of $\dot{f}$ and $\ddot{H}_{1}, \dots , \ddot{H}_{\tilde{L}}$ be the layers of $\ddot{f}$. For $l =1,\dots ,\tilde{L}$, we further introduce the concatenation of layers

$$\begin{aligned} \dot{W}_l:= \dot{H}_{l}\circ \dots \circ \dot{H}_1 \qquad \qquad \text{ and } \qquad \qquad \ddot{W}_l:= \ddot{H}_{l}\circ \dots \circ \ddot{H}_1 \,. \end{aligned}$$

Moreover, for $l =1,\dots ,\tilde{L}$, let $H_l$ be the layer given by $A^{(l)}$ and $b^{(l)}$ and $W_l:= H_{l}\circ \dots \circ H_1$. Since the last layers of $\dot{f}$ and $\ddot{f}$ do not have an activation function, we then find

$$\begin{aligned} (\dot{f}+ \ddot{f})(x)&= \dot{A}^{(\tilde{L})} \cdot \dot{W}_{\tilde{L}-1}(x) + \dot{b}^{(\tilde{L})} + \ddot{A}^{(\tilde{L})} \cdot \ddot{W}_{\tilde{L}-1}(x) + \ddot{b}^{(\tilde{L})} \\&= \bigl ( \begin{array}{cc}\dot{A}^{(\tilde{L})}&\ddot{A}^{(\tilde{L})}\end{array} \bigr ) \cdot \biggl ( \begin{array}{c}\dot{W}_{\tilde{L}-1}(x) \\ \ddot{W}_{\tilde{L}-1}(x) \end{array} \biggr ) + \dot{b}^{(\tilde{L})} + \ddot{b}^{(\tilde{L})}\\&= {A}^{(\tilde{L})} \cdot \biggl ( \begin{array}{c}\dot{W}_{\tilde{L}-1}(x) \\ \ddot{W}_{\tilde{L}-1}(x) \end{array} \biggr ) + b^{(\tilde{L})} \end{aligned}$$

for all $x\in \mathbb {R}^d$. Similarly, for all $l=2,\dots ,\tilde{L}-1$ and all $x\in \mathbb {R}^d$ we have

$$\begin{aligned} \biggl ( \begin{array}{c}\dot{W}_{l}(x) \\ \ddot{W}_{l}(x) \end{array} \biggr ) = \biggl ( \begin{array}{c} \dot{H}_l \circ \dot{W}_{l-1}(x) \\ \ddot{H}_l \circ \ddot{W}_{l-1}(x) \end{array} \biggr )&= \biggl ( \begin{array}{c} \bigl | \dot{A}^{(l)} \cdot \dot{W}_{l-1}(x) + \dot{b}^{(l)} \bigr |_+ \\ \bigl | \ddot{A}^{(l)} \cdot \ddot{W}_{l-1}(x) + \ddot{b}^{(l)} \bigr |_+ \end{array} \biggr ) \\&= \left| \biggl ( \begin{array}{cc}\dot{A}^{(l)} & 0\\ 0& \ddot{A}^{(l)} \end{array} \biggr ) \biggl ( \begin{array}{c}\dot{W}_{l-1}(x) \\ \ddot{W}_{l-1}(x) \end{array} \biggr ) + \biggl ( \begin{array}{c}\dot{b}^{(l)} \\ \ddot{b}^{(l)} \end{array}\biggr ) \right| _+ \\&= \biggl |{A}^{(l)} \cdot \biggl ( \begin{array}{c}\dot{W}_{l-1}(x) \\ \ddot{W}_{l-1}(x) \end{array} \biggr ) + b^{(l)} \biggr |_+ \, . \end{aligned}$$

Finally, for the first layer and all $x\in \mathbb {R}^d$ we obtain

$$\begin{aligned} \biggl ( \begin{array}{c}\dot{W}_{1}(x) \\ \ddot{W}_{1}(x) \end{array} \biggr ) = \biggl ( \begin{array}{c} \dot{H}_1 (x) \\ \ddot{H}_1 (x) \end{array} \biggr ) = \biggl ( \begin{array}{c} \bigl | \dot{A}^{(1)} \cdot x + \dot{b}^{(1)} \bigr |_+ \\ \bigl | \ddot{A}^{(1)} \cdot x + \ddot{b}^{(1)} \bigr |_+ \end{array} \biggr )&= \left| \biggl ( \begin{array}{cc}\dot{A}^{(1)} \\ \ddot{A}^{(1)} \end{array} \biggr ) \cdot x + \biggl ( \begin{array}{c}\dot{b}^{(1)} \\ \ddot{b}^{(1)} \end{array}\biggr ) \right| _+ \\&= \bigl |{A}^{(1)} \cdot x + b^{(l)} \bigr |_+ \, . \end{aligned}$$

Combining these results gives $W_l =(\dot{W}_{l}, \ddot{W}_{l} )^T$ for all $l=1,\dots ,\tilde{L}$, i.e. we have found the assertion. $\square$

1.2 D.2 Approximating step functions by DNNs

In this section we collect the main pieces to approximate histograms with DNNs. The first lemma, which is a longer and more detailed version of Lemma 3.3, shows how to approximate an indicator function on a multidimensional interval by a small ReLU-DNN with two hidden layers.

Lemma D.3

Let $d\ge 1$ and let $z_1 = (z_{1,1},\dots ,z_{1,d})\in {\mathbb {R}^d}$ and $z_2 = (z_{2,1},\dots ,z_{2,d})\in {\mathbb {R}^d}$ be two vectors with $z_{1} < z_{2}$. Moreover, let $\varepsilon >0$ satisfy

$$\begin{aligned} \varepsilon < \min \Bigl \{ \frac{z_{2,i} - z_{1,i}}{2}: i=1,\dots ,d\Bigr \} \end{aligned}$$

and define

$$\begin{aligned} A^{(1)}&:= \frac{1}{\varepsilon }\biggl ( \begin{array}{c} -I_d \\ I_d \end{array} \biggr )&b^{(1)}&:= \frac{1}{\varepsilon }\biggl ( \begin{array}{c} z_1 + \varepsilon \\ -z_2 - \varepsilon \end{array} \biggr ) \\ A^{(2)}&:= (-1, -1, \dots , -1) \in \mathbb {R}^{2d}&b^{(2)}&:= 1 \\ A^{(3)}&:= 1&b^{(3)}&:= 0 \, , \end{aligned}$$

where $I_d$ denotes the d-dimensional identity matrix, and $A^{(3)}, b^{(2)}, b^{(3)}\in \mathbb {R}$. Then the neural network $f_\varepsilon :\mathbb {R}^d\rightarrow \mathbb {R}$ given by the representation $\mathfrak A:= (A^{(1)}, A^{(2)}, A^{(3)})$ and $\mathfrak B:= (b^{(1)}, b^{(2)}, b^{(3)})$ satisfies $f_\varepsilon \in {\mathcal {A}}_{2d,1}$ and

$$\begin{aligned} \{ f_\varepsilon >0\}&\subset (z_{1}, z_{2})\, , \end{aligned}$$

(54)

$$\begin{aligned} \{ f_\varepsilon = 1\}&= [z_{1}+\varepsilon , z_{2}-\varepsilon ]\, , \end{aligned}$$

(55)

$$\begin{aligned} \{ f_\varepsilon < 0 \}&= \{ f_\varepsilon > 1\} = \emptyset \, . \end{aligned}$$

(56)

Proof of Lemma D.3

Let $H_1, H_2, H_3$ be the layers of $f_\varepsilon$. Then we have $H_3 = {{\,\textrm{id}\,}}_\mathbb {R}$ and if $h_1^{(1)}, \dots , h_d^{(1)}, h_1^{(2)},\dots , h_d^{(2)}$ denote the 2d component functions of $H_1$, that is

$$\begin{aligned} H_1(x) = \Bigl (h_1^{(1)}(x), \dots , h_d^{(1)}(x), h_1^{(2)}(x),\dots , h_d^{(2)}(x)\Bigr )^T\,, \qquad \qquad x\in {\mathbb {R}^d}, \end{aligned}$$

we thus find

$$\begin{aligned} f_\varepsilon (x) = H_3 \circ H_2 \circ H_1(x) = H_2 \circ H_1(x)&= \biggl | \, - \sum _{i=1}^d h_i^{(1)}(x) - \sum _{i=1}^d h_i^{(2)}(x) + 1 \, \biggr |_+ \nonumber \\&= \biggl | \, \sum _{i=1}^d \bigl (1 - h_i^{(1)}(x) - h_i^{(2)}(x) \bigr ) - d + 1 \, \biggr |_+ \end{aligned}$$

(57)

for all $x\in \mathbb {R}^d$. Therefore, we first investigate the functions $1- h_i^{(1)} - h_i^{(2)}$. To this end, let us fix an $i\in \{1,\dots ,d\}$ and an $x=(x_1,\dots ,x_d)\in {\mathbb {R}^d}$. Then we obviously have

$$\begin{aligned} h_i^{(1)}(x) = \Bigl |- \frac{x_i}{\varepsilon }+ \frac{z_{1,i} + \varepsilon }{\varepsilon }\Bigr |_+ = {\left\{ \begin{array}{ll} \frac{-x_i + z_{1,i}+\varepsilon }{\varepsilon }& \text{ if } x_i \le z_{1,i}+\varepsilon \\ 0 & \text{ else, } \end{array}\right. } \end{aligned}$$

and

$$\begin{aligned} h_i^{(2)}(x) = \Bigl |\frac{x_i}{\varepsilon }- \frac{z_{2,i} - \varepsilon }{\varepsilon }\Bigr |_+ = {\left\{ \begin{array}{ll} \frac{x_i - z_{2,i}+\varepsilon }{\varepsilon }& \text{ if } x_i \ge z_{2,i}-\varepsilon \\ 0 & \text{ else. } \end{array}\right. } \end{aligned}$$

Since $z_{1,i}+\varepsilon < z_{2,i}-\varepsilon$, we consequently find

$$\begin{aligned} 1- h_i^{(1)}(x) - h_i^{(2)}(x) = {\left\{ \begin{array}{ll} \frac{x_i - z_{1,i}}{\varepsilon }& \text{ if } x_i \le z_{1,i}+\varepsilon \\ 1 & \text{ if } x_i\in [z_{1,i}+\varepsilon , z_{2,i}-\varepsilon ]\\ \frac{z_{2,i} - x_i }{\varepsilon }& \text{ if } x_i \ge z_{2,i}-\varepsilon \,. \end{array}\right. } \end{aligned}$$

In particular, we have

$$\begin{aligned} \bigl \{ 1- h_i^{(1)} - h_i^{(2)} > 0\bigr \}&= \bigl \{ (x_1,\dots ,x_d)\in {\mathbb {R}^d}: x_i \in (z_{1,i}, z_{2,i}) \bigr \} \, , \end{aligned}$$

(58)

$$\begin{aligned} \bigl \{ 1- h_i^{(1)} - h_i^{(2)} = 1\bigr \}&= \bigl \{ (x_1,\dots ,x_d)\in {\mathbb {R}^d}: x_i \in [z_{1,i}+\varepsilon , z_{2,i}-\varepsilon ] \bigr \} \, , \end{aligned}$$

(59)

$$\begin{aligned} \bigl \{ 1- h_i^{(1)} - h_i^{(2)} > 1\bigr \}&= \emptyset \, . \end{aligned}$$

(60)

Combining our initial equation (57) with (59) and (60) yields

$$\begin{aligned} \{ f_\varepsilon = 1\} = \Biggl \{ \biggl | \, \sum _{i=1}^d \bigl (1 - h_i^{(1)} - h_i^{(2)} \bigr ) - d + 1 \, \biggr |_+ = 1 \Biggr \}&= \biggl \{ \sum _{i=1}^d \bigl (1 - h_i^{(1)} - h_i^{(2)} \bigr ) = d \biggr \} \\&= [z_{1}+\varepsilon , z_{2}-\varepsilon ] \, , \end{aligned}$$

i.e. we have found (55) Net we will verify (54) to this end, we first note that (57) gives

$$\begin{aligned} \{ f_\varepsilon>0\}&= \Biggl \{ \biggl | \, \sum _{i=1}^d \bigl (1 - h_i^{(1)} - h_i^{(2)} \bigr ) - d + 1 \, \biggr |_+> 0 \Biggr \} \nonumber \\&= \biggl \{ \sum _{i=1}^d \bigl (1 - h_i^{(1)} - h_i^{(2)} \bigr ) > d - 1 \biggr \} \, . \end{aligned}$$

(61)

Our next intermediate goal is to show

$$\begin{aligned} \biggl \{ \sum _{i=1}^d \bigl (1 - h_i^{(1)} - h_i^{(2)} \bigr )> d - 1 \biggr \} \subset \bigcap _{i=1}^d \bigl \{ 1- h_i^{(1)} - h_i^{(2)} > 0\bigr \} \, . \end{aligned}$$

(62)

To this end, we assume the converse, i.e. there is an $x\in {\mathbb {R}^d}$ and an $i_0\in \{1,\dots ,d\}$ with

$$\begin{aligned} \sum _{i=1}^d \bigl (1 - h_i^{(1)}(x) - h_i^{(2)}(x) \bigr ) > d - 1 \qquad \text{ and } \qquad 1 - h_{i_0}^{(1)}(x) - h_{i_0}^{(2)}(x) \le 0\,. \end{aligned}$$

Without loss of generality we may assume that $i_0 = d$. Then combining both inequalities we find

$$\begin{aligned} d-1 \,<\, \sum _{i=1}^{d-1} \bigl (1 - h_i^{(1)}(x) - h_i^{(2)}(x) \bigr ) + \bigl (1 - h_d^{(1)}(x) - h_d^{(2)}(x) \bigr ) \,\le \,\sum _{i=1}^{d-1} \bigl (1 - h_i^{(1)}(x) - h_i^{(2)}(x) \bigr ) \,, \end{aligned}$$

and this shows that there is also an $i\in \{1,\dots ,d-1\}$ with $1 - h_i^{(1)}(x) - h_i^{(2)}(x) > 1$. This contradicts (60), and hence we have shown (62). Now combining (61) with (62) and (58) we obtain

$$\begin{aligned} \{ f_\varepsilon>0\} \subset \bigcap _{i=1}^d \bigl \{ 1- h_i^{(1)} - h_i^{(2)} > 0\bigr \} = (z_{1}, z_{2}) \, , \end{aligned}$$

i.e. we have found (54). Finally, the equation $\{ f_\varepsilon > 1\} = \emptyset$ immediately follows from combining (57) and (60), and $\{ f_\varepsilon < 0\} = \emptyset$ is a direct consequence of (57). $\square$

Our next goal is to describe how well the function $f_\varepsilon$ found in Lemma D.3 approximates the indicator function $\varvec{1}_{[z_1,z_2]}$. To this end, we first recall a well-known estimate on $\Vert \cdot \Vert _\infty$-covering numbers of cuboids in the following lemma. We include its proof for the sake of completeness.

Lemma D.4

Let $s_1,\dots ,s_d >0$, $s_{\textrm{min}}:= \min \{s_1,\dots ,s_d\}$, $z\in {\mathbb {R}^d}$, and

$$\begin{aligned} A:= \bigl \{ x\in {\mathbb {R}^d}: z_i \le x_i \le z_i + s_i\bigr \}\, . \end{aligned}$$

Then for all $\varepsilon \in (0,s_{\textrm{min}}]$ we have

$$\begin{aligned} \mathcal{N}(A, \Vert \cdot \Vert _\infty , \varepsilon ) \le \Bigl (\frac{3}{2}\Bigr )^d \biggl ( \prod _{i=1}^d s_i \biggr )\cdot \varepsilon ^{-d}\, . \end{aligned}$$

Proof of Lemma D.4

Let us fix an $i\in \{1,\dots ,d\}$. Since $\varepsilon \le s_i$, we then need at most $\lceil \frac{s_i}{2\varepsilon }\rceil$ closed intervals of length $2\varepsilon$ to cover the interval $[z_i, z_i + s_i]$. From this it is easy to conclude that

$$\begin{aligned} \mathcal{N}(A, \Vert \cdot \Vert _\infty , \varepsilon ) \le \prod _{i=1}^d \Bigl \lceil \frac{s_i}{2\varepsilon }\Bigr \rceil \le \prod _{i=1}^d \Bigl ( \frac{s_i}{2\varepsilon } + 1\Bigr ) \le \prod _{i=1}^d \Bigl ( \frac{s_i}{2\varepsilon } + \frac{s_i}{\varepsilon }\Bigr ) = \Bigl (\frac{3}{2}\Bigr )^d \biggl ( \prod _{i=1}^d s_i \biggr )\cdot \varepsilon ^{-d}\, , \end{aligned}$$

and hence we have shown the assertion. $\square$

Now, the next lemma describes the announced description of the approximation error.

Lemma D.5

Let $z_1, z_2\in [-1,1]^d$, and $\varepsilon >0$ as in Lemma D.3. Moreover, let $A\subset [-1,1]^d$ be a subset satisfying $(z_1,z_2) \subset A\subset [z_1,z_2]$. Then the neural network $f_\varepsilon \in \mathcal{A}_{2d,1}$ constructed in Lemma D.3 satisfies

$$\begin{aligned} \{ f_\varepsilon = \varvec{1}_{A} \} = [z_1 + \varepsilon , z_2 - \varepsilon ] \cup (X\setminus A) \;. \end{aligned}$$

(63)

Moreover, if A is a cube of side length $s>0$, that is $z_{2,i}-z_{1,i} = s$ for all $i=1,\dots ,d$, and we have a distribution $P_X$ on $[-1,1]^d$ that satisfies Assumption 2.11 for some $\varphi : \mathbb {R}_+ \rightarrow \mathbb {R}_+$, then we further have

$$\begin{aligned} P_X \bigl ( \{ f_\varepsilon \ne \varvec{1}_{A} \} \bigr ) \le 3d \cdot \Bigl (\frac{3s}{2}\Bigr )^{d-1} \cdot \varepsilon ^{-d+1} \cdot \varphi (2\varepsilon )\, . \end{aligned}$$

(64)

Proof of Lemma D.5

By (55) and (54) we find the inclusions $\{ f_\varepsilon = 1\} = [z_1 + \varepsilon , z_2 - \varepsilon ] \subset A$ and $\{ f_\varepsilon > 0 \} \subset (z_1, z_2 ) \subset A$. Using $\{ f_\varepsilon < 0\} = \emptyset$, which is known by (56), we then obtain

$$\begin{aligned} \{ f_\varepsilon = \varvec{1}_{A} \}&= \bigl (A \cap \{ f_\varepsilon = 1\}\bigr ) \cup \bigl ((X\setminus A) \cap \{ f_\varepsilon = 0 \}\bigr ) \\&= \bigl (A \cap \{ f_\varepsilon = 1\}\bigr ) \cup \bigl ((X\setminus A) \cap (X\setminus \{ f_\varepsilon > 0 \})\bigr ) \\&= [z_1 + \varepsilon , z_2 - \varepsilon ] \cup (X\setminus A) \;, \end{aligned}$$

i.e. we have shown (63). Now, to establish (64), we first note that (63) together with $A\subset [z_1,z_2]$ implies

$$\begin{aligned} \{ f_\varepsilon \ne \varvec{1}_{A} \} = \bigl (X\setminus [z_1 + \varepsilon , z_2 - \varepsilon ]\bigr ) \cap A \subset [z_1,z_2] \setminus (z_1 + \varepsilon , z_2 - \varepsilon )\, . \end{aligned}$$

To further bound $[z_1,z_2] \setminus (z_1 + \varepsilon , z_2 - \varepsilon )$ we define

$$\begin{aligned} S_i^-&:= \bigl \{ x: x_i \in [z_{1,i}, z_{1,i} + \varepsilon ] \text{ and } x_j \in [z_{1,j},z_{2,j}] \text{ for } \text{ all } j\ne i \bigr \}\, , \\ S_i^+&:= \bigl \{ x: x_i \in [z_{2,i} - \varepsilon , z_{2,i}] \text{ and } x_j \in [z_{1,j},z_{2,j}] \text{ for } \text{ all } j\ne i \bigr \} \end{aligned}$$

Then we have $[z_1,z_2] {\setminus } (z_1 + \varepsilon , z_2 - \varepsilon ) \subset S_1^-\cup \dots \cup S_d^- \cup S_1^+\cup \dots \cup S_d^+$, and hence we obtain

$$\begin{aligned} P_X \bigl ( \{ f_\varepsilon \ne \varvec{1}_{A} \} \bigr ) \le \sum _{i=1}^d P_X(S_i^-) + \sum _{i=1}^d P_X(S_i^+)\, . \end{aligned}$$

(65)

Now observe that since A is a cube with side length s, the sets $S_i^-$ and $S_i^+$ are cuboids with side lengths $s_1,\dots ,s_d$, where $s_i = \varepsilon$ and $s_j = s$ for all $j\ne i$. Applying Lemma D.4 then shows

$$\begin{aligned} \mathcal{N}(S_i^{\pm 1}, \Vert \cdot \Vert _\infty , \varepsilon ) \le \Bigl (\frac{3}{2}\Bigr )^d s^{d-1}\cdot \varepsilon ^{-d+1}\, , \end{aligned}$$

and combining with Assumption 2.11 we obtain

$$\begin{aligned} P_X(S_i^{\pm 1}) \le \Bigl (\frac{3}{2}\Bigr )^d s^{d-1}\cdot \varepsilon ^{-d+1} \cdot \varphi (2\varepsilon )\, . \end{aligned}$$

Inserting this estimate into (65) yields (64). $\square$

As a second step in our construction presented in Subsection 3.1 we combine Lemmas D.1 and D.2 with Lemma D.3 to approximate step-functions on cubic partitions by ReLU-DNNs with two hidden layers.

Proposition D.6

Let $A_1,\dots ,A_k$ be mutually disjoint subsets of $X:= [-1,1]^d$ such that for each $i\in \{1,\dots ,k\}$ there exist $z_i^-, z_i^+\in X$ with $z_i^-< z_i^+$ and $(z_i^-, z_i^+) \subset A_i \subset [z_i^-, z_i^+]$. Moreover, let $z_{i,j}^\pm$ be the j-th coordinate of $z_i^\pm$. Then for all $g:X\rightarrow \mathbb {R}$ of the form

$$\begin{aligned} g = \sum _{i=1}^k \alpha _i \varvec{1}_{A_i} \end{aligned}$$

(66)

with $\alpha _i \in \mathbb {R}$, all $\varepsilon >0$ satisfying

$$\begin{aligned} \varepsilon < \min \Bigl \{ \frac{z_{i,j}^+ - z_{i,j}^-}{2}: i=1\dots ,k \text{ and } j=1,\dots ,d\Bigr \} \end{aligned}$$

and all $m_1 \ge 2d k$ and $m_2\ge k$, there exists a neural network $f_\varepsilon \in \mathcal{A}_{m_1,m_2}$ such that

$$\begin{aligned} \{ f_\varepsilon = g \} = \bigcup _{i=1}^k \; [z_{i}^- + \varepsilon , z_{i}^+ - \varepsilon ] \cup (X\setminus A_i ) . \end{aligned}$$

and $\Vert f_\varepsilon \Vert _\infty = \max \{|\alpha _1|,\dots ,|\alpha _k|\}$. In addition, if $A_1,\dots ,A_k$ are cubes of side length $s>0$, i.e. $z_i^+-z_i^- = (s,\dots , s)\in \mathbb {R}^d$ for all $i=1,\dots ,k$, and $P_X$ is a distribution on $[-1,1]^d$ that satisfies Assumption 2.11 for some $\varphi : \mathbb {R}_+ \rightarrow \mathbb {R}_+$, then we further have

$$\begin{aligned} P_X \bigl ( \{ f_\varepsilon \ne g \} \bigr ) \le 3d k\cdot \Bigl (\frac{3s}{2}\Bigr )^{d-1} \cdot \varepsilon ^{-d+1} \cdot \varphi (2\varepsilon )\, . \end{aligned}$$

Proof of Proposition D.6

Since $\mathcal{A}_{2dk, k}\subset \mathcal{A}_{m_1,m_2}$ it suffices to find an $f_\varepsilon$ with the desired properties in $\mathcal{A}_{2dk, k}$. By assumption and Lemma D.3, we find, for all $\varepsilon >0$ and $i=1,\dots , k$, a neural network $f^{(\varepsilon )}_i \in {\mathcal {A}}_{2d,1}$, and Lemma D.5 shows that

$$\begin{aligned} \{ f^{(\varepsilon )}_i = \varvec{1}_{A_i} \}&= [z_{i}^- + \varepsilon , z_{i}^+ - \varepsilon ] \cup (X\setminus A_i) \;. \end{aligned}$$

Moreover, for any $\alpha _i \in \mathbb {R}$, Lemma D.1 ensures $\alpha _i f^{(\varepsilon )}_i \in {\mathcal {A}}_{2d,1}$ with

$$\begin{aligned} \{\alpha _i f^{(\varepsilon )}_i = \alpha _i \varvec{1}_{A_i} \} = [z_{i}^- + \varepsilon , z_{i}^+ - \varepsilon ] \cup (X\setminus A_i) . \end{aligned}$$

Now, applying Lemma D.2 shows that

$$\begin{aligned} f_\varepsilon := \sum _{i=1}^k \alpha _i f^{(\varepsilon )}_i \end{aligned}$$

belongs to ${\mathcal {A}}_{2kd,k}$, and since we have

$$\begin{aligned} \{ f^{(\varepsilon )}_i \ne 0 \} \cap \{ f^{(\varepsilon )}_l \ne 0 \} = \{ f^{(\varepsilon )}_i> 0 \} \cap \{ f^{(\varepsilon )}_l > 0 \} \subset (z_{i}^- , z_{i}^+ ) \cap (z_{l}^- , z_{l}^+ )&\subset A_i \cap A_l \nonumber \\&= \emptyset \end{aligned}$$

(67)

for all $i\ne l$, our previous considerations give us

$$\begin{aligned} \{ f_\varepsilon = g \} = \bigcup _{i=1}^k \; [z_{i}^- + \varepsilon , z_{i}^+ - \varepsilon ]\cup (X\setminus A_i) . \end{aligned}$$

Finally, the identity $\Vert f_\varepsilon \Vert _\infty = \max \{|\alpha _1|,\dots ,|\alpha _k|\}$ follows from (67) and $\Vert f_i^{(\varepsilon )} \Vert _\infty = |\alpha _i|$ for all $i=1,\dots ,k$ and the bound on $P_X \bigl ( \{ f_\varepsilon \ne \varvec{1}_{A} \} \bigr )$ is a direct consequence of (64). $\square$

1.3 D.3 Proof of Main Theorem 3.7

Throughout this section we assume that the general assumptions of Theorem 3.7 are satisfied. In particular, $D \in (X \times Y)^n$ is an i.i.d. sample of size $n \ge 1$ and $D_X:=\{x_1^*,..., x_{m_n}^*\}\in \textrm{Pot}_m(X)$ is the set of input observations. Moreover, $(s_n)_{n \in \mathbb {N}}$ is a sequence with $s_n \rightarrow 0$, $s_n^d > 2^d/n$ and $\frac{\ln (n s_n^d)}{n s_n^d}\rightarrow 0$ as $n \rightarrow \infty$. In addition, we let $(\rho _n)_{n \in \mathbb {N}}$ be a non-negative sequence with $\rho ^d_n \le 2^d/n$ and $\rho _n^{-d} \varphi ( \rho _n ) \rightarrow 0$ for $n\rightarrow \infty$. Finally, let $(\varepsilon _n)_{n \in \mathbb {N}}$ and $(\delta _n)_{n \in \mathbb {N}}$ be positive sequences with $\varepsilon _n = \delta _n = \rho _n/2$.

We firstly show our claim for the good interpolating DNN from Example 3.6, having representation

$$\begin{aligned} g_{D,s_n, \rho _n}^+= h^{+,(\varepsilon _n)}_{D, {\mathcal {A}}_D} + \sum _{i=1}^m b^+_i \varvec{1}^{(\delta _n)}_{x_i^* + t B_\infty } \end{aligned}$$

with $t=\min \{r, \rho _n\}$ and associated ${\mathcal {H}}_{{\mathcal {A}}}^{(\epsilon )}$-part

$$\begin{aligned} h^{+,(\varepsilon _n)}_{D, {\mathcal {A}}_D}:= \sum _{j\in J} c_j^+ \varvec{1}^{(\varepsilon _n)}_{A_j} . \end{aligned}$$

We split the excess risk into three different terms

$$\begin{aligned} {\mathcal {R}}_{L,P}(g_{D,s_n, \rho _n}^+)-{\mathcal {R}}^*_{L,P}&= {\mathcal {R}}_{L,P}(g_{D,s_n, \rho _n}^+)- {\mathcal {R}}_{L,P}(h^{+, (\varepsilon )}_{D, {\mathcal {A}}_D} ) \nonumber \\&+ {\mathcal {R}}_{L,P}(h^{+, (\varepsilon )}_{D, {\mathcal {A}}_D} ) - {\mathcal {R}}_{L,P}(h^{+}_{D, {\mathcal {A}}_D} ) + {\mathcal {R}}_{L,P}(h^{+}_{D, {\mathcal {A}}_D} ) - {\mathcal {R}}^*_{L,P} \;. \end{aligned}$$

(68)

Convergence of the the first term follows from Lemma C.2 and by exploiting Assumption 2.11. We obtain

$$\begin{aligned} {\mathcal {R}}_{L,P}(g_{D,s_n, \rho _n}^+)- {\mathcal {R}}_{L,P}(h^{+, (\varepsilon _n )}_{D, {\mathcal {A}}_D} )&\le 4\;P_X\bigl ( \{ g_{D,s_n, \rho _n}^+\ne h^{+, (\varepsilon _n )}_{D, {\mathcal {A}}_D} \}\bigr ) \nonumber \\&\le 4\;P_X\bigl ( D_X^{+t} \bigr ) \nonumber \\&\le 4\;P_X\bigl ( D_X^{+\rho _n } \bigr ) \nonumber \\&\le 4 n\varphi ( \rho _n ) \nonumber \\&\le 4\cdot 2^d \rho _n^{-d} \varphi ( \rho _n ) \;. \end{aligned}$$

(69)

Hence, by our assumption on $\rho _n$ may we conclude

$$\begin{aligned} {\mathcal {R}}_{L,P}(g_{D,s_n, \rho _n}^+)- {\mathcal {R}}_{L,P}(h^{+, (\varepsilon _n )}_{D, {\mathcal {A}}_D} ) \rightarrow 0 \;, \end{aligned}$$

in probability for $|D|\rightarrow \infty$.

For bounding the second term in (68) we remind that $|J|\le \left( \frac{2}{s_n}\right) ^d$. Lemma C.2 and Proposition D.6 yield^{Footnote 3}

$$\begin{aligned} {\mathcal {R}}_{L,P}(h^{+, (\varepsilon _n )}_{D, {\mathcal {A}}_D} ) - {\mathcal {R}}_{L,P}(h^{+}_{D, {\mathcal {A}}_D} )&\le 4\;P_X\bigl ( \{ h^{+,(\varepsilon _n)}_{D, {\mathcal {A}}_D} \ne h^+_{D, {\mathcal {A}}_D} \}\bigr ) \nonumber \\&\le 12\cdot d |J| \cdot \Bigl (\frac{3s_n}{2}\Bigr )^{d-1} \cdot \varepsilon _n^{-d+1} \cdot \varphi (2\varepsilon _n) \nonumber \\&\le 12 \cdot d \left( \frac{2}{s_n}\right) ^d \cdot \Bigl (\frac{3s_n}{2}\Bigr )^{d-1} \cdot \varepsilon _n^{-d+1} \cdot \varphi (2\varepsilon _n) \nonumber \\&\le 4\cdot d \cdot 6^{d} \cdot \rho _n^{-d} \cdot \varphi (\rho _n) \;. \end{aligned}$$

(70)

Hence, our assumption on $\rho _n$ ensures

$$\begin{aligned} {\mathcal {R}}_{L,P}(h^{+, (\varepsilon _n )}_{D, {\mathcal {A}}_D} ) - {\mathcal {R}}_{L,P}(h^{+}_{D, {\mathcal {A}}_D} ) \rightarrow 0 \;, \end{aligned}$$

in probability for $|D|\rightarrow \infty$. Finally, convergence of the last term in (68) is easily derived with the help of Proposition E.4 and we conclude that

$$\begin{aligned} ||g_{D,s_n, \rho _n}^+- {f_{L,P}^*}||^2_{L_2(P_X)} = {\mathcal {R}}_{L,P}(g_{D,s_n, \rho _n}^+)-{\mathcal {R}}^*_{L,P} \rightarrow 0 , \end{aligned}$$

in probability for $|D|\rightarrow \infty$.

We now turn to considering the bad interpolating DNN from Example 3.6. Since we have $g_{D,s_n, \rho _n}^-(x) \in [-1,1]$ and $f_{L,P}^\dagger (x) \in [-1,1]$ for all $x \in X$, Assumption 2.11 gives

$$\begin{aligned} \int _{D_X^{+t}} ( g_{D,s_n, \rho _n}^-(x) - f_{L,P}^\dagger (x) )^2\; dP_X&\le 4P_X(D_X^{+\rho _n}) \le 4 n \varphi (\rho _n) \le 4 \rho _n^{-d} \varphi (\rho _n) \;. \end{aligned}$$

(71)

Moreover, for all $x \in X{\setminus } D_X^{+t}$ we have $g_{D,s_n, \rho _n}^-(x) = -h_{D, {\mathcal {A}}_D}^{+, (\varepsilon _n)}$ and $f_{L,P}^\dagger (x)= -f^*_{L,P}(x)$. Hence

$$\begin{aligned} \int _{X\setminus D_X^{+t}} ( g_{D,s_n, \rho _n}^-(x) - f_{L,P}^\dagger (x) )^2\; dP_X&\le {\mathcal {R}}_{L,P}(h_{D, {\mathcal {A}}_D}^{+, (\varepsilon _n)}) - {\mathcal {R}}^*_{L,P} \;. \end{aligned}$$

Combining both considerations with the first part of the proof shows then in probability for $|D|\rightarrow \infty$

$$\begin{aligned} ||g_{D,s_n, \rho _n}^-- {f_{L,P}^\dagger }||_{L_2(P_X)} \rightarrow 0 . \end{aligned}$$

Finally, since $|J| \le (\frac{2}{s_n})^d \le n$, Proposition D.6 shows that $g_{D,s_n, \rho _n}^{\pm }\in {\mathcal {A}}_{4dn, 2n}$.

1.4 D.4 Proof of Main Theorem 3.8

Let all assumptions of Theorem 3.8 be satisfied. Moreover, we let $(\varepsilon _n)_{n \in \mathbb {N}}$ and $(\delta _n)_{n \in \mathbb {N}}$ be positive sequences with $\varepsilon _n = \delta _n = \rho _n/2$. We prove the result for the good interpolating DNN by reconsidering (68). Indeed, by our assumption $\rho _n^{-d} \varphi (\rho _n) \le \ln (n) n^{-2/3}$ and thus (69) leads to

$$\begin{aligned} {\mathcal {R}}_{L,P}(g_{D,s_n, \rho _n}^+)- {\mathcal {R}}_{L,P}(h^{+, (\varepsilon _n )}_{D, {\mathcal {A}}_D} ) \le 4 \rho _n^{-d} \varphi ( \rho _n ) \le 4 \ln (n) n^{-2/3} \le 4 \ln (n) n^{-2\alpha \gamma } . \end{aligned}$$

Moreover, (70) gives

$$\begin{aligned} {\mathcal {R}}_{L,P}(h^{+, (\varepsilon _n )}_{D, {\mathcal {A}}_D} ) - {\mathcal {R}}_{L,P}(h^{+}_{D, {\mathcal {A}}_D} ) \le 4\cdot d \cdot 6^{d} \cdot \rho _n^{-d} \cdot \varphi (\rho _n) \le 16\cdot d \cdot 6^{d} \ln (n) n^{-2\alpha \gamma } . \end{aligned}$$

Finally, (53) shows with probability $P^n$ at least $1- 2^dn^{1+d} e^{-n^{d \gamma }}$

$$\begin{aligned} {\mathcal {R}}_{L,P}(h^{+}_{D, {\mathcal {A}}_D} ) - {\mathcal {R}}^*_{L,P} \le c_{d,\alpha }\ln (n ) \, n^{-2\alpha \gamma } , \end{aligned}$$

(72)

where $c_{d,\alpha }>0$ is a constant only depending on d, $\alpha$, and $|f^*_{L,P}|_\alpha$. Collecting the above considerations shows the first part of the theorem.

Now, coming to the bad interpolating DNN, we derive from (71)

$$\begin{aligned} \int _{D_X^{+t}} ( g_{D,s_n, \rho _n}^-(x) - f_{L,P}^\dagger (x) )^2\; dP_X \le 4P_X(D_X^{+\rho _n}) \le 4 \ln (n) n^{-2\alpha \gamma } . \end{aligned}$$

Moreover, combining the results from (70) and (72) gives with probability $P^n$ at least $1- 2^dn^{1+d} e^{-n^{d \gamma }}$

$$\begin{aligned} {\mathcal {R}}_{L,P}(h_{D, {\mathcal {A}}_D}^{+, (\varepsilon _n)}) - {\mathcal {R}}^*_{L,P}&\le 4\cdot d \cdot 6^{d} \cdot \ln (n) n^{-2/3} + c_{d,\alpha }\ln (n ) \, n^{-2\alpha \gamma } \\&\le c'_{d,\alpha } \ln (n ) \, n^{-2\alpha \gamma } \;, \end{aligned}$$

where $c'_{d,\alpha }= 4\cdot d \cdot 6^{d} + c_{d,\alpha }$. Thus, with probability $P^n$ at least $1- 2^dn^{1+d} e^{-n^{d \gamma }}$

$$\begin{aligned} ||g_{D,s_n, \rho _n}^-- {f_{L,P}^\dagger }||_{L_2(P_X)}^2 \le c''_{d,\alpha } \ln (n ) \, n^{-2\alpha \gamma } , \end{aligned}$$

where $c''_{d,\alpha } = 4 + c'_{d,\alpha }$.

Finally, we have $|J| \le (\frac{2}{s_n})^d = 2^d n^{\frac{d}{2\alpha +d }}\le n$, provided $n \ge n_{d, \alpha }$, for some $n_{d, \alpha } \in \mathbb {N}$, depending on d and $\alpha$. Proposition D.6 shows then that $g_{D,s_n, \rho _n}^{\pm }\in {\mathcal {A}}_{4dn, 2n}$.

E Uniform bounds for histograms based on data-dependent partitions

1.1 E.1 A generic oracle inequality for empirical risk minimization

If not stated otherwise, we assume throughout this subsection that X is an arbitrary non-empty set that is equipped with some $\sigma$-algebra. We write ${\mathcal {L}}_\infty$ for the corresponding set of all bounded, measurable functions $f:X\rightarrow \mathbb {R}$. Moreover, $Y\subset \mathbb {R}$ is assumed to be measurable. Following (Steinwart and Christmann (2008), Definition 2.18) we say that a measurable loss $L:X\times Y\times \mathbb {R}\rightarrow [0,\infty )$ is locally Lipschitz continuous if for all $a\ge 0$ there exists a constant $c_{a}\ge 0$ such that

$$\begin{aligned} \sup _{\begin{array}{c} x\in X\\ y\in Y \end{array}} \bigl | L(x,y, t) - L(x,y,t')\bigr | \, \le \, c_{a} \, |t-t'|\,, \qquad \qquad t,t'\in [-a,a]\,. \end{aligned}$$

(73)

Moreover, for $a\ge 0$, the smallest such constant $c_{a}$ is denoted by $|L|_{a,1}$.

In addition, we need to recall the notion of covering numbers, which is recalled in the following definition.

Definition E.1

Let (T, d) be a metric space and $\varepsilon >0$. We call $S\subset T$ an $\varepsilon$-net of T if for all $t\in T$ there exists an $s\in S$ with $d(s,t)\le \varepsilon$. Moreover, the $\varepsilon$-covering number of T is defined by

$$\begin{aligned} \mathcal{N}(T,d,\varepsilon ):= \inf \biggl \{ n \ge 1: \exists \, s_{1},\dots ,s_{n}\in T \text{ such } \text{ that } T\subset \bigcup _{i=1}^{n} B_{d}(s_{i},\varepsilon ) \biggr \} \,, \end{aligned}$$

where $\inf \emptyset := \infty$ and $B_{d}(s,\varepsilon ):=\{t\in T: d(t,s)\le \varepsilon \}$ denotes the closed ball with center $s\in T$ and radius $\varepsilon$.

Moreover, if (T, d) is a subspace of a normed space $(E,\Vert \cdot \Vert )$ and the metric is given by $d(x,x') = \Vert x-x' \Vert$, $x,x'\in T$, we write $\mathcal{N}(T,\Vert \cdot \Vert ,\varepsilon ):= \mathcal{N}(T,d,\varepsilon )$.

Finally, we need to fix some notation related to empirical risk minimization. To this end, we fix a loss function $L:X\times Y\times \mathbb {R}\rightarrow [0,\infty )$ and an $\mathcal{F}\subset \mathcal{L}_{\infty }(X)$. Given a distribution P on $X\times Y$, we denote the smallest possible risk attained by functions in $\mathcal{F}$ by ${\mathcal{R}_{L,P,\mathcal{F}}^*}$, that is

$$\begin{aligned} {\mathcal{R}_{L,P,\mathcal{F}}^*} := \inf _{f\in \mathcal{F}} {\mathcal{R}_{L,P}(f)}\, . \end{aligned}$$

Finally, following (Steinwart and Christmann (2008), Definition 6.2), we say that an ERM method $D\mapsto f_D$ with respect to L and $\mathcal{F}$ is measurable, if for all $n\ge 1$ the map

$$\begin{aligned} (X\times Y)^n \times X&\rightarrow \mathbb {R}\\ (D,x)&\mapsto f_D(x) \end{aligned}$$

is measurable with respect to the universal completion of the product $\sigma$-algebra of the product space $(X\times Y)^n \times X$. Recall from (Steinwart and Christmann (2008), Lemma 6.17) that for closed, separable $\mathcal{F} \subset \mathcal{L}_{\infty }(X)$ for which there exists an ERM, there also exists a measurable ERM. Moreover, in this case the map

$$\begin{aligned} (X\times Y)^n&\rightarrow [0,\infty ] \\ (D,x)&\mapsto {\mathcal{R}_{L,P}(f_D)} \end{aligned}$$

is also measurable with respect to the universal completion of the product $\sigma$-algebra of $(X\times Y)^n$, see (Steinwart and Christmann (2008), Lemma 6.3). In the following, we thus assume that $(X\times Y)^n$ is equipped with this universal completion. Finally, for a loss L and a function $f \in {\mathcal {F}}$, we denote by $L \circ f: X \times Y \rightarrow [0, \infty )$ the map $(x,y) \mapsto L(x,y, f(x))$.

With the help of these notion we can now prove the generic oracle inequality for empirical risk minimizers.

Proof of Theorem 2.10

We first note that (18) ensures ${\mathcal{R}_{L,P}(f_D)} - {\mathcal{R}_{L,P}^{*}}\le B$ and since we have additionally assumed $V\ge B^{2-\vartheta }$, we see that it suffices to consider sample sizes $n\ge 16\tau$.

Given an $f\in \mathcal{F}$, we define $h_f:=L\circ f-L\circ {f_{L,P}^*}$. Let us now fix an ${f_0}\in \mathcal{F}$ and a data set $D\in (X\times Y)^n$. Since $f_D$ is an empirical risk minimizer, we have ${\mathcal{R}_{L,D}(f_D)}\le {\mathcal{R}_{L,D}({f_0})}$, and hence we find $\mathbb {E}_D h_{f_D} \le \mathbb {E}_D h_{f_0}$. As a consequence, we obtain

$$\begin{aligned} {\mathcal{R}_{L,P}(f_D)} - {\mathcal{R}_{L,P}({f_0})}= & \mathbb {E}_P h_{f_D} - \mathbb {E}_P h_{f_0}\nonumber \\\le & \mathbb {E}_P h_{f_D} - \mathbb {E}_D h_{f_D} + \mathbb {E}_D h_{f_0}- \mathbb {E}_P h_{f_0}\,. \end{aligned}$$

(74)

To bound the first difference in (74) we first observe that for $f,f'\in \mathcal{F}$, $x\in X$, and $y\in Y$ the local Lipschitz continuity of L gives

$$\begin{aligned} \bigl | h_f(x,y) - h_{f'}(x,y) \bigr | = \bigl | L(x,y,f(x)) - L(x,y,f'(x)) \bigr | \le |L|_{M,1} \cdot \bigl | f(x) - f'(x) \bigr |\, , \end{aligned}$$

and thus we have $\Vert h_f-h_{f'} \Vert _\infty \le |L|_{M,1} \cdot \Vert f-f' \Vert _\infty$ for all $f,f'\in \mathcal{F}$. Now, let $\mathcal{C}\subset \mathcal{F}$ be a minimal $\varepsilon$-net of $\mathcal{F}$ with respect to $\Vert \cdot \Vert _\infty$. For a data set $D\in (X\times Y)^n$ there then exists an $f\in \mathcal{C}$ such that $\Vert f-f_D \Vert _\infty \le \varepsilon$, and hence $\Vert h_{f_D} - h_f \Vert _\infty \le |L|_{M,1} \cdot \varepsilon$. This yields

$$\begin{aligned} \max \Bigl \{ \bigl |\mathbb {E}_P h_{f_D} - \mathbb {E}_P h_f\bigr | \, ,\bigl |\mathbb {E}_D h_{f_D} - \mathbb {E}_D h_f\bigr | \Bigr \}&\le \max \Bigl \{ \mathbb {E}_P |h_{f_D} - h_f| \, , \mathbb {E}_D |h_{f_D} - h_f| \Bigr \} \nonumber \\&\le \Vert h_{f_D} - h_f \Vert _\infty \nonumber \\&\le |L|_{M,1} \cdot \varepsilon \, . \end{aligned}$$

(75)

For $f\in \mathcal{C}$ and $r>0$ we now define the function

$$\begin{aligned} g_{f,r}:= \frac{\mathbb {E}_P h_f- h_{f}}{\mathbb {E}_P h_f +r}\,. \end{aligned}$$

It is easy to see that both $\mathbb {E}_P g_{f,r}=0$ and $\Vert g_{f,r} \Vert _\infty \le 2Br^{-1}$ hold. In addition, in the case $\vartheta >0$ and $b:= \mathbb {E}_P h_f\ne 0$, setting $q:= \frac{2}{2-\vartheta }$, $q':= \frac{2}{\vartheta }$, and $a:= r$ in the second inequality of (Steinwart and Christmann (2008), Lemma 7.1) shows

$$\begin{aligned} \mathbb {E}_P g_{f,r}^2 \le \frac{\mathbb {E}_P h_f^2}{(\mathbb {E}_P h_f+r)^2} \le \frac{(2-\vartheta )^{2-\vartheta }\vartheta ^\vartheta \, \mathbb {E}_P h_f^2}{ 4 r^{2-\vartheta }(\mathbb {E}_P h_f)^\vartheta } \le V{r^{\vartheta -2}} \,. \end{aligned}$$

Furthermore, in the case $\vartheta >0$ and $\mathbb {E}_P h_f=0$, the variance bound (19) gives $\mathbb {E}_P h_f^2=0$, and hence we have $\mathbb {E}_P g_{f,r}^2 \le V{r^{\vartheta -2}}$. Finally, in the case $\vartheta = 0$, we have $\mathbb {E}_P g_{f,r}^2 \le \mathbb {E}_P h_f^2 \, r^{-2} \le V{r^{\vartheta -2}}$. In summary, we we have thus found

$$\begin{aligned} \mathbb {E}_P g_{f,r}^2 \le V{r^{\vartheta -2}} \end{aligned}$$

in all cases. By applying Bernstein’s inequality in the form of (Steinwart and Christmann (2008), Theorem 6.12) in combination with a union bound we thus find

$$\begin{aligned} P^n \biggl ( D\in (X\times Y)^n: \sup _{f\in \mathcal{C}} \mathbb {E}_D g_{f,r} < \sqrt{\frac{2V\tau }{nr^{2-\vartheta }}}+ \frac{4B\tau }{3n r} \biggr ) \ge 1-|\mathcal{C}|\,e^{-\tau } \quad \end{aligned}$$

(76)

for all $r>0$. Let us now pick a data set $D\in (X\times Y)^n$ that satisfies the above inequality, that is

$$\begin{aligned} \sup _{f\in \mathcal{C}}\mathbb {E}_D g_{f,r} <\sqrt{\frac{2V\tau }{nr^{2-\vartheta }}}+ \frac{4B\tau }{3n r}\,. \end{aligned}$$

For an $f\in \mathcal{C}$ with $\Vert f-f_D \Vert _\infty \le \varepsilon$, Inequality (75) together with the definition of $g_{f,r}$ then gives

$$\begin{aligned} \mathbb {E}_P h_{f_D} - \mathbb {E}_D h_{f_D}&\le \mathbb {E}_P h_{f} - \mathbb {E}_D h_{f} + 2\,|L|_{M,1} \cdot \varepsilon \nonumber \\&< \mathbb {E}_P h_{f}\biggl (\sqrt{\frac{2V\tau }{nr^{2-\vartheta }}}+ \frac{4B\tau }{3n r}\biggr ) + \sqrt{\frac{2V\tau r^\vartheta }{n}}+ \frac{4B\tau }{3n} + 2\,|L|_{M,1} \cdot \varepsilon \nonumber \\&\le \bigl ( \mathbb {E}_P h_{f_D}+\varepsilon \bigr )\biggl (\sqrt{\frac{2V\tau }{nr^{2-\vartheta }}}+ \frac{4B\tau }{3n r}\biggr ) + \sqrt{\frac{2V\tau r^\vartheta }{n}}+ \frac{4B\tau }{3n} + 2\, |L|_{M,1} \cdot \varepsilon \, . \end{aligned}$$

(77)

Our next goal is to estimate the second difference (74), that is $\mathbb {E}_D h_{f_0}- \mathbb {E}_P h_{f_0}$. Let us first consider the case $\vartheta >0$. Here, we have both $\Vert h_{f_0}- \mathbb {E}_P h_{f_0} \Vert _\infty \le 2B$ and

$$\begin{aligned} \mathbb {E}_P ( h_{f_0}- \mathbb {E}_P h_{f_0})^2\le \mathbb {E}_P h_{f_0}^2\le V (\mathbb {E}_P h_{f_0})^\vartheta \,. \end{aligned}$$

Furthermore, setting $q:= \frac{2}{2-\vartheta }$, $q':= \frac{2}{\vartheta }$, $a:=\bigl (\frac{2^{1-\vartheta }\vartheta ^\vartheta V\tau }{n} \bigr )^{1/2}$, and $b:= \bigl (\frac{2\mathbb {E}_P h_{f_0}}{\vartheta }\bigr )^{\vartheta /2}$ in (Steinwart and Christmann (2008), Lemma 7.1) yields

$$\begin{aligned} \sqrt{\frac{2\tau V (\mathbb {E}_P h_{f_0})^\vartheta }{n}} \le \biggl (1-\frac{\vartheta }{2}\biggr )\biggl (\frac{2^{1-\vartheta }\vartheta ^\vartheta V\tau }{n}\biggr )^{\frac{1}{2-\vartheta }} + \mathbb {E}_P h_{f_0}\le \biggl (\frac{2 V\tau }{n}\biggr )^{\frac{1}{2-\vartheta }} + \mathbb {E}_P h_{f_0}, \end{aligned}$$

By another application of Bernstein’s inequality we consequently find that

$$\begin{aligned} \mathbb {E}_D h_{f_0}- \mathbb {E}_P h_{f_0}< \mathbb {E}_P h_{f_0}+ \biggl (\frac{2 V\tau }{n}\biggr )^{\frac{1}{2-\vartheta }} + \frac{4B\tau }{3n} \end{aligned}$$

(78)

holds with probability $P^n$ not less than $1-e^{-\tau }$. Finally, in the case $\vartheta =0$, Hoeffding’s inequality in combination with $\Vert h_{f_0} \Vert _\infty \le B\le \sqrt{V}$ also yields (78).

To finish the proof, we now combine (74), (76), (77), and (78). As a result we see that

$$\begin{aligned} \mathbb {E}_P h_{f_D}&< 2 \mathbb {E}_P h_{f_0}+ \bigl (\mathbb {E}_P h_{f_D} + \varepsilon \bigr )\biggl (\sqrt{\frac{2V\tau }{nr^{2-\vartheta }}}+ \frac{4B\tau }{3n r}\biggr ) + \sqrt{\frac{2V\tau r^\vartheta }{n}} \\&\quad + \Bigl (\frac{2 V\tau }{n}\Bigr )^{\frac{1}{2-\vartheta }} + \frac{8B\tau }{3n} + 2\, |L|_{M,1} \cdot \varepsilon \end{aligned}$$

holds with probability $P^n$ not less than $1-(1+|\mathcal{C}|)e^{-\tau }$. In the following, we fix a data set D, for which this inequality holds. Defining

$$\begin{aligned} r:= \Bigl (\frac{16V\tau }{n} \Bigr )^{1/(2-\vartheta )}\,, \end{aligned}$$

a simple calculation then shows both

$$\begin{aligned} \sqrt{\frac{2V\tau }{nr^{2-\vartheta }}} = \frac{1}{\sqrt{8}} \qquad \text{ and } \qquad \sqrt{\frac{2V\tau r^\vartheta }{n}} = \frac{r}{\sqrt{8}}\,. \end{aligned}$$

Moreover, $V\ge B^{2-\vartheta }$ together with $n\ge 16\tau$ gives

$$\begin{aligned} \frac{4B\tau }{3n r} = \frac{1}{12} \cdot \frac{16\tau }{n} \cdot \frac{B}{r} \,\,\le \,\,\frac{1}{12} \cdot \Bigl (\frac{16\tau }{n}\Bigr )^{\frac{1}{2-\vartheta }} \cdot \frac{V^{\frac{1}{2-\vartheta }}}{r} = \frac{1}{12} \qquad \text{ and } \qquad \frac{8B\tau }{3n} \le \frac{r}{6}\,. \end{aligned}$$

Finally, we have

$$\begin{aligned} \Bigl (\frac{2 V\tau }{n}\Bigr )^{\frac{1}{2-\vartheta }} = 8^{-\frac{1}{2-\vartheta }} \Bigl (\frac{16 V\tau }{n}\Bigr )^{\frac{1}{2-\vartheta }} \le \frac{r}{\sqrt{8}}\, . \end{aligned}$$

Inserting these estimates in our inequality on $\mathbb {E}_Ph_{f_D}$ gives

$$\begin{aligned} \mathbb {E}_P h_{f_D}&< 2 \mathbb {E}_P h_{f_0}+ \bigl (\mathbb {E}_P h_{f_D} + \varepsilon \bigr )\biggl (\frac{1}{\sqrt{8}} + \frac{1}{12}\biggr ) + \frac{r}{\sqrt{8}} + \frac{r}{\sqrt{8}} + \frac{r}{6} + 2\, |L|_{M,1} \cdot \varepsilon \\&= 2\mathbb {E}_P h_{f_0}+ \frac{6+\sqrt{2}}{12\sqrt{2}} \cdot \mathbb {E}_P h_{f_D} + \frac{6+\sqrt{2}}{6\sqrt{2}} \cdot r + \frac{6+ 25\sqrt{2}}{12\sqrt{2}} \cdot |L|_{M,1} \cdot \varepsilon \, , \end{aligned}$$

and by elementary transformations we thus conclude that

$$\begin{aligned} {\mathcal{R}_{L,P}(f_D)} - {\mathcal{R}_{L,P}^{*}} = \mathbb {E}_P h_{f_D}&< \frac{24\sqrt{2}}{11\sqrt{2}-6} \cdot \mathbb {E}_P h_{f_0}+ \frac{12+2\sqrt{2}}{11\sqrt{2}-6} \cdot r + \frac{6+ 25\sqrt{2}}{11\sqrt{2}-6} \cdot |L|_{M,1} \cdot \varepsilon \\&< 4 \,\mathbb {E}_P h_{f_0}+ 2 r + 5 \, |L|_{M,1} \cdot \varepsilon \\&= 4 \bigl ( {\mathcal{R}_{L,P}({f_0})}- {\mathcal{R}_{L,P}^{*}} \bigr ) + 2\, \Bigl (\frac{16V\tau }{n} \Bigr )^{1/(2-\vartheta )} + 5 \, |L|_{M,1} \cdot \varepsilon \end{aligned}$$

Now the assertion follows by a simple algebraic transformation of $\tau$ and taking the infimum over all ${f_0}\in \mathcal{F}$. $\square$

If we have an upper bound on the covering numbers occurring in Theorem 2.10, then we can optimize the right hand side of its oracle inequality with respect to $\varepsilon$. The following corollary executes this idea for the least squares loss and histogram rules that choose their cubic partitions in a certain, data-dependent way.

Corollary E.2

Let $Y = [-M,M]$ and let L be the least squares loss. For $K<\infty$ and $A<\infty$ let ${\mathcal {A}}_1, \dots ,{\mathcal {A}}_K$ be finite partitions of X, satisfying $|{\mathcal {A}}_i|\le A$ for any $i=1,\dots ,K$. Moreover, let $D \mapsto h_{D, \mathcal{A}_D }$ be an algorithm, that first chooses a partition $\mathcal{A}_D$ from ${\mathcal {A}}_1, \dots , {\mathcal {A}}_K$ and then computes the corresponding ${\mathcal {A}}_D$-histogram. Then, for all $n\ge 1$ and $\tau >0$, we have

$$\begin{aligned} {\mathcal{R}_{L,P}(h_{D, {\mathcal {A}}_D })} - {\mathcal{R}_{L,P}^{*}}&\le 4\sup _{i=1,\dots ,K}\bigl ( {\mathcal{R}_{L,P,{\mathcal {H}}_{{\mathcal {A}}_i}}^*}-{\mathcal{R}_{L,P}^{*}}\bigr ) \\&\quad + 1024 \,\frac{ \tau M^2 }{n} + 512 \,\frac{ A M^2 }{n}\left( 1+\ln \left( \frac{n}{A} \right) \right) \end{aligned}$$

with probability $P^n$ not less than $1- Ke^{-\tau }$.

Proof of Corollary E.2

Since L is the least squares loss, the assumptions (18) and (19) of Theorem 2.10 are satisfied with $\vartheta = 1$, $B=4\,M^2$, and $V=16\,M^2$. Moreover, our assumption $Y\subset [-M,M]$ ensures that L is locally Lipschitz continuous with $|L|_{M,1} \le 4\,M$.

Now, for a fixed $i\in \{1,\dots ,K\}$ we recall that the histogram rule $D\mapsto h_{D,\mathcal{A}_i}$ is an empirical risk minimizer over the hyptheses class

$$\begin{aligned} {\mathcal {H}}_{{\mathcal {A}}_i} =\left\{ \sum _{j \in J}c_j \varvec{1}_{A_j}: \; c_j \in Y \right\} , \end{aligned}$$

where ${\mathcal {A}}_i=(A_j)_{j \in J}$. Moreover, for any $\varepsilon >0$, the $\varepsilon$-covering number of ${\mathcal {H}}_{{\mathcal {A}}_i}$ satisfies

$$\begin{aligned} {\mathcal {N}}({\mathcal {H}}_{{\mathcal {A}}_i}, ||\cdot ||_\infty , \varepsilon ) \le (2M/\varepsilon )^{|{\mathcal {A}}_i|} \le (2M/\varepsilon )^{A} . \end{aligned}$$

(79)

For $n\ge 1$, $\tau >0$, and $\varepsilon >0$ Theorem 2.10 thus gives

$$\begin{aligned} {\mathcal{R}_{L,P}(h_{D,\mathcal{A}_i})} - {\mathcal{R}_{L,P}^{*}} \le 4\bigl ( {\mathcal{R}_{L,P,{\mathcal {H}}_{{\mathcal {A}}_i}}^*}-{\mathcal{R}_{L,P}^{*}}\bigr ) + 20 M \cdot \varepsilon + \frac{512 M^2\bigl (2\tau + A \ln (\frac{2M}{\varepsilon }) \bigr ) }{n} \end{aligned}$$

with probability $P^n$ not less than $1- e^{-\tau }$.

Next we optimize this bound over $\varepsilon >0$. To this end, we consider the strongly convex function

$$\begin{aligned} h(\varepsilon ) = \alpha \varepsilon + \beta \ln (\gamma /\varepsilon )\, , \end{aligned}$$

where $\alpha := 20\,M$, $\beta := \frac{512AM^2}{n}$, and $\gamma := 2\,M$. Then a simple calculation shows that h has a minimum at $\varepsilon ^*:= \frac{\beta }{\alpha }$, giving

$$\begin{aligned} h(\varepsilon ^*) = \beta \biggl (1+\ln \Bigl (\frac{\alpha \gamma }{\beta }\Bigr ) \biggr ) = \frac{512AM^2}{n}\biggl (1+\ln \Bigl (\frac{40 n M^2}{512 AM^2}\Bigr ) \biggr ) \le \frac{512AM^2}{n}\biggl (1+\ln \Bigl (\frac{ n}{ A}\Bigr ) \biggr ) \, . \end{aligned}$$

Inserting this estimate in our above oracle inequality obtained from Theorem 2.10, using the fact that

$$\begin{aligned} {\mathcal{R}_{L,P}(h_{D, {\mathcal {A}}_D })} - {\mathcal{R}_{L,P}^{*}} \le \sup _{i=1,\dots ,K} {\mathcal{R}_{L,P}(h_{D,\mathcal{A}_i})} - {\mathcal{R}_{L,P}^{*}} \, , \end{aligned}$$

and finally applying a simple union bound then gives the assertion. $\square$

1.2 E.2 Learning properties of histograms

The first lemma describes how well the infinite sample histogram rules defined in (7) can approximate the least squares Bayes risk.

Lemma E.3

(Approximation Error) Let L be the least squares loss, $X:= [-1,1]^d$, $Y= [-1,1]$, and P be a distribution on $X\times Y$. Then, for all $\varepsilon > 0$, there exists an $s_\varepsilon >0$ such that for any cubic partition $\mathcal{A}$ of X with width $s \in (0, s_\varepsilon ]$ one has

$$\begin{aligned} {\mathcal {R}}_{L,P}( h_{P,{\mathcal {A}}}) - {\mathcal {R}}_{L,P}^* < \varepsilon . \end{aligned}$$

Moreover, if ${f_{L,P}^*}$ is $\alpha$- Hölder continuous for some $\alpha \in (0,1]$, then for all $s\in (0,1]$ and all cubic partitions $\mathcal{A}$ of X with width s we have

$$\begin{aligned} {\mathcal {R}}_{L,P}( h_{P,{\mathcal {A}}}) - {\mathcal {R}}_{L,P}^* \le |{f_{L,P}^*}|_\alpha ^2\cdot s^{2\alpha }. \end{aligned}$$

Proof of Lemma E.3

For the proof of the first assertion we fix an $\varepsilon >0$. Then recall that there exists a continuous function $f:\mathbb {R}^d\rightarrow \mathbb {R}$ with compact support such that

$$\begin{aligned} \Vert {f_{L,P}^*}- f \Vert _2 \le \varepsilon \, , \end{aligned}$$

(80)

see e.g. (Bauer (2001), Theorem 29.14 and Lemma 26.2). Moreover, since $\Vert {f_{L,P}^*} \Vert _\infty \le 1$, we can assume without loss of generality that $\Vert f \Vert _\infty \le 1$. Now, since f is continuous and has compact support, f is uniformly continuous, and hence there exists a $\delta \in (0,1]$ such that for all $x,x'\in X$ with $\Vert x-x' \Vert _\infty \le \delta$ we have

$$\begin{aligned} \bigl |f(x)-f(x')\bigr | \le \varepsilon \, . \end{aligned}$$

(81)

We define $s_\varepsilon := \delta$. Now, we fix a cubic partition $\mathcal{A} = (A_j)_{j\in J}$ of width $s>0$ for some $s\in (0,s_\varepsilon ]$. For $x\in X$ with $P_X(A(x)) > 0$ we then have

$$\begin{aligned} h_{P,{\mathcal {A}}}(x) = \frac{1}{P_X(A(x))} \int _{A(x)}f^*_{L,P}\, dP_X \,. \end{aligned}$$

For such x we then define

$$\begin{aligned} \bar{f}(x): = \frac{1}{P_X(A(x))} \int _{A(x)}f \, dP_X \, . \end{aligned}$$

(82)

For the remaining $x\in X$ we simply set $\bar{f}(x):= 0$. With these preparations we then have

$$\begin{aligned} \Vert h_{P,{\mathcal {A}}} - {f_{L,P}^*} \Vert _2 \le \Vert h_{P,{\mathcal {A}}} - \bar{f} \Vert _2 + \Vert \bar{f}-f \Vert _2 + \Vert f - {f_{L,P}^*} \Vert _2\, . \end{aligned}$$

(83)

Clearly, (80) shows that the third term is bounded by $\varepsilon$. Let us now consider the second term. Here we first note that for an $x\in X$ with $P_X(A(x)) > 0$ we have

$$\begin{aligned} \bigl | f(x)-\bar{f}(x)\bigr |&= \frac{1}{P_X(A(x))} \biggl |\int _{A(x)} f(x) - f(x') \, dP_X(x') \biggr | \nonumber \\&\le \frac{1}{P_X(A(x))} \int _{A(x)} \bigl | f(x) - f(x') \bigr | \, dP_X(x') \nonumber \\&\le \varepsilon \, , \end{aligned}$$

(84)

where in the last step we used (81). Consequently, we obtain

$$\begin{aligned} \Vert f-\bar{f} \Vert _2^2&= \sum _{j\in J: P_X(A_j)> 0}\, \int _{A_j}\bigl | f(x)-\bar{f}(x)\bigr |^2 \, dP_X(x') \nonumber \\&\le \sum _{j\in J: P_X(A_j) > 0} \varepsilon ^2 \cdot P_X(A_j) \nonumber \\&\le \varepsilon ^2 \, . \end{aligned}$$

(85)

In other words, the second term is bounded by $\varepsilon$, too. Let us finally consider the first term. Here we have

$$\begin{aligned} \Vert h_{P,{\mathcal {A}}} - \bar{f} \Vert _2^2&= \sum _{j\in J: P_X(A_j)> 0} \int _{A_j} \bigl | h_{P,{\mathcal {A}}} - \bar{f}\bigr |^2 \, dP_X \\&= \sum _{j\in J: P_X(A_j)> 0} \int _{A_j} \biggl | \frac{1}{P_X(A_j)} \int _{A_j}{f_{L,P}^*}\, dP_X - \frac{1}{P_X(A_j)} \int _{A_j}f \, dP_X \biggr |^2 \, dP_X \\&= \sum _{j\in J: P_X(A_j)> 0} \,\, \biggl | \int _{A_j} {f_{L,P}^*}\, dP_X - \int _{A_j} f \, dP_X \, dP_X \biggr |^2\\&\le \Biggl ( \sum _{j\in J: P_X(A_j)> 0} \,\, \biggl | \int _{A_j} {f_{L,P}^*}\, dP_X - \int _{A_j} f \, dP_X \biggr | \Biggr )^2\\&\le \Biggl ( \sum _{j\in J: P_X(A_j) > 0} \,\, \int _{A_j} \bigl | {f_{L,P}^*}- f\bigr | \, dP_X \Biggr )^2\\&=\Vert {f_{L,P}^*}-f \Vert _1^2 \\&\le \Vert {f_{L,P}^*}-f \Vert _2^2 \\&\le \varepsilon ^2 \, . \end{aligned}$$

Consequently, the first term is bounded by $\varepsilon$, too, and hence we conclude by (83) that the excess risk satisfies

$$\begin{aligned} {\mathcal{R}_{L,P}(h_{P,{\mathcal {A}}})} - {\mathcal{R}_{L,P}^{*}} = \Vert h_{P,{\mathcal {A}}} - {f_{L,P}^*} \Vert _2^2 \le 9\varepsilon ^2\,. \end{aligned}$$

A simple variable transformation then yields the first assertion.

To show the second assertion we first note that for all $x,x'\in X$ we have

$$\begin{aligned} | {f_{L,P}^*}(x) - {f_{L,P}^*}(x')| \le |{f_{L,P}^*}|_\alpha \cdot ||x-x'||_\infty ^\alpha \, . \end{aligned}$$

For $s\in (0,1]$, $\varepsilon := |{f_{L,P}^*}|_\alpha \cdot s^\alpha$, and $x,x'\in X$ with $||x-x'||_\infty \le s$ we thus find

$$\begin{aligned} | {f_{L,P}^*}(x) - {f_{L,P}^*}(x')| \le \varepsilon \, . \end{aligned}$$

Now consider $f: = {f_{L,P}^*}$ and fix an arbitrary cubic partition $\mathcal{A}$ of X with width s. Then $\bar{f}$ defined by (82) is given by $\bar{f} = h_{P,{\mathcal {A}}}$. Moreover, we have

$$\begin{aligned} {\mathcal{R}_{L,P}(h_{P,{\mathcal {A}}})} - {\mathcal{R}_{L,P}^{*}} = \Vert {f_{L,P}^*}- h_{P,{\mathcal {A}}} \Vert ^2_2 \le \varepsilon ^2 \,, \end{aligned}$$

where in the last step we used (84) and (85). $\square$

Based on the previous results we can now establish universal consistency of the empirical histogram rule $D \mapsto h_{D,{\mathcal {A}}_D}$ for regression based on a cubic data-dependent partition ${\mathcal {A}}_D$ from ${\mathcal {P}}(X)$.

Proposition E.4

(Universal Consistency) Let L be the least squares loss, $X:= [-1,1]^d$, $Y= [-1,1]$, P be a distribution on $X\times Y$, and $D \in (X \times Y)^n$ be an i.i.d. sample of size $n \ge 1$ drawn from P with $|D_X|=m_n$. Suppose that $(s_n)_{n \in \mathbb {N}}$ is a sequence with $s_n \rightarrow 0$ as well as $\frac{\ln (n s_n^d)}{n s_n^d}\rightarrow 0$ as $n \rightarrow \infty$. Assume further that $\pi _{m_n, s_n}$ is an $m_n$-sample cubic partitioning rule of width $s_n \in (0,1]$, satisfying $|{{\,\textrm{Im}\,}}(\pi _{m_n, s_n})| \le c n^\beta$, for some $c<\infty$ and some $\beta >0$ that are independent of n. Denoting ${\mathcal {A}}_D:=\pi _{m_n, s_n}(D_X)$, we have

$$\begin{aligned} {\mathcal{R}_{L,P}(h_{D,{\mathcal {A}}_D})} \rightarrow {\mathcal{R}^*_{L,P}} \end{aligned}$$

in probability as $n\rightarrow \infty$.

Proof of Proposition E.4

Note that for any $\varepsilon >0$ and for any ${\mathcal {A}}\in {{\,\textrm{Im}\,}}(\pi _{m_n, s_n})$, the $\varepsilon$-covering number of ${\mathcal {H}}_{\mathcal {A}}$ satisfies

$$\begin{aligned} {\mathcal {N}}({\mathcal {H}}_{\mathcal {A}}, ||\cdot ||_\infty , \varepsilon ) \le (2/\varepsilon )^{|{\mathcal {A}}|} , \end{aligned}$$

(86)

with $|{\mathcal {A}}|\le (2/s_n)^d$. Let us write $\mathcal{P}_n:= {{\,\textrm{Im}\,}}(\pi _{1, s_n}) \cup \dots \cup {{\,\textrm{Im}\,}}(\pi _{n, s_n})$. Applying Corollary E.2 with $A:= (2/s_n)^d$ and $K:= |\mathcal{P}_n| \le c n^{1+\beta }$ gives, for all $\tau \ge 1$ and $n\ge 1$, with probability $P^n$ at least $1-2c n^{1+\beta } e^{-\tau }$ that

$$\begin{aligned} {\mathcal{R}_{L,P}(h_{D,{\mathcal {A}}_D})} - {\mathcal{R}^*_{L,P}}&\le 4\sup _{{\mathcal {A}}\in \mathcal{P}_n}\bigl ( {\mathcal{R}_{L,P,{\mathcal {H}}_{{\mathcal {A}}}}^*}-{\mathcal{R}_{L,P}^{*}}\bigr ) \\&\;\;\; + 1024 \,\frac{ \tau }{n} + 512 \,\frac{ 2^d }{n s_n^d}\left( 1+\ln \left( \frac{n s_n^d}{2^d} \right) \right) \;. \end{aligned}$$

Now, for all $\varepsilon >0$, Lemma E.3 guarantees the existence of an $s_\varepsilon >0$ such that for any cubic partition ${\mathcal {A}}$ of width $s_n \in (0, s_\varepsilon ]$ we have

$$\begin{aligned} {\mathcal {R}}_{L,P}( h_{P,{\mathcal {A}}}) - {\mathcal {R}}_{L,P}^*&< \varepsilon \;. \end{aligned}$$

(87)

Since we assumed $s_n \rightarrow 0$ we conclude that the latter inequality holds for all sufficiently large n. Combining both bounds we find for all sufficiently large n that with probability $P^n$ at least $1-2c n^{1+\beta } e^{-\tau }$ it holds

$$\begin{aligned} {\mathcal{R}_{L,P}(h_{D,{\mathcal {A}}_D})} - {\mathcal{R}^*_{L,P}} \le 4\varepsilon + 1024 \,\frac{ \tau }{n} + c_d \, \frac{\ln (n s_n^d)}{n s_n^d} \;, \end{aligned}$$

where $c_d=1024 \cdot 2^d$. Finally, choosing $\tau = (\beta + 2) \log (n)$, the result follows by remembering that by assumption $\frac{\ln (n s_n^d)}{n s_n^d} \rightarrow 0$. $\square$

We now come to our second main contribution of this section, namely the derivation of learning rates for the empirical histogram rule $D \mapsto h_{D,{\mathcal {A}}_D}$ for regression based on a cubic data-dependent partition ${\mathcal {A}}_D$ from ${\mathcal {P}}(X)$.

Proposition E.5

(Learning Rates) Let L be the least squares loss, $X:= [-1,1]^d$, $Y= [-1,1]$, P be a distribution on $X\times Y$, and $D \in (X \times Y)^n$ be an i.i.d. sample of size $n \ge 1$ drawn from P with $|D_X|=m_n$. Assume the Bayes decision function $f^*_{L,P}$ is $\alpha$-Hölder continuous for some $\alpha \in (0,1]$. Suppose further that $(s_n)_{n \in \mathbb {N}}$ is a sequence satisfying

$$\begin{aligned} s_n = n^{-\gamma }, \quad \gamma = \frac{1}{2\alpha + d} . \end{aligned}$$

Assume further that $\pi _{m_n, s_n}$ is an $m_n$-sample cubic partitioning rule of width $s_n \in (0,1]$, satisfying $|{{\,\textrm{Im}\,}}(\pi _{m_n, s_n})| \le c n^\beta$, for some $c<\infty$ and some $\beta >0$ that are independent of n. Denoting ${\mathcal {A}}_D:=\pi _{m_n, s_n}(D_X)$, the excess risk then satisfies for all $n\ge 1$ the inequality

$$\begin{aligned} {\mathcal{R}_{L,P}(h_{D,{\mathcal {A}}_D})} - {\mathcal{R}^*_{L,P}} \le c_{d,\alpha }\ln (n ) \left( \frac{1}{n} \right) ^{2\alpha \gamma } \end{aligned}$$

with probability $P^n$ at least $1- cn^{1+\beta } e^{-n^{d \gamma }}$, where $c_{d,\alpha }>0$ is a constant only depending on d, $\alpha$, and $|f^*_{L,P}|_\alpha$.

Proof of Proposition E.5

If the Bayes decision function $f^*_{L,P}$ is $\alpha$-Hölder continuous, Lemma E.3 gives us for all $n\ge 1$ that

$$\begin{aligned} {\mathcal {R}}_{L,P}( h_{P,{\mathcal {A}}}) - {\mathcal {R}}_{L,P}^* < |f^*_{L,P}|_\alpha s_n^{2\alpha }\;. \end{aligned}$$

(88)

Repeating the proof of Proposition E.4 by replacing (87) with (88) shows that for all $\tau \ge 1$ and $n\ge 1$ we have

$$\begin{aligned} {\mathcal{R}_{L,P}(h_{D,{\mathcal {A}}_D})} - {\mathcal{R}^*_{L,P}}&\le 4|f^*_{L,P}|_\alpha s_n^{2\alpha } + 1024 \,\frac{ \tau }{n} + 1024\cdot 2^d \, \frac{\ln (n s_n^d)}{n s_n^d} \;. \end{aligned}$$

(89)

with probability $P^n$ not less than $1-2cn^{1+\beta } e^{-\tau }$. Using the definition of $s_n$ and setting $\tau _n: = ns_n^{2\alpha } = n^{\frac{d}{2\alpha + d}}$ then gives the assertion. $\square$

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Mücke, N., Steinwart, I. Empirical risk minimization in the interpolating regime with application to neural network learning. Mach Learn 114, 102 (2025). https://doi.org/10.1007/s10994-025-06738-9

Download citation

Received: 05 September 2022
Revised: 07 October 2024
Accepted: 09 January 2025
Published: 21 February 2025
Version of record: 21 February 2025
DOI: https://doi.org/10.1007/s10994-025-06738-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Empirical risk minimization in the interpolating regime with application to neural network learning

Abstract

Similar content being viewed by others

Compressive Sensing and Neural Networks from a Statistical Learning Perspective

Feature Equilibrium: An Adversarial Training Method to Improve Representation Learning

Resistant Neural Network Learning via Resistant Empirical Risk Minimization

Explore related subjects

1 Introduction

2 The histogram rule revisited

2.1 Classical histograms

2.2 Interpolating predictors and inflated histograms

Definition 2.1

Definition 2.2

Definition 2.3

Definition 2.4

Theorem 2.5

Definition 2.6

Proposition 2.7

Proof of Proposition 2.7

Example 2.8

Example 2.9

2.3 A generic oracle inequality for empirical risk minimization

Theorem 2.10

2.4 Main results for least squares loss

Assumption 2.11

Theorem 2.12

Theorem 2.13

3 Approximation of histograms with ReLU networks

Definition 3.1

3.1 \(\varepsilon\)-approximate inflated histograms

Definition 3.2

Lemma 3.3

Lemma 3.4

Definition 3.5

Example 3.6

Theorem 3.7

Theorem 3.8

4 Discussion and summary of results

4.1 Inflated histograms

4.2 Neural networks

Data availability

Code availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Appendices

A Characterization of empirical risk minimizers

Lemma A.1

Proof of of Lemma A.1

B Existence of properly aligned cubic partitioning rule

Proof of Theorem 2.5

C Learning properties of inflated histograms

Lemma C.1

Proof of Lemma C.1

1.1 C.1 Preparatory lemmata

Lemma C.2

Proof of Lemma C.2

Lemma C.3

Proof of Lemma C.3

Lemma C.4

Proof of Lemma C.4

1.2 C.2 Proof of Theorem 2.12

1.2.1 C.2.1 The good interpolating histogram rule

1.2.2 C.2.2 The bad interpolating histogram rule

1.3 C.3 Proof of Theorem 2.13 (Learning Rates)

D Learning properties of approximating neural networks

1.1 D.1 Auxiliary Results on Functions that can be represented by DNNs

Lemma D.1

Proof of Lemma D.1