1 Introduction

During the last few decades statistical learning theory (SLT) has developed powerful techniques to analyze many variants of (regularized) empirical risk minimizers, see e.g. Devroye et al. (1996); Vapnik (1998); van de Geer (2000); Györfi et al. (2002); Steinwart and Christmann (2008); Tsybakov (2009); Shalev-Shwartz and Ben-David (2014). The resulting learning guarantees, which include finite sample bounds, oracle inequalities, learning rates, adaptivity, and consistency, assume in most cases that the effective hypotheses space of the considered method is sufficiently small in terms of some notion of capacity such as VC-dimension, fat-shattering dimension, Rademacher complexities, covering numbers, or eigenvalues.

Most training algorithms for DNNs also optimize an (regularized) empirical error term over a hypotheses space, namely the class of functions that can be represented by the architecture of the considered DNN, see Goodfellow et al. (2016), Part II. However, unlike for many classical empirical risk minimizers, the hypotheses space is parametrized in a rather complicated manner. Consequently, the optimization problem is, in general, harder to solve. A common way to address this in practice is to use very large DNNs, since despite their size, training them is often easier, see e.g. Salakhutdinov (2017), Ma et al. (2018) and the references therein. Now, for sufficiently large DNNs it has been recently observed that common training algorithms can achieve zero training error on randomly, or arbitrarily labeled training sets, see Zhang et al. (2016). Because of this ability, their effective hypotheses space can no longer have a sufficiently small capacity in the sense of classical SLT, so that the usual techniques for analyzing learning algorithms are no longer suitable, see e.g. the discussion on this in Zhang et al. (2016), Belkin et al. (2018), Nagarajan and Kolter (2019), Zhou et al. (2020), Zhang et al. (2021). In fact, SLT provides well known examples of large hypotheses spaces for which zero training error is possible but a simple empirical risk minimizer fails to learn. This phenomenon is known as over-fitting, and common wisdom suggests that successful learning algorithms need to avoid over-fitting, see e.g. Györfi et al. (2002), pp. 21–22. The empirical evidence mentioned above thus stands in stark contrast to this credo of SLT.

This somewhat paradoxical behavior has recently sparked interests, leading to deeper theoretical investigations of the so called double/ multiple-descent phenomenon for different model settings. More specifically, Belkin et al. (2020) analyzed linear regression with random feature selection and investigated the random Fourier feature model. This model has also been analyzed by Mei and Montanari (2019). For linear regression, where model complexity is measured in terms of the number of parameters, the authors in Bartlett et al. (2020), Tsigler and Bartlett (2020) show that over-parameterization is even essential for benign over-fitting. However, these results are highly distribution dependent and require a specific covariance structure and (sub-) Gaussian data. For more details we refer also to Belkin et al. (2018); Chen et al. (2020); Liang et al. (2020); Neyshabur et al. (2019); Allen-Zhu et al. (2019). Another line of research (Belkin et al., 2019) shows for classical learning methods, namely Nadaraya-Watson estimator with certain singular kernels, that interpolating the training data can achieve optimal rates for problems of nonparametric regression and prediction with square loss.

Nonparametric regression with DNNs has been analyzed by various authors, see e.g. McCaffrey and Gallant (1994); Kohler and Krzyżak (2005); Kohler and Langer (2021); Suzuki (2018); Yarotsky (2018) and references therein. In particular, we highlight (Schmidt-Hieber, 2020) where it is shown that sparsely connected ReLU-DNNs achieve the minmax rates of convergence up to log-factors under a general assumption on the regression function. In Kohler and Langer (2021), the authors show that sparsity is not necessary to derive optimal rates of convergence and optimal rates have been established for fully connected feedforward neural networks with ReLU-activation. Here, an important observation is that DNNs are able to circumvent the curse of dimensionality (Bauer & Kohler, 2019). More structured input data are investigated in Kohler et al. (2023).

Beyond empirical evidence there are therefore also theoretical results showing that interpolating the data and good learning performance is simultaneously possible. So far, however, the considered interpolating learning methods do not implement an empirical risk minimization (ERM) scheme nor do they closely resemble the learning mechanisms of DNNs. In this paper, we take a step towards closing this gap.

First, we explicitly construct, for data sets of size n, large classes of hypotheses \({\mathcal {H}}_n\) for which we show that some interpolating least squares ERM algorithms over \({\mathcal {H}}_n\) enjoy very good statistical guarantees, while other interpolating least squares ERM algorithms over \({\mathcal {H}}_n\) fail in a strong sense. To be more precise, we observe the following phenomena: There exists a universally consistent empirical risk minimizer and there exists an empirical risk minimizer whose predictors converge to the negative regression function for most distributions. In particular, the latter empirical risk minimizer is not consistent for most distributions, and even worse, the obtained risks are usually far off the best possible risk. We further construct modifications that enjoy minmax optimal rates of convergence up to some log factor under standard assumptions. In addition, there are also ERM algorithms that exhibit an intermediate behavior between these two extreme cases, with arbitrarily slow convergence.

The finding that an interpolating estimator is not necessarily benign is a known fact. For instance, Zhou et al. (2020), Yang et al. (2021), Bartlett and Long (2021), Koehler et al. (2021) show that the uniform bound fails to give the precise evaluation of the risk for minimum norm interpolators in over-parameterized linear models, and there are estimators that provably give worse errors than the minimum norm interpolator. In this paper, we analyze a different setting and different class of estimators. To put our results in perspective, we note that classical SLT shows that for sufficiently small hypotheses classes, all versions of empirical risk minimizers enjoy good statistical guarantees. In contrast, our results demonstrate that this is no longer true for large hypotheses classes. For such hypotheses spaces, the description empirical risk minimizer is thus not sufficient to identify well-behaving learning algorithms. Instead, the class of algorithms described by ERM over such hypotheses spaces may encompass learning algorithms with extremely distinct learning behavior.

Second, we show that exactly the same phenomena occur for interpolating ReLU-DNNs of at least two hidden layers with widths growing linearly in both input dimension d and sample size n. We present DNN training algorithms that produce interpolating predictors and that enjoy consistency and optimal rates, at least up to some log factor. In addition, this training can be done in \({\mathcal {O}}(d^2\cdot n^2)\)-time if the DNNs are implemented as fully connected networks. Since the constructed predictors have a particularly sparse structure, the training time can actually be reduced to \({\mathcal {O}}(d\cdot n \cdot \log n)\) by implementing the DNNs as loosely connected networks. Moreover, we show that there are other efficient and feasible training algorithms for exactly the same architectures that fail in the worst possible sense, and like in the ERM case, there are also a variety of training algorithms performing in between these two extreme cases.

The rest of the paper is organized as follows: In Sect. 2 we firstly recall classical histograms as ERMs that we extend then to the class of inflated histograms. We provide specific examples of interpolating predictors from that class. In our main theorems we derive consistency results and learning rates. In the following Sect. 3 we explain how inflated histograms can be approximated by ReLU networks, having analogous learning properties. We discuss our results in Sect. 4.

All our proofs are deferred to the Appendices A, B, C, and D. Finally, in the supplementary material E we derive general uniform bounds for histograms based on data-dependent partitions. This result is needed for proving our main results and is of independent interest.

2 The histogram rule revisited

In this section, we reconsider the histogram rule within the framework of regression. Specifically, in Sect. 2.1, we recall the classical histogram rule and demonstrate how to modify it to obtain a predictor that can interpolate the given data. In Sect. 2.2, we construct specific interpolating empirical risk minimizers for a broad class of loss functions. The core idea is to begin with classical histogram rules and then expand their hypothesis spaces, allowing us to find interpolating empirical risk minimizers within these enlarged spaces. Section 2.3 presents a generic oracle inequality, while Sect. 2.4 focuses on learning rates for the least squares loss.

To begin with, let us introduce some necessary notations. Throughout this work, we consider \(X:=[-1,1]^d\) and \(Y=[-1,1]\) if not specified otherwise. Moreover, \(L:X\times Y\times \mathbb {R}\rightarrow [0,\infty )\) denotes the loss function. If not specified otherwise, we restrict ourselves to the least squares loss \(L(x,y,f(x))=(y-f(x))^2\). Given a dataset \(D:= ((x_1, y_1),...,(x_n, y_n)) \in (X\times Y)^n\) drawn i.i.d. from an unknown distribution P on \(X \times Y\), the aim of supervised learning is to build a function \(f_D: X \rightarrow \mathbb {R}\) based on D such that its risk

$$\begin{aligned} {\mathcal{R}_{L,P}(f_D)}:= \int _{X \times Y} L(x,y, f_D(x)) \; dP(x,y) , \end{aligned}$$
(1)

is close to the smallest possible risk

$$\begin{aligned} {\mathcal{R}^*_{L,P}} = \inf _{f:X \rightarrow \mathbb {R}} {\mathcal{R}_{L,P}(f)} \,. \end{aligned}$$
(2)

In the following, \({\mathcal{R}^*_{L,P}}\) is called the Bayes risk and an \({f_{L,P}^*}: X \rightarrow \mathbb {R}\) satisfying \({\mathcal{R}_{L,P}(f^*_{L,P})} = {\mathcal{R}^*_{L,P}}\) is called Bayes decision function. Recall, that for the least squares loss, \({f_{L,P}^*}\) equals the conditional mean function, i.e. \({f_{L,P}^*}(x) = \mathbb {E}_P(Y|x)\) for \(P_X\)-almost all \(x\in X\), where \(P_X\) denotes the marginal distribution of P on X. In general, estimators \(f_D\) having small excess risk

$$\begin{aligned} {\mathcal{R}_{L,P}(f_D)} - {\mathcal{R}^*_{L,P}} = ||f_D - {f_{L,P}^*}||_{L_2(P_X)}^2 , \end{aligned}$$
(3)

where \(\Vert \cdot \Vert _{L_2(P_X)}\) denotes the usual \(L_2\)-norm with respect to \(P_X\), are considered as good in classical statistical learning theory.

Now, to describe the class of learning algorithms we are interested in, we need the empirical risk of an \(f:X\rightarrow \mathbb {R}\), i.e.

$$\begin{aligned} {\mathcal{R}_{L,D}(f)}:= \frac{1}{n}\sum _{i=1}^n L(x_i, y_i, f(x_i)) . \end{aligned}$$

Recall, that an empirical risk minimizer over some set \(\mathcal{F}\) of functions \(f:X\rightarrow \mathbb {R}\) chooses, for every data set D, an \(f_D \in {\mathcal {F}}\) that satisfies

$$\begin{aligned} {\mathcal{R}_{L,D}(f_D)} = \inf _{f \in {\mathcal {F}}}{\mathcal{R}_{L,D}(f)} \,. \end{aligned}$$

Note that the definition of empirical risk minimizers implicitly requires that the infimum on the right hand side is attained, namely by \(f_D\). In general, however, \(f_D\) does not need to be unique. It is well-known that if we have a suitably increasing sequence of hypotheses classes \({\mathcal {F}}_n\) with controlled capacity, then every empirical risk minimizer \(D\mapsto f_D\) that ensures \(f_D \in {\mathcal {F}}_n\) for all data sets D of length n learns in the sense of e.g. universal consistency, and under additional assumptions it may also enjoy minmax optimal learning rates, see e.g. Devroye et al. (1996), van de Geer (2000), Györfi et al. (2002); Steinwart and Christmann (2008).

2.1 Classical histograms

Particular simple empirical risk minimizers are histogram rules (HRs). To recall the latter, we fix a finite partition \(\mathcal{A} = (A_j)_{j\in J}\) of X and for \(x\in X\) we write A(x) for the unique cell \(A_j\) with \(x\in A_j\). Moreover, we define

$$\begin{aligned} {\mathcal {H}}_{\mathcal{A}} := \biggl \{ \sum _{j\in J} c_j\varvec{1}_{A_j} \; : \; c_j \in Y \biggr \} \;, \end{aligned}$$
(4)

where \(\varvec{1}_{A_j}\) denotes the indicator function of the cell \(A_j\). Now, given a data set D and a loss L an \(\mathcal{A}\)-histogram is an \(h_{D, \mathcal{A}} = \sum _{j=1}^m c_j^*\varvec{1}_{A_j} \in {\mathcal {H}}_{\mathcal{A}}\) that satisfies

$$\begin{aligned} \sum _{i: x_i \in A_j} L(x_i,y_i, c^*_j) = \inf _{c\in Y} \sum _{i: x_i \in A_j} L(x_i, y_i, c) \end{aligned}$$
(5)

for all, so-called non-empty cells \(A_j\), that is, cells \(A_j\) with \(N_j:=|\{i: x_i \in A_j\}| >0\). Clearly, \(D\mapsto h_{D, \mathcal{A}}\) is an empirical risk minimizer. Moreover, note that in general \(h_{D, \mathcal{A}}\) is not uniquely determined, since \(c_j^*\in Y\) can take arbitrary values for empty cells \(A_j\). In particular, there are more than one empirical risk minimizer over \({\mathcal {H}}_{\mathcal{A}}\) as soon as \(m,n\ge 2\).

Before we proceed, let us consider the specific example of the least squares loss in more detail. Here, a simple calculation shows, see Lemma A.1, that for all non-empty cells \(A_j\), the coefficient \(c_j^*\) in (5) is uniquely determined by

$$\begin{aligned} c_j^*:= \frac{1}{N_j} \sum _{i: x_i \in A_j} y_i . \end{aligned}$$
(6)

In the following, we call every resulting \(D\mapsto h_{D, \mathcal{A}}\) with

$$\begin{aligned} h_{D, \mathcal{A}}: = \sum _{j=1}^m c_j^*\varvec{1}_{A_j} \in {\mathcal {H}}_{\mathcal{A}} \end{aligned}$$

an empirical HR for regression with respect to the least-squares loss L. For later use we also introduce an infinite sample version of a classical histogram

$$\begin{aligned} h_{P,{\mathcal {A}}}: = \sum _{j\in J} c^*_j \varvec{1}_{A_j} , \qquad \qquad \text{ where } \quad c^*_j:= \frac{1}{P_X(A_j)} \int _{A_j}{f_{L,P}^*}(x) dP_X(x) \; \end{aligned}$$
(7)

for all cells \(A_j\) with \(P_X(A_j)>0\). Similarly to empirical histograms one has

$$\begin{aligned} {\mathcal {R}}_{L,P}( h_{P,{\mathcal {A}}}) = \inf _{h \in {\mathcal {H}}_{\mathcal {A}}} {\mathcal {R}}_{L,P}(h) \;. \end{aligned}$$

We are mostly interested in HRs on \(X=[-1,1]^d\) whose underlying partition essentially consists of cubes with a fixed width. To rigorously deal with boundary effects, we first say that a partition \((B_j)_{j\ge 1}\) of \(\mathbb {R}^d\) is a cubic partition of width \(s>0\), if each cell \(B_j\) is a translated version of \([0,s)^d\), i.e. there is an \(x^\dagger \in \mathbb {R}^d\) called offset such that for all \(j\ge 1\) there exist \(k_j:= ( k'_1,\dots , k'_d)\in \mathbb {Z}^d\) with

$$\begin{aligned} B_j = x^\dagger + sk_j + [0,s)^d \, . \end{aligned}$$
(8)

Now, a partition \(\mathcal{A} = (A_j)_{j\in J}\) of \(X=[-1,1]^d\) is called a cubic partition of width \(s>0\), if there is a cubic partition \({\mathcal {B}}=(B_j)_{j\ge 1}\) of \(\mathbb {R}^d\) with width \(s>0\) such that \(J= \{j\ge 1: B_j \cap X \ne \emptyset \}\) and \(A_j = B_j\cap X\) for all \(j\in J\). If \(s\in (0,1]\), then, up to reordering, this \((B_j)_{j\ge 1}\) is uniquely determined by \(\mathcal{A}\).

If the hypotheses space (4) is based on a cubic partition of \(X=[-1,1]^d\) with width \(s>0\), then the resulting HRs are well understood. For example, universal consistency and learning rates have been established, see e.g. Devroye et al. (1996); Györfi et al. (2002). In general, these results only require a suitable choice for the widths \(s=s_n\) for \(n\rightarrow \infty\) but no specific choice of the cubic partition of width s. For this reason we write \(\mathcal{H}_s:= \bigcup \mathcal{H}_{\mathcal{A}}\), where the union runs over all cubic partitions \(\mathcal{A}\) of X with fixed width \(s\in (0,1]\).

2.2 Interpolating predictors and inflated histograms

In this section we construct particular interpolating empirical risk minimizers for a broad class of losses.

Definition 2.1

(Interpolating Predictor) We say that an \(f:X\rightarrow Y\) interpolates D, if

$$\begin{aligned} {\mathcal{R}_{L,D}(f)} = {\mathcal{R}^*_{L,D}}:= \inf _{\tilde{f}:X\rightarrow \mathbb {R}} {\mathcal{R}_{L,D}(\tilde{f})}\,, \end{aligned}$$

where we emphasize that the infimum is taken over all \(\mathbb {R}\)-valued functions, while f is required to be Y-valued.

Clearly, an \(f:X\rightarrow Y\) interpolates D if and only if

$$\begin{aligned} \sum _{k: x_k =x_i^*} L(x_i,y_i, f(x_i^*)) = \inf _{c\in \mathbb {R}} \sum _{k: x_k =x_i^*} L(x_i, y_i, c)\, , \qquad \qquad i=1,\dots ,m, \end{aligned}$$
(9)

where \(x_1^*,\dots , x_m^*\) are the elements of \(D_X:= \{x_i: i=1,\dots ,n\}\).

It is easy to check that for the least squares loss L and all data sets D there exists an \(f_D^*\) interpolating D. Moreover, we have \({\mathcal{R}^*_{L,D}} > 0\) if and only if D contains contradicting samples, i.e. \(x_i = x_k\) but \(y_i \ne y_k\). Finally, if \({\mathcal{R}^*_{L,D}} = 0\), then any interpolating \(f_D^*\) needs to satisfy \(f_D^*(x_i) = y_i\) for all \(i=1,\dots ,n\).

Definition 2.2

(Interpolatable Loss) We say that L is interpolatable for D if there exists an \(f:X\rightarrow Y\) that interpolates D, i.e. \({\mathcal{R}_{L,D}(f)} = {\mathcal{R}^*_{L,D}}\).

Note that (9) in particular ensures that the infimum over \(\mathbb {R}\) on the right is attained at some \(c^*_i\in Y\). Many common losses including the least squares, the hinge, and the classification loss interpolate all D, and for the latter three losses we have \({\mathcal{R}^*_{L,D}} > 0\) if and only if D contains contradicting samples, i.e. \(x_i = x_k\) but \(y_i \ne y_k\). Moreover, for the least squares loss, \(c^*_i\) can be easily computed by averaging over all labels \(y_k\) that belong to some sample \(x_k\) with \(x_k = x_i\).

Let us now describe more precisely the inflated versions of \(\mathcal{H}_s\). For \(r,s>0\) and \(m\ge 0\) we want to consider functions

$$\begin{aligned} f = h + \sum _{i=1}^m b_i \varvec{1}_{x_i^* + t B_\infty } \end{aligned}$$
(10)

with \(h\in \mathcal{H}_s,\, b_i \in 2Y,\, x_i^* \in X,\) and \(t \in [0,r]\), where \(B_\infty := [-1,1]^d\). In other words, for \(m\ge 1\), such an f changes a classical histogram \(h \in \mathcal{H}_s\) on at most m small neighborhoods of some arbitrary points \(x_1^*,\dots ,x_m^*\) in X. Such changes are useful for finding interpolating predictors. In general, these small neighborhoods \(x_i^* + t B_\infty\) however may intersect and may be contained in more than one cell \(A_j\) of the considered partition \(\mathcal{A}\) with \(h \in {\mathcal {H}}_{\mathcal {A}}\). To avoid undesired boundary effects we restrict the class of all admissible cubic partitions \({\mathcal {A}}\) of X associated with h. An additional technical difficulty arises in particular when constructing interpolating predictors since the set of points \(\{x_1^*,..., x_m^*\}\subset X\) are naturally the random input variables. As a consequence, the admissible cubic partitions become data-dependent. As a next step, we introduce the notion of a partitioning rule. To this end, we write

$$\begin{aligned} \textrm{Pot}_m(X):= \bigl \{A\subset X: |A| = m \bigr \} \end{aligned}$$

for the set of all subsets of X having cardinality m. Moreover, we denote the set of all finite partitions of X by \({\mathcal {P}}(X)\).

Definition 2.3

Given an integer \(m\ge 1\), an m-sample partitioning rule for X is a map \(\pi _m: \textrm{Pot}_m(X)\rightarrow {\mathcal {P}}(X)\), i.e. a map that associates to every subset \(\{x_1^*,..., x_m^*\}\subset X\) of cardinality m a finite partition \(\mathcal{A}\). Additionally, we will call an m-sample partitioning rule that assigns to any such \(\{x_1^*,..., x_m^*\}\in \textrm{Pot}_m(X)\) a cubic partition with fixed width \(s \in (0,1]\) an m-sample cubic partitioning rule and write \(\pi _{m,s}\).

Next we explain in more detail which particular partitions are considered as admissible.

Definition 2.4

(Proper Alignment) Let \({\mathcal {A}}\) be a cubic partition of X with width \(s\in (0,1]\), \({\mathcal {B}}\) be the partition of \(\mathbb {R}^d\) that defines \(\mathcal{A}\), and \(r\in (0,s)\). We say that \({\mathcal {A}}\) is properly aligned to the set of points \(\{x_1^*,..., x_m^*\}\in \textrm{Pot}_m(X)\) with parameter r, if for all \(i,k=1,\dots ,m\) we have

$$\begin{aligned} x_i^* + r B_\infty&\subset B(x_i^*)\, , \end{aligned}$$
(11)
$$\begin{aligned} x_i^* + r B_\infty \cap x_k^* + r B_\infty&= \emptyset \qquad \qquad \text{ whenever } i\ne k\text{, } \end{aligned}$$
(12)

where \(B(x_i^*)\) is the unique cellFootnote 1 of \({\mathcal {B}}\) that contains \(x_i^*\).

Clearly, if \({\mathcal {A}}\) is properly aligned with parameter \(r> 0\), then it is also properly aligned for any parameter \(t \in [0, r]\) for the same set of points \(\{x_j^*\}_{j=1}^m\) in \(\textrm{Pot}_m(X)\). Moreover, any cubic partition \({\mathcal {A}}\) of X with width \(s>0\) is properly aligned with the parameter \(r=0\) for any set of points \(\{x_j^*\}_{j=1}^m\) in \(\textrm{Pot}_m(X)\).

In what follows, we establish the existence of cubic partitions \({\mathcal {A}}\) that are properly aligned to a given set of points with parameter \(r>0\) being sufficiently small. In other words, we construct a special m-sample cubic partitioning rule \(\pi _{m,s}\). We call henceforth any such rule \(\pi _{m,s}\) an m-sample properly aligned cubic partitioning rule. To this end, let \(D_X:= \{x_1^*,..., x_m^*\} \in \textrm{Pot}_m(X)\) be a set of points and note that (12) holds for all \(r>0\) satisfying

$$\begin{aligned} r<\frac{1}{2}\min _{i,k:i\ne k}\Vert x_i^*-x_k^* \Vert _\infty . \end{aligned}$$

Clearly, a brute-force algorithm finds such an r in \(\mathcal{O}(dm^2)\)-time. However, a smarter approach is to first sort the first coordinates \(x_{1,1}^*,\dots , x_{m,1}^*\) and to determine the smallest positive distance \(r_1\) of two consecutive, non-identical ordered coordinates. This approach is then repeated for the remaining \(d-1\)-coordinates, so at the end we have \(r_1,\dots ,r_d>0\). Then

$$\begin{aligned} r^*:= r^*_{D_X}:= \frac{1}{3} \min \{r_1,\dots ,r_d\} \end{aligned}$$
(13)

satisfies (12) and the used algorithm is \(\mathcal{O}(d\cdot m \log m)\) in time. Our next result shows that we can also ensure (11) by jiggling the cubic partitions. Being rather technical, the proof is deferred to the Appendix B.

Theorem 2.5

(Existence of Properly Aligned Cubic Partitioning Rule) For all \(d\ge 1\), \(s\in (0,1]\), and \(m\ge 1\) there exist an m-sample cubic partitioning rule \(\pi _{m,s}\) with \(|{{\,\textrm{Im}\,}}(\pi _{m,s})|\le (m+1)^d\) that assigns to each set of points \(D_X:=\{x_1^*,..., x_m^*\} \in \textrm{Pot}_m(X)\) a cubic partition \(\mathcal{A}\) that is properly aligned to \(\{x_1^*,..., x_m^*\}\) with parameter \(r:= r_{D_X}:= \min \{r^*,\frac{s}{3\,m+3}\}\), where \(r^*= r^*_{D_X}\) is defined in (13).

The construction of an m-sample cubic partitioning rule \(\pi _{m,s}\) basically relies on the representation (8) of cubic partitions \({\mathcal {B}}\) of \(\mathbb {R}^d\). In fact, the proof of Theorem 2.5 shows that there exists a finite set \(x_1^\dagger ,..., x^\dagger _K \in \mathbb {R}^d\) of candidate offsets, with \(K=(m+1)^d\). While at first glance this number seems to be prohibitively large for an efficient search, it turns out that the proof of Theorem (2.5) actually provides a simple algorithm that is \(\mathcal{O}(d\cdot m)\) in time for identifying coordinate-wise the \(x_\ell ^\dagger\) that leads to \(\pi _{m,s}(\{x_1^*,..., x_m^*\})\).

Being now well prepared, we introduce the class of inflated histograms.

Definition 2.6

Let \(s \in (0,1]\) and \(m\ge 1\). Then a function \(f:X\rightarrow Y\) is called an m-inflated histogram of width s, if there exist a subset \(\{x_1^*,..., x_m^*\} \in \textrm{Pot}_m(X)\) and a cubic partition \(\mathcal{A}\) of width s that is properly aligned to \(\{x_1^*,..., x_m^*\}\) with parameter \(r\in [0,s)\) such that

$$\begin{aligned} f = h + \sum _{i=1}^m b_i \varvec{1}_{x_i^* + t B_\infty }\, , \end{aligned}$$

where \(h\in {\mathcal {H}}_{\mathcal {A}}\), \(t\in [0,r]\), and \(b_i\in 2Y\) for all \(i=1,\dots ,m\). We denote the set of all m-inflated histograms of width s by \({\mathcal {F}}_{s, m}\). Moreover, for \(k\ge 1\) we write

$$\begin{aligned} \mathcal{F}^*_{s,k} := \mathcal{F}_{s,1}\cup \dots \cup \mathcal{F}_{s,k}\, . \end{aligned}$$

Note that the condition \(t\le r < s\) ensures that the representation \(f= h + \sum _{i=1}^m b_i \varvec{1}_{x_i^* + t B_\infty }\) of any \(f\in {\mathcal {F}}_{s, k}\) is unique. In addition, given an \(f\in \mathcal{F}^*_{s,k}\), the number m of inflation points \(\{x_1^*,..., x_m^*\}\) is uniquely determined, too, and hence so is the representation of f. For a depiction of an inflated histogram for regression (with and without proper alignment) we refer to Fig. 1.

Fig. 1
figure 1

Left. Depiction of an inflated histogram for regression for a cubic partition \({\mathcal {A}}=(A_j)_{j \in J}\) that is not properly aligned to the data (black crosses). The predictions \(c_i^*\) and \(c_j^*\) on the associated cells \(A_i\) and \(A_j\) are calculated according to (6), i.e. by a local average. Mispredicted samples are corrected according to (14) on a \(tB_\infty\)-neighborhood for some small \(t > 0\). Note that one sample is too close to the cell boundary, i.e. (11) is violated. Right. An inflated histogram that is properly aligned to the same data set. Note that (11) ensures that boundary effects as for the left HR do not take place. For inflated histograms these effects seem to be a negligible technical nuisance. For their DNN counterparts considered in Sect. 3, however, such effects may significantly complicate the constructions of interpolating predictors, see Fig. 2 (Color figure online)

So far we have formalized the notion of interpolation and defined an appropriate inflated hypotheses class for modified histograms. The m-inflated histograms in Definition 2.6 can attain any values in Y that arise from classical histograms or changes of classical histograms on a discrete set (note that implicitly restricts the choices of the \(b_i\)). This is a quite general definition and m-inflated histograms do not need to be interpolating. In our next result we go a step further by providing a sufficient condition for the existence of interpolating predictors in \({\mathcal {F}}_{s, m}\). The idea is to give a condition on the \(b_i\) that ensures that an inflated histogram is interpolating. This depends, of course, on the \(c_j\).

Proposition 2.7

Let L be a loss that is interpolatable for \(D=((x_1,y_i),\dots ,(x_n,y_n))\) and let \(x_1^*,\dots , x_m^*\) be as in (9). Moreover, for \(s\in (0,1]\) and \(r>0\) we fix an \(f^*\in {\mathcal {F}}_{s, m}\) with representation as given in Definition 2.6. For \(i=1,\dots ,m\) let \(j_i\) be the index such that \(x_i^* \in A_{j_i}\). Then \(f^*\) interpolates D, if for all \(i=1,\dots ,m\) we have

$$\begin{aligned} b_i = - c_{j_i} + \arg \min _{c\in Y} \sum _{k: x_k = x_i^*} L(x_k, y_k, c) \, . \end{aligned}$$
(14)

Proof of Proposition 2.7

By our assumptions we have

$$\begin{aligned} c_i^*:= b_i + c_{j_i} \in \arg \min _{c\in Y} \sum _{k: x_k = x_i^*} L(x_k, y_k, c) = \arg \min _{c\in \mathbb {R}} \sum _{k: x_k = x_i^*} L(x_k, y_k, c) \,, \end{aligned}$$

where the last equality is a consequence of the fact that there is an \(f:X\rightarrow Y\) satisfying (9). Moreover, since (11) and (12) hold, we find \(f^*(x_i^*) = h(x_i^*) + b_i = c_{j_i} + b_i = c_i^*\), and therefore \(f^*\) interpolates D by (9).\(\square\)

Note that for all \(c_{j_i} \in Y\) the value \(b_i\) given by (14) satisfies \(b_i \in 2Y\) and we have \(b_i=0\) if \(c_{j_i}\) is contained in the \(\arg \min\) in (14). Consequently, defining \(b_i\) by (14) always gives an interpolating \(f^*\in {\mathcal {F}}_{s, m}\). Moreover, (14) shows that an interpolating \(f^*\in {\mathcal {F}}_{s, m}\) can have an arbitrary histogram part \(h\in \mathcal{H}_{\mathcal{A}}\), that is, the behavior of \(f^*\) outside the small \(tB_\infty\)-neighborhoods around the samples of D can be arbitrary. In other words, as soon as we have found a properly aligned cubic partition \(\mathcal{A}\) in the sense of \({\mathcal {F}}_{s,m}\), we can pick an arbitrary histogram \(h\in \mathcal{H}_{\mathcal{A}}\) and compute the \(b_i\)’s by (14). Intuitively, if the chosen \(tB_\infty\)-neighborhoods are sufficiently small, then the prediction capabilities of the resulting interpolating predictor are (mostly) determined by the chosen histogram part \(h\in \mathcal{H}_{\mathcal{A}}\). Based on this observation, we can now construct different, interpolating \(f^*_D\in {\mathcal {F}}_{s, m}\) that have particularly good and bad learning behaviors.

Example 2.8

(Good interpolating histogram rule) Let L be the least squares loss, \(s\in (0,1]\) be a cell width, \(\rho \ge 0\) be an inflation parameter, and \(D=((x_1,y_i),\dots ,(x_n,y_n))\) be a data set. By \(D_X=\{x_1^*,...,x_m^*\}\) we denote the set of all covariates \(x_j \in X\) with \((x_j, y_j)\) belonging to the data set. For \(m=|D_X|\), Theorem 2.5 ensures the existence of a cubic partition \({\mathcal {A}}_D=\pi _{m,s}(D_X)\) with width \(s\in (0,1]\), being properly aligned to \(D_X\) with the data-dependent parameter r. Based on this data-dependent cubic partition \({\mathcal {A}}_D\) we fix an empirical histogram for regression

$$\begin{aligned} h_{D,\mathcal{A}_D}^+:= \sum _{j\in J} c_j^+ \varvec{1}_{A_j} \in \mathcal{H}_{s} \end{aligned}$$
(15)

with coefficients \((c_j^+)_{j \in J}\) precisely given in (6). Applying now Proposition 2.7 gives us an \(f_{D,s,\rho }^+\in {\mathcal {F}}_{s, m} \subset \mathcal{F}^*_{s,n}\), which interpolates D and has the representation

$$\begin{aligned} f_{D,s,\rho }^+:= h_{D,\mathcal{A}_D}^+ + \sum _{i=1}^m b^+_i \varvec{1}_{x_i^* + t B_\infty } , \end{aligned}$$

where the \(b^+_1,\dots ,b_m^+\) are calculated according to the rule (14), and \(t:= \min \{r, \rho \}\) is again data-dependent. We call the map \(D\mapsto f_{D,s,\rho }^+\) a good interpolating histogram rule.

Example 2.9

(Bad interpolating histogram rule) Let L be the least squares loss, \(s\in (0,1]\) be a cell width, \(\rho \ge 0\) be an inflation parameter, and \(D=((x_1,y_i),\dots ,(x_n,y_n))\) be a data set. Consider again a cubic partition \({\mathcal {A}}_D=\pi _{m,s}(D_X)\) with width \(s\in (0,1]\), that is properly aligned to \(D_X\) with parameter r and fix an empirical histogram \(h_{D,\mathcal{A}_D}^+ \in \mathcal{H}_{s}\) as in (15). Setting \(t:= \min \{r, \rho \}\), we define a predictor \(f_{D,s,\rho }^-\in {\mathcal {F}}_{s, m}\) by

$$\begin{aligned} f_{D,s,\rho }^-:= h_{D,\mathcal{A}_D}^- + \sum _{i=1}^m b^-_i \varvec{1}_{x_i^* + t B_\infty } , \end{aligned}$$

with \(\mathcal{H}_{\mathcal{A}}\)-part \(h_{D,\mathcal{A}_D}^-:= - h_{D,\mathcal{A}_D}^+\). The \(b^-_1,\dots ,b_m^-\) are calculated according to (14) and satisfy

$$\begin{aligned} b_i^- = b_i^+ + 2c_{j_i}^+ , \end{aligned}$$

for any \(i=1,...,m\) and where \(j_i\) denotes the index such that \(x_i^* \in A_{j_i}\). By writing

$$\begin{aligned} D_X^{+t} := \bigcup _{i=1}^m \bigl ( x_i^* + t B_\infty \bigr ) \end{aligned}$$
(16)

we easily see that the definition of \(f_{D,s,\rho }^-\) gives \(f_{D,s,\rho }^-\in {\mathcal {F}}_{s, m} \subset \mathcal{F}^*_{s,n}\) and

$$\begin{aligned} f_{D,s,\rho }^-(x) = {\left\{ \begin{array}{ll} f_{D,s,\rho }^+(x) & \text{ if } x\in D_X^{+t}\\ -f_{D,s,\rho }^+(x) & \text{ if } x\not \in D_X^{+t}\, , \end{array}\right. } \end{aligned}$$
(17)

while Proposition 2.7 ensures that \(f_{D,s,\rho }^-\) interpolates D. We call the map \(D\mapsto f_{D,s,\rho }^-\) a bad interpolating histogram rule and remark that t is, like for good interpolating histogram rules, data-dependent.

2.3 A generic oracle inequality for empirical risk minimization

The main purpose of this section is to present a general variance improved oracle inequality for empirical risk minimizers for bounding the excess risk with respect to a broad class of loss functions, which is of independent interest. In Sect. 2.4, we apply this result to the special case of the least squares loss and to histogram rules that choose their cubic partitions in a certain, data-dependent way. In particular, we give an optimized uniform bound that crucially relies on an explicit capacity bound, expressed in terms of the covering number, see Definition E.1. This is a necessary step to provide the learning properties of histograms based on data-dependent cubic partitions. The proof of this result is provided in Appendix E.1.

Theorem 2.10

Let \(L:X\times Y\times \mathbb {R}\rightarrow [0,\infty )\) be a locally Lipschitz continuous loss, \(\mathcal{F}\subset \mathcal{L}_{\infty }(X)\) be a closed, separable set satisfying \(\Vert f \Vert _\infty \le M\) for a suitable constant \(M>0\) and all \(f\in \mathcal{F}\), and P be a distribution on \(X\times Y\) that has a Bayes decision function \(f_{L,P}^{*}\) with \({\mathcal{R}_{L,P}({f_{L,P}^*})}< \infty\). Assume that there exist constants \(B>0\), \(\vartheta \in [0,1]\), and \(V\ge B^{2-\vartheta }\) such that for all measurable \(f: X \rightarrow [-M,M]\) we have

$$\begin{aligned} \Vert L\circ f- L\circ {f_{L,P}^*} \Vert _\infty\le & B\,, \end{aligned}$$
(18)
$$\begin{aligned} \mathbb {E}_{P} \bigl ( L\circ f - L\circ {f_{L,P}^*}\bigr )^{2}\le & V \cdot \bigl (\mathbb {E}_{P} (L\circ f - L\circ {f_{L,P}^*}) \bigr )^{\vartheta }\,. \end{aligned}$$
(19)

Then, for all measurable empirical risk minimization algorithms \(D\mapsto f_D\), all \(n\ge 1\), \(\tau >0\), and all \(\varepsilon >0\) we have

$$\begin{aligned} {\mathcal{R}_{L,P}(f_D)} - {\mathcal{R}_{L,P}^{*}}\le & 4\bigl ( {\mathcal{R}_{L,P,\mathcal{F}}^*}-{\mathcal{R}_{L,P}^{*}}\bigr ) + 5\, |L|_{M,1} \cdot \varepsilon \\ & \quad + 2 \biggl (\frac{16 V\bigl (\tau +1+ \ln \mathcal{N}(\mathcal{F},\Vert \cdot \Vert _\infty ,\varepsilon ) \bigr ) }{n}\biggr )^{\frac{1}{2-\vartheta }} \end{aligned}$$

with probability \(P^n\) not less than \(1- e^{-\tau }\). Here, \(\mathcal{N}(\mathcal{F},\Vert \cdot \Vert _\infty ,\varepsilon )\) denotes the \(\varepsilon\)-covering number of \(\mathcal{F}\).

Note that variance improved oracle inequalities generally provide refined bounds for the estimation error part for the excess risk under the more strict assumptions (18), (19). This leads to faster rates of convergence, compared to basic statistical analysis, see e.g. (Steinwart and Christmann (2008), Chapter 6 and 7). Our variance improved oracle inequality improves over the one in (Steinwart and Christmann (2008), Theorem 7.2) for empirical risk minimizers. We go beyond finite function classes and bound the capacity in terms of covering numbers.

2.4 Main results for least squares loss

Our main results below show that the description good/ bad interpolating histogram rule from the above Examples 2.8/ 2.9, respectively, is indeed justified, provided the inflation parameter is chosen appropriately. Here we recall that good learning algorithms can be described by a small excess risk, or equivalently, a small distance to the Bayes decision function \({f_{L,P}^*}\), see (3). To describe bad learning behavior, we denote the point spectrum of \(P_X\) by

$$\begin{aligned} \Delta :=\{ x \in X\;: \; P_X(\{x\}) > 0 \} , \end{aligned}$$
(20)

see Hoffman-Jorgensen (2017). One easily verifies that \(\Delta\) is at most countable, since \(P_X\) is finite. Moreover, for an arbitrary but fixed version \({f_{L,P}^*}\) of the Bayes decision function, we write

$$\begin{aligned} {f_{L,P}^\dagger }:= \varvec{1}_\Delta {f_{L,P}^*}- \varvec{1}_{X\setminus \Delta }{f_{L,P}^*}\qquad \text{ and } \qquad {\mathcal {R}}^\dagger _{L,P}&:= {\mathcal{R}_{L,P}({f_{L,P}^\dagger })}\, , \end{aligned}$$

where we note that \({\mathcal {R}}^\dagger _{L,P}\) does, of course, not depend on the choice of \({f_{L,P}^*}\). Moreover, note that for \(x\in \Delta\) the value \({f_{L,P}^*}(x)\) is also independent of the choice of \({f_{L,P}^*}\) and it holds \(f^\dagger _{L,P} (x) = {f_{L,P}^*}(x)\). In contrast, for \(x\in X{\setminus } \Delta\) with \({f_{L,P}^*}(x) \ne 0\) we have \(f^\dagger _{L,P}(x) \ne {f_{L,P}^*}(x)\). In fact, a quick calculation using (3) shows

$$\begin{aligned} {\mathcal {R}}^\dagger _{L,P} - {\mathcal{R}_{L,P}^{*}} = \Vert {f_{L,P}^\dagger }- {f_{L,P}^*} \Vert _{{L_{2}(P_X)}}^2 = 4 \Vert \varvec{1}_{X\setminus \Delta }{f_{L,P}^*} \Vert _{{L_{2}(P_X)}}^2\, , \end{aligned}$$
(21)

and consequently we have \({\mathcal {R}}^\dagger _{L,P} - {\mathcal{R}_{L,P}^{*}}>0\) whenever \(P_X(\Delta ) < 1\) and \({f_{L,P}^*}\) does not almost surely vanish on \(X\setminus \Delta\). It seems fair to say that the overwhelming majority of “interesting” P fall into this category. Finally, note that in general we do not have an equality of the form (3), when we replace \({\mathcal{R}_{L,P}^{*}}\) and \({f_{L,P}^*}\) by \({\mathcal {R}}^\dagger _{L,P}\) and \({f_{L,P}^\dagger }\). However, for \(y,t,t'\in Y=[-1,1]\) we have \(|L(y,t) - L(y,t')| \le 4 |t-t'|\), and consequently we find

$$\begin{aligned} \bigl |{\mathcal{R}_{L,P}(f)} - {\mathcal {R}}^\dagger _{L,P}\bigr | \le 4\, \Vert f - {f_{L,P}^\dagger } \Vert _{{L_{2}(P_X)}} \end{aligned}$$
(22)

for all \(f:X\rightarrow Y\). For this reason, we will investigate the bad interpolating histogram rule only with respect to its \(L_2\)-distance to \({f_{L,P}^\dagger }\).

Before we state our main result of this section we need to introduce one more assumption that will be required for parts of our results.

Assumption 2.11

There exists a non-decreasing continuous map \(\varphi : \mathbb {R}_+ \rightarrow \mathbb {R}_+\) with \(\varphi (0)=0\) such that for any \(t \ge 0\) and \(x \in X\) one has \(P_X(x + tB_\infty ) \le \varphi ( t)\).

Note that this assumption implies \(P_X(\{x\})=0\) for any \(x \in X\). Moreover, it is satisfied for the uniform distribution \(P_X\), if we consider \(\phi (t):= 2^d t^d\), and a simple argument shows that modulo the constant appearing in \(\phi\) the same is true if \(P_X\) only has a bounded Lebesgue density. The latter is, however, not necessary. Indeed, for \(X=[-1,1]\) and \(0<\beta < 1\) it is easy to construct unbounded Lebesgue densities that satisfy Assumption 2.11 for \(\phi\) of the form \(\phi (t) = ct^\beta\), and higher dimensional analogs are also easy to construct. Moreover, in higher dimensions Assumption 2.11 also applies to various distributions living on sufficiently smooth low-dimensional manifolds.

With these preparations we can now establish the following theorem that shows that for an inflation parameter \(\rho =0\) (see Examples 2.8, 2.9) the good interpolating histogram rule is universally consistent while the bad interpolating histogram rule fails to be consistent in a stark sense. It further shows consistency, respectively non-consistency for \(\rho =\rho _n>0\) with \(\rho _n\rightarrow 0\).

Theorem 2.12

Let L be the least-squares loss and let \(D \in (X \times Y)^n\) be an i.i.d. sample of size \(n \ge 1\). Let \(D \mapsto f_{D,s,\rho }^+\) denote the good interpolating histogram rule from Example 2.8. Similarly, let \(D \mapsto f_{D,s,\rho }^-\) denote the bad interpolating histogram rule from Example 2.9. Assume that \((s_n)_{n \in \mathbb {N}}\) is a sequence with \(s_n \rightarrow 0\) as well as \(\frac{\ln (n s_n^d)}{n s_n^d}\rightarrow 0\) as \(n \rightarrow \infty\).

  1. (i)

    (Non)-consistency for \(\rho _n = 0\). We have in probability for \(|D|\rightarrow \infty\)

    $$\begin{aligned} \Vert f_{D,s_n,0}^+- {f_{L,P}^*} \Vert _{{L_{2}(P_X)}}&\rightarrow 0 \,, \end{aligned}$$
    (23)
    $$\begin{aligned} \Vert f_{D,s_n,0}^-- {f_{L,P}^\dagger } \Vert _{{L_{2}(P_X)}}&\rightarrow 0 \, . \end{aligned}$$
    (24)
  2. (ii)

    (Non)-consistency for \(\rho _n >0\). Let \((\rho _n)_{n \in \mathbb {N}}\) be a non-negative sequence with \(\rho _n \rightarrow 0\) as \(n \rightarrow \infty\). Then for all distributions P that satisfy Assumption 2.11 for a function \(\varphi\) with \(n\varphi (\rho _n) \rightarrow 0\) for \(n\rightarrow \infty\), we have

    $$\begin{aligned} ||f_{D,s_n, \rho _n}^+- {f_{L,P}^*}||_{L_2(P_X)} \rightarrow 0 \,, \end{aligned}$$
    (25)
    $$\begin{aligned} ||f_{D,s_n,\rho _n}^-- {f_{L,P}^\dagger }||_{L_2(P_X)} \rightarrow 0 \,, \end{aligned}$$
    (26)

    in probability for \(|D|\rightarrow \infty\).

The proof of Theorem 2.12 is provided in Appendix C.2. Our second main result, whose proof is provided in Appendix C.3, refines the above theorem and establishes learning rates for the good and bad interpolating histogram rules, provided the width \(s_n\) and the inflation parameter \(\rho _n\) decrease sufficiently fast as \(n \rightarrow \infty\).

Theorem 2.13

Let L be the least-squares loss and let \(D \in (X \times Y)^n\) be an i.i.d. sample of size \(n \ge 1\). Let \(D \mapsto f_{D,s,\rho }^+\) denote the good interpolating histogram rule from Example 2.8. Similarly, let \(D \mapsto f_{D,s,\rho }^-\) denote the bad interpolating histogram rule from Example 2.9. Suppose that \({f_{L,P}^*}\) is \(\alpha\)-Hölder continuous with \(\alpha \in (0,1]\) and that P satisfies Assumption 2.11 for some function \(\varphi\). Assume further that \((s_n)_{n \in \mathbb {N}}\) is a sequence with

$$\begin{aligned} s_n = n^{-\gamma }, \quad \gamma = \frac{1}{2\alpha + d} \end{aligned}$$

and that \((\rho _n)_{n\ge 1}\) is a non-negative sequence with \(n \varphi (\rho _n) \le \ln (n) n^{-2/3}\) for all \(n\ge 1\). Then there exists a constant \(c_{d,\alpha }>0\) only depending on d, \(\alpha\), and \(|f^*_{L,P}|_\alpha\), such that for all \(n\ge 1\) the good interpolating histogram rule satisfies

$$\begin{aligned} ||f_{D,s_n, \rho _n}^+- {f_{L,P}^*}||_{L_2(P_X)}&\le c_{\alpha ,d} \sqrt{\ln (n)} \left( \frac{1}{n} \right) ^{\alpha \gamma } \;, \end{aligned}$$
(27)

with probability \(P^n\) not less than \(1- 2^dn^{1+d} e^{-n^{d \gamma }}\). Furthermore, for all \(n\ge 1\), the bad interpolating histogram rule satisfies

$$\begin{aligned} ||f_{D,s_n,\rho _n}^-- {f_{L,P}^\dagger }||_{L_2(P_X)}&\le c_{\alpha ,d} \sqrt{\ln (n)} \left( \frac{1}{n} \right) ^{\alpha \gamma }\;. \end{aligned}$$
(28)

with probability \(P^n\) not less than \(1- 2^dn^{1+d} e^{-n^{d \gamma }}\).

For a discussion of our results, we refer to Sect. 4.

3 Approximation of histograms with ReLU networks

The goal of this section is to build neural networks of suitable depth and width that mimic the learning properties of inflated histogram rules. To be more precise, we aim to construct a particular class of inflated networks that contains good and bad interpolating predictors, similar to the good and bad interpolating histogram rules from Example 2.8 and Example 2.9, respectively.

We begin with describing in more detail the specific networks that we will consider. Given an activation function \(\sigma : \mathbb {R}\rightarrow \mathbb {R}\) and \(b \in \mathbb {R}^p\) we define the shifted activation function \(\sigma _b: \mathbb {R}^p \rightarrow \mathbb {R}^p\) as

$$\begin{aligned} \sigma _b (y):= ( \sigma (y_1+b_1),..., \sigma (y_p+b_p) )^T \; \end{aligned}$$
(29)

where \(y_j\), \(j=1,...,p\) denote the components of \(y \in \mathbb {R}^p\). A hidden layer with activation \(\sigma\), of width \(p \in \mathbb {N}\) and with input dimension \(q \in \mathbb {N}\) is a function \(H_\sigma :\mathbb {R}^q \rightarrow \mathbb {R}^p\) of the form

$$\begin{aligned} H_\sigma (x) := (\sigma _b \circ A )( x) \;, \quad x\in \mathbb {R}^q, \end{aligned}$$
(30)

where A is a \(p \times q\) weight matrix and \(b \in \mathbb {R}^{p}\) is a shift vector or bias. Clearly, each pair (Ab) describes a layer, but in general, a layer, if viewed as a function, can be described by more than one such pair. The class of networks we consider is given in the following definition.

Definition 3.1

Given an activation function \(\sigma : \mathbb {R}\rightarrow \mathbb {R}\) and an integer \(\tilde{L}\ge 1\), a neural network with architecture \(p \in \mathbb {N}^{\tilde{L}+1}\) is a function \(f: \mathbb {R}^{p_0} \rightarrow \mathbb {R}^{p_{\tilde{L}}}\), having a representation of the form

$$\begin{aligned} f(x)&= H_{\text{ id } , \tilde{L}}\circ H_{\sigma ,\tilde{L}-1}\circ \dots \circ H_{\sigma , 1}(x) \;, \quad x \in \mathbb {R}^{p_0} \;, \end{aligned}$$
(31)

where \(H_{\sigma , l}: \mathbb {R}^{p_{l-1}} \rightarrow \mathbb {R}^{p_l}\) is a hidden layer of width \(p_l \in \mathbb {N}\) and input dimension \(p_{l-1} \in \mathbb {N}\), \(l=1,...,\tilde{L}-1\). Here, the last layer \(H_{\text{ id }, \tilde{L}}: \mathbb {R}^{p_{\tilde{L}-1}} \rightarrow \mathbb {R}^{p_{\tilde{L}}}\) is associated to the identity \(\text{ id }: \mathbb {R}\rightarrow \mathbb {R}\).

A network architecture is therefore described by an activation function \(\sigma\) and a width vector \(p = (p_0,...,p_{\tilde{L}}) \in \mathbb {N}^{\tilde{L}+1}\). The positive integer \(\tilde{L}\) is the number of layers, \(\tilde{L}-1\) is the number of hidden layers or the depth. Here, \(p_0\) is the input dimension and \(p_{\tilde{L}}\) is the output dimension. In the sequel, we confine ourselves to the ReLU-activation function \(|\cdot |_+: \mathbb {R}\rightarrow [0,\infty )\) defined by

$$\begin{aligned} |t|_+:=\max \{0,t\} , \quad t \in \mathbb {R}. \end{aligned}$$

Moreover, we consider networks with fixed input dimension \(p_0=d\) and output dimension \(p_{\tilde{L}}=1\), that is,

$$\begin{aligned} H_{\text{ id }, \tilde{L}}(x) = \langle a,x \rangle + b , \quad x\in \mathbb {R}^{p_{\tilde{L}-1}} . \end{aligned}$$

Thus, we may parameterize the (inner) architecture by the width vector \((p_1,...,p_{\tilde{L}-1}) \in \mathbb {N}^{\tilde{L}-1}\) of the hidden layers only. In the following, we denote the set of all such neural networks by \({\mathcal {A}}_{p_1,...,p_{\tilde{L}-1}}\).

3.1 \(\varepsilon\)-approximate inflated histograms

Motivated by the representation (4) for histograms, the first step of our construction approximates the indicator function of a multi-dimensional interval by a small part of a possibly large DNN. This will be our main building block. We emphasize that the ReLU activation function is particularly suited for this approximation and it thus plays a key role in our entire construction.

For the formulation of the corresponding result we fix some notation. For \(z_1, z_2 \in {\mathbb {R}^d}\) we write \(z_1\le z_2\) if each coordinate satisfies \(z_{1,i}\le z_{2,i}\), \(i=1,\dots ,d\). We define \(z_1 < z_2\) analogously. In addition, if \(z_1\le z_2\), then the multi-dimensional interval is \([z_1, z_2]:= \{ z\in {\mathbb {R}^d}: z_1\le z\le z_2 \}\), and we similarly define \((z_1, z_2)\) if \(z_1 < z_2\). Finally, for \(s\in \mathbb {R}\), we let \(z_1 + s:= (z_{1,1}+s,\dots ,z_{1,d}+s)\).

Definition 3.2

Let \(A\subset X\), \(z_1, z_2\in {\mathbb {R}^d}\) with \(z_{1} < z_{2}\) and \(\varepsilon >0\) with \(\varepsilon < \frac{1}{2}\cdot \min \{z_{2,i}-z_{1,i}: i=1,\dots ,d \}\). Then a network \(\varvec{1}_A^{(\varepsilon )} \in {\mathcal {A}}_{2d,1}\) is called an \(\varepsilon\)-Approximation of the indicator function \(\varvec{1}_A: X \rightarrow [0,1]\) if

$$\begin{aligned} \{ \varvec{1}_A^{(\varepsilon )} = \varvec{1}_{A}\} = [z_{1}+\varepsilon , z_{2}-\varepsilon ] \cup \bigl ( X\setminus A \bigr ) , \end{aligned}$$

and if

$$\begin{aligned} \{\varvec{1}_A^{(\varepsilon )} >1 \} = \emptyset , \quad \{\varvec{1}_A^{(\varepsilon )} <0 \} = \emptyset . \end{aligned}$$

The next lemma ensures the existence of such approximations. The full construction is elementary calculus and is provided in Appendix D.2, in particular in Lemma D.3. Lemma D.5 provides then the desired properties.

Lemma 3.3

[Existence of \(\varepsilon\)-Approximations] Let \(z_1, z_2\in {\mathbb {R}^d}\) and \(\varepsilon >0\) as in Definition 3.2. Then for all \(A\subset X\) with \([z_{1}+\varepsilon , z_{2}-\varepsilon ] \subset A \subset [z_{1}, z_{2}]\) there exists an \(\varepsilon\)-Approximation \(\varvec{1}_A^{(\varepsilon )}\) of \(\varvec{1}_A\).

Fig. 2
figure 2

Left. Approximation \(\varvec{1}_A^{(\varepsilon )}\) (orange) of the indicator function \(\varvec{1}_A\) for \(A =[0.1, 0.6]\) (blue) according to Lemma 3.3 for \(\varepsilon = 0.1\) on \(X=[0,1]\). The construction of \(\varvec{1}_A^{(\varepsilon )}\) ensures that \(\varvec{1}_A^{(\varepsilon )}\) coincides with \(\varvec{1}_A\) modulo a small set that is controlled by \(\varepsilon >0\). Right. A DNN (blue) for regression that approximates the histogram \(\varvec{1}_{[0,0.5)} + 0.8 \cdot \varvec{1}_{[0.5,1)}\) and a DNN (orange) that additionally tries to interpolate two samples \(x_1 = 0.2\) and \(x_2= 0.575\) (located at the two vertical dotted lines) with \(y_i = -0.5\). The label \(y_1\) is correctly interpolated since the alignment condition (11) is satisfied for \(x_1\) with \(t=0.15\) and \(\varepsilon =\delta = t/3 = 0.05\) as in Example 3.6. In contrast, \(y_2\) is not correctly interpolated since condition (11) is violated for this t and hence \(\varepsilon\) and \(\delta\) are too large (Color figure online)

Figure 2 illustrates \(\varvec{1}_A^{(\varepsilon )}\) for \(d=1\). Moreover, the proof of Lemma D.3 shows that out of the \(2d^2\) weight parameters of the first layer, only 2d are non-zero. In addition, the 2d weight parameters of the neuron in the second layer are all identical. In order to approximate inflated histograms we need to know how to combine several functions of the form provided by Lemma 3.3 into a single neural network. An appealing feature of our DNNs is that the concatenation of layer structures is very easy.

Lemma 3.4

If \(c \in \mathbb {R}\), \((p_1, p_2) \in \mathbb {N}^2\), and \(g \in {\mathcal {A}}_{p}\), \(g' \in {\mathcal {A}}_{p'}\), then \(cg \in {\mathcal {A}}_{p}\) and \(g + g' \in {\mathcal {A}}_{p+p'}.\)

Lemma 3.4 describes some properties of neural networks with respect to scaling and addition. It tells us that the class of neural networks is closed under scalar multiplication and addition, with the width of the resulting networks adjusted appropriately. The proof is based on elementary linear algebra. For an extended version of this result, see Lemma D.2. In particular, our constructed DNNs have a particularly sparse structure and the number of required neurons behaves in a very controlled and natural fashion.

With these insights, we are now able to find a representation similar to (4). To this end, we choose a cubic partition \({\mathcal {A}}=(A_j)_{j \in J}\) of X with width \(s>0\) and define for \(\varepsilon \in (0, \frac{s}{3}]\)

$$\begin{aligned} {\mathcal {H}}^{(\varepsilon )}_{\mathcal{A}} := \biggl \{ \sum _{j \in J} c_j \; \varvec{1}_{A_j}^{(\varepsilon )} \; : \; c_j \in Y \biggr \}\, , \end{aligned}$$

where \(\varvec{1}_{A_j}^{(\varepsilon )}:= (\varvec{1}_{B_j}^{(\varepsilon )})_{|A_j}\) is the restriction of \(\varvec{1}_{B_j}^{(\varepsilon )}\) to \(A_j\) and \(\varvec{1}_{B_j}^{(\varepsilon )}\) is an \(\varepsilon\)-approximation of \(\varvec{1}_{B_j}\) of Lemma 3.3. Here, \(B_j\) is the cell with \(A_j = B_j\cap X\), see the text around (8). We call any function in \({\mathcal {H}}^{(\varepsilon )}_{\mathcal{A}}\) an \(\varepsilon\)-approximate histogram.

Our considerations above show that we have \({\mathcal {H}}^{(\varepsilon )}_{\mathcal{A}} \subset {\mathcal {A}}_{p_1,p_2}\) with \(p_1 = 2d|J|\) and \(p_2 = |J|\). Thus, any \(\varepsilon\)-approximate histogram can be represented by a neural network with 2 hidden layers. Inflated versions are now straightforward.

Definition 3.5

Let \(s \in (0,1]\), \(m\ge 1\), and \(\varepsilon \in (0, s/3]\). Then a function \(f^{\varepsilon }: X \rightarrow Y\) is called an \(\varepsilon\)-approximated m-inflated histogram of width s if there exist a subset \(\{x_1^*,..., x_m^*\} \in \textrm{Pot}_m(X)\) and a cubic partition \(\mathcal{A}\) of width s that is properly aligned to \(\{x_1^*,..., x_m^*\}\) with parameter \(r\in [0,s)\) such that

$$\begin{aligned} f^{(\varepsilon )} = h^{(\varepsilon )} + \sum _{i=1}^m b_i \varvec{1}^{(\delta )}_{x_i^* + t B_\infty } , \end{aligned}$$

where \(h^{(\varepsilon )} \in {\mathcal {H}}_{\mathcal {A}}^{(\varepsilon )}\), \(t \in (0,r]\), \(\delta \in (0, t/3]\), \(b_i \in 2Y\) and where \(\varvec{1}^{(\delta )}_{x_i^* + t B_\infty }\) is a \(\delta\)-approximation of \(\varvec{1}_{x_i^* + t B_\infty }\) for all \(i=1,...,m\). We denote the set of all \(\varepsilon\)-approximated m-inflated histograms of width s by \({\mathcal {F}}^{(\varepsilon )}_{s, m}\).

A short calculation shows that \({\mathcal {F}}^{(\varepsilon )}_{s, m} \subset {\mathcal {A}}_{p_1,p_2}\) with \(p_1 = 2d(m+|J|)\), \(p_2 = m+|J|\) and \(|J| \le (2/s)^d\). With these preparations, we can now introduce good and bad interpolating DNNs.

Example 3.6

(Good and bad interpolating DNN) Let L be the least squares loss, \(s \in (0,1]\) be a cell width and let \(\rho > 0\) be an inflation parameter. For a data set \(D=((x_1,y_1),\dots ,(x_n,y_n))\) we consider again a cubic partition \({\mathcal {A}}_D=\pi _{m,s}(D_X)\), with \(m=|D_X|\), being properly aligned to \(D_X\) with parameter r. Set \(t:=\min \{r, \rho \}\). According to Example 2.8, a good interpolating HR is given by

$$\begin{aligned} f_{D,s,\rho }^+:= \sum _{j\in J} c_j^+ \varvec{1}_{A_j} + \sum _{i=1}^m b^+_i \varvec{1}_{x_i^* + t B_\infty } , \end{aligned}$$

where the \((c_j^+)_{j \in J}\) are given in (6) and \(b^+_1,\dots ,b_m^+\) are from (14). For \(\varepsilon := \delta := t/3\) we then define the good interpolating DNN by

$$\begin{aligned} g_{D,s,\rho }^+= \sum _{j\in J} c_j^+ \varvec{1}^{(\varepsilon )}_{A_j} + \sum _{i=1}^m b^+_i \varvec{1}^{(\delta )}_{x_i^* + t B_\infty } \,. \end{aligned}$$

Clearly, we have \(g_{D,s,\rho }^+\in {\mathcal {F}}^{(\varepsilon )}_{s, m}\). We call the map \(D\mapsto g_{D,s,\rho }^+\) a good interpolating DNN and it is easy to see that this network indeed interpolates D. Finally, the bad interpolating DNN \(g_{D,s,\rho }^-\) is defined analogously using the bad interpolating HR from Example 2.9, instead.

Similarly to our inflated histograms from the previous section, the next theorem shows that the good interpolating DNN is consistent while the bad interpolating DNN fails to be. The proof of this result is given in Appendix D.3.

Theorem 3.7

[(Non)-consistency] Let L be the least-squares loss and let \(D \in (X \times Y)^n\) be an i.i.d. sample of size \(n \ge 1\). Let \(D \mapsto g_{D,s,\rho }^+\) denote the good interpolating DNN from Example 3.6. Similarly, let \(D \mapsto g_{D,s,\rho }^-\) denote the bad interpolating DNN from Example 3.6. Assume that \((s_n)_{n \in \mathbb {N}}\) is a sequence with \(s_n \rightarrow 0\), \(\frac{\ln (n s_n^d)}{n s_n^d}\rightarrow 0\) as \(n \rightarrow \infty\) as well as \(s_n > 2n^{-1/d}\). Additionally, let \((\rho _n)_{n \in \mathbb {N}}\) be a non-negative sequence with \(\rho _n \le 2n^{-1/d}\). Then \(g_{D,s_n, \rho _n}^{\pm }\in {\mathcal {A}}_{4dn, 2n}\). Moreover, for all distributions P that satisfy Assumption 2.11 for a function \(\varphi\) with \(\rho _n^{-d} \varphi ( \rho _n ) \rightarrow 0\) for \(n\rightarrow \infty\), we have

$$\begin{aligned} ||g_{D,s_n, \rho _n}^+- {f_{L,P}^*}||_{L_2(P_X)} \rightarrow 0 \,, \end{aligned}$$
(32)
$$\begin{aligned} ||g_{D,s_n, \rho _n}^-- {f_{L,P}^\dagger }||_{L_2(P_X)} \rightarrow 0 \,, \end{aligned}$$
(33)

in probability for \(|D|\rightarrow \infty\).

The above result can further be refined to establishing rates of convergence if the width \(s_n\) and the inflation parameter \(\rho _n\) converge to zero sufficiently fast as \(n \rightarrow \infty\). The proof is provided in Appendix D.4.

Theorem 3.8

[Learning Rates] Let L be the least-squares loss and let \(D \in (X \times Y)^n\) be an i.i.d. sample of size \(n \ge 1\). Let \(D \mapsto g_{D,s,\rho }^+\) denote the good interpolating DNN from Example 3.6. Similarly, let \(D \mapsto g_{D,s,\rho }^-\) denote the bad interpolating DNN from Example 3.6. Suppose that \({f_{L,P}^*}\) is \(\alpha\)-Hölder continuous with \(\alpha \in (0,1]\) and that P satisfies Assumption 2.11 for some function \(\varphi\). Assume further that \((s_n)_{n \in \mathbb {N}}\) is a sequence with

$$\begin{aligned} s_n = n^{-\gamma }, \quad \gamma = \frac{1}{2\alpha + d} \end{aligned}$$

and that \((\rho _n)_{n\ge 1}\) is a non-negative sequence with \(\rho _n \le 2n^{-1/d}\) and \(\rho _n^{-d} \varphi (\rho _n) \le \ln (n) n^{-2/3}\) for all \(n\ge 1\). Then there exists a constant \(c_{d,\alpha }>0\) only depending on d, \(\alpha\), and the Lipschitz constant \(|f^*_{L,P}|_\alpha\), such that for all \(n\ge 2\) the good interpolating histogram rule satisfies

$$\begin{aligned} ||g_{D,s_n, \rho _n}^+- {f_{L,P}^*}||_{L_2(P_X)}&\le c_{\alpha ,d} \sqrt{\ln (n)} \left( \frac{1}{n} \right) ^{\alpha \gamma } \;, \end{aligned}$$
(34)

with probability \(P^n\) not less than \(1- 2^dn^{1+d} e^{-n^{d \gamma }}\). Furthermore, for all \(n\ge 2\), the bad interpolating histogram rule satisfies

$$\begin{aligned} ||g_{D,s_n, \rho _n}^-- {f_{L,P}^\dagger }||_{L_2(P_X)}&\le c_{\alpha ,d} \sqrt{\ln (n)} \left( \frac{1}{n} \right) ^{\alpha \gamma }\;. \end{aligned}$$
(35)

with probability \(P^n\) not less than \(1- 2^dn^{1+d} e^{-n^{d \gamma }}\). Finally, there exists a natural number \(n_{d, \alpha } > 0\) such that for any \(n \ge n_{d, \alpha }\) we have \(g_{D,s_n, \rho _n}^{\pm }\in {\mathcal {A}}_{4dn, 2n}\).

Note that the rates of convergence in (34) and (35) remain true if we consider a sequence \(s_n\) with \(c^{-1} n^{-\gamma } \le s_n \le cn^{-\gamma }\) for some constant c independent of n. In fact, the only reason why we have formulated Theorem 3.8 with \(s_n = n^{-\gamma }\) is to avoid another constant appearing in the statements. Moreover, if we choose \(s_n:= 2 a \lfloor n^{-\gamma }\rfloor ^{-1}\) with \(a:= 3^{1/d}/(3^{1/d}-2)\), then we have \(|J| \le (2 s_n^{-1} + 2)^d \le (a^{-1}n^{1/d} + 2)^d \le n\) for all \(n\ge 3\). Consequently, for \(m:= n\), we can choose \(n_{d, \alpha }:= 3\), and hence we have \(g_{D,s_n, \rho _n}^{\pm }\in {\mathcal {A}}_{4dn, 2n}\) for all \(n\ge 3\) while (34) and (35) hold true modulo a change in the constant \(c_{\alpha ,d}\).

4 Discussion and summary of results

In this section we summarize our finding and put them into a broader context.

4.1 Inflated histograms

To set the results from Sect. 2 in context, let us first recall that even for a fixed hypotheses class, ERM is, in general, not a single algorithm, but a collection of algorithms. In fact, this ambiguity appears, as soon as the ERM-optimization problem has not a unique solution for certain data sets, and as Lemma A.1 shows, this non-uniqueness may even occur for strictly convex loss functions such as the least squares loss. Now, the standard techniques of statistical learning theory are capable of showing that for sufficiently small hypotheses classes, all versions of ERM enjoy good statistical guarantees. In other words, the non-uniqueness of ERM does not affect its learning capabilities as long as the hypotheses class is sufficiently small. In addition, it is folklore that in some large hypotheses classes, there may be heavily overfitting ERM solutions, leading to the usual conclusion that such hypotheses classes should be avoided.

In contrast to this common wisdom, however, Theorem 2.12 demonstrates that for large hypotheses classes, the situation may be substantially more complicated: First, it shows that there exist empirical risk minimizers, whose predictors converge to a function \({f_{L,P}^\dagger }\), see (24), that in almost all interesting cases is far off the target regression function, see (21), confirming that the overfitting issue is indeed present for the chosen hypotheses classes. Moreover, this strong overfitting may actually take place with fast convergence, see (28). Despite this negative result, however, we can also find empirical risk minimizers that enjoy a good learning behavior in terms of consistency (23) and almost optimal learning rates (27). In other words, both the expected overfitting and standard learning guarantees may be realized by suitable versions of empirical risk minimizers over these hypotheses classes. In fact, these two different behaviors are just extreme examples, and a variety of intermediate behaviors are possible, too: Indeed, as the training error can be solely controlled by the corrections on the inflating parts, the behaviour of the histogram part h can be arbitrarily chosen. For our theorems above, we have chosen a particular good and bad h-part, repectively, but of course, a variety of other choices leading to intermediate behavior are also possible. As a consequence, the ERM property of an algorithm working with a large hypotheses class is, in general, no longer a sufficient notion for describing its learning behavior. Instead, additional assumptions are required to determine its learning behavior. In this respect we also note that for our inflated hypotheses classes, other learning algorithms that do not (approximately) minimize the empirical risk may also enjoy good learning properties. Indeed, by setting the inflating parts to zero, we recover standard histograms, which in geneneral do not have close-to-zero training error, but for which the guarantees of our good interpolating predictors also hold true.

Of course, the chosen hypotheses classes may, to some extent, appear artificial. Nonetheless, in Sect. 3 they are key for showing that for sufficiently large DNN architectures exactly the same phenomena occur for some of their global minima.

4.2 Neural networks

To fully appreciate Theorems 3.7 and 3.8 as well as their underlying construction let us discuss its various consequences.

Training. The good interpolating DNN predictors \(g_{D,s_n, \rho _n}^+\) show that it is actually possible to train sufficiently large, over-parameterized DNNs such that they become consistent and enjoy optimal learning rates up to a logarithmic factor without adapting the network size to the particular smoothness of the target function. In fact, it suffices to consider DNNs with two hidden layers and 4dn, respectively 2n neurons in the first, respectively second, hidden layer. In other words, Theorems 3.7 and 3.8 already apply to moderately over-parameterized DNNs, and by the particular properties of the ReLU-activation function, also for all larger network architectures. In addition, when using architectures of minimal size, training, that is constructing \(g_{D,s_n, \rho _n}^+\), can be done in \(\mathcal{O}(d^2\cdot n^2)\)-time if the NNs are implemented as fully connected networks. Moreover, the constructed NNs have a particularly sparse structure and exploiting this can actually reduce the training time to \(\mathcal{O}(d \cdot n\cdot \log n)\). While we present statistically sound end-to-endFootnote 2 proofs of consistency and optimal rates for NNs, we also need to admit that our training algorithm is mostly interesting from a theoretical point of view, but useless for practical purposes.

Optimization Landscape. Theorems 3.7 and 3.8 also have its consequences for DNNs trained by variants of stochastic gradient descent (SGD) if the resulting predictor is interpolating. Indeed, these theorems show that ending in a global minimum may result in either a very good learning behavior or an extremely overfitting, bad behavior. In fact, all the observations made for histograms at the end of Sect. 2 apply to DNNs, too. In particular, since for \(n\ge n_{d,\alpha }\) the \({\mathcal {A}}_{4dn, 2n}\)-networks can \(\varepsilon\)-approximate all functions in \(\mathcal{F}_{s,n}^*\) for all \(\varepsilon \ge 0\) and all \(s\in [n^{-1/d}, 1]\), we can, for example, find, for each polynomial learning rate slower than \(n^{-\alpha \gamma }\), an interpolating learning method \(D\mapsto f_D\) with \(f_D\in {\mathcal {A}}_{4dn, 2n}\) that learns with this rate. Similarly, we can find interpolating \(f_D\in {\mathcal {A}}_{4dn, 2n}\) with various degrees of bad learning behavior. In summary, the optimization landscape induced by \({\mathcal {A}}_{4dn, 2n}\) contains a wide variety of global minima whose learning properties range somewhat continuously from essentially optimal to extremely poor. Consequently, an optimization guarantee for (S)GD, that is, a guarantee that (S)GD finds a global minimum in the optimization landscape, is useless for learning guarantees unless more information about the particular nature of the minimum found is provided. Moreover, it becomes clear that considering (S)GD without the initialization of the weights and biases is a meaningless endeavor: For example, constructing \(g_{D,s_n, \rho _n}^{\pm }\) can be viewed as a very particular form of initialization for which (S)GD won’t change the parameters anymore. More generally, when initializing the parameters randomly in the attraction basin of \(g_{D,s_n, \rho _n}^{\pm }\) then GD will converge to \(g_{D,s_n, \rho _n}^{\pm }\) and therefore the behavior of GD is completely determined by the initialization. In this respect note that so far there is no statistically sound way to distinguish between good and bad interpolating DNNs on the basis of the training set alone, and hence the only way to identify good interpolating DNNs obtained by SGD is to use a validation set (that SGD finds bad local minima is shown in Liu et al. (2020)). Now, for the good interpolating DNNs of Theorem 3.7 it is actually possible to construct a finite set of candidates such that the one with the best validation error achieves the optimal learning rates without knowing \(\alpha\). For DNNs trained by SGD, however, we do not have this luxury anymore. Indeed, while we can still identify the best predicting DNN from a finite set of SGD-learned interpolating DNNs we no longer have any theoretical understanding of whether there is any useful candidate among them, or whether they all behave like a bad \(g_{D,s_n, \rho _n}^-\).

For both consistency and learning with essentially optimal rates it is by no means necessary to find a global minimum, or at least a local minimum, in the optimization landscape. For example, the positive learning rates (27) also hold for ordinary cubic histograms with widths \(s_n:= n^{-\gamma }\), and the latter can, of course, also be approximated by \({\mathcal {A}}_{4dn, 2n}\). Repeating the proof of Theorem 3.8 it is easy to verify that these approximations also enjoy the good learning rates (34). Moreover, these approximations \(f_D\) are almost never global minima, or more precisely, \(f_D\) is not a global minimum as soon as there exist a cubic cell A containing two samples \(x_i\) and \(x_j\) with different labels, i.e. \(y_i\ne y_j\). In fact, in this case, \(f_D\) is not even a local minimum. To see this, assume without loss of generality that \(x_i\) is one of the samples in A with \(y_i \ne f_D(x_i)\). Considering \(f_{D,\lambda }:= f_D + \lambda b_i^+ \varvec{1}_{x_i+tB_\infty }^{(t/3)}\) for all \(\lambda \in [0,1]\) and \(t:= \min \{r,\rho \}\) we then see that there is a continuous path in the parameter space of \({\mathcal {A}}_{4dn, 2n}\) that corresponds to the \(\Vert \cdot \Vert _\infty\)-continuous path \(\lambda \mapsto f_{D,\lambda }\) in the set of functions \({\mathcal {A}}_{4dn, 2n}\) for which we have \({\mathcal{R}_{L,D}(f_{D,\lambda })} < {\mathcal{R}_{L,D}(f_D)}\) for all \(\lambda \in (0,1]\). In other words, \(f_D\) is not a local minimum. In this respect we note that this phenomenon also occurs to some extent in under-parameterized DNNs, at least for \(d=1\). Indeed, if we consider \(m:= 1\) and \(s_n:= n^{-\gamma }\), then \(f_D, f_{D,\lambda }\in {\mathcal {A}}_{4dn^{\gamma d}, 2 n^{\gamma d}}\) for all sufficiently large n. Now, the functions in \({\mathcal {A}}_{4dn^{\gamma d}, 2 n^{\gamma d}}\) have \(\mathcal{O}(d^2 n^{2\gamma d})\) many parameters and for \(2\gamma d = \frac{2d}{2\alpha +d}< 1\), that is \(\alpha > d/2 = 1/2\), we then see that we have strictly less than \(\mathcal{O}(\sqrt{n})\) neurons with \(\mathcal{O}(n)\) parameters, while all the observations made so far still hold.

Finally, we want to mention that a number of recent works analyze concrete efficient GD-type algorithms (Ji et al., 2021; Ji & Telgarsky, 2019; Song et al., 2021; Chen et al., 2021; Kuzborskij & Szepesvari, 2021; Kohler & Krzyzak, 2019; Nguyen & Mücke, 2024; Braun et al., 2024) and SGD-type algorithms (Rolland et al., 2021; Deng et al., 2022; Li & Liang, 2018; Kalimeris et al., 2019; Allen-Zhu et al., 2019; Cao et al., 2024), with a focus on the particular algorithmic properties and network architecture (e.g. early stopping and the required degree of overparameterization) rather than ERM. Also, the effect of regularization is investigated in e.g. Hu et al. (2021); Wei et al. (2019). Our work differs from the perspective given in these works in the sense that our aim is to provide a theoretical investigation of the qualitatively different learning properties of interpolating ReLU-DNNs at once.