Abstract
A common strategy to train deep neural networks (DNNs) is to use very large architectures and to train them until they (almost) achieve zero training error. Empirically observed good generalization performance on test data, even in the presence of lots of label noise, corroborate such a procedure. On the other hand, in statistical learning theory it is known that over-fitting models may lead to poor generalization properties, occurring in e.g. empirical risk minimization (ERM) over too large hypotheses classes. Inspired by this contradictory behavior, so-called interpolation methods have recently received much attention, leading to consistent and optimally learning methods for, e.g., some local averaging schemes with zero training error. We extend this analysis to ERM-like methods for least squares regression and show that for certain, large hypotheses classes called inflated histograms, some interpolating empirical risk minimizers enjoy very good statistical guarantees while others fail in the worst sense. Moreover, we show that the same phenomenon occurs for DNNs with zero training error and sufficiently large architectures.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
During the last few decades statistical learning theory (SLT) has developed powerful techniques to analyze many variants of (regularized) empirical risk minimizers, see e.g. Devroye et al. (1996); Vapnik (1998); van de Geer (2000); Györfi et al. (2002); Steinwart and Christmann (2008); Tsybakov (2009); Shalev-Shwartz and Ben-David (2014). The resulting learning guarantees, which include finite sample bounds, oracle inequalities, learning rates, adaptivity, and consistency, assume in most cases that the effective hypotheses space of the considered method is sufficiently small in terms of some notion of capacity such as VC-dimension, fat-shattering dimension, Rademacher complexities, covering numbers, or eigenvalues.
Most training algorithms for DNNs also optimize an (regularized) empirical error term over a hypotheses space, namely the class of functions that can be represented by the architecture of the considered DNN, see Goodfellow et al. (2016), Part II. However, unlike for many classical empirical risk minimizers, the hypotheses space is parametrized in a rather complicated manner. Consequently, the optimization problem is, in general, harder to solve. A common way to address this in practice is to use very large DNNs, since despite their size, training them is often easier, see e.g. Salakhutdinov (2017), Ma et al. (2018) and the references therein. Now, for sufficiently large DNNs it has been recently observed that common training algorithms can achieve zero training error on randomly, or arbitrarily labeled training sets, see Zhang et al. (2016). Because of this ability, their effective hypotheses space can no longer have a sufficiently small capacity in the sense of classical SLT, so that the usual techniques for analyzing learning algorithms are no longer suitable, see e.g. the discussion on this in Zhang et al. (2016), Belkin et al. (2018), Nagarajan and Kolter (2019), Zhou et al. (2020), Zhang et al. (2021). In fact, SLT provides well known examples of large hypotheses spaces for which zero training error is possible but a simple empirical risk minimizer fails to learn. This phenomenon is known as over-fitting, and common wisdom suggests that successful learning algorithms need to avoid over-fitting, see e.g. Györfi et al. (2002), pp. 21–22. The empirical evidence mentioned above thus stands in stark contrast to this credo of SLT.
This somewhat paradoxical behavior has recently sparked interests, leading to deeper theoretical investigations of the so called double/ multiple-descent phenomenon for different model settings. More specifically, Belkin et al. (2020) analyzed linear regression with random feature selection and investigated the random Fourier feature model. This model has also been analyzed by Mei and Montanari (2019). For linear regression, where model complexity is measured in terms of the number of parameters, the authors in Bartlett et al. (2020), Tsigler and Bartlett (2020) show that over-parameterization is even essential for benign over-fitting. However, these results are highly distribution dependent and require a specific covariance structure and (sub-) Gaussian data. For more details we refer also to Belkin et al. (2018); Chen et al. (2020); Liang et al. (2020); Neyshabur et al. (2019); Allen-Zhu et al. (2019). Another line of research (Belkin et al., 2019) shows for classical learning methods, namely Nadaraya-Watson estimator with certain singular kernels, that interpolating the training data can achieve optimal rates for problems of nonparametric regression and prediction with square loss.
Nonparametric regression with DNNs has been analyzed by various authors, see e.g. McCaffrey and Gallant (1994); Kohler and Krzyżak (2005); Kohler and Langer (2021); Suzuki (2018); Yarotsky (2018) and references therein. In particular, we highlight (Schmidt-Hieber, 2020) where it is shown that sparsely connected ReLU-DNNs achieve the minmax rates of convergence up to log-factors under a general assumption on the regression function. In Kohler and Langer (2021), the authors show that sparsity is not necessary to derive optimal rates of convergence and optimal rates have been established for fully connected feedforward neural networks with ReLU-activation. Here, an important observation is that DNNs are able to circumvent the curse of dimensionality (Bauer & Kohler, 2019). More structured input data are investigated in Kohler et al. (2023).
Beyond empirical evidence there are therefore also theoretical results showing that interpolating the data and good learning performance is simultaneously possible. So far, however, the considered interpolating learning methods do not implement an empirical risk minimization (ERM) scheme nor do they closely resemble the learning mechanisms of DNNs. In this paper, we take a step towards closing this gap.
First, we explicitly construct, for data sets of size n, large classes of hypotheses \({\mathcal {H}}_n\) for which we show that some interpolating least squares ERM algorithms over \({\mathcal {H}}_n\) enjoy very good statistical guarantees, while other interpolating least squares ERM algorithms over \({\mathcal {H}}_n\) fail in a strong sense. To be more precise, we observe the following phenomena: There exists a universally consistent empirical risk minimizer and there exists an empirical risk minimizer whose predictors converge to the negative regression function for most distributions. In particular, the latter empirical risk minimizer is not consistent for most distributions, and even worse, the obtained risks are usually far off the best possible risk. We further construct modifications that enjoy minmax optimal rates of convergence up to some log factor under standard assumptions. In addition, there are also ERM algorithms that exhibit an intermediate behavior between these two extreme cases, with arbitrarily slow convergence.
The finding that an interpolating estimator is not necessarily benign is a known fact. For instance, Zhou et al. (2020), Yang et al. (2021), Bartlett and Long (2021), Koehler et al. (2021) show that the uniform bound fails to give the precise evaluation of the risk for minimum norm interpolators in over-parameterized linear models, and there are estimators that provably give worse errors than the minimum norm interpolator. In this paper, we analyze a different setting and different class of estimators. To put our results in perspective, we note that classical SLT shows that for sufficiently small hypotheses classes, all versions of empirical risk minimizers enjoy good statistical guarantees. In contrast, our results demonstrate that this is no longer true for large hypotheses classes. For such hypotheses spaces, the description empirical risk minimizer is thus not sufficient to identify well-behaving learning algorithms. Instead, the class of algorithms described by ERM over such hypotheses spaces may encompass learning algorithms with extremely distinct learning behavior.
Second, we show that exactly the same phenomena occur for interpolating ReLU-DNNs of at least two hidden layers with widths growing linearly in both input dimension d and sample size n. We present DNN training algorithms that produce interpolating predictors and that enjoy consistency and optimal rates, at least up to some log factor. In addition, this training can be done in \({\mathcal {O}}(d^2\cdot n^2)\)-time if the DNNs are implemented as fully connected networks. Since the constructed predictors have a particularly sparse structure, the training time can actually be reduced to \({\mathcal {O}}(d\cdot n \cdot \log n)\) by implementing the DNNs as loosely connected networks. Moreover, we show that there are other efficient and feasible training algorithms for exactly the same architectures that fail in the worst possible sense, and like in the ERM case, there are also a variety of training algorithms performing in between these two extreme cases.
The rest of the paper is organized as follows: In Sect. 2 we firstly recall classical histograms as ERMs that we extend then to the class of inflated histograms. We provide specific examples of interpolating predictors from that class. In our main theorems we derive consistency results and learning rates. In the following Sect. 3 we explain how inflated histograms can be approximated by ReLU networks, having analogous learning properties. We discuss our results in Sect. 4.
All our proofs are deferred to the Appendices A, B, C, and D. Finally, in the supplementary material E we derive general uniform bounds for histograms based on data-dependent partitions. This result is needed for proving our main results and is of independent interest.
2 The histogram rule revisited
In this section, we reconsider the histogram rule within the framework of regression. Specifically, in Sect. 2.1, we recall the classical histogram rule and demonstrate how to modify it to obtain a predictor that can interpolate the given data. In Sect. 2.2, we construct specific interpolating empirical risk minimizers for a broad class of loss functions. The core idea is to begin with classical histogram rules and then expand their hypothesis spaces, allowing us to find interpolating empirical risk minimizers within these enlarged spaces. Section 2.3 presents a generic oracle inequality, while Sect. 2.4 focuses on learning rates for the least squares loss.
To begin with, let us introduce some necessary notations. Throughout this work, we consider \(X:=[-1,1]^d\) and \(Y=[-1,1]\) if not specified otherwise. Moreover, \(L:X\times Y\times \mathbb {R}\rightarrow [0,\infty )\) denotes the loss function. If not specified otherwise, we restrict ourselves to the least squares loss \(L(x,y,f(x))=(y-f(x))^2\). Given a dataset \(D:= ((x_1, y_1),...,(x_n, y_n)) \in (X\times Y)^n\) drawn i.i.d. from an unknown distribution P on \(X \times Y\), the aim of supervised learning is to build a function \(f_D: X \rightarrow \mathbb {R}\) based on D such that its risk
is close to the smallest possible risk
In the following, \({\mathcal{R}^*_{L,P}}\) is called the Bayes risk and an \({f_{L,P}^*}: X \rightarrow \mathbb {R}\) satisfying \({\mathcal{R}_{L,P}(f^*_{L,P})} = {\mathcal{R}^*_{L,P}}\) is called Bayes decision function. Recall, that for the least squares loss, \({f_{L,P}^*}\) equals the conditional mean function, i.e. \({f_{L,P}^*}(x) = \mathbb {E}_P(Y|x)\) for \(P_X\)-almost all \(x\in X\), where \(P_X\) denotes the marginal distribution of P on X. In general, estimators \(f_D\) having small excess risk
where \(\Vert \cdot \Vert _{L_2(P_X)}\) denotes the usual \(L_2\)-norm with respect to \(P_X\), are considered as good in classical statistical learning theory.
Now, to describe the class of learning algorithms we are interested in, we need the empirical risk of an \(f:X\rightarrow \mathbb {R}\), i.e.
Recall, that an empirical risk minimizer over some set \(\mathcal{F}\) of functions \(f:X\rightarrow \mathbb {R}\) chooses, for every data set D, an \(f_D \in {\mathcal {F}}\) that satisfies
Note that the definition of empirical risk minimizers implicitly requires that the infimum on the right hand side is attained, namely by \(f_D\). In general, however, \(f_D\) does not need to be unique. It is well-known that if we have a suitably increasing sequence of hypotheses classes \({\mathcal {F}}_n\) with controlled capacity, then every empirical risk minimizer \(D\mapsto f_D\) that ensures \(f_D \in {\mathcal {F}}_n\) for all data sets D of length n learns in the sense of e.g. universal consistency, and under additional assumptions it may also enjoy minmax optimal learning rates, see e.g. Devroye et al. (1996), van de Geer (2000), Györfi et al. (2002); Steinwart and Christmann (2008).
2.1 Classical histograms
Particular simple empirical risk minimizers are histogram rules (HRs). To recall the latter, we fix a finite partition \(\mathcal{A} = (A_j)_{j\in J}\) of X and for \(x\in X\) we write A(x) for the unique cell \(A_j\) with \(x\in A_j\). Moreover, we define
where \(\varvec{1}_{A_j}\) denotes the indicator function of the cell \(A_j\). Now, given a data set D and a loss L an \(\mathcal{A}\)-histogram is an \(h_{D, \mathcal{A}} = \sum _{j=1}^m c_j^*\varvec{1}_{A_j} \in {\mathcal {H}}_{\mathcal{A}}\) that satisfies
for all, so-called non-empty cells \(A_j\), that is, cells \(A_j\) with \(N_j:=|\{i: x_i \in A_j\}| >0\). Clearly, \(D\mapsto h_{D, \mathcal{A}}\) is an empirical risk minimizer. Moreover, note that in general \(h_{D, \mathcal{A}}\) is not uniquely determined, since \(c_j^*\in Y\) can take arbitrary values for empty cells \(A_j\). In particular, there are more than one empirical risk minimizer over \({\mathcal {H}}_{\mathcal{A}}\) as soon as \(m,n\ge 2\).
Before we proceed, let us consider the specific example of the least squares loss in more detail. Here, a simple calculation shows, see Lemma A.1, that for all non-empty cells \(A_j\), the coefficient \(c_j^*\) in (5) is uniquely determined by
In the following, we call every resulting \(D\mapsto h_{D, \mathcal{A}}\) with
an empirical HR for regression with respect to the least-squares loss L. For later use we also introduce an infinite sample version of a classical histogram
for all cells \(A_j\) with \(P_X(A_j)>0\). Similarly to empirical histograms one has
We are mostly interested in HRs on \(X=[-1,1]^d\) whose underlying partition essentially consists of cubes with a fixed width. To rigorously deal with boundary effects, we first say that a partition \((B_j)_{j\ge 1}\) of \(\mathbb {R}^d\) is a cubic partition of width \(s>0\), if each cell \(B_j\) is a translated version of \([0,s)^d\), i.e. there is an \(x^\dagger \in \mathbb {R}^d\) called offset such that for all \(j\ge 1\) there exist \(k_j:= ( k'_1,\dots , k'_d)\in \mathbb {Z}^d\) with
Now, a partition \(\mathcal{A} = (A_j)_{j\in J}\) of \(X=[-1,1]^d\) is called a cubic partition of width \(s>0\), if there is a cubic partition \({\mathcal {B}}=(B_j)_{j\ge 1}\) of \(\mathbb {R}^d\) with width \(s>0\) such that \(J= \{j\ge 1: B_j \cap X \ne \emptyset \}\) and \(A_j = B_j\cap X\) for all \(j\in J\). If \(s\in (0,1]\), then, up to reordering, this \((B_j)_{j\ge 1}\) is uniquely determined by \(\mathcal{A}\).
If the hypotheses space (4) is based on a cubic partition of \(X=[-1,1]^d\) with width \(s>0\), then the resulting HRs are well understood. For example, universal consistency and learning rates have been established, see e.g. Devroye et al. (1996); Györfi et al. (2002). In general, these results only require a suitable choice for the widths \(s=s_n\) for \(n\rightarrow \infty\) but no specific choice of the cubic partition of width s. For this reason we write \(\mathcal{H}_s:= \bigcup \mathcal{H}_{\mathcal{A}}\), where the union runs over all cubic partitions \(\mathcal{A}\) of X with fixed width \(s\in (0,1]\).
2.2 Interpolating predictors and inflated histograms
In this section we construct particular interpolating empirical risk minimizers for a broad class of losses.
Definition 2.1
(Interpolating Predictor) We say that an \(f:X\rightarrow Y\) interpolates D, if
where we emphasize that the infimum is taken over all \(\mathbb {R}\)-valued functions, while f is required to be Y-valued.
Clearly, an \(f:X\rightarrow Y\) interpolates D if and only if
where \(x_1^*,\dots , x_m^*\) are the elements of \(D_X:= \{x_i: i=1,\dots ,n\}\).
It is easy to check that for the least squares loss L and all data sets D there exists an \(f_D^*\) interpolating D. Moreover, we have \({\mathcal{R}^*_{L,D}} > 0\) if and only if D contains contradicting samples, i.e. \(x_i = x_k\) but \(y_i \ne y_k\). Finally, if \({\mathcal{R}^*_{L,D}} = 0\), then any interpolating \(f_D^*\) needs to satisfy \(f_D^*(x_i) = y_i\) for all \(i=1,\dots ,n\).
Definition 2.2
(Interpolatable Loss) We say that L is interpolatable for D if there exists an \(f:X\rightarrow Y\) that interpolates D, i.e. \({\mathcal{R}_{L,D}(f)} = {\mathcal{R}^*_{L,D}}\).
Note that (9) in particular ensures that the infimum over \(\mathbb {R}\) on the right is attained at some \(c^*_i\in Y\). Many common losses including the least squares, the hinge, and the classification loss interpolate all D, and for the latter three losses we have \({\mathcal{R}^*_{L,D}} > 0\) if and only if D contains contradicting samples, i.e. \(x_i = x_k\) but \(y_i \ne y_k\). Moreover, for the least squares loss, \(c^*_i\) can be easily computed by averaging over all labels \(y_k\) that belong to some sample \(x_k\) with \(x_k = x_i\).
Let us now describe more precisely the inflated versions of \(\mathcal{H}_s\). For \(r,s>0\) and \(m\ge 0\) we want to consider functions
with \(h\in \mathcal{H}_s,\, b_i \in 2Y,\, x_i^* \in X,\) and \(t \in [0,r]\), where \(B_\infty := [-1,1]^d\). In other words, for \(m\ge 1\), such an f changes a classical histogram \(h \in \mathcal{H}_s\) on at most m small neighborhoods of some arbitrary points \(x_1^*,\dots ,x_m^*\) in X. Such changes are useful for finding interpolating predictors. In general, these small neighborhoods \(x_i^* + t B_\infty\) however may intersect and may be contained in more than one cell \(A_j\) of the considered partition \(\mathcal{A}\) with \(h \in {\mathcal {H}}_{\mathcal {A}}\). To avoid undesired boundary effects we restrict the class of all admissible cubic partitions \({\mathcal {A}}\) of X associated with h. An additional technical difficulty arises in particular when constructing interpolating predictors since the set of points \(\{x_1^*,..., x_m^*\}\subset X\) are naturally the random input variables. As a consequence, the admissible cubic partitions become data-dependent. As a next step, we introduce the notion of a partitioning rule. To this end, we write
for the set of all subsets of X having cardinality m. Moreover, we denote the set of all finite partitions of X by \({\mathcal {P}}(X)\).
Definition 2.3
Given an integer \(m\ge 1\), an m-sample partitioning rule for X is a map \(\pi _m: \textrm{Pot}_m(X)\rightarrow {\mathcal {P}}(X)\), i.e. a map that associates to every subset \(\{x_1^*,..., x_m^*\}\subset X\) of cardinality m a finite partition \(\mathcal{A}\). Additionally, we will call an m-sample partitioning rule that assigns to any such \(\{x_1^*,..., x_m^*\}\in \textrm{Pot}_m(X)\) a cubic partition with fixed width \(s \in (0,1]\) an m-sample cubic partitioning rule and write \(\pi _{m,s}\).
Next we explain in more detail which particular partitions are considered as admissible.
Definition 2.4
(Proper Alignment) Let \({\mathcal {A}}\) be a cubic partition of X with width \(s\in (0,1]\), \({\mathcal {B}}\) be the partition of \(\mathbb {R}^d\) that defines \(\mathcal{A}\), and \(r\in (0,s)\). We say that \({\mathcal {A}}\) is properly aligned to the set of points \(\{x_1^*,..., x_m^*\}\in \textrm{Pot}_m(X)\) with parameter r, if for all \(i,k=1,\dots ,m\) we have
where \(B(x_i^*)\) is the unique cellFootnote 1 of \({\mathcal {B}}\) that contains \(x_i^*\).
Clearly, if \({\mathcal {A}}\) is properly aligned with parameter \(r> 0\), then it is also properly aligned for any parameter \(t \in [0, r]\) for the same set of points \(\{x_j^*\}_{j=1}^m\) in \(\textrm{Pot}_m(X)\). Moreover, any cubic partition \({\mathcal {A}}\) of X with width \(s>0\) is properly aligned with the parameter \(r=0\) for any set of points \(\{x_j^*\}_{j=1}^m\) in \(\textrm{Pot}_m(X)\).
In what follows, we establish the existence of cubic partitions \({\mathcal {A}}\) that are properly aligned to a given set of points with parameter \(r>0\) being sufficiently small. In other words, we construct a special m-sample cubic partitioning rule \(\pi _{m,s}\). We call henceforth any such rule \(\pi _{m,s}\) an m-sample properly aligned cubic partitioning rule. To this end, let \(D_X:= \{x_1^*,..., x_m^*\} \in \textrm{Pot}_m(X)\) be a set of points and note that (12) holds for all \(r>0\) satisfying
Clearly, a brute-force algorithm finds such an r in \(\mathcal{O}(dm^2)\)-time. However, a smarter approach is to first sort the first coordinates \(x_{1,1}^*,\dots , x_{m,1}^*\) and to determine the smallest positive distance \(r_1\) of two consecutive, non-identical ordered coordinates. This approach is then repeated for the remaining \(d-1\)-coordinates, so at the end we have \(r_1,\dots ,r_d>0\). Then
satisfies (12) and the used algorithm is \(\mathcal{O}(d\cdot m \log m)\) in time. Our next result shows that we can also ensure (11) by jiggling the cubic partitions. Being rather technical, the proof is deferred to the Appendix B.
Theorem 2.5
(Existence of Properly Aligned Cubic Partitioning Rule) For all \(d\ge 1\), \(s\in (0,1]\), and \(m\ge 1\) there exist an m-sample cubic partitioning rule \(\pi _{m,s}\) with \(|{{\,\textrm{Im}\,}}(\pi _{m,s})|\le (m+1)^d\) that assigns to each set of points \(D_X:=\{x_1^*,..., x_m^*\} \in \textrm{Pot}_m(X)\) a cubic partition \(\mathcal{A}\) that is properly aligned to \(\{x_1^*,..., x_m^*\}\) with parameter \(r:= r_{D_X}:= \min \{r^*,\frac{s}{3\,m+3}\}\), where \(r^*= r^*_{D_X}\) is defined in (13).
The construction of an m-sample cubic partitioning rule \(\pi _{m,s}\) basically relies on the representation (8) of cubic partitions \({\mathcal {B}}\) of \(\mathbb {R}^d\). In fact, the proof of Theorem 2.5 shows that there exists a finite set \(x_1^\dagger ,..., x^\dagger _K \in \mathbb {R}^d\) of candidate offsets, with \(K=(m+1)^d\). While at first glance this number seems to be prohibitively large for an efficient search, it turns out that the proof of Theorem (2.5) actually provides a simple algorithm that is \(\mathcal{O}(d\cdot m)\) in time for identifying coordinate-wise the \(x_\ell ^\dagger\) that leads to \(\pi _{m,s}(\{x_1^*,..., x_m^*\})\).
Being now well prepared, we introduce the class of inflated histograms.
Definition 2.6
Let \(s \in (0,1]\) and \(m\ge 1\). Then a function \(f:X\rightarrow Y\) is called an m-inflated histogram of width s, if there exist a subset \(\{x_1^*,..., x_m^*\} \in \textrm{Pot}_m(X)\) and a cubic partition \(\mathcal{A}\) of width s that is properly aligned to \(\{x_1^*,..., x_m^*\}\) with parameter \(r\in [0,s)\) such that
where \(h\in {\mathcal {H}}_{\mathcal {A}}\), \(t\in [0,r]\), and \(b_i\in 2Y\) for all \(i=1,\dots ,m\). We denote the set of all m-inflated histograms of width s by \({\mathcal {F}}_{s, m}\). Moreover, for \(k\ge 1\) we write
Note that the condition \(t\le r < s\) ensures that the representation \(f= h + \sum _{i=1}^m b_i \varvec{1}_{x_i^* + t B_\infty }\) of any \(f\in {\mathcal {F}}_{s, k}\) is unique. In addition, given an \(f\in \mathcal{F}^*_{s,k}\), the number m of inflation points \(\{x_1^*,..., x_m^*\}\) is uniquely determined, too, and hence so is the representation of f. For a depiction of an inflated histogram for regression (with and without proper alignment) we refer to Fig. 1.
Left. Depiction of an inflated histogram for regression for a cubic partition \({\mathcal {A}}=(A_j)_{j \in J}\) that is not properly aligned to the data (black crosses). The predictions \(c_i^*\) and \(c_j^*\) on the associated cells \(A_i\) and \(A_j\) are calculated according to (6), i.e. by a local average. Mispredicted samples are corrected according to (14) on a \(tB_\infty\)-neighborhood for some small \(t > 0\). Note that one sample is too close to the cell boundary, i.e. (11) is violated. Right. An inflated histogram that is properly aligned to the same data set. Note that (11) ensures that boundary effects as for the left HR do not take place. For inflated histograms these effects seem to be a negligible technical nuisance. For their DNN counterparts considered in Sect. 3, however, such effects may significantly complicate the constructions of interpolating predictors, see Fig. 2 (Color figure online)
So far we have formalized the notion of interpolation and defined an appropriate inflated hypotheses class for modified histograms. The m-inflated histograms in Definition 2.6 can attain any values in Y that arise from classical histograms or changes of classical histograms on a discrete set (note that implicitly restricts the choices of the \(b_i\)). This is a quite general definition and m-inflated histograms do not need to be interpolating. In our next result we go a step further by providing a sufficient condition for the existence of interpolating predictors in \({\mathcal {F}}_{s, m}\). The idea is to give a condition on the \(b_i\) that ensures that an inflated histogram is interpolating. This depends, of course, on the \(c_j\).
Proposition 2.7
Let L be a loss that is interpolatable for \(D=((x_1,y_i),\dots ,(x_n,y_n))\) and let \(x_1^*,\dots , x_m^*\) be as in (9). Moreover, for \(s\in (0,1]\) and \(r>0\) we fix an \(f^*\in {\mathcal {F}}_{s, m}\) with representation as given in Definition 2.6. For \(i=1,\dots ,m\) let \(j_i\) be the index such that \(x_i^* \in A_{j_i}\). Then \(f^*\) interpolates D, if for all \(i=1,\dots ,m\) we have
Proof of Proposition 2.7
By our assumptions we have
where the last equality is a consequence of the fact that there is an \(f:X\rightarrow Y\) satisfying (9). Moreover, since (11) and (12) hold, we find \(f^*(x_i^*) = h(x_i^*) + b_i = c_{j_i} + b_i = c_i^*\), and therefore \(f^*\) interpolates D by (9).\(\square\)
Note that for all \(c_{j_i} \in Y\) the value \(b_i\) given by (14) satisfies \(b_i \in 2Y\) and we have \(b_i=0\) if \(c_{j_i}\) is contained in the \(\arg \min\) in (14). Consequently, defining \(b_i\) by (14) always gives an interpolating \(f^*\in {\mathcal {F}}_{s, m}\). Moreover, (14) shows that an interpolating \(f^*\in {\mathcal {F}}_{s, m}\) can have an arbitrary histogram part \(h\in \mathcal{H}_{\mathcal{A}}\), that is, the behavior of \(f^*\) outside the small \(tB_\infty\)-neighborhoods around the samples of D can be arbitrary. In other words, as soon as we have found a properly aligned cubic partition \(\mathcal{A}\) in the sense of \({\mathcal {F}}_{s,m}\), we can pick an arbitrary histogram \(h\in \mathcal{H}_{\mathcal{A}}\) and compute the \(b_i\)’s by (14). Intuitively, if the chosen \(tB_\infty\)-neighborhoods are sufficiently small, then the prediction capabilities of the resulting interpolating predictor are (mostly) determined by the chosen histogram part \(h\in \mathcal{H}_{\mathcal{A}}\). Based on this observation, we can now construct different, interpolating \(f^*_D\in {\mathcal {F}}_{s, m}\) that have particularly good and bad learning behaviors.
Example 2.8
(Good interpolating histogram rule) Let L be the least squares loss, \(s\in (0,1]\) be a cell width, \(\rho \ge 0\) be an inflation parameter, and \(D=((x_1,y_i),\dots ,(x_n,y_n))\) be a data set. By \(D_X=\{x_1^*,...,x_m^*\}\) we denote the set of all covariates \(x_j \in X\) with \((x_j, y_j)\) belonging to the data set. For \(m=|D_X|\), Theorem 2.5 ensures the existence of a cubic partition \({\mathcal {A}}_D=\pi _{m,s}(D_X)\) with width \(s\in (0,1]\), being properly aligned to \(D_X\) with the data-dependent parameter r. Based on this data-dependent cubic partition \({\mathcal {A}}_D\) we fix an empirical histogram for regression
with coefficients \((c_j^+)_{j \in J}\) precisely given in (6). Applying now Proposition 2.7 gives us an \(f_{D,s,\rho }^+\in {\mathcal {F}}_{s, m} \subset \mathcal{F}^*_{s,n}\), which interpolates D and has the representation
where the \(b^+_1,\dots ,b_m^+\) are calculated according to the rule (14), and \(t:= \min \{r, \rho \}\) is again data-dependent. We call the map \(D\mapsto f_{D,s,\rho }^+\) a good interpolating histogram rule.
Example 2.9
(Bad interpolating histogram rule) Let L be the least squares loss, \(s\in (0,1]\) be a cell width, \(\rho \ge 0\) be an inflation parameter, and \(D=((x_1,y_i),\dots ,(x_n,y_n))\) be a data set. Consider again a cubic partition \({\mathcal {A}}_D=\pi _{m,s}(D_X)\) with width \(s\in (0,1]\), that is properly aligned to \(D_X\) with parameter r and fix an empirical histogram \(h_{D,\mathcal{A}_D}^+ \in \mathcal{H}_{s}\) as in (15). Setting \(t:= \min \{r, \rho \}\), we define a predictor \(f_{D,s,\rho }^-\in {\mathcal {F}}_{s, m}\) by
with \(\mathcal{H}_{\mathcal{A}}\)-part \(h_{D,\mathcal{A}_D}^-:= - h_{D,\mathcal{A}_D}^+\). The \(b^-_1,\dots ,b_m^-\) are calculated according to (14) and satisfy
for any \(i=1,...,m\) and where \(j_i\) denotes the index such that \(x_i^* \in A_{j_i}\). By writing
we easily see that the definition of \(f_{D,s,\rho }^-\) gives \(f_{D,s,\rho }^-\in {\mathcal {F}}_{s, m} \subset \mathcal{F}^*_{s,n}\) and
while Proposition 2.7 ensures that \(f_{D,s,\rho }^-\) interpolates D. We call the map \(D\mapsto f_{D,s,\rho }^-\) a bad interpolating histogram rule and remark that t is, like for good interpolating histogram rules, data-dependent.
2.3 A generic oracle inequality for empirical risk minimization
The main purpose of this section is to present a general variance improved oracle inequality for empirical risk minimizers for bounding the excess risk with respect to a broad class of loss functions, which is of independent interest. In Sect. 2.4, we apply this result to the special case of the least squares loss and to histogram rules that choose their cubic partitions in a certain, data-dependent way. In particular, we give an optimized uniform bound that crucially relies on an explicit capacity bound, expressed in terms of the covering number, see Definition E.1. This is a necessary step to provide the learning properties of histograms based on data-dependent cubic partitions. The proof of this result is provided in Appendix E.1.
Theorem 2.10
Let \(L:X\times Y\times \mathbb {R}\rightarrow [0,\infty )\) be a locally Lipschitz continuous loss, \(\mathcal{F}\subset \mathcal{L}_{\infty }(X)\) be a closed, separable set satisfying \(\Vert f \Vert _\infty \le M\) for a suitable constant \(M>0\) and all \(f\in \mathcal{F}\), and P be a distribution on \(X\times Y\) that has a Bayes decision function \(f_{L,P}^{*}\) with \({\mathcal{R}_{L,P}({f_{L,P}^*})}< \infty\). Assume that there exist constants \(B>0\), \(\vartheta \in [0,1]\), and \(V\ge B^{2-\vartheta }\) such that for all measurable \(f: X \rightarrow [-M,M]\) we have
Then, for all measurable empirical risk minimization algorithms \(D\mapsto f_D\), all \(n\ge 1\), \(\tau >0\), and all \(\varepsilon >0\) we have
with probability \(P^n\) not less than \(1- e^{-\tau }\). Here, \(\mathcal{N}(\mathcal{F},\Vert \cdot \Vert _\infty ,\varepsilon )\) denotes the \(\varepsilon\)-covering number of \(\mathcal{F}\).
Note that variance improved oracle inequalities generally provide refined bounds for the estimation error part for the excess risk under the more strict assumptions (18), (19). This leads to faster rates of convergence, compared to basic statistical analysis, see e.g. (Steinwart and Christmann (2008), Chapter 6 and 7). Our variance improved oracle inequality improves over the one in (Steinwart and Christmann (2008), Theorem 7.2) for empirical risk minimizers. We go beyond finite function classes and bound the capacity in terms of covering numbers.
2.4 Main results for least squares loss
Our main results below show that the description good/ bad interpolating histogram rule from the above Examples 2.8/ 2.9, respectively, is indeed justified, provided the inflation parameter is chosen appropriately. Here we recall that good learning algorithms can be described by a small excess risk, or equivalently, a small distance to the Bayes decision function \({f_{L,P}^*}\), see (3). To describe bad learning behavior, we denote the point spectrum of \(P_X\) by
see Hoffman-Jorgensen (2017). One easily verifies that \(\Delta\) is at most countable, since \(P_X\) is finite. Moreover, for an arbitrary but fixed version \({f_{L,P}^*}\) of the Bayes decision function, we write
where we note that \({\mathcal {R}}^\dagger _{L,P}\) does, of course, not depend on the choice of \({f_{L,P}^*}\). Moreover, note that for \(x\in \Delta\) the value \({f_{L,P}^*}(x)\) is also independent of the choice of \({f_{L,P}^*}\) and it holds \(f^\dagger _{L,P} (x) = {f_{L,P}^*}(x)\). In contrast, for \(x\in X{\setminus } \Delta\) with \({f_{L,P}^*}(x) \ne 0\) we have \(f^\dagger _{L,P}(x) \ne {f_{L,P}^*}(x)\). In fact, a quick calculation using (3) shows
and consequently we have \({\mathcal {R}}^\dagger _{L,P} - {\mathcal{R}_{L,P}^{*}}>0\) whenever \(P_X(\Delta ) < 1\) and \({f_{L,P}^*}\) does not almost surely vanish on \(X\setminus \Delta\). It seems fair to say that the overwhelming majority of “interesting” P fall into this category. Finally, note that in general we do not have an equality of the form (3), when we replace \({\mathcal{R}_{L,P}^{*}}\) and \({f_{L,P}^*}\) by \({\mathcal {R}}^\dagger _{L,P}\) and \({f_{L,P}^\dagger }\). However, for \(y,t,t'\in Y=[-1,1]\) we have \(|L(y,t) - L(y,t')| \le 4 |t-t'|\), and consequently we find
for all \(f:X\rightarrow Y\). For this reason, we will investigate the bad interpolating histogram rule only with respect to its \(L_2\)-distance to \({f_{L,P}^\dagger }\).
Before we state our main result of this section we need to introduce one more assumption that will be required for parts of our results.
Assumption 2.11
There exists a non-decreasing continuous map \(\varphi : \mathbb {R}_+ \rightarrow \mathbb {R}_+\) with \(\varphi (0)=0\) such that for any \(t \ge 0\) and \(x \in X\) one has \(P_X(x + tB_\infty ) \le \varphi ( t)\).
Note that this assumption implies \(P_X(\{x\})=0\) for any \(x \in X\). Moreover, it is satisfied for the uniform distribution \(P_X\), if we consider \(\phi (t):= 2^d t^d\), and a simple argument shows that modulo the constant appearing in \(\phi\) the same is true if \(P_X\) only has a bounded Lebesgue density. The latter is, however, not necessary. Indeed, for \(X=[-1,1]\) and \(0<\beta < 1\) it is easy to construct unbounded Lebesgue densities that satisfy Assumption 2.11 for \(\phi\) of the form \(\phi (t) = ct^\beta\), and higher dimensional analogs are also easy to construct. Moreover, in higher dimensions Assumption 2.11 also applies to various distributions living on sufficiently smooth low-dimensional manifolds.
With these preparations we can now establish the following theorem that shows that for an inflation parameter \(\rho =0\) (see Examples 2.8, 2.9) the good interpolating histogram rule is universally consistent while the bad interpolating histogram rule fails to be consistent in a stark sense. It further shows consistency, respectively non-consistency for \(\rho =\rho _n>0\) with \(\rho _n\rightarrow 0\).
Theorem 2.12
Let L be the least-squares loss and let \(D \in (X \times Y)^n\) be an i.i.d. sample of size \(n \ge 1\). Let \(D \mapsto f_{D,s,\rho }^+\) denote the good interpolating histogram rule from Example 2.8. Similarly, let \(D \mapsto f_{D,s,\rho }^-\) denote the bad interpolating histogram rule from Example 2.9. Assume that \((s_n)_{n \in \mathbb {N}}\) is a sequence with \(s_n \rightarrow 0\) as well as \(\frac{\ln (n s_n^d)}{n s_n^d}\rightarrow 0\) as \(n \rightarrow \infty\).
-
(i)
(Non)-consistency for \(\rho _n = 0\). We have in probability for \(|D|\rightarrow \infty\)
$$\begin{aligned} \Vert f_{D,s_n,0}^+- {f_{L,P}^*} \Vert _{{L_{2}(P_X)}}&\rightarrow 0 \,, \end{aligned}$$(23)$$\begin{aligned} \Vert f_{D,s_n,0}^-- {f_{L,P}^\dagger } \Vert _{{L_{2}(P_X)}}&\rightarrow 0 \, . \end{aligned}$$(24) -
(ii)
(Non)-consistency for \(\rho _n >0\). Let \((\rho _n)_{n \in \mathbb {N}}\) be a non-negative sequence with \(\rho _n \rightarrow 0\) as \(n \rightarrow \infty\). Then for all distributions P that satisfy Assumption 2.11 for a function \(\varphi\) with \(n\varphi (\rho _n) \rightarrow 0\) for \(n\rightarrow \infty\), we have
$$\begin{aligned} ||f_{D,s_n, \rho _n}^+- {f_{L,P}^*}||_{L_2(P_X)} \rightarrow 0 \,, \end{aligned}$$(25)$$\begin{aligned} ||f_{D,s_n,\rho _n}^-- {f_{L,P}^\dagger }||_{L_2(P_X)} \rightarrow 0 \,, \end{aligned}$$(26)in probability for \(|D|\rightarrow \infty\).
The proof of Theorem 2.12 is provided in Appendix C.2. Our second main result, whose proof is provided in Appendix C.3, refines the above theorem and establishes learning rates for the good and bad interpolating histogram rules, provided the width \(s_n\) and the inflation parameter \(\rho _n\) decrease sufficiently fast as \(n \rightarrow \infty\).
Theorem 2.13
Let L be the least-squares loss and let \(D \in (X \times Y)^n\) be an i.i.d. sample of size \(n \ge 1\). Let \(D \mapsto f_{D,s,\rho }^+\) denote the good interpolating histogram rule from Example 2.8. Similarly, let \(D \mapsto f_{D,s,\rho }^-\) denote the bad interpolating histogram rule from Example 2.9. Suppose that \({f_{L,P}^*}\) is \(\alpha\)-Hölder continuous with \(\alpha \in (0,1]\) and that P satisfies Assumption 2.11 for some function \(\varphi\). Assume further that \((s_n)_{n \in \mathbb {N}}\) is a sequence with
and that \((\rho _n)_{n\ge 1}\) is a non-negative sequence with \(n \varphi (\rho _n) \le \ln (n) n^{-2/3}\) for all \(n\ge 1\). Then there exists a constant \(c_{d,\alpha }>0\) only depending on d, \(\alpha\), and \(|f^*_{L,P}|_\alpha\), such that for all \(n\ge 1\) the good interpolating histogram rule satisfies
with probability \(P^n\) not less than \(1- 2^dn^{1+d} e^{-n^{d \gamma }}\). Furthermore, for all \(n\ge 1\), the bad interpolating histogram rule satisfies
with probability \(P^n\) not less than \(1- 2^dn^{1+d} e^{-n^{d \gamma }}\).
For a discussion of our results, we refer to Sect. 4.
3 Approximation of histograms with ReLU networks
The goal of this section is to build neural networks of suitable depth and width that mimic the learning properties of inflated histogram rules. To be more precise, we aim to construct a particular class of inflated networks that contains good and bad interpolating predictors, similar to the good and bad interpolating histogram rules from Example 2.8 and Example 2.9, respectively.
We begin with describing in more detail the specific networks that we will consider. Given an activation function \(\sigma : \mathbb {R}\rightarrow \mathbb {R}\) and \(b \in \mathbb {R}^p\) we define the shifted activation function \(\sigma _b: \mathbb {R}^p \rightarrow \mathbb {R}^p\) as
where \(y_j\), \(j=1,...,p\) denote the components of \(y \in \mathbb {R}^p\). A hidden layer with activation \(\sigma\), of width \(p \in \mathbb {N}\) and with input dimension \(q \in \mathbb {N}\) is a function \(H_\sigma :\mathbb {R}^q \rightarrow \mathbb {R}^p\) of the form
where A is a \(p \times q\) weight matrix and \(b \in \mathbb {R}^{p}\) is a shift vector or bias. Clearly, each pair (A, b) describes a layer, but in general, a layer, if viewed as a function, can be described by more than one such pair. The class of networks we consider is given in the following definition.
Definition 3.1
Given an activation function \(\sigma : \mathbb {R}\rightarrow \mathbb {R}\) and an integer \(\tilde{L}\ge 1\), a neural network with architecture \(p \in \mathbb {N}^{\tilde{L}+1}\) is a function \(f: \mathbb {R}^{p_0} \rightarrow \mathbb {R}^{p_{\tilde{L}}}\), having a representation of the form
where \(H_{\sigma , l}: \mathbb {R}^{p_{l-1}} \rightarrow \mathbb {R}^{p_l}\) is a hidden layer of width \(p_l \in \mathbb {N}\) and input dimension \(p_{l-1} \in \mathbb {N}\), \(l=1,...,\tilde{L}-1\). Here, the last layer \(H_{\text{ id }, \tilde{L}}: \mathbb {R}^{p_{\tilde{L}-1}} \rightarrow \mathbb {R}^{p_{\tilde{L}}}\) is associated to the identity \(\text{ id }: \mathbb {R}\rightarrow \mathbb {R}\).
A network architecture is therefore described by an activation function \(\sigma\) and a width vector \(p = (p_0,...,p_{\tilde{L}}) \in \mathbb {N}^{\tilde{L}+1}\). The positive integer \(\tilde{L}\) is the number of layers, \(\tilde{L}-1\) is the number of hidden layers or the depth. Here, \(p_0\) is the input dimension and \(p_{\tilde{L}}\) is the output dimension. In the sequel, we confine ourselves to the ReLU-activation function \(|\cdot |_+: \mathbb {R}\rightarrow [0,\infty )\) defined by
Moreover, we consider networks with fixed input dimension \(p_0=d\) and output dimension \(p_{\tilde{L}}=1\), that is,
Thus, we may parameterize the (inner) architecture by the width vector \((p_1,...,p_{\tilde{L}-1}) \in \mathbb {N}^{\tilde{L}-1}\) of the hidden layers only. In the following, we denote the set of all such neural networks by \({\mathcal {A}}_{p_1,...,p_{\tilde{L}-1}}\).
3.1 \(\varepsilon\)-approximate inflated histograms
Motivated by the representation (4) for histograms, the first step of our construction approximates the indicator function of a multi-dimensional interval by a small part of a possibly large DNN. This will be our main building block. We emphasize that the ReLU activation function is particularly suited for this approximation and it thus plays a key role in our entire construction.
For the formulation of the corresponding result we fix some notation. For \(z_1, z_2 \in {\mathbb {R}^d}\) we write \(z_1\le z_2\) if each coordinate satisfies \(z_{1,i}\le z_{2,i}\), \(i=1,\dots ,d\). We define \(z_1 < z_2\) analogously. In addition, if \(z_1\le z_2\), then the multi-dimensional interval is \([z_1, z_2]:= \{ z\in {\mathbb {R}^d}: z_1\le z\le z_2 \}\), and we similarly define \((z_1, z_2)\) if \(z_1 < z_2\). Finally, for \(s\in \mathbb {R}\), we let \(z_1 + s:= (z_{1,1}+s,\dots ,z_{1,d}+s)\).
Definition 3.2
Let \(A\subset X\), \(z_1, z_2\in {\mathbb {R}^d}\) with \(z_{1} < z_{2}\) and \(\varepsilon >0\) with \(\varepsilon < \frac{1}{2}\cdot \min \{z_{2,i}-z_{1,i}: i=1,\dots ,d \}\). Then a network \(\varvec{1}_A^{(\varepsilon )} \in {\mathcal {A}}_{2d,1}\) is called an \(\varepsilon\)-Approximation of the indicator function \(\varvec{1}_A: X \rightarrow [0,1]\) if
and if
The next lemma ensures the existence of such approximations. The full construction is elementary calculus and is provided in Appendix D.2, in particular in Lemma D.3. Lemma D.5 provides then the desired properties.
Lemma 3.3
[Existence of \(\varepsilon\)-Approximations] Let \(z_1, z_2\in {\mathbb {R}^d}\) and \(\varepsilon >0\) as in Definition 3.2. Then for all \(A\subset X\) with \([z_{1}+\varepsilon , z_{2}-\varepsilon ] \subset A \subset [z_{1}, z_{2}]\) there exists an \(\varepsilon\)-Approximation \(\varvec{1}_A^{(\varepsilon )}\) of \(\varvec{1}_A\).
Left. Approximation \(\varvec{1}_A^{(\varepsilon )}\) (orange) of the indicator function \(\varvec{1}_A\) for \(A =[0.1, 0.6]\) (blue) according to Lemma 3.3 for \(\varepsilon = 0.1\) on \(X=[0,1]\). The construction of \(\varvec{1}_A^{(\varepsilon )}\) ensures that \(\varvec{1}_A^{(\varepsilon )}\) coincides with \(\varvec{1}_A\) modulo a small set that is controlled by \(\varepsilon >0\). Right. A DNN (blue) for regression that approximates the histogram \(\varvec{1}_{[0,0.5)} + 0.8 \cdot \varvec{1}_{[0.5,1)}\) and a DNN (orange) that additionally tries to interpolate two samples \(x_1 = 0.2\) and \(x_2= 0.575\) (located at the two vertical dotted lines) with \(y_i = -0.5\). The label \(y_1\) is correctly interpolated since the alignment condition (11) is satisfied for \(x_1\) with \(t=0.15\) and \(\varepsilon =\delta = t/3 = 0.05\) as in Example 3.6. In contrast, \(y_2\) is not correctly interpolated since condition (11) is violated for this t and hence \(\varepsilon\) and \(\delta\) are too large (Color figure online)
Figure 2 illustrates \(\varvec{1}_A^{(\varepsilon )}\) for \(d=1\). Moreover, the proof of Lemma D.3 shows that out of the \(2d^2\) weight parameters of the first layer, only 2d are non-zero. In addition, the 2d weight parameters of the neuron in the second layer are all identical. In order to approximate inflated histograms we need to know how to combine several functions of the form provided by Lemma 3.3 into a single neural network. An appealing feature of our DNNs is that the concatenation of layer structures is very easy.
Lemma 3.4
If \(c \in \mathbb {R}\), \((p_1, p_2) \in \mathbb {N}^2\), and \(g \in {\mathcal {A}}_{p}\), \(g' \in {\mathcal {A}}_{p'}\), then \(cg \in {\mathcal {A}}_{p}\) and \(g + g' \in {\mathcal {A}}_{p+p'}.\)
Lemma 3.4 describes some properties of neural networks with respect to scaling and addition. It tells us that the class of neural networks is closed under scalar multiplication and addition, with the width of the resulting networks adjusted appropriately. The proof is based on elementary linear algebra. For an extended version of this result, see Lemma D.2. In particular, our constructed DNNs have a particularly sparse structure and the number of required neurons behaves in a very controlled and natural fashion.
With these insights, we are now able to find a representation similar to (4). To this end, we choose a cubic partition \({\mathcal {A}}=(A_j)_{j \in J}\) of X with width \(s>0\) and define for \(\varepsilon \in (0, \frac{s}{3}]\)
where \(\varvec{1}_{A_j}^{(\varepsilon )}:= (\varvec{1}_{B_j}^{(\varepsilon )})_{|A_j}\) is the restriction of \(\varvec{1}_{B_j}^{(\varepsilon )}\) to \(A_j\) and \(\varvec{1}_{B_j}^{(\varepsilon )}\) is an \(\varepsilon\)-approximation of \(\varvec{1}_{B_j}\) of Lemma 3.3. Here, \(B_j\) is the cell with \(A_j = B_j\cap X\), see the text around (8). We call any function in \({\mathcal {H}}^{(\varepsilon )}_{\mathcal{A}}\) an \(\varepsilon\)-approximate histogram.
Our considerations above show that we have \({\mathcal {H}}^{(\varepsilon )}_{\mathcal{A}} \subset {\mathcal {A}}_{p_1,p_2}\) with \(p_1 = 2d|J|\) and \(p_2 = |J|\). Thus, any \(\varepsilon\)-approximate histogram can be represented by a neural network with 2 hidden layers. Inflated versions are now straightforward.
Definition 3.5
Let \(s \in (0,1]\), \(m\ge 1\), and \(\varepsilon \in (0, s/3]\). Then a function \(f^{\varepsilon }: X \rightarrow Y\) is called an \(\varepsilon\)-approximated m-inflated histogram of width s if there exist a subset \(\{x_1^*,..., x_m^*\} \in \textrm{Pot}_m(X)\) and a cubic partition \(\mathcal{A}\) of width s that is properly aligned to \(\{x_1^*,..., x_m^*\}\) with parameter \(r\in [0,s)\) such that
where \(h^{(\varepsilon )} \in {\mathcal {H}}_{\mathcal {A}}^{(\varepsilon )}\), \(t \in (0,r]\), \(\delta \in (0, t/3]\), \(b_i \in 2Y\) and where \(\varvec{1}^{(\delta )}_{x_i^* + t B_\infty }\) is a \(\delta\)-approximation of \(\varvec{1}_{x_i^* + t B_\infty }\) for all \(i=1,...,m\). We denote the set of all \(\varepsilon\)-approximated m-inflated histograms of width s by \({\mathcal {F}}^{(\varepsilon )}_{s, m}\).
A short calculation shows that \({\mathcal {F}}^{(\varepsilon )}_{s, m} \subset {\mathcal {A}}_{p_1,p_2}\) with \(p_1 = 2d(m+|J|)\), \(p_2 = m+|J|\) and \(|J| \le (2/s)^d\). With these preparations, we can now introduce good and bad interpolating DNNs.
Example 3.6
(Good and bad interpolating DNN) Let L be the least squares loss, \(s \in (0,1]\) be a cell width and let \(\rho > 0\) be an inflation parameter. For a data set \(D=((x_1,y_1),\dots ,(x_n,y_n))\) we consider again a cubic partition \({\mathcal {A}}_D=\pi _{m,s}(D_X)\), with \(m=|D_X|\), being properly aligned to \(D_X\) with parameter r. Set \(t:=\min \{r, \rho \}\). According to Example 2.8, a good interpolating HR is given by
where the \((c_j^+)_{j \in J}\) are given in (6) and \(b^+_1,\dots ,b_m^+\) are from (14). For \(\varepsilon := \delta := t/3\) we then define the good interpolating DNN by
Clearly, we have \(g_{D,s,\rho }^+\in {\mathcal {F}}^{(\varepsilon )}_{s, m}\). We call the map \(D\mapsto g_{D,s,\rho }^+\) a good interpolating DNN and it is easy to see that this network indeed interpolates D. Finally, the bad interpolating DNN \(g_{D,s,\rho }^-\) is defined analogously using the bad interpolating HR from Example 2.9, instead.
Similarly to our inflated histograms from the previous section, the next theorem shows that the good interpolating DNN is consistent while the bad interpolating DNN fails to be. The proof of this result is given in Appendix D.3.
Theorem 3.7
[(Non)-consistency] Let L be the least-squares loss and let \(D \in (X \times Y)^n\) be an i.i.d. sample of size \(n \ge 1\). Let \(D \mapsto g_{D,s,\rho }^+\) denote the good interpolating DNN from Example 3.6. Similarly, let \(D \mapsto g_{D,s,\rho }^-\) denote the bad interpolating DNN from Example 3.6. Assume that \((s_n)_{n \in \mathbb {N}}\) is a sequence with \(s_n \rightarrow 0\), \(\frac{\ln (n s_n^d)}{n s_n^d}\rightarrow 0\) as \(n \rightarrow \infty\) as well as \(s_n > 2n^{-1/d}\). Additionally, let \((\rho _n)_{n \in \mathbb {N}}\) be a non-negative sequence with \(\rho _n \le 2n^{-1/d}\). Then \(g_{D,s_n, \rho _n}^{\pm }\in {\mathcal {A}}_{4dn, 2n}\). Moreover, for all distributions P that satisfy Assumption 2.11 for a function \(\varphi\) with \(\rho _n^{-d} \varphi ( \rho _n ) \rightarrow 0\) for \(n\rightarrow \infty\), we have
in probability for \(|D|\rightarrow \infty\).
The above result can further be refined to establishing rates of convergence if the width \(s_n\) and the inflation parameter \(\rho _n\) converge to zero sufficiently fast as \(n \rightarrow \infty\). The proof is provided in Appendix D.4.
Theorem 3.8
[Learning Rates] Let L be the least-squares loss and let \(D \in (X \times Y)^n\) be an i.i.d. sample of size \(n \ge 1\). Let \(D \mapsto g_{D,s,\rho }^+\) denote the good interpolating DNN from Example 3.6. Similarly, let \(D \mapsto g_{D,s,\rho }^-\) denote the bad interpolating DNN from Example 3.6. Suppose that \({f_{L,P}^*}\) is \(\alpha\)-Hölder continuous with \(\alpha \in (0,1]\) and that P satisfies Assumption 2.11 for some function \(\varphi\). Assume further that \((s_n)_{n \in \mathbb {N}}\) is a sequence with
and that \((\rho _n)_{n\ge 1}\) is a non-negative sequence with \(\rho _n \le 2n^{-1/d}\) and \(\rho _n^{-d} \varphi (\rho _n) \le \ln (n) n^{-2/3}\) for all \(n\ge 1\). Then there exists a constant \(c_{d,\alpha }>0\) only depending on d, \(\alpha\), and the Lipschitz constant \(|f^*_{L,P}|_\alpha\), such that for all \(n\ge 2\) the good interpolating histogram rule satisfies
with probability \(P^n\) not less than \(1- 2^dn^{1+d} e^{-n^{d \gamma }}\). Furthermore, for all \(n\ge 2\), the bad interpolating histogram rule satisfies
with probability \(P^n\) not less than \(1- 2^dn^{1+d} e^{-n^{d \gamma }}\). Finally, there exists a natural number \(n_{d, \alpha } > 0\) such that for any \(n \ge n_{d, \alpha }\) we have \(g_{D,s_n, \rho _n}^{\pm }\in {\mathcal {A}}_{4dn, 2n}\).
Note that the rates of convergence in (34) and (35) remain true if we consider a sequence \(s_n\) with \(c^{-1} n^{-\gamma } \le s_n \le cn^{-\gamma }\) for some constant c independent of n. In fact, the only reason why we have formulated Theorem 3.8 with \(s_n = n^{-\gamma }\) is to avoid another constant appearing in the statements. Moreover, if we choose \(s_n:= 2 a \lfloor n^{-\gamma }\rfloor ^{-1}\) with \(a:= 3^{1/d}/(3^{1/d}-2)\), then we have \(|J| \le (2 s_n^{-1} + 2)^d \le (a^{-1}n^{1/d} + 2)^d \le n\) for all \(n\ge 3\). Consequently, for \(m:= n\), we can choose \(n_{d, \alpha }:= 3\), and hence we have \(g_{D,s_n, \rho _n}^{\pm }\in {\mathcal {A}}_{4dn, 2n}\) for all \(n\ge 3\) while (34) and (35) hold true modulo a change in the constant \(c_{\alpha ,d}\).
4 Discussion and summary of results
In this section we summarize our finding and put them into a broader context.
4.1 Inflated histograms
To set the results from Sect. 2 in context, let us first recall that even for a fixed hypotheses class, ERM is, in general, not a single algorithm, but a collection of algorithms. In fact, this ambiguity appears, as soon as the ERM-optimization problem has not a unique solution for certain data sets, and as Lemma A.1 shows, this non-uniqueness may even occur for strictly convex loss functions such as the least squares loss. Now, the standard techniques of statistical learning theory are capable of showing that for sufficiently small hypotheses classes, all versions of ERM enjoy good statistical guarantees. In other words, the non-uniqueness of ERM does not affect its learning capabilities as long as the hypotheses class is sufficiently small. In addition, it is folklore that in some large hypotheses classes, there may be heavily overfitting ERM solutions, leading to the usual conclusion that such hypotheses classes should be avoided.
In contrast to this common wisdom, however, Theorem 2.12 demonstrates that for large hypotheses classes, the situation may be substantially more complicated: First, it shows that there exist empirical risk minimizers, whose predictors converge to a function \({f_{L,P}^\dagger }\), see (24), that in almost all interesting cases is far off the target regression function, see (21), confirming that the overfitting issue is indeed present for the chosen hypotheses classes. Moreover, this strong overfitting may actually take place with fast convergence, see (28). Despite this negative result, however, we can also find empirical risk minimizers that enjoy a good learning behavior in terms of consistency (23) and almost optimal learning rates (27). In other words, both the expected overfitting and standard learning guarantees may be realized by suitable versions of empirical risk minimizers over these hypotheses classes. In fact, these two different behaviors are just extreme examples, and a variety of intermediate behaviors are possible, too: Indeed, as the training error can be solely controlled by the corrections on the inflating parts, the behaviour of the histogram part h can be arbitrarily chosen. For our theorems above, we have chosen a particular good and bad h-part, repectively, but of course, a variety of other choices leading to intermediate behavior are also possible. As a consequence, the ERM property of an algorithm working with a large hypotheses class is, in general, no longer a sufficient notion for describing its learning behavior. Instead, additional assumptions are required to determine its learning behavior. In this respect we also note that for our inflated hypotheses classes, other learning algorithms that do not (approximately) minimize the empirical risk may also enjoy good learning properties. Indeed, by setting the inflating parts to zero, we recover standard histograms, which in geneneral do not have close-to-zero training error, but for which the guarantees of our good interpolating predictors also hold true.
Of course, the chosen hypotheses classes may, to some extent, appear artificial. Nonetheless, in Sect. 3 they are key for showing that for sufficiently large DNN architectures exactly the same phenomena occur for some of their global minima.
4.2 Neural networks
To fully appreciate Theorems 3.7 and 3.8 as well as their underlying construction let us discuss its various consequences.
Training. The good interpolating DNN predictors \(g_{D,s_n, \rho _n}^+\) show that it is actually possible to train sufficiently large, over-parameterized DNNs such that they become consistent and enjoy optimal learning rates up to a logarithmic factor without adapting the network size to the particular smoothness of the target function. In fact, it suffices to consider DNNs with two hidden layers and 4dn, respectively 2n neurons in the first, respectively second, hidden layer. In other words, Theorems 3.7 and 3.8 already apply to moderately over-parameterized DNNs, and by the particular properties of the ReLU-activation function, also for all larger network architectures. In addition, when using architectures of minimal size, training, that is constructing \(g_{D,s_n, \rho _n}^+\), can be done in \(\mathcal{O}(d^2\cdot n^2)\)-time if the NNs are implemented as fully connected networks. Moreover, the constructed NNs have a particularly sparse structure and exploiting this can actually reduce the training time to \(\mathcal{O}(d \cdot n\cdot \log n)\). While we present statistically sound end-to-endFootnote 2 proofs of consistency and optimal rates for NNs, we also need to admit that our training algorithm is mostly interesting from a theoretical point of view, but useless for practical purposes.
Optimization Landscape. Theorems 3.7 and 3.8 also have its consequences for DNNs trained by variants of stochastic gradient descent (SGD) if the resulting predictor is interpolating. Indeed, these theorems show that ending in a global minimum may result in either a very good learning behavior or an extremely overfitting, bad behavior. In fact, all the observations made for histograms at the end of Sect. 2 apply to DNNs, too. In particular, since for \(n\ge n_{d,\alpha }\) the \({\mathcal {A}}_{4dn, 2n}\)-networks can \(\varepsilon\)-approximate all functions in \(\mathcal{F}_{s,n}^*\) for all \(\varepsilon \ge 0\) and all \(s\in [n^{-1/d}, 1]\), we can, for example, find, for each polynomial learning rate slower than \(n^{-\alpha \gamma }\), an interpolating learning method \(D\mapsto f_D\) with \(f_D\in {\mathcal {A}}_{4dn, 2n}\) that learns with this rate. Similarly, we can find interpolating \(f_D\in {\mathcal {A}}_{4dn, 2n}\) with various degrees of bad learning behavior. In summary, the optimization landscape induced by \({\mathcal {A}}_{4dn, 2n}\) contains a wide variety of global minima whose learning properties range somewhat continuously from essentially optimal to extremely poor. Consequently, an optimization guarantee for (S)GD, that is, a guarantee that (S)GD finds a global minimum in the optimization landscape, is useless for learning guarantees unless more information about the particular nature of the minimum found is provided. Moreover, it becomes clear that considering (S)GD without the initialization of the weights and biases is a meaningless endeavor: For example, constructing \(g_{D,s_n, \rho _n}^{\pm }\) can be viewed as a very particular form of initialization for which (S)GD won’t change the parameters anymore. More generally, when initializing the parameters randomly in the attraction basin of \(g_{D,s_n, \rho _n}^{\pm }\) then GD will converge to \(g_{D,s_n, \rho _n}^{\pm }\) and therefore the behavior of GD is completely determined by the initialization. In this respect note that so far there is no statistically sound way to distinguish between good and bad interpolating DNNs on the basis of the training set alone, and hence the only way to identify good interpolating DNNs obtained by SGD is to use a validation set (that SGD finds bad local minima is shown in Liu et al. (2020)). Now, for the good interpolating DNNs of Theorem 3.7 it is actually possible to construct a finite set of candidates such that the one with the best validation error achieves the optimal learning rates without knowing \(\alpha\). For DNNs trained by SGD, however, we do not have this luxury anymore. Indeed, while we can still identify the best predicting DNN from a finite set of SGD-learned interpolating DNNs we no longer have any theoretical understanding of whether there is any useful candidate among them, or whether they all behave like a bad \(g_{D,s_n, \rho _n}^-\).
For both consistency and learning with essentially optimal rates it is by no means necessary to find a global minimum, or at least a local minimum, in the optimization landscape. For example, the positive learning rates (27) also hold for ordinary cubic histograms with widths \(s_n:= n^{-\gamma }\), and the latter can, of course, also be approximated by \({\mathcal {A}}_{4dn, 2n}\). Repeating the proof of Theorem 3.8 it is easy to verify that these approximations also enjoy the good learning rates (34). Moreover, these approximations \(f_D\) are almost never global minima, or more precisely, \(f_D\) is not a global minimum as soon as there exist a cubic cell A containing two samples \(x_i\) and \(x_j\) with different labels, i.e. \(y_i\ne y_j\). In fact, in this case, \(f_D\) is not even a local minimum. To see this, assume without loss of generality that \(x_i\) is one of the samples in A with \(y_i \ne f_D(x_i)\). Considering \(f_{D,\lambda }:= f_D + \lambda b_i^+ \varvec{1}_{x_i+tB_\infty }^{(t/3)}\) for all \(\lambda \in [0,1]\) and \(t:= \min \{r,\rho \}\) we then see that there is a continuous path in the parameter space of \({\mathcal {A}}_{4dn, 2n}\) that corresponds to the \(\Vert \cdot \Vert _\infty\)-continuous path \(\lambda \mapsto f_{D,\lambda }\) in the set of functions \({\mathcal {A}}_{4dn, 2n}\) for which we have \({\mathcal{R}_{L,D}(f_{D,\lambda })} < {\mathcal{R}_{L,D}(f_D)}\) for all \(\lambda \in (0,1]\). In other words, \(f_D\) is not a local minimum. In this respect we note that this phenomenon also occurs to some extent in under-parameterized DNNs, at least for \(d=1\). Indeed, if we consider \(m:= 1\) and \(s_n:= n^{-\gamma }\), then \(f_D, f_{D,\lambda }\in {\mathcal {A}}_{4dn^{\gamma d}, 2 n^{\gamma d}}\) for all sufficiently large n. Now, the functions in \({\mathcal {A}}_{4dn^{\gamma d}, 2 n^{\gamma d}}\) have \(\mathcal{O}(d^2 n^{2\gamma d})\) many parameters and for \(2\gamma d = \frac{2d}{2\alpha +d}< 1\), that is \(\alpha > d/2 = 1/2\), we then see that we have strictly less than \(\mathcal{O}(\sqrt{n})\) neurons with \(\mathcal{O}(n)\) parameters, while all the observations made so far still hold.
Finally, we want to mention that a number of recent works analyze concrete efficient GD-type algorithms (Ji et al., 2021; Ji & Telgarsky, 2019; Song et al., 2021; Chen et al., 2021; Kuzborskij & Szepesvari, 2021; Kohler & Krzyzak, 2019; Nguyen & Mücke, 2024; Braun et al., 2024) and SGD-type algorithms (Rolland et al., 2021; Deng et al., 2022; Li & Liang, 2018; Kalimeris et al., 2019; Allen-Zhu et al., 2019; Cao et al., 2024), with a focus on the particular algorithmic properties and network architecture (e.g. early stopping and the required degree of overparameterization) rather than ERM. Also, the effect of regularization is investigated in e.g. Hu et al. (2021); Wei et al. (2019). Our work differs from the perspective given in these works in the sense that our aim is to provide a theoretical investigation of the qualitatively different learning properties of interpolating ReLU-DNNs at once.
Data availability
Not applicable.
Code availability
Not applicable.
Notes
Note that this gives \(A(x_i^*) = B(x_i^*)\cap X\).
By end-to-end we mean the explicit construction of an efficient, feasible, and implementable training algorithm and the rigorous statistical analysis of this very particular algorithm under minimal assumptions.
This is justified since \(\varepsilon _n = \rho _n/2 \le n^{-1/d} < s_n/2\).
References
Allen-Zhu, Z., Li, Y. & Liang, Y. (2019). Learning and generalization in overparameterized neural networks, going beyond two layers. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 6158–6169.
Allen-Zhu, Z., Li, Y., & Song, Z. (2019). A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pp. 242–252. PMLR.
Bartlett, P. L., & Long, P. M. (2021). Failures of model-dependent generalization bounds for least-norm interpolation. Journal of Machine Learning Research, 22(204), 1–5.
Bartlett, P. L., Long, P. M., Lugosi, G., & Tsigler, A. (2020). Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48), 30063–30070.
Bauer, H. (2001). Measure and Integration Theory. Berlin: De Gruyter.
Bauer, B., & Kohler, M. (2019). On deep learning as a remedy for the curse of dimensionality in nonparametric regression. The Annals of Statistics, 47(4), 2261–2285.
Belkin, M., Hsu, D. J., & Mitra, P. (2018). Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pp. 2300–2311. Curran Associates, Inc.
Belkin, M., Hsu, D., & Ji, Xu. (2020). Two models of double descent for weak features. SIAM Journal on Mathematics of Data Science, 2(4), 1167–1180.
Belkin, M., Rakhlin, A., & Tsybakov, A. B. (2019). Does data interpolation contradict statistical optimality? In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1611–1619. PMLR.
Braun, A., Kohler, M., Langer, S., & Walk, H. (2024). Convergence rates for shallow neural networks learned by gradient descent. Bernoulli, 30(1), 475–502.
Cao, D., Guo, Z.-C., & Shi, L. (2024). Stochastic gradient descent for two-layer neural networks. Preprint at arXiv:2407.07670.
Chen, L., Min, Y., Belkin, M., & Karbasi, A. (2020). Multiple descent: Design your own generalization curve. Preprint at arXiv:2008.01036.
Chen, Z., Cao, Y., Zou, D., & Gu, Q. (2021). How much over-parameterization is sufficient to learn deep relu networks? In International Conference on Learning Representations (ICLR).
Deng, Y., Kamani, M. M., & Mahdavi, M. (2022). Local sgd optimizes overparameterized neural networks in polynomial time. In International Conference on Artificial Intelligence and Statistics, pp. 6840–6861. PMLR .
Devroye, L., Györfi, L., & Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Györfi, L., Kohler, M., Krzyżak, A., & Walk, H. (2002). A Distribution-Free Theory of Nonparametric Regression. Springer.
Hoffman-Jorgensen, J. (2017). Probability with a View Towards Statistics (Vol. I). Routledge.
Hu, T., Wang, W., Lin, C., & Cheng, G. (2021). Regularization matters: A nonparametric perspective on overparametrized neural network. In International Conference on Artificial Intelligence and Statistics, pp. 829–837. PMLR.
Ji, Z., Li, J., & Telgarsky, M. (2021). Early-stopped neural networks are consistent. Advances in Neural Information Processing Systems, 34, 1805–1817.
Ji, Z., & Telgarsky, M. (2019). Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. International Conference on Learning Representations (ICLR).
Kalimeris, D., Kaplun, G., Nakkiran, P., Edelman, B., Yang, T., Barak, B., & Zhang, H. (2019). Sgd on neural networks learns functions of increasing complexity. Advances in neural information processing systems, 32.
Koehler, F., Zhou, L., Sutherland, D. J., & Srebro, N. (2021). Uniform convergence of interpolators: Gaussian width, norm bounds and benign overfitting. Advances in Neural Information Processing Systems, 34, 20657–20668.
Kohler, M., & Krzyżak, A. (2005). Adaptive regression estimation with multilayer feedforward neural networks. Nonparametric Statistics, 17(8), 891–913.
Kohler, M., & Krzyzak, A. (2019). Over-parametrized deep neural networks do not generalize well. Preprint at arXiv:1912.03925.
Kohler, M., & Langer, S. (2021). On the rate of convergence of fully connected deep neural network regression estimates. The Annals of Statistics, 49(4), 2231–2249.
Kohler, M., Langer, S., & Reif, U. (2023). Estimation of a regression function on a manifold by fully connected deep neural networks. Journal of Statistical Planning and Inference, 222, 160–181.
Kuzborskij, I., & Szepesvári, C. (2021). Nonparametric regression with shallow overparameterized neural networks trained by gd with early stopping. In Conference on Learning Theory, pp. 2853–2890. PMLR.
Li, Y., & Liang, Y. (2018). Learning overparameterized neural networks via stochastic gradient descent on structured data. Advances in neural information processing systems, 31.
Liang, T., Rakhlin, A., & Zhai, X. (2020) On the multiple descent of minimum-norm interpolants and restricted lower isometry of kernels. In Conference on Learning Theory, pp. 2683–2711. PMLR.
Liu, S., Papailiopoulos, D., & Achlioptas, D. (2020). Bad global minima exist and sgd can reach them. Advances in Neural Information Processing Systems, 33, 8543–8552.
Ma, S., Bassily, R., & Belkin, M. (2018). The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 3325–3334.
McCaffrey, D. F., & Gallant, A. R. (1994). Convergence rates for single hidden layer feedforward networks. Neural Networks, 7(1), 147–158.
Mei, S., & Montanari, A. (2019). The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75, 667.
Nagarajan, V., & Kolter, J. Z. (2019). Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems 32.
Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., & Srebro, N. (2019). Towards understanding the role of over-parametrization in generalization of neural networks. In International Conference on Learning Representations (ICLR).
Nguyen, M., & Mücke, N. (2024). How many neurons do we need? a refined analysis for shallow networks trained with gradient descent. Journal of Statistical Planning and Inference, 233, 106169.
Rolland, P., Ramezani-Kebrya, A., Song, C. H., Latorre, F., & Cevher, V. (2021). Linear convergence of sgd on overparametrized shallow neural networks.
Salakhutdinov, R. (2017). Deep learning tutorial at the Simons Institute. https://simons.berkeley.edu/talks/ruslan-salakhutdinov-01-26-2017-1.
Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with relu activation function. The Annals of Statistics, 48(4), 1875–1897.
Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
Song, C., Ramezani-Kebrya, A., Pethick, T., Eftekhari, A., & Cevher, V. (2021). Subquadratic overparameterization for shallow neural networks. Advances in Neural Information Processing Systems, 34, 11247–11259.
Steinwart, I., & Christmann, A. (2008). Support Vector Machines. Springer.
Suzuki, T. (2018). Adaptivity of deep relu network for learning in besov and mixed smooth besov spaces: Optimal rate and curse of dimensionality. In International Conference on Learning Representations.
Tsigler, A., & Bartlett, P. L. (2020). Benign overfitting in ridge regression. Preprint at arXiv:2009.14286.
Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer.
van de Geer, S. (2000). Applications of Empirical Process Theory. Cambridge University Press.
Vapnik, V. N. (1998). Statistical Learning Theory. Wiley.
Wei, C., Lee, J. D., Liu, Q., & Ma, T. (2019). Regularization matters: Generalization and optimization of neural nets vs their induced kernel. Advances in Neural Information Processing Systems, 32.
Yang, Z., Bai, Y., & Mei, S. (2021). Exact gap between generalization error and uniform convergence in random feature models. In International Conference on Machine Learning, pp. 11704–11715. PMLR.
Yarotsky, D. (2018). Optimal approximation of continuous functions by very deep relu networks. In Conference on learning theory, pp. 639–649. PMLR, .
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. In Technical report. arxiv.org/abs/1611.03530.
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2021). Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3), 107–115.
Zhou, L., Sutherland, D. J., & Srebro, N. (2020). On uniform convergence and low-norm interpolation learning. Advances in Neural Information Processing Systems, 33 .
Funding
Open Access funding enabled and organized by Projekt DEAL. Not applicable.
Author information
Authors and Affiliations
Contributions
All authors whose names appear on the submission 1) made substantial contributions to the conception or design of the work; 2) drafted the work or revised it critically for important intellectual content; 3) approved the version to be published; and 4) agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
Corresponding author
Ethics declarations
Conflict of interest
All authors whose names appear on the submission declare to have no conflict of interest.
Ethical approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Editor: Hendrik Blockeel.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Characterization of empirical risk minimizers
In this section we briefly provide a full characterization of empirical risk minimizers that we use several times for proving our main results.
Lemma A.1
(Characterization of ERMs) Let Y be convex, \(A\subseteq X\) be non-empty, \(\mathcal{A} = (A_j)_{j\in J}\) be a finite partition of A, and
Moreover, let \(D=((x_1,y_1),\dots ,(x_n,y_n)) \in (X\times Y)^n\) be a data set and let \(L_A(x, y, t)=\varvec{1}_A(x)L(x,y,t)\), with L being the least squares loss. Furthermore, denote the number of samples whose covariates fall into cell \(A_j\) by \(N_j\), that is \(N_j:=|\{i: x_i \in A_j\}|\). Then, for every \(f^*\in \mathcal{H}_{\mathcal{A}}\) with representation \(f^* = \sum _{j\in J} c_j \varvec{1}_{A_j}\), the following statements are equivalent:
-
(i)
The function \(f^*\) is an empirical risk minimizer, that is
$$\begin{aligned} {\mathcal{R}_{L_A,D}(f^*)} = \min _{f\in \mathcal{H}_{\mathcal{A}}} {\mathcal{R}_{L_A,D}(f)}\,. \end{aligned}$$ -
(ii)
For all \(j\in J\) satisfying \(N_j \ne 0\) we have
$$\begin{aligned} c_j = \frac{1}{N_j}\sum _{i: x_i\in A_j} y_i \;. \end{aligned}$$(36)
Proof of of Lemma A.1
We first note that for an \(f^*\in \mathcal{H}_{\mathcal{A}}\) with representation \(f^*= \sum _{j\in J} c_j \varvec{1}_{A_j}\) we have
Consequently, \(f^*\) is an empirical risk minimizer, if and only if \(c_j\) minimizes \(\sum _{i: x_i\in A_j} L(x_i,y_i, \cdot )\) for all \(j\in J\). Now, if \(N_j = 0\), the sum is empty, and hence there is nothing to consider. For \(j\in J\) with \(N_j\) we observe that
which is minimized for \(c_j\) given by (36). \(\square\)
B Existence of properly aligned cubic partitioning rule
In this section we prove the existence of a properly aligned cubic partitioning rule.
Proof of Theorem 2.5
Recall that cubic partitions \(\mathcal{B}\) of \(\mathbb {R}^d\) have a representation of the form (8). Now, to construct \(\pi _{m,s}\) we will consider a finite set of candidate offsets \(x_1^\dagger , \dots , x_K^\dagger \in \mathbb {R}^d\). For the construction of these offsets we write \(\delta := s/(m+1)\) and for \(j \in \{0,\dots ,m\}\) we further define
Now, our candidate offsets \(x_1^\dagger , \dots , x_K^\dagger \in \mathbb {R}^d\) are exactly those vectors whose coordinates are taken from \(z_{0}^\dagger , \dots , z_{m}^\dagger\). Clearly, this gives \(K=(m+1)^d\). Now let \(\{x_1^*,\dots , x_m^*\}\in \textrm{Pot}_m([-1,1]^d)\). In the following, we will identify the offset \(x_\ell ^\dagger\) that leads to \(\pi _{m,s}(\{x_1^*,\dots , x_m^*\})\) coordinate-wise. We begin by determining its first coordinate \(x_{\ell ,1}^\dagger\). To this end, we define
Our first goal is to show that \(I_0,\dots ,I_m\) are a partition of \(\mathbb {R}\). To this end, we fix an \(x\in \mathbb {R}\). Then there exists a unique \(k\in \mathbb {Z}\) with \(k s \le x < (k +1) s\). Moreover, for \(y:= x- k s\in [0,s)\), there exists a unique \(j\in \{0,\dots ,m\}\) with \(j\delta \le y < (j+1)\delta\). Consequently, we have found \(x\in [k s + j\delta , \, k s +(j+1)\delta )\). This shows \(\mathbb {R}\subset I_0\cup \dots \cup I_m\), and the converse inclusion is trivial. Let us now fix some \(j,j'\in \{0,\dots ,m\}\) and assume that there is an \(x\in I_j \cap I_{j'}\). Then there exist \(k,k'\in \mathbb {Z}\) such that
Since \((j+1)\delta \le s\) and \((j'+1)\delta \le s\), we conclude that \(k s \le x < (k +1) s\) and \(k' s \le x < (k' +1) s\). As observed above this implies \(k = k'\). Now consider \(y:= x-k s\in [0,s)\). Then (37) implies
and again we have seen above that this implies \(j=j'\). This shows \(I_j \cap I_{j'} =\emptyset\) for all \(j\ne j'\).
Let us now denote the first coordinate of \(x_i^*\) by \(x_{i,1}^*\). Then \(D_{X,1}^*:= \{x_{i,1}^*: i=1,\dots ,m\}\) satisfies \(|D_{X,1}^*|\le m\) and since we have \(m+1\) cells \(I_j\), we conclude that there exists a \(j_1^*\in \{0,\dots ,m\}\) with \(D_{X,1}^* \cap I_{j_1^*} = \emptyset\). We define
Next we repeat this construction for the remaining \(d-1\) coordinates, so that we finally obtain \(x_\ell ^\dagger := (z_{j_1^*}^\dagger , \dots , z_{j_d^*}^\dagger )\in \mathbb {R}^d\) for indices \(j_1^*,\dots ,j_d^*\in \{0,\dots ,m\}\) found by the above reasoning.
It remains to show that (11) holds for the cubic partition (8) with offset \(x_\ell ^\dagger\) and all \(t>0\) with \(t\le \frac{s}{3\,m+3} = \delta /3\). To this end, we fix an \(x_i^*\). Then its cell \(B(x_i^*)\) is described by a unique \(k:= (k_1,\dots ,k_d)\in \mathbb {Z}^d\), namely
Let us now consider the first coordinate \(x_{i,1}^*\). By construction we know that \(x_{i,1}^* \not \in I_{j_1^*}\) and
Now, \(x_{i,1}^* \not \in I_{j_1^*}\) implies
Since the right hand side of (38) excludes the case \(x_{i,1}^* \ge (k_1+1) s +(j_1^*+1)\delta\), we hence find
This shows \(x_{i,1}^* + r < x_{\ell ,1}^\dagger + (k_1 +1) s\) for all \(r\in [-t,t]\). To show that \(x_{i,1}^* + r > x_{\ell ,1}^\dagger + k_1 s\) holds for all \(r\in [-t,t]\) we first observe that \(x_{i,1}^* \not \in I_{j_1^*}\) also implies
Now, the left hand side of (38) excludes the case \(x_{i,1}^* < k_1 s + j_1^*\delta\). Consequently, we have
and this yields \(x_{i,1}^* + r > x_{\ell ,1}^\dagger + k_1 s\) for all \(r\in [-t,t]\). Finally, by repeating these considerations for the remaining \(d-1\) coordinates, we conclude that \(x_i^* + t B_\infty \subset B(x_i^*)\). \(\square\)
C Learning properties of inflated histograms
In this section we provide the proofs of the results for the good and bad interpolating histogram rules from Sect. 2.2. To this end let us introduce some more notation. For a measurable set A and a loss \(L: X\times Y\times \mathbb {R}\rightarrow [0,\infty )\) we therefore introduce the loss \(L_A: X\times Y \times \mathbb {R}\rightarrow [0,\infty )\) by
Obviously, for any measurable function \(f:X \rightarrow \mathbb {R}\) it holds
Moreover, by linearity, for every measurable set \(B\subset A\), the risk then decomposes as
The next result shows that also the Bayes risk enjoys a similar decomposition.
Lemma C.1
Let \(A,B\subset X\) be non-empty, disjoint, and measurable with \(A\cup B = X\). Then we have
Proof of Lemma C.1
Basically, this is a consequence of the presence of the indicator functions \(\varvec{1}_A, \varvec{1}_B\) in the definition of \(L_A, L_B\), see (39). More precisely, there is a sequence of functions \(f_n^A\) with \(\{f_n^A \ne 0\} \subset A\) such that
as \(n \rightarrow \infty\), and similarly for A replaced by B. Thus, for \(f_n:= f_n^A + f_n^B\), one has
Since the converse inequality is trivial, this proves the lemma. \(\square\)
1.1 C.1 Preparatory lemmata
The next lemma provides a bound on the difference of the risks of two measurable functions.
Lemma C.2
Let \(Y=[-1,1]\) and let \(f_1, f_2: X \rightarrow Y\) be measurable functions. For \(A\subset X\) measurable and non-empty we define \(L_A(x,y,t)=\varvec{1}_A(x)L(x,y, t)\) with L being the least square loss. Then the following two inequalities hold:
Proof of Lemma C.2
We begin by proving the first inequality. To this end, we note that the definition of L yields
Now observe that \(y, f_i(x) \in [-1,1]\) implies \((y-f_i(x))^2 \le 4\). Moreover, we also have \((y-f_i(x))^2\ge 0\), and hence we conclude that
Combining these considerations we find
The second inequality can be show similarly. Namely, we have
where we again used \(f_i(x) \in [-1,1]\). \(\square\)
Lemma C.3
Let \(h: X \rightarrow \mathbb {R}\) be measurable, \(A \subseteq X\), and L be the least-squares loss. Then we have the identity
Proof of Lemma C.3
Given \(x \in \mathcal {X}\) and using
we obtain for the difference of inner risks
Thus, since
we arrive at
i.e., we have shown the assertion. \(\square\)
With these preparations we can now present the following key lemma that shows that it suffices to understand the behavior of the good and bad interpolating histogram rules on \(\Delta\) and the behavior of \(h_{D,\mathcal{A}_D}^+\).
Lemma C.4
Let L be the least squares loss, P be a distribution on \(X\times Y\) with point spectrum \(\Delta\), see (20), and \(D \in (X\times Y)^n\) be a data set. Then for all \(s\in (0,1]\) and all \(\rho \ge 0\) the good interpolating histogram rule satisfies
where \(D_X^{+t}\) is defined by (16). Moreover, for all \(s\in (0,1]\) and all \(\rho \ge 0\) the bad interpolating histogram rule satisfies
Proof of Lemma C.4
To simplify notation, we write \(A:= D_X^{+t}{\setminus } \Delta\) and \(B:= X {\setminus }(D_X^{+t}\cup \Delta )\). Note that this yields the partition \(X = \Delta \cup A \cup B\). In addition, we have \(f_{D,s,\rho }^+(x) = h_{D,\mathcal{A}_D}^+(x)\) for all \(x\in X{\setminus } D_X^{+t}\). Using this in combination with \(B\subset X\setminus D_X^{+t}\) as well as the risk decomposition formula (41) and Lemma C.1 we then find
Moreover, Lemma C.2 applied to \(f_1:= f_{D,s,\rho }^+\) and \(f_2:= {f_{L,P}^*}\) implies
In addition, we have
where again we used (41) and Lemma C.1. Combining these estimates we then obtain the assertion for the good interpolating ERM.
To prove the inequality for the bad interpolating histogram rule, we consider the decomposition
where in the first integral we used \({f_{L,P}^\dagger }(x) = {f_{L,P}^*}(x)\) for all \(x\in \Delta\). Now, \(f_{D,s,\rho }^-(x)\in [-1,1]\) and \({f_{L,P}^\dagger }(x) \in [-1,1]\) for all \(x\in X\) gives
Moreover, by (17) we find \(f_{D,s,\rho }^-(x) = -f_{D,s,\rho }^+(x) = - h_{D,\mathcal{A}_D}^+(x)\) for all \(x\in X{\setminus } D_X^{+t}\) and thus also for all \(x\in B\). In addition, \(B\subset X{\setminus } \Delta\) shows \({f_{L,P}^\dagger }(x) = -{f_{L,P}^*}(x)\) for all \(x\in B\). Together, these considerations give \(f_{D,s,\rho }^-(x) - {f_{L,P}^\dagger }(x) = -h_{D,\mathcal{A}_D}^+(x) + {f_{L,P}^*}(x)\) for all \(x\in B\), and consequently we obtain
Combining these considerations finishes the proof. \(\square\)
1.2 C.2 Proof of Theorem 2.12
Throughout this section we assume that the general assumptions of Theorem 2.12 are satisfied. In particular, \(D \in (X \times Y)^n\) is an i.i.d. sample of size \(n \ge 1\) and \(D_X:=\{x_1^*,..., x_{m_n}^*\}\in \textrm{Pot}_m(X)\) is the set of input observations. Moreover, \((s_n)_{n \in \mathbb {N}}\) is a sequence with \(s_n \rightarrow 0\) as well as \(\frac{\ln (n s_n^d)}{n s_n^d}\rightarrow 0\) as \(n \rightarrow \infty\).
1.2.1 C.2.1 The good interpolating histogram rule
We begin by introducing the basic strategy of our proof. To this end, consider the good interpolating histogram rule from Example 2.8 with representation
In view of (3) it suffices to consider the excess risk of \(f_{D,s,\rho }^+\). Now observe that in the case i), i.e. for \(\rho =0\), we have \(t=0\) and thus \(D_X^{+t} = D_X\). Since \(P_X(D_X\setminus \Delta )= 0\) by the definition of \(\Delta\), we then find by Lemma C.4 that
where
Moreover, in the case ii), i.e. for \(\rho =\rho _n>0\) the distribution P satisfies Assumption 2.11, which ensures \(\Delta = \emptyset\). The latter implies \({\mathcal {R}}_2(f_{D,s_n,0}^+) = 0\), and therefore we find by Lemma C.4 that
Moreover, by Assumption 2.11 and \(t\le \rho = \rho _n\) we obtain
and consequently, it suffices to bound \({\mathcal {R}}_1 ( h_{D,\mathcal{A}_D}^+)\). Therefore, the rest of this subsection is devoted to bounding \({\mathcal {R}}_1 ( h_{D,\mathcal{A}_D}^+)\) and \({\mathcal {R}}_2(f_{D,s_n,0}^+)\) individually.
Bounding \({\mathcal {R}}_1 (h_{D,\mathcal{A}_D}^+)\). Thanks to Proposition E.4, we already know that
in probability for \(n\rightarrow \infty\).
Bounding \({\mathcal {R}}_2(f_{D,s_n,0}^+)\). If \(\Delta = \emptyset\) we obviously have \({\mathcal {R}}_2(f_{D,s_n,0}^+) = 0\), and hence we assume \(\Delta \ne \emptyset\) in the following. In this case, \(\Delta\) can be at most countable, and therefore we fix an at most countable enumeration \((\tilde{x}_j)_{j\in J}\) of \(\Delta\), i.e.
Let us further fix an \(\epsilon >0\) and a finite subset \(\Delta _0 \subset \Delta\) such that \(P_X(\Delta \setminus \Delta _0) \le \epsilon\). With the help of (41) and Lemma C.1 we then observe that
Since \(Y=[-1,1]\) is bounded the second difference can be bounded by
Our next step is to bound the first difference in (44). To this end, we write
\(C_j:= \{\tilde{x}_j\}\) for \(j\in J_0\), and \({\mathcal {C}}:= (C_j)_{j\in J_0}\). Then \({\mathcal {C}}\) is a finite partition of \(\Delta _0\), and we set
Since all \(C_j\) are singletons, every measurable function \(f:X\rightarrow Y\) satisfies \(\varvec{1}_{\Delta _0} f \in {\mathcal {F}}_{\mathcal {C}}\). We thus conclude that \(f_D:= \varvec{1}_{\Delta _0}f_{D,s_n,0}^+\in {\mathcal {F}}_{\mathcal {C}}\), too. Moreover, by (40) we know
Our next goal is to show that \(f_D\) minimizes the empirical risk over \({\mathcal {F}}_{\mathcal {C}}\) with respect to \(L_{\Delta _0}\). To this end, we fix a \(j \in J_0\) for which we have \(N_j:=|\{i: x_i \in C_j\}| > 0\). Since \(f_{D,s_n,0}^+\) interpolates D by construction, Proposition 2.7 then gives
Thus, Lemma A.1 shows that \(f_D\) is indeed an empirical risk minimizer with respect to \(L_{\Delta _0}\) and \({\mathcal {F}}_{\mathcal {C}}\).
Our next goal is to apply Theorem 2.10, which holds for all ERM with respect to \(L_{\Delta _0}\) and \({\mathcal {F}}_{\mathcal {C}}\), to our specific ERM \(f_D\). To this end, we first observe, as in the proof of Corollary E.2, that since L is the least squares loss, the assumptions (18) and (19) of Theorem 2.10 are satisfied for \(L_{\Delta _0}\) with \(\vartheta = 1\), \(B=4\), and \(V=16\). Moreover, our assumption \(Y= [-1,1]\) ensures that \(L_{\Delta _0}\) is locally Lipschitz continuous with \(|L_{\Delta _0}|_{1,1} \le 4\). In addition, we have
Applying Theorem 2.10 and optimizing the resulting oracle inequality with respect to \(\varepsilon\) like at the end of the proof of Corollary E.2, we then see that, for all \(n\ge 1\) and \(\tau >0\),
holds with probability \(P^n\) not less than \(1- e^{-\tau }\). Now, to bound the approximation error term, we note that
and hence we easily find
Setting \(\tau := \ln (n)\) we conclude that
holds with probability \(P^n\) not less than \(1- 1/n\). For later use note that this oracle inequality actually holds for all ERM respect to \(L_{\Delta _0}\) and \({\mathcal {F}}_{\mathcal {C}}\), since so does Theorem 2.10 and we have not used any property of our specfic ERM \(f_D\) to derive (49). Finally, combining this with (44), (45), (47), and the obvious \({\mathcal{R}_{L_{\Delta _0},P}(f_{D,s_n,0}^+)} - {\mathcal{R}_{L_{\Delta _0},P}^{*}}\ge 0\), we conclude that
in probability for \(n\rightarrow \infty\).
1.2.2 C.2.2 The bad interpolating histogram rule
In this subsection we consider the bad interpolating histogram rule from Example 2.9 with representation
Now observe that in the case i) of Theorem 2.12, i.e. for \(\rho =0\), we have \(t=0\) and thus \(D_X^{+t} = D_X\). Since \(P_X(D_X\setminus \Delta )= 0\) by the definition of \(\Delta\), we then see by Lemma C.4 and Proposition E.4 that it suffices to show that
in probability for \(n\rightarrow \infty\). To this end, we fix an \(\epsilon >0\) and a finite \(\Delta _0\subset \Delta\) with \(P_X(\Delta {\setminus } \Delta _0 ) \le \epsilon\). Then we note that the decomposition (44) and the estimate (45) for \(f_{D,s_n,0}^+\) also holds for \(f_{D,s_n,0}^-\). Consequently, it suffices to bound the term
To this end, recall that \(f_{D,s_n,0}^+\) and \(f_{D,s_n,0}^-\) are both interpolating predictors, and hence we have
for all samples \((x_i, y_i)\) of D, and thus in particular for all samples \((x_i, y_i)\) of \(D_0\) with \(x_i \in \Delta\). Let us define \(f_D:= \varvec{1}_{\Delta _0}f_{D,s_n,0}^-\). Combining (51) with (48) we see that \(f_D\) is an empirical risk minimizer over the hypotheses set \({\mathcal {F}}_{\mathcal {C}}\) defined (46) with respect to \(L_{\Delta _0}\). Since (49) has been shown for all ERM respect to \(L_{\Delta _0}\) and \({\mathcal {F}}_{\mathcal {C}}\) we thus find
in probability for \(n\rightarrow \infty\). This finishes the proof in the case i) of Theorem 2.12. Moreover, in the case ii), i.e. for \(\rho =\rho _n>0\) the distribution P satisfies Assumption 2.11, which ensures \(\Delta = \emptyset\). In combination with Lemma C.4 the latter implies
Now, the first term has already been bounded in (43) and the excess risk of \(h_{D,\mathcal{A}_D}^+\) can again be bounded by Proposition E.4.
1.3 C.3 Proof of Theorem 2.13 (Learning Rates)
In the following we suppose that all assumption of Theorem 2.13 are satisfied.
Let us first prove the assertions for the good interpolating histogram rule. To this end, we first recall that Assumption 2.11 implies \(\Delta = \emptyset\). By (3) and Lemma C.4 we then obtain
Now, (43) shows
Moreover, by Theorem 2.5 we know that \(|{{\,\textrm{Im}\,}}(\pi _{m,s})|\le (m+1)^d \le 2^d n^d\) for all \(m\le n\). Consequently, applying Proposition E.5 with \(c=2^d\) and \(\beta := d\) we find
with probability \(P^n\) at least \(1- 2^dn^{1+d} e^{-n^{d \gamma }}\), where \(c_{d,\alpha }>0\) is a constant only depending on d, \(\alpha\), and \(|f^*_{L,P}|_\alpha\). Combining this with (52) we then obtain (27).
Finally, inequality (27) for the bad interpolating histogram rule follows analogously, since in this case Lemma C.4 shows
D Learning properties of approximating neural networks
1.1 D.1 Auxiliary Results on Functions that can be represented by DNNs
In this section we present some results on algebraic properties of the set of functions that can be represented by DNNs. We particularly focus on the network sizes required to perform algebraic transformations of such functions.
To this end, recall that throughout this work we solely consider the ReLU-activation function \(\sigma := |\cdot |_+\) and its shifted extensions (29). Given an input dimension d, a depth \(\tilde{L}\ge 2\), and a width vector \((p_1,\dots ,p_{\tilde{L}-1}) \in \mathbb {N}^{\tilde{L}-1}\), a function \(f\in {\mathcal {A}}_{p_1,\dots ,p_{\tilde{L}-1}}\) is then of the form (31), i.e.
where each layer \(H_l\), \(l=1,\dots , \tilde{L}\), is of the form (30), where we drop the index for the activation to ease notation. Specifically, each layer can be represented by a \(p_l \times p_{l-1}\) weight matrix \(A^{(l)}\) with \(p_0:= d\) and \(p_{\tilde{L}}:= 1\) and a shift vector \(b^{(l)}\in \mathbb {R}^{p_l}\), and the last layer \(H_{\tilde{L}}\) has the identity as an activation function. In the following, we thus say that the network f is represented by \((\mathfrak A, \mathfrak B)\), where \(\mathfrak A:= (A^{(1)}, \dots , A^{(\tilde{L})})\) and \(\mathfrak B:= (b^{(1)}, \dots , b^{(\tilde{L})})\). For later use we emphasize that \(p_{\tilde{L}} = 1\) implies \(b^{(\tilde{L})}\in \mathbb {R}\). Moreover note that each pair \((\mathfrak A, \mathfrak B)\) determines a neural network, but in general, a neural network, if viewed as a function, can be described by more than one such pair.
Now, our first lemma describes the changes in the representation when manipulating a single neural network.
Lemma D.1
Let \(d\ge 1\), \(\tilde{L}\ge 2\), and \(p:=(p_1,\dots ,p_{\tilde{L}-1}) \in \mathbb {N}^{\tilde{L}-1}\). Moreover, let \(f\in \mathcal{A}_{p}\) be a neural network with representation \(\mathfrak A:= (A^{(1)}, \dots , A^{(\tilde{L})})\) and \(\mathfrak B:= (b^{(1)}, \dots , b^{(\tilde{L})})\). Then the following statements hold true:
-
(i)
For all \(\alpha \in \mathbb {R}\) and \(c \in \mathbb {R}\) we have \(\alpha f + c \in \mathcal{A}_{p}\) with representation
$$\begin{aligned} \bigl (A^{(1)}, \dots , A^{(\tilde{L}-1)}, \alpha A^{(\tilde{L})}\bigr ) \qquad \qquad \text{ and } \qquad \qquad \bigl (b^{(1)}, \dots , b^{(\tilde{L}-1)},\alpha b^{(\tilde{L})}+c\bigr ) \,. \end{aligned}$$ -
(ii)
We have \(|f|_+\in \mathcal{A}_{p,1}\) with representation
$$\begin{aligned} \bigl (A^{(1)}, \dots , A^{(\tilde{L})}, 1\bigr ) \qquad \qquad \text{ and } \qquad \qquad \bigl (b^{(1)}, \dots , b^{(\tilde{L})}, 0\bigr ) . \end{aligned}$$
Proof of Lemma D.1
- i).:
-
This immediately follows from the representation (31) and the fact that \(H_{\tilde{L}}\) does not have an activation function.
- ii).:
-
Let \(\tilde{H}_1, \dots , \tilde{H}_{\tilde{L}+1}\) be the layers of the neural network \(\tilde{f}\) given by the new representation. Then we have \(H_l = \tilde{H}_l\) for all \(l=1,\dots , \tilde{L}-1\) as well as \(\tilde{H}_{\tilde{L}} = |H_{\tilde{L}}|_+\) and \(\tilde{H}_{\tilde{L}+1} = {{\,\textrm{id}\,}}_\mathbb {R}\). Applying the representation (31) for f and \(\tilde{f}\) then gives the assertion.
\(\square\)
Our next lemma describes a possible representation of the sum of two nets with the same depth \(\tilde{L}\).
Lemma D.2
Let \(d\ge 1\), \(\tilde{L}\ge 2\), and \(\dot{p}:=(\dot{p}_1,\dots ,\dot{p}_{\tilde{L}-1}) \in \mathbb {N}^{\tilde{L}-1}\) and \(\ddot{p}:=(\ddot{p}_1,\dots ,\ddot{p}_{\tilde{L}-1}) \in \mathbb {N}^{\tilde{L}-1}\) be two width vectors. Then for all \(\dot{f}\in \mathcal{A}_{\dot{p}}\) and \(\ddot{f}\in \mathcal{A}_{\ddot{p}}\) we have
In addition, if \((\dot{\mathfrak A},\dot{\mathfrak B})\) and \((\ddot{\mathfrak A},\ddot{\mathfrak B})\) are representations of \(\dot{f}\) and \(\ddot{f}\), then \(\dot{f} + \ddot{f}\) has the representation \(\mathfrak A:= (A^{(1)}, \dots , A^{(\tilde{L})})\) and \(\mathfrak B:= (b^{(1)}, \dots , b^{(\tilde{L})})\) defined by
as well as
for all \(l=2,\dots , \tilde{L}-1\) and
Proof of Lemma D.2
Let \(\dot{H}_{1}, \dots , \dot{H}_{\tilde{L}}\) be the layers of \(\dot{f}\) and \(\ddot{H}_{1}, \dots , \ddot{H}_{\tilde{L}}\) be the layers of \(\ddot{f}\). For \(l =1,\dots ,\tilde{L}\), we further introduce the concatenation of layers
Moreover, for \(l =1,\dots ,\tilde{L}\), let \(H_l\) be the layer given by \(A^{(l)}\) and \(b^{(l)}\) and \(W_l:= H_{l}\circ \dots \circ H_1\). Since the last layers of \(\dot{f}\) and \(\ddot{f}\) do not have an activation function, we then find
for all \(x\in \mathbb {R}^d\). Similarly, for all \(l=2,\dots ,\tilde{L}-1\) and all \(x\in \mathbb {R}^d\) we have
Finally, for the first layer and all \(x\in \mathbb {R}^d\) we obtain
Combining these results gives \(W_l =(\dot{W}_{l}, \ddot{W}_{l} )^T\) for all \(l=1,\dots ,\tilde{L}\), i.e. we have found the assertion. \(\square\)
1.2 D.2 Approximating step functions by DNNs
In this section we collect the main pieces to approximate histograms with DNNs. The first lemma, which is a longer and more detailed version of Lemma 3.3, shows how to approximate an indicator function on a multidimensional interval by a small ReLU-DNN with two hidden layers.
Lemma D.3
Let \(d\ge 1\) and let \(z_1 = (z_{1,1},\dots ,z_{1,d})\in {\mathbb {R}^d}\) and \(z_2 = (z_{2,1},\dots ,z_{2,d})\in {\mathbb {R}^d}\) be two vectors with \(z_{1} < z_{2}\). Moreover, let \(\varepsilon >0\) satisfy
and define
where \(I_d\) denotes the d-dimensional identity matrix, and \(A^{(3)}, b^{(2)}, b^{(3)}\in \mathbb {R}\). Then the neural network \(f_\varepsilon :\mathbb {R}^d\rightarrow \mathbb {R}\) given by the representation \(\mathfrak A:= (A^{(1)}, A^{(2)}, A^{(3)})\) and \(\mathfrak B:= (b^{(1)}, b^{(2)}, b^{(3)})\) satisfies \(f_\varepsilon \in {\mathcal {A}}_{2d,1}\) and
Proof of Lemma D.3
Let \(H_1, H_2, H_3\) be the layers of \(f_\varepsilon\). Then we have \(H_3 = {{\,\textrm{id}\,}}_\mathbb {R}\) and if \(h_1^{(1)}, \dots , h_d^{(1)}, h_1^{(2)},\dots , h_d^{(2)}\) denote the 2d component functions of \(H_1\), that is
we thus find
for all \(x\in \mathbb {R}^d\). Therefore, we first investigate the functions \(1- h_i^{(1)} - h_i^{(2)}\). To this end, let us fix an \(i\in \{1,\dots ,d\}\) and an \(x=(x_1,\dots ,x_d)\in {\mathbb {R}^d}\). Then we obviously have
and
Since \(z_{1,i}+\varepsilon < z_{2,i}-\varepsilon\), we consequently find
In particular, we have
Combining our initial equation (57) with (59) and (60) yields
i.e. we have found (55) Net we will verify (54) to this end, we first note that (57) gives
Our next intermediate goal is to show
To this end, we assume the converse, i.e. there is an \(x\in {\mathbb {R}^d}\) and an \(i_0\in \{1,\dots ,d\}\) with
Without loss of generality we may assume that \(i_0 = d\). Then combining both inequalities we find
and this shows that there is also an \(i\in \{1,\dots ,d-1\}\) with \(1 - h_i^{(1)}(x) - h_i^{(2)}(x) > 1\). This contradicts (60), and hence we have shown (62). Now combining (61) with (62) and (58) we obtain
i.e. we have found (54). Finally, the equation \(\{ f_\varepsilon > 1\} = \emptyset\) immediately follows from combining (57) and (60), and \(\{ f_\varepsilon < 0\} = \emptyset\) is a direct consequence of (57). \(\square\)
Our next goal is to describe how well the function \(f_\varepsilon\) found in Lemma D.3 approximates the indicator function \(\varvec{1}_{[z_1,z_2]}\). To this end, we first recall a well-known estimate on \(\Vert \cdot \Vert _\infty\)-covering numbers of cuboids in the following lemma. We include its proof for the sake of completeness.
Lemma D.4
Let \(s_1,\dots ,s_d >0\), \(s_{\textrm{min}}:= \min \{s_1,\dots ,s_d\}\), \(z\in {\mathbb {R}^d}\), and
Then for all \(\varepsilon \in (0,s_{\textrm{min}}]\) we have
Proof of Lemma D.4
Let us fix an \(i\in \{1,\dots ,d\}\). Since \(\varepsilon \le s_i\), we then need at most \(\lceil \frac{s_i}{2\varepsilon }\rceil\) closed intervals of length \(2\varepsilon\) to cover the interval \([z_i, z_i + s_i]\). From this it is easy to conclude that
and hence we have shown the assertion. \(\square\)
Now, the next lemma describes the announced description of the approximation error.
Lemma D.5
Let \(z_1, z_2\in [-1,1]^d\), and \(\varepsilon >0\) as in Lemma D.3. Moreover, let \(A\subset [-1,1]^d\) be a subset satisfying \((z_1,z_2) \subset A\subset [z_1,z_2]\). Then the neural network \(f_\varepsilon \in \mathcal{A}_{2d,1}\) constructed in Lemma D.3 satisfies
Moreover, if A is a cube of side length \(s>0\), that is \(z_{2,i}-z_{1,i} = s\) for all \(i=1,\dots ,d\), and we have a distribution \(P_X\) on \([-1,1]^d\) that satisfies Assumption 2.11 for some \(\varphi : \mathbb {R}_+ \rightarrow \mathbb {R}_+\), then we further have
Proof of Lemma D.5
By (55) and (54) we find the inclusions \(\{ f_\varepsilon = 1\} = [z_1 + \varepsilon , z_2 - \varepsilon ] \subset A\) and \(\{ f_\varepsilon > 0 \} \subset (z_1, z_2 ) \subset A\). Using \(\{ f_\varepsilon < 0\} = \emptyset\), which is known by (56), we then obtain
i.e. we have shown (63). Now, to establish (64), we first note that (63) together with \(A\subset [z_1,z_2]\) implies
To further bound \([z_1,z_2] \setminus (z_1 + \varepsilon , z_2 - \varepsilon )\) we define
Then we have \([z_1,z_2] {\setminus } (z_1 + \varepsilon , z_2 - \varepsilon ) \subset S_1^-\cup \dots \cup S_d^- \cup S_1^+\cup \dots \cup S_d^+\), and hence we obtain
Now observe that since A is a cube with side length s, the sets \(S_i^-\) and \(S_i^+\) are cuboids with side lengths \(s_1,\dots ,s_d\), where \(s_i = \varepsilon\) and \(s_j = s\) for all \(j\ne i\). Applying Lemma D.4 then shows
and combining with Assumption 2.11 we obtain
Inserting this estimate into (65) yields (64). \(\square\)
As a second step in our construction presented in Subsection 3.1 we combine Lemmas D.1 and D.2 with Lemma D.3 to approximate step-functions on cubic partitions by ReLU-DNNs with two hidden layers.
Proposition D.6
Let \(A_1,\dots ,A_k\) be mutually disjoint subsets of \(X:= [-1,1]^d\) such that for each \(i\in \{1,\dots ,k\}\) there exist \(z_i^-, z_i^+\in X\) with \(z_i^-< z_i^+\) and \((z_i^-, z_i^+) \subset A_i \subset [z_i^-, z_i^+]\). Moreover, let \(z_{i,j}^\pm\) be the j-th coordinate of \(z_i^\pm\). Then for all \(g:X\rightarrow \mathbb {R}\) of the form
with \(\alpha _i \in \mathbb {R}\), all \(\varepsilon >0\) satisfying
and all \(m_1 \ge 2d k\) and \(m_2\ge k\), there exists a neural network \(f_\varepsilon \in \mathcal{A}_{m_1,m_2}\) such that
and \(\Vert f_\varepsilon \Vert _\infty = \max \{|\alpha _1|,\dots ,|\alpha _k|\}\). In addition, if \(A_1,\dots ,A_k\) are cubes of side length \(s>0\), i.e. \(z_i^+-z_i^- = (s,\dots , s)\in \mathbb {R}^d\) for all \(i=1,\dots ,k\), and \(P_X\) is a distribution on \([-1,1]^d\) that satisfies Assumption 2.11 for some \(\varphi : \mathbb {R}_+ \rightarrow \mathbb {R}_+\), then we further have
Proof of Proposition D.6
Since \(\mathcal{A}_{2dk, k}\subset \mathcal{A}_{m_1,m_2}\) it suffices to find an \(f_\varepsilon\) with the desired properties in \(\mathcal{A}_{2dk, k}\). By assumption and Lemma D.3, we find, for all \(\varepsilon >0\) and \(i=1,\dots , k\), a neural network \(f^{(\varepsilon )}_i \in {\mathcal {A}}_{2d,1}\), and Lemma D.5 shows that
Moreover, for any \(\alpha _i \in \mathbb {R}\), Lemma D.1 ensures \(\alpha _i f^{(\varepsilon )}_i \in {\mathcal {A}}_{2d,1}\) with
Now, applying Lemma D.2 shows that
belongs to \({\mathcal {A}}_{2kd,k}\), and since we have
for all \(i\ne l\), our previous considerations give us
Finally, the identity \(\Vert f_\varepsilon \Vert _\infty = \max \{|\alpha _1|,\dots ,|\alpha _k|\}\) follows from (67) and \(\Vert f_i^{(\varepsilon )} \Vert _\infty = |\alpha _i|\) for all \(i=1,\dots ,k\) and the bound on \(P_X \bigl ( \{ f_\varepsilon \ne \varvec{1}_{A} \} \bigr )\) is a direct consequence of (64). \(\square\)
1.3 D.3 Proof of Main Theorem 3.7
Throughout this section we assume that the general assumptions of Theorem 3.7 are satisfied. In particular, \(D \in (X \times Y)^n\) is an i.i.d. sample of size \(n \ge 1\) and \(D_X:=\{x_1^*,..., x_{m_n}^*\}\in \textrm{Pot}_m(X)\) is the set of input observations. Moreover, \((s_n)_{n \in \mathbb {N}}\) is a sequence with \(s_n \rightarrow 0\), \(s_n^d > 2^d/n\) and \(\frac{\ln (n s_n^d)}{n s_n^d}\rightarrow 0\) as \(n \rightarrow \infty\). In addition, we let \((\rho _n)_{n \in \mathbb {N}}\) be a non-negative sequence with \(\rho ^d_n \le 2^d/n\) and \(\rho _n^{-d} \varphi ( \rho _n ) \rightarrow 0\) for \(n\rightarrow \infty\). Finally, let \((\varepsilon _n)_{n \in \mathbb {N}}\) and \((\delta _n)_{n \in \mathbb {N}}\) be positive sequences with \(\varepsilon _n = \delta _n = \rho _n/2\).
We firstly show our claim for the good interpolating DNN from Example 3.6, having representation
with \(t=\min \{r, \rho _n\}\) and associated \({\mathcal {H}}_{{\mathcal {A}}}^{(\epsilon )}\)-part
We split the excess risk into three different terms
Convergence of the the first term follows from Lemma C.2 and by exploiting Assumption 2.11. We obtain
Hence, by our assumption on \(\rho _n\) may we conclude
in probability for \(|D|\rightarrow \infty\).
For bounding the second term in (68) we remind that \(|J|\le \left( \frac{2}{s_n}\right) ^d\). Lemma C.2 and Proposition D.6 yieldFootnote 3
Hence, our assumption on \(\rho _n\) ensures
in probability for \(|D|\rightarrow \infty\). Finally, convergence of the last term in (68) is easily derived with the help of Proposition E.4 and we conclude that
in probability for \(|D|\rightarrow \infty\).
We now turn to considering the bad interpolating DNN from Example 3.6. Since we have \(g_{D,s_n, \rho _n}^-(x) \in [-1,1]\) and \(f_{L,P}^\dagger (x) \in [-1,1]\) for all \(x \in X\), Assumption 2.11 gives
Moreover, for all \(x \in X{\setminus } D_X^{+t}\) we have \(g_{D,s_n, \rho _n}^-(x) = -h_{D, {\mathcal {A}}_D}^{+, (\varepsilon _n)}\) and \(f_{L,P}^\dagger (x)= -f^*_{L,P}(x)\). Hence
Combining both considerations with the first part of the proof shows then in probability for \(|D|\rightarrow \infty\)
Finally, since \(|J| \le (\frac{2}{s_n})^d \le n\), Proposition D.6 shows that \(g_{D,s_n, \rho _n}^{\pm }\in {\mathcal {A}}_{4dn, 2n}\).
1.4 D.4 Proof of Main Theorem 3.8
Let all assumptions of Theorem 3.8 be satisfied. Moreover, we let \((\varepsilon _n)_{n \in \mathbb {N}}\) and \((\delta _n)_{n \in \mathbb {N}}\) be positive sequences with \(\varepsilon _n = \delta _n = \rho _n/2\). We prove the result for the good interpolating DNN by reconsidering (68). Indeed, by our assumption \(\rho _n^{-d} \varphi (\rho _n) \le \ln (n) n^{-2/3}\) and thus (69) leads to
Moreover, (70) gives
Finally, (53) shows with probability \(P^n\) at least \(1- 2^dn^{1+d} e^{-n^{d \gamma }}\)
where \(c_{d,\alpha }>0\) is a constant only depending on d, \(\alpha\), and \(|f^*_{L,P}|_\alpha\). Collecting the above considerations shows the first part of the theorem.
Now, coming to the bad interpolating DNN, we derive from (71)
Moreover, combining the results from (70) and (72) gives with probability \(P^n\) at least \(1- 2^dn^{1+d} e^{-n^{d \gamma }}\)
where \(c'_{d,\alpha }= 4\cdot d \cdot 6^{d} + c_{d,\alpha }\). Thus, with probability \(P^n\) at least \(1- 2^dn^{1+d} e^{-n^{d \gamma }}\)
where \(c''_{d,\alpha } = 4 + c'_{d,\alpha }\).
Finally, we have \(|J| \le (\frac{2}{s_n})^d = 2^d n^{\frac{d}{2\alpha +d }}\le n\), provided \(n \ge n_{d, \alpha }\), for some \(n_{d, \alpha } \in \mathbb {N}\), depending on d and \(\alpha\). Proposition D.6 shows then that \(g_{D,s_n, \rho _n}^{\pm }\in {\mathcal {A}}_{4dn, 2n}\).
E Uniform bounds for histograms based on data-dependent partitions
1.1 E.1 A generic oracle inequality for empirical risk minimization
If not stated otherwise, we assume throughout this subsection that X is an arbitrary non-empty set that is equipped with some \(\sigma\)-algebra. We write \({\mathcal {L}}_\infty\) for the corresponding set of all bounded, measurable functions \(f:X\rightarrow \mathbb {R}\). Moreover, \(Y\subset \mathbb {R}\) is assumed to be measurable. Following (Steinwart and Christmann (2008), Definition 2.18) we say that a measurable loss \(L:X\times Y\times \mathbb {R}\rightarrow [0,\infty )\) is locally Lipschitz continuous if for all \(a\ge 0\) there exists a constant \(c_{a}\ge 0\) such that
Moreover, for \(a\ge 0\), the smallest such constant \(c_{a}\) is denoted by \(|L|_{a,1}\).
In addition, we need to recall the notion of covering numbers, which is recalled in the following definition.
Definition E.1
Let (T, d) be a metric space and \(\varepsilon >0\). We call \(S\subset T\) an \(\varepsilon\)-net of T if for all \(t\in T\) there exists an \(s\in S\) with \(d(s,t)\le \varepsilon\). Moreover, the \(\varepsilon\)-covering number of T is defined by
where \(\inf \emptyset := \infty\) and \(B_{d}(s,\varepsilon ):=\{t\in T: d(t,s)\le \varepsilon \}\) denotes the closed ball with center \(s\in T\) and radius \(\varepsilon\).
Moreover, if (T, d) is a subspace of a normed space \((E,\Vert \cdot \Vert )\) and the metric is given by \(d(x,x') = \Vert x-x' \Vert\), \(x,x'\in T\), we write \(\mathcal{N}(T,\Vert \cdot \Vert ,\varepsilon ):= \mathcal{N}(T,d,\varepsilon )\).
Finally, we need to fix some notation related to empirical risk minimization. To this end, we fix a loss function \(L:X\times Y\times \mathbb {R}\rightarrow [0,\infty )\) and an \(\mathcal{F}\subset \mathcal{L}_{\infty }(X)\). Given a distribution P on \(X\times Y\), we denote the smallest possible risk attained by functions in \(\mathcal{F}\) by \({\mathcal{R}_{L,P,\mathcal{F}}^*}\), that is
Finally, following (Steinwart and Christmann (2008), Definition 6.2), we say that an ERM method \(D\mapsto f_D\) with respect to L and \(\mathcal{F}\) is measurable, if for all \(n\ge 1\) the map
is measurable with respect to the universal completion of the product \(\sigma\)-algebra of the product space \((X\times Y)^n \times X\). Recall from (Steinwart and Christmann (2008), Lemma 6.17) that for closed, separable \(\mathcal{F} \subset \mathcal{L}_{\infty }(X)\) for which there exists an ERM, there also exists a measurable ERM. Moreover, in this case the map
is also measurable with respect to the universal completion of the product \(\sigma\)-algebra of \((X\times Y)^n\), see (Steinwart and Christmann (2008), Lemma 6.3). In the following, we thus assume that \((X\times Y)^n\) is equipped with this universal completion. Finally, for a loss L and a function \(f \in {\mathcal {F}}\), we denote by \(L \circ f: X \times Y \rightarrow [0, \infty )\) the map \((x,y) \mapsto L(x,y, f(x))\).
With the help of these notion we can now prove the generic oracle inequality for empirical risk minimizers.
Proof of Theorem 2.10
We first note that (18) ensures \({\mathcal{R}_{L,P}(f_D)} - {\mathcal{R}_{L,P}^{*}}\le B\) and since we have additionally assumed \(V\ge B^{2-\vartheta }\), we see that it suffices to consider sample sizes \(n\ge 16\tau\).
Given an \(f\in \mathcal{F}\), we define \(h_f:=L\circ f-L\circ {f_{L,P}^*}\). Let us now fix an \({f_0}\in \mathcal{F}\) and a data set \(D\in (X\times Y)^n\). Since \(f_D\) is an empirical risk minimizer, we have \({\mathcal{R}_{L,D}(f_D)}\le {\mathcal{R}_{L,D}({f_0})}\), and hence we find \(\mathbb {E}_D h_{f_D} \le \mathbb {E}_D h_{f_0}\). As a consequence, we obtain
To bound the first difference in (74) we first observe that for \(f,f'\in \mathcal{F}\), \(x\in X\), and \(y\in Y\) the local Lipschitz continuity of L gives
and thus we have \(\Vert h_f-h_{f'} \Vert _\infty \le |L|_{M,1} \cdot \Vert f-f' \Vert _\infty\) for all \(f,f'\in \mathcal{F}\). Now, let \(\mathcal{C}\subset \mathcal{F}\) be a minimal \(\varepsilon\)-net of \(\mathcal{F}\) with respect to \(\Vert \cdot \Vert _\infty\). For a data set \(D\in (X\times Y)^n\) there then exists an \(f\in \mathcal{C}\) such that \(\Vert f-f_D \Vert _\infty \le \varepsilon\), and hence \(\Vert h_{f_D} - h_f \Vert _\infty \le |L|_{M,1} \cdot \varepsilon\). This yields
For \(f\in \mathcal{C}\) and \(r>0\) we now define the function
It is easy to see that both \(\mathbb {E}_P g_{f,r}=0\) and \(\Vert g_{f,r} \Vert _\infty \le 2Br^{-1}\) hold. In addition, in the case \(\vartheta >0\) and \(b:= \mathbb {E}_P h_f\ne 0\), setting \(q:= \frac{2}{2-\vartheta }\), \(q':= \frac{2}{\vartheta }\), and \(a:= r\) in the second inequality of (Steinwart and Christmann (2008), Lemma 7.1) shows
Furthermore, in the case \(\vartheta >0\) and \(\mathbb {E}_P h_f=0\), the variance bound (19) gives \(\mathbb {E}_P h_f^2=0\), and hence we have \(\mathbb {E}_P g_{f,r}^2 \le V{r^{\vartheta -2}}\). Finally, in the case \(\vartheta = 0\), we have \(\mathbb {E}_P g_{f,r}^2 \le \mathbb {E}_P h_f^2 \, r^{-2} \le V{r^{\vartheta -2}}\). In summary, we we have thus found
in all cases. By applying Bernstein’s inequality in the form of (Steinwart and Christmann (2008), Theorem 6.12) in combination with a union bound we thus find
for all \(r>0\). Let us now pick a data set \(D\in (X\times Y)^n\) that satisfies the above inequality, that is
For an \(f\in \mathcal{C}\) with \(\Vert f-f_D \Vert _\infty \le \varepsilon\), Inequality (75) together with the definition of \(g_{f,r}\) then gives
Our next goal is to estimate the second difference (74), that is \(\mathbb {E}_D h_{f_0}- \mathbb {E}_P h_{f_0}\). Let us first consider the case \(\vartheta >0\). Here, we have both \(\Vert h_{f_0}- \mathbb {E}_P h_{f_0} \Vert _\infty \le 2B\) and
Furthermore, setting \(q:= \frac{2}{2-\vartheta }\), \(q':= \frac{2}{\vartheta }\), \(a:=\bigl (\frac{2^{1-\vartheta }\vartheta ^\vartheta V\tau }{n} \bigr )^{1/2}\), and \(b:= \bigl (\frac{2\mathbb {E}_P h_{f_0}}{\vartheta }\bigr )^{\vartheta /2}\) in (Steinwart and Christmann (2008), Lemma 7.1) yields
By another application of Bernstein’s inequality we consequently find that
holds with probability \(P^n\) not less than \(1-e^{-\tau }\). Finally, in the case \(\vartheta =0\), Hoeffding’s inequality in combination with \(\Vert h_{f_0} \Vert _\infty \le B\le \sqrt{V}\) also yields (78).
To finish the proof, we now combine (74), (76), (77), and (78). As a result we see that
holds with probability \(P^n\) not less than \(1-(1+|\mathcal{C}|)e^{-\tau }\). In the following, we fix a data set D, for which this inequality holds. Defining
a simple calculation then shows both
Moreover, \(V\ge B^{2-\vartheta }\) together with \(n\ge 16\tau\) gives
Finally, we have
Inserting these estimates in our inequality on \(\mathbb {E}_Ph_{f_D}\) gives
and by elementary transformations we thus conclude that
Now the assertion follows by a simple algebraic transformation of \(\tau\) and taking the infimum over all \({f_0}\in \mathcal{F}\). \(\square\)
If we have an upper bound on the covering numbers occurring in Theorem 2.10, then we can optimize the right hand side of its oracle inequality with respect to \(\varepsilon\). The following corollary executes this idea for the least squares loss and histogram rules that choose their cubic partitions in a certain, data-dependent way.
Corollary E.2
Let \(Y = [-M,M]\) and let L be the least squares loss. For \(K<\infty\) and \(A<\infty\) let \({\mathcal {A}}_1, \dots ,{\mathcal {A}}_K\) be finite partitions of X, satisfying \(|{\mathcal {A}}_i|\le A\) for any \(i=1,\dots ,K\). Moreover, let \(D \mapsto h_{D, \mathcal{A}_D }\) be an algorithm, that first chooses a partition \(\mathcal{A}_D\) from \({\mathcal {A}}_1, \dots , {\mathcal {A}}_K\) and then computes the corresponding \({\mathcal {A}}_D\)-histogram. Then, for all \(n\ge 1\) and \(\tau >0\), we have
with probability \(P^n\) not less than \(1- Ke^{-\tau }\).
Proof of Corollary E.2
Since L is the least squares loss, the assumptions (18) and (19) of Theorem 2.10 are satisfied with \(\vartheta = 1\), \(B=4\,M^2\), and \(V=16\,M^2\). Moreover, our assumption \(Y\subset [-M,M]\) ensures that L is locally Lipschitz continuous with \(|L|_{M,1} \le 4\,M\).
Now, for a fixed \(i\in \{1,\dots ,K\}\) we recall that the histogram rule \(D\mapsto h_{D,\mathcal{A}_i}\) is an empirical risk minimizer over the hyptheses class
where \({\mathcal {A}}_i=(A_j)_{j \in J}\). Moreover, for any \(\varepsilon >0\), the \(\varepsilon\)-covering number of \({\mathcal {H}}_{{\mathcal {A}}_i}\) satisfies
For \(n\ge 1\), \(\tau >0\), and \(\varepsilon >0\) Theorem 2.10 thus gives
with probability \(P^n\) not less than \(1- e^{-\tau }\).
Next we optimize this bound over \(\varepsilon >0\). To this end, we consider the strongly convex function
where \(\alpha := 20\,M\), \(\beta := \frac{512AM^2}{n}\), and \(\gamma := 2\,M\). Then a simple calculation shows that h has a minimum at \(\varepsilon ^*:= \frac{\beta }{\alpha }\), giving
Inserting this estimate in our above oracle inequality obtained from Theorem 2.10, using the fact that
and finally applying a simple union bound then gives the assertion. \(\square\)
1.2 E.2 Learning properties of histograms
The first lemma describes how well the infinite sample histogram rules defined in (7) can approximate the least squares Bayes risk.
Lemma E.3
(Approximation Error) Let L be the least squares loss, \(X:= [-1,1]^d\), \(Y= [-1,1]\), and P be a distribution on \(X\times Y\). Then, for all \(\varepsilon > 0\), there exists an \(s_\varepsilon >0\) such that for any cubic partition \(\mathcal{A}\) of X with width \(s \in (0, s_\varepsilon ]\) one has
Moreover, if \({f_{L,P}^*}\) is \(\alpha\)- Hölder continuous for some \(\alpha \in (0,1]\), then for all \(s\in (0,1]\) and all cubic partitions \(\mathcal{A}\) of X with width s we have
Proof of Lemma E.3
For the proof of the first assertion we fix an \(\varepsilon >0\). Then recall that there exists a continuous function \(f:\mathbb {R}^d\rightarrow \mathbb {R}\) with compact support such that
see e.g. (Bauer (2001), Theorem 29.14 and Lemma 26.2). Moreover, since \(\Vert {f_{L,P}^*} \Vert _\infty \le 1\), we can assume without loss of generality that \(\Vert f \Vert _\infty \le 1\). Now, since f is continuous and has compact support, f is uniformly continuous, and hence there exists a \(\delta \in (0,1]\) such that for all \(x,x'\in X\) with \(\Vert x-x' \Vert _\infty \le \delta\) we have
We define \(s_\varepsilon := \delta\). Now, we fix a cubic partition \(\mathcal{A} = (A_j)_{j\in J}\) of width \(s>0\) for some \(s\in (0,s_\varepsilon ]\). For \(x\in X\) with \(P_X(A(x)) > 0\) we then have
For such x we then define
For the remaining \(x\in X\) we simply set \(\bar{f}(x):= 0\). With these preparations we then have
Clearly, (80) shows that the third term is bounded by \(\varepsilon\). Let us now consider the second term. Here we first note that for an \(x\in X\) with \(P_X(A(x)) > 0\) we have
where in the last step we used (81). Consequently, we obtain
In other words, the second term is bounded by \(\varepsilon\), too. Let us finally consider the first term. Here we have
Consequently, the first term is bounded by \(\varepsilon\), too, and hence we conclude by (83) that the excess risk satisfies
A simple variable transformation then yields the first assertion.
To show the second assertion we first note that for all \(x,x'\in X\) we have
For \(s\in (0,1]\), \(\varepsilon := |{f_{L,P}^*}|_\alpha \cdot s^\alpha\), and \(x,x'\in X\) with \(||x-x'||_\infty \le s\) we thus find
Now consider \(f: = {f_{L,P}^*}\) and fix an arbitrary cubic partition \(\mathcal{A}\) of X with width s. Then \(\bar{f}\) defined by (82) is given by \(\bar{f} = h_{P,{\mathcal {A}}}\). Moreover, we have
where in the last step we used (84) and (85). \(\square\)
Based on the previous results we can now establish universal consistency of the empirical histogram rule \(D \mapsto h_{D,{\mathcal {A}}_D}\) for regression based on a cubic data-dependent partition \({\mathcal {A}}_D\) from \({\mathcal {P}}(X)\).
Proposition E.4
(Universal Consistency) Let L be the least squares loss, \(X:= [-1,1]^d\), \(Y= [-1,1]\), P be a distribution on \(X\times Y\), and \(D \in (X \times Y)^n\) be an i.i.d. sample of size \(n \ge 1\) drawn from P with \(|D_X|=m_n\). Suppose that \((s_n)_{n \in \mathbb {N}}\) is a sequence with \(s_n \rightarrow 0\) as well as \(\frac{\ln (n s_n^d)}{n s_n^d}\rightarrow 0\) as \(n \rightarrow \infty\). Assume further that \(\pi _{m_n, s_n}\) is an \(m_n\)-sample cubic partitioning rule of width \(s_n \in (0,1]\), satisfying \(|{{\,\textrm{Im}\,}}(\pi _{m_n, s_n})| \le c n^\beta\), for some \(c<\infty\) and some \(\beta >0\) that are independent of n. Denoting \({\mathcal {A}}_D:=\pi _{m_n, s_n}(D_X)\), we have
in probability as \(n\rightarrow \infty\).
Proof of Proposition E.4
Note that for any \(\varepsilon >0\) and for any \({\mathcal {A}}\in {{\,\textrm{Im}\,}}(\pi _{m_n, s_n})\), the \(\varepsilon\)-covering number of \({\mathcal {H}}_{\mathcal {A}}\) satisfies
with \(|{\mathcal {A}}|\le (2/s_n)^d\). Let us write \(\mathcal{P}_n:= {{\,\textrm{Im}\,}}(\pi _{1, s_n}) \cup \dots \cup {{\,\textrm{Im}\,}}(\pi _{n, s_n})\). Applying Corollary E.2 with \(A:= (2/s_n)^d\) and \(K:= |\mathcal{P}_n| \le c n^{1+\beta }\) gives, for all \(\tau \ge 1\) and \(n\ge 1\), with probability \(P^n\) at least \(1-2c n^{1+\beta } e^{-\tau }\) that
Now, for all \(\varepsilon >0\), Lemma E.3 guarantees the existence of an \(s_\varepsilon >0\) such that for any cubic partition \({\mathcal {A}}\) of width \(s_n \in (0, s_\varepsilon ]\) we have
Since we assumed \(s_n \rightarrow 0\) we conclude that the latter inequality holds for all sufficiently large n. Combining both bounds we find for all sufficiently large n that with probability \(P^n\) at least \(1-2c n^{1+\beta } e^{-\tau }\) it holds
where \(c_d=1024 \cdot 2^d\). Finally, choosing \(\tau = (\beta + 2) \log (n)\), the result follows by remembering that by assumption \(\frac{\ln (n s_n^d)}{n s_n^d} \rightarrow 0\). \(\square\)
We now come to our second main contribution of this section, namely the derivation of learning rates for the empirical histogram rule \(D \mapsto h_{D,{\mathcal {A}}_D}\) for regression based on a cubic data-dependent partition \({\mathcal {A}}_D\) from \({\mathcal {P}}(X)\).
Proposition E.5
(Learning Rates) Let L be the least squares loss, \(X:= [-1,1]^d\), \(Y= [-1,1]\), P be a distribution on \(X\times Y\), and \(D \in (X \times Y)^n\) be an i.i.d. sample of size \(n \ge 1\) drawn from P with \(|D_X|=m_n\). Assume the Bayes decision function \(f^*_{L,P}\) is \(\alpha\)-Hölder continuous for some \(\alpha \in (0,1]\). Suppose further that \((s_n)_{n \in \mathbb {N}}\) is a sequence satisfying
Assume further that \(\pi _{m_n, s_n}\) is an \(m_n\)-sample cubic partitioning rule of width \(s_n \in (0,1]\), satisfying \(|{{\,\textrm{Im}\,}}(\pi _{m_n, s_n})| \le c n^\beta\), for some \(c<\infty\) and some \(\beta >0\) that are independent of n. Denoting \({\mathcal {A}}_D:=\pi _{m_n, s_n}(D_X)\), the excess risk then satisfies for all \(n\ge 1\) the inequality
with probability \(P^n\) at least \(1- cn^{1+\beta } e^{-n^{d \gamma }}\), where \(c_{d,\alpha }>0\) is a constant only depending on d, \(\alpha\), and \(|f^*_{L,P}|_\alpha\).
Proof of Proposition E.5
If the Bayes decision function \(f^*_{L,P}\) is \(\alpha\)-Hölder continuous, Lemma E.3 gives us for all \(n\ge 1\) that
Repeating the proof of Proposition E.4 by replacing (87) with (88) shows that for all \(\tau \ge 1\) and \(n\ge 1\) we have
with probability \(P^n\) not less than \(1-2cn^{1+\beta } e^{-\tau }\). Using the definition of \(s_n\) and setting \(\tau _n: = ns_n^{2\alpha } = n^{\frac{d}{2\alpha + d}}\) then gives the assertion. \(\square\)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Mücke, N., Steinwart, I. Empirical risk minimization in the interpolating regime with application to neural network learning. Mach Learn 114, 102 (2025). https://doi.org/10.1007/s10994-025-06738-9
Received:
Revised:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1007/s10994-025-06738-9


