Unified regularity measures for sample-wise learning and generalization

Zhang, Chi; Yuan, Meng; Ma, Xiaoning; Liu, Yu; Lu, Haoang; Wang, Le; Su, Yuanqi; Liu, Yuehu

doi:10.1007/s44267-024-00069-4

Unified regularity measures for sample-wise learning and generalization

Research
Open access
Published: 31 December 2024

Volume 2, article number 38, (2024)
Cite this article

You have full access to this open access article

Download PDF

Visual Intelligence Aims and scope Submit manuscript

Unified regularity measures for sample-wise learning and generalization

Download PDF

1844 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Fundamental machine learning theory shows that different samples contribute unequally to both the learning and testing processes. Recent studies on deep neural networks (DNNs) suggest that such sample differences are rooted in the distribution of intrinsic pattern information, namely sample regularity. Motivated by recent discoveries in network memorization and generalization, we propose a pair of sample regularity measures with a formulation-consistent representation for both processes. Specifically, the cumulative binary training/generalizing loss (CBTL/CBGL), the cumulative number of correct classifications of the training/test sample within the training phase, is proposed to quantify the stability in the memorization-generalization process, while forgetting/mal-generalizing events (ForEvents/MgEvents), i.e., the misclassification of previously learned or generalized samples, are utilized to represent the uncertainty of sample regularity with respect to optimization dynamics. The effectiveness and robustness of the proposed approaches for mini-batch stochastic gradient descent (SGD) optimization are validated through sample-wise analyses. Further training/test sample selection applications show that the proposed measures, which share the unified computing procedure, could benefit both tasks.

Switching: understanding the class-reversed sampling in tail sample memorization

Article 07 January 2022

Inversion dynamics of class manifolds in deep learning reveals tradeoffs underlying generalization

Article 08 January 2024

Efficient Bayesian Learning of Sparse Deep Artificial Neural Networks

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Although deep learning has been widely applied and has made significant progress, the generalization of deep neural networks is still affected by the unexplainable black-box structure of algorithms [1, 2], biased datasets [3–5], noisy labeling [6, 7], and various evaluation metrics [8, 9]. In particular, such an over-parameterized network, which has far more parameters than training samples do, is predicted to be severely overfit by classical learning theory. However, it generalizes remarkably well [10]. Zhang et al. [11, 12] further reported that the deep neural network trained with the ordinary stochastic gradient descent (SGD) [13] method could easily fit random labels. In this case, measurements in the statistical learning theory fail to explain why deep neural networks generalize well from the training set to new data. For example, Rademacher complexity, which is a measure of the capacity to fit random noises, will be close to 1, resulting in loose and invalid generalization error bounds.

Recently, some researchers have attempted to develop a new generalization theory adapted to the over-parameterized characteristics from the view of network memorization and generalization. Zhang et al. [11] argued that neural networks not only captured the remaining signals in correctly-labeled training data, but also fitted noisy parts forcibly by memorizing them. In other words, the ability to extract patterns while memorizing exceptional samples contributes majorly to network generalization. Feldman [14] further noted that the memorization of rare and atypical samples were the sources of unreasonably beyond-expectation generalization performance [15] achieved by deep neural networks. Both imply differences in the distribution of patterns contained in samples and the powerful memory of the neural networks. A fundamental question then arises: how can the intrinsic pattern in different samples be represented to distinguish the different states of being regular, i.e., sample regularities?

The quantification of the intrinsic sample patterns is crucial when dealing with the training set bias, i.e., the inconsistency between the distributions of the training set and the test set. Moreover, existing studies have shown that different samples contain different amounts of information, hence contributing differently to model training [16].

To address this issue, measuring samples and training them according to information gains become a reasonable solution, adopted by many studies to address specific tasks [17–23]. In OHEM [17], the dynamic learning difficulty of a certain example is represented by the current loss on itself, trained by a simply-modified SGD method. Instead, Pang et al. [18] proposed measuring its difficulty on the basis of the intersection over union (IoU) distribution. Consequently, IoU-balanced sampling is utilized to discover hard examples for detection tasks. Wang et al. [19] quantified the impact of samples on model training via the influence function [20]. Li et al. [21] noted that the essential effect of the disharmonies between samples could be summarized in terms of the gradient, and then proposed a novel gradient harmonizing mechanism (GHM) to hedge the disharmonies. A novel IoU hierarchical local rank strategy is proposed in Ref. [22] as a quantitative way to evaluate sample differences. Zhang et al. [23] proposed a robust learning algorithm, named DualGraph, to capture structural relationships among labels with graph neural networks at two different levels including the instance-level and distribution-level relations. However, the methods mentioned above are all task-specific, which naturally leads us to a more general question: do universal measures exist for the quantification of the intrinsic sample patterns?

Most recently, a new line of research has been opened, which empirically represents sample distribution on the basis of the memorization-generalization mechanism. Toneva et al. [24] reported that sample forgetting existed not only in the training process of a new learning task (i.e., catastrophic forgetting [25, 26]), but also in different training stages on the same dataset. Therefore, a forgetting-event-based sample representation is novelly proposed (abbreviated as forgetting events (ForEvents) in this manuscript), leading itself to be a measure of sample difficulty for robust reasoning in natural language processing [27]. Nguyen et al. [28] further used a captioning model to visually explain the catastrophic forgetting phenomenon, especially when the sample was forgotten or changed.

In contrast, Jiang et al. [29] reported that cumulative binary training loss (CBTL) could be used as a lightweight proxy for consistency score of samples. Inspired by the above works, we attempt to explore how to represent the intrinsic sample patterns through the memorization-generalization mechanism, empirically demonstrating the exposed regularity of the underlying data patterns. However, we find that either the unidimensional sample representation with CBTL or the number of ForEvents alone can be ambiguous when measuring intrinsic sample patterns. For example, samples with the same CBTL may have different visual difficulties because the number of ForEvents is different, as shown in Fig. 1. The phenomenon is caused mainly by the difference in the way they are calculated, although both are statistics on the sample-wise learning frequency. The forgetting-event-based sample representation addresses the observation of the dynamic events during the learning process, whereas the CBTL reflects the cumulative number of successes in learning a given sample [30].

Therefore, we present unified regularity measures with a two-dimensional representation for revealing intrinsic sample patterns, taking advantage of both short-term dynamic signals and long-term stable information. Specifically, the distributions of training samples and test samples are represented in the spaces of CBTL - ForEvents and cumulative binary generalizing loss (CBGL) -mal-generalizing events (MgEvents), respectively. The proposed unified regularity measures have been sufficiently verified by numerous empirical investigations, including training randomness, the relationship between memorization and generalization, and robustness. Moreover, applications in training and testing acceleration demonstrate its promising effects on network training, hard sample selection, and visual algorithms testing.

The major contributions of this paper are three-fold:

1)
A two-dimensional representation involving CBTL and ForEvents is proposed to measure the regularity of training samples.
2)
Likewise, a two-dimensional representation combining newly defined CBGL and MgEvents, is proposed to measure the regularity of test samples.
3)
Our experimental findings suggest that samples with higher regularity seem to contribute little to both training and testing tasks, which in turn validates the effectiveness of the proposed measures.

2 Preliminary work

The general form [31] of the learning system can be described as follows.

Given a training dataset $T=\left \lbrace \left ( x_{1},y_{1}\right ),\ldots ,\left ( x_{N},y_{N} \right ) \right \rbrace $, where $\left ( x_{i},y_{i}\right )$ ($i=1,2,\ldots ,N$) denotes the input observation-label sample pair. The learning system is trained with the given training dataset to obtain a model, denoted as a conditional probability distribution $\hat{P}\left (Y|X\right )$ or a decision function $Y=\hat{f}\left (X\right )$, to describe the mapping between input and output random variables. The optimal model is generally trained via the strategy of minimizing the empirical risk $R_{\mathrm{emp}}\left (\hat{f}\right )=(\sum _{i=1}^{N}L(y_{i}, \hat{f}(x_{i})))/N$, where $L(\cdot , \cdot )$ is the loss.

2.1 Forgetting events and cumulative binary training loss

Continuous learning in the real world requires that intelligent systems can learn on successive tasks without performance degradation on the preceding training tasks, just as humans do. Researchers have found that this problem setting poses a great challenge for connectionism-based neural networks [32–34]. This phenomenon, known as “catastrophic forgetting”, is manifested mainly by the tendency of neural networks to forget all the acquired knowledge of the previous task quickly and brutally after the current task is added. Even deep neural networks, which have achieved great success in recent years, are not able to overcome this shortcoming [25], inevitably increasing the doubts about the research on general artificial intelligence. This phenomenon is caused primarily by the fact that when the goal changes, the learned weights of the network that are adapted to the previous task also adapt to the requirements of the new one, preventing the network from generalizing well to the previous task.

Inspired by this, Toneva et al. [24] argued that a single learning task optimization based on mini-batch stochastic gradient descent could be considered a process similar to continuous learning. In this process, each mini-batch of training data can be considered a small task that is sequentially handed over to the deep neural network. This leads to the following definition of sample forgetting events [24].

Forgetting events (ForEvents)

During the mini-batch sample learning process, the acquired (i.e., correctly classified) training samples at time t are misclassified at subsequent time $t'$ ($t'>t$).

In Ref. [24], ForEvents during the training process are explored, on the basis of which training data are classified into forgettable and unforgettable samples. Furthermore, the feature atypicality and visual illegibility of the forgettable samples are verified in their experiments. However, both easily-learned regular samples and difficultly-learned exception samples have few ForEvents, resulting in symmetric statistics of ForEvents, as illustrated in Fig. 2. Consequently, differences in samples cannot be distinguished.

Following another paradigm, Jiang et al. [29] quantified the regularity of samples by measuring the consistency of the sample with the overall distribution of all samples through a lightweight proxy cumulative binary training loss, defined as follows.

Cumulative binary training loss (CBTL)

During the mini-batch sample learning process, the cumulative number of correct classifications of the training sample up to time t.

However, CBTL alone cannot distinguish the pattern differences in samples. For example, Fig. 1 shows that samples in the CIFAR-10 [35] training set are not equally difficult to distinguish even though CBTL is the same. Specifically, sample pairs in each column share the same CBTL, with fewer ForEvents in the top row. The images in the top row are more regular and easier to recognize than those in the bottom row.

Intuitively, the occurrence of a forgetting event implies that the sample crossed in the wrong direction as the decision boundary changes, demonstrating that the direction of the loss incurred by the sample at this moment is not consistent with the overall loss of the training set. Therefore, statistics of ForEvents can be considered a measure of the uncertainty for a particular sample when learning a decision boundary. On the other hand, reflecting the cumulative number of successes in learning a given sample, CBTL is considered a long-term stability measure of the learning difficulty of one certain sample.

Inspired by these preliminary investigations, we propose to represent sample regularity based on a unified two-dimensional space formed by measures of short-term learning uncertainty and long-term stability, i.e., the CBTL-ForEvents space for training samples and novelly-presented CBGL-MgEvents space for test samples, as shown in Fig. 2. Properties of this representation are explored and the insight of training and testing acceleration through dataset reduction is subsequently validated.

Owing to the high computational cost of performing statistics after each training mini-batch, we choose to calculate the regularity measures after each epoch, as detailed in Sect. 3.1.

Moreover, the predicted label for the training example $x_{i}$ obtained after t epochs of optimization is denoted as $\hat{y}_{i}^{t}=\hat{f}^{t}(x_{i})$. Let $acc^{t}_{i}=1_{\hat{y}_{i}^{t}=y_{i}}$, which is a binary variable, indicating whether the sample is correctly classified at time epoch t. Consequently, the above definitions of ForEvents and CBTL are formalized as follows.

Forgetting events

Let $for_{i}^{t}=1_{acc^{t-1}_{i}=1,acc^{t}_{i}=0}$, this means that the number of ForEvents is counted as 1 when the prediction is correct at epoch $t-1$ and incorrect at epoch t. The number of ForEvents of one training sample $(x_{i},y_{i})$ at epoch t is defined as

$$ F^{t}_{i}=\sum _{n=1}^{t}for_{i}^{n}. $$

(1)

Cumulative binary training loss

For training sample $(x_{i},y_{i})$, CBTL at epoch t is defined as

$$ L^{t}_{i}=\sum _{n=1}^{t}acc_{i}^{n}. $$

(2)

2.2 Mal-generalizing events and cumulative binary generalizing loss

Likewise, for sample-wise dynamics of generalization on the test set during sequential learning of mini-batch data, the definitions of MgEvents and CBGL are given as follows.

Mal-generalizing events (MgEvents)

During the mini-batch sample learning process, the test samples can be correctly classified at epoch t but misclassified at subsequent epoch $t'$ ($t'>t$).

Cumulative binary generalizing loss (CBGL)

During the mini-batch sample learning process, the cumulative number of correct classifications of test samples up to epoch t.

The formal description of the above definitions is similar to Eqs. (1) and (2), with the only difference being applied to the samples in the test set.

3 Experimental verification and analysis

3.1 Experimental setup

As described in Ref. [24], calculating whether a forgetting event occurs for all training samples after each mini-batch is quite time-consuming and computationally expensive. Therefore, they only calculate for the mini-batch samples involved in the training after each mini-batch. The calculation of MgEvents faces the same dilemma, i.e., it is not feasible to calculate the generalization status of all test samples after each mini-batch. Considering the limited impact on model performance after mini-batch samples training, this paper adopts a very different strategy from the above approach, i.e., updating the inference states of all training and test samples after each epoch. To confirm this idea, the result of our strategy is compared with that of Toneva et al. [24]. We trained the ResNet-110 [36] network on the CIFAR-10 training set. The average errors of CBTL and the number of ForEvents are 1.2614 and 0.6762, respectively. The Pearson correlation coefficients of the vectors of CBTL and the number of ForEvents for 50,000 samples in the CIFAR-10 training set are 0.9896 and 0.9835, respectively, both of which are strongly correlated. Therefore, the strategy of making one inference and updating the states after one epoch is used as a lightweight proxy. The number of model inferences required is reduced to approximately 1/390 of the ideal case when the number of training samples is 50,000, the number of test samples is 10,000, and the batch size is 128. Note that this approximation is likely to aggravate the randomness of model generalization.

This paper explores learning statistics on the basis of training process of ResNet-110 with the CIFAR-10 dataset. The model is trained for a total of 200 epochs, and its average training performance is close to the highest accuracy of the architecture on the CIFAR-10 dataset, i.e., 93.53%. In particular, the initial learning rate is 0.1, which decreases to 0.010 at the 81st epoch and 0.001 at the 122nd epoch. In this paper, the same network is trained 10 times under the same hyper-parameter settings and its mean value is taken to eliminate the effect of the randomness of model inference on the empirical analysis.

3.2 Representation of sample distribution

The sample distribution is represented from the perspective of the memorization-generalization mechanism of neural networks. Inspired by Refs. [24, 29], we empirically proposes a two-dimensional sample regularity representation, which is in the CBTL-ForEvents and CBGL-MgEvents spaces. The properties of such representations are explored in this section.

3.2.1 Unidimensional sample distribution representation

First, the histograms of the sample distribution in a single dimension are illustrated in Fig. 3. The distributions of each measure are all long-tailed and are dominated by easy samples that have high CBTL/CBGL scores and a small number of ForEvents/ MgEvents. The distributions of the training and test samples are very similar, but the test samples are not involved in the training process, resulting in a longer tail. Likewise, the histograms of ForEvents and MgEvents also share a similar distribution, although the local shape may vary, e.g., different peaks around the 15th bin.

3.2.2 Two-dimensional sample distribution representation

Furthermore, the regularity of training and test samples is depicted in the CBTL-ForEvents and CBGL-MgEvents spaces as shown in Fig. 2 (a) and 2 (b), respectively. The samples are symmetrically distributed in our defined two-dimensional space with respect to ForEvents/MgEvents, whereas the distributions of ForEvents/MgEvents under the same binary training/testing loss are widely disparate, as shown in Fig. 4. These results also support our claim that a single dimension measure cannot distinguish the intrinsic patterns of the samples well.

As displayed in Fig. 2 (a) and 2 (b), the density of samples in the lower right corner is greater, indicating that the easy samples dominate both the training and test sets, which is consistent with the previous findings. Moreover, these two sets share similar symmetric sample distributions, but the symmetry axes of the training samples are closer to the right and the distribution is more compact than the test samples. This is consistent with the theoretical intuition that the empirical error tends to be smaller than the test error.

3.2.3 Visual verification

We visualized several representative samples at different positions of the distribution, as shown in Fig. 2 (b). The visual patterns do vary from one sample to another.

Objects in the bottom right samples tend to be highly recognizable and regular, e.g., the brown horse on the green grass and the green frog on the gray rock. Visual ambiguity between the background and foreground objects may appear in the samples from the middle part, which increases the difficulty of identification. The objects with unconventional features in the bottom left samples are very easy to misclassify, i.e., the cat with green light looks like a frog. Likewise, the triangular boat looks like a tree.

3.3 Measure property investigation

3.3.1 Training randomness

Owing to the randomness of the optimization process and the uncertainty of model inference, the results of repetitive experiments may vary even with the same hyper-parameter settings for the same network architecture, so the stability of the measurement results needs to be explored. We record the statistics of 10 repetitive experiments.

As demonstrated in the first row of Fig. 5, the occurrences of MgEvents and the values of CBGL vary in repetitive experiments. For test sample #1, over 10 training sessions, when the MgEvents occur, the range of epochs is $[0, 90]$, the number of occurrences varies between $[0, 4]$, and the values of CBGL vary between $[195, 200]$. The ForEvents and CBTL also show similar dynamics, as shown in Fig. 5 (c) and Fig. 5 (d). This means that CBTL/CBGL and ForEvents/MgEvents should be counted for multiple repetitions of the experiment.

Furthermore, to intuitively quantify the effect of randomness from training on the sample distribution representation, we calculate the pearson correlation coefficients between the distributions from different training sessions in this paper. The correlation matrix is obtained on the basis of the coefficients between the sample density vectors in each representation, as depicted in Fig. 6. The sample density vectors are obtained by normalizing the vectors acquired via the density calculation method, as illustrated in Fig. 2.

The average Pearson correlation coefficient of the 10 training sessions is 0.8634 for the training samples and 0.8733 for the test samples. This shows that the proposed measures are stable at a certain level for repetitive training sessions.

3.3.2 Relationship between measures for training and test samples

Owing to the similarities in definitions and the computing processes, here we discuss the similarities between ForEvents and MgEvents to understand the relationship between memorization and generalization through dynamic statistics of the network’s classification performance on training and test samples.

As displayed in Fig. 7, the histogram of ForEvents is similar to that of MgEvents for model #1, and model #4, with Pearson correlation coefficients of 0.9985 and 0.9984, respectively. Similarly, the histogram of CBTL is similar to that of CBGL in the empirical statistics of each model. According to Ref. [35] for producing the CIFAR-10 dataset, the training and test sets are independently and identically distributed. This means that ideally, the similarity between the distributions of ForEvents and MgEvents of the model can be used to measure the similarity between the training and test sets. In addition, the red arrows indicate the distribution variability, which to some extent reflects the difference between model learning and generalizing, i.e., generalization error.

Furthermore, we investigate the synchronization between ForEvents and MgEvents, where the synchronization of the particular test sample with a training sample means that the epoch when its mal-generalizing event occurs during the training process corresponds exactly to the epoch when the forgetting event of that training sample occurs. Test samples that generalize successfully or unsuccessfully in all training epochs do not have synchronization with any training sample because no MgEvents occur. Thus, we are first concerned with synchronization in different generalization cases. As shown in Fig. 8, the number of synchronized training samples varies with the number of MgEvents for the test samples. Specifically, the higher the number of MgEvents, the greater the corresponding number of synchronized training samples. For example, for sample #90 (with 2 MgEvents), the number of synchronized training samples is 13,266, whereas sample #5695 (with 15 MgEvents) has approximately twice as many synchronized training samples during a single training process.

Considering the aforementioned randomness in the training and generalizing process, Fig. 8 also shows that the number of synchronized training samples for a particular test sample decreases significantly, or even drops to 0, when extended to 10 trainings. This indicates that most of the ForEvents of synchronized training samples just happen to occur at the same epoch as the MgEvents of the target test sample, although this synchronization occurs more than hundreds of times. This confirms that the neural network does not depend on specific training samples for the extraction and generalization of patterns embedded in the training data. Then does the generalization of specific irregular samples depend on memorizing certain training samples? We extend to 20 trainings for samples #521, #5695, and #889 to observe the distributions of their synchronized training samples, as demonstrated in Fig. 9.

To further verify the effect of these synchronized training samples on the generalization of the corresponding test samples, we train a binary classifier via support vector machine (SVM) to determine whether a particular test sample belongs to a certain class. The input is 4096-dimensional features extracted from the training samples via the VGG-19 [37] network pre-trained on ImageNet [38]. The experimental setup is divided into two types: 1) random N training samples containing 46 synchronized samples (50% of the training samples belong to the same category of the selected test sample, whereas the others are from different categories.); 2) random N training samples not containing these synchronized samples. The results are displayed in Table 1, where the classifier trained with synchronized samples can better generalize its corresponding test samples for sample #889 with low regularity. For example, when the number of synchronized samples accounts for 46% of the training samples, the classification accuracy of the classifier trained with synchronized samples for sample #889 is approximately 14% higher than that of the classifier trained without synchronized samples. The superiority is still approximately 3% when the proportion of synchronized samples is reduced to 10%. We conjecture that these synchronized training samples tend to play the role of “support vectors”. That is, stochastic gradient descent tends to converge implicitly to the solution that maximizes the differentiation of the dataset [39], and the synchronized training samples are more relevant to the learning of the corresponding test samples’ classification boundaries than the other samples.

Table 1 Investigations of the support of 46 synchronized training samples on specific test samples. N is the number of test samples

Full size table

3.3.3 Robustness

The generalization of neural networks varies across different optimizers and architectures, so the training statistics depend on the given network architecture trained with the given optimizer. It is thus important to explore the effects of these control variables on the proposed measures, the robustness of which needs to be further investigated.

Different architectures

Our statistics need to be collected during training, if the network has more layers or a more complex structure, the time cost of statistics will be greater. Therefore, we explore how this distribution representation is affected by different network architectures. To explore the possibility of reliably representing the sample distribution with simpler architectures with a lower time cost, we hope to determine whether the statistic is robust to different network architectures. We repeatedly train each of the four network architectures, ResNet-20 [36], ResNet-32 [36], ResNet-110 [36], and DenseNet [39], 10 times and take the average statistics for analysis. As shown in Fig. 10, the sample distribution does not change significantly whether the number of layers is reduced from 110 to 32, 20, or to the more densely-connected DenseNet. This conclusion is confirmed quantitatively by the correlation of the distributions and statistics. As displayed in Table 2, the pearson correlation coefficients are all above 0.88, which are very strong correlations. This indicates that our proposed measure is robust to the network architecture and can be migrated between different architectures. Since ResNet-32 takes only 80 min while DenseNet takes 1367 min with the same computational power (single GeForce RTX 2080Ti GPU), this opens up the prospect of computing sample distribution representation statistics in simpler architectures. Thus, the similarity of two-dimensional sample distribution representations across architectures allows them to be computed quickly by approximating proxy networks to reduce time and computational cost, e.g., estimating sample distribution statistics for DenseNet via ResNet32 can save up to 21.5 h of training time and reduce the number of parameters by 7.03 MB.

Table 2 Pearson correlation coefficient of regularity measures between two networks with different architectures. Cumulative binary training loss (CBTL) and forgetting events (ForEvents) are for training samples, and cumulative binary generalizing loss (CBGL) and mal-generalizing events (MgEvents) are for test samples

Full size table

Different optimizers

Most deep learning algorithms involve various forms of optimization, generally to minimize the loss function for parameter solving. Therefore, we explore the effects of different optimizers on the representation of the sample distribution. In addition to the common mini-batch SGD method, AdaGrad [40] and AdaMax [41] are also chosen in this comparative experiment.

As illustrated in Fig. 11, the sample distribution representations under different optimizers differ significantly. The Pearson correlation coefficients of SGD - AdaGrad, SGD - AdaMax, and AdaGrad - AdaMax are 0.8914, 0.6877 and 0.6962 for the training samples and 0.8877, 0.6628 and 0.7062 for the test samples, respectively. It thus confirms that different optimizers have a great impact on our proposed two-dimensional representation. Intrinsically, the optimization algorithm determines the direction of optimization at each step of training, which leads to changes in the learned parameters, resulting in differences in the learning process and thus affecting the representations of the sample distribution.

4 Application

4.1 Training acceleration

The distribution of the CIFAR-10 training set is extremely uneven in the proposed representation space. The samples in the lower right corner are relatively simple, and have little impact on the performance of the learned model. However they are densely distributed and dominate the training set. As a result, many training samples share similar patterns. We attempt to accomplish an efficient training process with smaller sample sets by eliminating the high-density redundant samples.

We first compute the density values of each sample in our two-dimensional representation space according to Fig. 2. Afterward, the high-density training samples are removed by sorting their density values in descending order to investigate the effect of the proportion of removed samples on the generalization performance.

Different radius values

Since the computed density value is related to the radius r, we firstly explore the effect of r on training performance as shown in Fig. 12 (a). We can observe that r has a significant effect on the training performance and that the overall performance is relatively best for $r=1$. Therefore, we take the calculated value of density with $r = 1$ for the subsequent experiments.

Training acceleration

We follow the strategies of removing samples according to the metric of CBTL proposed in Ref. [29] and ForEvents proposed in Ref. [24] as a comparison with our method, as shown in Fig. 12 (b).

Figure 12 (b) illustrates that the performance decreases dramatically fast when samples are removed randomly. However, when samples are removed according to some strategies, there is no significant decline in the generalization performance, and the test accuracy remains above 0.91 even after 60% of the training samples are removed. In addition, the strategy proposed in this paper slightly outperforms the CBTL and ForEvents strategies. This not only validates to some extent the insight of dataset reduction but also contributes to accelerating training with a smaller set at lower time/computational cost.

4.2 Testing acceleration

Unlike training, which is a pattern extraction process where sample complexity can be reduced by removing samples with close pattern representations, the goal of model testing is often to maximize discrimination of testees’ generalization level using minimal sample complexity. This requires reducing the number of samples involved in evaluation while ensuring adequate discrimination for the generalization performance of the testees. This means that more attention needs to be paid to hard samples in practice, which is often achieved via difficulty-based sampling. The proposed CBGL-MgEvents can be used as a measure of the difficulty level, which is obtained by dividing the defined two-dimensional space on the basis of the sample distribution in a uniform angle.

As shown by the red dashed line in Fig. 2 (b), we perform uniform difficulty division of the samples by rays emitted from a uniform angle at the center $(100,0)$, which is the midpoint of the horizontal coordinate range. The smaller the angle is, the finer the division of difficulty. (Note that the inconsistent coordinate ratio of the figure is for better illustration.)

We first use the original 10,000 test samples to evaluate 17 algorithms including deep neural networks and traditional machine learning methods. Then, based on the balanced difficulty sampling method described above, the representation is divided into 4 bins and 10 bins at 45 degrees and 18 degrees, respectively. We randomly pick samples from each bin to obtain a smaller dataset containing 400 test samples. Afterward, the 17 algorithms are evaluated on the reduced set once again, as illustrated in Table 3.

Table 3 Testing accuracy of different algorithms on selected test samples

Full size table

It is clear that the performances of all the listed algorithms, regardless of whether they are deep learning methods or traditional machine learning methods, decrease significantly when they are tested on difficulty-balanced samples. This is mainly because the low proportion of hard samples is overwhelmed by the large number of easy samples in the original test set. Owing to difficulty-balanced sampling, the proportion of hard samples is increased. Moreover, the algorithms perform worse when tested on samples selected on 10 bins than when tested on samples selected on samples selected on 4 bins.

To measure the discrimination of the sampled dataset for algorithm performance, we investigate the correlations between the original testing results and the results on 10 bin/4 bin difficulty-balanced sampling. The Spearman correlation coefficients are 0.9871 and 0.9877, respectively, indicating extremely strong correlations and sufficient discrimination for the algorithms.

Furthermore, we explore the optimal sample number in each bin for the difficulty-balanced samples, which can most efficiently maintain performance discrimination. The representation space is divided into 12 bins, similar to Fig. 2 (b). 0^∘, i.e., the left half axis, is the first bin, followed by $(0^{\circ},18^{\circ})$ for the second bin, $[18^{\circ}, 36^{\circ})$ for the third bin, … , $[162^{\circ}, 180^{\circ})$ for the 11th bin, and 180^∘ for the last bin. From most bins, we select n samples. However, since there are only 3 samples in the first bin, they are all selected in the experiments.

Moreover, we explore the effect of n, where $n\leq 49$, on the testing results, since the lowest number of samples in the latter 11 bins is 49. To discriminate the discrimination for the algorithm performance on the test set, i.e., the ranking stability of the algorithm performance evaluation, we calculate the Spearman correlation coefficient between original testing results and the results on $11n+3$ samples as well as the mean average precision of the algorithm ranking. The results are plotted as a line graph illustrated in Fig. 13. We can see 30 samples in each bin, i.e., a total of $11\times 30+3=333$ samples, which is appropriate for a reduced CIFAR-10 test set. Therefore we construct a small test set, CIFAR-10-333, to accelerate the testing of algorithm performance.

Experiments show that uniform regularity sampling for the test set can significantly reduce the sample complexity while maintaining discrimination for the algorithm performance.

5 Discussion

The awareness of sample-wise regularity is crucial to pattern learning for visual recognition.

Numerous theoretical and applied studies note that various samples have varying informational densities and hence contribute differently to model training. Ren et al. [52] reported that when training visual object detection models, training samples that were neither positive nor negative samples were not helpful for network training. In Ref. [52], the positive samples are defined as the anchors of the highest IoU overlap with a ground-truth box, or an anchor that has an IoU overlap higher than 0.7 with any ground-truth box, whereas the negative samples are those whose IoU ratio is lower than 0.3 for all ground-truth boxes. Zhang et al. [53] presented that easy samples were of little help in training the detector, so only the top 70% could be considered in the loss function calculation. Lin et al. [54] further showed that easy negative samples were the main body of loss function and could overwhelm training, leading to degenerate models. Kishida and Nakayama [3] revealed through experiments that easy training samples that could be consistently classified correctly in the early training phase were visually similar to each other, while hard samples were visually diverse. Furthermore, the hard training samples tend to contribute more to network generalization than easy samples. By analyzing experimental results, biases in a dataset and SGD are considered as the reasons why convolutional neural networks have consistent easy and hard samples.

In problem instance-dependent cases, the most direct measure of sample-wise regularity is derived from empirical loss, but related studies present two diametrically opposite ideas.

According to training set bias, the most straightforward way to measure sample regularity is to calculate the distance between its predicted value and the ground truth in the loss function. When thresholds are set, the distance evolves into sample difficulty classification, which is applied to improve the generalization performance of learning models, such as the training-loss-based sample weight assignment methods [55, 56]. These methods have two distinct ideas as follows.

1)
Label noise can be eliminated by selecting training samples with smaller loss, because they are likely to be clean samples.
2)
Class imbalance can be approached by selecting training samples with higher loss, because they are more likely to belong to a minority sample category.

From the perspective of classical computational learning theory, this simple representation of sample regularity is problematic in two ways.

1) The goal of the learning model’s task is to achieve good generalization performance on the validation set or even the test set by learning on a specified training set. Simply calculating the task difficulty of the training samples does not reflect the difficulty of generalization on independent and identically distributed (i.i.d.) validation sets, or even on differently distributed test sets. Intuitively, the relationship between the difficulty or importance of each training sample and the loss of the validation set can be inferred by using the “leave-one-out” method, that is, retraining by removing a single training sample to evaluate its impact on the loss of the validation or test set, but the time complexity of this algorithm increases exponentially with the number of samples. Koh and Liang [20] therefore used the influence function, a classical tool in statistics to estimate the impact of training samples on network generalization with only one inference step.

2) Training/validation/test samples that make the empirical loss greater do not necessarily contain a higher value, because these samples may be outliers, forcibly fitting or memorizing them may degrade the generalization performance. However, empirical analysis of deep neural network generalization shows that the effective capacity of neural networks is sufficient for memorizing the entire data set. [11], so that it can be effectively converged even when the sample labels are disturbed. This implies that the influence of the forced memorization of the outliers of the deep neural networks on their generalization performance is limited. As a result, Feldman [14] pointed out that, owing to the long-tailed nature of the real-world image data distribution, the memorization of rare and atypical samples was the main source of the unreasonable and unexpected generalization performance [15] of deep neural networks.

Given that challenges remain in terms of theoretical understanding, exploring interesting and poorly understood behaviour, such as sample regularity, through empirical studies where experiments are designed systematically is a promising method and current trend. These findings may have an impact on visual pattern understanding.

Recent studies have explored the generalization mechanism of DNNs via empirical analysis with an emphasis on the representation and computation of sample difficulty on the basis of memorization-generalization mechanisms. According to Ref. [24], in addition to a new learning task, sample forgetting also occurs at various phases of training the same dataset. They thus refer to the event where an individual sample transitions from being classified correctly to incorrectly over the course of learning as a “forgetting event”. The experiments confirm that the samples with more occurrences of ForEvents in the training process tend to be hard samples, whereas the most frequent samples are likely noisy samples.

Furthermore, Jiang et al. [29] used the memorization-generalization continuum to reveal that deep neural networks could not only remember rare and atypical samples but also effectively generalize to data that shared the same pattern or intrinsic regularity. The consistency score is defined as the expected accuracy for a held-out instance on a training set selected from a fixed size of data distribution. Consequently, the relationship between sample memorization and generalization is constructed. Empirical evidence suggests that cumulative binary training loss is a lightweight and effective approximation of the consistency score. In subsequent work, Feldman and Zhang [57] approximated the effect of each training sample on the accuracy of each test sample, as well as the probability of memorization occurrence of the training samples, thus confirming the apparent benefit of memorization to generalization on several standard test datasets.

In cases that measures to evaluate the rarity and the atypicality of the training/test samples are important, we explore how to represent the distribution of intrinsic patterns in samples through memorization-generalization theory, empirically demonstrating the exposed regularity of the underlying patterns in the data. Specifically, we incorporate long-term stable information and short-term dynamic signals during training to propose unified regularity measures with two-dimensional representations, which are experimentally confirmed to be beneficial in the fields of training and testing acceleration.

Notably, the proposed measure of sample regularity (in problem instance-dependent cases) can serve as a probe to discover the SGD dynamics in the long-tailed setting, and thus contribute a simple yet effective resampling method [58].

6 Conclusion

In this paper, we propose to measure the sample-wise regularity that a given sample exhibits when it is learned or generalized. We show that for i.i.d. training and test sets, the sample distributions are similar under the statistics of similar definitions, i.e., CBTL/CBGL and ForEvents/MgEvents. Inspired by this result, we propose a pair of two-dimensional representations for measuring sample regularity in network learning and generalization. In the property investigation, we find that the proposed measures seem to be quite stable with respect to the different characteristics of the training and testing phases. Further applications in training and test acceleration show that samples with higher regularity seem to contribute little to both processes, which in turn validates the effectiveness of the proposed measures. However, the challenge is the lack of fundamental theory, since the theoretical description of the sample-wise regularity is challenging. With the development of the full mathematical characterization of the memorization-generalization mechanism, a thorough understanding of sample regularity can contribute as supporting evidence to justify our work.

Data availability

The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.

Abbreviations

CBTL:: cumulative binary training loss
CBGL:: cumulative binary generalizing loss
ForEvents:: forgetting events
MgEvent:: mal-generalizing events

References

Lipton, Z. C. (2018). The mythos of model interpretability. Communications of the ACM, 61(10), 36–43.
Article MATH Google Scholar
Ribeiro, M.T., Singh, S., & Guestrin, C. (2016). Model-agnostic interpretability of machine learning. arXiv preprint. arXiv:1606.05386.
Kishida, I., & Nakayama, H. (2019). Empirical study of easy and hard examples in CNN training. In T. Gedeon, K.W. Wong, & M. Lee (Eds.), Proceedings of the 26th international conference on neural information processing (pp. 179–188). Cham: Springer.
Chapter MATH Google Scholar
Ren, M., Zeng, W., Yang, B., & Urtasun, R. (2018). Learning to reweight examples for robust deep learning. In J. G. Dy & A. Krause (Eds.), Proceedings of the 35th international conference on machine learning (pp. 4331–4340). Stroudsburg: International Machine Learning Society.
MATH Google Scholar
Yuan, H., Shi, Y., Xu, N., Yang, X., Geng, X., & Rui, Y. (2023). Learning from biased soft labels. In A. Oh, T. Naumann, A. Globerson, et al. (Eds.), Proceedings of the 37th international conference on neural information processing systems (pp. 59566–59584). Red Hook: Curran Associates.
MATH Google Scholar
Sun, C., Shrivastava, A., Singh, S., & Gupta, A. (2017). Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision (pp. 843–852). Piscataway: IEEE.
MATH Google Scholar
Shu, J., Yuan, X., & Meng, D. (2023). CMW-Net: an adaptive robust algorithm for sample selection and label correction. National Science Review, 10(6), nwad084.
Article Google Scholar
Kawaguchi, K., Bengio, Y., Verma, V., & Kaelbling, L.P. (2018). Generalization in machine learning via analytical learning theory. arXiv preprint. arXiv:1802.07426.
Zhang, B., Chen, J., Xu, Y., Zhang, H., Yang, X., & Geng, X. (2024). Auto-encoding score distribution regression for action quality assessment. Neural Computing & Applications, 36(2), 929–942.
Article MATH Google Scholar
Pérez, G.V., Camargo, C. Q., & Louis, A. A. (2019). Deep learning generalizes because the parameter-function map is biased towards simple functions. In Proceedings of the 7th international conference on learning representations (pp. 1–35). Retrieved December 1, 2024, from https://openreview.net/forum?id=rye4g3AqFm.
MATH Google Scholar
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. In Proceedings of the 5th international conference on learning representations (pp. 1–15). Retrieved December 1, 2024, from https://openreview.net/forum?id=Sy8gdB9xx.
MATH Google Scholar
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2021). Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3), 107–115.
Article MATH Google Scholar
Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, 22(3), 400–407.
Article MathSciNet MATH Google Scholar
Feldman, V. (2020). Does learning require memorization? A short tale about a long tail. In K. Makarychev, Y. Makarychev, M. Tulsiani, et al. (Eds.), Proceedings of the 52nd annual ACM SIGACT symposium on theory of computing (pp. 954–959). New York: ACM.
Chapter MATH Google Scholar
Sejnowski, T. J. (2020). The unreasonable effectiveness of deep learning in artificial intelligence. Proceedings of the National Academy of Sciences, 117(48), 30033–30038.
Article MATH Google Scholar
Angelos, K., & Fleuret, F. (2018). Not all samples are created equal: deep learning with importance sampling. In J. G. Dy & A. Krause (Eds.), Proceedings of the 35th international conference on machine learning (pp. 2530–2539). Stroudsburg: International Machine Learning Society.
MATH Google Scholar
Shrivastava, A., Gupta, A., & Girshick, R. (2016). Training region-based object detectors with online hard example mining. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 761–769). Piscataway: IEEE.
MATH Google Scholar
Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W., & Lin, D. (2019). Libra R-CNN: towards balanced learning for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 821–830). Piscataway: IEEE.
Google Scholar
Wang, T., Huan, J., & Li, B. (2018). Data dropout: optimizing training data for convolutional neural networks. In L.H. Tsoukalas, É. Grégoire, & M. Alamaniotis (Eds.), Proceedings of the IEEE 30th international conference on tools with artificial intelligence (pp. 39–46). Piscataway: IEEE.
MATH Google Scholar
Koh, P.W., & Liang, P. (2017). Understanding black-box predictions via influence functions. In D. Precup & Y.W. Teh (Eds.), Proceedings of the 34th international conference on machine learning (pp. 1885–1894). Stroudsburg: International Machine Learning Society.
MATH Google Scholar
Li, B., Liu, Y., & Wang, X. (2019). Gradient harmonized single-stage detector. In Proceedings of the 33rd AAAI conference on artificial intelligence (pp. 8577–8584). Palo Alto: AAAI Press.
MATH Google Scholar
Cao, Y., Chen, K., Loy, C. C., & Lin, D. (2020). Prime sample attention in object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11583–11591). Piscataway: IEEE.
MATH Google Scholar
Zhang, H., Xing, X., & Liu, L. (2021). Dualgraph: a graph-based method for reasoning about label noise. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9654–9663). Piscataway: IEEE.
MATH Google Scholar
Toneva, M., Sordoni, A., Tachet des Combes, R., Trischler, A., Bengio, Y., & Gordon, G. J. (2019). An empirical study of example forgetting during deep neural network learning. In Proceedings of the 7th international conference on learning representations (pp. 1–18). Retrieved December 1, 2024, from https://openreview.net/forum?id=BJlxm30cKm.
Google Scholar
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., et al. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences of the United States of America, 114(13), 3521–3526.
Article MathSciNet MATH Google Scholar
Ritter, H., Botev, A., & Barber, D. (2018). Online structured Laplace approximations for overcoming catastrophic forgetting. In S. Bengio, H. Wallach, H. Larochelle, et al. (Eds.), Proceedings of the 32nd international conference on neural information processing systems (pp. 3742–3752). Red Hook: Curran Associates.
MATH Google Scholar
Yaghoobzadeh, Y., Tachet des Combes, R., Hazen, T. J., & Sordoni, A. (2019). Robust natural language inference models with example forgetting. arXiv preprint. arXiv:1911.03861.
Nguyen, G., Chen, S., Do, T., Jun, T.J., Choi, H.-J., & Kim, D. (2020). Dissecting catastrophic forgetting in continual learning by deep visualization. arXiv preprint. arXiv:2001.01578.
Jiang, Z., Zhang, C., Talwar, K., & Mozer, M. C. (2021). Characterizing structural regularities of labeled data in overparameterized models. In M. Meila & T. Zhang (Eds.), Proceedings of the 38th international conference on machine learning (pp. 5034–5044). Stroudsburg: International Machine Learning Society.
MATH Google Scholar
Baldock, R., Maennel, H., & Neyshabur, B. (2021). Deep learning through the lens of example difficulty. In M. Ranzato, A. Beygelzimer, Y. N. Dauphin, et al. (Eds.), Proceedings of the 35th international conference on neural information processing systems (pp. 10876–10889). Red Hook: Curran Associates.
MATH Google Scholar
Li, H. (2012). Statistical learning methods. Beijing: Tsinghua University Press.
MATH Google Scholar
French, R. M. (1999). Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3(4), 128–135.
Article MATH Google Scholar
McCloskey, M., & Cohen, N. J. (1989). Catastrophic interference in connectionist networks: the sequential learning problem. The Psychology of Learning and Motivation, 24, 109–165.
Article MATH Google Scholar
Ratcliff, R. (1990). Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychological Review, 97(2), 285.
Article MATH Google Scholar
Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images. Technical Report, University of Toronto.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778). Piscataway: IEEE.
MATH Google Scholar
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint. arXiv:1409.1556.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Li, F.-F. (2009). ImageNet: a large-scale hierarchical image database. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 248–255). Piscataway: IEEE.
MATH Google Scholar
Huang, G., Liu, Z., van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4700–4708). Piscataway: IEEE.
MATH Google Scholar
Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(7), 2121–2159.
MathSciNet MATH Google Scholar
Kingma, D. P. (2014). Adan: a method for stochastic optimization. arXiv preprint. arXiv:1412.6980.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
Article MATH Google Scholar
Lin, M., Chen, Q., & Yan, S. (2013). Network in network. arXiv preprint. arXiv:1312.4400.
Zagoruyko, S., & Komodakis, N. (2016). Wide residual networks. In R. C. Wilson, E. R. Hancock, & W. A. P. Smith (Eds.), Proceedings of the British machine vision conference (pp. 1–12). Swansea: BMVA Press.
MATH Google Scholar
Tan, M., & Le, Q. (2019). Efficientnet: rethinking model scaling for convolutional neural networks. In K. Chaudhuri & R. Salakhutdinov (Eds.), Proceedings of the 36th international conference on machine learning (pp. 6105–6114). Stroudsburg: International Machine Learning Society.
MATH Google Scholar
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273–297.
Article MATH Google Scholar
Quinlan, J. R. (2014). C4. 5: programs for machine learning. San Francisco: Morgan Kaufmann.
MATH Google Scholar
Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society, Series B, Statistical Methodology, 20(2), 215–232.
Article MathSciNet MATH Google Scholar
Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27.
Article MATH Google Scholar
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.
Article MATH Google Scholar
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
Article MATH Google Scholar
Ren, S., He, K., Girshick, R., & Faster, J. S. (2015). R-CNN: towards real-time object detection with region proposal networks. In C. Cortes, N. D. Lawrence, D. D. Lee, et al. (Eds.), Proceedings of the 29th international conference on neural information processing systems. (pp. 91–99). Red Hook: Curran Associates.
MATH Google Scholar
Zhang, K., Zhang, Z., Li, Z., & Qiao, Y. (2016). Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10), 1499–1503.
Article MATH Google Scholar
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (pp. 2980–2988). Piscataway: IEEE.
MATH Google Scholar
Chang, H.-S., Learned-Miller, E. G., & McCallum, A. (2017). Active bias: training more accurate neural networks by emphasizing high variance samples. In I. Guyon, U. von Luxburg, S. Bengio, et al. (Eds.), Proceedings of the 31st international conference on neural information processing systems Red Hook: Curran Associates.
MATH Google Scholar
Jiang, L., Zhou, Z., Leung, T., Li, L.-J., & Li, F.-F. (2018). Mentornet: learning data-driven curriculum for very deep neural networks on corrupted labels. In J. G. Dy & A. Krause (Eds.), Proceedings of the 35th international conference on machine learning (pp. 2309–2318). Stroudsburg: International Machine Learning Society.
MATH Google Scholar
Feldman, V., & Zhang, C. (2020). What neural networks memorize and why: discovering the long tail via influence estimation. In H. Larochelle, M. Ranzato, R. Hadsell, et al. (Eds.), Proceedings of the 34th international conference on neural information processing systems (pp. 1–11). Red Hook: Curran Associates.
MATH Google Scholar
Zhang, C., Hu, B., Liuzhang, Y., Wang, L., Liu, L., & Liu, Y. (2022). Switching: understanding the class-reversed sampling in tail sample memorization. Machine Learning, 111(3), 1073–1101.
Article MathSciNet MATH Google Scholar
Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: the kitti dataset. The International Journal of Robotics Research, 32(11), 1231–1237.
Article Google Scholar
Wu, Y., Kirillov, A., Massa, F., Lo, W.-Y., & Girshick, R. (2019). Detectron2. Retrieved December 1, 2024, from https://github.com/facebookresearch/detectron2.
Lin, T.-Y., Dollár, P., Girshick, R. B., He, K., Hariharan, B., & Belongie, S. J. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 936–944). Piscataway: IEEE.
Google Scholar

Download references

Funding

This work was supported by the National Natural Science Foundation of China (No. 62202370).

Author information

Authors and Affiliations

Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, 710049, China
Chi Zhang, Meng Yuan, Yu Liu, Le Wang & Yuehu Liu
National KeyLaboratory of Human-Machine Hybrid Augmented Intelligence, Xi’an Jiaotong University, Xi’an, 710049, China
Chi Zhang, Meng Yuan, Yu Liu, Le Wang & Yuehu Liu
National Engineering Research Center for Visual Information and Applications, Xi’an Jiaotong University, Xi’an, 710049, China
Chi Zhang, Meng Yuan, Yu Liu, Le Wang & Yuehu Liu
School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, 710049, China
Xiaoning Ma, Haoang Lu & Yuanqi Su

Authors

Chi Zhang
View author publications
Search author on:PubMed Google Scholar
Meng Yuan
View author publications
Search author on:PubMed Google Scholar
Xiaoning Ma
View author publications
Search author on:PubMed Google Scholar
Yu Liu
View author publications
Search author on:PubMed Google Scholar
Haoang Lu
View author publications
Search author on:PubMed Google Scholar
Le Wang
View author publications
Search author on:PubMed Google Scholar
Yuanqi Su
View author publications
Search author on:PubMed Google Scholar
Yuehu Liu
View author publications
Search author on:PubMed Google Scholar

Contributions

CZ designed the concept and the method of the study. CZ, MY, XM, and YL jointly designed the experiments and analyzed the experimental results. YL and MY performed the experiments. LW, HL, YS and YL helped with the analysis and constructive discussions. CZ, MY, and XM prepared the manuscript, and all authors provided feedback during revision of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Chi Zhang.

Ethics declarations

Competing interests

The authors declare that they have no conflict of interest/competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

We apply the proposed two-dimensional representation method to the visual detection task in traffic scenes on the KITTI dataset [59]. In this section, the difficulty level of the samples is measured in the two-dimensional regularity space by the sample degree, discovering corner traffic scenes, exploring their boundary semantics and revealing information such as congestion, obstruction, construction, intersections, etc. This can be used to guide the synthesis of corner traffic scenes in future work.

1.1 A.1 Experimental setup

Our experiments are based on the Detectron2 platform [60]. We train the target detection network Faster R-CNN [52] on the KITTI dataset [59], in which ResNet-101 [36] and FPN [61] are used as backbone networks for feature extraction and fusion. The training is conducted for a total of 200 epochs, with a batch size of 16, an initial learning rate of 0.05, and a decay rate of 0.1 in the 81st and 122nd epochs.

Unlike the classification task, the target detection task outputs the category and location of all instances on the image. This section converts the prediction state after each epoch of training into a binary variable by calculating the F1 metric of the prediction results of all instances on each image, which is 1 if $F1>0.5$ and 0 otherwise.

The 2D detection dataset of KITTI has 7481 labeled RGB images with 9 categories: Car, Van, Truck, Pedestrian, Person_sitting, Cyclist, Tram, Misc and DontCare, in which DontCare indicates that the region is unlabeled. To avoid the effect of too many false positives during the evaluation process, the prediction results of the Misc region and the DontCare region are ignored.

First, the prediction results of multiple instances of each image are sorted by confidence and matched with the true values of the corresponding categories. It is worth noting that the matching starts from the high confidence prediction results, and once it has already been matched, it would not be involved in the subsequent matching process. The number of true positives (TP), false positives (FP) and false negatives (FN) are subsequently computed on the basis of the pair of instance predictions and true values obtained from the matching. In this paper, we follow the KITTI evaluation criteria, which the IoU of a vehicle-matching pair is greater than 0.7 or the IoU of a pedestrian-matching pair is greater than 0.5 to be considered true positives. The total number of predicted results minus the number of true positives is the number of false positives, and the total number of true results minus the number of true positives is the number of false negatives. Finally, the Precision, the Recall and F1 score are calculated as follows:

$$ \begin{aligned} &{\mathrm{Precision =\frac{TP}{TP+FP}}}, \\ &{\mathrm{Recall =\frac{TP}{TP+FN}}}, \\ &{ F1 = \frac{2\times \mathrm{{Precision\times Recall}}}{\mathrm{{Precision+Recall}}}.} \end{aligned} $$

(3)

As described in Sect. 3, we update the inference states of all training samples after each epoch to obtain the number of ForEvents and CBTL.

1.2 A.2 Representation of traffic scenes and discovery of boundary samples

The histograms of the sample distribution in a single dimension are illustrated in Fig. 14. The distributions of each measure are still long-tailed, and dominated by easy samples that have high CBTL scores and a small number of ForEvents. Furthermore, Fig. 15(a) illustrates the representation of the two-dimensional regularity distribution on KITTI training set. Obviously, there are two samples on the distribution that are very far away from the other samples, which are marked as outliers in Fig. 15(a). The corresponding original RGB images, as well as the prediction results and the truth labeling results are visualized as shown in Fig. 16. In the left position of both sample images, there is a narrow annotation, which is obviously a wrong labeling, and there are many predicted results in its vicinity.

Due to the narrowness of the labeling results, the IoUs of many similarly located results are not high, which leads to the failure of non-maximum suppression to have a good effect, and many predictions are localized to the same incorrectly labeled target, which leads to a large number of false positives, resulting in a poor evaluation of this sample. Consequently, the sample regularity representation we proposed also helps to detect incorrect or roughly labeled outer points.

As shown in Fig. 15(a), a quadratic curve is fitted to the sample distribution, which the function is $y=-0.001,76 x^{2}+0.312x+7.21$ and the axis of symmetry is $x=88.64$. We perform uniform difficulty division of the samples from a uniform angle at the center $(88.64,0)$ according to the method in Sect. 4.2. The angle formed between the line segment from $(88.64,0)$ to the sample point and the line segment from the origin to $(88.64,0)$ represents the sample degree. Excluding the two outer points as mentioned above, the distribution of sample degrees is illustrated in Fig. 15(b), which can be observed that there is still a long-tailed distribution, with simple samples approaching 180 degrees in a dominant position, and discrete samples with smaller degrees are the boundary samples.

We sort the samples in ascending order of sample degree. We take the top 5% of samples with smaller sample degrees for observation and extract their boundary semantics. The boundary semantic information can be roughly categorized into four types, which are congestion scenes (such as crowds or traffic), construction scenes, intersection scenes, and scenes with severe occlusion relationships, such as vehicles parked sequentially on the side of the road. We select some samples as shown in Fig. 17.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, C., Yuan, M., Ma, X. et al. Unified regularity measures for sample-wise learning and generalization. Vis. Intell. 2, 38 (2024). https://doi.org/10.1007/s44267-024-00069-4

Download citation

Received: 21 October 2023
Revised: 10 December 2024
Accepted: 12 December 2024
Published: 31 December 2024
Version of record: 31 December 2024
DOI: https://doi.org/10.1007/s44267-024-00069-4

Keywords

Profiles

Chi Zhang View author profile
Le Wang View author profile

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Unified regularity measures for sample-wise learning and generalization

Abstract

Similar content being viewed by others

Switching: understanding the class-reversed sampling in tail sample memorization

Inversion dynamics of class manifolds in deep learning reveals tradeoffs underlying generalization

Efficient Bayesian Learning of Sparse Deep Artificial Neural Networks

Explore related subjects

1 Introduction

2 Preliminary work

2.1 Forgetting events and cumulative binary training loss

Forgetting events (ForEvents)

Cumulative binary training loss (CBTL)

Forgetting events

Cumulative binary training loss

2.2 Mal-generalizing events and cumulative binary generalizing loss

Mal-generalizing events (MgEvents)

Cumulative binary generalizing loss (CBGL)

3 Experimental verification and analysis

3.1 Experimental setup

3.2 Representation of sample distribution

3.2.1 Unidimensional sample distribution representation

3.2.2 Two-dimensional sample distribution representation

3.2.3 Visual verification

3.3 Measure property investigation

3.3.1 Training randomness

3.3.2 Relationship between measures for training and test samples

3.3.3 Robustness

Different architectures

Different optimizers

4 Application

4.1 Training acceleration

Different radius values

Training acceleration

4.2 Testing acceleration

5 Discussion

6 Conclusion

Data availability

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Appendix

Appendix

1.1 A.1 Experimental setup

1.2 A.2 Representation of traffic scenes and discovery of boundary samples

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles