1 Introduction

Although deep learning has been widely applied and has made significant progress, the generalization of deep neural networks is still affected by the unexplainable black-box structure of algorithms [1, 2], biased datasets [35], noisy labeling [6, 7], and various evaluation metrics [8, 9]. In particular, such an over-parameterized network, which has far more parameters than training samples do, is predicted to be severely overfit by classical learning theory. However, it generalizes remarkably well [10]. Zhang et al. [11, 12] further reported that the deep neural network trained with the ordinary stochastic gradient descent (SGD) [13] method could easily fit random labels. In this case, measurements in the statistical learning theory fail to explain why deep neural networks generalize well from the training set to new data. For example, Rademacher complexity, which is a measure of the capacity to fit random noises, will be close to 1, resulting in loose and invalid generalization error bounds.

Recently, some researchers have attempted to develop a new generalization theory adapted to the over-parameterized characteristics from the view of network memorization and generalization. Zhang et al. [11] argued that neural networks not only captured the remaining signals in correctly-labeled training data, but also fitted noisy parts forcibly by memorizing them. In other words, the ability to extract patterns while memorizing exceptional samples contributes majorly to network generalization. Feldman [14] further noted that the memorization of rare and atypical samples were the sources of unreasonably beyond-expectation generalization performance [15] achieved by deep neural networks. Both imply differences in the distribution of patterns contained in samples and the powerful memory of the neural networks. A fundamental question then arises: how can the intrinsic pattern in different samples be represented to distinguish the different states of being regular, i.e., sample regularities?

The quantification of the intrinsic sample patterns is crucial when dealing with the training set bias, i.e., the inconsistency between the distributions of the training set and the test set. Moreover, existing studies have shown that different samples contain different amounts of information, hence contributing differently to model training [16].

To address this issue, measuring samples and training them according to information gains become a reasonable solution, adopted by many studies to address specific tasks [1723]. In OHEM [17], the dynamic learning difficulty of a certain example is represented by the current loss on itself, trained by a simply-modified SGD method. Instead, Pang et al. [18] proposed measuring its difficulty on the basis of the intersection over union (IoU) distribution. Consequently, IoU-balanced sampling is utilized to discover hard examples for detection tasks. Wang et al. [19] quantified the impact of samples on model training via the influence function [20]. Li et al. [21] noted that the essential effect of the disharmonies between samples could be summarized in terms of the gradient, and then proposed a novel gradient harmonizing mechanism (GHM) to hedge the disharmonies. A novel IoU hierarchical local rank strategy is proposed in Ref. [22] as a quantitative way to evaluate sample differences. Zhang et al. [23] proposed a robust learning algorithm, named DualGraph, to capture structural relationships among labels with graph neural networks at two different levels including the instance-level and distribution-level relations. However, the methods mentioned above are all task-specific, which naturally leads us to a more general question: do universal measures exist for the quantification of the intrinsic sample patterns?

Most recently, a new line of research has been opened, which empirically represents sample distribution on the basis of the memorization-generalization mechanism. Toneva et al. [24] reported that sample forgetting existed not only in the training process of a new learning task (i.e., catastrophic forgetting [25, 26]), but also in different training stages on the same dataset. Therefore, a forgetting-event-based sample representation is novelly proposed (abbreviated as forgetting events (ForEvents) in this manuscript), leading itself to be a measure of sample difficulty for robust reasoning in natural language processing [27]. Nguyen et al. [28] further used a captioning model to visually explain the catastrophic forgetting phenomenon, especially when the sample was forgotten or changed.

In contrast, Jiang et al. [29] reported that cumulative binary training loss (CBTL) could be used as a lightweight proxy for consistency score of samples. Inspired by the above works, we attempt to explore how to represent the intrinsic sample patterns through the memorization-generalization mechanism, empirically demonstrating the exposed regularity of the underlying data patterns. However, we find that either the unidimensional sample representation with CBTL or the number of ForEvents alone can be ambiguous when measuring intrinsic sample patterns. For example, samples with the same CBTL may have different visual difficulties because the number of ForEvents is different, as shown in Fig. 1. The phenomenon is caused mainly by the difference in the way they are calculated, although both are statistics on the sample-wise learning frequency. The forgetting-event-based sample representation addresses the observation of the dynamic events during the learning process, whereas the CBTL reflects the cumulative number of successes in learning a given sample [30].

Figure 1
figure 1

An intuitive illustration that samples of the same regularity level (represented by a certain unidimensional measure) can have different visual complexity levels when the number of forgetting events (ForEvents) varies. Specifically, each column has the same cumulative binary training loss (CBTL), with fewer ForEvents in the top row than in the bottom row. As a result, the samples in the top row appear to be more regular and recognizable

Therefore, we present unified regularity measures with a two-dimensional representation for revealing intrinsic sample patterns, taking advantage of both short-term dynamic signals and long-term stable information. Specifically, the distributions of training samples and test samples are represented in the spaces of CBTL - ForEvents and cumulative binary generalizing loss (CBGL) -mal-generalizing events (MgEvents), respectively. The proposed unified regularity measures have been sufficiently verified by numerous empirical investigations, including training randomness, the relationship between memorization and generalization, and robustness. Moreover, applications in training and testing acceleration demonstrate its promising effects on network training, hard sample selection, and visual algorithms testing.

The major contributions of this paper are three-fold:

  1. 1)

    A two-dimensional representation involving CBTL and ForEvents is proposed to measure the regularity of training samples.

  2. 2)

    Likewise, a two-dimensional representation combining newly defined CBGL and MgEvents, is proposed to measure the regularity of test samples.

  3. 3)

    Our experimental findings suggest that samples with higher regularity seem to contribute little to both training and testing tasks, which in turn validates the effectiveness of the proposed measures.

2 Preliminary work

The general form [31] of the learning system can be described as follows.

Given a training dataset \(T=\left \lbrace \left ( x_{1},y_{1}\right ),\ldots ,\left ( x_{N},y_{N} \right ) \right \rbrace \), where \(\left ( x_{i},y_{i}\right )\) (\(i=1,2,\ldots ,N\)) denotes the input observation-label sample pair. The learning system is trained with the given training dataset to obtain a model, denoted as a conditional probability distribution \(\hat{P}\left (Y|X\right )\) or a decision function \(Y=\hat{f}\left (X\right )\), to describe the mapping between input and output random variables. The optimal model is generally trained via the strategy of minimizing the empirical risk \(R_{\mathrm{emp}}\left (\hat{f}\right )=(\sum _{i=1}^{N}L(y_{i}, \hat{f}(x_{i})))/N\), where \(L(\cdot , \cdot )\) is the loss.

2.1 Forgetting events and cumulative binary training loss

Continuous learning in the real world requires that intelligent systems can learn on successive tasks without performance degradation on the preceding training tasks, just as humans do. Researchers have found that this problem setting poses a great challenge for connectionism-based neural networks [3234]. This phenomenon, known as “catastrophic forgetting”, is manifested mainly by the tendency of neural networks to forget all the acquired knowledge of the previous task quickly and brutally after the current task is added. Even deep neural networks, which have achieved great success in recent years, are not able to overcome this shortcoming [25], inevitably increasing the doubts about the research on general artificial intelligence. This phenomenon is caused primarily by the fact that when the goal changes, the learned weights of the network that are adapted to the previous task also adapt to the requirements of the new one, preventing the network from generalizing well to the previous task.

Inspired by this, Toneva et al. [24] argued that a single learning task optimization based on mini-batch stochastic gradient descent could be considered a process similar to continuous learning. In this process, each mini-batch of training data can be considered a small task that is sequentially handed over to the deep neural network. This leads to the following definition of sample forgetting events [24].

Forgetting events (ForEvents)

During the mini-batch sample learning process, the acquired (i.e., correctly classified) training samples at time t are misclassified at subsequent time \(t'\) (\(t'>t\)).

In Ref. [24], ForEvents during the training process are explored, on the basis of which training data are classified into forgettable and unforgettable samples. Furthermore, the feature atypicality and visual illegibility of the forgettable samples are verified in their experiments. However, both easily-learned regular samples and difficultly-learned exception samples have few ForEvents, resulting in symmetric statistics of ForEvents, as illustrated in Fig. 2. Consequently, differences in samples cannot be distinguished.

Figure 2
figure 2

Regularity of CIFAR-10 [35] samples on the (a) training set, (b) test set, and (c) visualization. (a) and (b) illustrate the training and test set distributions in the two-dimensional regularity spaces, i.e., CBTL and ForEvents for training samples, cumulative binary generalizing loss (CBGL) and mal-generalizing events (MgEvents) for test samples, respectively. Each scatter point represents a training or a test sample, whose color reflects the intensity of the sample density as the color code shown in the right bar. (c) shows the visual changes of the samples with respect to the variation of the proposed regularity measures. The selected training and test samples are shown at the top and bottom, respectively, and are sorted by the two-dimensional regularity along the horizontal and vertical axes. Specifically, for the top part of (c), the horizontal and vertical axes are represented by CBTL and ForEvents, while the bottom part is represented by CBGL and MgEvents. The higher the CBTL/CBGL is, the easier it is to visually detect the sample. Similarly, the greater the number of ForEvents/MgEvents is, the greater the degree of uncertainty in learning or generalizing the samples is. The number in the upper left corner of each image represents its category: 0 - airplane, 1 - automobile, 2 - bird, 3 - cat, 4 - deer, 5 - dog, 6 - frog, 7 - horse, 8 - ship, and 9 - truck

Following another paradigm, Jiang et al. [29] quantified the regularity of samples by measuring the consistency of the sample with the overall distribution of all samples through a lightweight proxy cumulative binary training loss, defined as follows.

Cumulative binary training loss (CBTL)

During the mini-batch sample learning process, the cumulative number of correct classifications of the training sample up to time t.

However, CBTL alone cannot distinguish the pattern differences in samples. For example, Fig. 1 shows that samples in the CIFAR-10 [35] training set are not equally difficult to distinguish even though CBTL is the same. Specifically, sample pairs in each column share the same CBTL, with fewer ForEvents in the top row. The images in the top row are more regular and easier to recognize than those in the bottom row.

Intuitively, the occurrence of a forgetting event implies that the sample crossed in the wrong direction as the decision boundary changes, demonstrating that the direction of the loss incurred by the sample at this moment is not consistent with the overall loss of the training set. Therefore, statistics of ForEvents can be considered a measure of the uncertainty for a particular sample when learning a decision boundary. On the other hand, reflecting the cumulative number of successes in learning a given sample, CBTL is considered a long-term stability measure of the learning difficulty of one certain sample.

Inspired by these preliminary investigations, we propose to represent sample regularity based on a unified two-dimensional space formed by measures of short-term learning uncertainty and long-term stability, i.e., the CBTL-ForEvents space for training samples and novelly-presented CBGL-MgEvents space for test samples, as shown in Fig. 2. Properties of this representation are explored and the insight of training and testing acceleration through dataset reduction is subsequently validated.

Owing to the high computational cost of performing statistics after each training mini-batch, we choose to calculate the regularity measures after each epoch, as detailed in Sect. 3.1.

Moreover, the predicted label for the training example \(x_{i}\) obtained after t epochs of optimization is denoted as \(\hat{y}_{i}^{t}=\hat{f}^{t}(x_{i})\). Let \(acc^{t}_{i}=1_{\hat{y}_{i}^{t}=y_{i}}\), which is a binary variable, indicating whether the sample is correctly classified at time epoch t. Consequently, the above definitions of ForEvents and CBTL are formalized as follows.

Forgetting events

Let \(for_{i}^{t}=1_{acc^{t-1}_{i}=1,acc^{t}_{i}=0}\), this means that the number of ForEvents is counted as 1 when the prediction is correct at epoch \(t-1\) and incorrect at epoch t. The number of ForEvents of one training sample \((x_{i},y_{i})\) at epoch t is defined as

$$ F^{t}_{i}=\sum _{n=1}^{t}for_{i}^{n}. $$
(1)

Cumulative binary training loss

For training sample \((x_{i},y_{i})\), CBTL at epoch t is defined as

$$ L^{t}_{i}=\sum _{n=1}^{t}acc_{i}^{n}. $$
(2)

2.2 Mal-generalizing events and cumulative binary generalizing loss

Likewise, for sample-wise dynamics of generalization on the test set during sequential learning of mini-batch data, the definitions of MgEvents and CBGL are given as follows.

Mal-generalizing events (MgEvents)

During the mini-batch sample learning process, the test samples can be correctly classified at epoch t but misclassified at subsequent epoch \(t'\) (\(t'>t\)).

Cumulative binary generalizing loss (CBGL)

During the mini-batch sample learning process, the cumulative number of correct classifications of test samples up to epoch t.

The formal description of the above definitions is similar to Eqs. (1) and (2), with the only difference being applied to the samples in the test set.

3 Experimental verification and analysis

3.1 Experimental setup

As described in Ref. [24], calculating whether a forgetting event occurs for all training samples after each mini-batch is quite time-consuming and computationally expensive. Therefore, they only calculate for the mini-batch samples involved in the training after each mini-batch. The calculation of MgEvents faces the same dilemma, i.e., it is not feasible to calculate the generalization status of all test samples after each mini-batch. Considering the limited impact on model performance after mini-batch samples training, this paper adopts a very different strategy from the above approach, i.e., updating the inference states of all training and test samples after each epoch. To confirm this idea, the result of our strategy is compared with that of Toneva et al. [24]. We trained the ResNet-110 [36] network on the CIFAR-10 training set. The average errors of CBTL and the number of ForEvents are 1.2614 and 0.6762, respectively. The Pearson correlation coefficients of the vectors of CBTL and the number of ForEvents for 50,000 samples in the CIFAR-10 training set are 0.9896 and 0.9835, respectively, both of which are strongly correlated. Therefore, the strategy of making one inference and updating the states after one epoch is used as a lightweight proxy. The number of model inferences required is reduced to approximately 1/390 of the ideal case when the number of training samples is 50,000, the number of test samples is 10,000, and the batch size is 128. Note that this approximation is likely to aggravate the randomness of model generalization.

This paper explores learning statistics on the basis of training process of ResNet-110 with the CIFAR-10 dataset. The model is trained for a total of 200 epochs, and its average training performance is close to the highest accuracy of the architecture on the CIFAR-10 dataset, i.e., 93.53%. In particular, the initial learning rate is 0.1, which decreases to 0.010 at the 81st epoch and 0.001 at the 122nd epoch. In this paper, the same network is trained 10 times under the same hyper-parameter settings and its mean value is taken to eliminate the effect of the randomness of model inference on the empirical analysis.

3.2 Representation of sample distribution

The sample distribution is represented from the perspective of the memorization-generalization mechanism of neural networks. Inspired by Refs. [24, 29], we empirically proposes a two-dimensional sample regularity representation, which is in the CBTL-ForEvents and CBGL-MgEvents spaces. The properties of such representations are explored in this section.

3.2.1 Unidimensional sample distribution representation

First, the histograms of the sample distribution in a single dimension are illustrated in Fig. 3. The distributions of each measure are all long-tailed and are dominated by easy samples that have high CBTL/CBGL scores and a small number of ForEvents/ MgEvents. The distributions of the training and test samples are very similar, but the test samples are not involved in the training process, resulting in a longer tail. Likewise, the histograms of ForEvents and MgEvents also share a similar distribution, although the local shape may vary, e.g., different peaks around the 15th bin.

Figure 3
figure 3

Histograms of the dataset distribution in terms of the unidimensional regularity measure, in which the horizontal axis indicates the sample grouping and the vertical axis indicates the proportion of examples in the case of a single dimension. All the distributions are long-tailed. The CBTL and CBGL histograms have similar distributions for the training and test samples, respectively. Similarly, the histograms of ForEvents and MgEvents have similar distributions, although the local shapes may differ, e.g., having different peaks around the 15th bin

3.2.2 Two-dimensional sample distribution representation

Furthermore, the regularity of training and test samples is depicted in the CBTL-ForEvents and CBGL-MgEvents spaces as shown in Fig. 2 (a) and 2 (b), respectively. The samples are symmetrically distributed in our defined two-dimensional space with respect to ForEvents/MgEvents, whereas the distributions of ForEvents/MgEvents under the same binary training/testing loss are widely disparate, as shown in Fig. 4. These results also support our claim that a single dimension measure cannot distinguish the intrinsic patterns of the samples well.

Figure 4
figure 4

The variation of the number of ForEvents/MgEvents with the CBTL/CBGL accumulated on (a) training samples and (b) test samples. The line shows the mean value and the transparent band represents the range

As displayed in Fig. 2 (a) and 2 (b), the density of samples in the lower right corner is greater, indicating that the easy samples dominate both the training and test sets, which is consistent with the previous findings. Moreover, these two sets share similar symmetric sample distributions, but the symmetry axes of the training samples are closer to the right and the distribution is more compact than the test samples. This is consistent with the theoretical intuition that the empirical error tends to be smaller than the test error.

3.2.3 Visual verification

We visualized several representative samples at different positions of the distribution, as shown in Fig. 2 (b). The visual patterns do vary from one sample to another.

Objects in the bottom right samples tend to be highly recognizable and regular, e.g., the brown horse on the green grass and the green frog on the gray rock. Visual ambiguity between the background and foreground objects may appear in the samples from the middle part, which increases the difficulty of identification. The objects with unconventional features in the bottom left samples are very easy to misclassify, i.e., the cat with green light looks like a frog. Likewise, the triangular boat looks like a tree.

3.3 Measure property investigation

3.3.1 Training randomness

Owing to the randomness of the optimization process and the uncertainty of model inference, the results of repetitive experiments may vary even with the same hyper-parameter settings for the same network architecture, so the stability of the measurement results needs to be explored. We record the statistics of 10 repetitive experiments.

As demonstrated in the first row of Fig. 5, the occurrences of MgEvents and the values of CBGL vary in repetitive experiments. For test sample #1, over 10 training sessions, when the MgEvents occur, the range of epochs is \([0, 90]\), the number of occurrences varies between \([0, 4]\), and the values of CBGL vary between \([195, 200]\). The ForEvents and CBTL also show similar dynamics, as shown in Fig. 5 (c) and Fig. 5 (d). This means that CBTL/CBGL and ForEvents/MgEvents should be counted for multiple repetitions of the experiment.

Figure 5
figure 5

Dynamic stability investigation of the proposed regularity measures over 10 repetitive training sessions. Left: test samples #1 and #3. Right: training samples #7 and #44701 (the number after # represents the ID of the sample in the CIFAR-10 dataset). The horizontal axis represents the training number, and the vertical axis represents the epoch number in the training process. The blue circles in the top row of the figures mark the epoch in which the mal-generalization/forgetting event occurs. The red lines in the bottom row describe the variations in CBTL/CBGL during each session

Furthermore, to intuitively quantify the effect of randomness from training on the sample distribution representation, we calculate the pearson correlation coefficients between the distributions from different training sessions in this paper. The correlation matrix is obtained on the basis of the coefficients between the sample density vectors in each representation, as depicted in Fig. 6. The sample density vectors are obtained by normalizing the vectors acquired via the density calculation method, as illustrated in Fig. 2.

Figure 6
figure 6

Correlation of the sample distribution for 10 training sessions, in which the x-axis and y-axis both represent the number of samples. The left panel shows the training samples and the right panel shows the test samples. The figure shows a \(10\times 10\) grid, representing the two-by-two correlation of the sample distribution obtained from 10 training sessions. The color shades of each grid represent the values of the Pearson correlation coefficient that is shown in the toolbar on the right side of the figure

The average Pearson correlation coefficient of the 10 training sessions is 0.8634 for the training samples and 0.8733 for the test samples. This shows that the proposed measures are stable at a certain level for repetitive training sessions.

3.3.2 Relationship between measures for training and test samples

Owing to the similarities in definitions and the computing processes, here we discuss the similarities between ForEvents and MgEvents to understand the relationship between memorization and generalization through dynamic statistics of the network’s classification performance on training and test samples.

As displayed in Fig. 7, the histogram of ForEvents is similar to that of MgEvents for model #1, and model #4, with Pearson correlation coefficients of 0.9985 and 0.9984, respectively. Similarly, the histogram of CBTL is similar to that of CBGL in the empirical statistics of each model. According to Ref. [35] for producing the CIFAR-10 dataset, the training and test sets are independently and identically distributed. This means that ideally, the similarity between the distributions of ForEvents and MgEvents of the model can be used to measure the similarity between the training and test sets. In addition, the red arrows indicate the distribution variability, which to some extent reflects the difference between model learning and generalizing, i.e., generalization error.

Figure 7
figure 7

Distributions of sample regularity in terms of unidimensional measures. The vertical axis of each figure represents the number of samples. Top: regularity distribution measured by ForEvents/MgEvents, and each of the corresponding x-axis represents the grouping. Bottom: regularity distribution measured by CBTL/CBGL, and each of the corresponding x-axis represents the grouping. #1 and #4 indicate the model obtained from the 1st and 4th iterations of the ResNet-110 architecture [36], respectively. The red arrows indicate the variability of the distribution

Furthermore, we investigate the synchronization between ForEvents and MgEvents, where the synchronization of the particular test sample with a training sample means that the epoch when its mal-generalizing event occurs during the training process corresponds exactly to the epoch when the forgetting event of that training sample occurs. Test samples that generalize successfully or unsuccessfully in all training epochs do not have synchronization with any training sample because no MgEvents occur. Thus, we are first concerned with synchronization in different generalization cases. As shown in Fig. 8, the number of synchronized training samples varies with the number of MgEvents for the test samples. Specifically, the higher the number of MgEvents, the greater the corresponding number of synchronized training samples. For example, for sample #90 (with 2 MgEvents), the number of synchronized training samples is 13,266, whereas sample #5695 (with 15 MgEvents) has approximately twice as many synchronized training samples during a single training process.

Figure 8
figure 8

Variation in the number of synchronized training samples on specific test samples. Specifically, the greater the number of MgEvents, the greater the number of synchronized training samples

Considering the aforementioned randomness in the training and generalizing process, Fig. 8 also shows that the number of synchronized training samples for a particular test sample decreases significantly, or even drops to 0, when extended to 10 trainings. This indicates that most of the ForEvents of synchronized training samples just happen to occur at the same epoch as the MgEvents of the target test sample, although this synchronization occurs more than hundreds of times. This confirms that the neural network does not depend on specific training samples for the extraction and generalization of patterns embedded in the training data. Then does the generalization of specific irregular samples depend on memorizing certain training samples? We extend to 20 trainings for samples #521, #5695, and #889 to observe the distributions of their synchronized training samples, as demonstrated in Fig. 9.

Figure 9
figure 9

Variation in the number of synchronized training samples on specific test samples when the training process is repeated 20 times. The green boxes indicate that the training samples belong to the same category as the test samples, and the red boxes indicate that they do not belong to the same category

To further verify the effect of these synchronized training samples on the generalization of the corresponding test samples, we train a binary classifier via support vector machine (SVM) to determine whether a particular test sample belongs to a certain class. The input is 4096-dimensional features extracted from the training samples via the VGG-19 [37] network pre-trained on ImageNet [38]. The experimental setup is divided into two types: 1) random N training samples containing 46 synchronized samples (50% of the training samples belong to the same category of the selected test sample, whereas the others are from different categories.); 2) random N training samples not containing these synchronized samples. The results are displayed in Table 1, where the classifier trained with synchronized samples can better generalize its corresponding test samples for sample #889 with low regularity. For example, when the number of synchronized samples accounts for 46% of the training samples, the classification accuracy of the classifier trained with synchronized samples for sample #889 is approximately 14% higher than that of the classifier trained without synchronized samples. The superiority is still approximately 3% when the proportion of synchronized samples is reduced to 10%. We conjecture that these synchronized training samples tend to play the role of “support vectors”. That is, stochastic gradient descent tends to converge implicitly to the solution that maximizes the differentiation of the dataset [39], and the synchronized training samples are more relevant to the learning of the corresponding test samples’ classification boundaries than the other samples.

Table 1 Investigations of the support of 46 synchronized training samples on specific test samples. N is the number of test samples

3.3.3 Robustness

The generalization of neural networks varies across different optimizers and architectures, so the training statistics depend on the given network architecture trained with the given optimizer. It is thus important to explore the effects of these control variables on the proposed measures, the robustness of which needs to be further investigated.

Different architectures

Our statistics need to be collected during training, if the network has more layers or a more complex structure, the time cost of statistics will be greater. Therefore, we explore how this distribution representation is affected by different network architectures. To explore the possibility of reliably representing the sample distribution with simpler architectures with a lower time cost, we hope to determine whether the statistic is robust to different network architectures. We repeatedly train each of the four network architectures, ResNet-20 [36], ResNet-32 [36], ResNet-110 [36], and DenseNet [39], 10 times and take the average statistics for analysis. As shown in Fig. 10, the sample distribution does not change significantly whether the number of layers is reduced from 110 to 32, 20, or to the more densely-connected DenseNet. This conclusion is confirmed quantitatively by the correlation of the distributions and statistics. As displayed in Table 2, the pearson correlation coefficients are all above 0.88, which are very strong correlations. This indicates that our proposed measure is robust to the network architecture and can be migrated between different architectures. Since ResNet-32 takes only 80 min while DenseNet takes 1367 min with the same computational power (single GeForce RTX 2080Ti GPU), this opens up the prospect of computing sample distribution representation statistics in simpler architectures. Thus, the similarity of two-dimensional sample distribution representations across architectures allows them to be computed quickly by approximating proxy networks to reduce time and computational cost, e.g., estimating sample distribution statistics for DenseNet via ResNet32 can save up to 21.5 h of training time and reduce the number of parameters by 7.03 MB.

Figure 10
figure 10

Two-dimensional sample regularity distributions for different network architectures on CIFAR-10. Top: CBGL-MgEvents. The horizontal axis represents the value of CBGL. Bottom: CBTL-ForEvents. The horizontal axis represents the value of CBTL

Table 2 Pearson correlation coefficient of regularity measures between two networks with different architectures. Cumulative binary training loss (CBTL) and forgetting events (ForEvents) are for training samples, and cumulative binary generalizing loss (CBGL) and mal-generalizing events (MgEvents) are for test samples

Different optimizers

Most deep learning algorithms involve various forms of optimization, generally to minimize the loss function for parameter solving. Therefore, we explore the effects of different optimizers on the representation of the sample distribution. In addition to the common mini-batch SGD method, AdaGrad [40] and AdaMax [41] are also chosen in this comparative experiment.

As illustrated in Fig. 11, the sample distribution representations under different optimizers differ significantly. The Pearson correlation coefficients of SGD - AdaGrad, SGD - AdaMax, and AdaGrad - AdaMax are 0.8914, 0.6877 and 0.6962 for the training samples and 0.8877, 0.6628 and 0.7062 for the test samples, respectively. It thus confirms that different optimizers have a great impact on our proposed two-dimensional representation. Intrinsically, the optimization algorithm determines the direction of optimization at each step of training, which leads to changes in the learned parameters, resulting in differences in the learning process and thus affecting the representations of the sample distribution.

Figure 11
figure 11

The two-dimensional sample regularity changes when different optimizers are applied to ResNet-110 [36]. Top: CBGL-MgEvents. The horizontal axis represents the value of CBGL. Bottom: CBTL-ForEvents. The horizontal axis represents the value of CBTL. From left to right, each column shows the representation results for the SGD [13], AdaGrad [40], and AdaMax [41] optimizers, respectively

4 Application

4.1 Training acceleration

The distribution of the CIFAR-10 training set is extremely uneven in the proposed representation space. The samples in the lower right corner are relatively simple, and have little impact on the performance of the learned model. However they are densely distributed and dominate the training set. As a result, many training samples share similar patterns. We attempt to accomplish an efficient training process with smaller sample sets by eliminating the high-density redundant samples.

We first compute the density values of each sample in our two-dimensional representation space according to Fig. 2. Afterward, the high-density training samples are removed by sorting their density values in descending order to investigate the effect of the proportion of removed samples on the generalization performance.

Different radius values

Since the computed density value is related to the radius r, we firstly explore the effect of r on training performance as shown in Fig. 12 (a). We can observe that r has a significant effect on the training performance and that the overall performance is relatively best for \(r=1\). Therefore, we take the calculated value of density with \(r = 1\) for the subsequent experiments.

Figure 12
figure 12

Performance variation in terms of the proportion of the training set removed. We calculate the density radius r of each sample in a two-dimensional representation space to eliminate samples from the training set. The horizontal axis represents the percentage of training set samples removed. Different regularity measures are used to remove the redundant training samples: (a) the proposed measure with different densities, and (b) the proposed measure compared with different strategies

Training acceleration

We follow the strategies of removing samples according to the metric of CBTL proposed in Ref. [29] and ForEvents proposed in Ref. [24] as a comparison with our method, as shown in Fig. 12 (b).

Figure 12 (b) illustrates that the performance decreases dramatically fast when samples are removed randomly. However, when samples are removed according to some strategies, there is no significant decline in the generalization performance, and the test accuracy remains above 0.91 even after 60% of the training samples are removed. In addition, the strategy proposed in this paper slightly outperforms the CBTL and ForEvents strategies. This not only validates to some extent the insight of dataset reduction but also contributes to accelerating training with a smaller set at lower time/computational cost.

4.2 Testing acceleration

Unlike training, which is a pattern extraction process where sample complexity can be reduced by removing samples with close pattern representations, the goal of model testing is often to maximize discrimination of testees’ generalization level using minimal sample complexity. This requires reducing the number of samples involved in evaluation while ensuring adequate discrimination for the generalization performance of the testees. This means that more attention needs to be paid to hard samples in practice, which is often achieved via difficulty-based sampling. The proposed CBGL-MgEvents can be used as a measure of the difficulty level, which is obtained by dividing the defined two-dimensional space on the basis of the sample distribution in a uniform angle.

As shown by the red dashed line in Fig. 2 (b), we perform uniform difficulty division of the samples by rays emitted from a uniform angle at the center \((100,0)\), which is the midpoint of the horizontal coordinate range. The smaller the angle is, the finer the division of difficulty. (Note that the inconsistent coordinate ratio of the figure is for better illustration.)

We first use the original 10,000 test samples to evaluate 17 algorithms including deep neural networks and traditional machine learning methods. Then, based on the balanced difficulty sampling method described above, the representation is divided into 4 bins and 10 bins at 45 degrees and 18 degrees, respectively. We randomly pick samples from each bin to obtain a smaller dataset containing 400 test samples. Afterward, the 17 algorithms are evaluated on the reduced set once again, as illustrated in Table 3.

Table 3 Testing accuracy of different algorithms on selected test samples

It is clear that the performances of all the listed algorithms, regardless of whether they are deep learning methods or traditional machine learning methods, decrease significantly when they are tested on difficulty-balanced samples. This is mainly because the low proportion of hard samples is overwhelmed by the large number of easy samples in the original test set. Owing to difficulty-balanced sampling, the proportion of hard samples is increased. Moreover, the algorithms perform worse when tested on samples selected on 10 bins than when tested on samples selected on samples selected on 4 bins.

To measure the discrimination of the sampled dataset for algorithm performance, we investigate the correlations between the original testing results and the results on 10 bin/4 bin difficulty-balanced sampling. The Spearman correlation coefficients are 0.9871 and 0.9877, respectively, indicating extremely strong correlations and sufficient discrimination for the algorithms.

Furthermore, we explore the optimal sample number in each bin for the difficulty-balanced samples, which can most efficiently maintain performance discrimination. The representation space is divided into 12 bins, similar to Fig. 2 (b). 0, i.e., the left half axis, is the first bin, followed by \((0^{\circ},18^{\circ})\) for the second bin, \([18^{\circ}, 36^{\circ})\) for the third bin, … , \([162^{\circ}, 180^{\circ})\) for the 11th bin, and 180 for the last bin. From most bins, we select n samples. However, since there are only 3 samples in the first bin, they are all selected in the experiments.

Moreover, we explore the effect of n, where \(n\leq 49\), on the testing results, since the lowest number of samples in the latter 11 bins is 49. To discriminate the discrimination for the algorithm performance on the test set, i.e., the ranking stability of the algorithm performance evaluation, we calculate the Spearman correlation coefficient between original testing results and the results on \(11n+3\) samples as well as the mean average precision of the algorithm ranking. The results are plotted as a line graph illustrated in Fig. 13. We can see 30 samples in each bin, i.e., a total of \(11\times 30+3=333\) samples, which is appropriate for a reduced CIFAR-10 test set. Therefore we construct a small test set, CIFAR-10-333, to accelerate the testing of algorithm performance.

Figure 13
figure 13

Variation in the test performance (mean average precision, MAP) and the Spearman correlation with the original results in terms of the bin size

Experiments show that uniform regularity sampling for the test set can significantly reduce the sample complexity while maintaining discrimination for the algorithm performance.

5 Discussion

The awareness of sample-wise regularity is crucial to pattern learning for visual recognition.

Numerous theoretical and applied studies note that various samples have varying informational densities and hence contribute differently to model training. Ren et al. [52] reported that when training visual object detection models, training samples that were neither positive nor negative samples were not helpful for network training. In Ref. [52], the positive samples are defined as the anchors of the highest IoU overlap with a ground-truth box, or an anchor that has an IoU overlap higher than 0.7 with any ground-truth box, whereas the negative samples are those whose IoU ratio is lower than 0.3 for all ground-truth boxes. Zhang et al. [53] presented that easy samples were of little help in training the detector, so only the top 70% could be considered in the loss function calculation. Lin et al. [54] further showed that easy negative samples were the main body of loss function and could overwhelm training, leading to degenerate models. Kishida and Nakayama [3] revealed through experiments that easy training samples that could be consistently classified correctly in the early training phase were visually similar to each other, while hard samples were visually diverse. Furthermore, the hard training samples tend to contribute more to network generalization than easy samples. By analyzing experimental results, biases in a dataset and SGD are considered as the reasons why convolutional neural networks have consistent easy and hard samples.

In problem instance-dependent cases, the most direct measure of sample-wise regularity is derived from empirical loss, but related studies present two diametrically opposite ideas.

According to training set bias, the most straightforward way to measure sample regularity is to calculate the distance between its predicted value and the ground truth in the loss function. When thresholds are set, the distance evolves into sample difficulty classification, which is applied to improve the generalization performance of learning models, such as the training-loss-based sample weight assignment methods [55, 56]. These methods have two distinct ideas as follows.

  1. 1)

    Label noise can be eliminated by selecting training samples with smaller loss, because they are likely to be clean samples.

  2. 2)

    Class imbalance can be approached by selecting training samples with higher loss, because they are more likely to belong to a minority sample category.

From the perspective of classical computational learning theory, this simple representation of sample regularity is problematic in two ways.

1) The goal of the learning model’s task is to achieve good generalization performance on the validation set or even the test set by learning on a specified training set. Simply calculating the task difficulty of the training samples does not reflect the difficulty of generalization on independent and identically distributed (i.i.d.) validation sets, or even on differently distributed test sets. Intuitively, the relationship between the difficulty or importance of each training sample and the loss of the validation set can be inferred by using the “leave-one-out” method, that is, retraining by removing a single training sample to evaluate its impact on the loss of the validation or test set, but the time complexity of this algorithm increases exponentially with the number of samples. Koh and Liang [20] therefore used the influence function, a classical tool in statistics to estimate the impact of training samples on network generalization with only one inference step.

2) Training/validation/test samples that make the empirical loss greater do not necessarily contain a higher value, because these samples may be outliers, forcibly fitting or memorizing them may degrade the generalization performance. However, empirical analysis of deep neural network generalization shows that the effective capacity of neural networks is sufficient for memorizing the entire data set. [11], so that it can be effectively converged even when the sample labels are disturbed. This implies that the influence of the forced memorization of the outliers of the deep neural networks on their generalization performance is limited. As a result, Feldman [14] pointed out that, owing to the long-tailed nature of the real-world image data distribution, the memorization of rare and atypical samples was the main source of the unreasonable and unexpected generalization performance [15] of deep neural networks.

Given that challenges remain in terms of theoretical understanding, exploring interesting and poorly understood behaviour, such as sample regularity, through empirical studies where experiments are designed systematically is a promising method and current trend. These findings may have an impact on visual pattern understanding.

Recent studies have explored the generalization mechanism of DNNs via empirical analysis with an emphasis on the representation and computation of sample difficulty on the basis of memorization-generalization mechanisms. According to Ref. [24], in addition to a new learning task, sample forgetting also occurs at various phases of training the same dataset. They thus refer to the event where an individual sample transitions from being classified correctly to incorrectly over the course of learning as a “forgetting event”. The experiments confirm that the samples with more occurrences of ForEvents in the training process tend to be hard samples, whereas the most frequent samples are likely noisy samples.

Furthermore, Jiang et al. [29] used the memorization-generalization continuum to reveal that deep neural networks could not only remember rare and atypical samples but also effectively generalize to data that shared the same pattern or intrinsic regularity. The consistency score is defined as the expected accuracy for a held-out instance on a training set selected from a fixed size of data distribution. Consequently, the relationship between sample memorization and generalization is constructed. Empirical evidence suggests that cumulative binary training loss is a lightweight and effective approximation of the consistency score. In subsequent work, Feldman and Zhang [57] approximated the effect of each training sample on the accuracy of each test sample, as well as the probability of memorization occurrence of the training samples, thus confirming the apparent benefit of memorization to generalization on several standard test datasets.

In cases that measures to evaluate the rarity and the atypicality of the training/test samples are important, we explore how to represent the distribution of intrinsic patterns in samples through memorization-generalization theory, empirically demonstrating the exposed regularity of the underlying patterns in the data. Specifically, we incorporate long-term stable information and short-term dynamic signals during training to propose unified regularity measures with two-dimensional representations, which are experimentally confirmed to be beneficial in the fields of training and testing acceleration.

Notably, the proposed measure of sample regularity (in problem instance-dependent cases) can serve as a probe to discover the SGD dynamics in the long-tailed setting, and thus contribute a simple yet effective resampling method [58].

6 Conclusion

In this paper, we propose to measure the sample-wise regularity that a given sample exhibits when it is learned or generalized. We show that for i.i.d. training and test sets, the sample distributions are similar under the statistics of similar definitions, i.e., CBTL/CBGL and ForEvents/MgEvents. Inspired by this result, we propose a pair of two-dimensional representations for measuring sample regularity in network learning and generalization. In the property investigation, we find that the proposed measures seem to be quite stable with respect to the different characteristics of the training and testing phases. Further applications in training and test acceleration show that samples with higher regularity seem to contribute little to both processes, which in turn validates the effectiveness of the proposed measures. However, the challenge is the lack of fundamental theory, since the theoretical description of the sample-wise regularity is challenging. With the development of the full mathematical characterization of the memorization-generalization mechanism, a thorough understanding of sample regularity can contribute as supporting evidence to justify our work.