Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning论文笔记

原创于 2025-03-29 17:42:39 发布 · 1k 阅读

8 ·

CC 4.0 BY-SA版权

文章标签：

#深度学习 #论文阅读 #人工智能

文章目录

摘要
- "multi-view"
Intro
Empirical Results

摘要

“multi-view”

when data has a structure we refer to as “multi-view”, then ensemble of independently trained neural networks can provably improve test accuracy, and such superior test accuracy can also be provably distilled into a single model by training a single model to match the output of the ensemble instead of the true label
就是数据有结构，在"multi-view"的情况下集成独立训练的神经网络可以提高测试准确度，这种优越的test accuracy可以蒸馏进一个单独的模型，通过训练这个单独的模型去匹配集成架构的输出而不是匹配真实标签

How individual neural networks learn

The network will quickly pick up one of the feature v ∈ {v1, v2} for the first label, and one of the features v′ ∈ {v3, v4} for the second label. So, 90% of the training examples, consisting of all the multi-view data and half of the single-view data (those with feature v or v′), are classified correctly.
所有多视角数据和一半的单视角数据被分类正确
training accuracy 100% but test accuracy 90%.

How ensemble improves test accuracy

Depending on the randomness of initialization, each individual network will pick up v1 or v2 each w.p. 50%. Hence, as long as we ensemble O ̃(1) many independently trained models, w.h.p. their ensemble will pick up both features {v1, v2} and both features {v3, v4}
每个独立的网络有50%概率选择v1/v2；只要集成很多独立训练的模型，它们的集成就会选择所有特征

How knowledge distillation works.

Hence, by training the individual model to match the output of the ensemble, the individual model is forced to learn both features v3, v4, even though it has already perfectly classified the training data. This is the “dark knowledge” hidden in the output of the ensemble model.
训练独立模型匹配集成的输出，独立模型被迫学习特征

models trained from knowledge distillation should have learned most of the features, and further computing their ensemble does not give much performance boost
知识蒸馏训练的模型学习到了大部分特征，进一步计算它们的集成不会带来表现提升
所以训练独立模型需要使用硬标签
在这里插入图片描述

Intro

Ensemble

By simply averaging the output of merely a few (like 3 or 10) independently trained neural networks of the same architecture, using the same training method over the same training data, it can significantly boost the prediction accuracy over the test set comparing to individual models. The only difference is the randomness used to initialize these neural networks and/or the randomness during training.
独立训练的神经网络的不同之处在于初始化的随机性and/or训练时的随机性，其他都是一样的：包括网络架构，训练方法，训练数据

knowledge distillation

that is, simply train a single model to match the output of the ensemble (such as “90% cat + 10% car”, also known as soft labels) as opposite to the true data labels, over the same training data.
知识蒸馏让集成架构测试时的优秀表现可以迁移到一个单独的模型上，
所以单独模型的表现比原来在直接在原始数据标签上的直接训练的单独模型好

Empirical Thoughts

Training average does not work

if one directly trains to learn an average of individual neural networks initialized by different seeds, the performance is much worse than ensemble.
如果直接训练一个单独的神经网络，通过不同种子初始化，学习它们的平均值，效果比集成差很多

Knowledge distillation works

the superior performance of ensemble in deep learning can be distilled into a single model
知识蒸馏有用

Self-distillation works

even distilling a single model into another single model of the same size, there is performance boost.
只是把一个单独模型蒸馏到另一个同等大小的单独模型上也有用
The main idea is that self-distillation is performing “implicit ensemble + knowledge distillation”
提出：自蒸馏是隐式集成+知识蒸馏

Empirical Results

Knowledge distillation does not work for random feature mappings

and ensemble in deep learning is very different from ensemble in random feature mappings
知识蒸馏对随机特征映射没用，在深度学习中的集成和随机特征映射的集成有很大不同
在这里插入图片描述
NTK：一种分析神经网络的数学方法
random feature mappings：高维特征映射到低维

Conclusion 1

It may be more accurate to study ensemble / knowledge distillation in deep learning as a feature learning process, instead of a feature selection process (where the features are prescribed and only their linear combinations are trained).
深度学习中的集成/知识蒸馏是特征学习/处理过程而不是特征选择过程
（特征预先定义好，只学习它们的线性组合—>特征选择）

Soft-label

For a k-class classification problem, the output of a model g(x) is usually k-dimensional, and represents a soft-max probability distribution over the k target classes. This is known as the soft label.
K分类问题中，模型输出通常是k维的，代表着在k类目标上的soft-max概率分布

ensemble needs multi-view

Special structures in data (such as the “multi-view” structure we shall introduce) is needed for ensemble of neural networks to work.
为了让集成神经网络有效，数据需要有特殊结构
Ensemble in DL might not improve test accuracy when inputs are Gaussian-like: Empirically, ensemble does not improve test accuracy in deep learning, in certain scenarios when the distribution of the input data is Gaussian or even mixture of Gaussians.
如果输入是高斯分布，那么集成不会提高test accuracy

Conclusion 2

The input distribution is more structured than standard Gaussian and there is no label noise.
集成的输入数据的分布需要比高斯分布更有结构而且没有标签噪音

The individual neural networks all are well-trained, in the sense that the training accuracy in the end is 100%, and there is nearly no variance in the test accuracy for individual models. (So training never fails.)
独立的神经网络需要训练到很好

label noise no work

The variance due to label noise or the non-convex landscape of training, in the independently trained models, may not be connected to the superior performance of ensemble in deep learning.
在独立训练的模型中由标签噪音导致的多样性，和深度学习中集成的优越表现无关