YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information目标检测论文精读(逐段解析)
论文地址:https://arxiv.org/abs/2402.13616
2024
中国台湾的中央研究院和台北科技大学等机构联合开发(也是YOLOv4、YOLOv7的作者)
Abstract
Today’s deep learning methods focus on how to design the most appropriate objective functions so that the prediction results of the model can be closest to the ground truth. Meanwhile, an appropriate architecture that can facilitate acquisition of enough information for prediction has to be designed. Existing methods ignore a fact that when input data undergoes layer-by-layer feature extraction and spatial transformation, large amount of informa- tion will be lost. This paper will delve into the important issues of data loss when data is transmitted through deep networks, namely information bottleneck and reversible functions. We proposed the concept of programmable gradient information (PGI) to cope with the various changes required by deep networks to achieve multiple objectives. PGI can provide complete input information for the target task to calculate objective function, so that reliable gradient information can be obtained to update network weights. In addition, a new lightweight network architecture – Generalized Efficient Layer Aggregation Network (GELAN), based on gradient path planning is designed. GELAN’s architecture confirms that PGI has gained superior results on lightweight models. We verified the proposed GELAN and PGI on MS COCO dataset based object detection. The results show that GELAN only uses conventional convolution operators to achieve better parameter utilization than the state-of-the-art methods developed based on depth-wise convolution. PGI can be used for variety of models from lightweight to large. It can be used to obtain complete information, so that train-fromscratch models can achieve better results than state-of-theart models pre-trained using large datasets, the comparison results are shown in Figure 1 . The source codes are at: https://github.com/WongKinYiu/yolov9 .
【翻译】当今的深度学习方法专注于如何设计最合适的目标函数,使模型的预测结果能够最接近真实标签。同时,还必须设计一个合适的架构,能够促进获取足够的信息进行预测。现有方法忽略了一个事实:当输入数据经过逐层特征提取和空间变换时,会丢失大量信息。本文将深入探讨数据在深度网络中传输时的重要问题,即信息瓶颈和可逆函数。我们提出了可编程梯度信息(PGI)的概念,以应对深度网络为实现多个目标所需的各种变化。PGI可以为目标任务提供完整的输入信息来计算目标函数,从而可以获得可靠的梯度信息来更新网络权重。此外,还设计了一种基于梯度路径规划的新型轻量级网络架构——广义高效层聚合网络(GELAN)。GELAN的架构证实了PGI在轻量级模型上取得了优异的结果。我们在基于MS COCO数据集的目标检测上验证了所提出的GELAN和PGI。结果表明,GELAN仅使用传统卷积算子就实现了比基于深度卷积开发的最先进方法更好的参数利用率。PGI可用于从轻量级到大型的各种模型。它可以用来获取完整信息,使得从零开始训练的模型能够取得比使用大型数据集预训练的最先进模型更好的结果,比较结果如图1所示。源代码位于:https://github.com/WongKinYiu/yolov9 。
【解析】这段摘要揭示了深度学习的一个根本性问题:深度神经网络在层层传递信息时会发生信息丢失,就像漏斗一样,越往深层走,原始信息丢失得越严重。想象一下,当你把一张图片输入到神经网络中,每经过一层处理,图片的一些关键细节就会消失,到了深层网络时,可能已经丢失了完成目标任务所必需的重要特征。这种现象被称为"信息瓶颈"问题。传统深度学习方法主要专注于优化损失函数和网络架构设计,却忽视了这个信息逐层丢失的根本问题,导致梯度信息不可靠,影响模型收敛和性能。YOLOv9提出了两个突破性创新来解决这个问题:
第一个是PGI(可编程梯度信息),这是一个完整的辅助监督框架,包含三个核心组件:主分支(负责推理)、辅助可逆分支(生成可靠梯度)和多级辅助信息(控制多层语义学习)。PGI的核心思想是通过辅助可逆分支为主网络提供完整的原始信息来计算目标函数,从而获得可靠的梯度信息来更新网络参数。与传统深度监督不同,PGI通过可逆架构避免了多路径特征融合可能导致的语义信息损失,并且在推理阶段可以完全移除辅助分支,不增加任何计算成本。更重要的是,PGI不仅适用于深层网络,还能让轻量级模型也受益于辅助监督机制。
第二个是GELAN(广义高效层聚合网络),这是结合了CSPNet和ELAN两种架构优势的新型网络设计。GELAN将原本只能使用卷积层堆叠的ELAN扩展为可以使用任意计算块的通用架构,同时考虑了参数数量、计算复杂度、准确性和推理速度的平衡。用户可以根据不同的推理设备自由选择合适的计算块。实验表明,GELAN仅使用传统卷积就实现了比基于深度可分离卷积的先进方法更高的参数利用率,在轻量化、速度和精度方面都表现出色。
这两项技术的协同作用使得YOLOv9在MS COCO数据集的目标检测任务上全面超越了之前的实时检测器。特别值得注意的是,即使采用从零开始训练的策略,YOLOv9仍能超越那些使用大型数据集预训练的模型,这充分证明了PGI在保持完整信息方面的有效性,是深度学习领域的重要技术突破。
1. Introduction
Deep learning-based models have demonstrated far better performance than past artificial intelligence systems in various fields, such as computer vision, language processing, and speech recognition. In recent years, researchers in the field of deep learning have mainly focused on how to develop more powerful system architectures and learning methods, such as CNNs [ 21 – 23 , 42 , 55 , 71 , 72 ], Transformers [ 8 , 9 , 40 , 41 , 60 , 69 , 70 ], Perceivers [ 26 , 26 , 32 , 52 , 56 , 81 , 81 ], and Mambas [ 17 , 38 , 80 ]. In addition, some researchers have tried to develop more general objective functions, such as loss function [ 5 , 45 , 46 , 50 , 77 , 78 ], label assignment [ 10 , 12 , 33 , 67 , 79 ] and auxiliary supervision [ 18 , 20 , 24 , 28 , 29 , 51 , 54 , 68 , 76 ]. The above studies all try to precisely find the mapping between input and target tasks. However, most past approaches have ignored that input data may have a non-negligible amount of information loss during the feedforward process. This loss of information can lead to biased gradient flows, which are subsequently used to update the model. The above problems can result in deep networks to establish incorrect associations between targets and inputs, causing the trained model to produce incorrect predictions.
【翻译】基于深度学习的模型在各个领域都表现出了远超过去人工智能系统的性能,如计算机视觉、语言处理和语音识别。近年来,深度学习领域的研究人员主要专注于如何开发更强大的系统架构和学习方法,如CNNs [21-23, 42, 55, 71, 72]、Transformers [8, 9, 40, 41, 60, 69, 70]、Perceivers [26, 26, 32, 52, 56, 81, 81]和Mambas [17, 38, 80]。此外,一些研究人员试图开发更通用的目标函数,如损失函数[5, 45, 46, 50, 77, 78]、标签分配[10, 12, 33, 67, 79]和辅助监督[18, 20, 24, 28, 29, 51, 54, 68, 76]。上述研究都试图精确地找到输入和目标任务之间的映射关系。然而,大多数过去的方法忽略了输入数据在前向传播过程中可能会有不可忽略的信息丢失。这种信息丢失可能导致有偏的梯度流,这些梯度随后被用来更新模型。上述问题可能导致深度网络在目标和输入之间建立错误的关联,使得训练的模型产生错误的预测。
【解析】从最初的卷积神经网络CNN,到后来革命性的Transformer,再到新兴的Perceiver和Mamba架构,每一种都代表着对神经网络设计思路的创新突破。同时,研究者们也在优化训练过程本身,通过改进损失函数来让模型更好地学习,通过优化标签分配策略来提高训练效率,通过辅助监督机制来帮助模型学习更丰富的特征表示。这些努力的核心目标都是一致的:建立从输入数据到期望输出之间最准确、最有效的映射关系。但这里有一个问题被忽视了。当数据在神经网络中逐层传递时,每一层的计算都会不可避免地丢失一些原始信息。这种信息丢失不是偶然的,而是神经网络前向传播过程的固有特性。更严重的是,这种信息丢失会直接影响到反向传播过程中梯度的计算。当梯度基于不完整的信息计算时,它们就变得不可靠,可能指向错误的优化方向。这就像用一张模糊不清的地图来导航一样,最终很可能走错路。结果就是,尽管我们设计了精巧的网络架构和训练策略,模型仍然可能学习到输入和输出之间的错误关联模式,导致在真实应用中出现预测错误。这个问题在复杂任务和轻量级模型中表现得尤为突出,因为信息容量的限制使得信息丢失的影响被放大。
Figure 1. Comparisons of the real-time object detecors on MS COCO dataset. The GELAN and PGI-based object detection method surpassed all previous train-from-scratch methods in terms of object detection performance. In terms of accuracy, the new method outperforms RT DETR [ 43 ] pre-trained with a large dataset, and it also outperforms depth-wise convolution-based design YOLO MS [ 7 ] in terms of parameters utilization.
【翻译】图1. MS COCO数据集上实时目标检测器的比较。基于GELAN和PGI的目标检测方法在目标检测性能方面超越了所有之前的从零开始训练的方法。在准确性方面,新方法超过了使用大型数据集预训练的RT DETR [43],并且在参数利用率方面也超过了基于深度卷积设计的YOLO MS [7]。
【解析】这个对比实验结果揭示了YOLOv9的两个突破。首先是在从零训练策略上的优势,这里的"从零开始训练"指的是不使用任何预训练权重,完全依靠目标数据集从随机初始化开始训练。传统观念认为,从零训练往往不如使用大型数据集预训练的模型效果好,但YOLOv9打破了这个常规,证明了通过PGI机制保持信息完整性,即使从零开始训练也能达到甚至超越预训练模型的性能。这说明PGI有效解决了深度网络训练过程中的信息丢失问题,让模型能够从有限的训练数据中学习到更准确的特征表示。其次是在参数效率上的优势,YOLO MS采用的是深度可分离卷积,这种设计理论上能用更少的参数达到相似的表达能力,但实际效果往往不如标准卷积。YOLOv9的GELAN架构仅使用传统卷积操作,却在参数利用率上超越了这些专门为轻量化设计的方法,这表明GELAN的设计思路更加合理,能够在保持模型简洁性的同时最大化每个参数的作用。与RT DETR的比较则更具说服力,因为RT DETR是基于Transformer架构的先进检测器,使用了大规模预训练,代表了当前技术的最高水平,YOLOv9能够超越它说明了新方法的技术先进性。
Figure 2. Visualization results of random initial weight output feature maps for different network architectures: (a) input image, (b) PlainNet, © ResNet, (d) CSPNet, and (e) proposed GELAN. From the figure, we can see that in different architectures, the information provided to the objective function to calculate the loss is lost to varying degrees, and our architecture can retain the most complete information and provide the most reliable gradient information for calculating the objective function.
【翻译】图2. 不同网络架构的随机初始权重输出特征图的可视化结果:(a) 输入图像,(b) PlainNet,© ResNet,(d) CSPNet,(e) 提出的GELAN。从图中可以看出,在不同的架构中,提供给目标函数计算损失的信息丢失程度不同,我们的架构能够保留最完整的信息,并为计算目标函数提供最可靠的梯度信息。
【解析】这个可视化实验揭示了一个问题:神经网络架构设计对信息保持能力的巨大影响。实验设置是使用随机初始化的权重(即未经训练的网络)对同一张输入图像进行前向传播,然后观察不同深度层的特征图输出。实验设计的巧妙之处在于,由于权重是随机的,网络还没有学习到任何有用的特征表示,因此特征图的质量直接反映了网络架构本身对原始信息的保持能力。从结果可以看出,最基础的PlainNet(普通的全连接或卷积层堆叠)表现最差,特征图几乎完全失去了原始图像的结构信息,这说明简单的层叠结构在信息传递过程中存在严重的信息瓶颈问题。ResNet通过残差连接机制有了明显改善,因为跳跃连接允许原始信息直接传递到更深的层次,绕过了中间层可能造成的信息丢失。CSPNet进一步优化了这个过程,通过将特征分为两个分支处理,一个分支保持原始信息,另一个分支进行特征变换,最后再融合,这种设计更好地平衡了信息保持和特征学习。而GELAN作为本文提出的新架构,在特征图中显示出了最好的信息保持效果,这意味着即使在随机权重状态下,GELAN也能最大程度地保留输入图像的结构和细节信息。这种信息保持能力直接关系到梯度计算的可靠性,因为只有当网络能够保持足够的原始信息时,反向传播过程中计算得到的梯度才能准确反映损失函数对参数的敏感性,从而指导网络朝着正确的方向进行优化。这个实验为后续的PGI机制设计提供了理论基础和实证支持。
In deep networks, the phenomenon of input data losing information during the feedforward process is commonly known as information bottleneck [ 59 ], and its schematic diagram is as shown in Figure 2 . At present, the main methods that can alleviate this phenomenon are as follows: (1) The use of reversible architectures [ 3 , 16 , 19 ]: this method mainly uses repeated input data and maintains the information of the input data in an explicit way; (2) The use of masked modeling [ 1 , 6 , 9 , 27 , 71 , 73 ]: it mainly uses reconstruction loss and adopts an implicit way to maximize the extracted features and retain the input information; and (3) Introduction of the deep supervision concept [ 28 , 51 , 54 , 68 ]: it uses shallow features that have not lost too much important information to pre-establish a mapping from features to targets to ensure that important information can be transferred to deeper layers. However, the above methods have different drawbacks in the training process and inference process. For example, a reversible architecture requires additional layers to combine repeatedly fed input data, which will significantly increase the inference cost. In addition, since the input data layer to the output layer cannot have a too deep path, this limitation will make it difficult to model high-order semantic information during the training process. As for masked modeling, its reconstruction loss sometimes conflicts with the target loss. In addition, most mask mechanisms also produce incorrect associations with data. For the deep supervision mechanism, it will produce error accumulation, and if the shallow supervision loses information during the training process, the subsequent layers will not be able to retrieve the required information. The above phenomenon will be more significant on difficult tasks and small models.
【翻译】在深度网络中,输入数据在前向传播过程中丢失信息的现象通常被称为信息瓶颈[59],其示意图如图2所示。目前,能够缓解这种现象的主要方法如下:(1) 使用可逆架构[3, 16, 19]:这种方法主要使用重复的输入数据,以显式方式维护输入数据的信息;(2) 使用掩码建模[1, 6, 9, 27, 71, 73]:它主要使用重建损失,采用隐式方式最大化提取的特征并保留输入信息;(3) 引入深度监督概念[28, 51, 54, 68]:它使用没有丢失太多重要信息的浅层特征,预先建立从特征到目标的映射,以确保重要信息能够传递到更深的层。然而,上述方法在训练过程和推理过程中都有不同的缺点。例如,可逆架构需要额外的层来组合重复输入的数据,这将显著增加推理成本。此外,由于从输入数据层到输出层不能有过深的路径,这种限制将使得在训练过程中难以建模高阶语义信息。至于掩码建模,其重建损失有时会与目标损失冲突。此外,大多数掩码机制也会产生与数据的错误关联。对于深度监督机制,它会产生错误累积,如果浅层监督在训练过程中丢失信息,后续层将无法检索所需的信息。上述现象在困难任务和小模型上会更加显著。
【解析】这段话系统地分析了当前解决信息瓶颈问题的三种主流方法及其局限性。信息瓶颈是深度学习中的一个基本问题,指的是信息在神经网络中逐层传递时会发生不可避免的损失,就像水流通过越来越窄的管道一样。第一种解决方案是可逆架构,这种方法的核心思想是通过数学上的可逆变换来保证信息的完整性。在可逆网络中,每一层的操作都设计成可以精确逆转的,这样理论上可以从输出完全恢复输入信息。但这种设计需要额外的计算资源来维护这种可逆性,特别是需要存储中间激活值用于反向计算,这大大增加了内存和计算开销。更重要的是,为了保持可逆性,网络的深度受到限制,因为过深的可逆路径会变得不稳定,这限制了模型学习复杂语义表示的能力。第二种方案掩码建模借鉴了自编码器的思想,通过随机遮挡输入的一部分,然后训练模型重建完整输入。这种方法强迫网络学习输入数据的内在结构和依赖关系,从而隐式地保留重要信息。但问题在于,重建损失(要求模型能够恢复原始输入)和目标任务损失(要求模型能够准确预测标签)往往存在冲突,模型需要在这两个目标之间找平衡,可能导致两个任务都无法达到最优。而且,掩码策略如果设计不当,可能会让模型学习到数据中的虚假关联模式。第三种方案深度监督是最直观的解决思路,在网络的中间层添加辅助的损失函数,利用还没有严重丢失信息的浅层特征直接进行监督学习。这样可以确保重要的梯度信息能够传递到网络的深层。但深度监督面临错误累积的问题,如果浅层的监督信号本身就不准确,这种错误会在网络中逐层传播和放大。特别是在轻量级模型中,由于参数容量有限,这些方法的缺陷会被进一步放大,使得性能提升效果不明显甚至出现负面影响。
To address the above-mentioned issues, we propose a new concept, which is programmable gradient information (PGI). The concept is to generate reliable gradients through auxiliary reversible branch, so that the deep features can still maintain key characteristics for executing target task. The design of auxiliary reversible branch can avoid the semantic loss that may be caused by a traditional deep supervision process that integrates multi-path features. In other words, we are programming gradient information propagation at different semantic levels, and thereby achieving the best training results. The reversible architecture of PGI is built on auxiliary branch, so there is no additional cost. Since PGI can freely select loss function suitable for the target task, it also overcomes the problems encountered by mask modeling. The proposed PGI mechanism can be applied to deep neural networks of various sizes and is more general than the deep supervision mechanism, which is only suitable for very deep neural networks.
【翻译】为了解决上述问题,我们提出了一个新概念,即可编程梯度信息(PGI)。这个概念是通过辅助可逆分支生成可靠的梯度,使深层特征仍能保持执行目标任务的关键特性。辅助可逆分支的设计可以避免传统深度监督过程中整合多路径特征可能导致的语义损失。换句话说,我们正在对不同语义层次的梯度信息传播进行编程,从而实现最佳的训练结果。PGI的可逆架构建立在辅助分支上,因此没有额外的成本。由于PGI可以自由选择适合目标任务的损失函数,它也克服了掩码建模遇到的问题。所提出的PGI机制可以应用于各种规模的深度神经网络,比仅适用于非常深的神经网络的深度监督机制更通用。
【解析】PGI是YOLOv9的核心创新,它巧妙地结合了前面提到的三种方法的优点,同时避免了它们的缺陷。PGI的设计哲学是"编程梯度信息",可以精确控制在网络的不同层次传递什么样的梯度信息。具体来说,PGI包含一个主分支和一个辅助可逆分支,主分支负责正常的前向推理,而辅助可逆分支负责维护完整的原始信息用于梯度计算。这种设计的关键优势在于,辅助分支只在训练阶段存在,推理时可以完全移除,因此不会增加推理成本。与传统深度监督不同,PGI的辅助分支采用可逆架构设计,这确保了从输入到辅助监督点的信息传递是无损的,避免了多路径特征融合可能带来的语义混乱。传统深度监督往往将来自不同层的特征简单地融合在一起,这种融合过程本身可能导致语义信息的丢失或混淆,而PGI通过可逆分支保持了信息的完整性和一致性。更重要的是,PGI允许针对不同的任务灵活选择最合适的损失函数,这解决了掩码建模中重建损失与任务损失冲突的问题。PGI的通用性也是其重要优势,它不仅适用于深层网络,也能让轻量级网络受益,这扩展了辅助监督技术的应用范围。
In this paper, we also designed generalized ELAN (GELAN) based on ELAN [ 65 ], the design of GELAN simultaneously takes into account the number of parameters, computational complexity, accuracy and inference speed. This design allows users to arbitrarily choose appropriate computational blocks for different inference devices. We combined the proposed PGI and GELAN, and then designed a new generation of YOLO series object detection system, which we call YOLOv9. We used the MS COCO dataset to conduct experiments, and the experimental results verified that our proposed YOLOv9 achieved the top performance in all comparisons.
【翻译】在本文中,我们还基于ELAN [65]设计了广义ELAN (GELAN),GELAN的设计同时考虑了参数数量、计算复杂度、准确性和推理速度。这种设计允许用户为不同的推理设备任意选择合适的计算块。我们将提出的PGI和GELAN结合起来,然后设计了新一代YOLO系列目标检测系统,我们称之为YOLOv9。我们使用MS COCO数据集进行实验,实验结果验证了我们提出的YOLOv9在所有比较中都取得了顶级性能。
【解析】GELAN是对原始ELAN架构的重要改进和泛化。原始的ELAN(Efficient Layer Aggregation Network)是一种高效的特征聚合网络设计,它通过特定的层连接模式来平衡计算效率和特征表达能力。GELAN的四个优化目标——参数数量、计算复杂度、准确性和推理速度——构成了一个多目标优化问题,需要在这些相互制约的因素之间找到最佳平衡点。当PGI和GELAN这两个创新技术结合起来时,就形成了YOLOv9的技术基础。PGI解决了训练过程中的信息丢失问题,确保模型能够学习到更准确的特征表示,而GELAN则提供了一个高效灵活的网络架构框架,使得这些准确的特征能够被有效地处理和利用。
We summarize the contributions of this paper as follows:
- We theoretically analyzed the existing deep neural network architecture from the perspective of reversible function, and through this process we successfully explained many phenomena that were difficult to explain in the past. We also designed PGI and auxiliary reversible branch based on this analysis and achieved excellent results.
- The PGI we designed solves the problem that deep supervision can only be used for extremely deep neural network architectures, and therefore allows new lightweight architectures to be truly applied in daily life.
- The GELAN we designed only uses conventional convolution to achieve a higher parameter usage than the depth-wise convolution design that based on the most advanced technology, while showing great advantages of being light, fast, and accurate.
- Combining the proposed PGI and GELAN, the object detection performance of the YOLOv9 on MS COCO dataset greatly surpasses the existing real-time object detectors in all aspects.
【翻译】我们总结本文的贡献如下:
- 我们从可逆函数的角度对现有深度神经网络架构进行了理论分析,通过这个过程我们成功解释了许多过去难以解释的现象。我们还基于这个分析设计了PGI和辅助可逆分支,并取得了优异的结果。
- 我们设计的PGI解决了深度监督只能用于极深神经网络架构的问题,因此使新的轻量级架构能够真正应用于日常生活中。
- 我们设计的GELAN仅使用传统卷积就实现了比基于最先进技术的深度卷积设计更高的参数使用率,同时在轻量化、快速和准确方面显示出巨大优势。
- 结合提出的PGI和GELAN,YOLOv9在MS COCO数据集上的目标检测性能在各个方面都大大超越了现有的实时目标检测器。
【解析】第一个贡献在于理论突破,作者从可逆函数这个数学角度重新审视了深度神经网络的工作机制。可逆函数是数学中一个基本概念,指的是存在反函数的函数,也就是说变换过程是完全可逆的,不会丢失任何信息。将这个概念引入深度学习领域,为理解网络中的信息流动提供了全新的理论框架。过去很多深度学习中的经验性现象,比如为什么某些架构效果更好、为什么深度增加有时候反而性能下降等,都可以通过信息丢失和可逆性的角度得到合理解释。基于这个理论基础,PGI机制的设计就有了坚实的数学支撑,而不是简单的经验性改进。第二个贡献解决了传统深度监督的适用性局限问题。虽然传统深度监督可以在一定程度上缓解信息丢失问题,但存在两个关键限制:首先,它只在极深的网络中才能发挥作用,对于轻量级和中等深度的网络反而可能降低性能;其次,传统深度监督的多路径特征融合过程本身会导致语义信息损失和错误累积。PGI的创新之处在于,它通过可逆分支架构避免了特征融合导致的语义损失,并通过多级辅助信息机制解决了错误累积问题,使得辅助监督技术能够有效应用于各种规模的网络,包括轻量级模型。这样,即使是资源受限的移动设备和嵌入式系统也能受益于先进的辅助监督机制。第三个贡献展现了工程优化的价值,深度可分离卷积是近年来轻量化网络设计的主流技术,它通过将标准卷积分解为深度卷积和逐点卷积来减少参数量和计算量。然而,GELAN仅使用最基础的标准卷积操作,就在参数效率上超越了这些专门为轻量化设计的复杂结构,这说明合理的架构设计比复杂的操作符更重要,证明了回归基础、注重架构本质的设计思路可能比追求操作符创新更有效。第四个贡献证明了技术的实际价值,在MS COCO评估基准上取得了全面的性能提升,不仅在精度上超越了现有方法,在速度和资源消耗方面也表现出色。
2. Related work
2.1. Real-time Object Detectors
The current mainstream real-time object detectors are the YOLO series [ 2 , 7 , 13 – 15 , 25 , 30 , 31 , 47 – 49 , 61 – 63 , 74 , 75 ], and most of these models use CSPNet [ 64 ] or ELAN [ 65 ] and their variants as the main computing units. In terms of feature integration, improved PAN [ 37 ] or FPN [ 35 ] is often used as a tool, and then improved YOLOv3 head [ 49 ] or FCOS head [ 57 , 58 ] is used as prediction head. Recently some real-time object detectors, such as RT DETR [ 43 ], which puts its fundation on DETR [ 4 ], have also been proposed. However, since it is extremely difficult for DETR series object detector to be applied to new domains without a corresponding domain pre-trained model, the most widely used real-time object detector at present is still YOLO series. This paper chooses YOLOv7 [ 63 ], which has been proven effective in a variety of computer vision tasks and various scenarios, as a base to develop the proposed method. We use GELAN to improve the architecture and the training process with the proposed PGI. The above novel approach makes the proposed YOLOv9 the top real-time object detector of the new generation.
【翻译】当前主流的实时目标检测器是YOLO系列[2, 7, 13-15, 25, 30, 31, 47-49, 61-63, 74, 75],这些模型大多使用CSPNet[64]或ELAN[65]及其变体作为主要计算单元。在特征集成方面,通常使用改进的PAN[37]或FPN[35]作为工具,然后使用改进的YOLOv3头[49]或FCOS头[57, 58]作为预测头。最近也提出了一些实时目标检测器,如基于DETR[4]的RT DETR[43]。然而,由于DETR系列目标检测器在没有相应领域预训练模型的情况下极难应用于新领域,目前使用最广泛的实时目标检测器仍然是YOLO系列。本文选择在各种计算机视觉任务和各种场景中都被证明有效的YOLOv7[63]作为开发所提出方法的基础。我们使用GELAN改进架构,并使用所提出的PGI改进训练过程。上述新颖方法使所提出的YOLOv9成为新一代顶级实时目标检测器。
【解析】现有的YOLO系列检测器在架构设计上有着相对统一的模式:骨干网络通常采用CSPNet或ELAN这样的高效特征提取模块,这些模块能够在保持计算效率的同时提供强大的特征表达能力。CSPNet通过跨阶段部分连接的设计减少计算冗余,而ELAN则通过高效的层聚合网络实现特征的有效整合。在特征融合层面,PAN和FPN是两种主流的多尺度特征融合策略,PAN通过自下而上和自上而下的路径聚合增强不同尺度特征的信息流通,而FPN则专注于构建具有丰富语义信息的特征金字塔。检测头的设计也相对成熟,YOLOv3头采用anchor-based的检测方式,而FCOS头则是anchor-free的代表,两种方式各有优势。虽然基于Transformer的RT DETR在某些场景下表现优异,但其对预训练模型的强依赖性限制了其在新领域的应用能力,特别是在缺乏大规模标注数据的垂直领域。DETR系列模型需要在大规模数据集上进行充分的预训练才能获得良好的特征表示能力,这种对预训练的依赖使其在资源受限或数据稀缺的场景下难以发挥作用。相比之下,YOLO系列检测器具有更好的泛化能力和更低的部署门槛,这也是其在工业界广泛应用的重要原因。这里同时也表明YOLOv9选择YOLOv7作为基础框架。
2.2. 可逆架构
The operation unit of reversible architectures [ 3 , 16 , 19 ] must maintain the characteristics of reversible conversion, so it can be ensured that the output feature map of each layer of operation unit can retain complete original information. Before, RevCol [ 3 ] generalizes traditional reversible unit to multiple levels, and in doing so can expand the semantic levels expressed by different layer units. Through a literature review of various neural network architectures, we found that there are many high-performing architectures with varying degree of reversible properties. For example, Res2Net module [ 11 ] combines different input partitions with the next partition in a hierarchical manner, and concatenates all converted partitions before passing them backwards. CBNet [ 34 , 39 ] re-introduces the original input data through composite backbone to obtain complete original information, and obtains different levels of multilevel reversible information through various composition methods. These network architectures generally have excellent parameter utilization, but the extra composite layers cause slow inference speeds. DynamicDet [ 36 ] combines CBNet [ 34 ] and the high-efficiency real-time object detector YOLOv7 [ 63 ] to achieve a very good trade-off among speed, number of parameters, and accuracy. This paper introduces the DynamicDet architecture as the basis for designing reversible branches. In addition, reversible information is further introduced into the proposed PGI. The proposed new architecture does not require additional connections during the inference process, so it can fully retain the advantages of speed, parameter amount, and accuracy.
【翻译】可逆架构[3, 16, 19]的操作单元必须保持可逆转换的特性,这样可以确保每层操作单元的输出特征图能够保留完整的原始信息。之前,RevCol[3]将传统的可逆单元推广到多个层次,通过这种方式可以扩展不同层单元表达的语义层次。通过对各种神经网络架构的文献综述,我们发现有许多具有不同程度可逆属性的高性能架构。例如,Res2Net模块[11]以分层方式将不同的输入分区与下一个分区结合,并在向后传递之前连接所有转换的分区。CBNet[34, 39]通过复合骨干网络重新引入原始输入数据以获得完整的原始信息,并通过各种组合方法获得不同层次的多级可逆信息。这些网络架构通常具有优秀的参数利用率,但额外的复合层导致推理速度较慢。DynamicDet[36]结合了CBNet[34]和高效实时目标检测器YOLOv7[63],在速度、参数数量和准确性之间实现了很好的权衡。本文引入DynamicDet架构作为设计可逆分支的基础。此外,可逆信息进一步引入到所提出的PGI中。所提出的新架构在推理过程中不需要额外的连接,因此可以完全保留速度、参数量和准确性的优势。
【解析】可逆架构的核心设计原则是保证信息的完全可恢复性,这就像在数据传递的每一步都保留完整的"备份",确保没有任何关键信息在网络的深层传播过程中丢失。RevCol的创新在于将单层的可逆操作扩展到多层次结构,这样不仅保持了信息的完整性,还增强了网络表达不同抽象层次语义信息的能力。多层次可逆设计的意义在于,浅层可以保留详细的局部特征信息,而深层可以保留高级的语义信息,这种分层的信息保持策略比单纯的全局信息保持更加精细和有效。文中提到的几种具有可逆特性的架构都体现了不同的设计思路。Res2Net通过分区处理的方式实现部分可逆性,它将输入特征分成多个部分,每个部分经过不同的变换路径,最后再重新组合,这种设计在保持信息多样性的同时也保留了原始信息的可追溯性。CBNet则采用了更直接的方法,通过复合骨干网络将原始输入数据在多个层次重复引入,确保网络在任何深度都能访问到完整的原始信息。这种设计特别适合于需要精细特征定位的任务,比如目标检测中的小物体检测。然而,这些传统可逆架构都面临一个共同的问题:为了保持信息的完整性,它们需要引入额外的计算路径和存储开销,这直接导致了推理速度的下降。DynamicDet的出现代表了可逆架构设计的一个重要转折点,它巧妙地将CBNet的信息保持能力与YOLOv7的高效性结合起来,在保证检测精度的同时显著提升了推理速度。YOLOv9基于DynamicDet的经验,设计了更加优雅的解决方案:通过PGI机制,将可逆信息的处理集中在训练阶段,而在推理阶段完全移除这些额外的计算开销,解决了可逆架构长期以来面临的效率与性能之间的矛盾,使得可逆信息处理技术真正具备了实际部署的可行性。
2.3. 辅助监督
Deep supervision [ 28 , 54 , 68 ] is the most common auxiliary supervision method, which performs training by inserting additional prediction layers in the middle layers. Especially the application of multi-layer decoders introduced in the transformer-based methods is the most common one. Another common auxiliary supervision method is to utilize the relevant meta information to guide the feature maps produced by the intermediate layers and make them have the properties required by the target tasks [ 18 , 20 , 24 , 29 , 76 ]. Examples of this type include using segmentation loss or depth loss to enhance the accuracy of object detectors. Recently, there are many reports in the literature [ 53 , 67 , 82 ] that use different label assignment methods to generate different auxiliary supervision mechanisms to speed up the convergence speed of the model and improve the robustness at the same time. However, the auxiliary supervision mechanism is usually only applicable to large models, so when it is applied to lightweight models, it is easy to cause an under parameterization phenomenon, which makes the performance worse. The PGI we proposed designed a way to reprogram multi-level semantic information, and this design allows lightweight models to also benefit from the auxiliary supervision mechanism.
【翻译】深度监督[28, 54, 68]是最常见的辅助监督方法,它通过在中间层插入额外的预测层来进行训练。特别是基于transformer方法中引入的多层解码器的应用是最常见的。另一种常见的辅助监督方法是利用相关的元信息来指导中间层产生的特征图,使其具备目标任务所需的属性[18, 20, 24, 29, 76]。这类方法的例子包括使用分割损失或深度损失来增强目标检测器的准确性。最近,文献中有许多报告[53, 67, 82]使用不同的标签分配方法来生成不同的辅助监督机制,以加速模型的收敛速度并同时提高鲁棒性。然而,辅助监督机制通常只适用于大型模型,所以当它应用于轻量级模型时,容易引起欠参数化现象,使性能变差。我们提出的PGI设计了一种重新编程多级语义信息的方法,这种设计使轻量级模型也能从辅助监督机制中受益。
【解析】辅助监督技术的核心思想是在深度网络的中间层添加额外的监督信号,就像在登山过程中设置多个检查点一样,确保模型在学习过程中不会偏离正确的方向。传统深度监督的实现方式相对直接,在网络的某些中间层后面添加分类器或回归器,让这些中间层的特征直接参与损失计算。这种做法的好处是能够为深层网络提供额外的梯度信号,缓解梯度消失问题,同时也能让中间层学习到更有意义的特征表示。在基于Transformer的方法中,多层解码器的设计就是深度监督的典型应用,每一层解码器都会产生预测结果,这些中间预测结果都会参与到最终的损失计算中,形成了分层次的监督机制。除了直接的预测监督,利用元信息进行辅助监督是另一个重要分支。这里的元信息指的是与主任务相关但又有所不同的监督信号,比如在目标检测任务中引入语义分割的监督信号。分割任务要求模型对每个像素进行精确分类,这种像素级的监督能够帮助特征提取器学习到更细粒度的空间信息,从而提升目标检测的定位精度。近年来,标签分配策略的优化成为了辅助监督的新方向。传统的标签分配往往比较固定,比如根据IoU阈值来决定正负样本。但现代方法会设计更加智能的分配策略,比如考虑样本的难易程度、特征质量等因素来动态调整标签分配,这种动态分配机制本身就构成了一种隐式的辅助监督。不同的分配策略会产生不同的监督信号,模型需要同时满足这些不同的约束条件,这种多约束学习能够提高模型的鲁棒性和泛化能力。但是,辅助监督技术面临的很大挑战是参数分配问题。大型模型由于参数容量充足,可以同时处理主任务和辅助任务的学习需求,不同任务之间的参数竞争不会太激烈。但在轻量级模型中,参数资源本来就非常有限,如果强行引入辅助监督,可能会导致模型把有限的参数过多地分配给辅助任务,反而影响主任务的性能。这就是所谓的欠参数化现象,模型的参数容量不足以支撑多任务学习的需求。PGI机制的创新之处在于,它不是简单地在轻量级模型上强加额外的监督任务,而是通过重新设计信息流的方式,让辅助监督能够以更加高效的方式发挥作用。通过可逆分支的设计,PGI能够在不显著增加参数量的情况下,为轻量级模型提供高质量的梯度信息,这样就解决了参数容量和监督需求之间的矛盾。
3. Problem Statement
Usually, people attribute the difficulty of deep neural network convergence problem due to factors such as gradient vanish or gradient saturation, and these phenomena do exist in traditional deep neural networks. However, modern deep neural networks have already fundamentally solved the above problem by designing various normalization and activation functions. Nevertheless, deep neural networks still have the problem of slow convergence or poor convergence results.
【翻译】通常,人们将深度神经网络收敛困难问题归因于梯度消失或梯度饱和等因素,这些现象确实存在于传统深度神经网络中。然而,现代深度神经网络已经通过设计各种归一化和激活函数从根本上解决了上述问题。尽管如此,深度神经网络仍然存在收敛缓慢或收敛结果不佳的问题。
In this paper, we explore the nature of the above issue further. Through in-depth analysis of information bottleneck, we deduced that the root cause of this problem is that the initial gradient originally coming from a very deep network has lost a lot of information needed to achieve the goal soon after it is transmitted. In order to confirm this inference, we feedforward deep networks of different architectures with initial weights, and then visualize and illustrate them in Figure 2 . Obviously, PlainNet has lost a lot of important information required for object detection in deep layers. As for the proportion of important information that ResNet, CSPNet, and GELAN can retain, it is indeed positively related to the accuracy that can be obtained after training. We further design reversible network-based methods to solve the causes of the above problems. In this section we shall elaborate our analysis of information bottleneck principle and reversible functions.
【翻译】在本文中,我们进一步探索了上述问题的本质。通过对信息瓶颈的深入分析,我们推断这个问题的根本原因是来自很深网络的初始梯度在传输后不久就失去了实现目标所需的大量信息。为了证实这一推断,我们使用初始权重对不同架构的深度网络进行前向传播,然后在图2中进行可视化和说明。显然,PlainNet在深层中失去了目标检测所需的大量重要信息。至于ResNet、CSPNet和GELAN能够保留的重要信息比例,确实与训练后能够获得的准确性呈正相关。我们进一步设计基于可逆网络的方法来解决上述问题的原因。在本节中,我们将详细阐述我们对信息瓶颈原理和可逆函数的分析。
【解析】这里点出了深度神经网络训练困难的真正根源。传统观点认为梯度消失和梯度饱和是主要问题,这确实在早期的深度网络中是一个严重障碍。比如在没有批归一化的网络中,梯度在反向传播过程中会逐层衰减,导致深层网络难以训练。现代深度学习通过批归一化、残差连接、ReLU激活函数等技术手段已经很好地解决了这些技术层面的问题。然而,即使解决了梯度传播问题,深度网络仍然面临收敛缓慢的困扰,这说明问题的根源更加深层。作者通过信息论的角度重新审视这个问题,提出了一个新的解释:问题不在于梯度是否能够传播,而在于传播的梯度是否包含足够的信息来指导学习。在深度网络中,信息在逐层传递的过程中会不断丢失,当网络很深时,最终的输出可能已经丢失了太多关于原始输入的关键信息。基于这样不完整信息计算出的梯度自然是不可靠的,即使这些梯度能够成功传播到网络的各个层,也无法提供有效的学习指导。为了验证这个理论,作者设计了一个巧妙的实验:使用随机初始化的权重对不同架构的网络进行前向传播,然后可视化中间层的特征图。这个实验的设计很有洞察力,因为在随机初始化的状态下,网络还没有进行任何学习,此时的特征图完全反映了网络架构本身对信息的保持能力。实验结果清楚地显示了不同架构在信息保持方面的差异:PlainNet(普通的全连接网络)在深层几乎完全丢失了有用信息,而ResNet、CSPNet等具有跳跃连接的架构能够保持更多信息,且信息保持能力与最终的训练精度呈现明显的正相关关系。这个发现为设计更好的网络架构提供了理论指导:我们需要的不仅仅是能够传播梯度的网络,更需要能够保持信息完整性的网络。基于这个认识,作者提出了可逆网络的解决方案,这为后续的PGI设计奠定了坚实的理论基础。
3.1. 信息瓶颈原理
According to information bottleneck principle, we know that data X X X may cause information loss when going through transformation, as shown in Eq. 1 below:
I ( X , X ) ≥ I ( X , f θ ( X ) ) ≥ I ( X , g ϕ ( f θ ( X ) ) ) , I(X,X)\geq I(X,f_{\theta}(X))\geq I(X,g_{\phi}(f_{\theta}(X))), I(X,X)≥I(X,fθ(X))≥I(X,gϕ(fθ(X))),
where I I I indicates mutual information, f f f and g g g are transformation functions, and θ \theta θ and ϕ \phi ϕ are parameters of f f f and g g g , respectively.
【翻译】根据信息瓶颈原理,我们知道数据
X
X
X在经过变换时可能会导致信息丢失,如下面的公式1所示:
其中
I
I
I表示互信息,
f
f
f和
g
g
g是变换函数,
θ
\theta
θ和
ϕ
\phi
ϕ分别是
f
f
f和
g
g
g的参数。
【解析】信息在网络传播过程中的不可逆损失。互信息 I ( X , Y ) I(X,Y) I(X,Y)衡量的是两个随机变量之间的相关程度,它表示知道一个变量能够减少对另一个变量不确定性的程度。在这个语境下, I ( X , X ) I(X,X) I(X,X)表示原始数据自身包含的全部信息量,这是一个理论上的上限。当数据经过神经网络的第一个变换函数 f θ ( ⋅ ) f_{\theta}(\cdot) fθ(⋅)后,变成了 f θ ( X ) f_{\theta}(X) fθ(X),此时原始数据 X X X与变换后数据之间的互信息 I ( X , f θ ( X ) ) I(X,f_{\theta}(X)) I(X,fθ(X))通常会小于 I ( X , X ) I(X,X) I(X,X),这意味着变换过程中丢失了一些信息。这种信息损失是不可避免的,因为神经网络的每一层都是对输入数据的一种压缩和抽象过程。公式中的不等号链条 I ( X , X ) ≥ I ( X , f θ ( X ) ) ≥ I ( X , g ϕ ( f θ ( X ) ) ) I(X,X)\geq I(X,f_{\theta}(X))\geq I(X,g_{\phi}(f_{\theta}(X))) I(X,X)≥I(X,fθ(X))≥I(X,gϕ(fθ(X)))表明,随着网络层数的增加,原始信息的保留程度会逐步下降。这种信息损失的累积效应是深度神经网络训练困难的核心原因之一,特别是当网络变得非常深时,最终层的输出可能已经丢失了太多与目标任务相关的关键信息,导致梯度计算变得不可靠。
In deep neural networks, f θ ( ⋅ ) f_{\theta}(\cdot) fθ(⋅) and g ϕ ( ⋅ ) g_{\phi}(\cdot) gϕ(⋅) respectively represent the operations of two consecutive layers in deep neural network. From Eq. 1 , we can predict that as the number of network layer becomes deeper, the original data will be more likely to be lost. However, the parameters of the deep neural network are based on the output of the network as well as the given target, and then update the network after generating new gradients by calculating the loss function. As one can imagine, the output of a deeper neural network is less able to retain complete information about the prediction target. This will make it possible to use incomplete information during network training, resulting in unreliable gradients and poor convergence.
【翻译】在深度神经网络中, f θ ( ⋅ ) f_{\theta}(\cdot) fθ(⋅)和 g ϕ ( ⋅ ) g_{\phi}(\cdot) gϕ(⋅)分别表示深度神经网络中两个连续层的操作。从公式1可以预测,随着网络层数变得更深,原始数据更容易丢失。然而,深度神经网络的参数是基于网络的输出以及给定的目标,然后通过计算损失函数生成新的梯度后更新网络。可以想象,更深的神经网络的输出较难保留关于预测目标的完整信息。这将使得在网络训练期间可能使用不完整的信息,导致不可靠的梯度和糟糕的收敛。
【解析】这段话分析了深度神经网络训练中的一个悖论。在深度网络中,每一层都可以看作是一个信息处理单元,它接收上一层的输出作为输入,然后通过参数变换产生新的特征表示。理想情况下,我们希望每一层都能提取出对最终任务有用的特征,同时保留足够的信息供后续层使用。但现实中,每一层的变换都会不可避免地丢失一些信息,这种损失在网络加深时会呈累积效应。问题的关键在于,神经网络的学习过程是通过反向传播算法实现的,即从网络的最终输出开始,根据损失函数计算梯度,然后逐层向前传播这些梯度来更新参数。如果网络的最终输出已经丢失了太多关于目标任务的关键信息,那么基于这个输出计算出的损失函数就无法准确反映真实的优化方向。这就像在一个信息传递链条中,如果最后收到的信息已经严重失真,那么基于这个失真信息做出的决策和调整就会偏离正确方向。这种信息不完整性导致的梯度不可靠性,会使得网络训练变得困难,收敛速度变慢,甚至可能陷入局部最优解。
One way to solve the above problem is to directly increase the size of the model. When we use a large number of parameters to construct a model, it is more capable of performing a more complete transformation of the data. The above approach allows even if information is lost during the data feedforward process, there is still a chance to retain enough information to perform the mapping to the target. The above phenomenon explains why the width is more important than the depth in most modern models. However, the above conclusion cannot fundamentally solve the problem of unreliable gradients in very deep neural network. Below, we will introduce how to use reversible functions to solve problems and conduct relative analysis.
【翻译】解决上述问题的一种方法是直接增加模型的大小。当我们使用大量参数来构建模型时,它更能够对数据执行更完整的变换。上述方法允许即使在数据前向传播过程中信息丢失,仍然有机会保留足够的信息来执行到目标的映射。上述现象解释了为什么在大多数现代模型中宽度比深度更重要。然而,上述结论无法从根本上解决极深神经网络中不可靠梯度的问题。下面,我们将介绍如何使用可逆函数来解决问题并进行相关分析。
【解析】增加模型参数量的方法本质上是通过提升网络的表达能力来补偿信息损失。当网络拥有更多参数时,每一层都有更大的容量来编码和保存输入信息,即使某些信息在变换过程中丢失,网络仍然可能在其他地方保留了足够的相关信息。这就像增加一个容器的容量,即使有一些泄漏,仍然能够容纳足够的内容。这种思路的成功可以从现代深度学习的发展历程中得到验证,比如从AlexNet到VGG再到更宽的ResNet变体,我们看到网络宽度的增加往往比简单增加深度更有效。宽度的重要性体现在:更宽的网络意味着每一层有更多的神经元,能够学习更丰富的特征表示,同时也提供了更多的信息传递通道,降低了关键信息完全丢失的风险。然而,这种暴力扩大参数量的方法存在明显的局限性。首先,它会大幅增加计算和存储成本,使模型在实际部署中变得不可行。其次,更重要的是,这种方法并没有从根本上解决信息传递的机制问题,只是通过冗余来缓解问题的表现。当网络变得极深时,即使是参数量很大的网络,信息损失的累积效应仍然会导致梯度不可靠的问题。因此,需要从信息传递的机制层面寻找更本质的解决方案,这就引出了可逆函数的概念,它试图从数学原理上保证信息传递过程的无损性。
3.2. 可逆函数
When a function r r r has an inverse transformation function v v v , we call this function reversible function, as shown in Eq. 2 .
X = v ζ ( r ψ ( X ) ) , X=v_{\zeta}(r_{\psi}(X)), X=vζ(rψ(X)),
where ψ \psi ψ and ζ \zeta ζ are parameters of r r r and v v v , respectively. Data X X X is converted by reversible function without losing information, as shown in Eq. 3 .
I ( X , X ) = I ( X , r ψ ( X ) ) = I ( X , v ζ ( r ψ ( X ) ) ) . I(X,X)=I(X,r_{\psi}(X))=I(X,v_{\zeta}(r_{\psi}(X))). I(X,X)=I(X,rψ(X))=I(X,vζ(rψ(X))).
When the network’s transformation function is composed of reversible functions, more reliable gradients can be obtained to update the model. Almost all of today’s popular deep learning methods are architectures that conform to the reversible property, such as Eq. 4 .
【翻译】当一个函数 r r r具有逆变换函数 v v v时,我们称这个函数为可逆函数,如公式2所示。其中 ψ \psi ψ和 ζ \zeta ζ分别是 r r r和 v v v的参数。数据 X X X通过可逆函数转换而不会丢失信息,如公式3所示。当网络的变换函数由可逆函数组成时,可以获得更可靠的梯度来更新模型。当今几乎所有流行的深度学习方法都是符合可逆性质的架构,如公式4所示。
【解析】可逆函数的本质是建立一种双向映射关系,确保信息在变换过程中能够完全保持。在数学层面,公式2展示了可逆性的定义:如果我们先对输入 X X X应用函数 r ψ ( ⋅ ) r_{\psi}(\cdot) rψ(⋅),然后再应用其逆函数 v ζ ( ⋅ ) v_{\zeta}(\cdot) vζ(⋅),最终能够完全恢复原始输入 X X X。这种可逆性的关键在于参数 ψ \psi ψ和 ζ \zeta ζ必须精确匹配,使得两个函数能够互为逆变换。公式3从信息论的角度量化了可逆性的价值:互信息 I ( X , X ) I(X,X) I(X,X)表示原始数据自身包含的全部信息,而 I ( X , r ψ ( X ) ) I(X,r_{\psi}(X)) I(X,rψ(X))表示原始数据与变换后数据之间的信息关联度。在完美的可逆变换中,这三个互信息值应该完全相等,意味着没有任何信息在变换过程中丢失。这种信息完整性对深度学习至关重要,因为完整的信息能够产生高质量的梯度信号。当损失函数基于完整信息计算梯度时,这些梯度更能准确反映参数调整的正确方向,从而加速模型收敛并提高最终性能。
X l + 1 = X l + f θ l + 1 ( X l ) , X^{l+1}=X^{l}+f_{\theta}^{l+1}(X^{l}), Xl+1=Xl+fθl+1(Xl),
where l l l indicates the l l l -th layer of a PreAct ResNet and f f f is the transformation function of the l l l -th layer. PreAct ResNet [ 22 ] repeatedly passes the original data X X X to subsequent layers in an explicit way. Although such a design can make a deep neural network with more than a thousand layers converge very well, it destroys an important reason why we need deep neural networks. That is, for difficult problems, it is difficult for us to directly find simple mapping functions to map data to targets. This also explains why PreAct ResNet performs worse than ResNet [ 21 ] when the number of layers is small.
【翻译】其中 l l l表示PreAct ResNet的第 l l l层, f f f是第 l l l层的变换函数。PreAct ResNet以显式的方式重复地将原始数据 X X X传递到后续层。虽然这种设计可以使超过一千层的深度神经网络收敛得很好,但它破坏了我们需要深度神经网络的一个重要原因。也就是说,对于困难问题,我们很难直接找到简单的映射函数将数据映射到目标。这也解释了为什么PreAct ResNet在层数较少时表现比ResNet差。
【解析】这里涉及到残差网络设计的思考。PreAct ResNet通过恒等映射的方式确实解决了极深网络的训练问题,但这种设计在某种程度上违背了深度学习的本质理念。深度神经网络之所以有效,是因为复杂的非线性问题往往无法通过单一的简单函数来解决,需要通过多层的组合变换来逐步抽象和提炼特征。每一层都应该承担一部分特征变换的责任,通过层层递进的方式将原始输入转化为对目标任务有用的表示。然而,PreAct ResNet的显式恒等连接使得原始数据可以几乎不经过任何变换就直接传递到深层,这种设计虽然保证了信息的完整传递,但也使得中间层的学习变得不够充分。当网络层数较少时,这种问题尤为明显,因为浅层网络本身的变换能力就有限,如果再通过恒等连接绕过部分变换过程,那么网络就更难学习到复杂的特征表示。这就像建造一座桥梁,如果我们为了防止结构失稳而添加了太多的支撑结构,虽然桥梁不会倒塌,但可能会失去其原本应有的跨越能力。
In addition, we tried to use masked modeling that allowed the transformer model to achieve significant breakthroughs. We use approximation methods, such as Eq. 5 , to try to find the inverse transformation v v v of r r r , so that the transformed features can retain enough information using sparse features. The form of Eq. 5 is as follows:
【翻译】此外,我们尝试使用掩码建模,这使得transformer模型取得了重大突破。我们使用近似方法,如公式5,试图找到 r r r的逆变换 v v v,使得变换后的特征能够使用稀疏特征保留足够的信息。公式5的形式如下:
【解析】这段话引入了掩码建模的概念,这是近年来自然语言处理和计算机视觉领域的重要突破。掩码建模的核心思想是通过故意隐藏部分输入信息,然后训练模型去预测这些被遮蔽的部分,从而学习到更鲁棒和泛化的特征表示。在transformer架构中,这种方法被证明极其有效,比如BERT通过掩码语言建模学习文本表示,MAE通过掩码图像建模学习视觉表示。作者在这里提到使用近似方法来寻找逆变换,实际上是在探索如何在稀疏信息的条件下重构完整信息。这种方法的数学基础在于,如果我们能够找到一个良好的逆变换函数,那么即使原始特征经过了有损压缩(通过掩码实现),我们仍然能够恢复出足够的信息来完成目标任务。
X = v ζ ( r ψ ( X ) ⋅ M ) , X=v_{\zeta}(r_{\psi}(X)\cdot M), X=vζ(rψ(X)⋅M),
where M M M is a dynamic binary mask. Other methods that are commonly used to perform the above tasks are diffusion model and variational autoencoder, and they both have the function of finding the inverse function. However, when we apply the above approach to a lightweight model, there will be defects because the lightweight model will be under parameterized to a large amount of raw data. Because of the above reason, important information I ( Y , X ) I(Y,X) I(Y,X) that maps data X X X to target Y Y Y will also face the same problem. For this issue, we will explore it using the concept of information bottleneck [ 59 ]. The formula for information bottleneck is as follows:
【翻译】其中 M M M是一个动态二值掩码。其他常用于执行上述任务的方法是扩散模型和变分自编码器,它们都具有寻找逆函数的功能。然而,当我们将上述方法应用于轻量级模型时,会出现缺陷,因为轻量级模型对于大量原始数据来说是参数不足的。由于上述原因,将数据 X X X映射到目标 Y Y Y的重要信息 I ( Y , X ) I(Y,X) I(Y,X)也会面临同样的问题。对于这个问题,我们将使用信息瓶颈的概念来探索它。信息瓶颈的公式如下:
【解析】这里引入了动态掩码的概念,这种掩码不是静态固定的,而是可以根据输入内容或训练阶段动态调整的。公式5展示了一个重要的数学关系:原始数据通过变换函数 r ψ r_{\psi} rψ处理后,再与动态掩码 M M M相乘得到稀疏表示,然后通过逆变换 v ζ v_{\zeta} vζ试图重构原始输入。这种设计的巧妙之处在于,它结合了信息压缩和信息重构两个过程,既能够学习紧凑的特征表示,又能够验证这些特征是否包含了足够的重要信息。作者提到了扩散模型和变分自编码器这两种流行的生成模型,它们的共同特点是都具备编码-解码的能力,能够学习数据的潜在表示并重构原始输入。然而,当这些先进的方法应用到轻量级模型时,就会遇到参数不足的根本性问题。轻量级模型的参数量有限,面对大规模的原始数据时,很容易出现欠参数化的情况,导致模型无法充分学习数据中的复杂模式。更关键的是,即使整体信息损失不大,但如果丢失的恰好是从输入到目标的关键映射信息 I ( Y , X ) I(Y,X) I(Y,X),那么对任务性能的影响将是灾难性的。
I ( X , X ) ≥ I ( Y , X ) ≥ I ( Y , f θ ( X ) ) ≥ . . . ≥ I ( Y , Y ^ ) . I(X,X)\geq I(Y,X)\geq I(Y,f_{\theta}(X))\geq...\geq I(Y,\hat{Y}). I(X,X)≥I(Y,X)≥I(Y,fθ(X))≥...≥I(Y,Y^).
Generally speaking, I ( Y , X ) I(Y,X) I(Y,X) will only occupy a very small part of I ( X , X ) I(X,X) I(X,X) . However, it is critical to the target mission. Therefore, even if the amount of information lost in the feedforward stage is not significant, as long as I ( Y , X ) I(Y,X) I(Y,X) is covered, the training effect will be greatly affected. The lightweight model itself is in an under parameterized state, so it is easy to lose a lot of important information in the feedforward stage. Therefore, our goal for the lightweight model is how to accurately filter I ( Y , X ) I(Y,X) I(Y,X) from I ( X , X ) I(X,X) I(X,X) . As for fully preserving the information of X X X , that is difficult to achieve. Based on the above analysis, we hope to propose a new deep neural network training method that can not only generate reliable gradients to update the model, but also be suitable for shallow and lightweight neural networks.
【翻译】一般来说, I ( Y , X ) I(Y,X) I(Y,X)只会占 I ( X , X ) I(X,X) I(X,X)的很小一部分。然而,它对目标任务至关重要。因此,即使在前向传播阶段丢失的信息量不显著,只要 I ( Y , X ) I(Y,X) I(Y,X)被覆盖,训练效果就会受到很大影响。轻量级模型本身处于参数不足的状态,因此很容易在前向传播阶段丢失大量重要信息。因此,我们对轻量级模型的目标是如何从 I ( X , X ) I(X,X) I(X,X)中准确过滤出 I ( Y , X ) I(Y,X) I(Y,X)。至于完全保存 X X X的信息,这很难实现。基于上述分析,我们希望提出一种新的深度神经网络训练方法,既能生成可靠的梯度来更新模型,又适用于浅层和轻量级神经网络。
【解析】信息瓶颈公式展示了一个递减的互信息链条,从原始数据的完整信息开始,通过网络的逐层变换,信息量逐步减少,最终到达预测输出。这个公式的表示出了深度网络中信息传递的本质规律。在实际应用中,原始输入包含的总信息量通常是巨大的,但其中与目标任务真正相关的信息 I ( Y , X ) I(Y,X) I(Y,X)往往只占很小的比例。这就像在一个巨大的图书馆中寻找特定的信息,虽然图书馆包含了海量的知识,但与当前问题相关的可能只是其中的一小部分。问题的关键在于,这一小部分关键信息虽然在整体中占比很小,但对任务成功却是决定性的。如果在网络的前向传播过程中,这些关键信息被意外丢失或损坏,那么即使其他大部分信息都被完好保留,模型的性能也会严重下降。轻量级模型面临的挑战更加严峻,因为参数量的限制使得它们没有足够的容量来同时保存大量冗余信息和关键信息,因此更容易在信息筛选过程中误删重要内容。这种情况下,完全保存原始信息既不现实也不必要,真正需要的是一种智能的信息过滤机制,能够准确识别并保留那些对目标任务至关重要的信息成分。
Figure 3. PGI and related network architectures and methods. (a) Path Aggregation Network (PAN)) [ 37 ], (b) Reversible Columns (RevCol) [ 3 ], © conventional deep supervision, and (d) our proposed Programmable Gradient Information (PGI). PGI is mainly composed of three components: (1) main branch: architecture used for inference, (2) auxiliary reversible branch: generate reliable gradients to supply main branch for backward transmission, and (3) multi-level auxiliary information: control main branch learning plannable multi-level of semantic information.
【翻译】图3. PGI和相关的网络架构和方法。(a) 路径聚合网络 (PAN) [37],(b) 可逆列 (RevCol) [3],© 传统深度监督,以及 (d) 我们提出的可编程梯度信息 (PGI)。PGI主要由三个组件组成:(1) 主分支:用于推理的架构,(2) 辅助可逆分支:生成可靠的梯度为主分支提供反向传播,以及 (3) 多级辅助信息:控制主分支学习可规划的多级语义信息。
【解析】图展示了PGI方法与现有技术的对比关系。PAN作为目标检测中的经典特征融合方法,通过自底向上和自顶向下的路径来整合不同尺度的特征信息。RevCol则代表了可逆架构的典型实现,通过可逆变换来保证信息的完整传递。传统深度监督通过在网络的中间层添加额外的损失函数来加强梯度信号。而PGI的创新在于它巧妙地结合了这些方法的优点同时避免了各自的缺点。PGI的三组件设计体现了一种模块化的思维:主分支专注于高效推理,不承担额外的计算负担;辅助可逆分支专门负责生成高质量的梯度信号,解决深度网络中的信息瓶颈问题;多级辅助信息则负责协调不同层级的学习目标,防止各层过度专门化导致的信息割裂。这种设计使得PGI既能享受可逆架构带来的梯度质量提升,又不会在推理时承担额外的计算成本,同时还能通过多级辅助信息实现更精细的训练控制。
4. Methodology
4.1. 可编程梯度信息
In order to solve the aforementioned problems, we propose a new auxiliary supervision framework called Programmable Gradient Information (PGI), as shown in Figure 3 (d). PGI mainly includes three components, namely (1) main branch, (2) auxiliary reversible branch, and (3) multi-level auxiliary information. From Figure 3 (d) we see that the inference process of PGI only uses main branch and therefore does not require any additional inference cost. As for the other two components, they are used to solve or slow down several important issues in deep learning methods. Among them, auxiliary reversible branch is designed to deal with the problems caused by the deepening of neural networks. Network deepening will cause information bottleneck, which will make the loss function unable to generate reliable gradients. As for multi-level auxiliary information, it is designed to handle the error accumulation problem caused by deep supervision, especially for the architecture and lightweight model of multiple prediction branch. Next, we will introduce these two components step by step.
【翻译】为了解决上述问题,我们提出了一个新的辅助监督框架,称为可编程梯度信息(PGI),如图3(d)所示。PGI主要包括三个组件,即(1)主分支、(2)辅助可逆分支和(3)多级辅助信息。从图3(d)我们可以看到,PGI的推理过程只使用主分支,因此不需要任何额外的推理成本。至于其他两个组件,它们用于解决或减缓深度学习方法中的几个重要问题。其中,辅助可逆分支旨在处理神经网络加深引起的问题。网络加深会导致信息瓶颈,这会使损失函数无法生成可靠的梯度。至于多级辅助信息,它旨在处理由深度监督引起的错误累积问题,特别是对于多预测分支的架构和轻量级模型。接下来,我们将逐步介绍这两个组件。
【解析】PGI是在不增加推理成本的前提下解决深度神经网络训练中的根本性问题。主分支承担了所有的实际推理工作,这保证了模型在部署时的效率不会受到影响。而另外两个辅助组件则专门针对深度学习中的两大挑战:信息传递质量和监督信号质量。信息瓶颈问题源于深度网络中信息在层间传递时的逐步损失,当网络变得很深时,原始输入的信息可能会在传递过程中严重退化,导致后续层接收到的信息不足以生成有效的梯度信号,这直接影响了网络的训练效果。多级辅助信息组件则针对深度监督中的另一个问题:当使用多个预测头进行监督时,不同层级的特征可能会过度专门化,导致信息分布不均和错误累积。PGI通过这种模块化的设计,既保证了推理效率,又系统性地改善了训练过程中的梯度质量和信息流动,特别适合轻量级模型这种参数受限的场景。
4.1.1 辅助可逆分支
In PGI, we propose auxiliary reversible branch to generate reliable gradients and update network parameters. By providing information that maps from data to targets, the loss function can provide guidance and avoid the possibility of finding false correlations from incomplete feedforward features that are less relevant to the target. We propose the maintenance of complete information by introducing reversible architecture, but adding main branch to reversible architecture will consume a lot of inference costs. We analyzed the architecture of Figure 3 (b) and found that when additional connections from deep to shallow layers are added, the inference time will increase by 20 % 20\% 20% . When we repeatedly add the input data to the high-resolution computing layer of the network (yellow box), the inference time even exceeds twice the time.
【翻译】在PGI中,我们提出了辅助可逆分支来生成可靠的梯度并更新网络参数。通过提供从数据到目标的映射信息,损失函数可以提供指导,避免从与目标关联度较低的不完整前向传播特征中发现虚假相关性的可能性。我们提出通过引入可逆架构来维护完整信息,但将主分支添加到可逆架构中会消耗大量推理成本。我们分析了图3(b)的架构,发现当添加从深层到浅层的额外连接时,推理时间会增加 20 % 20\% 20%。当我们重复将输入数据添加到网络的高分辨率计算层(黄色框)时,推理时间甚至超过了两倍。
【解析】在深度神经网络中,信息在前向传播过程中往往会出现丢失或降质,导致梯度计算变得不可靠,进而影响模型训练效果。辅助可逆分支的核心思想是构建一个能够完全保留原始信息的分支,使得即便在信息传递过程中出现损失,系统仍能通过可逆变换重构出原始信息,从而为损失函数提供准确的梯度信号。这种设计的理论基础在于,当损失函数能够接收到完整的从输入到目标的映射信息时,它就能够更好地识别真正有效的特征模式,而不会被那些由于信息不完整而产生的虚假相关性所误导。然而,直接将主分支与可逆架构结合会带来显著的计算开销问题,实验数据显示这种直接连接方式会使推理时间增加20%,而在某些复杂配置下甚至会增加一倍以上,这对于实际应用来说是不可接受的。
Since our goal is to use reversible architecture to obtain reliable gradients, “reversible” is not the only necessary condition in the inference stage. In view of this, we regard reversible branch as an expansion of deep supervision branch, and then design auxiliary reversible branch, as shown in Figure 3 (d). As for the main branch deep features that would have lost important information due to information bottleneck, they will be able to receive reliable gradient information from the auxiliary reversible branch. These gradient information will drive parameter learning to assist in extracting correct and important information, and the above actions can enable the main branch to obtain features that are more effective for the target task. Moreover, the reversible architecture performs worse on shallow networks than on general networks because complex tasks require conversion in deeper networks. Our proposed method does not force the main branch to retain complete original information but updates it by generating useful gradient through the auxiliary supervision mechanism. The advantage of this design is that the proposed method can also be applied to shallower networks.
【翻译】由于我们的目标是使用可逆架构来获得可靠的梯度,"可逆性"并不是推理阶段的唯一必要条件。鉴于此,我们将可逆分支视为深度监督分支的扩展,然后设计了辅助可逆分支,如图3(d)所示。至于主分支中由于信息瓶颈而丢失重要信息的深层特征,它们将能够从辅助可逆分支接收可靠的梯度信息。这些梯度信息将驱动参数学习,帮助提取正确和重要的信息,上述行为能够使主分支获得对目标任务更有效的特征。此外,可逆架构在浅层网络上的表现比在一般网络上更差,因为复杂任务需要在更深的网络中进行转换。我们提出的方法不强制主分支保留完整的原始信息,而是通过辅助监督机制生成有用的梯度来更新它。这种设计的优势在于所提出的方法也可以应用于较浅的网络。
【解析】传统观念认为可逆性是获得可靠梯度的核心要求,但作者指出这种理解过于绝对化。在实际应用中,特别是在推理阶段,完全的可逆性往往不是必需的,关键在于如何在训练过程中利用可逆性来改善梯度质量。基于这一认识,作者提出了一种架构设计:将可逆分支重新定位为深度监督的增强版本,而不是主架构的必要组成部分。主分支在接收到这些来自辅助分支的可靠梯度信号后,能够更准确地调整其参数,从而学习到更有价值的特征表示。这种间接的信息传递方式解决了传统可逆架构的一个重要局限性:它们在浅层网络中表现不佳。这是因为复杂的特征变换通常需要深层网络来实现,而浅层网络的表达能力有限。通过辅助监督机制,主分支不再需要承担完整信息保存的重任,而是专注于学习任务相关的有效特征,这使得该方法能够成功应用于各种深度的网络架构。
Figure 4. The architecture of GELAN: (a) CSPNet [64], (b) ELAN [65], and © proposed GELAN. We imitate CSPNet and extend ELAN into GELAN that can support any computational blocks.
【翻译】图4. GELAN的架构:(a) CSPNet [64],(b) ELAN [65],© 提出的GELAN。我们模仿CSPNet并将ELAN扩展为GELAN,使其能够支持任何计算块。
Finally, since auxiliary reversible branch can be removed during the inference phase, the inference capabilities of the original network can be retained. We can also choose any reversible architectures in PGI to play the role of auxiliary reversible branch.
【翻译】最后,由于辅助可逆分支可以在推理阶段被移除,原始网络的推理能力可以得到保留。我们也可以在PGI中选择任何可逆架构来扮演辅助可逆分支的角色。
【解析】强调了PGI架构设计的一个优势:训练和推理的分离性。一旦训练完成,这个辅助分支可以被完全移除而不影响模型的推理性能。
4.1.2 Multi-level Auxiliary Information
In this section we will discuss how multi-level auxiliary information works. The deep supervision architecture including multiple prediction branch is shown in Figure 3 ©. For object detection, different feature pyramids can be used to perform different tasks, for example together they can detect objects of different sizes. Therefore, after connecting to the deep supervision branch, the shallow features will be guided to learn the features required for small object detection, and at this time the system will regard the positions of objects of other sizes as the background. However, the above deed will cause the deep feature pyramids to lose a lot of information needed to predict the target object. Regarding this issue, we believe that each feature pyramid needs to receive information about all target objects so that subsequent main branch can retain complete information to learn predictions for various targets.
【翻译】在本节中,我们将讨论多级辅助信息是如何工作的。包含多个预测分支的深度监督架构如图3©所示。对于目标检测,不同的特征金字塔可以用来执行不同的任务,例如它们可以一起检测不同大小的目标。因此,在连接到深度监督分支后,浅层特征将被引导学习小目标检测所需的特征,此时系统会将其他大小目标的位置视为背景。然而,上述行为会导致深层特征金字塔丢失预测目标对象所需的大量信息。针对这个问题,我们认为每个特征金字塔都需要接收关于所有目标对象的信息,这样后续的主分支就能保留完整的信息来学习对各种目标的预测。
The concept of multi-level auxiliary information is to insert an integration network between the feature pyramid hierarchy layers of auxiliary supervision and the main branch, and then uses it to combine returned gradients from different prediction heads, as shown in Figure 3 (d). Multi-level auxiliary information is then to aggregate the gradient information containing all target objects, and pass it to the main branch and then update parameters. At this time, the characteristics of the main branch’s feature pyramid hierarchy will not be dominated by some specific object’s information. As a result, our method can alleviate the broken information problem in deep supervision. In addition, any integrated network can be used in multi-level auxiliary information. Therefore, we can plan the required semantic levels to guide the learning of network architectures of different sizes.
【翻译】多级辅助信息的概念是在辅助监督的特征金字塔层级和主分支之间插入一个集成网络,然后使用它来组合来自不同预测头的返回梯度,如图3(d)所示。多级辅助信息的作用是聚合包含所有目标对象的梯度信息,并将其传递给主分支然后更新参数。此时,主分支特征金字塔层级的特征将不会被某些特定对象的信息所主导。因此,我们的方法可以缓解深度监督中的信息破坏问题。此外,任何集成网络都可以用于多级辅助信息中。因此,我们可以规划所需的语义级别来指导不同规模网络架构的学习。
【解析】这段话介绍了多级辅助信息机制的工作原理。在传统的深度监督中,不同层级的特征金字塔往往会被训练去专门处理特定尺寸的目标,这就导致了信息的分化和丢失。比如浅层特征可能只关注小目标的检测,而忽略了大目标的信息,这种专门化虽然在某种程度上提高了特定任务的性能,但却破坏了特征的完整性。多级辅助信息机制的创新在于引入了一个集成网络作为"信息汇聚器",它位于辅助监督分支和主分支之间,专门负责收集和整合来自不同预测头的梯度信息。这些梯度信息包含了对所有尺寸目标的学习信号,通过集成网络的处理,这些多样化的梯度信息被有效地融合在一起,然后统一传递给主分支进行参数更新。这样做的好处是主分支的每一层特征都能接收到关于所有目标的综合信息,而不会出现某一层只专注于特定目标而忽略其他目标的偏向性问题。这种设计的灵活性还体现在集成网络的选择上,研究者可以根据具体需求选择不同的网络结构来实现信息集成,从而为不同规模的网络架构提供定制化的语义级别规划,最终实现更加均衡和全面的特征学习。
4.2. Generalized ELAN
In this Section we describe the proposed new network architecture – GELAN. By combining two neural network architectures, CSPNet [ 64 ] and ELAN [ 65 ], which are designed with gradient path planning, we designed generalized efficient layer aggregation network (GELAN) that takes into account lighweight, inference speed, and accuracy. Its overall architecture is shown in Figure 4 . We generalized the capability of ELAN [ 65 ], which originally only used stacking of convolutional layers, to a new architecture that can use any computational blocks.
【翻译】在本节中,我们描述了所提出的新网络架构——GELAN。通过结合两种采用梯度路径规划设计的神经网络架构CSPNet [64]和ELAN [65],我们设计了广义高效层聚合网络(GELAN),该网络兼顾了轻量化、推理速度和准确性。其整体架构如图4所示。我们将ELAN [65]的能力进行了泛化,原本它只使用卷积层的堆叠,现在扩展为可以使用任何计算块的新架构。
5. Experiments
5.1. Experimental Setup
We verify the proposed method with MS COCO dataset. All experimental setups follow YOLOv7 AF [ 63 ], while the dataset is MS COCO 2017 splitting. All models we mentioned are trained using the train-from-scratch strategy, and the total number of training times is 500 epochs. In setting the learning rate, we use linear warm-up in the first three epochs, and the subsequent epochs set the corresponding decay manner according to the model scale. As for the last 15 epochs, we turn mosaic data augmentation off. For more settings, please refer to Appendix.
【翻译】我们使用MS COCO数据集验证了所提出的方法。所有实验设置都遵循YOLOv7 AF [63],数据集采用MS COCO 2017的划分方式。我们提到的所有模型都采用从头训练的策略,总训练轮数为500个epochs。在学习率设置方面,我们在前三个epochs使用线性预热,后续epochs根据模型规模设置相应的衰减方式。至于最后15个epochs,我们关闭了马赛克数据增强。更多设置请参考附录。
5.2. Implimentation Details
We built general and extended version of YOLOv9 based on YOLOv7 [ 63 ] and Dynamic YOLOv7 [ 36 ] respectively. In the design of the network architecture, we replaced ELAN [ 65 ] with GELAN using CSPNet blocks [ 64 ] with planned RepConv [ 63 ] as computational blocks. We also simplified downsampling module and optimized anchorfree prediction head. As for the auxiliary loss part of PGI, we completely follow YOLOv7’s auxiliary head setting. Please see Appendix for more details.
【翻译】我们分别基于YOLOv7 [63]和Dynamic YOLOv7 [36]构建了YOLOv9的通用版本和扩展版本。在网络架构设计中,我们使用CSPNet块[64]和规划的RepConv [63]作为计算块,将ELAN [65]替换为GELAN。我们还简化了下采样模块并优化了无锚点预测头。至于PGI的辅助损失部分,我们完全遵循YOLOv7的辅助头设置。更多详细信息请参阅附录。
Table 1. Comparison of state-of-the-art real-time object detectors.
【翻译】表1. 最先进实时目标检测器的比较。
5.3. Comparison with state-of-the-arts
Table 1 lists comparison of our proposed YOLOv9 with other train-from-scratch real-time object detectors. Overall, the best performing methods among existing methods are YOLO MS-S [ 7 ] for lightweight models, YOLO MS [ 7 ] for medium models, YOLOv7 AF [ 63 ] for general models, and YOLOv8-X [ 15 ] for large models. Compared with lightweight and medium model YOLO MS [ 7 ], YOLOv9 has about 10 % 10\% 10% less s and 5 ∼ 15 % 5\sim15\% 5∼15% less calculations, but still has a 0.4 pared with YOLOv7 AF, YOLOv9-C has 42% less pa 0.4 ∼ 0.6 % 0.4\sim0.6\% 0.4∼0.6% ∼ 0.6% improvemen AP. Comrameters and 22 % 22\% 22% less calculations, but achieves the same AP ( 53 % ) (53\%) (53%) . Compared with YOLOv8-X, YOLOv9-E has 16 % 16\% 16% less parameters, 27 % 27\% 27% less calculations, and has significant improvement of 1.7 % 1.7\% 1.7% AP. The above comparison results show that our proposed YOLOv9 has significantly improved in all aspects compared with existing methods.
【翻译】表1列出了我们提出的YOLOv9与其他从头训练的实时目标检测器的比较。总体而言,现有方法中表现最佳的方法是:轻量级模型的YOLO MS-S [7],中等模型的YOLO MS [7],通用模型的YOLOv7 AF [63],以及大型模型的YOLOv8-X [15]。与轻量级和中等模型YOLO MS [7]相比,YOLOv9的参数约减少 10 % 10\% 10%,计算量减少 5 ∼ 15 % 5\sim15\% 5∼15%,但仍有0.4∼0.6%的AP改进。与YOLOv7 AF相比,YOLOv9-C的参数减少42%,计算量减少 22 % 22\% 22%,但达到了相同的AP( 53 % 53\% 53%)。与YOLOv8-X相比,YOLOv9-E的参数减少 16 % 16\% 16%,计算量减少 27 % 27\% 27%,并且AP显著提高了 1.7 % 1.7\% 1.7%。上述比较结果表明,我们提出的YOLOv9与现有方法相比在各个方面都有显著改进。
On the other hand, we also include ImageNet pretrained model in the comparison, and the results are shown in Figure 5 . We compare them based on the parameters and the amount of computation respectively. In terms of the number of parameters, the best performing large model is RT DETR [ 43 ]. From Figure 5 , we can see that YOLOv9 using conventional convolution is even better than YOLO MS using depth-wise convolution in parameter utilization. As for the parameter utilization of large models, it also greatly surpasses RT DETR using ImageNet pretrained model. Even better is that in the deep model, YOLOv9 shows the huge advantages of using PGI. By accurately retaining and extracting the information needed to map the data to the target, our method requires only 66 % 66\% 66% of the parameters while maintaining the accuracy as RT DETR-X.
【翻译】另一方面,我们也在比较中包含了ImageNet预训练模型,结果如图5所示。我们分别基于参数数量和计算量进行比较。在参数数量方面,表现最佳的大型模型是RT DETR [43]。从图5中我们可以看到,使用传统卷积的YOLOv9在参数利用率方面甚至比使用深度可分离卷积的YOLO MS更好。至于大型模型的参数利用率,它也大大超越了使用ImageNet预训练模型的RT DETR。更好的是,在深度模型中,YOLOv9展现了使用PGI的巨大优势。通过准确保留和提取将数据映射到目标所需的信息,我们的方法仅需要 66 % 66\% 66%的参数就能保持与RT DETR-X相同的准确性。
Figure 5. Comparison of state-of-the-art real-time object detectors. The methods participating in the comparison all use ImageNet as pre-trained weights, including RT DETR [ 43 ], RTMDet [ 44 ], and PP-YOLOE [ 74 ], etc. The YOLOv9 that uses train-from-scratch method clearly surpasses the performance of other methods.
【翻译】图5. 最先进实时目标检测器的比较。参与比较的方法都使用ImageNet作为预训练权重,包括RT DETR [43]、RTMDet [44]和PP-YOLOE [74]等。使用从头训练方法的YOLOv9明显超越了其他方法的性能。
As for the amount of computation, the best existing models from the smallest to the largest are YOLO MS [ 7 ], PP YOLOE [ 74 ], and RT DETR [ 43 ]. From Figure 5, we can see that YOLOv9 is far superior to the train-from-scratch methods in terms of computational complexity. In addition, if compared with those based on depth-wise convolution and ImageNet-based pretrained models, YOLOv9 is also very competitive.
【翻译】至于计算量,从最小到最大的最佳现有模型分别是YOLO MS [7]、PP YOLOE [74]和RT DETR [43]。从图5中我们可以看到,YOLOv9在计算复杂度方面远优于从头训练的方法。此外,如果与基于深度可分离卷积和基于ImageNet预训练模型的方法相比,YOLOv9也非常具有竞争力。
5.4. Ablation Studies
5.4.1 Generalized ELAN
For GELAN, we first do ablation studies for computational blocks. We used Res blocks [ 21 ], Dark blocks [ 49 ], and CSP blocks [ 64 ] to conduct experiments, respectively. Table 2 shows that after replacing convolutional layers in ELAN with different computational blocks, the system can maintain good performance. Users are indeed free to replace computational blocks and use them on their respective inference devices. Among different computational block replacements, CSP blocks perform particularly well. They not only reduce the amount of parameters and computation, but also improve AP by 0.7 % 0.7\% 0.7% . Therefore, we choose CSPELAN as the component unit of GELAN in YOLOv9.
【翻译】对于GELAN,我们首先对计算块进行消融研究。我们分别使用了Res块[21]、Dark块[49]和CSP块[64]进行实验。表2显示,在用不同的计算块替换ELAN中的卷积层后,系统能够保持良好的性能。用户确实可以自由地替换计算块并在各自的推理设备上使用它们。在不同的计算块替换中,CSP块表现特别好。它们不仅减少了参数量和计算量,还将AP提高了 0.7 % 0.7\% 0.7%。因此,我们选择CSPELAN作为YOLOv9中GELAN的组件单元。
【解析】作者测试了三种不同的计算块:Res块(来自ResNet的残差块)、Dark块(来自Darknet的块)和CSP块(跨阶段部分网络块)。最终选择CSPELAN(CSP版本的ELAN)作为GELAN的基础组件,在性能和效率之间找到最佳平衡点。
Table 2. Ablation study on various computational blocks.
【翻译】表2. 各种计算块的消融研究。
Next, we conduct ELAN block-depth and CSP blockdepth experiments on GELAN of different sizes, and display the results in Table 3 . We can see that when the depth of ELAN is increased from 1 to 2, the accuracy is significantly improved. But when the depth is greater than or equal to 2, no matter it is improving the ELAN depth or the CSP depth, the number of parameters, the amount of computation, and the accuracy will always show a linear relationship. This means GELAN is not sensitive to the depth. In other words, users can arbitrarily combine the components in GELAN to design the network architecture, and have a model with stable performance without special design. In Table 3, for YOLOv9- { S , M , C } \{\mathrm{S},\mathrm{M},\mathrm{C}\} {S,M,C} , w g { 2 , 1 } } \{2,1\}\} {2,1}} of the ELAN depth and the CSP depth to . {{ { { 2 , 3 } , { 2 , 1 } \{\{2,3\},\{2,1\} {{2,3},{2,1} } { } ,
【翻译】接下来,我们对不同规模的GELAN进行了ELAN块深度和CSP块深度实验,并将结果显示在表3中。我们可以看到,当ELAN的深度从1增加到2时,准确性显著提高。但当深度大于或等于2时,无论是提高ELAN深度还是CSP深度,参数数量、计算量和准确性都始终呈现线性关系。这意味着GELAN对深度不敏感。换句话说,用户可以任意组合GELAN中的组件来设计网络架构,并拥有一个性能稳定的模型,无需特殊设计。在表3中,对于YOLOv9- { S , M , C } \{\mathrm{S},\mathrm{M},\mathrm{C}\} {S,M,C},我们设置ELAN深度和CSP深度分别为 { 2 , 1 } \{2,1\} {2,1}、 { 2 , 3 } \{2,3\} {2,3}、 { 2 , 1 } \{2,1\} {2,1}。
【解析】作者通过改变ELAN和CSP块的深度来研究GELAN的性能表现。实验发现,从深度1到深度2时有显著的性能提升,但之后继续增加深度时,性能、参数量和计算量呈现线性关系,说明GELAN架构具有良好的可扩展性和稳定性。
Table 3. Ablation study on ELAN and CSP depth.
【翻译】表3. ELAN和CSP深度的消融研究。
5.4.2 Programmable Gradient Information
In terms of PGI, we performed ablation studies on auxiliary reversible branch and multi-level auxiliary information on the backbone and neck, respectively. We designed auxiliary reversible branch ICN to use DHLC [ 34 ] linkage to obtain multi-level reversible information. As for multi-level auxiliary information, we use FPN and PAN for ablation studies and the role of PFH is equivalent to the traditional deep supervision. The results of all experiments are listed in Table 4 . From Table 4 , we can see that PFH is only effective in deep models, while our proposed PGI can improve accuracy under different combinations. Especially when using ICN, we get stable and better results. We also tried to apply the lead-head guided assignment proposed in YOLOv7 [ 63 ] to the PGI’s auxiliary supervision, and achieved much better performance.
【翻译】在PGI方面,我们分别对主干网络和颈部的辅助可逆分支和多级辅助信息进行了消融研究。我们设计了辅助可逆分支ICN来使用DHLC [34]连接获得多级可逆信息。至于多级辅助信息,我们使用FPN和PAN进行消融研究,而PFH的作用等同于传统的深度监督。所有实验的结果都列在表4中。从表4中,我们可以看到PFH只在深度模型中有效,而我们提出的PGI可以在不同组合下提高准确性。特别是当使用ICN时,我们得到了稳定且更好的结果。我们还尝试将YOLOv7 [63]中提出的引导头引导分配应用于PGI的辅助监督,并取得了更好的性能。
Table 4. Ablation study on PGI of backbone and neck.
【翻译】表4. 主干网络和颈部PGI的消融研究。
We further implemented the concepts of PGI and deep supervision on models of various sizes and compared the results, these results are shown in Table 5 . As analyzed at the beginning, introduction of deep supervision will cause a loss of accuracy for shallow models. As for general models, introducing deep supervision will cause unstable performance, and the design concept of deep supervision can only bring gains in extremely deep models. The proposed PGI can effectively handle problems such as information bottleneck and information broken, and can comprehensively improve the accuracy of models of different sizes. The concept of PGI brings two valuable contributions. The first one is to make the auxiliary supervision method applicable to shallow models, while the second one is to make the deep model training process obtain more reliable gradients. These gradients enable deep models to use more accurate information to establish correct correlations between data and targets.
【翻译】我们进一步在各种规模的模型上实现了PGI和深度监督的概念并比较了结果,这些结果如表5所示。正如开始时分析的那样,引入深度监督会导致浅层模型的准确性损失。对于一般模型,引入深度监督会导致性能不稳定,深度监督的设计概念只能在极深的模型中带来收益。所提出的PGI能够有效处理信息瓶颈和信息断裂等问题,并能全面提高不同规模模型的准确性。PGI的概念带来了两个有价值的贡献。第一个是使辅助监督方法适用于浅层模型,第二个是使深度模型训练过程获得更可靠的梯度。这些梯度使深度模型能够使用更准确的信息在数据和目标之间建立正确的关联。
Table 5. Ablation study on PGI.
【翻译】表5. PGI的消融研究。
Finally, we show in the table the results of gradually increasing components from baseline YOLOv7 to YOLOv9- E. The GELAN and PGI we proposed have brought allround improvement to the model.
【翻译】最后,我们在表中展示了从基线YOLOv7到YOLOv9-E逐步增加组件的结果。我们提出的GELAN和PGI为模型带来了全方位的改进。
Table 6. Ablation study on GELAN and PGI.
【翻译】表6. GELAN和PGI的消融研究。
5.5. Visualization
This section will explore the information bottleneck issues and visualize them. In addition, we will also visualize how the proposed PGI uses reliable gradients to find the correct correlations between data and targets. In Figure 6 we show the visualization results of feature maps obtained by using random initial weights as feedforward under different architectures. We can see that as the number of layers increases, the original information of all architectures gradually decreases. For example, at the 5 0 t h 50^{t h} 50th layer of the PlainNet, it is difficult to see the location of objects, and all distinguishable features will be lost at the 10 0 t h 100^{t h} 100th layer. As for ResNet, although the position of object can still be seen at the 5 0 t h 50^{t h} 50th layer, the boundary information has been lost. When the depth reached to the 10 0 t h 100^{t h} 100th layer, the whole image becomes blurry. Both CSPNet and the proposed GELAN perform very well, and they both can maintain features that support clear identification of objects until the 20 0 t h 200^{t h} 200th layer. Among the comparisons, GELAN has more stable results and clearer boundary information.
【翻译】本节将探索信息瓶颈问题并将其可视化。此外,我们还将可视化所提出的PGI如何使用可靠的梯度来找到数据和目标之间的正确关联。在图6中,我们展示了在不同架构下使用随机初始权重作为前向传播得到的特征图的可视化结果。我们可以看到,随着层数的增加,所有架构的原始信息逐渐减少。例如,在PlainNet的第 5 0 t h 50^{th} 50th层,很难看到物体的位置,在第 10 0 t h 100^{th} 100th层时,所有可区分的特征都会丢失。至于ResNet,虽然在第 5 0 t h 50^{th} 50th层仍然可以看到物体的位置,但边界信息已经丢失。当深度达到第 10 0 t h 100^{th} 100th层时,整个图像变得模糊。CSPNet和提出的GELAN都表现得很好,它们都能够维持支持清晰识别物体的特征直到第 20 0 t h 200^{th} 200th层。在比较中,GELAN具有更稳定的结果和更清晰的边界信息。
【解析】作者通过使用随机初始权重来进行前向传播,这样做的目的是排除训练的影响,纯粹观察网络架构本身对信息传递的影响。PlainNet是最基础的网络结构,没有任何特殊的设计,所以信息丢失最为严重。ResNet虽然引入了残差连接来缓解这个问题,但在很深的层数下仍然会出现信息模糊。CSPNet通过跨阶段部分连接的设计,在一定程度上保持了信息的流动性。而GELAN作为作者提出的新架构,在信息保持方面表现最为出色,即使在200层的深度下仍能保持较为清晰的特征信息,这说明GELAN的设计有效地解决了深度网络中的信息瓶颈问题。
Figure 6. Feature maps (visualization results) output by random initial weights of PlainNet, ResNet, CSPNet, and GELAN at different depths. After 100 layers, ResNet begins to produce feedforward output that is enough to obfuscate object information. Our proposed GELAN can still retain quite complete information up to the 15 0 t h 150^{t h} 150th layer, and is still sufficiently discriminative up to the 20 0 t h 200^{t h} 200th layer.
【翻译】图6. PlainNet、ResNet、CSPNet和GELAN在不同深度下使用随机初始权重输出的特征图(可视化结果)。在100层之后,ResNet开始产生足以混淆物体信息的前向输出。我们提出的GELAN在第 15 0 t h 150^{th} 150th层仍能保持相当完整的信息,在第 20 0 t h 200^{th} 200th层仍具有足够的判别性。
【解析】图进一步说明了实验的具体结果。100层是一个分界点,在这个深度下,ResNet开始显著地混淆物体信息,这意味着网络已经无法清晰地区分图像中的不同物体了。而GELAN的优势在于它能够在更深的网络中保持信息的完整性和判别性。第150层时GELAN仍能保持"相当完整的信息",这意味着原始图像的主要特征还能被识别出来。第200层时仍具有"足够的判别性",说明网络依然能够区分不同的物体类别。这种能力对于深度学习模型来说至关重要,因为更深的网络通常意味着更强的表达能力,但如果信息在传递过程中严重退化,那么增加深度反而会降低模型性能。GELAN的这种特性使得它能够在保持网络深度的同时,避免信息瓶颈问题,从而实现更好的性能。
Figure 7. PAN feature maps (visualization results) of GELAN and YOLOv9 (GELAN + ^+ + PGI) after one epoch of bias warm-up. GELAN originally had some divergence, but after adding PGI’s reversible branch, it is more capable of focusing on the target object.
【翻译】图7. GELAN和YOLOv9 (GELAN + ^+ + PGI)在一个epoch的偏差预热后的PAN特征图(可视化结果)。GELAN原本存在一些发散现象,但在添加PGI的可逆分支后,更能够专注于目标物体。
【解析】图描述了对比实验。"偏差预热"是深度学习训练中的一个重要技术,指在训练初期使用较小的学习率来逐步调整模型参数,避免训练初期的剧烈震荡。PAN将不同尺度的特征进行有效整合。这里的"发散"指的是网络在学习过程中出现的不稳定现象,表现为注意力分散到不该关注的区域,或者对目标边界的判断不够精确。PGI的可逆分支通过提供额外的梯度信息,帮助网络更好地聚焦于真正重要的目标区域,提高了特征学习的精确性和稳定性。
Figure 7 is used to show whether PGI can provide more reliable gradients during the training process, so that the parameters used for updating can effectively capture the relationship between the input data and the target. Figure 7 shows the visualization results of the feature map of GELAN and YOLOv9 ( G E L A N + P G I ) (\mathrm{GELAN+PGI}) (GELAN+PGI) in PAN bias warmup. From the comparison of Figure 7 (b) and ©, we can clearly see that PGI accurately and concisely captures the area containing objects. As for GELAN that does not use PGI, we found that it had divergence when detecting object boundaries, and it also produced unexpected responses in some background areas. This experiment confirms that PGI can indeed provide better gradients to update parameters and enable the feedforward stage of the main branch to retain more important features.
【翻译】图7用于展示PGI是否能在训练过程中提供更可靠的梯度,使得用于更新的参数能够有效捕获输入数据与目标之间的关系。图7显示了GELAN和YOLOv9 ( G E L A N + P G I ) (\mathrm{GELAN+PGI}) (GELAN+PGI) 在PAN偏差预热中的特征图可视化结果。从图7 (b)和©的比较中,我们可以清楚地看到PGI准确而简洁地捕获了包含物体的区域。至于不使用PGI的GELAN,我们发现它在检测物体边界时存在发散现象,并且在一些背景区域也产生了意外的响应。这个实验证实了PGI确实能够提供更好的梯度来更新参数,并使主分支的前向传播阶段保留更多重要特征。
【解析】PGI的核心作用是提供"更可靠的梯度",这里的"可靠"指的是梯度信息更准确地反映了输入数据和目标标签之间的真实关系,而不会被噪声或无关信息所干扰。PGI通过其可逆分支机制,提供了额外的监督信号,帮助主网络更好地学习到正确的特征表示,从而在前向传播过程中保留更多对最终任务有用的重要信息。
6. Conclusions
In this paper, we propose to use PGI to solve the information bottleneck problem and the problem that the deep supervision mechanism is not suitable for lightweight neural networks. We designed GELAN, a highly efficient and lightweight neural network. In terms of object detection, GELAN has strong and stable performance at different computational blocks and depth settings. It can indeed be widely expanded into a model suitable for various inference devices. For the above two issues, the introduction of PGI allows both lightweight models and deep models to achieve significant improvements in accuracy. The YOLOv9, designed by combining PGI and GELAN, has shown strong competitiveness. Its excellent design allows the deep model to reduce the number of parameters by 49 % 49\% 49% and the amount of calculations by 43 % 43\% 43% compared with YOLOv8, but it still has a 0.6 % 0.6\% 0.6% AP improvement on MS COCO dataset.
【翻译】在本文中,我们提出使用PGI来解决信息瓶颈问题以及深度监督机制不适用于轻量级神经网络的问题。我们设计了GELAN,这是一个高效且轻量级的神经网络。在目标检测方面,GELAN在不同的计算块和深度设置下都具有强大且稳定的性能。它确实可以广泛扩展为适合各种推理设备的模型。针对上述两个问题,PGI的引入使得轻量级模型和深度模型都能在准确率方面取得显著改进。通过结合PGI和GELAN设计的YOLOv9展现出了强大的竞争力。其出色的设计使得深度模型与YOLOv8相比,参数数量减少了 49 % 49\% 49%,计算量减少了 43 % 43\% 43%,但在MS COCO数据集上仍有 0.6 % 0.6\% 0.6%的AP改进。