YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications论文精读(逐段解析)
论文地址:https://arxiv.org/abs/2209.02976
美团发布
2022
Abstract
For years, YOLO series have been de facto industry-level standard for efficient object detection. The YOLO community has prospered overwhelmingly to enrich its use in a multitude of hardware platforms and abundant scenarios. In this technical report, we strive to push its limits to the next level, stepping forward with an unwavering mindset for industry application. Considering the diverse requirements for speed and accuracy in the real environment, we extensively examine the up-to-date object detection advancements either from industry or academy. Specif- ically, we heavily assimilate ideas from recent network design, training strategies, testing techniques, quantization and optimization methods. On top of this, we integrate our thoughts and practice to build a suite of deploymentready networks at various scales to accommodate diversified use cases. With the generous permission of YOLO authors, we name it YOLOv6. We also express our warm welcome to users and contributors for further enhancement. For a glimpse of performance, our YOLOv6-N hits 35.9 % 35.9\% 35.9% AP on COCO dataset at a throughput of 1234 FPS on an NVIDIA Tesla T4 GPU. YOLOv6-S strikes 43.5 % 43.5\% 43.5% AP at 495 FPS, outperforming other mainstream detectors at the same scale (YOLOv5-S, YOLOX-S and PPYOLOE-S). Our quantized version of YOLOv6-S even brings a new state-of-theart 43.3 % 43.3\% 43.3% AP at 869 FPS. Furthermore, YOLOv6-M/L also achieves better accuracy performance (i.e., 49.5 % / 52.3 % , 49.5\%/52.3\%, 49.5%/52.3%, ) than other detectors with the similar inference speed. We carefully conducted experiments to validate the effectiveness of each component. Our code is made available at https://github.com/meituan/YOLOv6.
【翻译】多年来,YOLO系列一直是高效目标检测的事实行业标准。YOLO社区蓬勃发展,极大地丰富了它在多种硬件平台和大量场景中的应用。在这份技术报告中,我们努力将其推向更高层次,以坚定不移的工业应用思维向前迈进。考虑到现实环境中对速度和精度的多样化需求,我们广泛考查了来自工业界或学术界的最新目标检测进展。具体而言,我们大量吸收了近期网络设计、训练策略、测试技术、量化和优化方法的思想。在此基础上,我们整合自己的思考和实践,构建了一套不同规模的部署就绪网络来适应多样化的使用案例。在YOLO作者的慷慨许可下,我们将其命名为YOLOv6。我们也热烈欢迎用户和贡献者进一步完善。从性能一瞥来看,我们的YOLOv6-N在NVIDIA Tesla T4 GPU上达到了COCO数据集 35.9 % 35.9\% 35.9%的AP,吞吐量为1234 FPS。YOLOv6-S在495 FPS下达到 43.5 % 43.5\% 43.5%的AP,超越了同等规模的其他主流检测器(YOLOv5-S、YOLOX-S和PPYOLOE-S)。我们的YOLOv6-S量化版本甚至在869 FPS下达到了新的最先进 43.3 % 43.3\% 43.3%的AP。此外,YOLOv6-M/L也在相似推理速度下实现了更好的精度性能(即 49.5 % / 52.3 % 49.5\%/52.3\% 49.5%/52.3%)。我们仔细进行了实验来验证每个组件的有效性。我们的代码可在https://github.com/meituan/YOLOv6获得。
【解析】美团团队在YOLO系列算法的基础上进行了全面的改进和优化。从多个维度入手:网络架构设计让模型更高效,训练策略让模型学得更好,测试技术让评估更准确,量化方法让模型在实际部署时更快。最终构建出的YOLOv6在不同规模上都有对应的版本,可以适应从轻量级到重量级的各种应用场景。
Figure 1: Comparison of state-of-the-art efficient object detectors. Both latency and throughput (at a batch size of 32) are given for a handy reference. All models are test with TensorRT 7 except that the quantized model is with TensorRT 8.
【翻译】图1:最先进的高效目标检测器的比较。延迟和吞吐量(批量大小为32)都给出了便于参考。除了量化模型使用TensorRT 8外,所有模型都使用TensorRT 7进行测试。
1. Introduction
YOLO series have been the most popular detection frameworks in industrial applications, for its excellent balance between speed and accuracy. Pioneering works of YOLO series are YOLOv1-3 [ 32 – 34 ], which blaze a new trail of one-stage detectors along with the later substantial improvements. YOLOv4 [ 1 ] reorganized the detection framework into several separate parts (backbone, neck and head), and verified bag-of-freebies and bag-of-specials at the time to design a framework suitable for training on a single GPU. At present, YOLOv5 [ 10 ], YOLOX [ 7 ], PPYOLOE [ 44 ] and YOLOv7 [ 42 ] are all the competing candidates for efficient detectors to deploy. Models at different sizes are commonly obtained through scaling techniques.
【翻译】YOLO系列因其在速度和精度之间的卓越平衡,一直是工业应用中最受欢迎的检测框架。YOLO系列的开创性工作是YOLOv1-3,它们开辟了单阶段检测器的新道路,随后的大幅改进也随之而来。YOLOv4将检测框架重新组织为几个独立的部分(主干网络、颈部和头部),并在当时验证了免费的技巧和特殊的技巧,以设计一个适合在单个GPU上训练的框架。目前,YOLOv5、YOLOX、PPYOLOE和YOLOv7都是部署高效检测器的竞争候选者。不同大小的模型通常通过缩放技术获得。
【解析】YOLO系列算法之所以在工业界如此受欢迎,核心在于它找到了检测速度和准确性的最佳平衡点。早期的YOLOv1到v3建立了单阶段检测的基础框架,与传统的两阶段检测器(如R-CNN系列)相比,YOLO直接在一次前向传播中完成目标检测,大大提升了速度。到了YOLOv4时代,算法架构变得更加模块化,分为三个核心组件:backbone负责特征提取,neck负责特征融合,head负责最终预测。这种模块化设计让算法更容易理解和改进。YOLOv4还系统性地整理了当时可用的训练技巧,分为"bag-of-freebies"(不增加推理成本的技巧)和"bag-of-specials"(轻微增加推理成本但显著提升精度的技巧)。现在的YOLO家族已经非常庞大,不同的研究团队都在推出自己的版本,而模型的大小通常通过简单的缩放技术来实现,比如调整网络的宽度、深度或分辨率。YOLOv6其实是以面向工业场景应用需求为出发点和重点而研发的。
In this report, we empirically observed several important factors that motivate us to refurnish the YOLO framework: (1) Reparameterization from RepVGG [ 3 ] is a superior technique that is not yet well exploited in detection. We also notice that simple model scaling for RepVGG blocks becomes impractical, for which we consider that the elegant consistency of the network design between small and large networks is unnecessary. The plain single-path architecture is a better choice for small networks, but for larger models, the exponential growth of the parameters and the computation cost of the single-path architecture makes it infeasible; (2) Quantization of reparameterization-based detectors also requires meticulous treatment, otherwise it would be intractable to deal with performance degradation due to its heterogeneous configuration during training and inference. (3) Previous works [ 7 , 10 , 42 , 44 ] tend to pay less attention to deployment, whose latencies are commonly compared on high-cost machines like V100. There is a hardware gap when it comes to real serving environment. Typically, lowpower GPUs like Tesla T4 are less costly and provide rather good inference performance. (4) Advanced domain-specific strategies like label assignment and loss function design need further verifications considering the architectural variance; (5) For deployment, we can tolerate the adjustments of the training strategy that improve the accuracy performance but not increase inference costs, such as knowledge distillation .
【翻译】在本报告中,我们通过经验观察到几个重要因素,促使我们重新设计YOLO框架:(1) 来自RepVGG的重参数化是一种优越的技术,在检测中尚未得到充分利用。我们还注意到,RepVGG块的简单模型缩放变得不切实际,为此我们认为小型和大型网络之间网络设计的优雅一致性是不必要的。简单的单路径架构对小型网络是更好的选择,但对于更大的模型,单路径架构的参数和计算成本呈指数增长,使其变得不可行;(2) 基于重参数化的检测器的量化也需要细致的处理,否则由于训练和推理期间的异构配置,性能下降将难以处理。(3) 以前的工作往往较少关注部署,其延迟通常在V100等高成本机器上进行比较。当涉及到真实服务环境时,存在硬件差距。通常,像Tesla T4这样的低功耗GPU成本较低,并提供相当好的推理性能。(4) 考虑到架构差异,标签分配和损失函数设计等先进的领域特定策略需要进一步验证;(5) 对于部署,我们可以容忍提高精度性能但不增加推理成本的训练策略调整,如知识蒸馏。
【解析】作者团队在重新设计YOLO框架时发现了五个关键问题。首先是重参数化技术的应用。RepVGG提出了一种巧妙的设计:训练时使用多分支结构来增强表达能力,推理时将这些分支合并成单一路径来提升速度。但这种技术在目标检测领域应用有限,而且简单的模型缩放在RepVGG上效果不佳。第二个问题是量化的挑战。重参数化模型在训练和推理时结构不同,这给量化带来了额外的复杂性,需要特别的处理方法。第三个问题触及了学术研究与工业应用的差距。很多研究都在V100这样的高端GPU上测试性能,但实际部署时更多使用T4这样的中端GPU,硬件差异会影响实际表现。第四个问题是算法组件的适配性。不同的网络架构可能需要不同的标签分配策略和损失函数,需要重新验证这些组件的有效性。最后一个问题关注部署友好性,即寻找那些能提升训练时精度但不增加推理成本的技术,知识蒸馏就是这样的典型例子。
With the aforementioned observations in mind, we bring the birth of YOLOv6, which accomplishes so far the best trade-off in terms of accuracy and speed. We show the comparison of YOLOv6 with other peers at a similar scale in Fig. 1 . To boost inference speed without much performance degradation, we examined the cutting-edge quantization methods, including post-training quantization (PTQ) and quantization-aware training (QAT), and accommodate them in YOLOv6 to achieve the goal of deployment-ready networks.
【翻译】考虑到上述观察结果,我们带来了YOLOv6,它在精度和速度方面实现了迄今为止最好的权衡。我们在图1中展示了YOLOv6与类似规模的其他同类算法的比较。为了在不大幅降低性能的情况下提升推理速度,我们研究了最先进的量化方法,包括训练后量化(PTQ)和量化感知训练(QAT),并将它们融入YOLOv6中,以实现部署就绪网络的目标。
【解析】基于前面提到的五个核心问题,作者团队开发了YOLOv6。这个新版本的主要目标是在保持高精度的同时最大化推理速度,特别关注实际部署场景。为了进一步提升速度,团队深入研究了量化技术。量化是将模型从浮点数精度(如FP32)降低到更低精度(如INT8)的过程,可以显著减少模型大小和计算量。PTQ是在训练完成后直接对模型进行量化,简单快速但可能损失精度;QAT则在训练过程中就考虑量化的影响,虽然训练成本更高但精度损失更小。通过结合这两种量化技术,YOLOv6能够在保持高精度的同时实现更快的推理速度。
We summarize the main aspects of YOLOv6 as follows:
• We refashion a line of networks of different sizes tailored for industrial applications in diverse scenarios. The architectures at different scales vary to achieve the best speed and accuracy trade-off, where small models feature a plain single-path backbone and large models are built on efficient multi-branch blocks.
• We imbue YOLOv6 with a self-distillation strategy, performed both on the classification task and the regression task. Meanwhile, we dynamically adjust the knowledge from the teacher and labels to help the student model learn knowledge more efficiently during all training phases.
• We broadly verify the advanced detection techniques for label assignment, loss function and data augmentation techniques and adopt them selectively to further boost the performance.
• We reform the quantization scheme for detection with the help of RepOptimizer [ 2 ] and channel-wise distillation [ 36 ], which leads to an ever-fast and accurate detector with 43.3 % 43.3\% 43.3% COCO AP and a throughput of 869 FPS at a batch size of 32.
【翻译】我们总结YOLOv6的主要方面如下:
• 我们重新设计了一系列针对多样化场景中工业应用的不同尺寸网络。不同规模的架构各有差异,以实现最佳的速度和精度权衡,其中小型模型采用简单的单路径主干网络,大型模型基于高效的多分支块构建。
• 我们为YOLOv6注入了自蒸馏策略,在分类任务和回归任务上都进行了实施。同时,我们动态调整来自教师模型和标签的知识,帮助学生模型在所有训练阶段更有效地学习知识。
• 我们广泛验证了标签分配、损失函数和数据增强技术的先进检测技术,并有选择地采用它们来进一步提升性能。
• 我们借助RepOptimizer和通道级蒸馏改革了检测的量化方案,从而产生了一个速度更快、更准确的检测器,在批量大小为32时达到
43.3
%
43.3\%
43.3%的COCO AP和869 FPS的吞吐量。
【解析】YOLOv6的核心创新可以概括为四个方面。首先是网络架构的差异化设计。与传统方法不同,YOLOv6不是简单地通过缩放系数来产生不同大小的模型,而是为不同规模的模型设计了不同的架构。小模型使用单路径设计来最大化推理效率,大模型则使用多分支设计来提升表达能力。其次是自蒸馏技术的创新应用。传统的知识蒸馏需要额外的教师模型,而自蒸馏让模型自己充当教师,通过动态调整不同信息源的权重来提升学习效果。第三是系统性的技术验证。团队没有盲目采用所有新技术,而是通过大量实验验证哪些技术真正有效,然后有选择地集成。最后是量化技术的深度优化。通过RepOptimizer和通道级蒸馏,解决了重参数化模型量化困难的问题,最终实现了令人印象深刻的性能指标。
2. Method
The renovated design of YOLOv6 consists of the following components, network design , label assignment , loss function , data augmentation , industry-handy improvements , and quantization and deployment :
【翻译】YOLOv6的全新设计包含以下组件:网络设计、标签分配、损失函数、数据增强、工业友好改进和量化部署:
【解析】YOLOv6的整体架构可以分为六个核心模块。网络设计是基础,决定了模型的表达能力和计算效率;标签分配负责在训练时为样本分配正负标签;损失函数指导模型的学习方向;数据增强提升模型的泛化能力;工业友好改进包含了一些实用的训练技巧;量化部署则专门针对实际应用场景进行优化。这六个模块相互配合,共同构成了YOLOv6的完整技术栈。
• Network Design: Backbone : Compared with other mainstream architectures, we find that RepVGG [ 3 ] backbones are equipped with more feature representation power in small networks at a similar inference speed, whereas it can hardly be scaled to obtain larger models due to the explosive growth of the parameters and computational costs. In this regard, we take RepBlock [ 3 ] as the building block of our small networks. For large models, we revise a more efficient CSP [ 43 ] block, named CSPStackRep Block . Neck : The neck of YOLOv6 adopts PAN topology [ 24 ] following YOLOv4 and YOLOv5. We enhance the neck with RepBlocks or CSPStackRep Blocks to have RepPAN. Head : We simplify the decoupled head to make it more efficient, called Efficient Decoupled Head .
【翻译】网络设计:主干网络:与其他主流架构相比,我们发现RepVGG主干网络在相似推理速度下为小型网络提供了更强的特征表示能力,但由于参数和计算成本的爆炸性增长,它很难扩展到更大的模型。在这方面,我们采用RepBlock作为小型网络的构建块。对于大型模型,我们修改了一个更高效的CSP块,命名为CSPStackRep Block。颈部:YOLOv6的颈部沿用YOLOv4和YOLOv5的PAN拓扑结构。我们用RepBlocks或CSPStackRep Blocks增强颈部,形成RepPAN。头部:我们简化了解耦头部以使其更高效,称为高效解耦头部。
【解析】网络设计采用了分层优化的策略。在主干网络设计上,团队发现RepVGG在小模型上表现优异,但扩展到大模型时会遇到计算瓶颈。这是因为RepVGG的单路径设计在小规模时能充分利用硬件并行性,但参数量随模型增大而急剧膨胀。因此YOLOv6采用了差异化策略:小模型使用RepBlock来最大化效率,大模型则使用改进的CSPStackRep Block来平衡性能和计算成本。颈部网络继承了YOLO系列成熟的PAN结构,这种自顶向下和自底向上的特征融合方式已经被证明非常有效。头部设计则追求简化,通过减少不必要的复杂性来提升推理速度。
• Label Assignment: We evaluate the recent progress of label assignment strategies [ 5 , 7 , 18 , 48 , 51 ] on YOLOv6 through numerous experiments, and the results indicate that TAL [ 5 ] is more effective and training-friendly.
【翻译】标签分配:我们通过大量实验评估了YOLOv6上标签分配策略的最新进展,结果表明TAL更有效且训练友好。
【解析】标签分配是目标检测训练中的关键环节,决定了哪些预测框被认为是正样本(应该检测到目标)或负样本(背景)。传统方法通常基于IoU阈值进行简单划分,但这种方法忽略了分类和回归任务之间的差异。TAL(Task Alignment Learning)提出了一种更智能的分配策略,它考虑了分类置信度和定位质量的联合优化,使得模型在训练时能够更好地平衡这两个子任务,从而提升整体性能并使训练过程更加稳定。
• Loss Function: The loss functions of the mainstream anchor-free object detectors contain classification loss , box regression loss and object loss . For each loss, we systematically experiment it with all available techniques and finally select VariFocal Loss [ 50 ] as our classification loss and SIoU [ 8 ]/GIoU [ 35 ] Loss as our regression loss.
【翻译】损失函数:主流无锚目标检测器的损失函数包含分类损失、框回归损失和目标损失。对于每种损失,我们系统地用所有可用技术进行实验,最终选择VariFocal Loss作为我们的分类损失,SIoU/GIoU Loss作为我们的回归损失。
【解析】损失函数的选择直接影响模型的学习效果。在无锚检测器中,分类损失负责判断每个位置是否存在目标以及目标的类别,回归损失负责预测目标框的精确位置和大小,目标损失则用于平衡正负样本。VariFocal Loss是Focal Loss的改进版本,它对正负样本采用不对称的处理方式,能够更好地处理样本不平衡问题。SIoU和GIoU都是IoU损失的改进版本,它们考虑了预测框和真实框之间更丰富的几何关系,能够提供更精确的定位监督信号。
Figure 2: The YOLOv6 framework (N and S are shown). Note for M/L, RepBlocks is replaced with CSPStackRep.
【翻译】图2:YOLOv6框架(显示了N和S版本)。注意对于M/L版本,RepBlocks被替换为CSPStackRep。
• Industry-handy improvements: We introduce additional common practice and tricks to improve the performance including self-distillation and more training epochs . For self-distillation, both classification and box regression are respectively supervised by the teacher model. The distillation of box regression is made possible thanks to DFL [ 20 ]. In addition, the proportion of information from the soft and hard labels is dynamically declined via cosine decay, which helps the student selectively acquire knowledge at different phases during the training process. In addition, we encounter the problem of the impaired performance without adding extra gray borders at evaluation, for which we provide some remedies.
【翻译】工业友好改进:我们引入了额外的常见实践和技巧来提高性能,包括自蒸馏和更多训练周期。对于自蒸馏,分类和框回归都分别由教师模型监督。框回归的蒸馏得益于DFL成为可能。此外,来自软标签和硬标签的信息比例通过余弦衰减动态下降,这有助于学生在训练过程的不同阶段有选择地获取知识。此外,我们遇到了在评估时不添加额外灰色边界会导致性能受损的问题,为此我们提供了一些补救措施。
【解析】工业友好改进主要包含了几个实用的训练技巧。自蒸馏是知识蒸馏的一种特殊形式,不需要额外的预训练教师模型,而是让模型的深层作为教师来指导浅层的学习。DFL(Distribution Focal Loss)的引入使得回归任务也能进行知识蒸馏,这在传统方法中是比较困难的。动态调整软硬标签的比例是一个巧妙的设计:训练初期更依赖硬标签(真实标签),随着训练进行逐渐增加软标签(教师模型的预测)的权重,这样能够在不同训练阶段提供最合适的监督信号。灰色边界问题是一个实际部署中经常遇到的工程问题,涉及到图像预处理和后处理的细节处理。此外,DFL是YOLO系列中的一个重要创新,它将传统的回归任务转换为分类任务来处理。具体来说,DFL不是直接预测边界框的坐标值,而是预测坐标值的概率分布。例如,对于一个边界框的左边界,传统方法直接预测一个数值,而DFL将可能的数值范围离散化为多个区间,然后预测每个区间的概率。这种设计的优势在于:1)提供了更丰富的监督信号,因为模型不仅要预测正确的值,还要学习值的不确定性;2)使得回归任务能够享受分类任务中成熟的技术,如Focal Loss的思想;3)为知识蒸馏提供了可能,因为概率分布比单一数值包含更多信息,更适合作为"软标签"进行蒸馏。DFL的引入显著提升了边界框回归的精度,特别是在处理模糊边界和小目标时表现出色。
• Quantization and deployment: To cure the performance degradation in quantizing reparameterizationbased models, we train YOLOv6 with RepOptimizer [ 2 ] to obtain PTQ-friendly weights. We further adopt QAT with channel-wise distillation [ 36 ] and graph optimization to pursue extreme performance. Our quantized YOLOv6-S hits a new state of the art with 42.3 % 42.3\% 42.3% AP and a throughput of 869 FPS (batch size = 32 =32 =32 ).
【翻译】量化和部署:为了解决重参数化模型量化时的性能下降问题,我们使用RepOptimizer训练YOLOv6以获得PTQ友好的权重。我们进一步采用QAT与通道级蒸馏和图优化来追求极致性能。我们的量化YOLOv6-S达到了新的最先进水平,AP为 42.3 % 42.3\% 42.3%,吞吐量为869 FPS(批量大小 = 32 =32 =32)。
【解析】量化和部署是将模型从研究阶段转向实际应用的关键步骤。重参数化模型的量化面临特殊挑战,因为训练时的多分支结构和推理时的单分支结构之间存在差异。RepOptimizer是专门为解决这个问题设计的优化器,它能够产生对量化更友好的权重分布。PTQ(Post-Training Quantization)是训练后量化,速度快但精度损失可能较大;QAT(Quantization-Aware Training)是量化感知训练,虽然训练成本更高但能够显著减少精度损失。通道级蒸馏进一步减少了量化带来的精度损失,而图优化则从计算图层面进行优化。最终实现的869 FPS吞吐量在保持高精度的同时达到了实用级别的推理速度。
2.1. Network Design
A one-stage object detector is generally composed of the following parts: a backbone, a neck and a head. The backbone mainly determines the feature representation ability, meanwhile, its design has a critical influence on the inference efficiency since it carries a large portion of computation cost. The neck is used to aggregate the low-level physical features with high-level semantic features, and then build up pyramid feature maps at all levels. The head consists of several convolutional layers, and it predicts fi- nal detection results according to multi-level features assembled by the neck. It can be categorized as anchorbased and anchor-free , or rather parameter-coupled head and parameter-decoupled head from the structure’s perspective.
【翻译】单阶段目标检测器通常由以下部分组成:主干网络、颈部网络和头部网络。主干网络主要决定特征表示能力,同时,由于它承载了大部分计算成本,其设计对推理效率有关键影响。颈部网络用于聚合低级物理特征和高级语义特征,然后构建所有层级的金字塔特征图。头部网络由几个卷积层组成,根据颈部网络组装的多级特征预测最终检测结果。从结构角度来看,它可以分类为基于锚点和无锚点,或者参数耦合头部和参数解耦头部。
【解析】目标检测器的三个核心组件各司其职,形成了一个完整的检测流水线。主干网络是整个系统的基础,它从原始图像中提取特征。这就像人眼看物体时先要有基本的视觉感知能力,主干网络的作用就是赋予机器这种"视觉感知"。由于主干网络通常包含大量参数和计算量,它的设计直接影响模型的速度。颈部网络扮演着特征融合器的角色,它将主干网络不同阶段提取的特征进行巧妙组合。低级特征包含丰富的细节信息(如边缘、纹理),高级特征包含抽象的语义信息(如"这是一只猫")。颈部网络通过金字塔特征图的构建,让模型能够同时利用这两种互补的信息,从而更好地检测不同尺度的目标。头部网络是最终的决策者,它接收融合后的特征,输出具体的检测结果,包括目标的位置、大小和类别。在设计理念上,有锚点和无锚点的区别主要在于是否预先定义候选区域,而参数耦合与解耦的区别在于分类和定位任务是否共享参数。
In YOLOv6, based on the principle of hardwarefriendly network design [ 3 ], we propose two scaled reparameterizable backbones and necks to accommodate models at different sizes, as well as an efficient decoupled head with the hybrid-channel strategy. The overall architecture of YOLOv6 is shown in Fig. 2 .
【翻译】在YOLOv6中,基于硬件友好网络设计的原则,我们提出了两种可缩放的重参数化主干网络和颈部网络来适应不同大小的模型,以及一个采用混合通道策略的高效解耦头部。YOLOv6的整体架构如图2所示。
【解析】硬件友好设计意味着不仅要追求理论上的性能指标,更要考虑在真实硬件环境中的运行效率。这包括内存访问模式、并行计算能力、缓存利用率等多个维度。可缩放的重参数化设计是一个巧妙的解决方案:它允许在训练时使用复杂的多分支结构来增强表达能力,在推理时转换为简单的单分支结构来提升速度。这种设计既不牺牲性能,又满足了部署需求。两种不同的主干和颈部设计体现了"因地制宜"的思想:小模型追求极致效率,大模型追求更强能力。混合通道策略的解耦头部则在保持分类和定位任务独立性的同时,通过巧妙的通道设计减少了计算开销,实现了精度和速度的平衡。
2.1.1 Backbone
As mentioned above, the design of the backbone network has a great impact on the effectiveness and efficiency of the detection model. Previously, it has been shown that multibranch networks [ 13 , 14 , 38 , 39 ] can often achieve better classification performance than single-path ones [ 15 , 37 ], but often it comes with the reduction of the parallelism and results in an increase of inference latency. On the contrary, plain single-path networks like VGG [ 37 ] take the advantages of high parallelism and less memory footprint, leading to higher inference efficiency. Lately in RepVGG [ 3 ], a structural re-parameterization method is proposed to decouple the training-time multi-branch topology with an inference-time plain architecture to achieve a better speedaccuracy trade-off.
【翻译】如上所述,主干网络的设计对检测模型的有效性和效率有很大影响。此前已经表明,多分支网络通常比单路径网络能够实现更好的分类性能,但这往往伴随着并行性的降低,导致推理延迟的增加。相反,像VGG这样的简单单路径网络具有高并行性和较少内存占用的优势,从而带来更高的推理效率。最近在RepVGG中,提出了一种结构重参数化方法,将训练时的多分支拓扑与推理时的简单架构解耦,以实现更好的速度-精度权衡。
【解析】在深度学习发展的历程中,研究者们发现了一个有趣的现象:复杂的多分支网络结构虽然能够提供更强的特征表示能力,但却牺牲了计算效率。这是因为多分支结构需要更多的内存操作和分支跳转,这些操作在现代GPU和CPU上并不能得到很好的优化,从而降低了硬件的利用率。而VGG这样的简单网络结构虽然看起来"笨拙",但它的线性结构非常适合现代硬件的并行计算特性,能够充分利用硬件资源。RepVGG的出现提供了一个突破性的解决方案:在训练阶段使用复杂的多分支结构来获得更好的梯度流和特征表示,然后在推理阶段将这些分支融合成简单的卷积层,这样既保证了训练时的性能,又确保了推理时的效率。
Inspired by the above works, we design an efficient re-parameterizable backbone denoted as EfficientRep . For small models, the main component of the backbone is RepBlock during the training phase, as shown in Fig. 3 (a). And each RepBlock is converted to stacks of 3 × 3 3\times3 3×3 convolutional layers (denoted as RepConv) with ReLU activation functions during the inference phase, as shown in Fig. 3 (b). Typically a 3 × 3 3\times3 3×3 convolution is highly optimized on mainstream GPUs and CPUs and it enjoys higher computational density. Consequently, EfficientRep Backbone sufficiently utilizes the computing power of the hardware, resulting in a significant decrease in inference latency while enhancing the representation ability in the meantime.
【翻译】受上述工作启发,我们设计了一个高效的重参数化主干网络,称为EfficientRep。对于小模型,主干网络的主要组件在训练阶段是RepBlock,如图3(a)所示。每个RepBlock在推理阶段转换为带有ReLU激活函数的 3 × 3 3\times3 3×3卷积层堆栈(称为RepConv),如图3(b)所示。通常 3 × 3 3\times3 3×3卷积在主流GPU和CPU上得到了高度优化,具有更高的计算密度。因此,EfficientRep主干网络充分利用了硬件的计算能力,在增强表示能力的同时显著降低了推理延迟。
【解析】 3 × 3 3\times3 3×3卷积之所以成为深度学习的黄金标准,不仅仅是因为它在理论上的优秀性质,更重要的是它在硬件层面得到了极致的优化。现代GPU的CUDA核心、CPU的SIMD指令集,甚至专门的AI芯片都对 3 × 3 3\times3 3×3卷积操作进行了深度优化。这种优化体现在多个层面:内存访问模式更加规律,可以更好地利用缓存;计算模式更适合并行处理;数据复用率更高,减少了内存带宽的压力。EfficientRep通过重参数化技术,确保最终的推理结构完全由这些高效的 3 × 3 3\times3 3×3卷积组成,从而最大化了硬件利用率。这种设计思路反映了深度学习从"算法导向"向"工程导向"的转变,即不仅要追求理论上的先进性,更要考虑实际部署时的工程效率。
However, we notice that with the model capacity further expanded, the computation cost and the number of parameters in the single-path plain network grow exponentially. To achieve a better trade-off between the computation burden and accuracy, we revise a CSPStackRep Block to build the backbone of medium and large networks. As shown in Fig. 3 ©, CSPStackRep Block is composed of three 1 × 1 1\times1 1×1 convolution layers and a stack of sub-blocks consisting of two RepVGG blocks [ 3 ] or RepConv (at training or inference respectively) with a residual connection. Besides, a cross stage partial (CSP) connection is adopted to boost performance without excessive computation cost. Compared with CSPRepResStage [ 45 ], it comes with a more succinct outlook and considers the balance between accuracy and speed.
【翻译】然而,我们注意到随着模型容量的进一步扩展,单路径简单网络的计算成本和参数数量呈指数增长。为了在计算负担和精度之间实现更好的权衡,我们修改了CSPStackRep Block来构建中型和大型网络的主干网络。如图3©所示,CSPStackRep Block由三个 1 × 1 1\times1 1×1卷积层和一个子块堆栈组成,子块由两个RepVGG块(在训练或推理时分别为RepConv)和残差连接组成。此外,采用了跨阶段部分(CSP)连接来在不过度增加计算成本的情况下提升性能。与CSPRepResStage相比,它具有更简洁的外观,并考虑了精度和速度之间的平衡。
【解析】模型容量与计算复杂度之间并非线性关系。当模型规模增大时,简单的线性扩展往往会导致计算资源的急剧膨胀,这种指数级增长很快就会超出硬件的承受能力。CSPStackRep Block的设计巧妙地解决了这个问题。其中 1 × 1 1\times1 1×1卷积被称为"瓶颈层",它的作用是降低特征图的通道维度,从而减少后续计算的复杂度。这种"先压缩再处理再扩展"的设计模式已经在ResNet、MobileNet等经典架构中得到了验证。CSP连接的引入更是一个关键创新,它将输入特征分为两部分:一部分直接传递到输出,另一部分经过复杂的处理后再与第一部分融合。这种设计既保留了梯度的直接传播路径,又允许网络学习更复杂的特征变换,同时由于只有部分特征需要经过复杂处理,计算成本得到了有效控制。残差连接则进一步增强了网络的训练稳定性和表达能力。
Figure 3: (a) RepBlock is composed of a stack of RepVGG blocks with ReLU activations at training. (b) During inference time, RepVGG block is converted to RepConv. © CSPStackRep Block comprises three 1 × 1 1\times1 1×1 convolutional layers and a stack of sub-blocks of double RepConvs following the ReLU activations with a residual connection.
【翻译】图3:(a) RepBlock在训练时由带有ReLU激活的RepVGG块堆栈组成。(b) 在推理时,RepVGG块转换为RepConv。© CSPStackRep Block包含三个 1 × 1 1\times1 1×1卷积层和一个子块堆栈,子块由双RepConv组成,后跟ReLU激活和残差连接。
2.1.2 Neck
In practice, the feature integration at multiple scales has been proved to be a critical and effective part of object detection [ 9 , 21 , 24 , 40 ]. We adopt the modified PAN topology [ 24 ] from YOLOv4 [ 1 ] and YOLOv5 [ 10 ] as the base of our detection neck. In addition, we replace the CSPBlock used in YOLOv5 with RepBlock (for small models) or CSPStackRep Block (for large models) and adjust the width and depth accordingly. The neck of YOLOv6 is denoted as Rep-PAN.
【翻译】在实践中,多尺度特征融合已被证明是目标检测中关键且有效的部分。我们采用了来自YOLOv4和YOLOv5的改进PAN拓扑结构作为我们检测颈部网络的基础。此外,我们将YOLOv5中使用的CSPBlock替换为RepBlock(用于小模型)或CSPStackRep Block(用于大模型),并相应地调整宽度和深度。YOLOv6的颈部网络被称为Rep-PAN。
【解析】PAN(Path Aggregation Network)是常用的特征融合架构,它不仅实现了从深层到浅层的特征传递(自顶向下),还增加了从浅层到深层的信息流动(自底向上),形成了一个双向的信息交流通道。这种设计让网络能够在不同层级之间充分交换信息,使得每个层级的特征都能获得来自其他层级的补充信息。YOLOv6在继承这一优秀设计的基础上,根据自身的重参数化策略进行了定制化改进。将原有的CSPBlock替换为RepBlock或CSPStackRep Block,这样做的好处是保持了特征融合的有效性,同时也享受到了重参数化技术带来的推理加速优势。Rep-PAN这个命名其实就体现了这种设计思路:既保留了PAN的特征融合能力,又融入了Rep系列的高效推理特性。
2.1.3 Head
Efficient decoupled head The detection head of YOLOv5 is a coupled head with parameters shared between the classification and localization branches, while its counterparts in FCOS [ 41 ] and YOLOX [ 7 ] decouple the two branches, and additional two 3 × 3 3\times3 3×3 convolutional layers are introduced in each branch to boost the performance.
【翻译】高效解耦头部:YOLOv5的检测头部是一个耦合头部,分类和定位分支之间共享参数,而FCOS和YOLOX中的对应组件将两个分支解耦,并在每个分支中引入额外的两个 3 × 3 3\times3 3×3卷积层来提升性能。
【解析】在目标检测任务中,网络需要同时完成两个不同的子任务:分类(判断目标是什么)和定位(确定目标在哪里)。这两个任务虽然相关但本质上是不同的。分类任务更关注语义信息,需要对目标的类别特征敏感;而定位任务更关注几何信息,需要对目标的边界和位置敏感。耦合头部意味着这两个任务共享相同的参数和特征表示,这虽然减少了参数量,但可能限制了各自任务的优化空间。解耦头部的设计思路是让每个任务拥有独立的参数和处理通道,这样可以让网络为不同任务学习到更专门化的特征表示。FCOS和YOLOX通过增加额外的卷积层进一步增强了各分支的表达能力,但这也带来了计算开销的增加。
In YOLOv6, we adopt a hybrid-channel strategy to build a more efficient decoupled head. Specifically, we reduce the number of the middle 3 × 3 3\times3 3×3 convolutional layers to only one. The width of the head is jointly scaled by the width multiplier for the backbone and the neck. These modifications further reduce computation costs to achieve a lower inference latency.
【翻译】在YOLOv6中,我们采用混合通道策略来构建更高效的解耦头部。具体而言,我们将中间 3 × 3 3\times3 3×3卷积层的数量减少到只有一个。头部的宽度由主干网络和颈部网络的宽度乘数共同缩放。这些修改进一步降低了计算成本,实现了更低的推理延迟。
【解析】混合通道策略是YOLOv6在解耦头部设计上的创新点。传统的解耦头部为了提升性能往往会增加网络深度,但这种做法在小模型上可能得不偿失,因为额外的计算开销可能超过性能提升带来的收益。YOLOv6的解决方案是在保持解耦优势的同时,通过减少卷积层数量来控制计算复杂度。只使用一个中间卷积层是一个精心的平衡点:它既保证了特征的充分处理,又避免了过度的计算开销。宽度的联合缩放策略确保了整个网络的参数配置是协调一致的。
Anchor-free Anchor-free detectors stand out because of their better generalization ability and simplicity in decoding prediction results. The time cost of its post-processing is substantially reduced. There are two types of anchorfree detectors: anchor point-based [ 7 , 41 ] and keypointbased [ 16 , 46 , 53 ]. In YOLOv6, we adopt the anchor pointbased paradigm, whose box regression branch actually predicts the distance from the anchor point to the four sides of the bounding boxes.
【翻译】无锚点:无锚点检测器因其更好的泛化能力和预测结果解码的简单性而脱颖而出。其后处理的时间成本大幅降低。无锚点检测器有两种类型:基于锚点的和基于关键点的。在YOLOv6中,我们采用基于锚点的范式,其边界框回归分支实际上预测从锚点到边界框四个边的距离。
【解析】无锚点检测的出现是目标检测领域的一个重要进步。传统的基于锚点的方法需要预先定义大量的候选框(锚点),这些锚点的设计需要大量的先验知识和手工调优,不同的数据集可能需要不同的锚点配置,这限制了模型的泛化能力。而且锚点匹配过程涉及复杂的IoU计算和筛选,增加了后处理的复杂度。无锚点方法直接在特征图的每个位置预测目标的存在性和属性,避免了锚点设计和匹配的复杂性。基于锚点的无锚点方法虽然名字看起来矛盾,但这里的"锚点"更像是特征图上的参考位置,而不是传统意义上需要精心设计的候选框。这种方法通过预测从参考点到目标边界的距离来直接回归目标框,既保持了预测的直观性,又避免了传统锚点的复杂性。这种设计使得模型更容易适应不同尺度和形状的目标,提升了检测的灵活性和效率。
2.2. Label Assignment
Label assignment is responsible for assigning labels to predefined anchors during the training stage. Previous work has proposed various label assignment strategies ranging from simple IoU-based strategy and inside ground-truth method [ 41 ] to other more complex schemes [ 5 , 7 , 18 , 48 , 51 ].
【翻译】标签分配负责在训练阶段为预定义的锚点分配标签。之前的工作提出了各种标签分配策略,从简单的基于IoU的策略和内部真实值方法到其他更复杂的方案。
【解析】在目标检测的训练过程中,网络需要知道哪些预测框应该被认为是"正样本"(包含目标的框),哪些应该被认为是"负样本"(不包含目标或包含背景的框)。这个过程就是标签分配。早期的方法相对简单,比如计算预测框与真实框的IoU,如果IoU超过某个阈值就认为是正样本,否则就是负样本。或者判断锚点是否落在真实目标框内部。但是这些简单方法往往不能很好地处理复杂情况,比如目标重叠、尺度变化等问题,因此研究者们开发了更加精细和智能的标签分配策略。
SimOTA OTA [ 6 ] considers the label assignment in object detection as an optimal transmission problem. It defines positive/negative training samples for each ground-truth object from a global perspective. SimOTA [ 7 ] is a simplified version of OTA [ 6 ], which reduces additional hyperparameters and maintains the performance. SimOTA was utilized as the label assignment method in the early version of YOLOv6. However, in practice, we find that introducing SimOTA will slow down the training process. And it is not rare to fall into unstable training. Therefore, we desire a replacement for SimOTA.
【翻译】SimOTA:OTA将目标检测中的标签分配视为最优传输问题。它从全局角度为每个真实目标定义正负训练样本。SimOTA是OTA的简化版本,减少了额外的超参数并保持了性能。SimOTA在YOLOv6的早期版本中被用作标签分配方法。然而,在实践中,我们发现引入SimOTA会减慢训练过程。而且训练不稳定的情况并不罕见。因此,我们希望找到SimOTA的替代方案。
【解析】最优传输理论是数学中的一个分支,它研究如何以最小代价将一堆"货物"从一个分布转移到另一个分布。OTA巧妙地将这个理论应用到目标检测中:将真实目标看作"供应方",将所有可能的预测位置看作"需求方",然后寻找一个最优的分配方案,使得每个真实目标都能以最小的"传输成本"分配到最合适的预测位置。这种全局优化的思路比传统的局部判断方法更加科学,因为它考虑了所有目标和所有预测位置之间的关系。SimOTA在保持这种优势的同时简化了计算过程,减少了需要调节的参数。但是在实际应用中,这种复杂的优化过程增加了计算开销,而且全局优化有时会导致训练过程中的振荡,特别是在训练初期当网络预测还不够准确时,全局优化可能会产生不稳定的梯度更新。
Task alignment learning Task Alignment Learning (TAL) was first proposed in TOOD [ 5 ], in which a unified metric of classification score and predicted box quality is designed. The IoU is replaced by this metric to assign object labels. To a certain extent, the problem of the misalignment of tasks (classification and box regression) is alleviated.
【翻译】任务对齐学习:任务对齐学习(TAL)最初在TOOD中提出,其中设计了一个统一的分类得分和预测框质量度量。用这个度量替代IoU来分配目标标签。在一定程度上,缓解了任务不对齐(分类和边界框回归)的问题。
【解析】在传统的目标检测中存在一个根本性的矛盾:分类任务和定位任务使用的是不同的评价标准。分类任务关心的是"这是什么类别",通常用分类得分来衡量;而定位任务关心的是"位置是否准确",通常用IoU来衡量。这就造成了一个问题:一个分类得分很高的预测框,其定位可能很不准确;反之,一个定位很准确的框,其分类得分可能不高。TAL的核心思想是设计一个新的度量标准,它同时考虑分类的准确性和定位的准确性,将这两个原本独立的任务统一起来。这样做的好处是让网络在训练时能够更好地平衡这两个任务,避免出现"顾此失彼"的情况。当使用这个统一度量来进行标签分配时,那些既分类准确又定位准确的预测框会被优先选为正样本,这更符合我们对"好的检测结果"的直观理解。
The other main contribution of TOOD is about the taskaligned head (T-head). T-head stacks convolutional layers to build interactive features, on top of which the Task-Aligned Predictor (TAP) is used. PP-YOLOE [ 45 ] improved Thead by replacing the layer attention in T-head with the lightweight ESE attention, forming ET-head. However, we find that the ET-head will deteriorate the inference speed in our models and it comes with no accuracy gain. Therefore, we retain the design of our Efficient decoupled head.
【翻译】TOOD的另一个主要贡献是关于任务对齐头部(T-head)。T-head堆叠卷积层来构建交互特征,在此基础上使用任务对齐预测器(TAP)。PP-YOLOE通过用轻量级ESE注意力替换T-head中的层注意力来改进T-head,形成了ET-head。然而,我们发现ET-head会降低我们模型的推理速度,并且没有带来精度提升。因此,我们保留了高效解耦头部的设计。
【解析】T-head的设计理念是让分类和定位两个任务能够更好地相互协作。通过堆叠卷积层,网络能够学习到更复杂的特征交互模式,让分类分支和定位分支之间能够共享有用的信息。任务对齐预测器进一步强化了这种协作关系。PP-YOLOE在此基础上引入了ESE注意力机制,试图让网络能够自动学习哪些特征对于不同任务更重要。但是YOLOv6的实验结果表明,这种复杂的设计在他们的模型架构中并没有带来预期的收益。事实上,并不是所有理论上先进的技术都能在所有场景下带来实际的性能提升。有时候,简单有效的设计反而能够取得更好的效果。YOLOv6选择保留自己设计的高效解耦头部。
Furthermore, we observed that TAL could bring more performance improvement than SimOTA and stabilize the training. Therefore, we adopt TAL as our default label assignment strategy in YOLOv6.
【翻译】此外,我们观察到TAL比SimOTA能带来更多的性能改进并稳定训练。因此,我们采用TAL作为YOLOv6中的默认标签分配策略。
【解析】经过详细的对比实验,YOLOv6团队发现TAL在多个方面都优于SimOTA。首先是性能提升更明显,这意味着使用TAL的模型能够达到更高的检测精度。其次是训练稳定性更好,这对于实际应用来说非常重要,因为稳定的训练过程意味着更可预测的结果和更少的调参工作。TAL之所以能够取得这样的效果,主要是因为它的设计更加直接和有针对性:通过统一的度量标准来解决任务对齐问题,避免了SimOTA中复杂的全局优化可能带来的不稳定性。
2.3. Loss Functions
Object detection contains two sub-tasks: classification and localization , corresponding to two loss functions: classification loss and box regression loss . For each sub-task, there are various loss functions presented in recent years. In this section, we will introduce these loss functions and describe how we select the best ones for YOLOv6.
【翻译】目标检测包含两个子任务:分类和定位,对应两个损失函数:分类损失和边界框回归损失。对于每个子任务,近年来提出了各种损失函数。在本节中,我们将介绍这些损失函数,并描述我们如何为YOLOv6选择最佳的损失函数。
【解析】目标检测本质上是一个多任务学习问题。分类任务要求网络能够准确识别图像中目标的类别,比如区分猫、狗、汽车等;定位任务则要求网络精确预测目标在图像中的位置和大小,通常用边界框来表示。这两个任务虽然紧密相关,但各自的优化目标不同。分类任务关注的是语义理解,需要网络学习到不同类别之间的判别性特征;定位任务关注的是几何精度,需要网络学习到目标的空间分布和形状信息。因此,需要设计不同的损失函数来指导这两个任务的学习。分类损失通常基于交叉熵或其变种,用来衡量预测类别概率分布与真实标签之间的差异;边界框回归损失则用来衡量预测边界框与真实边界框之间的几何差异。选择合适的损失函数对于模型性能至关重要,不同的损失函数会引导网络学习不同的特征表示,从而影响最终的检测效果。
2.3.1 Classification Loss
Improving the performance of the classifier is a crucial part of optimizing detectors. Focal Loss [ 22 ] modified the traditional cross-entropy loss to solve the problems of class imbalance either between positive and negative examples, or hard and easy samples. To tackle the inconsistent usage of the quality estimation and classification between training and inference, Quality Focal Loss (QFL) [ 20 ] further extended Focal Loss with a joint representation of the classification score and the localization quality for the supervision in classification. Whereas VariFocal Loss (VFL) [ 50 ] is rooted from Focal Loss [ 22 ], but it treats the positive and negative samples asymmetrically. By considering positive and negative samples at different degrees of importance, it balances learning signals from both samples. Poly Loss [ 17 ] decomposes the commonly used classification loss into a series of weighted polynomial bases. It tunes polynomial coefficients on different tasks and datasets, which is proved better than Cross-entropy Loss and Focal Loss through experiments.
We assess all these advanced classification losses on YOLOv6 to finally adopt VFL [ 50 ].
【翻译】提高分类器的性能是优化检测器的关键部分。Focal Loss修改了传统的交叉熵损失,以解决正负样本之间或困难样本与简单样本之间的类别不平衡问题。为了解决训练和推理之间质量估计与分类使用不一致的问题,Quality Focal Loss (QFL)进一步扩展了Focal Loss,在分类监督中联合表示分类得分和定位质量。而VariFocal Loss (VFL)源自Focal Loss,但它对正负样本进行非对称处理。通过考虑正负样本的不同重要程度,它平衡了来自两种样本的学习信号。Poly Loss将常用的分类损失分解为一系列加权多项式基础。它在不同任务和数据集上调整多项式系数,实验证明它比交叉熵损失和Focal Loss效果更好。我们在YOLOv6上评估了所有这些先进的分类损失,最终采用了VFL。
【解析】分类损失函数的选择直接影响检测器能否准确识别目标类别。传统的交叉熵损失虽然简单有效,但在目标检测场景中面临一个严重问题:样本不平衡。在一张图像中,背景区域(负样本)往往远多于目标区域(正样本),而且大部分负样本都是容易分类的,这会导致网络训练时被大量简单的负样本主导,难以学好真正重要的正样本和困难样本。Focal Loss通过引入调制因子,降低了简单样本的损失权重,让网络更专注于学习困难样本,这种重新加权的策略显著改善了类别不平衡问题。QFL在此基础上进一步创新,它认识到分类任务和定位任务之间存在不一致性:训练时我们分别优化分类得分和定位质量,但测试时却用分类得分来排序检测结果。QFL通过将定位质量直接融入分类标签中,让网络学习一个既反映分类准确性又反映定位准确性的统一得分。VFL则从另一个角度思考问题:正样本和负样本本质上是不同的,应该用不同的策略来处理。对于正样本,VFL使用目标质量得分作为软标签;对于负样本,仍然使用硬标签。这种非对称设计让网络能够更细致地学习正样本的质量差异,同时保持对负样本的明确抑制。Poly Loss则是从数学角度重新审视损失函数设计,它将传统损失函数表示为多项式的线性组合,通过调整不同项的系数来适应具体的任务和数据特性,这种灵活性使其能够在多种场景下取得更好的效果。YOLOv6经过全面对比实验后选择了VFL,说明VFL在它们的模型架构和数据集上能够取得最佳的精度提升效果。
2.3.2 Box Regression Loss
Box regression loss provides significant learning signals localizing bounding boxes precisely. L1 loss is the original box regression loss in early works. Progressively, a variety of well-designed box regression losses have sprung up, such as IoU-series loss [ 8 , 11 , 35 , 47 , 52 , 52 ] and probability loss [ 20 ].
【翻译】边界框回归损失为精确定位边界框提供了重要的学习信号。L1损失是早期工作中的原始边界框回归损失。随着时间推移,出现了各种精心设计的边界框回归损失,如IoU系列损失和概率损失。
【解析】边界框回归损失作用是指导网络学习如何准确预测目标的位置和大小。在深度学习的早期阶段,研究者们使用简单的L1损失来优化边界框的坐标预测,这种方法直接计算预测坐标与真实坐标之间的绝对值差异。然而,L1损失存在一个根本性问题:它将边界框的四个坐标(通常是左上角和右下角的x、y坐标,或者是中心点坐标加上宽高)作为独立的数值来处理,完全忽略了这四个坐标之间的几何关系。这种处理方式并不符合边界框作为一个几何实体的本质特性。随着对目标检测理解的深入,研究者们意识到需要设计更加符合边界框几何特性的损失函数,这就催生了IoU系列损失和概率损失等更先进的方法。
IoU-series Loss IoU loss [ 47 ] regresses the four bounds of a predicted box as a whole unit. It has been proved to be effective because of its consistency with the evaluation metric. There are many variants of IoU, such as GIoU [ 35 ], DIoU [ 52 ], CIoU [ 52 ], α \alpha α -IoU [ 11 ] and SIoU [ 8 ], etc, forming relevant loss functions. We experiment with GIoU, CIoU and SIoU in this work. And SIoU is applied to YOLOv6-N and YOLOv6-T, while others use GIoU.
【翻译】IoU系列损失:IoU损失将预测框的四个边界作为一个整体单元进行回归。它被证明是有效的,因为它与评估指标的一致性。IoU有许多变体,如GIoU、DIoU、CIoU、 α \alpha α-IoU和SIoU等,形成了相关的损失函数。我们在这项工作中实验了GIoU、CIoU和SIoU。SIoU应用于YOLOv6-N和YOLOv6-T,而其他版本使用GIoU。
【解析】IoU损失的革命性在于它将边界框视为一个不可分割的几何整体,而不是四个独立的数值。这种方法的核心优势是与目标检测的评估标准完全一致——我们在评估检测性能时使用的就是IoU指标,所以用IoU来指导训练过程是非常自然和合理的。传统的L1或L2损失可能会出现这样的情况:预测框在某些坐标上的误差很小,但整体的重叠度却很低;而IoU损失能够直接优化重叠度,确保训练目标与评估目标的一致性。随着研究的深入,IoU损失的各种改进版本应运而生。GIoU(Generalized IoU)解决了当两个框完全不重叠时IoU为零导致梯度消失的问题;DIoU(Distance IoU)在IoU的基础上加入了中心点距离的考虑,让优化过程更加稳定;CIoU(Complete IoU)进一步考虑了长宽比的匹配; α \alpha α-IoU通过引入可调参数来平衡不同的优化目标;SIoU(SCYLLA IoU)则从角度信息的角度来改进IoU计算。YOLOv6根据不同模型规模的特点选择了不同的IoU变体:对于较小的模型(N和T版本),使用计算相对复杂但效果更好的SIoU;对于其他版本,则使用计算效率和效果平衡较好的GIoU。
Probability Loss Distribution Focal Loss (DFL) [ 20 ] simplifies the underlying continuous distribution of box locations as a discretized probability distribution. It considers ambiguity and uncertainty in data without introducing any other strong priors, which is helpful to improve the box localization accuracy especially when the boundaries of the ground-truth boxes are blurred. Upon DFL, DFLv2 [ 19 ] develops a lightweight sub-network to leverage the close correlation between distribution statistics and the real localization quality, which further boosts the detection performance. However, DFL usually outputs 17 × 17\times 17× more regression values than general box regression, leading to a substantial overhead. The extra computation cost signifi- cantly hinders the training of small models. Whilst DFLv2 further increases the computation burden because of the extra sub-network. In our experiments, DFLv2 brings similar performance gain to DFL on our models. Consequently, we only adopt DFL in YOLOv6-M/L. Experimental details can be found in Section 3.3.3 .
【翻译】概率损失:分布焦点损失(DFL)将边界框位置的底层连续分布简化为离散概率分布。它考虑了数据中的模糊性和不确定性,而不引入任何其他强先验,这有助于提高边界框定位精度,特别是当真实框的边界模糊时。在DFL基础上,DFLv2开发了一个轻量级子网络来利用分布统计和真实定位质量之间的密切关联,进一步提升了检测性能。然而,DFL通常输出比一般边界框回归多 17 × 17\times 17×的回归值,导致了大量的计算开销。额外的计算成本显著阻碍了小模型的训练。而DFLv2由于额外的子网络进一步增加了计算负担。在我们的实验中,DFLv2在我们的模型上带来的性能提升与DFL相似。因此,我们只在YOLOv6-M/L中采用DFL。实验细节可在第3.3.3节中找到。
【解析】分布焦点损失代表了边界框回归的一个重要范式转变。传统方法将边界框坐标看作确定性的数值预测问题,但现实中目标的边界往往存在模糊性和不确定性。比如在图像中,一个目标的边缘可能因为光照、阴影、或者目标本身的材质特性而显得模糊不清,此时要求网络给出一个绝对精确的边界框坐标是不现实的。DFL通过将每个坐标位置建模为概率分布来解决这个问题:网络不再预测一个确定的坐标值,而是预测该坐标在不同位置的概率。这种概率表示能够自然地编码预测的不确定性,当网络对某个位置很确定时,概率分布会比较尖锐;当网络不确定时,概率分布会比较平缓。这种方法的数学优雅性在于它没有引入人为的先验假设,而是让网络自己学习不确定性的表示。DFLv2在此基础上更进一步,它发现分布的统计特性(如方差、偏度等)与实际的定位质量存在强相关性,因此设计了一个小型网络来学习这种关联,从而更准确地评估定位质量。然而,这种概率建模的代价是计算复杂度的大幅增加:原本只需要预测4个坐标值,现在需要为每个坐标预测一个概率分布,通常需要17个数值来离散化表示,这意味着计算量增加了17倍。对于资源受限的小模型来说,这种计算开销是难以承受的,因此YOLOv6只在计算资源相对充足的中大型模型中使用DFL技术。
2.3.3 Object Loss
Object loss was first proposed in FCOS [ 41 ] to reduce the score of low-quality bounding boxes so that they can be filtered out in post-processing. It was also used in YOLOX [ 7 ] to accelerate convergence and improve network accuracy. As an anchor-free framework like FCOS and YOLOX, we have tried object loss into YOLOv6. Unfortunately, it doesn’t bring many positive effects. Details are given in Section 3 .
【翻译】目标损失(Object loss)最初在FCOS中被提出,用于降低低质量边界框的得分,以便在后处理中将其过滤掉。它也被用于YOLOX中以加速收敛并提高网络精度。作为像FCOS和YOLOX一样的无锚框架,我们尝试将目标损失引入YOLOv6。不幸的是,它没有带来很多积极效果。详细信息在第3节中给出。
【解析】目标损失是为了解决目标检测中一个关键问题而设计的:如何区分高质量和低质量的检测结果。在传统的目标检测方法中,网络会产生大量的候选检测框,但其中很多质量很差——要么定位不准确,要么包含的目标不完整,要么根本就是误检。这些低质量的检测框如果不被有效识别和抑制,就会影响最终的检测性能。FCOS通过引入目标损失来训练网络学习一个额外的"目标性"分数,这个分数反映了当前位置是否真的包含一个完整、清晰的目标。具有高目标性分数的位置更可能产生高质量的检测框,而低分数的位置则倾向于产生低质量或错误的检测。在后处理阶段,可以利用这个目标性分数来过滤掉那些质量差的检测结果,从而提高整体的检测精度。YOLOX进一步证明了目标损失不仅能提高检测质量,还能加速训练收敛,这是因为额外的监督信号帮助网络更快地学会区分有效和无效的特征表示。然而,YOLOv6的实验结果表明,并不是所有在其他模型上有效的技术都能在新的架构中发挥同样的作用。目标损失在YOLOv6中效果不佳可能有多种原因:首先,YOLOv6的网络架构和训练策略可能已经通过其他方式有效地解决了检测质量问题,使得额外的目标损失变得冗余;其次,不同的检测框架对于特征表示和损失函数的敏感性不同,YOLOv6可能需要不同的质量评估机制;最后,目标损失的引入增加了训练的复杂性和计算开销,在YOLOv6追求效率优化的设计理念下,这种代价可能超过了它带来的收益。
2.4. 工业友好的改进
The following tricks come ready to use in real practice. They are not intended for a fair comparison but steadily produce performance gain without much tedious effort.
【翻译】以下技巧在实际应用中可以直接使用。它们并不是为了公平比较而设计的,但能够稳定地产生性能提升,而无需太多繁琐的工作。
2.4.1 更多训练轮数
Empirical results have shown that detectors have a progressing performance with more training time. We extended the training duration from 300 epochs to 400 epochs to reach a better convergence.
【翻译】经验结果表明,检测器的性能随着训练时间的增加而不断提升。我们将训练持续时间从300个epochs延长到400个epochs,以达到更好的收敛效果。
2.4.2 自蒸馏
To further improve the model accuracy while not introducing much additional computation cost, we apply the classical knowledge distillation technique minimizing the KLdivergence between the prediction of the teacher and the student. We limit the teacher to be the student itself but pretrained, hence we call it self-distillation. Note that the KL-divergence is generally utilized to measure the difference between data distributions. However, there are two sub-tasks in object detection, in which only the classification task can directly utilize knowledge distillation based on KL-divergence. Thanks to DFL loss [ 20 ], we can perform it on box regression as well. The knowledge distillation loss can then be formulated as:
【翻译】为了进一步提高模型精度而不引入太多额外的计算成本,我们应用经典的知识蒸馏技术,最小化教师网络和学生网络预测之间的KL散度。我们将教师网络限制为学生网络本身但是经过预训练的版本,因此我们称之为自蒸馏。注意KL散度通常用于测量数据分布之间的差异。然而,在目标检测中有两个子任务,其中只有分类任务可以直接利用基于KL散度的知识蒸馏。感谢DFL损失,我们也可以在边界框回归上执行知识蒸馏。知识蒸馏损失可以表述为:
【解析】知识蒸馏是深度学习中一种非常重要的模型优化技术,它的核心思想是让一个较小或较快的"学生"网络学习一个更大更强的"教师"网络的知识。传统的知识蒸馏需要两个不同的网络:一个复杂的教师网络和一个简单的学生网络。但在YOLOv6中,研究者们采用了一种更加巧妙的方法——自蒸馏,即让网络向它自己的预训练版本学习。这种方法的好处是不需要设计和训练额外的教师网络,同时还能获得知识蒸馏带来的性能提升。KL散度作为衡量两个概率分布差异的数学工具,在知识蒸馏中扮演着关键角色。它能够量化教师网络的"软标签"和学生网络预测之间的差异,软标签包含了比硬标签(0或1)更丰富的信息。在目标检测任务中,存在分类和定位两个核心子任务。传统的知识蒸馏主要应用于分类任务,因为分类输出天然就是概率分布的形式,可以直接计算KL散度。但是边界框回归任务的输出是坐标值,不是概率分布,因此无法直接应用KL散度。这里DFL损失的引入解决了这个问题:DFL将边界框的坐标预测转换为概率分布的形式,这样就使得在回归任务上也能够应用知识蒸馏成为可能。通过将分类和回归两个任务的蒸馏损失结合起来,YOLOv6能够在不增加推理计算量的情况下,有效提升模型的整体检测精度。
L K D = K L ( p t c l s ∣ ∣ p s c l s ) + K L ( p t r e g ∣ ∣ p s r e g ) , L_{K D}=K L(p_{t}^{c l s}||p_{s}^{c l s})+K L(p_{t}^{r e g}||p_{s}^{r e g}), LKD=KL(ptcls∣∣pscls)+KL(ptreg∣∣psreg),
where p t c l s p_{t}^{c l s} ptcls and p s c l s p_{s}^{c l s} pscls are class prediction of the teacher model and the student model respectively, and accordingly p t r e g \boldsymbol{p}_{t}^{r e g} ptreg and p s r e g p_{s}^{r e g} psreg are box regression predictions. The overall loss function is now formulated as:
【翻译】其中 p t c l s p_{t}^{c l s} ptcls和 p s c l s p_{s}^{c l s} pscls分别是教师模型和学生模型的分类预测,相应地, p t r e g \boldsymbol{p}_{t}^{r e g} ptreg和 p s r e g p_{s}^{r e g} psreg是边界框回归预测。整体损失函数现在表述为:
L t o t a l = L d e t + α L K D , L_{t o t a l}=L_{d e t}+\alpha L_{K D}, Ltotal=Ldet+αLKD,
where L d e t L_{d e t} Ldet is the detection loss computed with predictions and labels. The hyperparameter α \alpha α is introduced to balance two losses. In the early stage of training, the soft labels from the teacher are easier to learn. As the training continues, the performance of the student will match the teacher so that the hard labels will help students more. Upon this, we apply cosine weight decay to α \alpha α to dynamically adjust the information from hard labels and soft ones from the teacher. We conducted detailed experiments to verify the effect of self-distillation on YOLOv6, which will be discussed in Section 3 .
【翻译】其中 L d e t L_{d e t} Ldet是用预测和标签计算的检测损失。引入超参数 α \alpha α来平衡两个损失。在训练的早期阶段,来自教师的软标签更容易学习。随着训练的继续,学生的性能将匹配教师,因此硬标签将对学生更有帮助。基于此,我们对 α \alpha α应用余弦权重衰减,以动态调整来自硬标签和教师软标签的信息。我们进行了详细的实验来验证自蒸馏对YOLOv6的效果,这将在第3节中讨论。
【解析】公式中, p t c l s p_{t}^{c l s} ptcls和 p s c l s p_{s}^{c l s} pscls分别代表教师网络和学生网络在分类任务上的输出概率分布,而 p t r e g \boldsymbol{p}_{t}^{r e g} ptreg和 p s r e g p_{s}^{r e g} psreg则是两个网络在边界框回归任务上的预测分布。这种设计确保了知识蒸馏能够同时作用于目标检测的两个核心任务。总损失函数 L t o t a l = L d e t + α L K D L_{t o t a l}=L_{d e t}+\alpha L_{K D} Ltotal=Ldet+αLKD体现了一个非常重要的平衡原则: L d e t L_{d e t} Ldet是传统的检测损失,它确保网络能够正确学习真实标签(硬标签)的信息;而 L K D L_{K D} LKD是知识蒸馏损失,它让学生网络学习教师网络的预测分布(软标签)。超参数 α \alpha α的作用至关重要,它控制着这两种不同类型监督信号的相对重要性。 α \alpha α动态调整策略:在训练初期,学生网络的能力还很弱,此时教师网络的软标签包含了丰富的知识,比如类别之间的相似性关系、预测的不确定性等,这些信息比硬标签更容易被初学的学生网络吸收。随着训练的进行,学生网络逐渐成熟,其性能开始接近教师网络,此时软标签的指导作用逐渐减弱,而硬标签的精确指导变得更加重要。余弦权重衰减策略巧妙地模拟了这一过程:通过让 α \alpha α按照余弦函数的形式逐渐减小,实现了从"以软标签为主"到"以硬标签为主"的平滑过渡。这种动态平衡不仅理论上合理,而且在实践中能够充分发挥知识蒸馏的优势,避免了固定权重可能导致的训练不稳定或收敛不充分的问题。
2.4.3 图像的灰色边框
We notice that a half-stride gray border is put around each image when evaluating the model performance in the implementations of YOLOv5 [ 10 ] and YOLOv7 [ 42 ]. Although no useful information is added, it helps in detecting the objects near the edge of the image. This trick also applies in YOLOv6.
【翻译】我们注意到在YOLOv5和YOLOv7的实现中,在评估模型性能时会在每张图像周围放置半步长的灰色边框。虽然没有添加有用的信息,但它有助于检测图像边缘附近的目标。这个技巧也适用于YOLOv6。
【解析】这个看似简单的"灰色边框"技巧实际上解决了目标检测中一个重要的边界效应问题。在目标检测任务中,当目标位于图像边缘时,网络往往难以准确检测,因为这些目标的特征信息可能被截断。添加灰色边框本质上是一种边界填充策略,它为图像边缘的目标提供了额外的上下文空间。虽然灰色像素本身不包含任何语义信息,但它们创造了一个缓冲区域,让网络的感受野能够完整地覆盖边缘目标。"半步长"这个设置非常巧妙——它确保了填充的大小与网络的下采样步长相匹配,这样可以保持特征图在空间维度上的对齐性,避免因为填充导致的特征错位问题。
However, the extra gray pixels evidently reduce the inference speed. Without the gray border, the performance of YOLOv6 deteriorates, which is also the case in [ 10 , 42 ]. We postulate that the problem is related to the gray borders padding in Mosaic augmentation [ 1 , 10 ]. Experiments on turning mosaic augmentations off during last epochs [ 7 ] (aka. fade strategy) are conducted for verification. In this regard, we change the area of gray border and resize the image with gray borders directly to the target image size. Combining these two strategies, our models can maintain or even boost the performance without the degradation of inference speed.
【翻译】然而,额外的灰色像素明显降低了推理速度。如果没有灰色边框,YOLOv6的性能会下降,这在YOLOv5和YOLOv7中也是如此。我们推测这个问题与Mosaic数据增强中的灰色边框填充有关。为了验证,我们进行了在最后几个epoch关闭mosaic增强(也称为淡化策略)的实验。在这方面,我们改变了灰色边框的区域,并将带有灰色边框的图像直接调整到目标图像尺寸。结合这两种策略,我们的模型可以在不降低推理速度的情况下保持甚至提升性能。
【解析】灰色边框虽然提升了检测精度,但增加了图像的像素数量,直接导致计算量增加和推理速度下降。作者们敏锐地意识到问题的根源可能在于训练过程中使用的Mosaic数据增强技术。Mosaic增强会将四张不同的图像拼接成一张新图像,在拼接过程中需要用灰色像素填充空白区域。如果模型在训练时习惯了这种带有灰色填充的图像模式,那么在推理时移除灰色边框就会导致输入分布的不匹配,从而影响性能。淡化策略:在训练的最后几个epoch逐渐减少或完全停止Mosaic增强,让模型适应没有额外填充的原始图像分布。这样既保留了Mosaic增强在训练前期带来的泛化能力提升,又避免了推理时的分布不匹配问题。同时,通过调整灰色边框的区域大小并直接将图像缩放到目标尺寸,进一步优化了这一过程,最终实现了精度和速度的双重优化。
2.5. 量化和部署
For industrial deployment, it has been common practice to adopt quantization to further speed up runtime without much performance compromise. Post-training quantization (PTQ) directly quantizes the model with only a small calibration set. Whereas quantization-aware training (QAT) further improves the performance with the access to the training set, which is typically used jointly with distillation. However, due to the heavy use of re-parameterization blocks in YOLOv6, previous PTQ techniques fail to produce high performance, while it is hard to incorporate QAT when it comes to matching fake quantizers during training and inference. We here demonstrate the pitfalls and our cures during deployment.
【翻译】在工业部署中,采用量化来进一步加速运行时间而不会显著损害性能已成为常见做法。训练后量化(PTQ)仅使用小型校准集直接对模型进行量化。而量化感知训练(QAT)通过访问训练集进一步提高性能,通常与蒸馏联合使用。然而,由于YOLOv6大量使用重参数化块,之前的PTQ技术无法产生高性能,而且在匹配训练和推理期间的伪量化器时很难结合QAT。我们在此展示部署过程中的陷阱和我们的解决方案。
【解析】量化的核心原理是将原本用32位浮点数表示的权重和激活值转换为8位整数或更低精度的表示,这样可以显著减少内存占用和计算复杂度,同时利用现代硬件对整数运算的优化支持。训练后量化是一种相对简单的量化方法,它在模型训练完成后直接对已有的浮点模型进行转换。这种方法只需要一个小型的校准数据集来统计激活值的分布范围,然后据此确定量化参数。PTQ的优势在于实施简单、不需要重新训练,但缺点是可能会导致较大的精度损失,特别是对于复杂的网络结构。量化感知训练则是一种更加精细的方法,它在训练过程中就考虑量化的影响。QAT通过在前向传播中插入伪量化操作来模拟真实的量化过程,让网络在训练时就适应量化带来的精度损失。这种方法通常能获得更好的量化效果,但需要完整的训练数据集和更长的训练时间。重参数化块的引入为YOLOv6带来了性能提升,但也给量化带来了新的挑战,结构上的不一致性使得传统的量化方法难以处理:PTQ方法无法很好地处理结构转换过程中的数值变化,而QAT方法在匹配训练时的伪量化器和推理时的真实量化器方面也面临困难。这种不匹配可能导致量化后的模型性能大幅下降,因此需要专门的解决方案。
2.5.1 重参数化优化器
RepOptimizer [ 2 ] proposes gradient re-parameterization at each optimization step. This technique also well solves the quantization problem of reparameterization-based models. We hence reconstruct the re-parameterization blocks of YOLOv6 in this fashion and train it with RepOptimizer to obtain PTQ-friendly weights. The distribution of feature map is largely narrowed (e.g. Fig. 4 , more in B.1 ), which greatly benefits the quantization process, see Sec 3.5.1 for results.
【翻译】RepOptimizer [2] 提出在每个优化步骤中进行梯度重参数化。这项技术也很好地解决了基于重参数化模型的量化问题。因此,我们以这种方式重构YOLOv6的重参数化块,并使用RepOptimizer进行训练,以获得对PTQ友好的权重。特征图的分布大大缩小(如图4所示,更多内容见B.1),这极大地有利于量化过程,结果见第3.5.1节。
【解析】RepOptimizer是一个专门针对重参数化网络设计的优化器,它的核心创新在于在训练过程中对梯度进行重参数化处理。传统的优化器直接使用计算得到的梯度来更新参数,但RepOptimizer会在每个优化步骤中重新组织和调整梯度的分布方式。这种方法特别适合处理重参数化网络的训练,因为重参数化结构在训练和推理时具有不同的拓扑结构,直接训练可能导致参数分布不够理想。RepOptimizer通过梯度重参数化,能够使得训练得到的权重在结构转换后仍然保持良好的数值特性。对于量化而言,权重和激活值的分布范围是影响量化精度的关键因素。如果数值分布过于分散,量化时就需要用较大的量化范围来覆盖所有可能的值,这会降低量化精度。相反,如果分布比较集中,就可以用较小的量化范围获得更好的精度。RepOptimizer训练得到的模型具有更窄的特征图分布,这意味着激活值更加集中在某个数值范围内,从而使得8位量化能够更精确地表示这些数值,显著减少量化误差。这种"PTQ友好"的特性使得模型可以直接进行训练后量化,而无需复杂的量化感知训练过程,大大简化了模型部署的复杂度。
Figure 4: Improved activation distribution of YOLOv6-S trained with RepOptimizer.
【翻译】图4:使用RepOptimizer训练的YOLOv6-S的改进激活分布。
2.5.2 敏感性分析
We further improve the PTQ performance by partially converting quantization-sensitive operations into float computation. To obtain the sensitivity distribution, several metrics are commonly used, mean-square error (MSE), signal-noise ratio (SNR) and cosine similarity. Typically for comparison, one can pick the output feature map (after the activation of a certain layer) to calculate these metrics with and without quantization. As an alternative, it is also viable to compute validation AP by switching quantization on and off for the certain layer [ 29 ].
【翻译】我们通过将量化敏感操作部分转换为浮点计算来进一步改善PTQ性能。为了获得敏感性分布,通常使用几种指标:均方误差(MSE)、信噪比(SNR)和余弦相似度。通常为了比较,可以选择输出特征图(某层激活后)来计算这些指标在有量化和无量化情况下的差异。作为替代方案,也可以通过对特定层开启和关闭量化来计算验证AP。
【解析】这段描述了一种精细化的量化优化策略——敏感性分析。虽然整体量化可以显著提升推理速度,但并非所有层对量化都有相同的容忍度。某些关键层的量化可能会导致显著的精度损失,而另一些层即使量化也不会对最终结果产生太大影响。敏感性分析的核心思想是识别出这些"量化敏感层",然后对它们采用特殊处理策略。具体的分析方法包括:均方误差衡量量化前后特征图的数值差异程度,如果MSE很大,说明该层对量化很敏感;信噪比从信号处理角度评估量化引入的噪声对原始信号的影响程度;余弦相似度则从向量空间角度衡量量化前后特征向量的方向一致性。这些指标从不同角度刻画了量化对网络中间表示的影响。除了这些数学指标,直接使用下游任务的性能指标(如验证集上的AP)来评估敏感性也是一种有效方法,这种方法更直观地反映了量化对最终检测性能的影响。通过这种混合精度的策略,既保留了量化带来的速度优势,又避免了关键层量化导致的精度大幅下降,实现了精度和速度的更好平衡。
We compute all these metrics on the YOLOv6-S model trained with RepOptimizer and pick the top-6 sensitive layers to run in float. The full chart of sensitivity analysis can be found in B.2 .
【翻译】我们在使用RepOptimizer训练的YOLOv6-S模型上计算所有这些指标,并选择前6个敏感层以浮点精度运行。完整的敏感性分析图表可在B.2中找到。
【解析】作者采用了系统性的敏感性分析方法来优化YOLOv6的量化策略。他们首先对使用RepOptimizer训练的YOLOv6-S模型进行全面的敏感性测试,通过前面提到的多种指标来评估每一层对量化的敏感程度。然后根据分析结果,识别出对量化最敏感的前6层,这些层在推理时保持32位浮点精度运算,而其他层则使用8位量化。这种选择性的混合精度策略是量化优化中的常见做法,它基于一个重要的观察:网络中不同层的重要性和敏感性是不同的。通常来说,网络的早期层(负责提取底层特征)和最后几层(负责最终预测)往往对量化更加敏感,而中间的特征提取层相对更能容忍量化误差。选择前6个最敏感的层进行浮点运算,这个数量是经过实验验证的最优平衡点——既能保持较高的检测精度,又不会因为过多的浮点运算而显著影响推理速度。
2.5.3 带有逐通道蒸馏的量化感知训练
In case PTQ is insufficient, we propose to involve quantization-aware training (QAT) to boost quantization performance. To resolve the problem of the inconsistency of fake quantizers during training and inference, it is necessary to build QAT upon the RepOptimizer. Besides, channelwise distillation [ 36 ] (later as CW Distill) is adapted within the YOLOv6 framework, shown in Fig. 5 . This is also a self-distillation approach where the teacher network is the student itself in FP32-precision. See experiments in Sec 3.5.1 .
【翻译】在PTQ不充分的情况下,我们提议引入量化感知训练(QAT)来提升量化性能。为了解决训练和推理期间伪量化器不一致的问题,有必要在RepOptimizer的基础上构建QAT。此外,逐通道蒸馏(后称为CW Distill)被适配到YOLOv6框架中,如图5所示。这也是一种自蒸馏方法,其中教师网络就是学生网络本身的FP32精度版本。实验结果见第3.5.1节。
【解析】当训练后量化的效果不能满足精度要求时,量化感知训练成为了一个更强大的替代方案。但是对于YOLOv6这样使用重参数化结构的网络,QAT面临着一个独特的挑战:训练时的网络结构(包含多分支)和推理时的结构(合并为单分支)是不同的,这导致伪量化器的配置在训练和推理阶段存在不匹配的问题。RepOptimizer的引入恰好解决了这个问题,它通过特殊的梯度处理方式,使得重参数化网络在QAT训练过程中能够获得更好的数值稳定性和一致性。逐通道蒸馏是一种知识蒸馏技术的变体,传统的知识蒸馏通常在输出层面进行,而逐通道蒸馏则在特征图的通道维度上进行更细粒度的知识传递。在这个框架中,作者采用了自蒸馏的策略,即让同一个网络的FP32版本作为教师来指导其量化版本的训练。这种方法的巧妙之处在于避免了额外教师网络的训练成本,同时确保了知识来源的一致性。FP32版本的网络拥有完整的数值精度,能够提供最准确的特征表示,而量化版本则通过模仿这些特征来学习如何在低精度下保持性能。
Figure 5: Schematic of YOLOv6 channel-wise distillation in QAT.
【翻译】图5:YOLOv6在QAT中的逐通道蒸馏示意图。
3. Experiments
3.1. Implementation Details
We use the same optimizer and the learning schedule as YOLOv5 [ 10 ], i.e . stochastic gradient descent (SGD) with momentum and cosine decay on learning rate. Warm-up, grouped weight decay strategy and the exponential moving average (EMA) are also utilized. We adopt two strong data augmentations (Mosaic [ 1 , 10 ] and Mixup [ 49 ]) following [ 1 , 7 , 10 ]. A complete list of hyperparameter settings can be found in our released code. We train our models on the COCO 2017 [ 23 ] training set, and the accuracy is evaluated on the COCO 2017 validation set. All our models are trained on 8 NVIDIA A100 GPUs, and the speed performance is measured on an NVIDIA Tesla T4 GPU with TensorRT version 7.2 unless otherwise stated. And the speed performance measured with other TensorRT versions or on other devices is demonstrated in Appendix A.
【翻译】我们使用与YOLOv5 [10]相同的优化器和学习调度,即带动量的随机梯度下降(SGD)和学习率的余弦衰减。还利用了热身、分组权重衰减策略和指数移动平均(EMA)。我们采用了两种强大的数据增强(Mosaic [1, 10]和Mixup [49]),遵循[1, 7, 10]的做法。完整的超参数设置列表可以在我们发布的代码中找到。我们在COCO 2017 [23]训练集上训练模型,准确性在COCO 2017验证集上评估。我们所有的模型都在8个NVIDIA A100 GPU上训练,除非另有说明,速度性能在带有TensorRT版本7.2的NVIDIA Tesla T4 GPU上测量。在其他TensorRT版本或其他设备上测量的速度性能在附录A中展示。
3.2. 比较
Considering that the goal of this work is to build networks for industrial applications, we primarily focus on the speed performance of all models after deployment, including throughput (FPS at a batch size of 1 or 32) and the GPU latency, rather than FLOPs or the number of parameters. We compare YOLOv6 with other state-of-the-art detectors of YOLO series, including YOLOv5 [ 10 ], YOLOX [ 7 ], PPYOLOE [ 45 ] and YOLOv7 [ 42 ]. Note that we test the speed performance of all official models with FP16-precision on the same Tesla T4 GPU with TensorRT [ 28 ]. The perfor- mance of YOLOv7-Tiny is re-evaluated according to their open-sourced code and weights at the input size of 416 and 640. Results are shown in Table 1 and Fig. 1 . Compared with YOLOv5-N/YOLOv7-Tiny (input size = 416), our YOLOv6-N has significantly advanced by 7.9 % / 2.6 % 7.9\%/2.6\% 7.9%/2.6% respectively. It also comes with the best speed performance in terms of both throughput and latency. Compared with YOLOX-S/PPYOLOE-S, YOLOv6-S can improve AP by 3.0 % / 0.4 % 3.0\%/0.4\% 3.0%/0.4% with higher speed. We compare YOLOv5-S and YOLOv7-Tiny (input size = 640) with YOLOv6-T, our method is 2.9 % 2.9\% 2.9% more accurate and 73/25 FPS faster with a batch size of 1. YOLOv6-M outperforms YOLOv5-M by 4.2 % 4.2\% 4.2% higher AP with a similar speed, and it achieves 2.7 % / 0.6 % 2.7\%/0.6\% 2.7%/0.6% higher AP than YOLOX-M/PPYOLOE-M at a higher speed. Besides, it is more accurate and faster than YOLOv5-L. YOLOv6-L is 2.8 % / 1.1 % 2.8\%/1.1\% 2.8%/1.1% more accurate than YOLOX-L/PPYOLOE-L under the same latency constraint. We additionally provide a faster version of YOLOv6-L by replacing SiLU with ReLU (denoted as YOLOv6-L-ReLU). It achieves 51.7 % 51.7\% 51.7% AP with a latency of 8.8 m s 8.8\mathrm{ms} 8.8ms , outperforming YOLOX-L/PPYOLOE-L/YOLOv7 in both accuracy and speed.
【翻译】考虑到这项工作的目标是构建用于工业应用的网络,我们主要关注所有模型部署后的速度性能,包括吞吐量(批次大小为1或32时的FPS)和GPU延迟,而不是FLOPs或参数数量。我们将YOLOv6与YOLO系列的其他最先进检测器进行比较,包括YOLOv5 [10]、YOLOX [7]、PPYOLOE [45]和YOLOv7 [42]。注意,我们在同一个Tesla T4 GPU上使用TensorRT [28]以FP16精度测试所有官方模型的速度性能。YOLOv7-Tiny的性能根据其开源代码和权重在输入尺寸为416和640时重新评估。结果如表1和图1所示。与YOLOv5-N/YOLOv7-Tiny(输入尺寸=416)相比,我们的YOLOv6-N分别显著提升了7.9%/2.6%。它在吞吐量和延迟方面都具有最佳的速度性能。与YOLOX-S/PPYOLOE-S相比,YOLOv6-S能够在更高速度下将AP提高3.0%/0.4%。我们将YOLOv5-S和YOLOv7-Tiny(输入尺寸=640)与YOLOv6-T进行比较,我们的方法准确度高出2.9%,在批次大小为1时速度快73/25 FPS。YOLOv6-M在相似速度下比YOLOv5-M的AP高出4.2%,并且在更高速度下比YOLOX-M/PPYOLOE-M的AP高出2.7%/0.6%。此外,它比YOLOv5-L更准确且更快。在相同延迟约束下,YOLOv6-L比YOLOX-L/PPYOLOE-L准确度高出2.8%/1.1%。我们还通过将SiLU替换为ReLU提供了YOLOv6-L的更快版本(记为YOLOv6-L-ReLU)。它实现了51.7%的AP,延迟为8.8ms,在准确性和速度方面都优于YOLOX-L/PPYOLOE-L/YOLOv7。
3.3. Ablation Study
3.3.1 Network
Backbone and neck We explore the influence of singlepath structure and multi-branch structure on backbones and necks, as well as the channel coefficient (denoted as CC) of CSPStackRep Block. All models described in this part adopt TAL as the label assignment strategy, VFL as the classification loss, and GIoU with DFL as the regression loss. Results are shown in Table 2 . We find that the optimal network structure for models at different sizes should come up with different solutions.
【翻译】骨干网和颈部 我们探索了单路径结构和多分支结构对骨干网和颈部的影响,以及CSPStackRep Block的通道系数(记为CC)。本部分描述的所有模型都采用TAL作为标签分配策略,VFL作为分类损失,GIoU结合DFL作为回归损失。结果如表2所示。我们发现不同尺寸的模型应该采用不同的最优网络结构解决方案。
For YOLOv6-N, the single-path structure outperforms the multi-branch structure in terms of both accuracy and speed. Although the single-path structure has more FLOPs and parameters than the multi-branch structure, it could run faster due to a relatively lower memory footprint and a higher degree of parallelism. For YOLOv6-S, the two block styles bring similar performance. When it comes to larger models, multi-branch structure achieves better performance in accuracy and speed. And we finally select multi-branch with a channel coefficient of 2/3 for YOLOv6-M and 1/2 for YOLOv6-L.
【翻译】对于YOLOv6-N,单路径结构在准确性和速度方面都优于多分支结构。尽管单路径结构比多分支结构有更多的FLOPs和参数,但由于相对较低的内存占用和更高的并行度,它可以运行得更快。对于YOLOv6-S,两种块样式带来相似的性能。当涉及到更大的模型时,多分支结构在准确性和速度方面取得更好的性能。我们最终选择多分支结构,YOLOv6-M的通道系数为2/3,YOLOv6-L的通道系数为1/2。
Furthermore, we study the influence of width and depth of the neck on YOLOv6-L. Results in Table 3 show that the slender neck performs 0.2 % 0.2\% 0.2% better than the wide-shallow neck with the similar speed.
【翻译】此外,我们研究了颈部的宽度和深度对YOLOv6-L的影响。表3中的结果显示,纤细颈部在相似速度下比宽浅颈部性能好0.2%。
Combinations of convolutional layers and activation functions YOLO series adopted a wide range of activation functions, ReLU [ 27 ], LReLU [ 25 ], Swish [ 31 ], SiLU [ 4 ], Mish [ 26 ] and so on. Among these activation functions, SiLU is the most used. Generally speaking, SiLU performs with better accuracy and does not cause too much extra computation cost. However, when it comes to industrial applications, especially for deploying models with TensorRT [ 28 ] acceleration, ReLU has a greater speed advantage because of its fusion into convolution.
【翻译】卷积层和激活函数的组合 YOLO系列采用了广泛的激活函数,如ReLU [27]、LReLU [25]、Swish [31]、SiLU [4]、Mish [26]等。在这些激活函数中,SiLU是最常用的。一般来说,SiLU具有更好的准确性,并且不会造成太多额外的计算成本。然而,当涉及到工业应用时,特别是在使用TensorRT [28]加速部署模型时,ReLU由于其与卷积的融合而具有更大的速度优势。
Table 1: Comparisons with other YOLO-series detectors on COCO 2017 val . FPS and latency are measured in FP16-precision on a Tesla T4 in the same environment with TensorRT. All our models are trained for 300 epochs without pre-training or any external data. Both the accuracy and the speed performance of our dels are evaluated with the input resolution of 640 × 640 640\times640 640×640 . ’ through the official code. ‡ ’ represents that the proposed self-distillation method is utilized. ’ ⋅ ∗ , \mathbf{\boldsymbol{\cdot}}_{\ast}, ⋅∗, ∗ ’ represents the re-evaluated result of the released model
【翻译】表1:在COCO 2017 val上与其他YOLO系列检测器的比较。FPS和延迟在相同环境下使用TensorRT在Tesla T4上以FP16精度测量。我们所有的模型都在没有预训练或任何外部数据的情况下训练300个epoch。我们的模型的准确性和速度性能都在640×640的输入分辨率下评估。'通过官方代码。‡‘表示使用了提出的自蒸馏方法。’·,'表示发布模型的重新评估结果。
Table 2: Ablation study on backbones and necks. YOLOv6- L here is equipped with ReLU.
【翻译】表2:骨干网和颈部的消融研究。这里的YOLOv6-L配备了ReLU。
Moreover, we further verify the effectiveness of combinations of RepConv/ordinary convolution (denoted as Conv) and ReLU/SiLU/LReLU in networks of different sizes to achieve a better trade-off. As shown in Table 4 , Conv with SiLU performs the best in accuracy while the combination of RepConv and ReLU achieves a better trade-off. We suggest users adopt RepConv with ReLU in latency-sensitive applications. We choose to use RepConv/ReLU combination in YOLOv6-N/T/S/M for higher inference speed and use the Conv/SiLU combination in the large model YOLOv6-L to speed up training and improve performance.
【翻译】此外,我们进一步验证了在不同尺寸的网络中RepConv/普通卷积(记为Conv)和ReLU/SiLU/LReLU组合的有效性,以实现更好的权衡。如表4所示,Conv结合SiLU在准确性方面表现最佳,而RepConv和ReLU的组合实现了更好的权衡。我们建议用户在延迟敏感的应用中采用RepConv结合ReLU。我们选择在YOLOv6-N/T/S/M中使用RepConv/ReLU组合以获得更高的推理速度,在大型模型YOLOv6-L中使用Conv/SiLU组合来加速训练并提高性能。
Table 3: Ablation study on the neck settings of YOLOv6-L. SiLU is selected as the activation function.
【翻译】表3:YOLOv6-L颈部设置的消融研究。选择SiLU作为激活函数。
Table 4: Ablation study on combinations of different types of convolutional layers (denoted as Conv.) and activation layers (denoted as Act.).
【翻译】表4:不同类型卷积层(记为Conv.)和激活层(记为Act.)组合的消融研究。
Miscellaneous design We also conduct a series of ablation on other network parts mentioned in Section 2.1 based on YOLOv6-N. We choose YOLOv5-N as the baseline and add other components incrementally. Results are shown in Table 5 . Firstly, with decoupled head (denoted as DH), our model is 1.4 % 1.4\% 1.4% more accurate with 5 % 5\% 5% increase in time cost. Secondly, we verify that the anchor-free paradigm is 51 % 51\% 51% faster than the anchor-based one for its 3 × 3\times 3× less predefined anchors, which results in less dimensionality of the output. Further, the unified modification of the backbone (EfficientRep Backbone) and the neck (Rep-PAN neck), denoted as E B + R N \tt E B{+}\tt R N EB+RN , brings 3.6 % 3.6\% 3.6% AP improvements, and runs 21 % 21\% 21% faster. Finally, the optimized decoupled head (hybrid channels, HC) brings 0.2 % 0.2\% 0.2% AP and 6.8 % 6.8\% 6.8% FPS improvements in accuracy and speed respectively.
【翻译】其他设计 我们还基于YOLOv6-N对第2.1节中提到的其他网络部分进行了一系列消融实验。我们选择YOLOv5-N作为基线,并逐步添加其他组件。结果如表5所示。首先,使用解耦头(记为DH),我们的模型准确性提高1.4%,时间成本增加5%。其次,我们验证了无锚范式比基于锚的范式快51%,因为其预定义锚点减少了3倍,从而导致输出维度更少。进一步,骨干网(EfficientRep Backbone)和颈部(Rep-PAN neck)的统一修改,记为EB+RN,带来3.6%的AP改进,运行速度快21%。最后,优化的解耦头(混合通道,HC)在准确性和速度方面分别带来0.2%的AP和6.8%的FPS改进。
3.3.2 Label Assignment
In Table 6 , we analyze the effectiveness of mainstream label assign strategies. Experiments are conducted on YOLOv6- N. As expected, we observe that SimOTA and TAL are the best two strategies. Compared with the ATSS, SimOTA can increase AP by 2.0 % 2.0\% 2.0% , and TAL brings 0.5 % 0.5\% 0.5% higher AP than SimOTA. Considering the stable training and better accuracy performance of TAL, we adopt TAL as our label assignment strategy.
【翻译】在表6中,我们分析了主流标签分配策略的有效性。实验在YOLOv6-N上进行。正如预期的那样,我们观察到SimOTA和TAL是最好的两种策略。与ATSS相比,SimOTA可以将AP提高2.0%,TAL比SimOTA带来0.5%更高的AP。考虑到TAL的稳定训练和更好的准确性性能,我们采用TAL作为我们的标签分配策略。
Table 5: Ablation study on all network designs in an incremental way. FPS is tested with FP16-precision and batchsize : = 32 \scriptstyle:=32 :=32 on Tesla T4 GPUs.
【翻译】表5:以递增方式对所有网络设计的消融研究。FPS在Tesla T4 GPU上以FP16精度和批次大小=32进行测试。
Table 6: Comparisons of label assignment methods.
【翻译】表6:标签分配方法的比较。
Table 7: Comparisons of label assignment methods in warm-up stage.
【翻译】表7:预热阶段标签分配方法的比较。
In addition, the implementation of TOOD [ 5 ] adopts ATSS [ 51 ] as the warm-up label assignment strategy during the early training epochs. We also retain the warm-up strategy and further make some explorations on it. Details are shown in Table 7 , and we can find that without warm-up or warmed up by other strategies (i.e., SimOTA) it can also achieve the similar performance.
【翻译】此外,TOOD [5]的实现在早期训练epoch期间采用ATSS [51]作为预热标签分配策略。我们也保留了预热策略并进一步对其进行了一些探索。详细信息如表7所示,我们可以发现,无论是不使用预热还是通过其他策略(即SimOTA)进行预热,都可以实现相似的性能。
3.3.3 Loss functions
In the object detection framework, the loss function is composed of a classification loss, a box regression loss and an optional object loss, which can be formulated as follows:
【翻译】在目标检测框架中,损失函数由分类损失、边界框回归损失和可选的目标损失组成,可以表述如下:
L d e t = L c l s + λ L r e g + μ L o b j , L_{d e t}=L_{c l s}+\lambda L_{r e g}+\mu L_{o b j}, Ldet=Lcls+λLreg+μLobj,
where L c l s L_{c l s} Lcls , L r e g L_{r e g} Lreg and L o b j L_{o b j} Lobj are classification loss, regression loss and object loss. λ \lambda λ and μ \mu μ are hyperparameters.
【翻译】其中 L c l s L_{c l s} Lcls、 L r e g L_{r e g} Lreg和 L o b j L_{o b j} Lobj分别是分类损失、回归损失和目标损失。 λ \lambda λ和 μ \mu μ是超参数。
Table 8: Ablation study on classification loss functions.
【翻译】表8:分类损失函数的消融研究。
In this subsection, we evaluate each loss function on YOLOv6.Unless otherwise specified, the baselinesfor YOLOv6-N, YOLOv6-S and YOLOv6-M are 35.0 % 35.0\% 35.0% , 42.9 % 42.9\% 42.9% and 48.0 % 48.0\% 48.0% trained with TAL, Focal Loss and GIoU Loss.
【翻译】在本小节中,我们在YOLOv6上评估每个损失函数。除非另有说明,YOLOv6-N、YOLOv6-S和YOLOv6-M的基线分别为35.0%、42.9%和48.0%,使用TAL、Focal Loss和GIoU Loss进行训练。
Classification Loss We experiment Focal Loss [ 22 ], Poly loss [ 17 ], QFL [ 20 ] and VFL [ 50 ] on YOLOv6-N/S/M. As can be seen in Table 8 , VFL brings 0.2 % / 0.3 % / 0.1 % 0.2\%/0.3\%/0.1\% 0.2%/0.3%/0.1% AP improvements on YOLOv6-N/S/M respectively compared with Focal Loss. We choose VFL as the classification loss function.
【翻译】分类损失 我们在YOLOv6-N/S/M上实验了Focal Loss [22]、Poly loss [17]、QFL [20]和VFL [50]。如表8所示,与Focal Loss相比,VFL在YOLOv6-N/S/M上分别带来0.2%/0.3%/0.1%的AP改进。我们选择VFL作为分类损失函数。
Regression Loss IoU-series and probability loss functions are both experimented with on YOLOv6-N/S/M.
【翻译】回归损失 在YOLOv6-N/S/M上同时实验了IoU系列和概率损失函数。
The latest IoU-series losses are utilized in YOLOv6- N/S/M. Experiment results in Table 9 show that SIoU Loss outperforms others for YOLOv6-N and YOLOv6-T, while CIoU Loss performs better on YOLOv6-M.
【翻译】在YOLOv6-N/S/M中使用了最新的IoU系列损失。表9中的实验结果显示,SIoU Loss在YOLOv6-N和YOLOv6-T上优于其他损失,而CIoU Loss在YOLOv6-M上表现更好。
For probability losses, as listed in Table 10 , introducing DFL can obtain 0.2 % / 0.1 % / 0.2 % 0.2\%/0.1\%/0.2\% 0.2%/0.1%/0.2% performance gain for YOLOv6-N/S/M respectively. However, the inference speed is greatly affected for small models. Therefore, DFL is only introduced in YOLOv6-M/L.
【翻译】对于概率损失,如表10所列,引入DFL可以为YOLOv6-N/S/M分别获得0.2%/0.1%/0.2%的性能提升。然而,小模型的推理速度受到很大影响。因此,DFL只在YOLOv6-M/L中引入。
Object Loss Object loss is also experimented with YOLOv6, as shown in Table 11 . From Table 11 , we can see that object loss has negative effects on YOLOv6-N/S/M networks, where the maximum decrease is 1.1 % 1.1\% 1.1% AP on YOLOv6-N. The negative gain may come from the conflict between the object branch and the other two branches in TAL. Specifically, in the training stage, IoU between predicted boxes and ground-truth ones, as well as classification scores are used to jointly build a metric as the criteria to assign labels. However, the introduced object branch extends the number of tasks to be aligned from two to three, which obviously increases the difficulty. Based on the experimental results and this analysis, the object loss is then discarded in YOLOv6.
【翻译】目标损失 目标损失也在YOLOv6上进行了实验,如表11所示。从表11可以看出,目标损失对YOLOv6-N/S/M网络有负面影响,其中YOLOv6-N上的最大下降为1.1% AP。负面收益可能来自于TAL中目标分支与其他两个分支之间的冲突。具体来说,在训练阶段,预测框与真实框之间的IoU以及分类得分被联合用来构建一个度量作为分配标签的准则。然而,引入的目标分支将需要对齐的任务数量从两个扩展到三个,这显然增加了难度。基于实验结果和这一分析,目标损失在YOLOv6中被舍弃。
Table 9: Ablation study on IoU-series box regression loss functions. The classification loss is VFL [ 50 ].
【翻译】表9:IoU系列边界框回归损失函数的消融研究。分类损失为VFL [50]。
Table 10: Ablation study on probability loss functions.
【翻译】表10:概率损失函数的消融研究。
Table 11: Effectiveness of object loss.
【翻译】表11:目标损失的有效性。
3.4. 工业友好型改进
More training epochs In practice, more training epochs is a simple and effective way to further increase the accuracy. Results of our small models trained for 300 and 400 epochs are shown in Table 12 . We observe that training for longer epochs substantially boosts AP by 0.4 % 0.4\% 0.4% , 0.6 % 0.6\% 0.6% , 0.5 % 0.5\% 0.5% for YOLOv6-N, T, S respectively. Considering the acceptable cost and the produced gain, it suggests that training for 400 epochs is a better convergence scheme for YOLOv6. Self-distillation. We conducted detailed experiments to verify the proposed self-distillation method on YOLOv6- L. As can be seen in Table 13 , applying the self-distillation only on the classification branch can bring 0.4 % 0.4\% 0.4% AP improvement. Furthermore, we simply perform the selfdistillation on the box regression task to have 0.3 % 0.3\% 0.3% AP increase. The introduction of weight decay boosts the model by 0.6 % 0.6\% 0.6% AP.
【翻译】更多训练轮次 在实践中,更多的训练轮次是进一步提高准确率的简单有效方法。我们的小模型训练300和400轮次的结果如表12所示。我们观察到,更长的训练轮次显著提升了YOLOv6-N、T、S的AP,分别提升了0.4%、0.6%、0.5%。考虑到可接受的成本和产生的收益,这表明训练400轮次是YOLOv6更好的收敛方案。自蒸馏:我们进行了详细实验来验证在YOLOv6-L上提出的自蒸馏方法。如表13所示,仅在分类分支上应用自蒸馏可以带来0.4%的AP改进。此外,我们在边界框回归任务上简单地执行自蒸馏,获得了0.3%的AP增长。引入权重衰减使模型提升了0.6%的AP。
Table 12: Experiments of more training epochs on small models.
【翻译】表12:小模型上更多训练轮次的实验。
Table 13: Ablation study on the self-distillation.
【翻译】表13:自蒸馏的消融研究。
Gray border of images In Section 2.4.3 , we introduce a strategy to solve the problem of performance degradation without extra gray borders. Experimental results are shown in Table 14 . In these experiments, YOLOv6-N and YOLOv6-S are trained for 400 epochs and YOLOv6-M for 300 epochs. It can be observed that the accuracy of YOLOv6-N/S/M is lowered by 0.4 % / 0.5 % / 0.7 % 0.4\%/0.5\%/0.7\% 0.4%/0.5%/0.7% without Mosaic fading when removing the gray border. However, the performance degradation becomes 0.2 % / 0.5 % / 0.5 % 0.2\%/0.5\%/0.5\% 0.2%/0.5%/0.5% when adopting Mosaic fading, from which we find that, on the one hand, the problem of performance degradation is mitigated. On the other hand, the accuracy of small models (YOLOv6-N/S) is improved whether we pad gray borders or not. Moreover, we limit the input images to 634 × 634 634\times634 634×634 and add gray borders by 3 pixels wide around the edges (more results can be found in Appendix C ). With this strategy, the size of the final images is the expected 640 × 640 640\times640 640×640 . The results in Table 14 indicate that the final performance of YOLOv6-N/S/M is even 0.2 % / 0.3 % / 0.1 % 0.2\%/0.3\%/0.1\% 0.2%/0.3%/0.1% more accurate with the final image size reduced from 672 to 640.
【翻译】图像的灰色边界 在第2.4.3节中,我们介绍了一种解决在没有额外灰色边界的情况下性能下降问题的策略。实验结果如表14所示。在这些实验中,YOLOv6-N和YOLOv6-S训练400轮次,YOLOv6-M训练300轮次。可以观察到,在移除灰色边界时,如果没有Mosaic衰减,YOLOv6-N/S/M的准确率分别降低了0.4%/0.5%/0.7%。然而,当采用Mosaic衰减时,性能下降变为0.2%/0.5%/0.5%,由此我们发现,一方面,性能下降的问题得到了缓解。另一方面,无论是否填充灰色边界,小模型(YOLOv6-N/S)的准确率都得到了改善。此外,我们将输入图像限制为634×634,并在边缘周围添加3像素宽的灰色边界(更多结果可在附录C中找到)。通过这种策略,最终图像的尺寸是预期的640×640。表14中的结果表明,当最终图像尺寸从672减少到640时,YOLOv6-N/S/M的最终性能甚至更准确,分别提升了0.2%/0.3%/0.1%。
Table 14: Experimental results about the strategies for solving the problem of the performance degradation without extra gray border.
【翻译】表14:关于解决在没有额外灰色边界情况下性能下降问题的策略的实验结果。
3.5. 量化结果
We take YOLOv6-S as an example to validate our quantization method. The following experiment is on both two releases. The baseline model is trained for 300 epochs.
【翻译】我们以YOLOv6-S为例来验证我们的量化方法。以下实验在两个版本上进行。基线模型训练300轮次。
3.5.1 PTQ(训练后量化)
The average performance is substantially improved when the model is trained with RepOptimizer, see Table 15 . RepOptimizer is in general faster and nearly identical.
【翻译】当模型使用RepOptimizer训练时,平均性能显著改善,见表15。RepOptimizer总体上更快且几乎相同。
Table 15: PTQ performance of YOLOv6s trained with RepOptimizer.
【翻译】表15:使用RepOptimizer训练的YOLOv6s的PTQ性能。
3.5.2 QAT(量化感知训练)
For v1.0, we apply fake quantizers to non-sensitive layers obtained from Section 2.5.2 to perform quantization-aware training and call it partial QAT. We compare the result with full QAT in Table 16 . Partial QAT leads to better accuracy with a slightly reduced throughput.
【翻译】对于v1.0版本,我们将伪量化器应用于从第2.5.2节获得的非敏感层,以执行量化感知训练,称为部分QAT。我们在表16中将结果与完整QAT进行比较。部分QAT在吞吐量略有降低的情况下获得了更好的准确率。
Table 16: QAT performance of YOLOv6-S (v1.0) under different settings.
【翻译】表16:YOLOv6-S (v1.0)在不同设置下的QAT性能。
Due to the removal of quantization-sensitive layers in v2.0 release, we directly use full QAT on YOLOv6-S trained with RepOptimizer. We eliminate inserted quantizers through graph optimization to obtain higher accuracy and faster speed. We compare the distillation-based quantization results from PaddleSlim [ 30 ] in Table 17 . Note our quantized version of YOLOv6-S is the fastest and the most accurate, also see Fig. 1 .
【翻译】由于在v2.0版本中移除了量化敏感层,我们直接在使用RepOptimizer训练的YOLOv6-S上使用完整QAT。我们通过图优化消除插入的量化器,以获得更高的准确率和更快的速度。我们在表17中比较了来自PaddleSlim [30]的基于蒸馏的量化结果。注意我们的YOLOv6-S量化版本是最快和最准确的,另见图1。
Table 17: QAT performance of YOLOv6-S (v2.0) com- pared with other quantized detectors. ’ ∗ ': based on v1.0 release. ’ † ': We tested with TensorRT 8 on Tesla T4 with a batch size of 1 and 32.
【翻译】表17:YOLOv6-S (v2.0)与其他量化检测器的QAT性能比较。‘∗’:基于v1.0版本。‘†’:我们在Tesla T4上使用TensorRT 8进行测试,批次大小为1和32。
4. Conclusion
In a nutshell, with the persistent industrial requirements in mind, we present the current form of YOLOv6, carefully examining all the advancements of components of object detectors up to date, meantime instilling our thoughts and practices. The result surpasses other available real-time detectors in both accuracy and speed. For the convenience of the industrial deployment, we also supply a customized quantization method for YOLOv6, rendering an ever-fast detector out-of-box. We sincerely thank the academic and industrial community for their brilliant ideas and endeavors. In the future, we will continue expanding this project to meet higher standards and more demanding scenarios.
【翻译】简而言之,考虑到持续的工业需求,我们提出了YOLOv6的当前形式,仔细检查了迄今为止目标检测器组件的所有进展,同时融入了我们的思考和实践。结果在准确率和速度方面都超越了其他可用的实时检测器。为了便于工业部署,我们还为YOLOv6提供了定制的量化方法,开箱即用地呈现了一个超快速的检测器。我们真诚感谢学术界和工业界的杰出想法和努力。未来,我们将继续扩展这个项目,以满足更高的标准和更苛刻的场景。