High-ResolutionRepresentationsforLabelingPixelsandRegions资源-CSDN下载

需积分: 16 67 浏览量 2021-09-11 11:49:32 上传评论收藏 490KB PDF 举报

《高分辨率表示在像素和区域标注中的应用》高分辨率表示学习在众多视觉问题中起着至关重要的作用，如姿态估计和语义分割。HRNet（高分辨率网络）是近年来为人体姿态估计开发的一种新方法，其特点是通过整个过程保持高分辨率表示，通过并行连接高到低分辨率的卷积，并通过反复在并行卷积间融合来构建强大的高分辨率表示。本文对高分辨率表示进行了深入研究，提出了一种简单而有效的改进，并将其应用于广泛的视觉任务。与HRNet中仅从高分辨率卷积获取表示的做法不同，我们通过聚合所有并行卷积（上采样后的）的表示来增强高分辨率表示。这种简单的修改产生了更强大的表示，这在实验结果中得到了验证。我们在Cityscapes、LIP和PASCAL Context等数据集上取得了语义分割的顶级结果，并在AFLW、COFW、300W和WFLW等人脸关键点检测任务上表现出色。此外，我们从高分辨率表示构建多级表示，并将其应用于Faster R-CNN目标检测框架及其扩展框架中。所提出的策略在COCO对象检测任务上超越了现有的单一模型网络。代码和模型已在GitHub上公开，供研究者和开发者使用。 1. 引言随着深度学习技术的发展，高分辨率表示在计算机视觉领域的重要性日益凸显。尤其是在像素级任务，如语义分割和实例分割中，保持图像细节的能力对于精确预测至关重要。HRNet的设计理念是通过并行结构保持高分辨率特征流，同时结合低分辨率特征的优点，以实现对细节和全局信息的兼顾。 2. HRNet的改进原版HRNet在并行卷积路径中只保留了高分辨率分支的输出作为最终表示。然而，本研究发现，融合所有并行分支的上采样表示可以进一步提升表示的质量。这一改进不仅保留了高分辨率信息，还利用了低分辨率分支捕捉到的全局上下文，从而提高了整体性能。 3. 应用场景拓展除了在语义分割上的成功，该改进的HRNet还被应用于面部关键点检测，实现了在多个基准测试中的优异表现。此外，通过构建多级表示，HRNet还能适应目标检测任务，特别是在Faster R-CNN框架下，显示了比单一模型网络更强的检测能力。 4. 结论与未来工作通过对HRNet的改进和应用拓展，我们证明了高分辨率表示在视觉任务中的潜力。未来的研究方向可能包括进一步优化网络架构以提高计算效率，以及探索更多领域的应用，如视频分析和三维重建。同时，将高分辨率表示与其他先进技术（如Transformer或自注意力机制）相结合，可能会开启新的研究方向，推动计算机视觉技术的进步。

资源推荐

资源详情

资源评论

High-Resolution Representations for Labeling Pixels and Regions

Ke Sun

1,2∗

Yang Zhao

3∗

Borui Jiang

2,4∗

Tianheng Cheng

2,5∗

Bin Xiao

Dong Liu

Yadong Mu

Xinggang Wang

Wenyu Liu

Jingdong Wang

2†

University of Science and Technology of China

Microsoft Research Asia

The University of Adelaide

Peking University

Huazhong University of Science and Technology

[email protected], [email protected], [email protected], [email protected]

{vic,liuwy,xgwang}@hust.edu.cn, [email protected], {Bin.Xiao,jingdw}@microsoft.com

Abstract

High-resolution representation learning plays an essen-

tial role in many vision problems, e.g., pose estimation

and semantic segmentation. The high-resolution network

(HRNet) [91], recently developed for human pose estima-

tion, maintains high-resolution representations through the

whole process by connecting high-to-low resolution convo-

lutions in parallel and produces strong high-resolution rep-

resentations by repeatedly conducting fusions across paral-

lel convolutions.

In this paper, we conduct a further study on high-

resolution representations by introducing a simple yet ef-

fective modiﬁcation and apply it to a wide range of vision

tasks. We augment the high-resolution representation by ag-

gregating the (upsampled) representations from all the par-

allel convolutions rather than only the representation from

the high-resolution convolution as done in [91]. This simple

modiﬁcation leads to stronger representations, evidenced by

superior results. We show top results in semantic segmen-

tation on Cityscapes, LIP, and PASCAL Context, and facial

landmark detection on AFLW, COFW, 300W, and WFLW.

In addition, we build a multi-level representation from the

high-resolution representation and apply it to the Faster R-

CNN object detection framework and the extended frame-

works. The proposed approach achieves superior results

to existing single-model networks on COCO object detec-

tion. The code and models have been publicly available at

https://github.com/HRNet.

1. Introduction

Deeply-learned representations have been demonstrated

to be strong and achieved state-of-the-art results in many

vision tasks. There are two main kinds of representations:

∗

Equal contribution.

†

Corresponding author, [email protected]

low-resolution representations that are mainly for image

classiﬁcation, and high-resolution representations that are

essential for many other vision problems, e.g., semantic

segmentation, object detection, human pose estimation, etc.

The latter one, the interest of this paper, remains unsolved

and is attracting a lot of attention.

There are two main lines for computing high-resolution

representations. One is to recover high-resolution rep-

resentations from low-resolution representations outputted

by a network (e.g., ResNet) and optionally intermediate

medium-resolution representations, e.g., Hourglass [72],

SegNet [2], DeconvNet [74], U-Net [83], and encoder-

decoder [77]. The other one is to maintain high-resolution

representations through high-resolution convolutions and

strengthen the representations with parallel low-resolution

convolutions [91, 30, 132, 86]. In addition, dilated con-

volutions are used to replace some strided convolutions and

associated regular convolutions in classiﬁcation networks to

compute medium-resolution representations [13, 126].

We go along the research line of maintaining high-

resolution representations and further study the high-

resolution network (HRNet), which is initially developed

for human pose estimation [91], for a broad range of vision

tasks. An HRNet maintains high-resolution representations

by connecting high-to-low resolution convolutions in par-

allel and repeatedly conducting multi-scale fusions across

parallel convolutions. The resulting high-resolution repre-

sentations are not only strong but also spatially precise.

We make a simple modiﬁcation by exploring the repre-

sentations from all the high-to-low resolution parallel con-

volutions other than only the high-resolution representa-

tions in the original HRNet [91]. This modiﬁcation adds

a small overhead and leads to stronger high-resolution rep-

resentations. The resulting network is named as HRNetV2.

We empirically show the superiority to the original HRNet.

We apply our proposed network to semantic segmenta-

tion/facial landmark detection through estimating segmen-

tation maps/facial landmark heatmaps from the output high-

arXiv:1904.04514v1 [cs.CV] 9 Apr 2019

channel

maps

conv.

block

strided

conv.

upsample

Figure 1. A simple example of a high-resolution network. There are four stages. The 1st stage consists of high-resolution convolutions.

The 2nd (3rd, 4th) stage repeats two-resolution (three-resolution, four-resolution) blocks. The detail is given in Section 3.

resolution representations. In semantic segmentation, the

proposed approach achieves state-of-the-art results on PAS-

CAL Context, Cityscapes, and LIP with similar model sizes

and lower computation complexity. In facial landmark de-

tection, our approach achieves overall best results on four

standard datasets: AFLW, COFW, 300W, and WFLW.

In addition, we construct a multi-level representation

from the high-resolution representation, and apply it to the

Faster R-CNN object detection framework and its extended

frameworks, Mask R-CNN [38] and Cascade R-CNN [9].

The results show that our method gets great detection per-

formance improvement and in particular dramatic improve-

ment for small objects. With single-scale training and test-

ing, the proposed approach achieves better COCO object

detection results than existing single-model methods.

2. Related Work

Strong high-resolution representations play an essential

role in pixel and region labeling problems, e.g., seman-

tic segmentation, human pose estimation, facial landmark

detection, and object detection. We review representation

learning techniques developed mainly in the semantic seg-

mentation, facial landmark detection [92, 50, 69, 104, 123,

94, 119] and object detection areas

, from low-resolution

representation learning, high-resolution representation re-

covering, to high-resolution representation maintaining.

Learning low-resolution representations. The fully-

convolutional network (FCN) approaches [67, 87] com-

pute low-resolution representations by removing the fully-

connected layers in a classiﬁcation network, and estimate

from their coarse segmentation conﬁdence maps. The esti-

mated segmentation maps are improved by combining the

ﬁne segmentation score maps estimated from intermediate

low-level medium-resolution representations [67], or iter-

ating the processes [50]. Similar techniques have also been

applied to edge detection, e.g., holistic edge detection [106].

The fully convolutional network is extended, by replac-

ing a few (typically two) strided convolutions and the as-

sociated convolutions with dilated convolutions, to the di-

lation version, leading to medium-resolution representa-

tions [126, 13, 115, 12, 57]. The representations are further

The techniques developed for human pose estimation are reviewed

in [91].

augmented to multi-scale contextual representations [126,

13, 15] through feature pyramids for segmenting objects at

multiple scales.

Recovering high-resolution representations. An upsam-

ple subnetwork, like a decoder, is adopted to gradually

recover the high-resolution representations from the low-

resolution representations outputted by the downsample

process. The upsample subnetwork could be a symmet-

ric version of the downsample subnetwork, with skip-

ping connection over some mirrored layers to transform

the pooling indices, e.g., SegNet [2] and DeconvNet [74],

or copying the feature maps, e.g., U-Net [83] and Hour-

glass [72, 111, 7, 22, 6], encoder-decoder [77], FPN [62],

and so on. The full-resolution residual network [78] intro-

duces an extra full-resolution stream that carries informa-

tion at the full image resolution, to replace the skip connec-

tions, and each unit in the downsample and upsample sub-

networks receives information from and sends information

to the full-resolution stream.

The asymmetric upsample process is also widely stud-

ied. ReﬁneNet [60] improves the combination of upsam-

pled representations and the representations of the same

resolution copied from the downsample process. Other

works include: light upsample process [5]; light down-

sample and heavy upsample processes [97], recombinator

networks [40]; improving skip connections with more or

complicated convolutional units [76, 125, 42], as well as

sending information from low-resolution skip connections

to high-resolution skip connections [133] or exchanging in-

formation between them [36]; studying the details the up-

sample process [100]; combining multi-scale pyramid rep-

resentations [16, 105]; stacking multiple DeconvNets/U-

Nets/Hourglass [31, 101] with dense connections [93].

Maintaining high-resolution representations. High-

resolution representations are maintained through the whole

process, typically by a network that is formed by connecting

multi-resolution (from high-resolution to low-resolution)

parallel convolutions with repeated information exchange

across parallel convolutions. Representative works include

GridNet [30], convolutional neural fabrics [86], interlinked

CNNs [132], and the recently-developed high-resolution

networks (HRNet) [91] that is our interest.

The two early works, convolutional neural fabrics [86]

剩余12页未读，继续阅读

评论收藏

内容反馈

TracelessLe

粉丝: 6w+

High-Resolution Representations for Labeling Pixels and Regions

OCR-GAN（Omni-frequency Channel-selection Representations）

Adapted and adaptive linear time-frequency representations

Learning Rich Features from RGB-D Images for Object Detection and Segmentation

Representations and Techniques for 3D Object Recognition and Scene Interpretation

ActBERT Learning Global-Local Video-Text Representations.pdf

时频博士论文alpha stable----Robust time-frequency representations for signals in alpha-stable noise Methods and applications----FLOM时频表示----1997

Vector-based navigation using grid-like representations in artificial agents

Machine learning for graph-based representations

Mikolov 等。 - 2013 - Distributed Representations of Words and Phr

信号与系统教学课件：2-1 Representations of systems.ppt

基于Seasonal-Trend的时间序列预测（PyTorch完整源码和数据）

Example-Based Super-Resolution

Vector-based Navigation using Grid-like Representations in Artificial Agents

ISO 8601-1：2019 Date and time - Representations for information

ISO 8601-2：2019 Date and time - Representations for information

1097038982116629vdoc.pub_integral-representations-and-computation-of-combinatorial-sums.djvu

CVPR2018_Oral_论文合集_人工智能_机器学习

大创SikuBERT-for-digital-humanities-and-classical

Python-使用深度高分辨率表示学习HRNetV2h生成的多级表示的对象检测

Sparse Representations for Radar with MATLAB Examples

vector-based navigation using grid-like representations in artificial agents.pdf

SSD6 Exercise2

DeepSeek从入门到精通-清华大学-202502.pdf

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

pdf转为扫描件，免费

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

DEEP SEEK 本地部署（Ollama + ChatBox）+ 私有知识库（cherry studio）教程

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

深度学习：从代码到创造

最新资源