
channel
maps
conv.
block
strided
conv.
upsample
Figure 1. A simple example of a high-resolution network. There are four stages. The 1st stage consists of high-resolution convolutions.
The 2nd (3rd, 4th) stage repeats two-resolution (three-resolution, four-resolution) blocks. The detail is given in Section 3.
resolution representations. In semantic segmentation, the
proposed approach achieves state-of-the-art results on PAS-
CAL Context, Cityscapes, and LIP with similar model sizes
and lower computation complexity. In facial landmark de-
tection, our approach achieves overall best results on four
standard datasets: AFLW, COFW, 300W, and WFLW.
In addition, we construct a multi-level representation
from the high-resolution representation, and apply it to the
Faster R-CNN object detection framework and its extended
frameworks, Mask R-CNN [38] and Cascade R-CNN [9].
The results show that our method gets great detection per-
formance improvement and in particular dramatic improve-
ment for small objects. With single-scale training and test-
ing, the proposed approach achieves better COCO object
detection results than existing single-model methods.
2. Related Work
Strong high-resolution representations play an essential
role in pixel and region labeling problems, e.g., seman-
tic segmentation, human pose estimation, facial landmark
detection, and object detection. We review representation
learning techniques developed mainly in the semantic seg-
mentation, facial landmark detection [92, 50, 69, 104, 123,
94, 119] and object detection areas
1
, from low-resolution
representation learning, high-resolution representation re-
covering, to high-resolution representation maintaining.
Learning low-resolution representations. The fully-
convolutional network (FCN) approaches [67, 87] com-
pute low-resolution representations by removing the fully-
connected layers in a classification network, and estimate
from their coarse segmentation confidence maps. The esti-
mated segmentation maps are improved by combining the
fine segmentation score maps estimated from intermediate
low-level medium-resolution representations [67], or iter-
ating the processes [50]. Similar techniques have also been
applied to edge detection, e.g., holistic edge detection [106].
The fully convolutional network is extended, by replac-
ing a few (typically two) strided convolutions and the as-
sociated convolutions with dilated convolutions, to the di-
lation version, leading to medium-resolution representa-
tions [126, 13, 115, 12, 57]. The representations are further
1
The techniques developed for human pose estimation are reviewed
in [91].
augmented to multi-scale contextual representations [126,
13, 15] through feature pyramids for segmenting objects at
multiple scales.
Recovering high-resolution representations. An upsam-
ple subnetwork, like a decoder, is adopted to gradually
recover the high-resolution representations from the low-
resolution representations outputted by the downsample
process. The upsample subnetwork could be a symmet-
ric version of the downsample subnetwork, with skip-
ping connection over some mirrored layers to transform
the pooling indices, e.g., SegNet [2] and DeconvNet [74],
or copying the feature maps, e.g., U-Net [83] and Hour-
glass [72, 111, 7, 22, 6], encoder-decoder [77], FPN [62],
and so on. The full-resolution residual network [78] intro-
duces an extra full-resolution stream that carries informa-
tion at the full image resolution, to replace the skip connec-
tions, and each unit in the downsample and upsample sub-
networks receives information from and sends information
to the full-resolution stream.
The asymmetric upsample process is also widely stud-
ied. RefineNet [60] improves the combination of upsam-
pled representations and the representations of the same
resolution copied from the downsample process. Other
works include: light upsample process [5]; light down-
sample and heavy upsample processes [97], recombinator
networks [40]; improving skip connections with more or
complicated convolutional units [76, 125, 42], as well as
sending information from low-resolution skip connections
to high-resolution skip connections [133] or exchanging in-
formation between them [36]; studying the details the up-
sample process [100]; combining multi-scale pyramid rep-
resentations [16, 105]; stacking multiple DeconvNets/U-
Nets/Hourglass [31, 101] with dense connections [93].
Maintaining high-resolution representations. High-
resolution representations are maintained through the whole
process, typically by a network that is formed by connecting
multi-resolution (from high-resolution to low-resolution)
parallel convolutions with repeated information exchange
across parallel convolutions. Representative works include
GridNet [30], convolutional neural fabrics [86], interlinked
CNNs [132], and the recently-developed high-resolution
networks (HRNet) [91] that is our interest.
The two early works, convolutional neural fabrics [86]