0% found this document useful (0 votes)
61 views21 pages

MTP: Advancing Remote Sensing Foundation Model Via Multi-Task Pretraining

mtp

Uploaded by

pjnbe6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views21 pages

MTP: Advancing Remote Sensing Foundation Model Via Multi-Task Pretraining

mtp

Uploaded by

pjnbe6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO.

8, AUGUST 2021 1

MTP: Advancing Remote Sensing Foundation


Model via Multi-Task Pretraining
Di Wang, Member, IEEE, Jing Zhang, Senior Member, IEEE, Minqiang Xu, Lin Liu, Dongsheng Wang,
Erzhong Gao, Chengxi Han, Student Member, IEEE, Haonan Guo, Student Member, IEEE, Bo Du, Senior
Member, IEEE, Dacheng Tao, Fellow, IEEE and Liangpei Zhang, Fellow, IEEE

Abstract—Foundation models have reshaped the landscape of Utilizing its inherent capability to automatically learn and
arXiv:2403.13430v1 [cs.CV] 20 Mar 2024

Remote Sensing (RS) by enhancing various image interpretation extract deep features from objects, deep learning methods have
tasks. Pretraining is an active research topic, encompassing su- found widespread application in the RS domain, particularly
pervised and self-supervised learning methods to initialize model
weights effectively. However, transferring the pretrained models for tasks such as scene classification, land use and land cover
to downstream tasks may encounter task discrepancy due to classification, and ship detection. Typically, ImageNet pre-
their formulation of pretraining as image classification or object trained weights are employed in training deep networks for RS
discrimination tasks. In this study, we explore the Multi-Task tasks due to their extensive representational ability. However,
Pretraining (MTP) paradigm for RS foundation models to ad- these weights are derived from pretraining models on natural
dress this issue. Using a shared encoder and task-specific decoder
architecture, we conduct multi-task supervised pretraining on the images, leading to domain gaps between natural images and
SAMRS dataset, encompassing semantic segmentation, instance RS images. For instance, RS images are captured from a
segmentation, and rotated object detection. MTP supports both bird’s-eye view, lack the vibrant colors of natural images, and
convolutional neural networks and vision transformer foundation possess lower spatial resolution. These disparities may impede
models with over 300 million parameters. The pretrained models the model’s finetuning performance [4], [5]. Moreover, relying
are finetuned on various RS downstream tasks, such as scene
classification, horizontal and rotated object detection, seman- solely on limited task-specific data for training restricts the
tic segmentation, and change detection. Extensive experiments model size and generalization capability of current RS deep
across 14 datasets demonstrate the superiority of our models over models due to the notorious overfitting issue.
existing ones of similar size and their competitive performance To tackle these challenges, the development of RS vision
compared to larger state-of-the-art models, thus validating the foundation models is imperative, which should excel in ex-
effectiveness of MTP. The codes and pretrained models will be
released at https://github.com/ViTAE-Transformer/MTP. tracting representative RS features. However, the RS domain
has long grappled with a scarcity of adequately large annotated
Index Terms—Remote sensing, Foundation model, Multi-task datasets, impeding related investigations. Until recently, the
pretraining, Scene classification, Semantic segmentation, Object
detection, Change detection. most expansive RS scene labeling datasets were fMoW [6]
and BigEarthNet [7], boasting 132,716 and 590,326 unique
scene instances [8], respectively — yet still falling short of
I. I NTRODUCTION benchmarks set by natural image datasets like ImageNet-
1K [9]. Long et al. [8] addressed this gap by introducing
R Emote sensing (RS) image is one of the most important
data resources for recording ground surfaces and land
objects. Precisely understanding RS images is beneficial to
MillionAID, a large-scale RS scene labeling dataset with a
closed sample capacity of 100,0848 compared to ImageNet-
many applications, including urban planning [1], environmen- 1K, igniting interest in supervised RS pretraining [5], [10].
tal survey [2], disaster assessment [3], etc. These studies show the feasibility of pretraining RS foundation
models on large-scale RS datasets. Nonetheless, supervised
D. Wang and B. Du are with the School of Computer Science, Wuhan Uni- pretraining of RS foundation models may not be the most
versity, Wuhan 430072, China, also with the Institute of Artificial Intelligence,
Wuhan University, Wuhan 430072, China, also with the National Engineering preferable choice due to the expertise and substantial time and
Research Center for Multimedia Software, Wuhan University, Wuhan 430072, labor costs associated with labeling RS images.
China, and also with the Hubei Key Laboratory of Multimedia and Network Constructing large-scale RS annotation datasets is chal-
Communication Engineering, Wuhan University, Wuhan 430072, China (e-
mail: [email protected]; [email protected]). (Corresponding author: lenging due to the high complexity and cost of labeling.
Minqiang Xu, Jing Zhang, Bo Du and Liangpei Zhang.) Despite this challenge, the advancement of earth observation
J. Zhang is with the School of Computer Science, Faculty of Engineering, technologies grants easy access to a vast amount of unlabeled
The University of Sydney, Australia (e-mail: [email protected]).
M. Xu, L. Liu, D. Wang and E. Gao are with the iFlytek Co., Ltd and RS images. Efficiently leveraging these unlabeled RS images
also with the National Engineering Research Center of Speech and Language is crucial for developing robust RS foundation models. In the
Information Processing, Hefei 230088, China (e-mail: [email protected]; realm of deep learning, unsupervised pretraining has emerged
[email protected]; [email protected]; [email protected]).
C. Han, H. Guo and L. Zhang are with the State Key Laboratory of as a promising approach for learning effective knowledge from
Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan massive unlabeled data [14]–[17]. Typically, unsupervised
University, Wuhan 430079, China (e-mail: [email protected]; hao- pretraining employs self-supervised learning (SSL) to learn
[email protected]; [email protected]).
D. Tao is with the School of Computer Science and Engineering, Nanyang effective feature representation. SSL encompasses two primary
Technological University, Singapore (e-mail: [email protected]). techniques: contrastive-based [18]–[20] and generative-based
2 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021

markable performance across various RS tasks, a persistent


Natural RS Different
(a) images
Backbone
images
Backbone Backbone
RS tasks challenge remains: the task discrepancy between pretraining
and finetuning, which often dictates the effectiveness of mi-
grating pretrained models to downstream tasks. Research has
highlighted the impact of representation granularity mismatch
Natural/RS RS Semantic
(b) Backbone Backbone
images images
Backbone
segmentation between pretraining and finetuning tasks [5]. For instance,
Seghead Seghead
models pretrained on scene-level classification tasks perform
favorably when finetuned on similar tasks but falter on pixel-
level segmentation tasks. To address this issue, recent work
Natural/RS RS Different
(c) images
Backbone
images
Backbone Backbone
RS tasks [13] has explored the segmentation pretraining paradigm, as
Multi-Task head shown in Figure 1(b), yielding improved finetuning results.
This suggests that enhancing model representation capability
Fig. 1. Comparison of various pretraining methods. (a) [11], [12] sequentially
pretrains a foundational model on both natural and RS images. (b) [13]
through additional pretraining, particularly on tasks demanding
employs a two-stage pretraining strategy to initialize task-specific decoders finer representation granularity, such as pixel-level segmenta-
(e.g., segmentation) using existing foundational models pretrained on either tion, could be beneficial. Motivated by these findings, we ask:
natural or RS images, preserving the decoder during subsequent finetuning. We
extend (b) by incorporating multi-task decoders to enhance the representation
can we significantly enhance RS foundation models’ repre-
capacity of the foundational model, facilitating easy transferability across sentation ability through additional pretraining incorporating
diverse tasks during finetuning, as depicted in (c). multiple tasks with diverse representation granularity? To
this end, we investigate the Multi-Task Pretraining (MTP)
paradigm to bridge the gap between upstream and downstream
learning [21]–[23]. Contrastive learning aims to bring similar tasks and obtain more powerful RS foundation models, as
samples closer while maximizing distances between dissimilar shown in Figure 1(c). Importantly, MTP is designed to be
samples through the object discrimination pretext task. When applied to any existing pretraining models, irrespective of
applied to the RS domain, data characteristics like geographic whether trained on RS or natural images.
coordinates [24]–[26] and temporal information [27]–[29] are Implementing MTP to bridge upstream-downstream task
usually leveraged in formulating the pretext task. However, discrepancy necessitates the utilization of a similar or the same
designing these pretext tasks and gathering requisite data pretraining task as the downstream one, such as segmentation
can be inefficient, especially for training large-scale mod- pretraining (SEP) for RS segmentation tasks [13]. There-
els. Generative-based learning, exemplified by masked image fore, to cover the common task types in typical downstream
modeling (MIM), circumvents this challenge by enhancing applications, MTP tasks should encompass dense prediction
network representation through reconstructing masked regions. tasks like object detection and semantic segmentation. Hence,
Many RS studies leverage MIM initialization for its efficiency MTP requires a pretraining dataset with labels for these tasks,
[30]–[37]. Recent approaches have attempted to combine ideally, each sample encompassing all task labels. However,
contrastive-based and generative-based learning techniques to existing RS datasets often lack annotations for segmentation
pretrain more powerful models [38]–[40]. and rotated object detection. Fortunately, recent work [13]
However, existing research usually resorts to a single data introduces SAMRS, a large-scale segmentation dataset derived
source. For instance, [5], [30] utilize RGB aerial images from from existing RS rotated object detection datasets via the
MillionAID, while [31], [34] utilize Sentinel-2 multispectral Segment Anything Model (SAM) [47]. SAMRS provides both
images. Despite recent advancements in RS multimodal foun- detection and segmentation labels, facilitating MTP across RS
dation models [41]–[44], which are beginning to incorporate semantic segmentation, instance segmentation, and rotated ob-
more diverse imagery such as SAR, they still remain within ject detection tasks. Utilizing SAMRS, we demonstrate MTP’s
the realm of in-domain data, namely pretraining with RS data. efficacy in enhancing RS foundation models, including both
However, restricting pretraining solely to RS images may limit convolutional neural networks (CNN) and vision transformer
model capabilities since understanding RS objects requires foundation models with over 300 million parameters.
specialized knowledge [12]. Can RS foundation models benefit The main contributions of this paper are three-fold:
from incorporating information from other data sources? [5] 1) We address the discrepancy between upstream pretrain-
suggests that traditional ImageNet pretraining aids in learning ing and downstream finetuning tasks by introducing a
universal representations, whereas pretraining on RS data is stage-wise multi-task pretraining approach to enhance
particularly beneficial for recognizing RS-related categories. the RS foundation model.
To address this, [45] develop a teacher-student framework that 2) We utilize MTP to pretrain representative CNN and
integrates ImageNet supervised pretraining and RS unsuper- vision transformer foundation models with over 300M
vised pretraining simultaneously, while [46] employs repre- parameters on the SAMRS dataset, encompassing se-
sentations from ImageNet to enhance the learning process of mantic segmentation, instance segmentation, and rotated
MIM for improving RS foundation models. Additionally, [12] object detection tasks in a unified framework.
and [11] sequentially pretrain models on natural images and 3) Extensive experiments demonstrate that MTP signif-
RS images using contrastive SSL or MAE [23], respectively, icantly advances the representation capability of RS
as illustrated in Figure 1(a). foundation models, delivering remarkable performance
While previous RS foundation models have shown re- across various RS downstream tasks such as scene
WANG et al.: ADVANCING RS FOUNDATION MODEL VIA MTP 3

classification, semantic segmentation, object detection, to conduct multiple rounds of pretraining. Gururangan et al.
and change detection. [60] demonstrated that unsupervised pretraining on in-domain
The remainder of this paper is organized as follows. Sec- or task-specific data enhances model performance in natural
tion II introduces the existing works related to supervised, language processing (NLP) tasks. Building on this insight,
multi-stage, and multi-task RS pretraining. Section III presents Zhang et al. [11] devised a sequential pretraining approach,
the details of MTP, where the used SAMRS dataset and vision initially on ImageNet followed by the target RS dataset,
foundation models are also briefly introduced. Experimental employing MIM for pretraining. Similarly, [12] proposed a
results and corresponding analyses are depicted in Section IV. strategy inspired by human-like learning, first performing
Finally, Section V concludes this paper. contrastive SSL on natural images, then freezing shallow layer
weights and conducting SSL on an RS dataset. Contrary to
II. R ELATED W ORK [60], Dery et al. [61] introduced stronger end-task-aware train-
A. Supervised Pretraining for RS Foundation Model ing for NLP tasks by integrating auxiliary data and end-task
objectives into the learning process. Similarly, [13] introduced
Before the rise of SSL-based RS foundation models, re- additional segmentation pretraining using common segmenters
searchers have already delved into pretraining deep models (e.g., UperNet [62] and Mask2Former [63]) and the SAMRS
using labeled RS datasets. Tong et al. [48] pretrained an dataset, enhancing model accuracy in RS segmentation tasks.
ImageNet-pretrained ResNet-50 [49] using images from the Notably, our objective diverges from [13] in applying stage-
GID dataset [48] to derive pseudo-labels for precise land-cover wise pretraining. While [13] retains the segmentor after seg-
classification on high-resolution RS images. Recognizing the mentation pretraining to enhance segmentation performance,
challenge of labeling large-scale RS images, others sought we aim to enhance the representation capability of RS foun-
alternatives to RS annotation datasets. For instance, Li et dation models via stage-wise pretraining, preserving only the
al. [50] utilized the global land cover product Globeland30 backbone network after pretraining to facilitate transfer to
[51] as supervision for RS representation learning. They diverse RS downstream tasks.
adopted a mean-teacher framework to mitigate random noise
stemming from inconsistencies in imaging time and resolution C. Multi-Task Pretraining for RS Foundation Model
between RS images and geographical products. Moreover,
Applying multi-task learning to enhance the RS foundation
they incorporated additional geographical supervisions, such
model is an intuitive idea. Li et al. [64] introduced multi-
as change degree and spatial aggregation, to regularize the
task SSL representation learning, combining image inpainting,
pretraining process [52]. Long et al. [10] subsequently demon-
transform prediction, and contrast learning to boost semantic
strated the effectiveness of various CNN models (including
segmentation performance in RS images. However, it was
AlexNet [53], VGG-16 [54], GoogleNet [55], ResNet-101
limited to finetuning a pretrained model solely on semantic
[49], and DenseNet-121/169 [56]) pretrained from scratch
segmentation tasks, constrained by model size and pretraining
on the MillionAID dataset. Their models outperformed tra-
dataset capacity. The aspiration to consolidate multiple tasks
ditional ImageNet pretrained models in scene classification
into a single model has been a longstanding pursuit [15], [17],
tasks, indicating the potential of leveraging large-scale RS
[42], [58], [65]–[74], aligning with the original goals of the
datasets for pretraining. Later, Wang et al. [5] pretrained
foundation model exploration. Bastani et al. [59] devised a
typical CNN models and vision transformer models, including
multi-task model by integrating Swin-Base [57] with seven
Swin-T [57] and ViTAEv2 [58], all randomly initialized, on the
heads from existing networks (e.g., Faster-RCNN [75] and
MillionAID. They conducted a comprehensive empirical study
UNet [76]), facilitating training on the multi-task annotated
comparing finetuning performance using different pretraining
Satlas dataset. However, their approach lacked incorporation of
strategies (MillionAID vs. ImageNet) across four types of RS
typical RS rotated object tasks, focusing solely on transferring
downstream tasks: scene recognition, semantic segmentation,
the model to RS classification datasets. Inspired by these
rotated object detection, and change detection. Their results
pioneering efforts, this paper presents multi-task pretraining
demonstrated the superiority of vision transformer models over
of RS foundation models with over 300M parameters, encom-
CNNs on RS scenes and validated the feasibility of construct-
passing semantic segmentation, instance segmentation, and
ing RS foundation models via supervised pretraining on large-
rotated object detection tasks using the SAMRS dataset. After
scale RS datasets. Bastani et al. [59] introduced the larger
pretraining, the backbone network is further finetuned on
Satlas dataset for RS supervised pretraining. Very recently,
various RS downstream tasks.
SAMRS [13] introduced supervised semantic segmentation
pretraining to enhance model performance on the segmentation III. M ULTI -TASK P RETRAINING
task. Inspired by [13], this paper revisits the supervised We utilize semantic segmentation, instance segmentation,
learning approach by integrating it with existing pretraining and rotated object detection annotations from the SAMRS
strategies, such as ImageNet pretraining, and exploring multi- dataset for Multi-Task Pretraining (MTP). Advanced CNN and
task pretraining to construct distinct RS foundation models. vision transformer models serve as the backbone networks
to thoroughly investigate MTP. This section begins with an
B. Multi-Stage Pretraining for RS Foundation Model overview of the SAMRS dataset, followed by a brief intro-
Given the domain gap between RS images and natural duction to the selected models. Subsequently, we present the
images or between various RS modalities, it is reasonable MTP framework and implementation details.
4 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021

Multi-Task Pretraining Multi-Task Label


RS object
Feature pyramid Feed forward Label transformation
Natural/RS Back propagation detection
images 1/32x dataset
Rotated 𝓛𝒓𝒐𝒅 R-Det. boxes
object
1/16x detector

𝓛𝒊𝒏𝒔
1/8x Instance Ins. boxes Ins. masks
segmentor

Pretrained model
1/4x 𝓛𝒔𝒆𝒎
ViT-B + RVSA
ViT-L + RVSA Semantic
Semseg. labels
SAMRS
segmentor
InternImage-XL

Transferring to Various RS Tasks

Horizontal Rotated
Different Scene Semantic Change
Backbone
task head
object object
classification segmentation detection
detection detection

Fig. 2. The overall pipeline of MTP. Inside MTP, the feature pyramid from the backbone network is fed into multiple decoders for various tasks, including
rotated object detection, instance segmentation, and semantic segmentation. These tasks are supervised by diverse labels in the SAMRS dataset. Following
MTP, the pretrained model is transferred to different RS tasks for finetuning.

A. SAMRS Dataset 1) RVSA: This model is specially designed for RS images.


SAMRS (Segment Anything Model annotated Remote Considering the various orientations of RS objects caused by
Sensing Segmentation dataset) [13] is a large-scale RS seg- the bird’s-eye view, this model extends the varied-size window
mentation database, comprising 105,090 images and 1,668,241 attention in [82] by additionally introducing a learnable angle
instances from three datasets: SOTA, SIOR, and FAST. These factor, offering windows that can freely zoom, translate, and
datasets are derived from existing large-scale RS object de- rotate. RVSA is used to replace the original full attention in
tection datasets, namely DOTA-V2.0 [77], DIOR [78], and original vision transformers. To achieve a trade-off between
FAIR1M-2.0 [79], through transforming the bounding box accuracy and efficiency, following [83], only the full attention
annotations using the SAM [47]. SAMRS inherits the cate- in 1/4 depth layer is preserved. In the original paper [30],
gories directly from the original detection datasets, resulting RVSA is separately used on ViT [84] and ViTAE [71], whereas
in a capacity exceeding that of most existing RS segmentation ViTAE is a CNN-Transformer hybrid model. In this paper,
datasets by more than tenfold (e.g., ISPRS Potsdam1 and we employ the ViT-based version to investigate the impact of
LoveDA [80]). The image sizes for the three sets are 1,024 MTP on a plain vision transformer. In addition, the RVSA
× 1,024, 800 × 800, and 600 × 600, respectively. Despite model in the original paper is limited to the base version of
being primarily intended for large-scale pretraining exploration vision transformers, i.e., ViT-B + RVSA. To pretrain larger
rather than benchmarking due to its automatically generated models, we further apply RVSA to ViT-Large, obtaining ViT-L
labels, SAMRS naturally supports instance segmentation and + RVSA. Their detailed configurations are presented in Table I.
object detection. This versatility extends its utility to investi-
2) InternImage: This model integrates the strengths of
gating large-scale multi-task pretraining.
recent vision transformers and large kernels into CNNs via
dynamic sparse kernels, combining long-range context cap-
B. Backbone Network
ture, adaptive spatial information aggregation, and efficient
In this research, we adopt RVSA [30] and InternImage [81] computation. It extends deformable convolution [85], [86]
as the representative vision transformer-based and CNN-based with depth-wise and multi-head mechanisms and incorporates
foundation models. modern transformer designs such as layer normalization [87],
1 https://www.isprs.org/education/benchmarks/UrbanSemLab/2d-sem- feed-forward networks [88], and GELU activation [89]. We
labelpotsdam.aspx evaluate its performance on diverse RS downstream tasks,
WANG et al.: ADVANCING RS FOUNDATION MODEL VIA MTP 5

TABLE I TABLE II
D ETAILED CONFIGURATIONS OF DIFFERENT RVSA MODELS . T HE TRAINING COSTS OF IMPLEMENTING MTP USING DIFFERENT
MODELS .
Backbone ViT-B + RVSA ViT-L + RVSA
Depth 12 24 Backbone #Param.(M) #GPU Time (days)
Embedding Dim 768 1024 ViT-B + RVSA 86 16 3.0
Head 12 16 ViT-L + RVSA 305 32 6.3
Full attention Index [3, 6, 9, 12] [6, 12, 18, 24] InternImage-XL 335 32 6.3
Feature Pyramid Index [4, 6, 8, 12] [8, 12, 16, 24]

as the dataloader, model structure, loss function, and metric


showcasing its potential beyond its initial design for natural calculator, into a unified pipeline, to realize the MTP.
images. Furthermore, this choice facilitates investigating the The pretraining is conducted on NVIDIA V100 GPUs. All
impact of MTP on CNN-based models. Here, we employ the models are trained for 80K iterations using the AdamW opti-
XL version to match the model size of ViT-L + RVSA. mizer [92]. The base learning rates of RVSA and InternImage
are 0.00006 and 0.00002, respectively, with a weight decay of
C. Implementation of Multi-Task Pretraining 0.05. We adopt an iteration-wise cosine annealing scheduler to
We examine MTP using three models: ViT-B + RVSA, adjust the learning rate. The layer decay rates of RVSA models
ViT-L + RVSA, and InternImage-XL. As the original RVSA and InternImage are 0.9 and 0.94, following original papers
research [30] focuses solely on the base version, we inde- [30], [81]. For ViT-B + RVSA, the batch size and input image
pendently pretrain ViT-L on MillionAID similar to ViT-B + size are set to 48 and 224, which are doubled for training larger
RVSA. These pretrained weights will be publicly accessible. models. Table II lists the training costs of implementing MTP
Figure 2 shows the overall pipeline of MTP. Technically, we using different models.
further train the pretrained model on the SAMRS dataset,
encompassing various annotations such as semantic segmenta- IV. E XPERIMENTS
tion, instance segmentation, and rotated object detection tasks In this section, we thoroughly evaluate MTP’s performance
concurrently. We employ well-established classical networks, by finetuning pretrained models across four classical RS tasks:
including UperNet [62], Mask-RCNN [90], and Oriented- scene classification, object detection, semantic segmentation,
RCNN [91], as segmentors or detectors. These networks utilize and change detection. We also investigate the characteris-
feature pyramids and are supervised with different labels. To tics of MTP-based RS foundation models, examining the
illustrate this process, we depict the label transformation when relationships between adopted datasets, hyperparameters, and
generating SAMRS. Initially, rotated detection boxes (R-Det. finetuning performances, measuring accuracy variations with
boxes) are transformed into binary masks using SAM, serving reduced training samples, and visualizing the predicted results.
as instance-level mask annotations. Subsequently, the mini-
mum circumscribed horizontal rectangle of the binary mask A. Scene Classification
is derived as instance-level box annotations, with categories
We first evaluate the pretrained models on the scene classifi-
inherited from rotated boxes. These instance-level annotations
cation task. It does not need any extra decoder and can reflect
are utilized for instance segmentation. Semantic segmentation
the overall representation capability of the pretrained model.
labels are then obtained by assigning rotated box categories to
1) Dataset: We adopt two classical datasets: EuroSAT [93]
the masks. The losses stemming from these labels are Lrod ,
and RESISC-45 [94] for scene classification.
Lins , and Lsem , employed for the respective tasks. Notably,
the instance segmentation loss comprises two components: the 1) EuroSAT: This dataset is captured by Sentinel-2 from
box annotation loss Lins b and the binary mask loss Lins m . Europe for land use and land cover classification. It has
The overall loss for MTP is: 10 classes, a total of 27,000 images with a resolution
of 64 × 64. We adopt the public train/val split [95] by
L = Lrod + Lins b + Lins m + Lsem . (1) following [29], [31].
2) RESISC-45: This is a commonly-used dataset. It con-
Since SAMRS contains three sets, we have
tains 31,500 images in a size of 256 × 256 across 45
3
X categories, where each category possesses 700 samples.
L= Lirod + Liins b + Liins m + Lisem , (2) Following [5], [30], [32], [42], [45], we randomly select
i=1
20% of the data for training and 80% of the data for
where i indexes the three sub-sets: SOTA, SIOR, and FAST. testing.
The other settings of the loss follow the original papers [62], 2) Implementation Details: In the implementation, all mod-
[90], [91]. In practice, we implement the overall framework els are trained with a batch size of 64. The training epochs for
based on MMSegmentation2 , MMDetection3 , and MMRotate4 . EuroSAT and RESISC-45 are set to 100 and 200, respectively.
However, all these packages only support a single task. So The AdamW optimizer is used, where the base learning rate
we integrate the key components from these packages, such for RVSA and InterImage are 0.00006 and 0.00002, respec-
2 https://github.com/open-mmlab/mmsegmentation tively, with a weight decay of 0.05. In the first 5 epochs, we
3 https://github.com/open-mmlab/mmdetection adopt a linear warming-up strategy, where the initial learning
4 https://github.com/open-mmlab/mmrotate rate is set to 0.000001. Then, the learning rate is controlled
6 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021

TABLE III TABLE IV


T HE OA(%) OF DIFFERENT MODEL PRETRAINING STRATEGIES ON T HE OA (%) OF FINETUNING DIFFERENT PRETRAINED MODELS ON
E URO SAT. E URO SAT AND RESISC-45 DATASETS . H ERE IMP MEANS PRETRAINING
ON I MAGE N ET-22K. BOLD INDICATES THE BETTER ACCURACY OF
Model MAE MTP OA PRETRAINED MODELS WITH OR WITHOUT MTP. * REPRESENTS THE FIRST
ViT-B + RVSA ✔ 98.54 THREE METHODS AMONG ALL COMPARISON METHODS , WHERE THE
ViT-B + RVSA ✔ 98.24 FIRST, SECOND , AND THIRD PLACES ARE EMPHASIZED BY BLACK , RED
ViT-B + RVSA ✔ ✔ 98.76 AND GREEN COLORS , RESPECTIVELY.

Method Model EuroSAT RESISC-45


GASSL [27] ResNet-18 89.51 -
by the cosine annealing scheduler. The layer decay rates are SeCo [29] ResNet-18 93.14 -
0.9 and 0.94 for RVSA and InternImage models, respectively. SatMAE [31] ViT-L 95.74 94.10
SwiMDiff [97] ResNet-18 96.10 -
For classification, a global pooling layer and a linear head GASSL [27] ResNet-50 96.38 93.06
are used after the backbone network. To avoid overfitting, we GeRSP [45] ResNet-50 - 92.74
adopt multiple data augmentations, including random resized SeCo [29] ResNet-50 97.34 92.91
CACo [28] ResNet-50 97.77 91.94
cropping, random flipping, RandAugment [96], and random TOV [12] ResNet-50 - 93.79
erasing. Since the original image size of EuroSAT is too small, RSP [5] ViTAEv2-S - 95.60
before feeding into the network, we resize the image to 224 RingMo [32] Swin-B - 95.67
SatLas [59] Swin-B - 94.70
× 224. The overall accuracy (OA) is used as the evaluation CMID [38] Swin-B - 95.53
criterion. All experiments are implemented by MMPretrain5 . GFM [46] Swin-B - 94.64
3) Ablation Study of Stage-wise Pretraining: As aforemen- CSPT [11] ViT-L - 95.62
Usat [98] ViT-L 98.37
tioned, MTP is implemented based on existing pretraining Scale-MAE [36] ViT-L 98.59 95.04
models since it tries to address the task-level discrepancy. CtxMIM [37] Swin-B 98.69 -
So an interesting question naturally arises: what about con- SatMAE++ [99] ViT-L 99.04 -
SpectralGPT+ [34] ViT-B 99.21* -
ducting MTP from scratch? To this end, we experiment by SkySense [42] Swin-L - 95.92*
exploring different pretraining strategies using ViT-B + RVSA SkySense [42] Swin-H - 96.32*
on EuroSAT, and the results are shown in Table III. It can MAE ViT-B + RVSA 98.54 95.49
MAE + MTP ViT-B + RVSA 98.76 95.57
be seen that, without using pretrained weights, MTP cannot MAE ViT-L + RVSA 98.56 95.46
achieve the ideal performance and even performs worse than MAE + MTP ViT-L + RVSA 98.78 95.88
MAE pretraining. These results demonstrate the importance of IMP InternImage-XL 99.30* 95.82
IMP + MTP InternImage-XL 99.24* 96.27*
performing stage-wise pretraining.
4) Finetuning Results and Analyses: Table IV shows the
finetuning results. It can be seen that MTP can improve exist- 1) Dataset: We use two public datasets Xview [100] and
ing foundation models on scene classification tasks, especially DIOR [78] for evaluation. Here are the details:
for the RVSA series. It helps the model achieve state-of- 1) Xview: This dataset is from the DIUx xView 2018 De-
the-art performances compared to other pretraining models tection Challenge [100]. It collects Worldview-3 satellite
that have comparable sizes. With the help of MTP, on the imagery beyond 1,400 km2 in a ground resolution
RESISC-45 dataset, InterImage-XL surpasses Swin-L-based of 0.3m, involving 60 classes over 1 million object
SkySense [42], which is pretrained on a tremendously large instances. Due to only the 846 images (beyond 2,000 ×
dataset that has more than 20 million multimodal RS image 2,000 pixels) in the training set are available, following
triplets involving RGB high-resolution images and multi- [27], [37], we randomly select 700 images as the training
temporal multispectral and SAR sequences. MTP boosts the set and 146 images for testing.
performance of InterImage-XL close to the Swin-H-based 2) DIOR: This dataset consists of 23,463 images with
SkySense (96.27 v.s. 96.32), which has more parameters. We resolutions ranging from 0.5 to 30m, including 192,472
also notice the accuracy of IMP-InterImage-XL is decreased instances. The images have been clipped to 800 × 800
marginally in EuroSAT after MTP. We will investigate this for the convenience of model training and testing. It
phenomenon later. Nevertheless, the obtained model still out- involves 20 common object categories. The training set,
performs SpectralGPT+ , which is pretrained with 1 million validation set, and testing set contain 5862, 5863, and
multispectral images, where each sample can be regarded as 11738 samples, respectively. In this paper, we jointly use
containing multiple groups of tri-spectral images, similar to the training set and the validation set to finetune models
RGB channels. and conduct the evaluation on the testing set.
2) Implementation Details: For Xview, we train a Reti-
naNet [101] by following [27], [37] with the pretrained model
B. Horizontal Object Detection
for 12 epochs, with a batch size of 8. While Faster-RCNN [75]
After completing the scene-level task of recognition, we fo- is adopted when finetuning on DIOR with the same settings
cus on the object-level tasks, i.e., horizontal and rotated object except for a batch size of 4. We also apply a linear warming-
detection. Here, we first consider the horizontal detection task. up strategy with an initial learning rate of 0.000001 at the
beginning of 500 iterations. We keep the same layer decay
5 https://github.com/open-mmlab/mmpretrain rates as the scene classification task. The basic learning rate,
WANG et al.: ADVANCING RS FOUNDATION MODEL VIA MTP 7

TABLE V natural scene object detection because of special overhead


T HE AP50 (%) OF FINETUNING DIFFERENT PRETRAINED MODELS WITH views. This is also one of the motivations to implement MTP
R ETINA N ET ON X VIEW AND DIOR DATASETS . T HE “S UP. L EA . W IN1K”
MEANS SUPERVISED LEARNING WITH I MAGE N ET-1K. R ANDOM I NIT. using SAMRS.
MEANS THE BACKBONE NETWORK IS RANDOMLY INITIALIZED . 1) Dataset: We adopt the four most commonly used
Method Backbone Xview DIOR datasets for this task: DIOR-R [103], FAIR1M-2.0 [79],
Random Init. ResNet-50 10.8 - DOTA-V1.0 [102] and DOTA-V2.0 [77].
Sup. Lea. w IN1K ResNet-50 14.4 -
Sup. Lea. w IN1K Swin-B 16.3 - 1) DIOR-R: This is the extended oriented bounding box
GASSL [27] ResNet-50 17.7 67.40 version of DIOR [78]. It has 23,463 images and 192,158
SeCo [29] ResNet-50 17.2 -
CACO [28] ResNet-50 17.2 66.91
instances over 20 classes. Each image in this dataset has
CtxMIM [37] Swin-B 18.8* - been cropped into 800 × 800. Following [30], [35], [42],
TOV [12] ResNet-50 - 70.16 we merge the original training and validation sets for
SATMAE [31] ViT-L - 70.89
CSPT [11] ViT-L - 71.70
training, while the testing set is used for evaluation.
GeRSP [45] ResNet-50 - 72.20 2) FAIR1M-2.0: This is a large-scale RS benchmark
GFM [46] Swin-B - 72.84 dataset, including more than 40,000 images and 1
Scale-MAE [36] ViT-L - 73.81
SatLas [59] Swin-B - 74.10
million instances for fine-grained object detection. It
CMID [38] Swin-B - 75.11 collects samples with resolutions ranging from 0.3 to
RingMo [32] Swin-B - 75.90 0.8m and image sizes ranging from 1,000 × 1,000 to
SkySense [42] Swin-H - 78.73*
10,000 × 10,000 from various sensors and platforms. It
MAE ViT-B + RVSA 14.6 75.80
MAE + MTP ViT-B + RVSA 16.4 79.40* contains 37 subcategories belonging to 5 classes: Ship,
MAE ViT-L + RVSA 15.0 78.30 Vehicle, Airplane, Court, and Road. In this paper, we use
MAE + MTP ViT-L + RVSA 19.4* 81.10* the more challenging version of 2.0, which additionally
IMP InternImage-XL 17.0 77.10
IMP + MTP InternImage-XL 18.2* 78.00 incorporates the 2021 Gaofen Challenge dataset. The
training and validation sets are together adopted for
training.
optimizer, and scheduler are the same as [30]. Before input into 3) DOTA-V1.0: This is the most popular dataset for RS
the network, the large images are uniformly clipped to 416 rotated object detection. It comprises 2,806 images
× 416 pixels. The data augmentation only includes random spanning from 800 × 800 to 4,000 × 4,000 ×, where
flipping with a probability of 0.5. We use MMDetection to 188,282 instances from 15 typical categories are pre-
implement the finetuning, where the AP50 is used as the sented. We adopt classical train/test split, that is, the
evaluation metric for the comparison of different models. original training and validation sets are together for
3) Finetuning Results and Analyses: The experimental re- training, while the original testing set is used for evalu-
sults are shown in Table V. We can find that the MTP enhances ation.
the performance of all pretrained models, especially for ViT-L 4) DOTA-V2.0: The is the enhanced version of DOTA
+ RVSA. On Xview, the performance of MAE pretrained ViT- V1.0. By additionally collecting larger images, adding
L + RVSA is not as good as InterImage-XL, even worse than new categories, and annotating tiny instances, it finally
the smaller ResNet-50-based models. After utilizing MTP, the contains 11,268 images, 1,793,658 instances, and 18
performance of ViT-L + RVSA has been greatly improved. It categories. We use the combination of training and
outperforms CtxMIM [37] and achieves the best. On DIOR, validation sets for training, while the test-dev set is used
with the help of MTP, ViT-B + RVSA has outperformed all for evaluation.
existing methods, including the recently distinguished method
2) Implementation Details: Since the large-size image is
SkySense [42] that employs a larger model. In addition, MTP
not suitable for training, we first perform data cropping. For
also greatly enhances ViT-L + RVSA, setting a new state-of-
DOTA-V2.0, we adopt single-scale training and testing by
the-art. Here, we emphasize that despite the pretraining dataset
following [104], where the images are cropped to patches in
SAMRS includes the samples of DIOR [78]. To avoid unfair
size of 1,024 × 1,024 with an overlap of 200. For DOTA-V1.0
comparison, following [13], the images of the testing set in
and FAIR1M-2.0, we implement the multiscale training and
DIOR have not been used for MTP. This rule also applies
testing, i.e., the original images are scaled with three ratios:
to other datasets that form the SAMRS, involving DOTA-
(0.5, 1.0, 1.5). Then, the DOTA-V1.0 images are cropped to
V1.0 [102], DOTA-V2.0 [77], DIOR-R [103] and FAIR1M-
1,024 × 1,024 patches but with an overlap of 500, while
2.0 [79]. It should also be noted that the RVSA model is
FAIR1M-2.0 images adopt a patch size of 800 and an overlap
initially proposed by considering the diverse orientations of RS
of 400. The batch sizes are set to 4, 16, 4, and 4 for the
objects, which are related to the rotated object detection task.
DIOR-R, FAIR1M, DOTA-V1.0, and DOTA-V2.0 datasets,
Nevertheless, the models after MTP demonstrate an excellent
respectively. The other settings during training are the same as
capability in detecting horizontal boxes.
horizontal object detection. We adopt the Oriented-RCNN net-
work implemented in MMRotate. During training, input data
C. Rotated Object Detection is augmented by random flipping and random rotation. The
We then investigate the impact of MTP on the rotated object mean average precision (mAP) is adopted as the evaluation
detection task, which is a typical RS task distinguished from metric.
8 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021

TABLE VI
T HE M AP (%) OF FINETUNING DIFFERENT PRETRAINED MODELS ON THE DIOR-R, FAIR1M-2.0, DOTA-V1.0, AND DOTA-V2.0 DATASETS . MS
INDICATES WHETHER THE ACCURACY ON DOTA-V1.0 IS OBTAINED FROM THE MULTI - SCALE TRAINING AND TESTING . †: T HE FEATURE PYRAMID IS
FORMED BY UPSAMPLING AND DOWNSAMPLING THE LAST LAYER FEATURE OF THE BACKBONE NETWORK BY FOLLOWING THE STRATEGY OF V I TD ET
[83].

Method Backbone MS DIOR-R FAIR1M-2.0 DOTA-V1.0 DOTA-V2.0


RetinaNet-OBB [101] ResNet-50 - 57.55 - - 46.68
Faster RCNN-OBB [75] ResNet-50 ✘ 59.54 - 69.36 47.31
FCOS-OBB [105] ResNet-50 - - - - 48.51
ATSS-OBB [106] ResNet-50 ✘ - - 72.29 49.57
SCRDet [107] ResNet-101 ✘ - - 72.61 -
Gilding Vertex [108] ResNet-50 ✘ 60.06 - 75.02 -
ROI Transformer [109] ResNet-50 ✔ 63.87 - 74.61 52.81
CACo [28] ResNet-50 - 64.10 47.83 - -
RingMo [32] Swin-B - - 46.21 - -
R3 Det [110] ResNet-152 ✔ - - 76.47 -
SASM [111] ResNeXt-101 ✔ - - 79.17 44.53
AO2-DETR [112] ResNet-50 ✔ - - 79.22 -
S2 ANet [113] ResNet-50 ✔ - - 79.42 49.86
ReDet [114] ReResNet-50 ✔ - 80.10
R3 Det-KLD [115] ResNet-50 ✔ - - 80.17 47.26
R3 Det-GWD [116] ResNet-152 ✔ - - 80.23 -
R3 Det-DEA [117] ReResNet-50 ✔ - - 80.37 -
AOPG [103] ResNet-50 ✔ 64.41 - 80.66 -
DODet [118] ResNet-50 ✔ 65.10 - 80.62 -
PP-YOLOE-R [119] CRN-x [120] ✔ 80.73
GASSL [27] ResNet-50 - 65.65 48.15 - -
SatMAE [31] ViT-L - 65.66 46.55 - -
TOV [12] ResNet-50 - 66.33 49.62 - -
Oriented RepPoints [121] ResNet-50 ✘ 66.71 - 75.97 48.95
GGHL [122] DarkNet-53 [123] ✘ 66.48 - 76.95 57.17
CMID [38] Swin-B ✘ 66.37 50.58 77.36 -
RSP [5] ViTAEv2-S ✘ - - 77.72 -
Scale-MAE [36] ViT-L - 66.47 48.31 - -
SatLas [59] Swin-B - 67.59 46.19 - -
GFM [46] Swin-B - 67.67 49.69 - -
Oriented RCNN [91] ResNet-50 ✔ - - 80.87 53.28
R3 Det-KFIoU [124] ResNet-152 ✔ - - 81.03 -
RTMDet-R [125] RTMDet-R-l ✔ - - 81.33 -
DCFL [104] ReResNet-101 [114] ✘ 71.03 - - 57.66
SMLFR [126] ConvNeXt-L [127] ✘ 72.33 - 79.33 -
ARC [128] ARC-R50 ✔ - - 81.77* -
LSKNet [129] LSKNet-S ✔ - - 81.85* -
STD [130] ViT-B ✔ - - 81.66 -
STD [130] HiViT-B [131] ✔ - - 82.24* -
BillionFM [35] ViT-G12X4 - 73.62* - - 58.69*
SkySense [42] Swin-H - 74.27* 54.57* - -
RVSA [30] † ViT-B + RVSA ✔ 70.67 - 81.01 -
MAE ViT-B + RVSA ✔ 68.06 51.56 80.83 55.22
MAE + MTP ViT-B + RVSA ✔ 71.29 51.92 80.67 56.08
MAE ViT-L + RVSA ✔ 70.54 53.20* 81.43 58.96*
MAE + MTP ViT-L + RVSA ✔ 74.54* 53.00* 81.66 58.41*
IMP InternImage-XL ✔ 71.14 50.67 80.24 54.85
IMP + MTP InternImage-XL ✔ 72.17 50.93 80.77 55.13

3) Finetuning Results and Analyses: Table VI shows the capability as SkySense, although it has over 600M parameters
finetuning results. Except for DIOR-R, we find the MTP pre- and utilizes 20 million images for pretraining. We also notice
trained models cannot always demonstrate obvious advantages the performances of our models still have gaps compared with
compared to their counterparts. Since the volumes of FAIR1M- the current advanced method STD [130] on DOTA-V1.0. It
2.0, DOTA-V1.0, and DOTA-V2.0 are much larger than DIOR- may be attributed to the adopted classical detector Oriented-
R, we speculate that after long-time finetuning, the benefit of RCNN [91], which limits the detection performance.
MTP becomes diminished. We will further explore this issue
in later sections. Nevertheless, owing to the excellent structure, D. Semantic Segmentation
RVSA-L outperforms the ViT-G-based foundation model [35] We further consider finetuning the pretrained models on the
with over 1 billion parameters on DOTA-V2.0. Compared to finer pixel-level tasks, e.g., the semantic segmentation task. It
the powerful SkySense model [42], our models achieve better is one of the most important RS applications for the extraction
performance on the DIOR-R. While on FAIR1M-2.0, except and recognition of RS objects and land covers.
SkySense, our models surpass all other methods by a large 1) Dataset: We separately take into account both single-
margin. Generally, our models have comparable representation class geospatial target extraction and multi-class surface el-
WANG et al.: ADVANCING RS FOUNDATION MODEL VIA MTP 9

ement perception through two RS semantic segmentation TABLE VII


datasets: SpaceNetv1 [132] and LoveDA [80]. Here are their T HE M IOU (%) OF FINETUNING DIFFERENT PRETRAINED MODELS WITH
U PER N ET ON THE S PACE N ETV 1 AND L OVE DA DATASETS .
details:
1) SpaceNetv1: This dataset is provided by SpaceNet Chal- Method Backbone SpaceNetv1 LoveDA
PSANet [133] ResNet-50 75.61 -
lenge [132] for extracting building footprints. It is made SeCo [29] ResNet-50 77.09 43.63
up of the DigitalGlobe WorldView-2 satellite imagery GASSL [27] ResNet-50 78.51 48.76
with a ground sample distance of 0.5m photoed during SatMAE [31] ViT-L 78.07 -
CACo [28] ResNet-50 77.94 48.89
2011-2014 over Rio de Janeiro. It covers about 2,544 PSPNet [134] ResNet-50 - 48.31
km2 , including 382,534 building instances. Since only DeeplabV3+ [135] ResNet-50 - 48.31
the 6,940 images in the original training set are available, FarSeg [136] ResNet-50 - 48.15
FactSeg [137] ResNet-50 - 48.94
following [31], [37], we randomly split these images into TOV [12] ResNet-50 - 49.70
two parts, where the first part containing 5,000 images HRNet [138] HRNet-W32 - 49.79
being used as the training set, and another part will be GeRSP [45] ResNet-50 - 50.56
DCFAM [139] Swin-T - 50.60
used for testing. UNetFormer [140] ResNet-18 - 52.40
2) LoveDA: This is a challenging dataset involving both RSSFormer [141] RSS-B - 52.43
urban and rural scenes. It collects 0.3m spaceborne UperNet [62] ViTAE-B + RVSA [30] - 52.44
Hi-ResNet [142] Hi-ResNet - 52.50
imagery from Google Earth, where the images were RSP [5] ViTAEv2-S - 53.02
obtained in July 2016, covering 536.15km2 of Nanjing, SMLFR [126] ConvNext-L - 53.03
Changzhou, and Wuhan. It has 5,987 images in size of LSKNet [129] LSKNet-S - 54.00
CtxMIM [37] Swin-B 79.47 -
1,024 × 1,024, involving seven types of common land AerialFormer [143] Swin-B - 54.10
covers. We merge the official training and validation sets BillionFM [35] ViT-G12X4 - 54.40*
for training and conduct evaluation using the official MAE ViT-B + RVSA 79.56* 51.95
MAE + MTP ViT-B + RVSA 79.63* 52.39
testing set. MAE ViT-L + RVSA 79.69* 53.72
2) Implementation Details: Except that the models are MAE + MTP ViT-L + RVSA 79.54 54.17*
trained with 80K iterations with a batch size of 8, and the IMP InternImage-XL 79.08 53.93*
IMP + MTP InternImage-XL 79.16 54.17*
warming up stage in the parameter scheduler lasts 1,500
iterations, most of the optimization settings are similar to the
scene classification section. We use the UperNet [62] as the 1) Dataset: We conduct the finetuning on the datasets of
segmentation framework, where the input image sizes during different scales: Onera Satellite Change Detection Dataset
training are 384 × 384 and 512 × 512 for SpaceNetv1 and (OSCD) [186], Wuhan University Building Change Detection
LoveDA, respectively, through random scaling and cropping. Dataset (WHU) [187], the Learning, Vision, and Remote Sens-
We also adopt random flipping for data augmentation. All ing Change Detection Dataset (LEVIR) [188], and the Season-
experiments are implemented by MMSegmentation, where the Varying Change Detection Dataset (SVCD) [189], which is
mean value of the intersection over union (mIOU) is adopted also called “CDD”.
as the evaluation metric.
3) Finetuning Results and Analyses: The results presented 1) OSCD: This is a small-scale dataset. It contains 24 pairs
in Table VII demonstrate that MTP is also useful for enhancing of Sentinel-2 multispectral images involving all bands
the models’ performance on semantic segmentation tasks. and in an average size of 600 × 600. These images are
Compared to SpaceNetv1, the improvements on the classical obtained during 2015-2018 to record urban changes. We
land cover classification dataset: LoveDA, are even more follow the same train/val split as [186], where training
significant. As a result, on this dataset, our models surpass all and validation sets include 14 and 10 pairs, respectively.
previous methods except the BillionFM [35], which utilizes a 2) WHU: This dataset is used for detecting building
model with over 1 billion parameters. On the SpaceNetv1, changes in a single view. It contains two large-scale
our models set new state-of-the-art accuracy. Nonetheless, images with a ground resolution of 0.3m and in size of
probably due to overfitting, the results of SpaceNetv1 also 32,507 × 15,354. They are collected in 2012 and 2016,
indicate that the performances on simple extraction tasks containing 12,796 and 16,077 instances, respectively.
do not improve as increasing model capacity. We have also Since there is no official data split, the 70%, 10%,
noticed the performance of ViT-L + RVSA on SpaceNetv1 and 20% patches of the cropped images are randomly
is decreased when adopting MTP. We will conduct further selected as training, validation, and testing sets as sug-
exploration in later sections. gested by [182].
3) LEVIR: This dataset contains 637 pairs of 1,024 ×
1,024 images with a spatial resolution of 0.5m. These
E. Change Detection images are acquired between 2002 and 2018 from 20
Finally, we pay attention to the change detection task, which different regions in Texas, USA. It contains 31,333
can be regarded as a special type of segmentation by extracting change instances. We adopt the official split, where
the changed area between the RS images taken at different training, validation, and testing sets contain 445, 64, and
times in the same location. Here, we mainly consider the most 128 pairs, respectively.
representative bi-temporal change detection. 4) SVCD/CDD: This dataset focuses on seasonal varia-
10 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021

TABLE VIII
T HE F1 SCORE (%) OF FINETUNING DIFFERENT PRETRAINED MODELS WITH UN ET ON THE OSCD, WHU, LEVIR, AND SVCD/CDD DATASETS .

Method Backbone OSCD WHU LEVIR SVCD/CDD


GASSL [27] ResNet-50 46.26 - - -
SeCo [29] ResNet-50 47.67 - 90.14 -
FC-EF [144] - 48.89 - 62.32 77.11
SwiMDiff [97] ResNet-18 49.60 - - -
CACo [28] ResNet-50 52.11 - - -
SatMAE [31] ViT-L 52.76 - - -
SNUNet [145] - - 83.49 88.16 96.20
BIT [146] ResNet-18 - 83.98 89.31 -
SRCDNet [147] ResNet-18 - 87.40 - 92.94
CLNet [148] - - - 90.00 92.10
HANet [149] - - - 90.28 -
RECM [150] ViT-S - - 90.39 -
ChangeFormer [151] MiT-B2 [152] - - 90.40 -
AERNet [153] ResNet-34 - - 90.78 -
ESCNet [154] - - - - 93.54
DSAMNet [155] ResNet-18 - - - 93.69
GCD-DDPM [156] - - 92.54 90.96 94.93
CDContrast [157] - - - - 95.11
DDPM-CD [158] UNet [76] - 92.65 90.91 95.62
DeepCL [159] EfficientNet-b0 [160] - - 91.11 -
DMNet [161] ResNet-50 - - - 95.93
ChangeStar [162] ResNeXt-101 [163] - - 91.25 -
RSP [5] ViTAEv2-S [58] - - 90.93 96.81
SAAN [164] ResNet-18 - - 91.41 97.03
SiamixFormer [165] MiT-B5 [152] - - 91.58 97.13
TransUNetCD [166] ResNet-50 - 93.59 91.11 97.17
RDPNet [167] - - - 90.10 97.20
SDACD [168] - - - - 97.34
Siam-NestedUNet [169] UNet++ [170] - - 91.50 -
Changen [171] MiT-B1 [152] - - 91.50 -
HCGMNet [172] VGG-16 - - 91.77 -
CEECNet [173] - - - 91.83 -
RingMo [32] Swin-B - - 91.86 -
CGNet [174] VGG-16 - - 92.01 -
TTP [175] SAM [47] - - 92.10 -
Changer [176] ResNeSt-101 [177] - - 92.33 -
WNet [178] ResNet-18 + DAT [179] - 91.25 90.67 97.56
SpectralGPT+ [34] ViT-B 54.29 - - -
C2FNet [180] VGG-16 - - 91.83 -
MATTER [24] ResNet-34 59.37* - - -
GFM [46] Swin-B 59.82* - - -
FMCD [181] EfficientNet-b4 [160] - 94.48 - -
SGSLN/256 [182] - - 94.67 91.93 96.24
P2V-CD [183] - - 92.38 91.94 98.42*
ChangeCLIP [184] CLIP [17] - 94.82* 92.01 97.89
BAN [185] InternImage-XL [81] - - 91.94 -
BAN [185] ViT-L [81] - - 91.96 -
BAN [185] ChangeFormer [151] - - 92.30 -
SkySense [42] Swin-H 60.06* - 92.58* -
MAE ViT-B + RVSA 50.28 93.77 92.21 97.80
MAE + MTP ViT-B + RVSA 53.36 94.32 92.22 97.87
MAE ViT-L + RVSA 54.04 94.07 92.52 97.78
MAE + MTP ViT-L + RVSA 55.92 94.75 92.67* 97.98
IMP InternImage-XL 51.61 95.33* 92.46 98.37*
IMP + MTP InternImage-XL 55.61 95.59* 92.54* 98.33*

tions. It initially contains 11 pairs of images obtained separately have 5334, 762, and 1524 images for training,
from Google Earth in different seasons, with spatial validation, and testing, after cropping the image to patches in
resolutions ranging from 0.03 to 1m. It now has been size of 256 × 256 without overlaps. A similar operation is con-
cropped to 16,000 pairs of patches in size of 256 × 256 ducted for LEVIR, generating training, validation, and testing
by [190]. The 10,000/3,000/3,000 pairs are separately sets containing 7120, 1024, and 2048 samples, respectively.
used as training, validation, and testing sets. The training epochs on OSCD, WHU, LEVIR, and CDD are
separately set to 100, 200, 150, and 200. The batch size of all
2) Implementation Details: Following [29], [42], we crop datasets is uniformly set to 32. We adopt the same optimization
the OSCD images to 96 × 96 patches with no overlapping, strategy as the scene classification task. To fully leverage the
obtaining 827/385 pairs for training/testing. However, the feature pyramid produced by foundation models, we adopt
training is difficult to converge due to the extremely small a UNet [76] to process the differences between different
input size, thus we rescale the image to 224 × 224 before temporal features. The training is implemented through Open-
inputting it into the network. For the WHU dataset, we
WANG et al.: ADVANCING RS FOUNDATION MODEL VIA MTP 11

TABLE IX
D ETAILED HYPERPARAMETER SETTINGS IN FINETUNING PRETRAINED MODELS ON DIFFERENT DATASETS . “✔” AND “✘” INDICATE WHETHER THE MTP
IS USEFUL FOR IMPROVING PERFORMANCE COMPARED TO THE SETTING WITHOUT MTP.

Scene Classification Horizontal Detection Rotated Object Detection


Dataset EuroSAT RESISC-45 Xview DIOR DIOR-R FAIR1M-2.0 DOTA-V1.0 DOTA-2.0
Training Image Number (NT rIm ) 16,200 6,300 20,084 11,725 11,725 288,428 133,883 31,273
Training Epoch Number (NT rEp ) 100 200 12 12 12 12 12 40
Total Sample Number (NT oSa ) 1,620,000 1,260,000 241,008 140,700 140,700 3,461,136 1,606,596 1,250,920
Batch Size (SB ) 64 64 8 4 4 16 4 4
Total Iteration Number (NT oIt ) 25,312 19,688 30,126 35,175 35,175 216,321 401,649 312,730
Training Image Size (ST rIm ) 224 224 416 800 800 800 1,024 1,024
Class Number (NC ) 10 45 60 20 20 37 15 18
Average Pixel per Class (APC ) 36,288,000 6,272,000 167,0988 5,628,000 5,628,000 74,835,373 109,676,954 71,163,449
Average Iteration per Class (AIC ) 2,531 438 502 1759 1,759 5,847 26,777 17,374
ViT-B + RVSA ✔ ✔ ✔ ✔ ✔ ✔ ✘ ✔
ViT-L + RVSA ✔ ✔ ✔ ✔ ✔ ✘ ✔ ✘
InternImage-XL ✘ ✔ ✔ ✔ ✔ ✔ ✔ ✔
Semantic Segmentation Bi-temporal Change Detection
Dataset SpaceNetv1 LoveDA OSCD WHU LEVIR SVCD/CDD
Training Image Number (NT rIm ) 5,000 4,191 827 5,334 7,120 10,000
Training Epoch Number (NT rEp ) 128 153 100 200 150 200
Total Sample Number (NT oSa ) 640,000 640,000 82,700 106,6800 1,068,000 2,000,000
Batch Size (SB ) 8 8 32 32 32 32
Total Iteration Number (NT oIt ) 80,000 80,000 2,584 33,338 33,375 62,500
Training Image Size (ST rIm ) 384 512 224 256 256 256
Class Number (NC ) 2 7 2 2 2 2
Average Pixel per Class (APC ) 122,880,000 46,811,429 9,262,400 136,550,400 136,704,000 256,000,000
Average Iteration per Class (AIC ) 40,000 11,429 1,292 16,669 16,688 31,250
ViT-B + RVSA ✔ ✔ ✔ ✔ ✔ ✔
ViT-L + RVSA ✘ ✔ ✔ ✔ ✔ ✔
InternImage-XL ✔ ✔ ✔ ✔ ✔ ✘

CD6 , where the data augmentation includes random rotation, relatively large gaps compared to current works. These results
random flipping, random exchange temporal, and color jitters suggest that it is necessary to conduct further explorations to
that randomly adjust brightness, contrast, hue, and saturation enhance the model finetuning performance on datasets with
of images. The F1 score of the changed class is adopted as small volumes and input sizes.
the evaluation metric.
3) Finetuning Results and Analyses: To comprehensively F. Further Investigations and Analyses
assess the finetuning performance of pretrained models, we Besides evaluating the performances of pretrained models,
conduct the comparison by collecting existing advanced we conduct further investigations to obtain deeper insights into
change detection methods, as shown in Table VIII. It should the characteristics of MTP, including the influence factors of
be noted that, since the original WHU dataset does not provide MTP, finetuning with fewer samples, and parameter reusing
an official train/test split, various split strategies are adopted in of decoders.
different methods. Therefore, on this dataset, we only list the 1) Influence Factors of Multi-Task Pretraining: Up to now,
accuracy value of the methods that employ the same settings to comprehensively assess the impact of MTP, we have fine-
as us or training with more images. It can be seen that MTP tuned three types of foundation models, on five RS down-
effectively improves the performances of pretrained models on stream tasks, involving a total of fourteen datasets. From the
these datasets. Especially, our models perform well on three finetuning results (Table IV-VIII) we find that MTP improves
large-scale datasets: WHU, LEVIR, and SVCD/CDD. Even if these foundation models in most cases. But there are still some
adopting simple UNet [76] and the RVSA model of the base datasets, on which MTP does not perform well as expected,
version, the finetuning performances have been competitive i.e., not all accuracies of three models are increased. To figure
and surpassed many advanced approaches. When utilizing out the reason, we explore the influence factors related to
larger models, the performance can be further boosted. Finally, the performance of MTP, as shown in Table IX. Intuitively,
they achieve the best accuracy on the WHU and LEVIR we suppose MTP may be affected by the characteristics of
datasets by outperforming almost all existing methods, in- finetuning datasets and consider a series of variables, in-
cluding the recent SkySense [42] that builds a larger change cluding “Training Image Number” (NT rIm ), “Training Epoch
detection network with over 600M parameters, ChangeCLIP Number” (NT rEp ), “Batch Size” (SB ), and “ Training Image
[184] that uses CLIP [17] to obtain additional knowledge from Size” (ST rIm ). The “Training Image Number” means: for each
language modalities, and the newly proposed adapter BAN dataset, the number of images used for training. For example,
[185], where the ability of existing foundation model and the NT rIm of DIOR is 11,725 since the original training and
change detection approaches can be exploited. Different from validation sets are together used for training. While “Training
large-scale scenes, on the small-scale dataset OSCD, although Image Size” represents the image size after data augmentation
MTP is still useful, the performances of our models have and preprocessing. Theoretically, we have
NT rIm · NT rEp
6 https://github.com/likyoo/open-cd NT oIt = , (3)
SB
12 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021

where NT oIt means the number of training iterations for


model parameter updating under the mini-batch optimization
strategy. In Table IX, we observe that as NT oIt increases,
there is a tendency for MTP to have a negative impact on
finetuning performance for a given task. However, this trend
isn’t universally applicable, as evidenced by varying results
among pretrained models in segmentation tasks, all with the
same NT oIt . Additionally, we account for dataset difficulty by
considering the Number of Classes NC and use the “Average
Iteration per Class” (AIC ) to represent each dataset as
NT oIt
AIC = . (4)
NC
Surprisingly, Table IX reveals a notable trend: a relatively large (a)
AIC corresponds to a negative impact of MTP for the same
task. This suggests that, over extended finetuning periods,
MTP models lose their advantage compared to conventional
pretrained models. We propose a bold conjecture regarding this
internal mechanism: the benefits of MTP diminish gradually
due to excessive network optimization. This discovery prompts
a reconsideration of the trade-off between longer training
times for accuracy gains and the benefits of pretraining when
finetuning models. However, determining the critical point
of AIC remains challenging due to limited experimentation,
necessitating further investigation. It is important to note
that this phenomenon differs from overfitting, as our models
continue to outperform existing methods at this stage.
In addition to training duration, we consider dataset capac- (b)
ity, introducing the index “Average Pixels per Class” APC ,
which can be formulated by Fig. 3. The finetuning accuracy of different pretrained models with varying
training sample sizes. (a) InternImage-XL on EuroSAT. (b) ViT-L + RVSA
NT oSa · ST rIm on SpaceNetv1.
APC = , (5)
NC
where NT oSa = NT rIm · NT rEp denotes the quantity of distances between the curves progressively widen. This trend
images processed during training. Consequently, APC approx- suggests that the benefits of MTP are beginning to emerge,
imately reflects the data volume encountered during finetuning. becoming increasingly significant. These findings validate our
Table IX reveals that APC exhibits similar trends to AIC , yet hypotheses, underscoring the benefit of MTP for finetuning
the correlation between APC and MTP performance is less foundational models on limited training samples.
discernible compared to AIC , possibly due to the presence of
redundant pixels in RS images. TABLE X
2) Fewer Sample Finetuning: The efficacy of SEP has T HE ACCURACIES OF FINETUNING V I T-B + RVSA ON VARIOUS DATASETS
been demonstrated in scenarios with limited samples [13]. WITH AND WITHOUT REUSING PRETRAINED DECODER WEIGHTS .

While MTP represents an extension of SEP, it is reasonable Model SpaceNetv1 LoveDA DIOR-R
to anticipate that MTP could excel in analogous contexts. w/o DPR 79.63 52.39 71.29
Moreover, as noted earlier, MTP primarily addresses the w DPR 79.54 51.83 71.94
Model FAIR1M-2.0 DOTA-V1.0 DOTA-V2.0
discrepancy between upstream pretraining and downstream w/o DPR 51.92 80.67 55.22
finetuning tasks. This encourages us to consider that fewer w DPR 52.19 80.54 55.78
downstream training samples might better showcase MTP’s
efficacy in facilitating efficient transfer from pretraining mod- 3) Decoder Parameter Reusing: MTP utilizes task-specific
els. To explore this, we finetune InterImage-XL on EuroSAT decoders for segmentation and detection tasks. Hence, reusing
and ViT-L + RVSA on SpaceNetv1, respectively, progressively these decoder weights during finetuning seems a naive choice.
reducing training samples. The results are depicted in Figure However, only semantic segmentation and rotated detection
3. Initially, MTP’s performance is slightly inferior to its decoders are eligible for reuse, as per the segmentor or
counterparts when the training sample proportion is 100%, detector used in existing methods. We conduct experiments
as illustrated in Tables IV and VII. However, as training accordingly. Initially, during finetuning, aside from the back-
samples decrease, the performance curves converge until the bone network, we initialize the corresponding decoders with
training sample proportion is 10%, at which point MTP’s pretrained weights. Employing ViT-B + RVSA, the results
impact is minimal. Subsequent reductions in training sam- are presented in Table X. Across the six datasets, decoder
ples lead to decreased accuracies across all models, yet the parameter reusing (DPR) proves beneficial in only three sce-
WANG et al.: ADVANCING RS FOUNDATION MODEL VIA MTP 13

Fig. 4. Visualization of the horizontal object detection predictions of MAE + MTP pretrained ViT-L + RVSA. The images of the first and the second rows
are from Xview and DIOR testing sets, respectively.

Fig. 5. Visualization of the rotated object detection predictions of MAE + MTP pretrained ViT-L + RVSA. The images in four rows are from the testing
sets of DIOR-R, FAIR1M-2.0, DOTA-V1.0 and DOTA-V2.0, respectively.

Fig. 6. Visualization of the semantic segmentation predictions of MAE + MTP pretrained ViT-L + RVSA. The samples of the first and the second rows are
from SpaceNetv1 and LoveDA testing sets, respectively.
14 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021

(a) (b) (c) (d) (e) (f) (g) (h)

Fig. 7. Visualization of the bi-temporal change detection predictions of MAE + MTP pretrained ViT-L + RVSA. The samples in four rows are from the
testing sets of OSCD, WHU, LEVIR and SVCD/CDD, respectively. (a)(b)(e)(f) depict bi-temporal images of different samples, with (c) and (g) representing
corresponding ground truth labels. Our prediction maps are shown at (d) and (h).

narios. Notably, on segmentation tasks, DPR models fail to a shared encoder and task-specific decoder architecture to
outperform the MTP models without DPR. Consequently, we effectively pretrain convolutional neural networks and vision
conclude: after MTP, reusing pretrained decoder parameters transformer backbones on three tasks: semantic segmenta-
in finetuning is unnecessary. Typically, decoders encode task- tion, instance segmentation, and rotated object detection in
specific information. However, given that the SAMRS dataset a unified supervised learning framework. We evaluate MTP
used for pretraining involves annotations generated by SAM by examining the finetuning accuracy of these pretrained
[47], they inevitably contain errors, jeopardizing the quality of models on 14 datasets covering various downstream RS tasks.
pretrained decoders. Our results demonstrate the competitive performance of these
models compared to existing methods, even with larger mod-
els. Further experiments indicate that MTP excels in low-
G. Visualization
data finetuning scenarios but may offer diminishing returns
To further show the efficacy of MTP in enhancing RS with prolonged finetuning on large-scale datasets. We hope
foundation models, we present the predictions of MAE + MTP this research encourages further exploration of RS foundation
pretrained ViT-L + RVSA across detection, segmentation, and models, especially in resource-constrained settings. Addition-
change detection tasks in Figure 4-7. For detection, we demon- ally, we anticipate the widespread application of these models
strate results across diverse scenes using horizontal or rotated across diverse fields of RS image interpretation due to their
bounding boxes. For segmentation, we display the original strong representation capabilities.
images alongside segmentation maps, highlighting building
extraction masks in red. For change detection, we provide the
ACKNOWLEDGEMENT
bi-temporal images, ground truths, and predicted change maps.
Our model accurately detects RS objects, extracts buildings, The numerical calculations in this paper are partly supported
classifies land cover categories, and characterizes changes by the Dawning Information Industry Co., Ltd.
across diverse types. In summary, MTP enables the construc-
tion of an RS foundation model with over 300 parameters, R EFERENCES
which achieves superior representation capability for various
[1] Z. Zhu, Y. Zhou, K. C. Seto, E. C. Stokes, C. Deng, S. T. Pickett, and
downstream tasks. H. Taubenböck, “Understanding an urbanizing planet: Strategic direc-
tions for remote sensing,” Remote Sensing of Environment, vol. 228,
pp. 164–182, 2019.
V. C ONCLUSION [2] Q. Yuan, H. Shen, T. Li, Z. Li, S. Li, Y. Jiang, H. Xu, W. Tan, Q. Yang,
J. Wang, J. Gao, and L. Zhang, “Deep learning in environmental
In this paper, we introduce the multi-task pretraining (MTP) remote sensing: Achievements and challenges,” Remote Sensing of
approach for building RS foundation models. MTP utilizes Environment, vol. 241, p. 111716, 2020.
WANG et al.: ADVANCING RS FOUNDATION MODEL VIA MTP 15

[3] F. Dell’Acqua and P. Gamba, “Remote sensing and earthquake damage [27] K. Ayush, B. Uzkent, C. Meng, K. Tanmay, M. Burke, D. Lobell,
assessment: Experiences, limits, and perspectives,” Proceedings of the and S. Ermon, “Geography-aware self-supervised learning,” in ICCV,
IEEE, vol. 100, no. 10, pp. 2876–2890, 2012. pp. 10181–10190, October 2021.
[4] J. Kang, R. Fernandez-Beltran, P. Duan, S. Liu, and A. J. Plaza, “Deep [28] U. Mall, B. Hariharan, and K. Bala, “Change-aware sampling and
unsupervised embedding for remotely sensed images based on spatially contrastive learning for satellite images,” in CVPR, pp. 5261–5270,
augmented momentum contrast,” IEEE Transactions on Geoscience June 2023.
and Remote Sensing, vol. 59, pp. 2598–2610, Mar. 2021. [29] O. Mañas, A. Lacoste, X. Giro-i Nieto, D. Vazquez, and P. Rodriguez,
[5] D. Wang, J. Zhang, B. Du, G.-S. Xia, and D. Tao, “An empirical study “Seasonal contrast: Unsupervised pre-training from uncurated remote
of remote sensing pretraining,” IEEE Transactions on Geoscience and sensing data,” in ICCV, pp. 9414–9423, 2021.
Remote Sensing, vol. 61, pp. 1–20, 2023. [30] D. Wang, Q. Zhang, Y. Xu, J. Zhang, B. Du, D. Tao, and L. Zhang,
[6] G. Christie, N. Fendley, J. Wilson, and R. Mukherjee, “Functional map “Advancing plain vision transformer toward remote sensing founda-
of the world,” in CVPR, pp. 6172–6180, 2018. tion model,” IEEE Transactions on Geoscience and Remote Sensing,
[7] G. Sumbul, M. Charfuelan, B. Demir, and V. Markl, “Bigearthnet: A vol. 61, pp. 1–15, 2023.
large-scale benchmark archive for remote sensing image understand- [31] Y. Cong, S. Khanna, C. Meng, P. Liu, E. Rozi, Y. He, M. Burke,
ing,” in IGARSS, pp. 5901–5904, IEEE, 2019. D. Lobell, and S. Ermon, “Satmae: Pre-training transformers for
[8] Y. Long, G.-S. Xia, S. Li, W. Yang, M. Y. Yang, X. X. Zhu, temporal and multi-spectral satellite imagery,” in NeurIPS, vol. 35,
L. Zhang, and D. Li, “On creating benchmark dataset for aerial image pp. 197–211, 2022.
interpretation: Reviews, guidances and million-aid,” IEEE Journal of [32] X. Sun, P. Wang, W. Lu, Z. Zhu, X. Lu, Q. He, J. Li, X. Rong, Z. Yang,
Selected Topics in Applied Earth Observations and Remote Sensing, H. Chang, Q. He, G. Yang, R. Wang, J. Lu, and K. Fu, “Ringmo: A
vol. 14, pp. 4205–4230, 2021. remote sensing foundation model with masked image modeling,” IEEE
[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Ima- Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–22,
geNet: A large-scale hierarchical image database,” in CVPR, pp. 248– 2023.
255, 2009. [33] F. Yao, W. Lu, H. Yang, L. Xu, C. Liu, L. Hu, H. Yu, N. Liu,
[10] Y. Long, G.-S. Xia, L. Zhang, G. Cheng, and D. Li, “Aerial scene C. Deng, D. Tang, C. Chen, J. Yu, X. Sun, and K. Fu, “RingMo-
parsing: From tile-level scene classification to pixel-wise semantic Sense: Remote sensing foundation model for spatiotemporal prediction
labeling,” arXiv preprint arXiv:2201.01953, 2022. via spatiotemporal evolution disentangling,” IEEE Transactions on
[11] T. Zhang, P. Gao, H. Dong, Y. Zhuang, G. Wang, W. Zhang, and Geoscience and Remote Sensing, vol. 61, pp. 1–21, 2023.
H. Chen, “Consecutive Pre-Training: A knowledge transfer learning [34] D. Hong, B. Zhang, X. Li, Y. Li, C. Li, J. Yao, N. Yokoya, H. Li,
strategy with relevant unlabeled data for remote sensing domain,” X. Jia, A. Plaza, et al., “Spectralgpt: Spectral foundation model,” arXiv
Remote Sensing, vol. 14, no. 22, 2022. preprint arXiv:2311.07113, 2023.
[12] C. Tao, J. Qi, G. Zhang, Q. Zhu, W. Lu, and H. Li, “TOV: The original [35] K. Cha, J. Seo, and T. Lee, “A billion-scale foundation model for
vision model for optical remote sensing image understanding via self- remote sensing images,” arXiv preprint arXiv:2304.05215, 2023.
supervised learning,” IEEE Journal of Selected Topics in Applied Earth [36] C. J. Reed, R. Gupta, S. Li, S. Brockman, C. Funk, B. Clipp,
Observations and Remote Sensing, vol. 16, pp. 4916–4930, 2023. K. Keutzer, S. Candido, M. Uyttendaele, and T. Darrell, “Scale-
[13] D. Wang, J. Zhang, B. Du, M. Xu, L. Liu, D. Tao, and L. Zhang, MAE: A scale-aware masked autoencoder for multiscale geospatial
“SAMRS: Scaling-up remote sensing segmentation dataset with seg- representation learning,” in ICCV, pp. 4088–4099, October 2023.
ment anything model,” in NeurIPS Track on Datasets and Benchmarks, [37] M. Zhang, Q. Liu, and Y. Wang, “Ctxmim: Context-enhanced masked
2023. image modeling for remote sensing image understanding,” arXiv
[14] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- preprint arXiv:2310.00022, 2023.
training of deep bidirectional transformers for language understanding,” [38] D. Muhtar, X. Zhang, P. Xiao, Z. Li, and F. Gu, “CMID: A unified self-
in NAACL, pp. 4171–4186, June 2019. supervised learning framework for remote sensing image understand-
[15] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, ing,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language pp. 1–17, 2023.
models are few-shot learners,” NeurIPS, vol. 33, pp. 1877–1901, 2020. [39] M. Tang, A. Cozma, K. Georgiou, and H. Qi, “Cross-Scale MAE: A
[16] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and tale of multiscale exploitation in remote sensing,” NeurIPS, vol. 36,
A. Joulin, “Emerging properties in self-supervised vision transformers,” 2024.
in ICCV, pp. 9650–9660, October 2021. [40] A. Fuller, K. Millard, and J. Green, “CROMA: Remote sensing
[17] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, representations with contrastive radar-optical masked autoencoders,”
G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable NeurIPS, vol. 36, 2024.
visual models from natural language supervision,” in ICML, pp. 8748– [41] Y. Wang, H. H. Hernández, C. M. Albrecht, and X. X. Zhu, “Feature
8763, PMLR, 2021. guided masked autoencoder for self-supervised learning in remote
[18] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple frame- sensing,” arXiv preprint arXiv:2310.18653, 2023.
work for contrastive learning of visual representations,” in ICML, [42] X. Guo, J. Lao, B. Dang, Y. Zhang, L. Yu, L. Ru, L. Zhong, Z. Huang,
pp. 1597–1607, PMLR, 2020. K. Wu, D. Hu, et al., “Skysense: A multi-modal remote sensing
[19] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast foundation model towards universal interpretation for earth observation
for unsupervised visual representation learning,” in CVPR, June 2020. imagery,” arXiv preprint arXiv:2312.10115, 2023.
[20] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, [43] Y. Wang, C. M. Albrecht, N. A. A. Braham, C. Liu, Z. Xiong, and
C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, et al., “Boot- X. X. Zhu, “DeCUR: decoupling common & unique representations for
strap your own latent-a new approach to self-supervised learning,” multimodal self-supervision,” arXiv preprint arXiv:2309.05300, 2023.
NeurIPS, vol. 33, pp. 21271–21284, 2020. [44] Y. Feng, P. Wang, W. Diao, Q. He, H. Hu, H. Bi, X. Sun, and K. Fu,
[21] H. Bao, L. Dong, S. Piao, and F. Wei, “BEiT: BERT pre-training of “A self-supervised cross-modal remote sensing foundation model with
image transformers,” in ICLR, 2022. multi-domain representation and cross-domain fusion,” in IGARSS,
[22] Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, and pp. 2239–2242, 2023.
H. Hu, “SimMIM: A simple framework for masked image modeling,” [45] Z. Huang, M. Zhang, Y. Gong, Q. Liu, and Y. Wang, “Generic
in CVPR, pp. 9653–9663, June 2022. knowledge boosted pretraining for remote sensing images,” IEEE
[23] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–13,
autoencoders are scalable vision learners,” in CVPR, pp. 16000–16009, 2024.
June 2022. [46] M. Mendieta, B. Han, X. Shi, Y. Zhu, and C. Chen, “Towards geospatial
[24] P. Akiva, M. Purri, and M. Leotta, “Self-supervised material and texture foundation models via continual pretraining,” in ICCV, pp. 16806–
representation learning for remote sensing tasks,” in CVPR, pp. 8203– 16816, October 2023.
8215, June 2022. [47] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson,
[25] G. Mai, N. Lao, Y. He, J. Song, and S. Ermon, “CSP: Self-supervised T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollar, and R. Girshick,
contrastive spatial pre-training for geospatial-visual representations,” in “Segment anything,” in ICCV, pp. 4015–4026, October 2023.
ICML, PMLR, 2023. [48] X.-Y. Tong, G.-S. Xia, Q. Lu, H. Shen, S. Li, S. You, and L. Zhang,
[26] V. V. Cepeda, G. K. Nayak, and M. Shah, “GeoCLIP: Clip-inspired “Land-cover classification with high-resolution remote sensing images
alignment between locations and images for effective worldwide geo- using transferable deep models,” Remote Sensing of Environment,
localization,” in NeurIPS, 2023. vol. 237, p. 111322, 2020.
16 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021

[49] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image [75] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-
recognition,” in CVPR, pp. 770–778, 2016. time object detection with region proposal networks,” IEEE Transac-
[50] W. Li, K. Chen, H. Chen, and Z. Shi, “Geographical knowledge-driven tions on Pattern Analysis and Machine Intelligence, vol. 39, pp. 1137–
representation learning for remote sensing images,” IEEE Transactions 1149, June 2017.
on Geoscience and Remote Sensing, vol. 60, pp. 1–16, 2022. [76] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net-
[51] C. Jun, Y. Ban, and S. Li, “Open access to earth land-cover map,” works for biomedical image segmentation,” in MICCAI, pp. 234–241,
Nature, vol. 514, no. 7523, pp. 434–434, 2014. Springer, 2015.
[52] W. Li, K. Chen, and Z. Shi, “Geographical supervision correction [77] J. Ding, N. Xue, G.-S. Xia, X. Bai, W. Yang, M. Y. Yang, S. Belongie,
for remote sensing representation learning,” IEEE Transactions on J. Luo, M. Datcu, M. Pelillo, and L. Zhang, “Object detection in
Geoscience and Remote Sensing, vol. 60, pp. 1–20, 2022. aerial images: A large-scale benchmark and challenges,” IEEE Trans-
[53] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification actions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11,
with deep convolutional neural networks,” in NeurIPS, vol. 25, 2012. pp. 7778–7796, 2022.
[54] K. Simonyan and A. Zisserman, “Very deep convolutional networks [78] K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection in
for large-scale image recognition,” in ICLR, May 2015. optical remote sensing images: A survey and a new benchmark,” ISPRS
journal of photogrammetry and remote sensing, vol. 159, pp. 296–307,
[55] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
2020.
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with
[79] X. Sun, P. Wang, Z. Yan, F. Xu, R. Wang, W. Diao, J. Chen, J. Li,
convolutions,” in CVPR, pp. 1–9, 2015.
Y. Feng, T. Xu, et al., “FAIR1M: A benchmark dataset for fine-grained
[56] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely object recognition in high-resolution remote sensing imagery,” ISPRS
connected convolutional networks,” in CVPR, pp. 4700–4708, 2017. Journal of Photogrammetry and Remote Sensing, vol. 184, pp. 116–
[57] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and 130, 2022.
B. Guo, “Swin transformer: Hierarchical vision transformer using [80] J. Wang, Z. Zheng, X. Lu, and Y. Zhong, “LoveDA: A remote sensing
shifted windows,” in ICCV, pp. 10012–10022, 2021. land-cover dataset for domain adaptive semantic segmentation,” in
[58] Q. Zhang, Y. Xu, J. Zhang, and D. Tao, “ViTAEv2: Vision transformer NeurIPS Track on Datasets and Benchmarks, 2021.
advanced by exploring inductive bias for image recognition and be- [81] W. Wang, J. Dai, Z. Chen, Z. Huang, Z. Li, X. Zhu, X. Hu, T. Lu, L. Lu,
yond,” International Journal of Computer Vision, pp. 1–22, 2023. H. Li, et al., “Internimage: Exploring large-scale vision foundation
[59] F. Bastani, P. Wolters, R. Gupta, J. Ferdinando, and A. Kembhavi, models with deformable convolutions,” in CVPR, pp. 14408–14419,
“Satlaspretrain: A large-scale dataset for remote sensing image under- 2023.
standing,” in ICCV, pp. 16772–16782, October 2023. [82] Q. Zhang, Y. Xu, J. Zhang, and D. Tao, “VSA: Learning varied-
[60] S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, size window attention in vision transformers,” in ECCV, pp. 466–483,
D. Downey, and N. A. Smith, “Don’t stop pretraining: Adapt language Springer, 2022.
models to domains and tasks,” in ACL, pp. 8342–8360, July 2020. [83] Y. Li, H. Mao, R. Girshick, and K. He, “Exploring plain vision
[61] L. M. Dery, P. Michel, A. Talwalkar, and G. Neubig, “Should we be transformer backbones for object detection,” in ECCV, pp. 280–296,
pre-training? an argument for end-task aware training as an alternative,” Springer, 2022.
in ICLR, 2022. [84] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
[62] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, “Unified perceptual T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,
parsing for scene understanding,” in ECCV, pp. 418–434, 2018. J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words:
[63] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, Transformers for image recognition at scale,” ICLR, 2021.
“Masked-attention mask transformer for universal image segmenta- [85] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei,
tion,” in CVPR, pp. 1290–1299, June 2022. “Deformable convolutional networks,” in CVPR, pp. 764–773, 2017.
[64] W. Li, H. Chen, and Z. Shi, “Semantic segmentation of remote sensing [86] X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable convnets v2: More
images with self-supervised multitask representation learning,” IEEE deformable, better results,” in CVPR, June 2019.
Journal of Selected Topics in Applied Earth Observations and Remote [87] J. Lei Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv
Sensing, vol. 14, pp. 6438–6450, 2021. e-prints, p. arXiv:1607.06450, July 2016.
[65] L. Yuan, D. Chen, Y.-L. Chen, N. Codella, X. Dai, J. Gao, H. Hu, [88] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
X. Huang, B. Li, C. Li, et al., “Florence: A new foundation model for Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in
computer vision,” arXiv preprint arXiv:2111.11432, 2021. NeurIPS, pp. 5998–6008, 2017.
[66] C. Wu, J. Liang, L. Ji, F. Yang, Y. Fang, D. Jiang, and N. Duan, [89] D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),”
“Nüwa: Visual synthesis pre-training for neural visual world creation,” arXiv preprint arXiv:1606.08415, 2016.
in ECCV, pp. 720–736, Springer, 2022. [90] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in ICCV,
[67] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pp. 2980–2988, 2017.
pre-training for unified vision-language understanding and generation,” [91] X. Xie, G. Cheng, J. Wang, X. Yao, and J. Han, “Oriented r-cnn for
in ICML, pp. 12888–12900, PMLR, 2022. object detection,” in ICCV, pp. 3520–3529, October 2021.
[92] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”
[68] L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang,
in ICLR, 2019.
L. Yuan, L. Zhang, J.-N. Hwang, et al., “Grounded language-image
[93] P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel
pre-training,” in CVPR, pp. 10965–10975, 2022.
dataset and deep learning benchmark for land use and land cover
[69] J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu, classification,” IEEE Journal of Selected Topics in Applied Earth
“CoCa: Contrastive captioners are image-text foundation models,” Observations and Remote Sensing, vol. 12, no. 7, pp. 2217–2226, 2019.
Transactions on Machine Learning Research, 2022. [94] G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classi-
[70] W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, fication: Benchmark and state of the art,” Proceedings of the IEEE,
O. K. Mohammed, S. Singhal, S. Som, and F. Wei, “Image as a foreign vol. 105, no. 10, pp. 1865–1883, 2017.
language: Beit pretraining for vision and vision-language tasks,” in [95] M. Neumann, A. S. Pinto, X. Zhai, and N. Houlsby, “In-
CVPR, pp. 19175–19186, June 2023. domain representation learning for remote sensing,” arXiv preprint
[71] Y. Xu, Q. Zhang, J. Zhang, and D. Tao, “ViTAE: Vision transformer arXiv:1911.06721, 2019.
advanced by exploring intrinsic inductive bias,” NeurIPS, vol. 34, 2021. [96] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le, “Randaugment: Practical
[72] Q. Zhang, J. Zhang, Y. Xu, and D. Tao, “Vision transformer with automated data augmentation with a reduced search space,” in CVPRW,
quadrangle attention,” IEEE Transactions on Pattern Analysis and June 2020.
Machine Intelligence, 2024. [97] J. Tian, J. Lei, J. Zhang, W. Xie, and Y. Li, “SwiMDiff: Scene-
[73] Y. Hu, J. Yuan, C. Wen, X. Lu, and X. Li, “RSGPT: A re- wide matching contrastive learning with diffusion constraint for remote
mote sensing vision language model and benchmark,” arXiv preprint sensing image,” arXiv preprint arXiv:2401.05093, 2024.
arXiv:2307.15266, 2023. [98] J. Irvin, L. Tao, J. Zhou, Y. Ma, L. Nashold, B. Liu, and A. Y.
[74] C. Wu, B. Du, and L. Zhang, “Fully convolutional change detec- Ng, “Usat: A unified self-supervised encoder for multi-sensor satellite
tion framework with generative adversarial network for unsupervised, imagery,” arXiv preprint arXiv:2312.02199, 2023.
weakly supervised and regional supervised change detection,” IEEE [99] M. Noman, M. Naseer, H. Cholakkal, R. M. Anwar, S. Khan, and F. S.
Transactions on Pattern Analysis and Machine Intelligence, vol. 45, Khan, “Rethinking transformers pre-training for multi-spectral satellite
no. 8, pp. 9774–9788, 2023. imagery,” arXiv preprint arXiv:2403.05419, 2024.
WANG et al.: ADVANCING RS FOUNDATION MODEL VIA MTP 17

[100] D. Lam, R. Kuzma, K. McGee, S. Dooley, M. Laielli, M. Klaric, [125] C. Lyu, W. Zhang, H. Huang, Y. Zhou, Y. Wang, Y. Liu, S. Zhang, and
Y. Bulatov, and B. McCord, “xview: Objects in context in overhead K. Chen, “RTMDet: An empirical study of designing real-time object
imagery,” arXiv preprint arXiv:1802.07856, 2018. detectors,” arXiv preprint arXiv:2212.07784, 2022.
[101] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for [126] Z. Dong, Y. Gu, and T. Liu, “Generative convnet foundation model with
dense object detection,” IEEE Transactions on Pattern Analysis and sparse modeling and low-frequency reconstruction for remote sensing
Machine Intelligence, vol. 42, no. 2, pp. 318–327, 2020. image interpretation,” IEEE Transactions on Geoscience and Remote
[102] G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, Sensing, vol. 62, pp. 1–16, 2024.
M. Pelillo, and L. Zhang, “Dota: A large-scale dataset for object [127] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie,
detection in aerial images,” in CVPR, June 2018. “A convnet for the 2020s,” in CVPR, pp. 11976–11986, 2022.
[103] G. Cheng, J. Wang, K. Li, X. Xie, C. Lang, Y. Yao, and J. Han, [128] Y. Pu, Y. Wang, Z. Xia, Y. Han, Y. Wang, W. Gan, Z. Wang,
“Anchor-free oriented proposal generator for object detection,” IEEE S. Song, and G. Huang, “Adaptive rotated convolution for rotated object
Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–11, detection,” in ICCV, pp. 6589–6600, October 2023.
2022. [129] Y. Li, Q. Hou, Z. Zheng, M.-M. Cheng, J. Yang, and X. Li, “Large
[104] C. Xu, J. Ding, J. Wang, W. Yang, H. Yu, L. Yu, and G.-S. Xia, selective kernel network for remote sensing object detection,” in ICCV,
“Dynamic coarse-to-fine learning for oriented tiny object detection,” pp. 16794–16805, October 2023.
in CVPR, pp. 7318–7328, 2023. [130] H. Yu, Y. Tian, Q. Ye, and Y. Liu, “Spatial transform decoupling for
[105] A.-F. O. Detector, “FCOS: A simple and strong anchor-free object oriented object detection,” arXiv preprint arXiv:2308.10561, 2023.
detector,” IEEE Transactions on Pattern Analysis and Machine Intelli- [131] X. Zhang, Y. Tian, L. Xie, W. Huang, Q. Dai, Q. Ye, and Q. Tian,
gence, vol. 44, no. 4, 2022. “HiViT: A simpler and more efficient design of hierarchical vision
[106] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li, “Bridging the gap transformer,” in ICLR, 2023.
between anchor-based and anchor-free detection via adaptive training [132] A. Van Etten, D. Lindenbaum, and T. M. Bacastow, “Spacenet:
sample selection,” in CVPR, June 2020. A remote sensing dataset and challenge series,” arXiv preprint
[107] X. Yang, J. Yang, J. Yan, Y. Zhang, T. Zhang, Z. Guo, X. Sun, and arXiv:1807.01232, 2018.
K. Fu, “SCRDet: Towards more robust detection for small, cluttered [133] H. Zhao, Y. Zhang, S. Liu, J. Shi, C. C. Loy, D. Lin, and J. Jia,
and rotated objects,” in ICCV, pp. 8231–8240, 2019. “PSANet: Point-wise spatial attention network for scene parsing,” in
[108] Y. Xu, M. Fu, Q. Wang, Y. Wang, K. Chen, G.-S. Xia, and X. Bai, ECCV, 2018.
“Gliding vertex on the horizontal bounding box for multi-oriented [134] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
object detection,” IEEE Transactions on Pattern Analysis and Machine network,” in CVPR, pp. 6230–6239, 2017.
Intelligence, vol. 43, no. 4, pp. 1452–1459, 2020. [135] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam,
[109] J. Ding, N. Xue, Y. Long, G.-S. Xia, and Q. Lu, “Learning RoI “Encoder-decoder with atrous separable convolution for semantic im-
transformer for oriented object detection in aerial images,” in CVPR, age segmentation,” in ECCV, pp. 801–818, 2018.
pp. 2844–2853, 2019. [136] Z. Zheng, Y. Zhong, J. Wang, and A. Ma, “Foreground-aware relation
[110] X. Yang, J. Yan, Z. Feng, and T. He, “R3Det: Refined single-stage network for geospatial object segmentation in high spatial resolution
detector with feature refinement for rotating object,” AAAI, vol. 35, remote sensing imagery,” in CVPR, pp. 4095–4104, 2020.
pp. 3163–3171, May 2021. [137] A. Ma, J. Wang, Y. Zhong, and Z. Zheng, “FactSeg: Foreground
[111] L. Hou, K. Lu, J. Xue, and Y. Li, “Shape-adaptive selection and activation-driven small object semantic segmentation in large-scale re-
measurement for oriented object detection,” in AAAI, vol. 36, pp. 923– mote sensing imagery,” IEEE Transactions on Geoscience and Remote
932, 2022. Sensing, vol. 60, pp. 1–16, 2022.
[112] L. Dai, H. Liu, H. Tang, Z. Wu, and P. Song, “AO2-DETR: Arbitrary- [138] K. Sun, Y. Zhao, B. Jiang, T. Cheng, B. Xiao, D. Liu, Y. Mu, X. Wang,
oriented object detection transformer,” IEEE Transactions on Circuits W. Liu, and J. Wang, “High-resolution representations for labeling
and Systems for Video Technology, 2022. pixels and regions,” arXiv preprint arXiv:1904.04514, 2019.
[139] L. Wang, R. Li, C. Duan, C. Zhang, X. Meng, and S. Fang, “A novel
[113] J. Han, J. Ding, J. Li, and G.-S. Xia, “Align deep features for ori-
transformer based semantic segmentation scheme for fine-resolution
ented object detection,” IEEE Transactions on Geoscience and Remote
remote sensing images,” IEEE Geoscience and Remote Sensing Letters,
Sensing, vol. 60, p. 3062048, Jan. 2022.
vol. 19, pp. 1–5, 2022.
[114] J. Han, J. Ding, N. Xue, and G.-S. Xia, “ReDet: A rotation-equivariant
[140] L. Wang, R. Li, C. Zhang, S. Fang, C. Duan, X. Meng, and P. M.
detector for aerial object detection,” in CVPR, pp. 2786–2795, June
Atkinson, “UNetFormer: A unet-like transformer for efficient semantic
2021.
segmentation of remote sensing urban scene imagery,” ISPRS Journal
[115] X. Yang, X. Yang, J. Yang, Q. Ming, W. Wang, Q. Tian, and J. Yan, of Photogrammetry and Remote Sensing, vol. 190, pp. 196–214, 2022.
“Learning high-precision bounding box for rotated object detection via
[141] R. Xu, C. Wang, J. Zhang, S. Xu, W. Meng, and X. Zhang, “RSS-
Kullback-Leibler divergence,” in NeurIPS, 2021.
Former: Foreground saliency enhancement for remote sensing land-
[116] X. Yang, J. Yan, M. Qi, W. Wang, Z. Xiaopeng, and T. Qi, “Rethinking cover segmentation,” IEEE Transactions on Image Processing, vol. 32,
rotated object detection with gaussian wasserstein distance loss,” in pp. 1052–1064, 2023.
ICML, 2021. [142] Y. Chen, P. Fang, J. Yu, X. Zhong, X. Zhang, and T. Li, “Hi-resnet:
[117] D. Liang, Q. Geng, Z. Wei, D. A. Vorontsov, E. L. Kim, M. Wei, A high-resolution remote sensing network for semantic segmentation,”
and H. Zhou, “Anchor retouching via model interaction for robust arXiv preprint arXiv:2305.12691, 2023.
object detection in aerial images,” IEEE Transactions on Geoscience [143] K. Yamazaki, T. Hanyu, M. Tran, A. Garcia, A. Tran, R. McCann,
and Remote Sensing, vol. 60, pp. 1–13, 2022. H. Liao, C. Rainwater, M. Adkins, A. Molthan, et al., “AerialFormer:
[118] G. Cheng, Y. Yao, S. Li, K. Li, X. Xie, J. Wang, X. Yao, and J. Han, Multi-resolution transformer for aerial image segmentation,” arXiv
“Dual-aligned oriented detector,” IEEE Transactions on Geoscience preprint arXiv:2306.06842, 2023.
and Remote Sensing, vol. 60, pp. 1–11, 2022. [144] R. Caye Daudt, B. Le Saux, and A. Boulch, “Fully convolutional
[119] X. Wang, G. Wang, Q. Dang, Y. Liu, X. Hu, and D. Yu, “PP-YOLOE- siamese networks for change detection,” in ICIP, pp. 4063–4067, 2018.
R: An efficient anchor-free rotated object detector,” arXiv preprint [145] S. Fang, K. Li, J. Shao, and Z. Li, “SNUNet-CD: A densely connected
arXiv:2211.02386, 2022. siamese network for change detection of vhr images,” IEEE Geoscience
[120] S. Xu, X. Wang, W. Lv, Q. Chang, C. Cui, K. Deng, G. Wang, Q. Dang, and Remote Sensing Letters, vol. 19, p. 3056416, Jan. 2022.
S. Wei, Y. Du, et al., “PP-YOLOE: An evolved version of yolo,” arXiv [146] H. Chen, Z. Qi, and Z. Shi, “Remote sensing image change detection
preprint arXiv:2203.16250, 2022. with transformers,” IEEE Transactions on Geoscience and Remote
[121] K. H. Wentong Li, Yijie Chen and J. Zhu, “Oriented reppoints for Sensing, vol. 60, p. 3095166, Jan. 2022.
aerial object detection,” in CVPR, 2022. [147] M. Liu, Q. Shi, A. Marinoni, D. He, X. Liu, and L. Zhang, “Super-
[122] Z. Huang, W. Li, X.-G. Xia, and R. Tao, “A general gaussian heatmap Resolution-Based change detection network with stacked attention
label assignment for arbitrary-oriented object detection,” IEEE Trans- module for images with different resolutions,” IEEE Transactions on
actions on Image Processing, vol. 31, pp. 1895–1910, 2022. Geoscience and Remote Sensing, vol. 60, p. 3091758, Jan. 2022.
[123] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” [148] Z. Zheng, Y. Wan, Y. Zhang, S. Xiang, D. Peng, and B. Zhang,
arXiv preprint arXiv:1804.02767, 2018. “CLNet: Cross-layer convolutional neural network for change detection
[124] X. Yang, Y. Zhou, G. Zhang, J. Yang, W. Wang, J. Yan, X. Zhang, and in optical remote sensing imagery,” ISPRS Journal of Photogrammetry
Q. Tian, “The KFIou loss for rotated object detection,” in ICLR, 2023. and Remote Sensing, vol. 175, pp. 247–267, May 2021.
18 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021

[149] C. Han, C. Wu, H. Guo, M. Hu, and H. Chen, “Hanet: A hierarchical [172] C. Han, C. Wu, and B. Du, “HCGMNet: A hierarchical change guiding
attention network for change detection with bitemporal very-high- map network for change detection,” in IGARSS, pp. 5511–5514, 2023.
resolution remote sensing images,” IEEE Journal of Selected Topics in [173] F. I. Diakogiannis, F. Waldner, and P. Caccetta, “Looking for change?
Applied Earth Observations and Remote Sensing, vol. 16, pp. 3867– roll the dice and demand attention,” Remote Sensing, vol. 13, no. 18,
3878, 2023. 2021.
[150] Y. Zhang, Y. Zhao, Y. Dong, and B. Du, “Self-supervised pretraining [174] C. Han, C. Wu, H. Guo, M. Hu, J. Li, and H. Chen, “Change guiding
via multimodality images with transformer for change detection,” IEEE network: Incorporating change prior to guide change detection in
Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–11, remote sensing imagery,” IEEE Journal of Selected Topics in Applied
2023. Earth Observations and Remote Sensing, vol. 16, pp. 8395–8407, 2023.
[151] W. G. C. Bandara and V. M. Patel, “A transformer-based siamese [175] K. Chen, C. Liu, W. Li, Z. Liu, H. Chen, H. Zhang, Z. Zou, and
network for change detection,” arXiv preprint arXiv:2201.01293, 2022. Z. Shi, “Time travelling pixels: Bitemporal features integration with
[152] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, foundation model for remote sensing image change detection,” 2023.
“Segformer: Simple and efficient design for semantic segmentation [176] S. Fang, K. Li, and Z. Li, “Changer: Feature interaction is what you
with transformers,” in NeurIPS, vol. 34, pp. 12077–12090, 2021. need for change detection,” IEEE Transactions on Geoscience and
[153] J. Zhang, Z. Shao, Q. Ding, X. Huang, Y. Wang, X. Zhou, and D. Li, Remote Sensing, vol. 61, pp. 1–11, 2023.
“AERNet: An attention-guided edge refinement network and a dataset [177] H. Zhang, C. Wu, Z. Zhang, Y. Zhu, H. Lin, Z. Zhang, Y. Sun, T. He,
for remote sensing building change detection,” IEEE Transactions on J. Mueller, R. Manmatha, M. Li, and A. Smola, “ResNeSt: Split-
Geoscience and Remote Sensing, vol. 61, pp. 1–16, 2023. attention networks,” in CVPRW, pp. 2736–2746, June 2022.
[154] H. Zhang, M. Lin, G. Yang, and L. Zhang, “ESCNet: An end-to-end [178] X. Tang, T. Zhang, J. Ma, X. Zhang, F. Liu, and L. Jiao, “WNet: W-
superpixel-enhanced change detection network for very-high-resolution shaped hierarchical network for remote-sensing image change detec-
remote sensing images,” IEEE Transactions on Neural Networks and tion,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61,
Learning Systems, vol. 34, no. 1, pp. 28–42, 2023. pp. 1–14, 2023.
[155] Q. Shi, M. Liu, S. Li, X. Liu, F. Wang, and L. Zhang, “A deeply [179] Z. Xia, X. Pan, S. Song, L. E. Li, and G. Huang, “Vision transformer
supervised attention metric-based network and an open aerial image with deformable attention,” in CVPR, pp. 4794–4803, 2022.
dataset for remote sensing change detection,” IEEE Transactions on [180] C. Han, C. Wu, M. Hu, J. Li, and H. Chen, “C2F-SemiCD: A coarse-
Geoscience and Remote Sensing, vol. 60, pp. 1–16, 2022. to-fine semi-supervised change detection method based on consistency
[156] Y. Wen, X. Ma, X. Zhang, and M. Pun, “GCD-DDPM: A generative regularization in high-resolution remote-sensing images,” IEEE Trans-
change detection model based on difference-feature guided ddpm,” actions on Geoscience and Remote Sensing, pp. 1–1, 2024.
arXiv: 2306.03424, 2023. [181] C. Zhao, Y. Tang, S. Feng, Y. Fan, W. Li, R. Tao, and L. Zhang, “High-
[157] J. Wang, Y. Zhong, and L. Zhang, “Change detection based on super- resolution remote sensing bitemporal image change detection based
vised contrastive learning for high-resolution remote sensing imagery,” on feature interaction and multitask learning,” IEEE Transactions on
IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1– Geoscience and Remote Sensing, vol. 61, pp. 1–14, 2023.
16, 2023. [182] S. Zhao, X. Zhang, P. Xiao, and G. He, “Exchanging dual-
[158] W. G. C. Bandara, N. G. Nair, and V. M. Patel, “Ddpm-cd: Remote encoder–decoder: A new strategy for change detection with semantic
sensing change detection using denoising diffusion probabilistic mod- guidance and spatial localization,” IEEE Transactions on Geoscience
els,” arXiv preprint arXiv:2206.11892, 2022. and Remote Sensing, vol. 61, pp. 1–16, 2023.
[159] H. Guo, B. Du, C. Wu, C. Han, and L. Zhang, “Deepcl: Deep change [183] M. Lin, G. Yang, and H. Zhang, “Transition is a process: Pair-to-video
feature learning on remote sensing images in the metric space,” arXiv change detection networks for very high resolution remote sensing
preprint arXiv:2307.12208, 2023. images,” IEEE Transactions on Image Processing, vol. 32, pp. 57–71,
[160] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convo- 2023.
lutional neural networks,” in ICML, pp. 6105–6114, 2019. [184] S. Dong, L. Wang, B. Du, and X. Meng, “ChangeCLIP: Remote sensing
[161] X. Li, L. Yan, Y. Zhang, and H. Zeng, “ESR-DMNet: Enhanced change detection with multimodal vision-language representation learn-
super-resolution-based dual-path metric change detection network for ing,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 208,
remote sensing images with different resolutions,” IEEE Transactions pp. 53–69, 2024.
on Geoscience and Remote Sensing, vol. 62, pp. 1–15, 2024. [185] K. Li, X. Cao, and D. Meng, “A new learning paradigm for foun-
[162] Z. Zheng, A. Ma, L. Zhang, and Y. Zhong, “Change is everywhere: dation model-based remote sensing change detection,” arXiv preprint
Single-temporal supervised object change detection in remote sensing arXiv:2312.01163, 2023.
imagery,” in ICCV, pp. 15173–15182, 2021. [186] R. C. Daudt, B. Le Saux, A. Boulch, and Y. Gousseau, “Urban change
[163] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual detection for multispectral earth observation using convolutional neural
transformations for deep neural networks,” in CVPR, pp. 1492–1500, networks,” in IGARSS, pp. 2115–2118, 2018.
2017. [187] S. Ji, S. Wei, and M. Lu, “Fully convolutional networks for multisource
[164] H. Guo, X. Su, C. Wu, B. Du, and L. Zhang, “Saan: Similarity-aware building extraction from an open aerial and satellite imagery data set,”
attention flow network for change detection with vhr remote sensing IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 1,
images,” arXiv preprint arXiv:2308.14570, 2023. pp. 574–586, 2019.
[165] A. Mohammadian and F. Ghaderi, “Siamixformer: a fully-transformer [188] H. Chen and Z. Shi, “A spatial-temporal attention-based method and
siamese network with temporal fusion for accurate building detection a new dataset for remote sensing image change detection,” Remote
and change detection in bi-temporal remote sensing images,” Inter- Sensing, vol. 12, no. 10, 2020.
national Journal of Remote Sensing, vol. 44, no. 12, pp. 3660–3678, [189] M. Lebedev, Y. V. Vizilter, O. Vygolov, V. A. Knyaz, and A. Y. Rubis,
2023. “Change detection in remote sensing images using conditional adver-
[166] Q. Li, R. Zhong, X. Du, and Y. Du, “Transunetcd: A hybrid transformer sarial networks,” The International Archives of the Photogrammetry,
network for change detection in optical remote-sensing images,” IEEE Remote Sensing and Spatial Information Sciences, vol. 42, pp. 565–
Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–19, 571, 2018.
2022. [190] S. Ji, S. Wei, and M. Lu, “Fully convolutional networks for multisource
[167] H. Chen, F. Pu, R. Yang, R. Tang, and X. Xu, “RDP-Net: Region building extraction from an open aerial and satellite imagery data set,”
detail preserving network for change detection,” IEEE Transactions on IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 1,
Geoscience and Remote Sensing, vol. 60, pp. 1–10, 2022. pp. 574–586, 2019.
[168] J. Liu, W. Xuan, Y. Gan, Y. Zhan, J. Liu, and B. Du, “An end-to-
end supervised domain adaptation framework for cross-domain change
detection,” Pattern Recognition, vol. 132, p. 108960, 2022. A PPENDIX
[169] K. Li, Z. Li, and S. Fang, “Siamese nestedunet networks for change
detection of high resolution satellite image,” in CCRIS, p. 42–48, 2021.
We present detailed finetuning accuracies of the three mod-
[170] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: els, i.e., ViT-B + RVSA, ViT-L + RVSA, and InternImage-XL,
Redesigning skip connections to exploit multiscale features in image on the DIOR, DIOR-R, FAIR1M-2.0, DOTA-V1.0, DOTA-
segmentation,” IEEE Transactions on Medical Imaging, 2019.
[171] Z. Zheng, S. Tian, A. Ma, L. Zhang, and Y. Zhong, “Scalable
V2.0, and LoveDA datasets in Table XI-XVI.
multi-temporal remote sensing change data generation via simulating
stochastic change process,” in ICCV, pp. 21818–21827, 2023.
WANG et al.: ADVANCING RS FOUNDATION MODEL VIA MTP 19

TABLE XI
D ETAILED ACCURACIES OF DIFFERENT MODELS ON DIOR DATASET.

ViT-B + RVSA ViT-B + RVSA ViT-L + RVSA ViT-L + RVSA InternImage-XL InternImage-XL
Category
w/o MTP w MTP w/o MTP w MTP w/o MTP w MTP
airplane 68.2 87.5 76.5 93.8 65.0 69.0
airport 91.2 92.1 92.6 91.6 91.3 92.8
baseballfield 79.9 87.3 83.2 87.7 75.6 81.3
basketballcourt 88.0 89.4 90.7 92.1 89.3 90.1
bridge 53.6 58.0 58.8 64.6 59.3 59.1
chimney 82.1 83.7 84.1 85.9 84.9 84.8
expressway-service-area 90.6 92.9 92.3 94.3 92.8 93.9
expressway-toll-station 76.2 80.5 79.5 84.5 84.4 83.6
dam 78.2 82.0 79.3 81.4 80.2 82.0
golffield 84.9 88.1 85.7 87.3 86.2 83.9
groundtrackfield 83.9 85.6 85.3 86.7 85.9 87.4
harbor 56.8 62.4 60.4 64.5 62.2 63.3
overpass 67.4 69.8 70.6 72.0 68.8 68.6
ship 74.4 75.4 75.4 76.3 73.6 73.6
stadium 82.8 85.8 84.3 85.6 83.7 85.5
storagetank 61.4 62.5 65.3 62.6 59.6 57.7
tenniscourt 89.4 91.2 91.3 92.5 87.6 90.3
trainstation 76.1 80.3 76.6 79.7 77.7 77.7
vehicle 45.2 47.4 48.2 47.6 45.0 46.2
windmill 85.0 86.5 85.1 90.2 89.3 89.3
mAP 75.8 79.4 78.3 81.1 77.1 78.0

TABLE XII
D ETAILED ACCURACIES OF DIFFERENT MODELS ON DIOR-R DATASET.

ViT-B + RVSA ViT-B + RVSA ViT-L + RVSA ViT-L + RVSA InternImage-XL InternImage-XL
Category
w/o MTP w MTP w/o MTP w MTP w/o MTP w MTP
airplane 72.1 89.6 81.2 90.7 72.0 72.3
airport 51.1 52.6 51.9 63.4 61.5 63.6
baseballfield 80.8 81.2 81.1 90.0 80.6 80.9
basketballcourt 81.3 87.8 90.1 90.1 90.0 90.1
bridge 44.9 48.6 48.1 56.4 53.5 54.7
chimney 72.7 77.2 78.2 81.5 81.5 81.5
expressway-service-area 87.5 89.1 88.4 89.4 89.9 89.7
expressway-toll-station 69.3 71.6 74.7 80.1 79.5 79.6
dam 35.5 43.3 39.8 39.9 43.0 45.9
golffield 78.4 79.0 79.4 79.3 80.0 79.3
groundtrackfield 81.9 84.2 84.3 85.1 85.2 85.3
harbor 43.3 51.3 46.4 56.0 54.8 55.2
overpass 60.1 60.9 60.5 67.2 64.2 65.8
ship 81.2 81.2 81.2 81.1 81.3 81.3
stadium 81.6 83.7 83.4 78.9 78.2 79.2
storagetank 70.5 71.2 71.4 71.4 62.7 62.8
tenniscourt 89.2 90.2 90.0 90.4 81.5 90.1
trainstation 65.6 66.6 65.2 73.9 66.7 67.5
vehicle 49.3 50.5 51.0 51.8 50.5 51.1
windmill 65.1 66.0 64.6 74.3 66.1 67.2
mAP 68.1 71.3 70.5 74.5 71.1 72.2
20 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021

TABLE XIII
D ETAILED ACCURACIES OF DIFFERENT MODELS ON FAIR1M-2.0 DATASET.

ViT-B + RVSA ViT-B + RVSA ViT-L + RVSA ViT-L + RVSA InternImage-XL InternImage-XL
Category
w/o MTP w MTP w/o MTP w MTP w/o MTP w MTP
Boeing737 47.17 44.78 48.91 43.21 41.36 47.48
Boeing747 93.31 93.73 95.00 94.95 95.76 95.67
Boeing777 41.48 45.01 36.86 36.20 50.87 44.57
Boeing787 60.01 57.03 64.70 61.42 63.66 66.78
C919 42.43 44.27 33.94 50.14 51.95 54.52
A220 53.64 52.23 56.51 52.70 54.10 58.17
A321 71.56 72.12 72.98 72.68 69.31 70.76
A330 61.97 64.22 60.78 64.01 68.53 64.66
A350 75.76 73.14 75.94 70.90 80.19 78.89
ARJ21 20.36 20.93 19.66 23.14 22.67 22.78
Passenger Ship 20.12 21.29 20.38 19.80 15.76 15.55
Motorboat 71.17 71.73 72.72 73.07 64.86 66.28
Fishing Boat 35.42 36.30 36.30 33.94 28.38 31.34
Tugboat 31.12 32.66 36.00 32.52 27.33 25.38
Engineering Ship 22.75 25.90 29.02 28.63 20.16 21.00
Liquid Cargo Ship 48.73 49.30 52.80 49.31 46.91 46.32
Dry Cargo Ship 53.07 53.12 53.87 51.09 49.18 49.13
Warship 38.17 40.88 45.66 43.44 33.45 38.04
Small Car 76.77 76.98 77.65 77.23 72.77 72.92
Bus 45.60 42.16 51.73 51.73 47.59 46.79
Cargo Truck 59.64 59.87 61.56 60.53 56.15 56.32
Dump Truck 61.73 61.85 63.08 60.95 57.69 57.48
Van 77.08 77.33 78.22 77.74 73.00 73.08
Trailer 19.74 19.34 23.63 22.48 14.48 17.55
Tractor 1.33 1.79 2.21 1.83 0.71 1.23
Excavator 23.38 25.03 27.58 29.12 21.49 18.68
Truck Tractor 50.00 49.83 48.71 52.15 50.62 48.44
Basketball Court 64.50 63.35 63.46 64.40 60.69 60.46
Tennis Court 91.45 90.71 92.09 91.53 89.07 90.22
Football Field 66.23 67.21 70.72 70.75 68.44 68.98
Baseball Field 91.81 91.52 92.27 91.80 87.93 88.78
Intersection 63.65 64.73 64.94 66.22 64.30 63.28
Roundabout 26.13 25.43 28.36 31.00 30.71 27.07
Bridge 45.91 49.38 50.66 51.32 42.81 43.33
mAP 51.56 51.92 53.20 53.00 50.67 50.94

TABLE XIV
D ETAILED ACCURACIES OF DIFFERENT MODELS ON DOTA-V1 DATASET.

ViT-B + RVSA ViT-B + RVSA ViT-L + RVSA ViT-L + RVSA InternImage-XL InternImage-XL
Category
w/o MTP w MTP w/o MTP w MTP w/o MTP w MTP
plane 88.42 88.91 88.52 88.33 88.91 88.96
baseball-diamond 85.03 84.07 85.36 86.59 86.78 85.93
bridge 60.86 60.76 61.55 63.38 59.93 60.62
ground-track-field 82.39 82.93 81.25 83.49 81.05 81.26
small-vehicle 80.70 80.41 80.69 81.06 80.80 80.08
large-vehicle 85.76 86.16 86.44 86.48 85.06 84.55
ship 88.58 88.64 88.51 88.53 88.38 87.96
tennis-court 90.88 90.87 90.87 90.87 90.81 90.86
basketball-court 86.61 86.21 85.80 86.26 86.27 86.37
storage-tank 86.88 86.76 86.84 85.80 86.19 86.50
soccer-ball-field 63.79 63.91 69.81 67.17 69.64 68.93
roundabout 72.52 71.00 72.51 71.84 71.50 73.11
harbor 78.52 79.04 84.82 84.94 79.21 78.76
swimming-pool 80.53 82.17 79.99 81.93 81.50 80.98
helicopter 81.00 78.34 78.51 78.31 67.57 76.63
mAP 80.83 80.68 81.43 81.66 80.24 80.77
WANG et al.: ADVANCING RS FOUNDATION MODEL VIA MTP 21

TABLE XV
D ETAILED ACCURACIES OF DIFFERENT MODELS ON DOTA-V2 DATASET.

ViT-B + RVSA ViT-B + RVSA ViT-L + RVSA ViT-L + RVSA InternImage-XL InternImage-XL
Category
w/o MTP w MTP w/o MTP w MTP w/o MTP w MTP
plane 77.86 78.30 79.16 78.57 78.52 70.98
baseball-diamond 48.99 52.58 48.16 45.54 50.96 50.82
bridge 46.55 48.42 47.97 49.72 43.21 43.60
ground-track-field 63.10 59.05 61.57 56.42 59.77 59.25
small-vehicle 43.55 43.65 43.74 43.78 43.58 43.55
large-vehicle 56.85 57.15 61.14 62.26 56.11 56.50
ship 61.09 61.08 68.60 68.76 61.38 61.41
tennis-court 76.90 77.83 78.45 74.89 77.61 78.23
basketball-court 54.57 56.32 61.97 64.17 58.20 61.41
storage-tank 58.55 59.23 59.62 58.70 58.55 51.30
soccer-ball-field 36.37 36.93 45.98 43.70 48.88 42.89
roundabout 50.39 51.26 54.89 49.46 50.17 50.60
harbor 56.34 56.97 62.03 63.32 57.86 56.84
swimming-pool 63.89 63.05 64.34 64.59 58.43 58.31
helicopter 65.43 66.40 70.23 72.71 58.16 59.55
container-crane 39.19 44.24 46.94 50.91 40.58 49.21
airport 79.90 87.57 87.64 87.64 77.39 84.73
helipad 14.43 9.33 18.92 16.27 7.91 13.09
mAP 55.22 56.08 58.96 58.41 54.85 55.13

TABLE XVI
D ETAILED ACCURACIES OF DIFFERENT MODELS ON L OVE DA DATASET.

ViT-B + RVSA ViT-B + RVSA ViT-L + RVSA ViT-L + RVSA InternImage-XL InternImage-XL
Category
w/o MTP w MTP w/o MTP w MTP w/o MTP w MTP
background 45.91 45.92 47.26 47.14 46.63 46.80
building 57.93 59.40 59.27 62.69 61.98 62.60
Road 56.08 56.15 59.54 58.00 58.25 58.96
water 79.72 80.66 81.45 81.43 82.14 82.25
barren 16.49 16.56 17.59 19.27 18.11 17.49
forest 46.03 46.38 47.39 46.82 47.99 47.63
agriculture 61.48 61.67 63.55 63.80 62.40 63.44
mIOU 51.95 52.39 53.72 54.17 53.93 54.17

You might also like