MTP: Advancing Remote Sensing Foundation Model Via Multi-Task Pretraining
MTP: Advancing Remote Sensing Foundation Model Via Multi-Task Pretraining
8, AUGUST 2021 1
Abstract—Foundation models have reshaped the landscape of Utilizing its inherent capability to automatically learn and
arXiv:2403.13430v1 [cs.CV] 20 Mar 2024
Remote Sensing (RS) by enhancing various image interpretation extract deep features from objects, deep learning methods have
tasks. Pretraining is an active research topic, encompassing su- found widespread application in the RS domain, particularly
pervised and self-supervised learning methods to initialize model
weights effectively. However, transferring the pretrained models for tasks such as scene classification, land use and land cover
to downstream tasks may encounter task discrepancy due to classification, and ship detection. Typically, ImageNet pre-
their formulation of pretraining as image classification or object trained weights are employed in training deep networks for RS
discrimination tasks. In this study, we explore the Multi-Task tasks due to their extensive representational ability. However,
Pretraining (MTP) paradigm for RS foundation models to ad- these weights are derived from pretraining models on natural
dress this issue. Using a shared encoder and task-specific decoder
architecture, we conduct multi-task supervised pretraining on the images, leading to domain gaps between natural images and
SAMRS dataset, encompassing semantic segmentation, instance RS images. For instance, RS images are captured from a
segmentation, and rotated object detection. MTP supports both bird’s-eye view, lack the vibrant colors of natural images, and
convolutional neural networks and vision transformer foundation possess lower spatial resolution. These disparities may impede
models with over 300 million parameters. The pretrained models the model’s finetuning performance [4], [5]. Moreover, relying
are finetuned on various RS downstream tasks, such as scene
classification, horizontal and rotated object detection, seman- solely on limited task-specific data for training restricts the
tic segmentation, and change detection. Extensive experiments model size and generalization capability of current RS deep
across 14 datasets demonstrate the superiority of our models over models due to the notorious overfitting issue.
existing ones of similar size and their competitive performance To tackle these challenges, the development of RS vision
compared to larger state-of-the-art models, thus validating the foundation models is imperative, which should excel in ex-
effectiveness of MTP. The codes and pretrained models will be
released at https://github.com/ViTAE-Transformer/MTP. tracting representative RS features. However, the RS domain
has long grappled with a scarcity of adequately large annotated
Index Terms—Remote sensing, Foundation model, Multi-task datasets, impeding related investigations. Until recently, the
pretraining, Scene classification, Semantic segmentation, Object
detection, Change detection. most expansive RS scene labeling datasets were fMoW [6]
and BigEarthNet [7], boasting 132,716 and 590,326 unique
scene instances [8], respectively — yet still falling short of
I. I NTRODUCTION benchmarks set by natural image datasets like ImageNet-
1K [9]. Long et al. [8] addressed this gap by introducing
R Emote sensing (RS) image is one of the most important
data resources for recording ground surfaces and land
objects. Precisely understanding RS images is beneficial to
MillionAID, a large-scale RS scene labeling dataset with a
closed sample capacity of 100,0848 compared to ImageNet-
many applications, including urban planning [1], environmen- 1K, igniting interest in supervised RS pretraining [5], [10].
tal survey [2], disaster assessment [3], etc. These studies show the feasibility of pretraining RS foundation
models on large-scale RS datasets. Nonetheless, supervised
D. Wang and B. Du are with the School of Computer Science, Wuhan Uni- pretraining of RS foundation models may not be the most
versity, Wuhan 430072, China, also with the Institute of Artificial Intelligence,
Wuhan University, Wuhan 430072, China, also with the National Engineering preferable choice due to the expertise and substantial time and
Research Center for Multimedia Software, Wuhan University, Wuhan 430072, labor costs associated with labeling RS images.
China, and also with the Hubei Key Laboratory of Multimedia and Network Constructing large-scale RS annotation datasets is chal-
Communication Engineering, Wuhan University, Wuhan 430072, China (e-
mail: [email protected]; [email protected]). (Corresponding author: lenging due to the high complexity and cost of labeling.
Minqiang Xu, Jing Zhang, Bo Du and Liangpei Zhang.) Despite this challenge, the advancement of earth observation
J. Zhang is with the School of Computer Science, Faculty of Engineering, technologies grants easy access to a vast amount of unlabeled
The University of Sydney, Australia (e-mail: [email protected]).
M. Xu, L. Liu, D. Wang and E. Gao are with the iFlytek Co., Ltd and RS images. Efficiently leveraging these unlabeled RS images
also with the National Engineering Research Center of Speech and Language is crucial for developing robust RS foundation models. In the
Information Processing, Hefei 230088, China (e-mail: [email protected]; realm of deep learning, unsupervised pretraining has emerged
[email protected]; [email protected]; [email protected]).
C. Han, H. Guo and L. Zhang are with the State Key Laboratory of as a promising approach for learning effective knowledge from
Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan massive unlabeled data [14]–[17]. Typically, unsupervised
University, Wuhan 430079, China (e-mail: [email protected]; hao- pretraining employs self-supervised learning (SSL) to learn
[email protected]; [email protected]).
D. Tao is with the School of Computer Science and Engineering, Nanyang effective feature representation. SSL encompasses two primary
Technological University, Singapore (e-mail: [email protected]). techniques: contrastive-based [18]–[20] and generative-based
2 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021
classification, semantic segmentation, object detection, to conduct multiple rounds of pretraining. Gururangan et al.
and change detection. [60] demonstrated that unsupervised pretraining on in-domain
The remainder of this paper is organized as follows. Sec- or task-specific data enhances model performance in natural
tion II introduces the existing works related to supervised, language processing (NLP) tasks. Building on this insight,
multi-stage, and multi-task RS pretraining. Section III presents Zhang et al. [11] devised a sequential pretraining approach,
the details of MTP, where the used SAMRS dataset and vision initially on ImageNet followed by the target RS dataset,
foundation models are also briefly introduced. Experimental employing MIM for pretraining. Similarly, [12] proposed a
results and corresponding analyses are depicted in Section IV. strategy inspired by human-like learning, first performing
Finally, Section V concludes this paper. contrastive SSL on natural images, then freezing shallow layer
weights and conducting SSL on an RS dataset. Contrary to
II. R ELATED W ORK [60], Dery et al. [61] introduced stronger end-task-aware train-
A. Supervised Pretraining for RS Foundation Model ing for NLP tasks by integrating auxiliary data and end-task
objectives into the learning process. Similarly, [13] introduced
Before the rise of SSL-based RS foundation models, re- additional segmentation pretraining using common segmenters
searchers have already delved into pretraining deep models (e.g., UperNet [62] and Mask2Former [63]) and the SAMRS
using labeled RS datasets. Tong et al. [48] pretrained an dataset, enhancing model accuracy in RS segmentation tasks.
ImageNet-pretrained ResNet-50 [49] using images from the Notably, our objective diverges from [13] in applying stage-
GID dataset [48] to derive pseudo-labels for precise land-cover wise pretraining. While [13] retains the segmentor after seg-
classification on high-resolution RS images. Recognizing the mentation pretraining to enhance segmentation performance,
challenge of labeling large-scale RS images, others sought we aim to enhance the representation capability of RS foun-
alternatives to RS annotation datasets. For instance, Li et dation models via stage-wise pretraining, preserving only the
al. [50] utilized the global land cover product Globeland30 backbone network after pretraining to facilitate transfer to
[51] as supervision for RS representation learning. They diverse RS downstream tasks.
adopted a mean-teacher framework to mitigate random noise
stemming from inconsistencies in imaging time and resolution C. Multi-Task Pretraining for RS Foundation Model
between RS images and geographical products. Moreover,
Applying multi-task learning to enhance the RS foundation
they incorporated additional geographical supervisions, such
model is an intuitive idea. Li et al. [64] introduced multi-
as change degree and spatial aggregation, to regularize the
task SSL representation learning, combining image inpainting,
pretraining process [52]. Long et al. [10] subsequently demon-
transform prediction, and contrast learning to boost semantic
strated the effectiveness of various CNN models (including
segmentation performance in RS images. However, it was
AlexNet [53], VGG-16 [54], GoogleNet [55], ResNet-101
limited to finetuning a pretrained model solely on semantic
[49], and DenseNet-121/169 [56]) pretrained from scratch
segmentation tasks, constrained by model size and pretraining
on the MillionAID dataset. Their models outperformed tra-
dataset capacity. The aspiration to consolidate multiple tasks
ditional ImageNet pretrained models in scene classification
into a single model has been a longstanding pursuit [15], [17],
tasks, indicating the potential of leveraging large-scale RS
[42], [58], [65]–[74], aligning with the original goals of the
datasets for pretraining. Later, Wang et al. [5] pretrained
foundation model exploration. Bastani et al. [59] devised a
typical CNN models and vision transformer models, including
multi-task model by integrating Swin-Base [57] with seven
Swin-T [57] and ViTAEv2 [58], all randomly initialized, on the
heads from existing networks (e.g., Faster-RCNN [75] and
MillionAID. They conducted a comprehensive empirical study
UNet [76]), facilitating training on the multi-task annotated
comparing finetuning performance using different pretraining
Satlas dataset. However, their approach lacked incorporation of
strategies (MillionAID vs. ImageNet) across four types of RS
typical RS rotated object tasks, focusing solely on transferring
downstream tasks: scene recognition, semantic segmentation,
the model to RS classification datasets. Inspired by these
rotated object detection, and change detection. Their results
pioneering efforts, this paper presents multi-task pretraining
demonstrated the superiority of vision transformer models over
of RS foundation models with over 300M parameters, encom-
CNNs on RS scenes and validated the feasibility of construct-
passing semantic segmentation, instance segmentation, and
ing RS foundation models via supervised pretraining on large-
rotated object detection tasks using the SAMRS dataset. After
scale RS datasets. Bastani et al. [59] introduced the larger
pretraining, the backbone network is further finetuned on
Satlas dataset for RS supervised pretraining. Very recently,
various RS downstream tasks.
SAMRS [13] introduced supervised semantic segmentation
pretraining to enhance model performance on the segmentation III. M ULTI -TASK P RETRAINING
task. Inspired by [13], this paper revisits the supervised We utilize semantic segmentation, instance segmentation,
learning approach by integrating it with existing pretraining and rotated object detection annotations from the SAMRS
strategies, such as ImageNet pretraining, and exploring multi- dataset for Multi-Task Pretraining (MTP). Advanced CNN and
task pretraining to construct distinct RS foundation models. vision transformer models serve as the backbone networks
to thoroughly investigate MTP. This section begins with an
B. Multi-Stage Pretraining for RS Foundation Model overview of the SAMRS dataset, followed by a brief intro-
Given the domain gap between RS images and natural duction to the selected models. Subsequently, we present the
images or between various RS modalities, it is reasonable MTP framework and implementation details.
4 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021
𝓛𝒊𝒏𝒔
1/8x Instance Ins. boxes Ins. masks
segmentor
Pretrained model
1/4x 𝓛𝒔𝒆𝒎
ViT-B + RVSA
ViT-L + RVSA Semantic
Semseg. labels
SAMRS
segmentor
InternImage-XL
Horizontal Rotated
Different Scene Semantic Change
Backbone
task head
object object
classification segmentation detection
detection detection
Fig. 2. The overall pipeline of MTP. Inside MTP, the feature pyramid from the backbone network is fed into multiple decoders for various tasks, including
rotated object detection, instance segmentation, and semantic segmentation. These tasks are supervised by diverse labels in the SAMRS dataset. Following
MTP, the pretrained model is transferred to different RS tasks for finetuning.
TABLE I TABLE II
D ETAILED CONFIGURATIONS OF DIFFERENT RVSA MODELS . T HE TRAINING COSTS OF IMPLEMENTING MTP USING DIFFERENT
MODELS .
Backbone ViT-B + RVSA ViT-L + RVSA
Depth 12 24 Backbone #Param.(M) #GPU Time (days)
Embedding Dim 768 1024 ViT-B + RVSA 86 16 3.0
Head 12 16 ViT-L + RVSA 305 32 6.3
Full attention Index [3, 6, 9, 12] [6, 12, 18, 24] InternImage-XL 335 32 6.3
Feature Pyramid Index [4, 6, 8, 12] [8, 12, 16, 24]
TABLE VI
T HE M AP (%) OF FINETUNING DIFFERENT PRETRAINED MODELS ON THE DIOR-R, FAIR1M-2.0, DOTA-V1.0, AND DOTA-V2.0 DATASETS . MS
INDICATES WHETHER THE ACCURACY ON DOTA-V1.0 IS OBTAINED FROM THE MULTI - SCALE TRAINING AND TESTING . †: T HE FEATURE PYRAMID IS
FORMED BY UPSAMPLING AND DOWNSAMPLING THE LAST LAYER FEATURE OF THE BACKBONE NETWORK BY FOLLOWING THE STRATEGY OF V I TD ET
[83].
3) Finetuning Results and Analyses: Table VI shows the capability as SkySense, although it has over 600M parameters
finetuning results. Except for DIOR-R, we find the MTP pre- and utilizes 20 million images for pretraining. We also notice
trained models cannot always demonstrate obvious advantages the performances of our models still have gaps compared with
compared to their counterparts. Since the volumes of FAIR1M- the current advanced method STD [130] on DOTA-V1.0. It
2.0, DOTA-V1.0, and DOTA-V2.0 are much larger than DIOR- may be attributed to the adopted classical detector Oriented-
R, we speculate that after long-time finetuning, the benefit of RCNN [91], which limits the detection performance.
MTP becomes diminished. We will further explore this issue
in later sections. Nevertheless, owing to the excellent structure, D. Semantic Segmentation
RVSA-L outperforms the ViT-G-based foundation model [35] We further consider finetuning the pretrained models on the
with over 1 billion parameters on DOTA-V2.0. Compared to finer pixel-level tasks, e.g., the semantic segmentation task. It
the powerful SkySense model [42], our models achieve better is one of the most important RS applications for the extraction
performance on the DIOR-R. While on FAIR1M-2.0, except and recognition of RS objects and land covers.
SkySense, our models surpass all other methods by a large 1) Dataset: We separately take into account both single-
margin. Generally, our models have comparable representation class geospatial target extraction and multi-class surface el-
WANG et al.: ADVANCING RS FOUNDATION MODEL VIA MTP 9
TABLE VIII
T HE F1 SCORE (%) OF FINETUNING DIFFERENT PRETRAINED MODELS WITH UN ET ON THE OSCD, WHU, LEVIR, AND SVCD/CDD DATASETS .
tions. It initially contains 11 pairs of images obtained separately have 5334, 762, and 1524 images for training,
from Google Earth in different seasons, with spatial validation, and testing, after cropping the image to patches in
resolutions ranging from 0.03 to 1m. It now has been size of 256 × 256 without overlaps. A similar operation is con-
cropped to 16,000 pairs of patches in size of 256 × 256 ducted for LEVIR, generating training, validation, and testing
by [190]. The 10,000/3,000/3,000 pairs are separately sets containing 7120, 1024, and 2048 samples, respectively.
used as training, validation, and testing sets. The training epochs on OSCD, WHU, LEVIR, and CDD are
separately set to 100, 200, 150, and 200. The batch size of all
2) Implementation Details: Following [29], [42], we crop datasets is uniformly set to 32. We adopt the same optimization
the OSCD images to 96 × 96 patches with no overlapping, strategy as the scene classification task. To fully leverage the
obtaining 827/385 pairs for training/testing. However, the feature pyramid produced by foundation models, we adopt
training is difficult to converge due to the extremely small a UNet [76] to process the differences between different
input size, thus we rescale the image to 224 × 224 before temporal features. The training is implemented through Open-
inputting it into the network. For the WHU dataset, we
WANG et al.: ADVANCING RS FOUNDATION MODEL VIA MTP 11
TABLE IX
D ETAILED HYPERPARAMETER SETTINGS IN FINETUNING PRETRAINED MODELS ON DIFFERENT DATASETS . “✔” AND “✘” INDICATE WHETHER THE MTP
IS USEFUL FOR IMPROVING PERFORMANCE COMPARED TO THE SETTING WITHOUT MTP.
CD6 , where the data augmentation includes random rotation, relatively large gaps compared to current works. These results
random flipping, random exchange temporal, and color jitters suggest that it is necessary to conduct further explorations to
that randomly adjust brightness, contrast, hue, and saturation enhance the model finetuning performance on datasets with
of images. The F1 score of the changed class is adopted as small volumes and input sizes.
the evaluation metric.
3) Finetuning Results and Analyses: To comprehensively F. Further Investigations and Analyses
assess the finetuning performance of pretrained models, we Besides evaluating the performances of pretrained models,
conduct the comparison by collecting existing advanced we conduct further investigations to obtain deeper insights into
change detection methods, as shown in Table VIII. It should the characteristics of MTP, including the influence factors of
be noted that, since the original WHU dataset does not provide MTP, finetuning with fewer samples, and parameter reusing
an official train/test split, various split strategies are adopted in of decoders.
different methods. Therefore, on this dataset, we only list the 1) Influence Factors of Multi-Task Pretraining: Up to now,
accuracy value of the methods that employ the same settings to comprehensively assess the impact of MTP, we have fine-
as us or training with more images. It can be seen that MTP tuned three types of foundation models, on five RS down-
effectively improves the performances of pretrained models on stream tasks, involving a total of fourteen datasets. From the
these datasets. Especially, our models perform well on three finetuning results (Table IV-VIII) we find that MTP improves
large-scale datasets: WHU, LEVIR, and SVCD/CDD. Even if these foundation models in most cases. But there are still some
adopting simple UNet [76] and the RVSA model of the base datasets, on which MTP does not perform well as expected,
version, the finetuning performances have been competitive i.e., not all accuracies of three models are increased. To figure
and surpassed many advanced approaches. When utilizing out the reason, we explore the influence factors related to
larger models, the performance can be further boosted. Finally, the performance of MTP, as shown in Table IX. Intuitively,
they achieve the best accuracy on the WHU and LEVIR we suppose MTP may be affected by the characteristics of
datasets by outperforming almost all existing methods, in- finetuning datasets and consider a series of variables, in-
cluding the recent SkySense [42] that builds a larger change cluding “Training Image Number” (NT rIm ), “Training Epoch
detection network with over 600M parameters, ChangeCLIP Number” (NT rEp ), “Batch Size” (SB ), and “ Training Image
[184] that uses CLIP [17] to obtain additional knowledge from Size” (ST rIm ). The “Training Image Number” means: for each
language modalities, and the newly proposed adapter BAN dataset, the number of images used for training. For example,
[185], where the ability of existing foundation model and the NT rIm of DIOR is 11,725 since the original training and
change detection approaches can be exploited. Different from validation sets are together used for training. While “Training
large-scale scenes, on the small-scale dataset OSCD, although Image Size” represents the image size after data augmentation
MTP is still useful, the performances of our models have and preprocessing. Theoretically, we have
NT rIm · NT rEp
6 https://github.com/likyoo/open-cd NT oIt = , (3)
SB
12 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021
While MTP represents an extension of SEP, it is reasonable Model SpaceNetv1 LoveDA DIOR-R
to anticipate that MTP could excel in analogous contexts. w/o DPR 79.63 52.39 71.29
Moreover, as noted earlier, MTP primarily addresses the w DPR 79.54 51.83 71.94
Model FAIR1M-2.0 DOTA-V1.0 DOTA-V2.0
discrepancy between upstream pretraining and downstream w/o DPR 51.92 80.67 55.22
finetuning tasks. This encourages us to consider that fewer w DPR 52.19 80.54 55.78
downstream training samples might better showcase MTP’s
efficacy in facilitating efficient transfer from pretraining mod- 3) Decoder Parameter Reusing: MTP utilizes task-specific
els. To explore this, we finetune InterImage-XL on EuroSAT decoders for segmentation and detection tasks. Hence, reusing
and ViT-L + RVSA on SpaceNetv1, respectively, progressively these decoder weights during finetuning seems a naive choice.
reducing training samples. The results are depicted in Figure However, only semantic segmentation and rotated detection
3. Initially, MTP’s performance is slightly inferior to its decoders are eligible for reuse, as per the segmentor or
counterparts when the training sample proportion is 100%, detector used in existing methods. We conduct experiments
as illustrated in Tables IV and VII. However, as training accordingly. Initially, during finetuning, aside from the back-
samples decrease, the performance curves converge until the bone network, we initialize the corresponding decoders with
training sample proportion is 10%, at which point MTP’s pretrained weights. Employing ViT-B + RVSA, the results
impact is minimal. Subsequent reductions in training sam- are presented in Table X. Across the six datasets, decoder
ples lead to decreased accuracies across all models, yet the parameter reusing (DPR) proves beneficial in only three sce-
WANG et al.: ADVANCING RS FOUNDATION MODEL VIA MTP 13
Fig. 4. Visualization of the horizontal object detection predictions of MAE + MTP pretrained ViT-L + RVSA. The images of the first and the second rows
are from Xview and DIOR testing sets, respectively.
Fig. 5. Visualization of the rotated object detection predictions of MAE + MTP pretrained ViT-L + RVSA. The images in four rows are from the testing
sets of DIOR-R, FAIR1M-2.0, DOTA-V1.0 and DOTA-V2.0, respectively.
Fig. 6. Visualization of the semantic segmentation predictions of MAE + MTP pretrained ViT-L + RVSA. The samples of the first and the second rows are
from SpaceNetv1 and LoveDA testing sets, respectively.
14 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021
Fig. 7. Visualization of the bi-temporal change detection predictions of MAE + MTP pretrained ViT-L + RVSA. The samples in four rows are from the
testing sets of OSCD, WHU, LEVIR and SVCD/CDD, respectively. (a)(b)(e)(f) depict bi-temporal images of different samples, with (c) and (g) representing
corresponding ground truth labels. Our prediction maps are shown at (d) and (h).
narios. Notably, on segmentation tasks, DPR models fail to a shared encoder and task-specific decoder architecture to
outperform the MTP models without DPR. Consequently, we effectively pretrain convolutional neural networks and vision
conclude: after MTP, reusing pretrained decoder parameters transformer backbones on three tasks: semantic segmenta-
in finetuning is unnecessary. Typically, decoders encode task- tion, instance segmentation, and rotated object detection in
specific information. However, given that the SAMRS dataset a unified supervised learning framework. We evaluate MTP
used for pretraining involves annotations generated by SAM by examining the finetuning accuracy of these pretrained
[47], they inevitably contain errors, jeopardizing the quality of models on 14 datasets covering various downstream RS tasks.
pretrained decoders. Our results demonstrate the competitive performance of these
models compared to existing methods, even with larger mod-
els. Further experiments indicate that MTP excels in low-
G. Visualization
data finetuning scenarios but may offer diminishing returns
To further show the efficacy of MTP in enhancing RS with prolonged finetuning on large-scale datasets. We hope
foundation models, we present the predictions of MAE + MTP this research encourages further exploration of RS foundation
pretrained ViT-L + RVSA across detection, segmentation, and models, especially in resource-constrained settings. Addition-
change detection tasks in Figure 4-7. For detection, we demon- ally, we anticipate the widespread application of these models
strate results across diverse scenes using horizontal or rotated across diverse fields of RS image interpretation due to their
bounding boxes. For segmentation, we display the original strong representation capabilities.
images alongside segmentation maps, highlighting building
extraction masks in red. For change detection, we provide the
ACKNOWLEDGEMENT
bi-temporal images, ground truths, and predicted change maps.
Our model accurately detects RS objects, extracts buildings, The numerical calculations in this paper are partly supported
classifies land cover categories, and characterizes changes by the Dawning Information Industry Co., Ltd.
across diverse types. In summary, MTP enables the construc-
tion of an RS foundation model with over 300 parameters, R EFERENCES
which achieves superior representation capability for various
[1] Z. Zhu, Y. Zhou, K. C. Seto, E. C. Stokes, C. Deng, S. T. Pickett, and
downstream tasks. H. Taubenböck, “Understanding an urbanizing planet: Strategic direc-
tions for remote sensing,” Remote Sensing of Environment, vol. 228,
pp. 164–182, 2019.
V. C ONCLUSION [2] Q. Yuan, H. Shen, T. Li, Z. Li, S. Li, Y. Jiang, H. Xu, W. Tan, Q. Yang,
J. Wang, J. Gao, and L. Zhang, “Deep learning in environmental
In this paper, we introduce the multi-task pretraining (MTP) remote sensing: Achievements and challenges,” Remote Sensing of
approach for building RS foundation models. MTP utilizes Environment, vol. 241, p. 111716, 2020.
WANG et al.: ADVANCING RS FOUNDATION MODEL VIA MTP 15
[3] F. Dell’Acqua and P. Gamba, “Remote sensing and earthquake damage [27] K. Ayush, B. Uzkent, C. Meng, K. Tanmay, M. Burke, D. Lobell,
assessment: Experiences, limits, and perspectives,” Proceedings of the and S. Ermon, “Geography-aware self-supervised learning,” in ICCV,
IEEE, vol. 100, no. 10, pp. 2876–2890, 2012. pp. 10181–10190, October 2021.
[4] J. Kang, R. Fernandez-Beltran, P. Duan, S. Liu, and A. J. Plaza, “Deep [28] U. Mall, B. Hariharan, and K. Bala, “Change-aware sampling and
unsupervised embedding for remotely sensed images based on spatially contrastive learning for satellite images,” in CVPR, pp. 5261–5270,
augmented momentum contrast,” IEEE Transactions on Geoscience June 2023.
and Remote Sensing, vol. 59, pp. 2598–2610, Mar. 2021. [29] O. Mañas, A. Lacoste, X. Giro-i Nieto, D. Vazquez, and P. Rodriguez,
[5] D. Wang, J. Zhang, B. Du, G.-S. Xia, and D. Tao, “An empirical study “Seasonal contrast: Unsupervised pre-training from uncurated remote
of remote sensing pretraining,” IEEE Transactions on Geoscience and sensing data,” in ICCV, pp. 9414–9423, 2021.
Remote Sensing, vol. 61, pp. 1–20, 2023. [30] D. Wang, Q. Zhang, Y. Xu, J. Zhang, B. Du, D. Tao, and L. Zhang,
[6] G. Christie, N. Fendley, J. Wilson, and R. Mukherjee, “Functional map “Advancing plain vision transformer toward remote sensing founda-
of the world,” in CVPR, pp. 6172–6180, 2018. tion model,” IEEE Transactions on Geoscience and Remote Sensing,
[7] G. Sumbul, M. Charfuelan, B. Demir, and V. Markl, “Bigearthnet: A vol. 61, pp. 1–15, 2023.
large-scale benchmark archive for remote sensing image understand- [31] Y. Cong, S. Khanna, C. Meng, P. Liu, E. Rozi, Y. He, M. Burke,
ing,” in IGARSS, pp. 5901–5904, IEEE, 2019. D. Lobell, and S. Ermon, “Satmae: Pre-training transformers for
[8] Y. Long, G.-S. Xia, S. Li, W. Yang, M. Y. Yang, X. X. Zhu, temporal and multi-spectral satellite imagery,” in NeurIPS, vol. 35,
L. Zhang, and D. Li, “On creating benchmark dataset for aerial image pp. 197–211, 2022.
interpretation: Reviews, guidances and million-aid,” IEEE Journal of [32] X. Sun, P. Wang, W. Lu, Z. Zhu, X. Lu, Q. He, J. Li, X. Rong, Z. Yang,
Selected Topics in Applied Earth Observations and Remote Sensing, H. Chang, Q. He, G. Yang, R. Wang, J. Lu, and K. Fu, “Ringmo: A
vol. 14, pp. 4205–4230, 2021. remote sensing foundation model with masked image modeling,” IEEE
[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Ima- Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–22,
geNet: A large-scale hierarchical image database,” in CVPR, pp. 248– 2023.
255, 2009. [33] F. Yao, W. Lu, H. Yang, L. Xu, C. Liu, L. Hu, H. Yu, N. Liu,
[10] Y. Long, G.-S. Xia, L. Zhang, G. Cheng, and D. Li, “Aerial scene C. Deng, D. Tang, C. Chen, J. Yu, X. Sun, and K. Fu, “RingMo-
parsing: From tile-level scene classification to pixel-wise semantic Sense: Remote sensing foundation model for spatiotemporal prediction
labeling,” arXiv preprint arXiv:2201.01953, 2022. via spatiotemporal evolution disentangling,” IEEE Transactions on
[11] T. Zhang, P. Gao, H. Dong, Y. Zhuang, G. Wang, W. Zhang, and Geoscience and Remote Sensing, vol. 61, pp. 1–21, 2023.
H. Chen, “Consecutive Pre-Training: A knowledge transfer learning [34] D. Hong, B. Zhang, X. Li, Y. Li, C. Li, J. Yao, N. Yokoya, H. Li,
strategy with relevant unlabeled data for remote sensing domain,” X. Jia, A. Plaza, et al., “Spectralgpt: Spectral foundation model,” arXiv
Remote Sensing, vol. 14, no. 22, 2022. preprint arXiv:2311.07113, 2023.
[12] C. Tao, J. Qi, G. Zhang, Q. Zhu, W. Lu, and H. Li, “TOV: The original [35] K. Cha, J. Seo, and T. Lee, “A billion-scale foundation model for
vision model for optical remote sensing image understanding via self- remote sensing images,” arXiv preprint arXiv:2304.05215, 2023.
supervised learning,” IEEE Journal of Selected Topics in Applied Earth [36] C. J. Reed, R. Gupta, S. Li, S. Brockman, C. Funk, B. Clipp,
Observations and Remote Sensing, vol. 16, pp. 4916–4930, 2023. K. Keutzer, S. Candido, M. Uyttendaele, and T. Darrell, “Scale-
[13] D. Wang, J. Zhang, B. Du, M. Xu, L. Liu, D. Tao, and L. Zhang, MAE: A scale-aware masked autoencoder for multiscale geospatial
“SAMRS: Scaling-up remote sensing segmentation dataset with seg- representation learning,” in ICCV, pp. 4088–4099, October 2023.
ment anything model,” in NeurIPS Track on Datasets and Benchmarks, [37] M. Zhang, Q. Liu, and Y. Wang, “Ctxmim: Context-enhanced masked
2023. image modeling for remote sensing image understanding,” arXiv
[14] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- preprint arXiv:2310.00022, 2023.
training of deep bidirectional transformers for language understanding,” [38] D. Muhtar, X. Zhang, P. Xiao, Z. Li, and F. Gu, “CMID: A unified self-
in NAACL, pp. 4171–4186, June 2019. supervised learning framework for remote sensing image understand-
[15] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, ing,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language pp. 1–17, 2023.
models are few-shot learners,” NeurIPS, vol. 33, pp. 1877–1901, 2020. [39] M. Tang, A. Cozma, K. Georgiou, and H. Qi, “Cross-Scale MAE: A
[16] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and tale of multiscale exploitation in remote sensing,” NeurIPS, vol. 36,
A. Joulin, “Emerging properties in self-supervised vision transformers,” 2024.
in ICCV, pp. 9650–9660, October 2021. [40] A. Fuller, K. Millard, and J. Green, “CROMA: Remote sensing
[17] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, representations with contrastive radar-optical masked autoencoders,”
G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable NeurIPS, vol. 36, 2024.
visual models from natural language supervision,” in ICML, pp. 8748– [41] Y. Wang, H. H. Hernández, C. M. Albrecht, and X. X. Zhu, “Feature
8763, PMLR, 2021. guided masked autoencoder for self-supervised learning in remote
[18] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple frame- sensing,” arXiv preprint arXiv:2310.18653, 2023.
work for contrastive learning of visual representations,” in ICML, [42] X. Guo, J. Lao, B. Dang, Y. Zhang, L. Yu, L. Ru, L. Zhong, Z. Huang,
pp. 1597–1607, PMLR, 2020. K. Wu, D. Hu, et al., “Skysense: A multi-modal remote sensing
[19] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast foundation model towards universal interpretation for earth observation
for unsupervised visual representation learning,” in CVPR, June 2020. imagery,” arXiv preprint arXiv:2312.10115, 2023.
[20] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, [43] Y. Wang, C. M. Albrecht, N. A. A. Braham, C. Liu, Z. Xiong, and
C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, et al., “Boot- X. X. Zhu, “DeCUR: decoupling common & unique representations for
strap your own latent-a new approach to self-supervised learning,” multimodal self-supervision,” arXiv preprint arXiv:2309.05300, 2023.
NeurIPS, vol. 33, pp. 21271–21284, 2020. [44] Y. Feng, P. Wang, W. Diao, Q. He, H. Hu, H. Bi, X. Sun, and K. Fu,
[21] H. Bao, L. Dong, S. Piao, and F. Wei, “BEiT: BERT pre-training of “A self-supervised cross-modal remote sensing foundation model with
image transformers,” in ICLR, 2022. multi-domain representation and cross-domain fusion,” in IGARSS,
[22] Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, and pp. 2239–2242, 2023.
H. Hu, “SimMIM: A simple framework for masked image modeling,” [45] Z. Huang, M. Zhang, Y. Gong, Q. Liu, and Y. Wang, “Generic
in CVPR, pp. 9653–9663, June 2022. knowledge boosted pretraining for remote sensing images,” IEEE
[23] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–13,
autoencoders are scalable vision learners,” in CVPR, pp. 16000–16009, 2024.
June 2022. [46] M. Mendieta, B. Han, X. Shi, Y. Zhu, and C. Chen, “Towards geospatial
[24] P. Akiva, M. Purri, and M. Leotta, “Self-supervised material and texture foundation models via continual pretraining,” in ICCV, pp. 16806–
representation learning for remote sensing tasks,” in CVPR, pp. 8203– 16816, October 2023.
8215, June 2022. [47] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson,
[25] G. Mai, N. Lao, Y. He, J. Song, and S. Ermon, “CSP: Self-supervised T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollar, and R. Girshick,
contrastive spatial pre-training for geospatial-visual representations,” in “Segment anything,” in ICCV, pp. 4015–4026, October 2023.
ICML, PMLR, 2023. [48] X.-Y. Tong, G.-S. Xia, Q. Lu, H. Shen, S. Li, S. You, and L. Zhang,
[26] V. V. Cepeda, G. K. Nayak, and M. Shah, “GeoCLIP: Clip-inspired “Land-cover classification with high-resolution remote sensing images
alignment between locations and images for effective worldwide geo- using transferable deep models,” Remote Sensing of Environment,
localization,” in NeurIPS, 2023. vol. 237, p. 111322, 2020.
16 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021
[49] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image [75] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-
recognition,” in CVPR, pp. 770–778, 2016. time object detection with region proposal networks,” IEEE Transac-
[50] W. Li, K. Chen, H. Chen, and Z. Shi, “Geographical knowledge-driven tions on Pattern Analysis and Machine Intelligence, vol. 39, pp. 1137–
representation learning for remote sensing images,” IEEE Transactions 1149, June 2017.
on Geoscience and Remote Sensing, vol. 60, pp. 1–16, 2022. [76] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net-
[51] C. Jun, Y. Ban, and S. Li, “Open access to earth land-cover map,” works for biomedical image segmentation,” in MICCAI, pp. 234–241,
Nature, vol. 514, no. 7523, pp. 434–434, 2014. Springer, 2015.
[52] W. Li, K. Chen, and Z. Shi, “Geographical supervision correction [77] J. Ding, N. Xue, G.-S. Xia, X. Bai, W. Yang, M. Y. Yang, S. Belongie,
for remote sensing representation learning,” IEEE Transactions on J. Luo, M. Datcu, M. Pelillo, and L. Zhang, “Object detection in
Geoscience and Remote Sensing, vol. 60, pp. 1–20, 2022. aerial images: A large-scale benchmark and challenges,” IEEE Trans-
[53] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification actions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11,
with deep convolutional neural networks,” in NeurIPS, vol. 25, 2012. pp. 7778–7796, 2022.
[54] K. Simonyan and A. Zisserman, “Very deep convolutional networks [78] K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection in
for large-scale image recognition,” in ICLR, May 2015. optical remote sensing images: A survey and a new benchmark,” ISPRS
journal of photogrammetry and remote sensing, vol. 159, pp. 296–307,
[55] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
2020.
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with
[79] X. Sun, P. Wang, Z. Yan, F. Xu, R. Wang, W. Diao, J. Chen, J. Li,
convolutions,” in CVPR, pp. 1–9, 2015.
Y. Feng, T. Xu, et al., “FAIR1M: A benchmark dataset for fine-grained
[56] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely object recognition in high-resolution remote sensing imagery,” ISPRS
connected convolutional networks,” in CVPR, pp. 4700–4708, 2017. Journal of Photogrammetry and Remote Sensing, vol. 184, pp. 116–
[57] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and 130, 2022.
B. Guo, “Swin transformer: Hierarchical vision transformer using [80] J. Wang, Z. Zheng, X. Lu, and Y. Zhong, “LoveDA: A remote sensing
shifted windows,” in ICCV, pp. 10012–10022, 2021. land-cover dataset for domain adaptive semantic segmentation,” in
[58] Q. Zhang, Y. Xu, J. Zhang, and D. Tao, “ViTAEv2: Vision transformer NeurIPS Track on Datasets and Benchmarks, 2021.
advanced by exploring inductive bias for image recognition and be- [81] W. Wang, J. Dai, Z. Chen, Z. Huang, Z. Li, X. Zhu, X. Hu, T. Lu, L. Lu,
yond,” International Journal of Computer Vision, pp. 1–22, 2023. H. Li, et al., “Internimage: Exploring large-scale vision foundation
[59] F. Bastani, P. Wolters, R. Gupta, J. Ferdinando, and A. Kembhavi, models with deformable convolutions,” in CVPR, pp. 14408–14419,
“Satlaspretrain: A large-scale dataset for remote sensing image under- 2023.
standing,” in ICCV, pp. 16772–16782, October 2023. [82] Q. Zhang, Y. Xu, J. Zhang, and D. Tao, “VSA: Learning varied-
[60] S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, size window attention in vision transformers,” in ECCV, pp. 466–483,
D. Downey, and N. A. Smith, “Don’t stop pretraining: Adapt language Springer, 2022.
models to domains and tasks,” in ACL, pp. 8342–8360, July 2020. [83] Y. Li, H. Mao, R. Girshick, and K. He, “Exploring plain vision
[61] L. M. Dery, P. Michel, A. Talwalkar, and G. Neubig, “Should we be transformer backbones for object detection,” in ECCV, pp. 280–296,
pre-training? an argument for end-task aware training as an alternative,” Springer, 2022.
in ICLR, 2022. [84] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
[62] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, “Unified perceptual T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,
parsing for scene understanding,” in ECCV, pp. 418–434, 2018. J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words:
[63] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, Transformers for image recognition at scale,” ICLR, 2021.
“Masked-attention mask transformer for universal image segmenta- [85] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei,
tion,” in CVPR, pp. 1290–1299, June 2022. “Deformable convolutional networks,” in CVPR, pp. 764–773, 2017.
[64] W. Li, H. Chen, and Z. Shi, “Semantic segmentation of remote sensing [86] X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable convnets v2: More
images with self-supervised multitask representation learning,” IEEE deformable, better results,” in CVPR, June 2019.
Journal of Selected Topics in Applied Earth Observations and Remote [87] J. Lei Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv
Sensing, vol. 14, pp. 6438–6450, 2021. e-prints, p. arXiv:1607.06450, July 2016.
[65] L. Yuan, D. Chen, Y.-L. Chen, N. Codella, X. Dai, J. Gao, H. Hu, [88] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
X. Huang, B. Li, C. Li, et al., “Florence: A new foundation model for Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in
computer vision,” arXiv preprint arXiv:2111.11432, 2021. NeurIPS, pp. 5998–6008, 2017.
[66] C. Wu, J. Liang, L. Ji, F. Yang, Y. Fang, D. Jiang, and N. Duan, [89] D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),”
“Nüwa: Visual synthesis pre-training for neural visual world creation,” arXiv preprint arXiv:1606.08415, 2016.
in ECCV, pp. 720–736, Springer, 2022. [90] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in ICCV,
[67] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pp. 2980–2988, 2017.
pre-training for unified vision-language understanding and generation,” [91] X. Xie, G. Cheng, J. Wang, X. Yao, and J. Han, “Oriented r-cnn for
in ICML, pp. 12888–12900, PMLR, 2022. object detection,” in ICCV, pp. 3520–3529, October 2021.
[92] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”
[68] L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang,
in ICLR, 2019.
L. Yuan, L. Zhang, J.-N. Hwang, et al., “Grounded language-image
[93] P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel
pre-training,” in CVPR, pp. 10965–10975, 2022.
dataset and deep learning benchmark for land use and land cover
[69] J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu, classification,” IEEE Journal of Selected Topics in Applied Earth
“CoCa: Contrastive captioners are image-text foundation models,” Observations and Remote Sensing, vol. 12, no. 7, pp. 2217–2226, 2019.
Transactions on Machine Learning Research, 2022. [94] G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classi-
[70] W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, fication: Benchmark and state of the art,” Proceedings of the IEEE,
O. K. Mohammed, S. Singhal, S. Som, and F. Wei, “Image as a foreign vol. 105, no. 10, pp. 1865–1883, 2017.
language: Beit pretraining for vision and vision-language tasks,” in [95] M. Neumann, A. S. Pinto, X. Zhai, and N. Houlsby, “In-
CVPR, pp. 19175–19186, June 2023. domain representation learning for remote sensing,” arXiv preprint
[71] Y. Xu, Q. Zhang, J. Zhang, and D. Tao, “ViTAE: Vision transformer arXiv:1911.06721, 2019.
advanced by exploring intrinsic inductive bias,” NeurIPS, vol. 34, 2021. [96] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le, “Randaugment: Practical
[72] Q. Zhang, J. Zhang, Y. Xu, and D. Tao, “Vision transformer with automated data augmentation with a reduced search space,” in CVPRW,
quadrangle attention,” IEEE Transactions on Pattern Analysis and June 2020.
Machine Intelligence, 2024. [97] J. Tian, J. Lei, J. Zhang, W. Xie, and Y. Li, “SwiMDiff: Scene-
[73] Y. Hu, J. Yuan, C. Wen, X. Lu, and X. Li, “RSGPT: A re- wide matching contrastive learning with diffusion constraint for remote
mote sensing vision language model and benchmark,” arXiv preprint sensing image,” arXiv preprint arXiv:2401.05093, 2024.
arXiv:2307.15266, 2023. [98] J. Irvin, L. Tao, J. Zhou, Y. Ma, L. Nashold, B. Liu, and A. Y.
[74] C. Wu, B. Du, and L. Zhang, “Fully convolutional change detec- Ng, “Usat: A unified self-supervised encoder for multi-sensor satellite
tion framework with generative adversarial network for unsupervised, imagery,” arXiv preprint arXiv:2312.02199, 2023.
weakly supervised and regional supervised change detection,” IEEE [99] M. Noman, M. Naseer, H. Cholakkal, R. M. Anwar, S. Khan, and F. S.
Transactions on Pattern Analysis and Machine Intelligence, vol. 45, Khan, “Rethinking transformers pre-training for multi-spectral satellite
no. 8, pp. 9774–9788, 2023. imagery,” arXiv preprint arXiv:2403.05419, 2024.
WANG et al.: ADVANCING RS FOUNDATION MODEL VIA MTP 17
[100] D. Lam, R. Kuzma, K. McGee, S. Dooley, M. Laielli, M. Klaric, [125] C. Lyu, W. Zhang, H. Huang, Y. Zhou, Y. Wang, Y. Liu, S. Zhang, and
Y. Bulatov, and B. McCord, “xview: Objects in context in overhead K. Chen, “RTMDet: An empirical study of designing real-time object
imagery,” arXiv preprint arXiv:1802.07856, 2018. detectors,” arXiv preprint arXiv:2212.07784, 2022.
[101] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for [126] Z. Dong, Y. Gu, and T. Liu, “Generative convnet foundation model with
dense object detection,” IEEE Transactions on Pattern Analysis and sparse modeling and low-frequency reconstruction for remote sensing
Machine Intelligence, vol. 42, no. 2, pp. 318–327, 2020. image interpretation,” IEEE Transactions on Geoscience and Remote
[102] G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, Sensing, vol. 62, pp. 1–16, 2024.
M. Pelillo, and L. Zhang, “Dota: A large-scale dataset for object [127] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie,
detection in aerial images,” in CVPR, June 2018. “A convnet for the 2020s,” in CVPR, pp. 11976–11986, 2022.
[103] G. Cheng, J. Wang, K. Li, X. Xie, C. Lang, Y. Yao, and J. Han, [128] Y. Pu, Y. Wang, Z. Xia, Y. Han, Y. Wang, W. Gan, Z. Wang,
“Anchor-free oriented proposal generator for object detection,” IEEE S. Song, and G. Huang, “Adaptive rotated convolution for rotated object
Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–11, detection,” in ICCV, pp. 6589–6600, October 2023.
2022. [129] Y. Li, Q. Hou, Z. Zheng, M.-M. Cheng, J. Yang, and X. Li, “Large
[104] C. Xu, J. Ding, J. Wang, W. Yang, H. Yu, L. Yu, and G.-S. Xia, selective kernel network for remote sensing object detection,” in ICCV,
“Dynamic coarse-to-fine learning for oriented tiny object detection,” pp. 16794–16805, October 2023.
in CVPR, pp. 7318–7328, 2023. [130] H. Yu, Y. Tian, Q. Ye, and Y. Liu, “Spatial transform decoupling for
[105] A.-F. O. Detector, “FCOS: A simple and strong anchor-free object oriented object detection,” arXiv preprint arXiv:2308.10561, 2023.
detector,” IEEE Transactions on Pattern Analysis and Machine Intelli- [131] X. Zhang, Y. Tian, L. Xie, W. Huang, Q. Dai, Q. Ye, and Q. Tian,
gence, vol. 44, no. 4, 2022. “HiViT: A simpler and more efficient design of hierarchical vision
[106] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li, “Bridging the gap transformer,” in ICLR, 2023.
between anchor-based and anchor-free detection via adaptive training [132] A. Van Etten, D. Lindenbaum, and T. M. Bacastow, “Spacenet:
sample selection,” in CVPR, June 2020. A remote sensing dataset and challenge series,” arXiv preprint
[107] X. Yang, J. Yang, J. Yan, Y. Zhang, T. Zhang, Z. Guo, X. Sun, and arXiv:1807.01232, 2018.
K. Fu, “SCRDet: Towards more robust detection for small, cluttered [133] H. Zhao, Y. Zhang, S. Liu, J. Shi, C. C. Loy, D. Lin, and J. Jia,
and rotated objects,” in ICCV, pp. 8231–8240, 2019. “PSANet: Point-wise spatial attention network for scene parsing,” in
[108] Y. Xu, M. Fu, Q. Wang, Y. Wang, K. Chen, G.-S. Xia, and X. Bai, ECCV, 2018.
“Gliding vertex on the horizontal bounding box for multi-oriented [134] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
object detection,” IEEE Transactions on Pattern Analysis and Machine network,” in CVPR, pp. 6230–6239, 2017.
Intelligence, vol. 43, no. 4, pp. 1452–1459, 2020. [135] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam,
[109] J. Ding, N. Xue, Y. Long, G.-S. Xia, and Q. Lu, “Learning RoI “Encoder-decoder with atrous separable convolution for semantic im-
transformer for oriented object detection in aerial images,” in CVPR, age segmentation,” in ECCV, pp. 801–818, 2018.
pp. 2844–2853, 2019. [136] Z. Zheng, Y. Zhong, J. Wang, and A. Ma, “Foreground-aware relation
[110] X. Yang, J. Yan, Z. Feng, and T. He, “R3Det: Refined single-stage network for geospatial object segmentation in high spatial resolution
detector with feature refinement for rotating object,” AAAI, vol. 35, remote sensing imagery,” in CVPR, pp. 4095–4104, 2020.
pp. 3163–3171, May 2021. [137] A. Ma, J. Wang, Y. Zhong, and Z. Zheng, “FactSeg: Foreground
[111] L. Hou, K. Lu, J. Xue, and Y. Li, “Shape-adaptive selection and activation-driven small object semantic segmentation in large-scale re-
measurement for oriented object detection,” in AAAI, vol. 36, pp. 923– mote sensing imagery,” IEEE Transactions on Geoscience and Remote
932, 2022. Sensing, vol. 60, pp. 1–16, 2022.
[112] L. Dai, H. Liu, H. Tang, Z. Wu, and P. Song, “AO2-DETR: Arbitrary- [138] K. Sun, Y. Zhao, B. Jiang, T. Cheng, B. Xiao, D. Liu, Y. Mu, X. Wang,
oriented object detection transformer,” IEEE Transactions on Circuits W. Liu, and J. Wang, “High-resolution representations for labeling
and Systems for Video Technology, 2022. pixels and regions,” arXiv preprint arXiv:1904.04514, 2019.
[139] L. Wang, R. Li, C. Duan, C. Zhang, X. Meng, and S. Fang, “A novel
[113] J. Han, J. Ding, J. Li, and G.-S. Xia, “Align deep features for ori-
transformer based semantic segmentation scheme for fine-resolution
ented object detection,” IEEE Transactions on Geoscience and Remote
remote sensing images,” IEEE Geoscience and Remote Sensing Letters,
Sensing, vol. 60, p. 3062048, Jan. 2022.
vol. 19, pp. 1–5, 2022.
[114] J. Han, J. Ding, N. Xue, and G.-S. Xia, “ReDet: A rotation-equivariant
[140] L. Wang, R. Li, C. Zhang, S. Fang, C. Duan, X. Meng, and P. M.
detector for aerial object detection,” in CVPR, pp. 2786–2795, June
Atkinson, “UNetFormer: A unet-like transformer for efficient semantic
2021.
segmentation of remote sensing urban scene imagery,” ISPRS Journal
[115] X. Yang, X. Yang, J. Yang, Q. Ming, W. Wang, Q. Tian, and J. Yan, of Photogrammetry and Remote Sensing, vol. 190, pp. 196–214, 2022.
“Learning high-precision bounding box for rotated object detection via
[141] R. Xu, C. Wang, J. Zhang, S. Xu, W. Meng, and X. Zhang, “RSS-
Kullback-Leibler divergence,” in NeurIPS, 2021.
Former: Foreground saliency enhancement for remote sensing land-
[116] X. Yang, J. Yan, M. Qi, W. Wang, Z. Xiaopeng, and T. Qi, “Rethinking cover segmentation,” IEEE Transactions on Image Processing, vol. 32,
rotated object detection with gaussian wasserstein distance loss,” in pp. 1052–1064, 2023.
ICML, 2021. [142] Y. Chen, P. Fang, J. Yu, X. Zhong, X. Zhang, and T. Li, “Hi-resnet:
[117] D. Liang, Q. Geng, Z. Wei, D. A. Vorontsov, E. L. Kim, M. Wei, A high-resolution remote sensing network for semantic segmentation,”
and H. Zhou, “Anchor retouching via model interaction for robust arXiv preprint arXiv:2305.12691, 2023.
object detection in aerial images,” IEEE Transactions on Geoscience [143] K. Yamazaki, T. Hanyu, M. Tran, A. Garcia, A. Tran, R. McCann,
and Remote Sensing, vol. 60, pp. 1–13, 2022. H. Liao, C. Rainwater, M. Adkins, A. Molthan, et al., “AerialFormer:
[118] G. Cheng, Y. Yao, S. Li, K. Li, X. Xie, J. Wang, X. Yao, and J. Han, Multi-resolution transformer for aerial image segmentation,” arXiv
“Dual-aligned oriented detector,” IEEE Transactions on Geoscience preprint arXiv:2306.06842, 2023.
and Remote Sensing, vol. 60, pp. 1–11, 2022. [144] R. Caye Daudt, B. Le Saux, and A. Boulch, “Fully convolutional
[119] X. Wang, G. Wang, Q. Dang, Y. Liu, X. Hu, and D. Yu, “PP-YOLOE- siamese networks for change detection,” in ICIP, pp. 4063–4067, 2018.
R: An efficient anchor-free rotated object detector,” arXiv preprint [145] S. Fang, K. Li, J. Shao, and Z. Li, “SNUNet-CD: A densely connected
arXiv:2211.02386, 2022. siamese network for change detection of vhr images,” IEEE Geoscience
[120] S. Xu, X. Wang, W. Lv, Q. Chang, C. Cui, K. Deng, G. Wang, Q. Dang, and Remote Sensing Letters, vol. 19, p. 3056416, Jan. 2022.
S. Wei, Y. Du, et al., “PP-YOLOE: An evolved version of yolo,” arXiv [146] H. Chen, Z. Qi, and Z. Shi, “Remote sensing image change detection
preprint arXiv:2203.16250, 2022. with transformers,” IEEE Transactions on Geoscience and Remote
[121] K. H. Wentong Li, Yijie Chen and J. Zhu, “Oriented reppoints for Sensing, vol. 60, p. 3095166, Jan. 2022.
aerial object detection,” in CVPR, 2022. [147] M. Liu, Q. Shi, A. Marinoni, D. He, X. Liu, and L. Zhang, “Super-
[122] Z. Huang, W. Li, X.-G. Xia, and R. Tao, “A general gaussian heatmap Resolution-Based change detection network with stacked attention
label assignment for arbitrary-oriented object detection,” IEEE Trans- module for images with different resolutions,” IEEE Transactions on
actions on Image Processing, vol. 31, pp. 1895–1910, 2022. Geoscience and Remote Sensing, vol. 60, p. 3091758, Jan. 2022.
[123] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” [148] Z. Zheng, Y. Wan, Y. Zhang, S. Xiang, D. Peng, and B. Zhang,
arXiv preprint arXiv:1804.02767, 2018. “CLNet: Cross-layer convolutional neural network for change detection
[124] X. Yang, Y. Zhou, G. Zhang, J. Yang, W. Wang, J. Yan, X. Zhang, and in optical remote sensing imagery,” ISPRS Journal of Photogrammetry
Q. Tian, “The KFIou loss for rotated object detection,” in ICLR, 2023. and Remote Sensing, vol. 175, pp. 247–267, May 2021.
18 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021
[149] C. Han, C. Wu, H. Guo, M. Hu, and H. Chen, “Hanet: A hierarchical [172] C. Han, C. Wu, and B. Du, “HCGMNet: A hierarchical change guiding
attention network for change detection with bitemporal very-high- map network for change detection,” in IGARSS, pp. 5511–5514, 2023.
resolution remote sensing images,” IEEE Journal of Selected Topics in [173] F. I. Diakogiannis, F. Waldner, and P. Caccetta, “Looking for change?
Applied Earth Observations and Remote Sensing, vol. 16, pp. 3867– roll the dice and demand attention,” Remote Sensing, vol. 13, no. 18,
3878, 2023. 2021.
[150] Y. Zhang, Y. Zhao, Y. Dong, and B. Du, “Self-supervised pretraining [174] C. Han, C. Wu, H. Guo, M. Hu, J. Li, and H. Chen, “Change guiding
via multimodality images with transformer for change detection,” IEEE network: Incorporating change prior to guide change detection in
Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–11, remote sensing imagery,” IEEE Journal of Selected Topics in Applied
2023. Earth Observations and Remote Sensing, vol. 16, pp. 8395–8407, 2023.
[151] W. G. C. Bandara and V. M. Patel, “A transformer-based siamese [175] K. Chen, C. Liu, W. Li, Z. Liu, H. Chen, H. Zhang, Z. Zou, and
network for change detection,” arXiv preprint arXiv:2201.01293, 2022. Z. Shi, “Time travelling pixels: Bitemporal features integration with
[152] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, foundation model for remote sensing image change detection,” 2023.
“Segformer: Simple and efficient design for semantic segmentation [176] S. Fang, K. Li, and Z. Li, “Changer: Feature interaction is what you
with transformers,” in NeurIPS, vol. 34, pp. 12077–12090, 2021. need for change detection,” IEEE Transactions on Geoscience and
[153] J. Zhang, Z. Shao, Q. Ding, X. Huang, Y. Wang, X. Zhou, and D. Li, Remote Sensing, vol. 61, pp. 1–11, 2023.
“AERNet: An attention-guided edge refinement network and a dataset [177] H. Zhang, C. Wu, Z. Zhang, Y. Zhu, H. Lin, Z. Zhang, Y. Sun, T. He,
for remote sensing building change detection,” IEEE Transactions on J. Mueller, R. Manmatha, M. Li, and A. Smola, “ResNeSt: Split-
Geoscience and Remote Sensing, vol. 61, pp. 1–16, 2023. attention networks,” in CVPRW, pp. 2736–2746, June 2022.
[154] H. Zhang, M. Lin, G. Yang, and L. Zhang, “ESCNet: An end-to-end [178] X. Tang, T. Zhang, J. Ma, X. Zhang, F. Liu, and L. Jiao, “WNet: W-
superpixel-enhanced change detection network for very-high-resolution shaped hierarchical network for remote-sensing image change detec-
remote sensing images,” IEEE Transactions on Neural Networks and tion,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61,
Learning Systems, vol. 34, no. 1, pp. 28–42, 2023. pp. 1–14, 2023.
[155] Q. Shi, M. Liu, S. Li, X. Liu, F. Wang, and L. Zhang, “A deeply [179] Z. Xia, X. Pan, S. Song, L. E. Li, and G. Huang, “Vision transformer
supervised attention metric-based network and an open aerial image with deformable attention,” in CVPR, pp. 4794–4803, 2022.
dataset for remote sensing change detection,” IEEE Transactions on [180] C. Han, C. Wu, M. Hu, J. Li, and H. Chen, “C2F-SemiCD: A coarse-
Geoscience and Remote Sensing, vol. 60, pp. 1–16, 2022. to-fine semi-supervised change detection method based on consistency
[156] Y. Wen, X. Ma, X. Zhang, and M. Pun, “GCD-DDPM: A generative regularization in high-resolution remote-sensing images,” IEEE Trans-
change detection model based on difference-feature guided ddpm,” actions on Geoscience and Remote Sensing, pp. 1–1, 2024.
arXiv: 2306.03424, 2023. [181] C. Zhao, Y. Tang, S. Feng, Y. Fan, W. Li, R. Tao, and L. Zhang, “High-
[157] J. Wang, Y. Zhong, and L. Zhang, “Change detection based on super- resolution remote sensing bitemporal image change detection based
vised contrastive learning for high-resolution remote sensing imagery,” on feature interaction and multitask learning,” IEEE Transactions on
IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1– Geoscience and Remote Sensing, vol. 61, pp. 1–14, 2023.
16, 2023. [182] S. Zhao, X. Zhang, P. Xiao, and G. He, “Exchanging dual-
[158] W. G. C. Bandara, N. G. Nair, and V. M. Patel, “Ddpm-cd: Remote encoder–decoder: A new strategy for change detection with semantic
sensing change detection using denoising diffusion probabilistic mod- guidance and spatial localization,” IEEE Transactions on Geoscience
els,” arXiv preprint arXiv:2206.11892, 2022. and Remote Sensing, vol. 61, pp. 1–16, 2023.
[159] H. Guo, B. Du, C. Wu, C. Han, and L. Zhang, “Deepcl: Deep change [183] M. Lin, G. Yang, and H. Zhang, “Transition is a process: Pair-to-video
feature learning on remote sensing images in the metric space,” arXiv change detection networks for very high resolution remote sensing
preprint arXiv:2307.12208, 2023. images,” IEEE Transactions on Image Processing, vol. 32, pp. 57–71,
[160] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convo- 2023.
lutional neural networks,” in ICML, pp. 6105–6114, 2019. [184] S. Dong, L. Wang, B. Du, and X. Meng, “ChangeCLIP: Remote sensing
[161] X. Li, L. Yan, Y. Zhang, and H. Zeng, “ESR-DMNet: Enhanced change detection with multimodal vision-language representation learn-
super-resolution-based dual-path metric change detection network for ing,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 208,
remote sensing images with different resolutions,” IEEE Transactions pp. 53–69, 2024.
on Geoscience and Remote Sensing, vol. 62, pp. 1–15, 2024. [185] K. Li, X. Cao, and D. Meng, “A new learning paradigm for foun-
[162] Z. Zheng, A. Ma, L. Zhang, and Y. Zhong, “Change is everywhere: dation model-based remote sensing change detection,” arXiv preprint
Single-temporal supervised object change detection in remote sensing arXiv:2312.01163, 2023.
imagery,” in ICCV, pp. 15173–15182, 2021. [186] R. C. Daudt, B. Le Saux, A. Boulch, and Y. Gousseau, “Urban change
[163] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual detection for multispectral earth observation using convolutional neural
transformations for deep neural networks,” in CVPR, pp. 1492–1500, networks,” in IGARSS, pp. 2115–2118, 2018.
2017. [187] S. Ji, S. Wei, and M. Lu, “Fully convolutional networks for multisource
[164] H. Guo, X. Su, C. Wu, B. Du, and L. Zhang, “Saan: Similarity-aware building extraction from an open aerial and satellite imagery data set,”
attention flow network for change detection with vhr remote sensing IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 1,
images,” arXiv preprint arXiv:2308.14570, 2023. pp. 574–586, 2019.
[165] A. Mohammadian and F. Ghaderi, “Siamixformer: a fully-transformer [188] H. Chen and Z. Shi, “A spatial-temporal attention-based method and
siamese network with temporal fusion for accurate building detection a new dataset for remote sensing image change detection,” Remote
and change detection in bi-temporal remote sensing images,” Inter- Sensing, vol. 12, no. 10, 2020.
national Journal of Remote Sensing, vol. 44, no. 12, pp. 3660–3678, [189] M. Lebedev, Y. V. Vizilter, O. Vygolov, V. A. Knyaz, and A. Y. Rubis,
2023. “Change detection in remote sensing images using conditional adver-
[166] Q. Li, R. Zhong, X. Du, and Y. Du, “Transunetcd: A hybrid transformer sarial networks,” The International Archives of the Photogrammetry,
network for change detection in optical remote-sensing images,” IEEE Remote Sensing and Spatial Information Sciences, vol. 42, pp. 565–
Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–19, 571, 2018.
2022. [190] S. Ji, S. Wei, and M. Lu, “Fully convolutional networks for multisource
[167] H. Chen, F. Pu, R. Yang, R. Tang, and X. Xu, “RDP-Net: Region building extraction from an open aerial and satellite imagery data set,”
detail preserving network for change detection,” IEEE Transactions on IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 1,
Geoscience and Remote Sensing, vol. 60, pp. 1–10, 2022. pp. 574–586, 2019.
[168] J. Liu, W. Xuan, Y. Gan, Y. Zhan, J. Liu, and B. Du, “An end-to-
end supervised domain adaptation framework for cross-domain change
detection,” Pattern Recognition, vol. 132, p. 108960, 2022. A PPENDIX
[169] K. Li, Z. Li, and S. Fang, “Siamese nestedunet networks for change
detection of high resolution satellite image,” in CCRIS, p. 42–48, 2021.
We present detailed finetuning accuracies of the three mod-
[170] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: els, i.e., ViT-B + RVSA, ViT-L + RVSA, and InternImage-XL,
Redesigning skip connections to exploit multiscale features in image on the DIOR, DIOR-R, FAIR1M-2.0, DOTA-V1.0, DOTA-
segmentation,” IEEE Transactions on Medical Imaging, 2019.
[171] Z. Zheng, S. Tian, A. Ma, L. Zhang, and Y. Zhong, “Scalable
V2.0, and LoveDA datasets in Table XI-XVI.
multi-temporal remote sensing change data generation via simulating
stochastic change process,” in ICCV, pp. 21818–21827, 2023.
WANG et al.: ADVANCING RS FOUNDATION MODEL VIA MTP 19
TABLE XI
D ETAILED ACCURACIES OF DIFFERENT MODELS ON DIOR DATASET.
ViT-B + RVSA ViT-B + RVSA ViT-L + RVSA ViT-L + RVSA InternImage-XL InternImage-XL
Category
w/o MTP w MTP w/o MTP w MTP w/o MTP w MTP
airplane 68.2 87.5 76.5 93.8 65.0 69.0
airport 91.2 92.1 92.6 91.6 91.3 92.8
baseballfield 79.9 87.3 83.2 87.7 75.6 81.3
basketballcourt 88.0 89.4 90.7 92.1 89.3 90.1
bridge 53.6 58.0 58.8 64.6 59.3 59.1
chimney 82.1 83.7 84.1 85.9 84.9 84.8
expressway-service-area 90.6 92.9 92.3 94.3 92.8 93.9
expressway-toll-station 76.2 80.5 79.5 84.5 84.4 83.6
dam 78.2 82.0 79.3 81.4 80.2 82.0
golffield 84.9 88.1 85.7 87.3 86.2 83.9
groundtrackfield 83.9 85.6 85.3 86.7 85.9 87.4
harbor 56.8 62.4 60.4 64.5 62.2 63.3
overpass 67.4 69.8 70.6 72.0 68.8 68.6
ship 74.4 75.4 75.4 76.3 73.6 73.6
stadium 82.8 85.8 84.3 85.6 83.7 85.5
storagetank 61.4 62.5 65.3 62.6 59.6 57.7
tenniscourt 89.4 91.2 91.3 92.5 87.6 90.3
trainstation 76.1 80.3 76.6 79.7 77.7 77.7
vehicle 45.2 47.4 48.2 47.6 45.0 46.2
windmill 85.0 86.5 85.1 90.2 89.3 89.3
mAP 75.8 79.4 78.3 81.1 77.1 78.0
TABLE XII
D ETAILED ACCURACIES OF DIFFERENT MODELS ON DIOR-R DATASET.
ViT-B + RVSA ViT-B + RVSA ViT-L + RVSA ViT-L + RVSA InternImage-XL InternImage-XL
Category
w/o MTP w MTP w/o MTP w MTP w/o MTP w MTP
airplane 72.1 89.6 81.2 90.7 72.0 72.3
airport 51.1 52.6 51.9 63.4 61.5 63.6
baseballfield 80.8 81.2 81.1 90.0 80.6 80.9
basketballcourt 81.3 87.8 90.1 90.1 90.0 90.1
bridge 44.9 48.6 48.1 56.4 53.5 54.7
chimney 72.7 77.2 78.2 81.5 81.5 81.5
expressway-service-area 87.5 89.1 88.4 89.4 89.9 89.7
expressway-toll-station 69.3 71.6 74.7 80.1 79.5 79.6
dam 35.5 43.3 39.8 39.9 43.0 45.9
golffield 78.4 79.0 79.4 79.3 80.0 79.3
groundtrackfield 81.9 84.2 84.3 85.1 85.2 85.3
harbor 43.3 51.3 46.4 56.0 54.8 55.2
overpass 60.1 60.9 60.5 67.2 64.2 65.8
ship 81.2 81.2 81.2 81.1 81.3 81.3
stadium 81.6 83.7 83.4 78.9 78.2 79.2
storagetank 70.5 71.2 71.4 71.4 62.7 62.8
tenniscourt 89.2 90.2 90.0 90.4 81.5 90.1
trainstation 65.6 66.6 65.2 73.9 66.7 67.5
vehicle 49.3 50.5 51.0 51.8 50.5 51.1
windmill 65.1 66.0 64.6 74.3 66.1 67.2
mAP 68.1 71.3 70.5 74.5 71.1 72.2
20 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021
TABLE XIII
D ETAILED ACCURACIES OF DIFFERENT MODELS ON FAIR1M-2.0 DATASET.
ViT-B + RVSA ViT-B + RVSA ViT-L + RVSA ViT-L + RVSA InternImage-XL InternImage-XL
Category
w/o MTP w MTP w/o MTP w MTP w/o MTP w MTP
Boeing737 47.17 44.78 48.91 43.21 41.36 47.48
Boeing747 93.31 93.73 95.00 94.95 95.76 95.67
Boeing777 41.48 45.01 36.86 36.20 50.87 44.57
Boeing787 60.01 57.03 64.70 61.42 63.66 66.78
C919 42.43 44.27 33.94 50.14 51.95 54.52
A220 53.64 52.23 56.51 52.70 54.10 58.17
A321 71.56 72.12 72.98 72.68 69.31 70.76
A330 61.97 64.22 60.78 64.01 68.53 64.66
A350 75.76 73.14 75.94 70.90 80.19 78.89
ARJ21 20.36 20.93 19.66 23.14 22.67 22.78
Passenger Ship 20.12 21.29 20.38 19.80 15.76 15.55
Motorboat 71.17 71.73 72.72 73.07 64.86 66.28
Fishing Boat 35.42 36.30 36.30 33.94 28.38 31.34
Tugboat 31.12 32.66 36.00 32.52 27.33 25.38
Engineering Ship 22.75 25.90 29.02 28.63 20.16 21.00
Liquid Cargo Ship 48.73 49.30 52.80 49.31 46.91 46.32
Dry Cargo Ship 53.07 53.12 53.87 51.09 49.18 49.13
Warship 38.17 40.88 45.66 43.44 33.45 38.04
Small Car 76.77 76.98 77.65 77.23 72.77 72.92
Bus 45.60 42.16 51.73 51.73 47.59 46.79
Cargo Truck 59.64 59.87 61.56 60.53 56.15 56.32
Dump Truck 61.73 61.85 63.08 60.95 57.69 57.48
Van 77.08 77.33 78.22 77.74 73.00 73.08
Trailer 19.74 19.34 23.63 22.48 14.48 17.55
Tractor 1.33 1.79 2.21 1.83 0.71 1.23
Excavator 23.38 25.03 27.58 29.12 21.49 18.68
Truck Tractor 50.00 49.83 48.71 52.15 50.62 48.44
Basketball Court 64.50 63.35 63.46 64.40 60.69 60.46
Tennis Court 91.45 90.71 92.09 91.53 89.07 90.22
Football Field 66.23 67.21 70.72 70.75 68.44 68.98
Baseball Field 91.81 91.52 92.27 91.80 87.93 88.78
Intersection 63.65 64.73 64.94 66.22 64.30 63.28
Roundabout 26.13 25.43 28.36 31.00 30.71 27.07
Bridge 45.91 49.38 50.66 51.32 42.81 43.33
mAP 51.56 51.92 53.20 53.00 50.67 50.94
TABLE XIV
D ETAILED ACCURACIES OF DIFFERENT MODELS ON DOTA-V1 DATASET.
ViT-B + RVSA ViT-B + RVSA ViT-L + RVSA ViT-L + RVSA InternImage-XL InternImage-XL
Category
w/o MTP w MTP w/o MTP w MTP w/o MTP w MTP
plane 88.42 88.91 88.52 88.33 88.91 88.96
baseball-diamond 85.03 84.07 85.36 86.59 86.78 85.93
bridge 60.86 60.76 61.55 63.38 59.93 60.62
ground-track-field 82.39 82.93 81.25 83.49 81.05 81.26
small-vehicle 80.70 80.41 80.69 81.06 80.80 80.08
large-vehicle 85.76 86.16 86.44 86.48 85.06 84.55
ship 88.58 88.64 88.51 88.53 88.38 87.96
tennis-court 90.88 90.87 90.87 90.87 90.81 90.86
basketball-court 86.61 86.21 85.80 86.26 86.27 86.37
storage-tank 86.88 86.76 86.84 85.80 86.19 86.50
soccer-ball-field 63.79 63.91 69.81 67.17 69.64 68.93
roundabout 72.52 71.00 72.51 71.84 71.50 73.11
harbor 78.52 79.04 84.82 84.94 79.21 78.76
swimming-pool 80.53 82.17 79.99 81.93 81.50 80.98
helicopter 81.00 78.34 78.51 78.31 67.57 76.63
mAP 80.83 80.68 81.43 81.66 80.24 80.77
WANG et al.: ADVANCING RS FOUNDATION MODEL VIA MTP 21
TABLE XV
D ETAILED ACCURACIES OF DIFFERENT MODELS ON DOTA-V2 DATASET.
ViT-B + RVSA ViT-B + RVSA ViT-L + RVSA ViT-L + RVSA InternImage-XL InternImage-XL
Category
w/o MTP w MTP w/o MTP w MTP w/o MTP w MTP
plane 77.86 78.30 79.16 78.57 78.52 70.98
baseball-diamond 48.99 52.58 48.16 45.54 50.96 50.82
bridge 46.55 48.42 47.97 49.72 43.21 43.60
ground-track-field 63.10 59.05 61.57 56.42 59.77 59.25
small-vehicle 43.55 43.65 43.74 43.78 43.58 43.55
large-vehicle 56.85 57.15 61.14 62.26 56.11 56.50
ship 61.09 61.08 68.60 68.76 61.38 61.41
tennis-court 76.90 77.83 78.45 74.89 77.61 78.23
basketball-court 54.57 56.32 61.97 64.17 58.20 61.41
storage-tank 58.55 59.23 59.62 58.70 58.55 51.30
soccer-ball-field 36.37 36.93 45.98 43.70 48.88 42.89
roundabout 50.39 51.26 54.89 49.46 50.17 50.60
harbor 56.34 56.97 62.03 63.32 57.86 56.84
swimming-pool 63.89 63.05 64.34 64.59 58.43 58.31
helicopter 65.43 66.40 70.23 72.71 58.16 59.55
container-crane 39.19 44.24 46.94 50.91 40.58 49.21
airport 79.90 87.57 87.64 87.64 77.39 84.73
helipad 14.43 9.33 18.92 16.27 7.91 13.09
mAP 55.22 56.08 58.96 58.41 54.85 55.13
TABLE XVI
D ETAILED ACCURACIES OF DIFFERENT MODELS ON L OVE DA DATASET.
ViT-B + RVSA ViT-B + RVSA ViT-L + RVSA ViT-L + RVSA InternImage-XL InternImage-XL
Category
w/o MTP w MTP w/o MTP w MTP w/o MTP w MTP
background 45.91 45.92 47.26 47.14 46.63 46.80
building 57.93 59.40 59.27 62.69 61.98 62.60
Road 56.08 56.15 59.54 58.00 58.25 58.96
water 79.72 80.66 81.45 81.43 82.14 82.25
barren 16.49 16.56 17.59 19.27 18.11 17.49
forest 46.03 46.38 47.39 46.82 47.99 47.63
agriculture 61.48 61.67 63.55 63.80 62.40 63.44
mIOU 51.95 52.39 53.72 54.17 53.93 54.17