EfficientViT.pdf_efficientvit资源-CSDN下载

需积分: 5 90 浏览量 2024-12-31 16:13:21 上传评论收藏 5.41MB PDF 举报

资源推荐

资源详情

资源评论

EfﬁcientViT: Memory Efﬁcient Vision Transformer with

Cascaded Group Attention

Xinyu Liu

1,∗

, Houwen Peng

, Ningxin Zheng

, Yuqing Yang

, Han Hu

, Yixuan Yuan

The Chinese University of Hong Kong,

Microsoft Research

Abstract

Vision transformers have shown great success due to

their high model capabilities. However, their remarkable

performance is accompanied by heavy computation costs,

which makes them unsuitable for real-time applications. In

this paper, we propose a family of high-speed vision trans-

formers named EfﬁcientViT. We ﬁnd that the speed of ex-

isting transformer models is commonly bounded by mem-

ory inefﬁcient operations, especially the tensor reshaping

and element-wise functions in MHSA. Therefore, we design

a new building block with a sandwich layout, i.e., using a

single memory-bound MHSA between efﬁcient FFN layers,

which improves memory efﬁciency while enhancing channel

communication. Moreover, we discover that the attention

maps share high similarities across heads, leading to com-

putational redundancy. To address this, we present a cas-

caded group attention module feeding attention heads with

different splits of the full feature, which not only saves com-

putation cost but also improves attention diversity. Compre-

hensive experiments demonstrate EfﬁcientViT outperforms

existing efﬁcient models, striking a good trade-off between

speed and accuracy. For instance, our EfﬁcientViT-M5 sur-

passes MobileNetV3-Large by 1.9% in accuracy, while get-

ting 40.4% and 45.2% higher throughput on Nvidia V100

GPU and Intel Xeon CPU, respectively. Compared to

the recent efﬁcient model MobileViT-XXS, EfﬁcientViT-M2

achieves 1.8% superior accuracy, while running 5.8×/3.7×

faster on the GPU/CPU, and 7.4× faster when converted to

ONNX format. Code and models are available at here.

1. Introduction

Vision Transformers (ViTs) have taken computer vision

domain by storm due to their high model capabilities and

superior performance [18, 44, 69]. However, the constantly

improved accuracy comes at the cost of increasing model

sizes and computation overhead. For example, SwinV2 [43]

uses 3.0B parameters, while V-MoE [62] taking 14.7B pa-

rameters, to achieve state-of-the-art performance on Ima-

∗

Work done when Xinyu was an intern of Microsoft Research.

Figure 1. Speed and accuracy comparisons between EfﬁcientViT

(Ours) and other efﬁcient CNN and ViT models tested on an

Nvidia V100 GPU with ImageNet-1K dataset [17].

geNet [17]. Such large model sizes and the accompanying

heavy computational costs make these models unsuitable

for applications with real-time requirements [40, 78, 86].

There are several recent works designing light and efﬁ-

cient vision transformer models [9,19,29, 49,50,56,79,81].

Unfortunately, most of these methods aim to reduce model

parameters or Flops, which are indirect metrics for speed

and do not reﬂect the actual inference throughput of models.

For example, MobileViT-XS [50] using 700M Flops runs

much slower than DeiT-T [69] with 1,220M Flops on an

Nvidia V100 GPU. Although these methods have achieved

good performance with fewer Flops or parameters, many

of them do not show signiﬁcant wall-clock speedup against

standard isomorphic or hierarchical transformers, e.g., DeiT

[69] and Swin [44], and have not gained wide adoption.

To address this issue, in this paper, we explore how to

go faster with vision transformers, seeking to ﬁnd princi-

ples for designing efﬁcient transformer architectures. Based

on the prevailing vision transformers DeiT [69] and Swin

[44], we systematically analyze three main factors that af-

fect model inference speed, including memory access, com-

putation redundancy, and parameter usage. In particular,

we ﬁnd that the speed of transformer models is commonly

memory-bound. In other words, memory accessing de-

lay prohibits the full utilization of the computing power

in GPU/CPUs [21, 32, 72], leading to a critically negative

impact on the runtime speed of transformers [15, 31]. The

arXiv:2305.07027v1 [cs.CV] 11 May 2023

咕泡出品　必属精品

most memory-inefﬁcient operations are the frequent tensor

reshaping and element-wise functions in multi-head self-

attention (MHSA). We observe that through an appropri-

ate adjustment of the ratio between MHSA and FFN (feed-

forward network) layers, the memory access time can be re-

duced signiﬁcantly without compromising the performance.

Moreover, we ﬁnd that some attention heads tend to learn

similar linear projections, resulting in redundancy in atten-

tion maps. The analysis shows that explicitly decomposing

the computation of each head by feeding them with diverse

features can mitigate this issue while improving computa-

tion efﬁciency. In addition, the parameter allocation in dif-

ferent modules is often overlooked by existing lightweight

models, as they mainly follow the conﬁgurations in stan-

dard transformer models [44,69]. To improve parameter ef-

ﬁciency, we use structured pruning [45] to identify the most

important network components, and summarize empirical

guidance of parameter reallocation for model acceleration.

Based upon the analysis and ﬁndings, we propose a new

family of memory efﬁcient transformer models named Efﬁ-

cientViT. Speciﬁcally, we design a new block with a sand-

wich layout to build up the model. The sandwich layout

block applies a single memory-bound MHSA layer between

FFN layers. It reduces the time cost caused by memory-

bound operations in MHSA, and applies more FFN layers

to allow communication between different channels, which

is more memory efﬁcient. Then, we propose a new cascaded

group attention (CGA) module to improve computation ef-

ﬁciency. The core idea is to enhance the diversity of the fea-

tures fed into the attention heads. In contrast to prior self-

attention using the same feature for all heads, CGA feeds

each head with different input splits and cascades the out-

put features across heads. This module not only reduces the

computation redundancy in multi-head attention, but also

elevates model capacity by increasing network depth. Last

but not least, we redistribute parameters through expanding

the channel width of critical network components such as

value projections, while shrinking the ones with lower im-

portance like hidden dimensions in FFNs. This reallocation

ﬁnally promotes model parameter efﬁciency.

Experiments demonstrate that our models achieve clear

improvements over existing efﬁcient CNN and ViT models

in terms of both speed and accuracy, as shown in Fig. 1.

For instance, our EfﬁcientViT-M5 gets 77.1% top-1 accu-

racy on ImageNet with throughput of 10,621 images/s on an

Nvidia V100 GPU and 56.8 images/s on an Intel Xeon E5-

2690 v4 CPU @ 2.60GHz, outperforming MobileNetV3-

Large [26] by 1.9% in accuracy, 40.4% in GPU inference

speed, and 45.2% in CPU speed. Moreover, EfﬁcientViT-

M2 gets 70.8% accuracy, surpassing MobileViT-XXS [50]

by 1.8%, while running 5.8×/3.7× faster on the GPU/CPU,

and 7.4× faster when converted to ONNX [3] format. When

deployed on the mobile chipset, i.e., Apple A13 Bionic chip

Figure 2. Runtime proﬁling on two standard vision transformers

Swin-T and DeiT-T. Red text denotes memory-bound operations,

i.e., the time taken by the operation is mainly determined by mem-

ory accesses, while time spent in computation is much smaller.

in iPhone 11, EfﬁcientViT-M2 model runs 2.3× faster than

MobileViT-XXS [50] using the CoreML [1].

In summary, the contributions of this work are two-fold:

• We present a systematic analysis on the factors that

affect the inference speed of vision transformers, de-

riving a set of guidelines for efﬁcient model design.

• We design a new family of vision transformer models,

which strike a good trade-off between efﬁciency and

accuracy. The models also demonstrate good transfer

ability on a variety of downstream tasks.

2. Going Faster with Vision Transformers

In this section, we explore how to improve the efﬁciency

of vision transformers from three perspectives: memory ac-

cess, computation redundancy, and parameter usage. We

seek to identify the underlying speed bottlenecks through

empirical studies, and summarize useful design guidelines.

2.1. Memory Efﬁciency

Memory access overhead is a critical factor affecting

model speed [15, 28, 31,65]. Many operators in transformer

[71], such as frequent reshaping, element-wise addition,

and normalization are memory inefﬁcient, requiring time-

consuming access across different memory units, as shown

in Fig. 2. Although there are some methods proposed to ad-

dress this issue by simplifying the computation of standard

softmax self-attention, e.g., sparse attention [34, 57, 61, 75]

and low-rank approximation [11,51,74], they often come at

the cost of accuracy degradation and limited acceleration.

In this work, we turn to save memory access cost by

reducing memory-inefﬁcient layers. Recent studies reveal

that memory-inefﬁcient operations are mainly located in

MHSA rather than FFN layers [31, 33]. However, most ex-

isting ViTs [18, 44, 69] use an equivalent number of these

two layers, which may not achieve the optimal efﬁciency.

We thereby explore the optimal allocation of MHSA and

FFN layers in small models with fast inference. Speciﬁ-

cally, we scale down Swin-T [44] and DeiT-T [69] to several

small subnetworks with 1.25× and 1.5× higher inference

throughput, and compare the performance of subnetworks

with different proportions of MHSA layers. As shown in

Fig. 3, subnetworks with 20%-40% MHSA layers tend to

get better accuracy. Such ratios are much smaller than the

咕泡出品　必属精品

Figure 3. The accuracy of downscaled baseline models with dif-

ferent MHSA layer proportions, where the dots on each line rep-

resent subnetworks with similar throughput. Left: Swin-T as the

baseline. Right: DeiT-T as the baseline. The 1.25×/1.5× denote

accelerating the baseline models by 1.25/1.5 times, respectively.

typical ViTs that adopt 50% MHSA layers. Furthermore,

we measure the time consumption on memory-bound op-

erations to compare memory access efﬁciency, including

reshaping, element-wise addition, copying, and normaliza-

tion. Memory-bound operations is reduced to 44.26% of

the total runtime in Swin-T-1.25× that has 20% MHSA lay-

ers. The observation also generalizes to DeiT and smaller

models with 1.5× speed-up. It is demonstrated that reduc-

ing MHSA layer utilization ratio appropriately can enhance

memory efﬁciency while improving model performance.

2.2. Computation Efﬁciency

MHSA embeds the input sequence into multiple sub-

spaces (heads) and computes attention maps separately,

which has been proven effective in improving performance

[18, 69, 71]. However, attention maps are computationally

expensive, and studies have shown that a number of them

are not of vital importance [52, 73]. To save computation

cost, we explore how to reduce redundant attention in small

ViT models. We train width downscaled Swin-T [44] and

DeiT-T [69] models with 1.25× inference speed-up, and

measure the maximum cosine similarity of each head and

the remaining heads within each block. From Fig. 4, we ob-

serve there exists high similarities between attention heads,

especially in the last blocks. The phenomenon suggests that

many heads learn similar projections of the same full fea-

ture and incur computation redundancy. To explicitly en-

courage the heads to learn different patterns, we apply an

intuitive solution by feeding each head with only a split of

the full feature, which is similar to the idea of group con-

volution in [10, 87]. We train the variants of downscaled

models with the modiﬁed MHSA, and also compute the at-

tention similarities in Fig. 4. It is shown that using different

channel-wise splits of the feature in different heads, instead

of using the same full feature for all heads as MHSA, could

effectively mitigate attention computation redundancy.

2.3. Parameter Efﬁciency

Typical ViTs mainly inherit the design strategies from

NLP transformer [71], e.g., using an equivalent width for

Q,K,V projections, increasing heads over stages, and set-

ting the expansion ratio to 4 in FFN. For lightweight mod-

Figure 4. The average maximum cosine similarity of each head in

different blocks. Left: downscaled Swin-T models. Right: down-

scaled DeiT-T models. Blue lines denote Swin-T-1.25×/DeiT-T-

1.25× model, while darkblue lines denote the variants that feed

each head with only a split of the full feature.

Figure 5. The ratio of the channels to the input embeddings before

and after pruning Swin-T. Baseline accuracy: 79.1%; pruned ac-

curacy: 76.5%. Results for DeiT-T are given in the supplementary.

els, the conﬁgurations of these components need to be care-

fully re-designed [7, 8, 39]. Inspired by [45, 82], we adopt

Taylor structured pruning [53] to automatically ﬁnd the im-

portant components in Swin-T and DeiT-T, and explore the

underlying principles of parameter allocation. The pruning

method removes unimportant channels under a certain re-

source constraint and keeps the most critical ones to best

preserve the accuracy. It uses the multiplication of gradient

and weight as channel importance, which approximates the

loss ﬂuctuation when removing channels [38].

The ratio between the remaining output channels to the

input channels is plotted in Fig. 5, and the original ratios

in the unpruned model are also given for reference. It is

observed that: 1) The ﬁrst two stages preserve more dimen-

sions, while the last stage keeps much less; 2) The Q,K and

FFN dimensions are largely trimmed, whereas the dimen-

sion of V is almost preserved and diminishes only at the

last few blocks. These phenomena show that 1 ) the typical

channel conﬁguration, that doubles the channel after each

stage [44] or use equivalent channels for all blocks [69],

may produce substantial redundancy in last few blocks; 2)

The redundancy in Q,K is much larger than V when they

have the same dimensions. V prefers a relative large chan-

nels, being close to the input embedding dimension.

3. Efﬁcient Vision Transformer

Based upon the above analysis, in this section, we pro-

pose a new hierarchical model with fast inference named

EfﬁcientViT. The architecture overview is shown in Fig. 6.

咕泡出品　必属精品

剩余10页未读，继续阅读

评论收藏

内容反馈

木子乔乔

粉丝: 1042

EfficientViT.pdf

YOLO-World + EfficientViT SAM.zip

YOLOWorld EfficientViT SAM.zip

多模态开集大模型-基于Yolo-world+EfficientViT-SAM实现的开集多模态目标分割大模型算法-附项目源码

YOLOv8-EfficientViT: 创新融合用于目标检测的高效网络可执行项目源码

【损害和风险评估＆坑洼】基于改进yolo11-efficientViT的道路坑洞及修补检测系统新版源码+数据集.zip

批量读取Yaml文件，并获取Onnx模型下载地址，批量下载onnx模型

无人机目标检测系统源码和数据集：改进yolo11-efficientViT

改进 yolo11-efficientViT 的无人机目标检测系统源码及数据集

基于改进YOLOv8的电力设备图像分割系统_包含yolov8-seg-efficientViT和yolov8-seg-C2f-DCNV2等50多种创新改进点_支持目标检测和实例分割.zip

基于resnet+EfficientViT改进的图像分类项目实战+项目说明书+5种中药材图像识别

pytorch efficient-b0预训练模型训练

深度学习融合轻量化ViT模块的ResNet18模型设计：提升图像分类性能与效率优化

DeepSeek从入门到精通(20250204)-清华团队.pdf

相关实用应用程序（Windows可用）

Visio2013 安装包及破解方法

李飞飞自传 我看见的世界 The World I see

清华大学-DeepSeek从入门到精通

ChatGPT使用总结：150个ChatGPT提示词模板（完整版）

全国计算机二级WPSoffice精选350道选择题题库（含答案）.pdf

DeepSeek 15天指导手册-从入门到精通.pdf

DeepSeek从入门到精通：中国开源推理模型的综合指南（104 页）

公益资料（51页PPT）智慧校园解决方案.pptx

公益资料（54页PPT）安徽华速达电子科技集约化智慧园区解决方案.pptx

visio2021-64位.7z

汽车智能驾驶技术及产业发展白皮书

ISO 34505:2025《道路车辆 自动驾驶系统测试场景 场景评价与测试用例生成》

公益资料（120页PPT）人工智能与数字化转型的业财融合.pptx

哈尔滨工业大学-ChatGPT调研报告-2023.3.6-94页.pdf

学术海报模板+论文科研+研究生

公益资料（112页PPT）XX制造业大数据项目之产品主数据蓝图方案.pptx

SpringBoot【ElasticSearch集成 02】Java HTTP Rest client for ElasticSearch Jest 客户端集成（依赖+配置+增删改查测试源码）推荐使用

最新资源

李飞飞自传我看见的世界 The World I see

ISO 34505:2025《道路车辆自动驾驶系统测试场景场景评价与测试用例生成》