
most memory-inefficient operations are the frequent tensor
reshaping and element-wise functions in multi-head self-
attention (MHSA). We observe that through an appropri-
ate adjustment of the ratio between MHSA and FFN (feed-
forward network) layers, the memory access time can be re-
duced significantly without compromising the performance.
Moreover, we find that some attention heads tend to learn
similar linear projections, resulting in redundancy in atten-
tion maps. The analysis shows that explicitly decomposing
the computation of each head by feeding them with diverse
features can mitigate this issue while improving computa-
tion efficiency. In addition, the parameter allocation in dif-
ferent modules is often overlooked by existing lightweight
models, as they mainly follow the configurations in stan-
dard transformer models [44,69]. To improve parameter ef-
ficiency, we use structured pruning [45] to identify the most
important network components, and summarize empirical
guidance of parameter reallocation for model acceleration.
Based upon the analysis and findings, we propose a new
family of memory efficient transformer models named Effi-
cientViT. Specifically, we design a new block with a sand-
wich layout to build up the model. The sandwich layout
block applies a single memory-bound MHSA layer between
FFN layers. It reduces the time cost caused by memory-
bound operations in MHSA, and applies more FFN layers
to allow communication between different channels, which
is more memory efficient. Then, we propose a new cascaded
group attention (CGA) module to improve computation ef-
ficiency. The core idea is to enhance the diversity of the fea-
tures fed into the attention heads. In contrast to prior self-
attention using the same feature for all heads, CGA feeds
each head with different input splits and cascades the out-
put features across heads. This module not only reduces the
computation redundancy in multi-head attention, but also
elevates model capacity by increasing network depth. Last
but not least, we redistribute parameters through expanding
the channel width of critical network components such as
value projections, while shrinking the ones with lower im-
portance like hidden dimensions in FFNs. This reallocation
finally promotes model parameter efficiency.
Experiments demonstrate that our models achieve clear
improvements over existing efficient CNN and ViT models
in terms of both speed and accuracy, as shown in Fig. 1.
For instance, our EfficientViT-M5 gets 77.1% top-1 accu-
racy on ImageNet with throughput of 10,621 images/s on an
Nvidia V100 GPU and 56.8 images/s on an Intel Xeon E5-
2690 v4 CPU @ 2.60GHz, outperforming MobileNetV3-
Large [26] by 1.9% in accuracy, 40.4% in GPU inference
speed, and 45.2% in CPU speed. Moreover, EfficientViT-
M2 gets 70.8% accuracy, surpassing MobileViT-XXS [50]
by 1.8%, while running 5.8×/3.7× faster on the GPU/CPU,
and 7.4× faster when converted to ONNX [3] format. When
deployed on the mobile chipset, i.e., Apple A13 Bionic chip
Figure 2. Runtime profiling on two standard vision transformers
Swin-T and DeiT-T. Red text denotes memory-bound operations,
i.e., the time taken by the operation is mainly determined by mem-
ory accesses, while time spent in computation is much smaller.
in iPhone 11, EfficientViT-M2 model runs 2.3× faster than
MobileViT-XXS [50] using the CoreML [1].
In summary, the contributions of this work are two-fold:
• We present a systematic analysis on the factors that
affect the inference speed of vision transformers, de-
riving a set of guidelines for efficient model design.
• We design a new family of vision transformer models,
which strike a good trade-off between efficiency and
accuracy. The models also demonstrate good transfer
ability on a variety of downstream tasks.
2. Going Faster with Vision Transformers
In this section, we explore how to improve the efficiency
of vision transformers from three perspectives: memory ac-
cess, computation redundancy, and parameter usage. We
seek to identify the underlying speed bottlenecks through
empirical studies, and summarize useful design guidelines.
2.1. Memory Efficiency
Memory access overhead is a critical factor affecting
model speed [15, 28, 31,65]. Many operators in transformer
[71], such as frequent reshaping, element-wise addition,
and normalization are memory inefficient, requiring time-
consuming access across different memory units, as shown
in Fig. 2. Although there are some methods proposed to ad-
dress this issue by simplifying the computation of standard
softmax self-attention, e.g., sparse attention [34, 57, 61, 75]
and low-rank approximation [11,51,74], they often come at
the cost of accuracy degradation and limited acceleration.
In this work, we turn to save memory access cost by
reducing memory-inefficient layers. Recent studies reveal
that memory-inefficient operations are mainly located in
MHSA rather than FFN layers [31, 33]. However, most ex-
isting ViTs [18, 44, 69] use an equivalent number of these
two layers, which may not achieve the optimal efficiency.
We thereby explore the optimal allocation of MHSA and
FFN layers in small models with fast inference. Specifi-
cally, we scale down Swin-T [44] and DeiT-T [69] to several
small subnetworks with 1.25× and 1.5× higher inference
throughput, and compare the performance of subnetworks
with different proportions of MHSA layers. As shown in
Fig. 3, subnetworks with 20%-40% MHSA layers tend to
get better accuracy. Such ratios are much smaller than the
2