没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
内容概要:本文提出了一种名为AndroAnalyzer的新方法,用于Android恶意软件检测。该方法聚焦于敏感行为链并嵌入了抽象语法树(AST)代码语义,旨在减少计算和存储开销的同时提高检测性能。具体来说,AndroAnalyzer利用函数调用图(FCG)表示应用程序的宏观行为,并采用结构化代码语义表示函数的微观行为。此外,还提出了敏感函数调用图(SFCG)生成算法来缩小分析范围到敏感函数调用,并引入了AST向量化算法(AST2Vec)捕捉结构化代码语义。实验结果显示,AndroAnalyzer在二分类和多分类任务上分别达到了99.21%和98.45%的F1分数,并展示了良好的泛化能力。 适合人群:对移动安全、恶意软件检测以及图神经网络感兴趣的科研人员和技术开发者。 使用场景及目标:适用于需要高效、准确地检测Android恶意软件及其家族识别的应用场景。主要目标是保护用户的个人财产和隐私免受恶意软件威胁。 其他说明:AndroAnalyzer不仅提高了检测效率,还在一定程度上增强了模型对抗结构性攻击的能力。未来的研究方向将集中在探索图解释器的应用,以便更好地理解和解释模型决策背后的机
资源推荐
资源详情
资源评论
























9216 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 19, 2024
Sensitive Behavioral Chain-Focused Android
Malware Detection Fused With AST Semantics
Jiacheng Gong , Graduate Student Member, IEEE, Weina Niu , Senior Member, IEEE,
Song Li , Member, IEEE, Mingxue Zhang , and Xiaosong Zhang
Abstract— The proliferation of Android malware poses a
substantial security threat to mobile devices. Thus, achieving
efficient and accurate malware detection and malware family
identification is crucial for safeguarding users’ individual prop-
erty and privacy. Graph-based approaches have demonstrated
remarkable detection performance in the realm of intelligent
Android malware detection methods. This is attributed to the
robust representation capabilities of graphs and the rich semantic
information. The function call graph (FCG) is the most widely
used graph in intelligent Android malware detection. However,
existing FCG-based malware detection methods face challenges,
such as the enormous computational and storage costs of mod-
eling large graphs. Additionally, the ignorance of code semantics
also makes them susceptible to structured attacks. In this paper,
we proposed AndroAnalyzer, which embeds abstract syntax tree
(AST) code semantics while focusing on sensitive behavior chains.
It leverages FCGs to represent the macroscopic behavior of the
application, and employs structured code semantics to represent
the microscopic behavior of functions. Furthermore, we proposed
the sensitive function call graph (SFCG) generation algorithm to
narrow down the analysis scope to sensitive function calls, and
the AST vectorization algorithm (AST2Vec) to capture struc-
tured code semantics. Experimental results demonstrate that the
proposed SFCG generation algorithm noticeably reduces graph
size while ensuring robust detection performance. AndroAnalyzer
outperforms the baseline methods in binary and multiclass
classification tasks, achieving F1-scores of 99.21% and 98.45%
respectively. Moreover, AndroAnalyzer (trained with samples of
2010-2018) exhibits good generalization capabilities in detecting
samples of 2019-2022.
Index Terms— Android malware detection, function call graph,
abstract syntax tree, code semantic embedding, graph neural
networks.
Received 23 October 2023; revised 24 April 2024; accepted 20 September
2024. Date of publication 26 September 2024; date of current version
7 October 2024. This work was supported in part by the National Science
Foundation of China under Grant 62372086, in part by Chongqing Natural
Science Foundation Innovation and Development Joint Foundation under
Grant CSTB2023NSCQ-LZX0003, and in part by Sichuan Natural Science
Foundation under Grant 24ZNSFSC0038. The associate editor coordinating
the review of this article and approving it for publication was Dr. Aaron
Visaggio. (Corresponding author: Weina Niu.)
Jiacheng Gong is with the School of Computer Science and Engineering,
University of Electronic Science and Technology of China, Chengdu 611731,
Weina Niu and Xiaosong Zhang are with the Institute for Advanced Study,
University of Electronic Science and Technology of China, Shenzhen 518110,
China, and also with the School of Computer Science and Engineering,
University of Electronic Science and Technology of China, Chengdu 611731,
Song Li and Mingxue Zhang are with the State Key Laboratory of
Blockchain and Data Security, Zhejiang University, Hangzhou 310058, China
Digital Object Identifier 10.1109/TIFS.2024.3468891
I. INTRODUCTION
T
HE development of Internet of Things (IoT) has led to
continuous implementation of the digital living concept,
and widespread adoption of mobile devices. Concurrently,
Android OS, the dominant operating system for mobile
devices, is facing remarkable challenges posed by massive
Android malware. According to reports by Kaspersky [1],
in the first quarter of 2023, 307,529 malicious installation
packages were detected. These malware strains often infiltrate
user devices covertly, aiming to steal sensitive information or
gain control over the device, posing a severe threat to users’
financial assets and privacy.
In response to the security threats posed by Android mal-
ware, and considering the substantial costs associated with
manual software analysis, numerous intelligent malware detec-
tion methods have been proposed [2], [3], [4], [5], [6], [7],
[8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19],
[20]. These methods are designed to identify a vast number
of malicious software instances efficiently.
In recent years, graph-based Android malware detection
methods have garnered notable attention [2], [3], [4], [5],
[6], [7], [8], [9], [10], [11], [12], [13], [14], [15]. This is
because the graphs can capture complex relationships between
different components of malware, providing a multi-level
information representation of malware. It enables the analysis
of associations between different malware instances. Among
these methods, function call graphs (FCGs), as a structural
representation of software, can capture the call relationships
between functions, and allows the discovery of potential mali-
cious behavior patterns. Therefore, FCGs are most widely used
in intelligent Android malware detection. Aiming to model
application behavior and optimize modeling costs, we focus
on FCGs over graphs about feature relations or more detailed
graphs, such as control flow graphs. However, in existing FCG-
based methods, the following issues have been identified:
(1) Fine-grained modeling and large-scale analysis. Exist-
ing FCG-based malware detection methods can be broadly
categorized into three types.
• The first category of methods generate feature vectors
based on the usage or frequency of the API calls in the
FCGs. These vectors are subsequently utilized to perform
malware detection [5]. These methods heavily rely on
the expertise of the designer and can be influenced by
subjective factors.
1556-6021 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Anhui Normal University. Downloaded on November 26,2024 at 12:10:31 UTC from IEEE Xplore. Restrictions apply.

GONG et al.: SENSITIVE BEHAVIORAL CHAIN-FOCUSED ANDROID MALWARE DETECTION FUSED 9217
• The second category of methods utilize FCGs as input
to the graph neural networks (GNNs) and employ
structural or frequency features of nodes for classifica-
tion [9], [10], [11]. However, the FCGs can be quite
large, leading to considerable computational and stor-
age overhead for subsequent AI model training. For
example, consider a sample with an MD5 hash of
88ddf2594600f4b570478fa92a7050a0; its APK file size
is 30.18MB, and the extracted Dex files have a size of
52.42MB. The FCG generated by Androguard [21] con-
tains 356,687 nodes and 1,700,696 edges. In mainstream
app markets, many popular apps have sizes exceeding
50MB, with game apps often exceeding 1GB. The com-
putational and storage costs of analyzing and learning
from the complete FCG using AI models can be substan-
tial in such cases.
• The third category of methods, such as those proposed
in [4], [22], and [8], utilize features related to sensitive
API calls within the FCGs as application representations.
These methods overlook the contextual information of
function calls, specifically the information within the call
chains of sensitive APIs.
(2) Ignoring structured code semantics. In existing
graph-based detection methods, API semantics extraction can
be broadly categorized into the following types.
• The first category of work [4], [5], and [11] uses only
node-related statistical or structural features, ignoring
code semantics. This makes their classification results
susceptible graph structural attacks [23].
• The second category of work uses One-hot encoding
to represent API usage patterns. However, the encoding
vector is highly sparse and does not contain function code
semantics information.
• The third category of work [9] and [7] employs natural
language processing techniques to learn the code seman-
tics. However, source code and binary files are more
structured and logically organized than natural language.
Abstract syntax trees (ASTs), control flow graphs, and
data flow graphs are more suitable for representing struc-
tured code semantics.
To address the abovementioned issues, we proposed an
Android malware detection method called AndroAnalyzer,
which focuses on sensitive behavior chains to reduce the
computational and storage overhead. It also incorporates struc-
tured code semantics in the chains to be resistent to structural
attacks. Specifically, the sensitive behavior chain is a chain of
function calls that are closely associated with sensitive behav-
iors. All the sensitive behavior chains are merged to construct
the sensitive function call graph (SFCG). AndroAnalyzer
utilizes the SFCG to characterize the macroscopic behavior
of the application, providing a view of the execution flow of
malicious behaviors. Next, AndroAnalyzer extracts structural
code semantics via ASTs to represent the micro-behavior of
each function, delving into the code logic within the functions.
Furthermore, by incorporating AST code semantics features
generated by the proposed AST2Vec algorithm and combining
them with API semantics features and structural information
features obtained from social network analysis, we can obtain a
SFCG with fused node features. Finally, this graph is input into
a GNN with graph self-attention pooling for learning, resulting
in an intelligent classifier for Android malware detection.
In summary, the major contributions of this work include:
• We proposed an effective method for representing the
behavior of Android applications. It uses FCGs to rep-
resent the macroscopic behavior of applications and
structured code semantics to represent the microscopic
behavior of functions. This approach strikes a balance
between modeling granularity and storage cost. Addition-
ally, we designed a SFCG generation algorithm to reduce
the graph size and focus on sensitive behavior chains
that are related to malicious behaviors. This effectively
reduces the computational overhead when analyzing com-
plex APK files.
• We proposed a structured code semantics extrac-
tion algorithm called AST2Vec, based on ASTs. This
algorithm effectively extracts structured code semantics
from smali code, providing comprehensive behavioral
information for Android malware detection. Furthermore,
the classification model exhibits improved generalization
and robustness in binary and multi-class classification by
integrating API semantics features and structural features
obtained from social network analysis.
• We conducted extensive performance evaluation exper-
iments on two datasets constructed from CICMalDroid
[24] and AndroZoo [25]. The experimental results
demonstrate that AndroAnalyzer outperforms the baseline
methods in binary and multi-class classification tasks.
Furthermore, it (trained with samples in 2010-2018)
exhibits good generalization ability in the detection of
samples in 2019-2022.
The remaining sections are organized as follows. Section II
provides an overview of graph-based malicious software
detection works. In Section III, we introduce the design of
AndroAnalyzer, and in Section IV, we present the evaluation
results. Finally, we discuss briefly in Section V and offer con-
cluding remarks and future research directions in Section VI.
II. RELATED WORK
A. Detection Methods Based on Graph Analysis
This category of methods models static or dynamic features
of Android applications using graph. Subsequently, it employs
graph matching or graph feature extraction in conjunction with
machine learning techniques to perform detection.
In 2019, Arora et al. proposed PermPair [2], which models
the usage patterns of permission pairs (pairing of two dan-
gerous permissions used simultaneously to perform malicious
or benign behavior) in applications using a permission graph.
During detection, the method calculates benign and malicious
weight scores based on an app’s usage of permission pairs
and compares these scores to detect malwares. Similarly,
Fan et al. introduced GefDroid [3] in 2019, which extracts API
usage patterns by analyzing the structural features of sensitive
APIs within subgraphs corresponding to code classes in apps.
This approach analyzes graph similarity between applications
Authorized licensed use limited to: Anhui Normal University. Downloaded on November 26,2024 at 12:10:31 UTC from IEEE Xplore. Restrictions apply.

9218 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 19, 2024
and performs unsupervised clustering of malicious app fami-
lies, incorporating community detection techniques. Wu et al.
also employed social network analysis to analyze FCGs and
introduced MalScan [4]. This method analyzes the centrality
of sensitive API calls in FCGs to generate feature vectors.
It relies on the centrality distributions of sensitive API calls
for classification. MaMaDroid [5] models API call sequences
(or abstracted sequences at class or package levels) in function
call graphs as Markov chains for detection and classification.
In 2020, Surendran et al. introduced Gsdroid [6], which
represents the behavior of applications using system call
graphs. They normalize the call frequency of system calls as
their proposed graph signals and combine these graph signal
features with machine learning techniques for classification.
Also in 2020, Niu et al. [7] extracted API call sequences from
FCGs, followed by further extraction of opcodes associated
with API calls. Finally, they employed LSTM to train and
learn from opcode-level call sequences.
In 2021, Wu et al. introduced HomDroid [8]. It begins
with community detection on FCGs and employs homogeneity
analysis to identify the most suspicious subgraphs. It then
generates features from the sensitive APIs’ occurrences, quan-
tities, and proportions in the subgraphs. Machine learning is
subsequently applied to learn from the feature vectors.
These methods often break the feature correlations during
the feature vectorization. Thus, we employ graphs and GNNs
to represent and learn these features respectively.
B. Behavioral Analysis Detection Methods Based on GNNs
This category of methods models behaviors of applications
using graph. They utilize GNNs or graph embedding to learn
the topological and node features of graphs, and they typically
belong to the task of graph classification.
In 2021, Xu et al. [9] proposed an Android malware detec-
tion method based on FCG embedding. They used Word2Vec
to vectorize opcodes and combined it with the SIF network
for function embedding. This embedding was used as node
features in the FCG to generate graph embeddings through
Struct2Vec, ultimately leading to malware detection based on
these embeddings. Similarly, Cai et al. [10] used API call
sequences as a corpus to obtain function embeddings using
natural language processing techniques. They fed the FCGs
with these embeddings to GNNs to perform classification.
In 2022, Yumlembam et al. [11] utilized GNNs to generate
API graph embeddings based on centrality measures. They
combined these embeddings with permissions and intents for
malware detection.
In 2023, Wu et al. introduced DeepCatra [12], which tracks
the call traces of key APIs in FCGs. It considers relationships
such as intent and ICC (Inter-Component Communication) and
connects edges accordingly. DeepCatra employs a Bi-LSTM
to learn call traces and utilizes GNNs to learn abstract flow
graphs, combining information for detection. In the same
year, Wu et al. [13] presented another approach in which
they encoded opcodes in functions using one-hot encoding.
They calculated node importance based on centrality and
weighted APIs based on their protection levels corresponding
to permissions. The construction of the graph treated the FCG
as an undirected graph. They used a breadth-first algorithm
to create sensitive function subgraphs of sensitive APIs and
their neighbors within a two-hop distance. They combined
API features with graph structure and employed GNNs for
malware detection. Shi et al. proposed SFCGDroid [14], where
they used API call sequences as a corpus to obtain function
semantics using the Skip-gram method. They also incorporated
social network triple information of sensitive APIs as function
node features and combined them with FCG structures for
malware detection. Addressing the remarkable challenge in
graph-based Android malware detection methods known as
graph structural attacks, Li et al. introduced RGDroid [15].
This method initially generates embeddings of API entities
based on an API relationship graph derived from official
Android documentation. These embeddings are used as node
features in FCGs. Additionally, RGDroid employs community
detection to partition FCGs into functional subgraphs, reducing
redundant edge connections and mitigating the impact of graph
structural attacks. Finally, it uses Graph Neural Networks to
learn and detect function call subgraphs.
Most of these methods face the issues mentioned in
Section I. Therefore, we integrate SFCGs and AST code
semantics for malware detection.
C. Association Analysis Detection Methods Based on GNNs
This category of methods models the relationships between
applications using graph. They employed GNNs or graph
embedding to learn the topological (relationship) features and
node (application) features, making this category suitable for
node classification tasks.
In 2021, Gao et al. introduced GDroid [16], which trans-
forms the problem of malware detection into a graph node
classification task. It constructs edges between applications
(APPs) and APIs based on the call relationships among APIs
and the patterns of API usage. This maps APPs and APIs
to a large heterogeneous graph and employs Graph Convo-
lutional Neural Networks (GCNs) to detect malware. Hei et
al. presented HAWK [17] in the same year. This method
builds a heterogeneous graph by considering more entities
such as APIs, permissions, permission types, classes, inter-
faces, and shared object (so) files. It utilizes a heterogeneous
graph attention network to learn relationships under different
meta-paths for the final detection. Fan et al. developed a
method [18] that constructs a heterogeneous graph using
entities like applications, app markets, publishing companies,
app names, app signatures, and developers. It also incorporates
information from different versions of the heterogeneous graph
and performs learning and detection based on spatiotemporal
heterogeneous graph information of applications.
In 2023, Huang et al. introduced WHGDroid [19], which
also builds a heterogeneous graph using multiple entities and
learn relationships through meta-paths. Additionally, WHG-
Droid incorporates features to mitigate the impact of malware
evolution and computes weights based on entity importance.
These methods emphasize inter-app relationships. However,
we focus on analyzing individual app behaviors to perform
malware detection.
Authorized licensed use limited to: Anhui Normal University. Downloaded on November 26,2024 at 12:10:31 UTC from IEEE Xplore. Restrictions apply.
剩余13页未读,继续阅读
资源评论


AI安全这点事
- 粉丝: 137
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助


最新资源
- 5种ceemdan组合时间序列预测模型Python代码(包括ceemdan-lstm、ceemdan-cnn-lstm等)
- 江苏移动通信有限责任公司员工绩效考核实施细则精.doc
- 最新国家开放大学电大《优秀广告作品评析答案》网络核心课形考网考作业.docx
- 工程项目管理计划书.doc
- 基于PLC双轴位置控制.docx
- 基于复矢量PI控制器的模型参考自适应三相永磁同步电机高速低载波比无速度传感器控制仿真研究 - MATLAB 宝典
- 第8章-网络营销的策略组合.ppt
- (源码)基于NodeMCU的可视化通知提醒系统.zip
- 系统集成测试(SIT)报告.docx
- 基于MATLAB的GMSK系统的设计仿真.doc
- 离心风机辐射噪声仿真分析:从结构模态到声源辐射噪声的全流程解析 · 辐射噪声 深度版
- 专题讲座资料(2021-2022年)大工秋Java程序设计在线作业.docx
- (源码)基于Arduino的EDeliveryRobot.zip
- Comsol光子晶体仿真技术:拓扑荷、偏振态、三维能带及Q因子计算
- 基于非支配排序的多目标鱼鹰优化算法求解柔性作业车间调度问题的MATLAB实现
- (源码)基于多种编程语言和框架的物联网服务器与客户端.zip
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈



安全验证
文档复制为VIP权益,开通VIP直接复制
