基于敏感行为链与抽象语法树语义融合的Android恶意软件检测方法研究资源-CSDN下载

图神经网络

抽象语法树

需积分: 5 25 浏览量 2025-03-26 22:31:26 上传评论收藏 11.4MB PDF 举报

资源推荐

资源详情

资源评论

9216 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 19, 2024

Sensitive Behavioral Chain-Focused Android

Malware Detection Fused With AST Semantics

Jiacheng Gong , Graduate Student Member, IEEE, Weina Niu , Senior Member, IEEE,

Song Li , Member, IEEE, Mingxue Zhang , and Xiaosong Zhang

Abstract— The proliferation of Android malware poses a

substantial security threat to mobile devices. Thus, achieving

efﬁcient and accurate malware detection and malware family

identiﬁcation is crucial for safeguarding users’ individual prop-

erty and privacy. Graph-based approaches have demonstrated

remarkable detection performance in the realm of intelligent

Android malware detection methods. This is attributed to the

robust representation capabilities of graphs and the rich semantic

information. The function call graph (FCG) is the most widely

used graph in intelligent Android malware detection. However,

existing FCG-based malware detection methods face challenges,

such as the enormous computational and storage costs of mod-

eling large graphs. Additionally, the ignorance of code semantics

also makes them susceptible to structured attacks. In this paper,

we proposed AndroAnalyzer, which embeds abstract syntax tree

(AST) code semantics while focusing on sensitive behavior chains.

It leverages FCGs to represent the macroscopic behavior of the

application, and employs structured code semantics to represent

the microscopic behavior of functions. Furthermore, we proposed

the sensitive function call graph (SFCG) generation algorithm to

narrow down the analysis scope to sensitive function calls, and

the AST vectorization algorithm (AST2Vec) to capture struc-

tured code semantics. Experimental results demonstrate that the

proposed SFCG generation algorithm noticeably reduces graph

size while ensuring robust detection performance. AndroAnalyzer

outperforms the baseline methods in binary and multiclass

classiﬁcation tasks, achieving F1-scores of 99.21% and 98.45%

respectively. Moreover, AndroAnalyzer (trained with samples of

2010-2018) exhibits good generalization capabilities in detecting

samples of 2019-2022.

Index Terms— Android malware detection, function call graph,

abstract syntax tree, code semantic embedding, graph neural

networks.

Received 23 October 2023; revised 24 April 2024; accepted 20 September

2024. Date of publication 26 September 2024; date of current version

7 October 2024. This work was supported in part by the National Science

Foundation of China under Grant 62372086, in part by Chongqing Natural

Science Foundation Innovation and Development Joint Foundation under

Grant CSTB2023NSCQ-LZX0003, and in part by Sichuan Natural Science

Foundation under Grant 24ZNSFSC0038. The associate editor coordinating

the review of this article and approving it for publication was Dr. Aaron

Visaggio. (Corresponding author: Weina Niu.)

Jiacheng Gong is with the School of Computer Science and Engineering,

University of Electronic Science and Technology of China, Chengdu 611731,

China (e-mail: [email protected]).

Weina Niu and Xiaosong Zhang are with the Institute for Advanced Study,

University of Electronic Science and Technology of China, Shenzhen 518110,

China, and also with the School of Computer Science and Engineering,

University of Electronic Science and Technology of China, Chengdu 611731,

China (e-mail: [email protected]; [email protected]).

Song Li and Mingxue Zhang are with the State Key Laboratory of

Blockchain and Data Security, Zhejiang University, Hangzhou 310058, China

(e-mail: [email protected]; [email protected]).

Digital Object Identiﬁer 10.1109/TIFS.2024.3468891

I. INTRODUCTION

HE development of Internet of Things (IoT) has led to

continuous implementation of the digital living concept,

and widespread adoption of mobile devices. Concurrently,

Android OS, the dominant operating system for mobile

devices, is facing remarkable challenges posed by massive

Android malware. According to reports by Kaspersky [1],

in the ﬁrst quarter of 2023, 307,529 malicious installation

packages were detected. These malware strains often inﬁltrate

user devices covertly, aiming to steal sensitive information or

gain control over the device, posing a severe threat to users’

ﬁnancial assets and privacy.

In response to the security threats posed by Android mal-

ware, and considering the substantial costs associated with

manual software analysis, numerous intelligent malware detec-

tion methods have been proposed [2], [3], [4], [5], [6], [7],

[8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19],

[20]. These methods are designed to identify a vast number

of malicious software instances efﬁciently.

In recent years, graph-based Android malware detection

methods have garnered notable attention [2], [3], [4], [5],

[6], [7], [8], [9], [10], [11], [12], [13], [14], [15]. This is

because the graphs can capture complex relationships between

different components of malware, providing a multi-level

information representation of malware. It enables the analysis

of associations between different malware instances. Among

these methods, function call graphs (FCGs), as a structural

representation of software, can capture the call relationships

between functions, and allows the discovery of potential mali-

cious behavior patterns. Therefore, FCGs are most widely used

in intelligent Android malware detection. Aiming to model

application behavior and optimize modeling costs, we focus

on FCGs over graphs about feature relations or more detailed

graphs, such as control ﬂow graphs. However, in existing FCG-

based methods, the following issues have been identiﬁed:

(1) Fine-grained modeling and large-scale analysis. Exist-

ing FCG-based malware detection methods can be broadly

categorized into three types.

• The ﬁrst category of methods generate feature vectors

based on the usage or frequency of the API calls in the

FCGs. These vectors are subsequently utilized to perform

malware detection [5]. These methods heavily rely on

the expertise of the designer and can be inﬂuenced by

subjective factors.

See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Anhui Normal University. Downloaded on November 26,2024 at 12:10:31 UTC from IEEE Xplore. Restrictions apply.

GONG et al.: SENSITIVE BEHAVIORAL CHAIN-FOCUSED ANDROID MALWARE DETECTION FUSED 9217

• The second category of methods utilize FCGs as input

to the graph neural networks (GNNs) and employ

structural or frequency features of nodes for classiﬁca-

tion [9], [10], [11]. However, the FCGs can be quite

large, leading to considerable computational and stor-

age overhead for subsequent AI model training. For

example, consider a sample with an MD5 hash of

88ddf2594600f4b570478fa92a7050a0; its APK ﬁle size

is 30.18MB, and the extracted Dex ﬁles have a size of

52.42MB. The FCG generated by Androguard [21] con-

tains 356,687 nodes and 1,700,696 edges. In mainstream

app markets, many popular apps have sizes exceeding

50MB, with game apps often exceeding 1GB. The com-

putational and storage costs of analyzing and learning

from the complete FCG using AI models can be substan-

tial in such cases.

• The third category of methods, such as those proposed

in [4], [22], and [8], utilize features related to sensitive

API calls within the FCGs as application representations.

These methods overlook the contextual information of

function calls, speciﬁcally the information within the call

chains of sensitive APIs.

(2) Ignoring structured code semantics. In existing

graph-based detection methods, API semantics extraction can

be broadly categorized into the following types.

• The ﬁrst category of work [4], [5], and [11] uses only

node-related statistical or structural features, ignoring

code semantics. This makes their classiﬁcation results

susceptible graph structural attacks [23].

• The second category of work uses One-hot encoding

to represent API usage patterns. However, the encoding

vector is highly sparse and does not contain function code

semantics information.

• The third category of work [9] and [7] employs natural

language processing techniques to learn the code seman-

tics. However, source code and binary ﬁles are more

structured and logically organized than natural language.

Abstract syntax trees (ASTs), control ﬂow graphs, and

data ﬂow graphs are more suitable for representing struc-

tured code semantics.

To address the abovementioned issues, we proposed an

Android malware detection method called AndroAnalyzer,

which focuses on sensitive behavior chains to reduce the

computational and storage overhead. It also incorporates struc-

tured code semantics in the chains to be resistent to structural

attacks. Speciﬁcally, the sensitive behavior chain is a chain of

function calls that are closely associated with sensitive behav-

iors. All the sensitive behavior chains are merged to construct

the sensitive function call graph (SFCG). AndroAnalyzer

utilizes the SFCG to characterize the macroscopic behavior

of the application, providing a view of the execution ﬂow of

malicious behaviors. Next, AndroAnalyzer extracts structural

code semantics via ASTs to represent the micro-behavior of

each function, delving into the code logic within the functions.

Furthermore, by incorporating AST code semantics features

generated by the proposed AST2Vec algorithm and combining

them with API semantics features and structural information

features obtained from social network analysis, we can obtain a

SFCG with fused node features. Finally, this graph is input into

a GNN with graph self-attention pooling for learning, resulting

in an intelligent classiﬁer for Android malware detection.

In summary, the major contributions of this work include:

• We proposed an effective method for representing the

behavior of Android applications. It uses FCGs to rep-

resent the macroscopic behavior of applications and

structured code semantics to represent the microscopic

behavior of functions. This approach strikes a balance

between modeling granularity and storage cost. Addition-

ally, we designed a SFCG generation algorithm to reduce

the graph size and focus on sensitive behavior chains

that are related to malicious behaviors. This effectively

reduces the computational overhead when analyzing com-

plex APK ﬁles.

• We proposed a structured code semantics extrac-

tion algorithm called AST2Vec, based on ASTs. This

algorithm effectively extracts structured code semantics

from smali code, providing comprehensive behavioral

information for Android malware detection. Furthermore,

the classiﬁcation model exhibits improved generalization

and robustness in binary and multi-class classiﬁcation by

integrating API semantics features and structural features

obtained from social network analysis.

• We conducted extensive performance evaluation exper-

iments on two datasets constructed from CICMalDroid

[24] and AndroZoo [25]. The experimental results

demonstrate that AndroAnalyzer outperforms the baseline

methods in binary and multi-class classiﬁcation tasks.

Furthermore, it (trained with samples in 2010-2018)

exhibits good generalization ability in the detection of

samples in 2019-2022.

The remaining sections are organized as follows. Section II

provides an overview of graph-based malicious software

detection works. In Section III, we introduce the design of

AndroAnalyzer, and in Section IV, we present the evaluation

results. Finally, we discuss brieﬂy in Section V and offer con-

cluding remarks and future research directions in Section VI.

II. RELATED WORK

A. Detection Methods Based on Graph Analysis

This category of methods models static or dynamic features

of Android applications using graph. Subsequently, it employs

graph matching or graph feature extraction in conjunction with

machine learning techniques to perform detection.

In 2019, Arora et al. proposed PermPair [2], which models

the usage patterns of permission pairs (pairing of two dan-

gerous permissions used simultaneously to perform malicious

or benign behavior) in applications using a permission graph.

During detection, the method calculates benign and malicious

weight scores based on an app’s usage of permission pairs

and compares these scores to detect malwares. Similarly,

Fan et al. introduced GefDroid [3] in 2019, which extracts API

usage patterns by analyzing the structural features of sensitive

APIs within subgraphs corresponding to code classes in apps.

This approach analyzes graph similarity between applications

Authorized licensed use limited to: Anhui Normal University. Downloaded on November 26,2024 at 12:10:31 UTC from IEEE Xplore. Restrictions apply.

9218 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 19, 2024

and performs unsupervised clustering of malicious app fami-

lies, incorporating community detection techniques. Wu et al.

also employed social network analysis to analyze FCGs and

introduced MalScan [4]. This method analyzes the centrality

of sensitive API calls in FCGs to generate feature vectors.

It relies on the centrality distributions of sensitive API calls

for classiﬁcation. MaMaDroid [5] models API call sequences

(or abstracted sequences at class or package levels) in function

call graphs as Markov chains for detection and classiﬁcation.

In 2020, Surendran et al. introduced Gsdroid [6], which

represents the behavior of applications using system call

graphs. They normalize the call frequency of system calls as

their proposed graph signals and combine these graph signal

features with machine learning techniques for classiﬁcation.

Also in 2020, Niu et al. [7] extracted API call sequences from

FCGs, followed by further extraction of opcodes associated

with API calls. Finally, they employed LSTM to train and

learn from opcode-level call sequences.

In 2021, Wu et al. introduced HomDroid [8]. It begins

with community detection on FCGs and employs homogeneity

analysis to identify the most suspicious subgraphs. It then

generates features from the sensitive APIs’ occurrences, quan-

tities, and proportions in the subgraphs. Machine learning is

subsequently applied to learn from the feature vectors.

These methods often break the feature correlations during

the feature vectorization. Thus, we employ graphs and GNNs

to represent and learn these features respectively.

B. Behavioral Analysis Detection Methods Based on GNNs

This category of methods models behaviors of applications

using graph. They utilize GNNs or graph embedding to learn

the topological and node features of graphs, and they typically

belong to the task of graph classiﬁcation.

In 2021, Xu et al. [9] proposed an Android malware detec-

tion method based on FCG embedding. They used Word2Vec

to vectorize opcodes and combined it with the SIF network

for function embedding. This embedding was used as node

features in the FCG to generate graph embeddings through

Struct2Vec, ultimately leading to malware detection based on

these embeddings. Similarly, Cai et al. [10] used API call

sequences as a corpus to obtain function embeddings using

natural language processing techniques. They fed the FCGs

with these embeddings to GNNs to perform classiﬁcation.

In 2022, Yumlembam et al. [11] utilized GNNs to generate

API graph embeddings based on centrality measures. They

combined these embeddings with permissions and intents for

malware detection.

In 2023, Wu et al. introduced DeepCatra [12], which tracks

the call traces of key APIs in FCGs. It considers relationships

such as intent and ICC (Inter-Component Communication) and

connects edges accordingly. DeepCatra employs a Bi-LSTM

to learn call traces and utilizes GNNs to learn abstract ﬂow

graphs, combining information for detection. In the same

year, Wu et al. [13] presented another approach in which

they encoded opcodes in functions using one-hot encoding.

They calculated node importance based on centrality and

weighted APIs based on their protection levels corresponding

to permissions. The construction of the graph treated the FCG

as an undirected graph. They used a breadth-ﬁrst algorithm

to create sensitive function subgraphs of sensitive APIs and

their neighbors within a two-hop distance. They combined

API features with graph structure and employed GNNs for

malware detection. Shi et al. proposed SFCGDroid [14], where

they used API call sequences as a corpus to obtain function

semantics using the Skip-gram method. They also incorporated

social network triple information of sensitive APIs as function

node features and combined them with FCG structures for

malware detection. Addressing the remarkable challenge in

graph-based Android malware detection methods known as

graph structural attacks, Li et al. introduced RGDroid [15].

This method initially generates embeddings of API entities

based on an API relationship graph derived from ofﬁcial

Android documentation. These embeddings are used as node

features in FCGs. Additionally, RGDroid employs community

detection to partition FCGs into functional subgraphs, reducing

redundant edge connections and mitigating the impact of graph

structural attacks. Finally, it uses Graph Neural Networks to

learn and detect function call subgraphs.

Most of these methods face the issues mentioned in

Section I. Therefore, we integrate SFCGs and AST code

semantics for malware detection.

C. Association Analysis Detection Methods Based on GNNs

This category of methods models the relationships between

applications using graph. They employed GNNs or graph

embedding to learn the topological (relationship) features and

node (application) features, making this category suitable for

node classiﬁcation tasks.

In 2021, Gao et al. introduced GDroid [16], which trans-

forms the problem of malware detection into a graph node

classiﬁcation task. It constructs edges between applications

(APPs) and APIs based on the call relationships among APIs

and the patterns of API usage. This maps APPs and APIs

to a large heterogeneous graph and employs Graph Convo-

lutional Neural Networks (GCNs) to detect malware. Hei et

al. presented HAWK [17] in the same year. This method

builds a heterogeneous graph by considering more entities

such as APIs, permissions, permission types, classes, inter-

faces, and shared object (so) ﬁles. It utilizes a heterogeneous

graph attention network to learn relationships under different

meta-paths for the ﬁnal detection. Fan et al. developed a

method [18] that constructs a heterogeneous graph using

entities like applications, app markets, publishing companies,

app names, app signatures, and developers. It also incorporates

information from different versions of the heterogeneous graph

and performs learning and detection based on spatiotemporal

heterogeneous graph information of applications.

In 2023, Huang et al. introduced WHGDroid [19], which

also builds a heterogeneous graph using multiple entities and

learn relationships through meta-paths. Additionally, WHG-

Droid incorporates features to mitigate the impact of malware

evolution and computes weights based on entity importance.

These methods emphasize inter-app relationships. However,

we focus on analyzing individual app behaviors to perform

malware detection.

Authorized licensed use limited to: Anhui Normal University. Downloaded on November 26,2024 at 12:10:31 UTC from IEEE Xplore. Restrictions apply.

剩余13页未读，继续阅读

评论收藏

内容反馈

AI安全这点事

粉丝: 137

基于敏感行为链与抽象语法树语义融合的Android恶意软件检测方法研究

基于语义树的概念语义相似度计算方法研究

网络游戏-基于树语义的异步动态下推网络可达性分析方法.zip

高中语文文摘校园梧桐树语

linux设备树使用手册(注释版)

初中语文文摘文苑树语

设备树语法分析文档，使用案例

树型自动机经典理论书籍

linux驱动开发

tree automata

ANTLR-v3.ppt

kernel4.19+设备树.zip

Linux驱动开发相关资料

ARM嵌入式Linux设备驱动实例开发(源代码)

20221117be9Wu1yf.zip

weixin052用于日语词汇学习的微信小程序+ssm后端毕业源码案例设计_weixin052_riyucihui.zip

imx219驱动与设备树代码.rar

vim-colors-clearance:VIM 的深色配色方案，采用大胆、明亮的颜色，对您的眼睛很好

随波逐流CTF编码工具 V6.5 20250115

最新版ISO/IEC 27001:2022、ISO 27002:2022中英文合集

Chrome Header Editor 插件

BurpSuite V2024.1.1专业版

软件工程导论(第六版)课后习题答案1

BurpLoaderKeygen.jar.zip

OpenVAS GVM 中文翻译补丁

STM32F103C8T6核心板-电路原理图1.PDF

安全认证cisp教材全套

现代永磁同步电机控制原理及MATLAB仿真__袁雷编著1

OpenVAS离线资源

Kafka高性能揭秘 —— sequence IO、PageCache、SendFile的应用详解

Keil-ARM支持Keil-C51芯片包

最新资源