Andro-AutoPsy: Similarity-Based Android Malware and Benign Application Dataset

Citation Author(s):: Jae-wook Jang (Korea University)

Hyunjae Kang (Deloitte Anjin LLC)

Jiyoung Woo (Korea University)

Aziz Mohaisen (State University of New York at Buffalo)

Huy Kang Kim (Korea University)
Submitted by:: Saehoon Oh
Last updated:: Fri, 11/21/2025 - 07:02
DOI:: 10.21227/xdvc-1038
Data Format:: *.apk

*.csv

*.txt

*.json
Research Article Link:: https://www.sciencedirect.com/science/article/pii/S1742287615000717

4 views

Categories:

Keywords:

Similarity matching

Profiling

Android malware

Malware Classification

Certificate

ACCESS DATASET CITE

Abstract

The Andro-AutoPsy dataset consists of Android malware and benign application samples collected to support research on similarity-based malware detection. Each application is analyzed using behavior-centric and creator-centric features, including certificate serial numbers, malicious API usage, permission likelihood ratios, system command execution, intent manipulation, and file integrity anomalies.

The dataset was created for the Andro-AutoPsy system, which classifies applications by measuring the similarity of extracted behavioral footprints and identifying malware creator signatures. This approach enables not only malware detection but also grouping of malicious applications into behaviorally similar subfamilies. The system demonstrates strong applicability to zero-day malware detection and provides an alternative to traditional signature-based detection.

Instructions:

Dataset Components

The archive includes:

Compressed APK samples (malware & benign apps)
Malware samples belong to 30 known families, with 9,990 malicious and 109,193 benign applications.
Extracted feature files, including:
- Certificate serial numbers (creator-centric profiling)
- Malicious API sequences
- Critical permission sets
- System command usage
- Forged file indicators
- Intent filter metadata

Malware Detection Features

The following features are used for similarity computation and detection:

Certificate-based indicators
API sequence alignment (Needleman–Wunsch)
Permission likelihood ratio
System command set (e.g., chmod, mount, su, reboot)
Forged file checking (extension vs magic number)
SMS interception patterns (e.g., abortBroadcast)

Acknowledgments

This research was supported by the MSIP(Ministry of Science, ICT and Future Planning), Korea, under the ITRC(Information Technology Research Center) support program (IITP-2015-H8501-15-1003) supervised by the IITP (Institute for Information & communications Technology Promotion). In addition, this work was also supported by the ICT R&D Program of MSIP/IITP. [14-912-06-002, The Development of Script-based Cyber Attack Protection Technology].