Andro-AutoPsy: Similarity-Based Android Malware and Benign Application Dataset
- Citation Author(s):
-
Jae-wook Jang (Korea University)Hyunjae Kang (Deloitte Anjin LLC)Jiyoung Woo (Korea University)Aziz Mohaisen (State University of New York at Buffalo)
- Submitted by:
- Saehoon Oh
- Last updated:
- DOI:
- 10.21227/xdvc-1038
- Data Format:
- Research Article Link:
- Categories:
- Keywords:
Abstract
The Andro-AutoPsy dataset consists of Android malware and benign application samples collected to support research on similarity-based malware detection. Each application is analyzed using behavior-centric and creator-centric features, including certificate serial numbers, malicious API usage, permission likelihood ratios, system command execution, intent manipulation, and file integrity anomalies.
The dataset was created for the Andro-AutoPsy system, which classifies applications by measuring the similarity of extracted behavioral footprints and identifying malware creator signatures. This approach enables not only malware detection but also grouping of malicious applications into behaviorally similar subfamilies. The system demonstrates strong applicability to zero-day malware detection and provides an alternative to traditional signature-based detection.
Instructions:
Dataset Components
The archive includes:
- Compressed APK samples (malware & benign apps)
Malware samples belong to 30 known families, with 9,990 malicious and 109,193 benign applications. - Extracted feature files, including:
- Certificate serial numbers (creator-centric profiling)
- Malicious API sequences
- Critical permission sets
- System command usage
- Forged file indicators
- Intent filter metadata
Malware Detection Features
The following features are used for similarity computation and detection:
- Certificate-based indicators
- API sequence alignment (Needleman–Wunsch)
- Permission likelihood ratio
- System command set (e.g., chmod, mount, su, reboot)
- Forged file checking (extension vs magic number)
- SMS interception patterns (e.g., abortBroadcast)
Acknowledgments
This research was supported by the MSIP(Ministry of Science, ICT and Future Planning), Korea, under the ITRC(Information Technology Research Center) support program (IITP-2015-H8501-15-1003) supervised by the IITP (Institute for Information & communications Technology Promotion). In addition, this work was also supported by the ICT R&D Program of MSIP/IITP. [14-912-06-002, The Development of Script-based Cyber Attack Protection Technology].