Machine Learning for Malware
Classification and Clustering
Phil Roth, Data Scientist
1
• PhD in particle astrophysics
• Switched to making images from radar data
• Switched to solving security problems with data
Phil Roth
Data Scientist
2
Outline
• Malware Detection
• Boosted Decision Trees
• Malware Features
• Evaluating Performance
• Bringing a Human into the Loop
3
The Problem: Antivirus
The security industry has declared antivirus as dead, but
there is no widely accepted replacement.
Machine Learning can be that replacement.
4
The Problem: Antivirus
• Antivirus uses signatures, heuristics, and hand crafted rules
that do not scale well
• Using polymorphism and obfuscation, malware authors can
circumvent rules based detection techniques
5
The Solution: Machine Learning
Machine Learning uses statistical techniques to learn
patterns from large datasets
6
Two Steps:
• Feature Extraction
• Boundary Learning
Machine Learning Advantages
• Automation
• Deep Insights
• Scalability
• Generalization
7
Machine Learning Challenges
• Requires labels
• Requires large data sets
• Security field requires very low tolerance for errors
8
Boosted Decision Trees
Basically, it’s a game of 20 questions
Source: https://en.wikipedia.org/wiki/Decision_tree_learning
A tree showing survival of passengers
on the Titanic ("sibsp" is the number
of spouses or siblings aboard). The
figures under the leaves show the
probability of survival and the
percentage of observations in the
leaf.
9
Boosted Decision Trees
• The trees are built by choosing “questions” that
maximize the discrimination between two classes
• The model is called “boosted” because misclassified
samples are given higher weight in future tree building
10
Why Boosted Decision Trees?
Proven results in security and physics
References:
https://www.kaggle.com/c/malware-classification/
http://arxiv.org/pdf/1511.04317.pdf
http://jmlr.org/proceedings/papers/v42/chen14.pdf
11
Malware Features
The extracted features determine your
model’s performance, but there is a tradeoff
Complicated Explainable
12
Complicated Features
Byte frequency and byte
entropy features form a
binary fingerprint that inform
the model
13
Explainable Features
Lists of capabilities don’t greatly help the model classify a
sample, but they can provide more insight to an analyst.
This sample can:
• Record keystrokes
• Send/receive network traffic
• Modify registry
14
Evaluating Performance
We must be careful not to learn from “future” information:
time
time
Train Data
Test Data
Model Train Times
Patterns learned here….
... should not inform classifications here
15
Bringing Humans in the Loop
Amazon built an entire tool (Mechanical Turk) to cheaply
generate labels from human intuition:
Are these products related?
16
Bringing Humans in the Loop
Our labels are more expensive to obtain, and so choosing
what samples to label is even more important.
Is this binary malicious?
Active Learning can help!
17
Bringing Humans in the Loop
When new data arrives, Active Learning tells analysts
which labels would be most helpful.
18
Integration
• Our malware classifier model has been integrated into
our stealthy sensor and Hunt Platform
• Ask the other friendly Endgamers here for a demo!
19
Thanks!
proth@endgame.com
@mrphilroth
20

Machine Learning for Malware Classification and Clustering

  • 1.
    Machine Learning forMalware Classification and Clustering Phil Roth, Data Scientist 1
  • 2.
    • PhD inparticle astrophysics • Switched to making images from radar data • Switched to solving security problems with data Phil Roth Data Scientist 2
  • 3.
    Outline • Malware Detection •Boosted Decision Trees • Malware Features • Evaluating Performance • Bringing a Human into the Loop 3
  • 4.
    The Problem: Antivirus Thesecurity industry has declared antivirus as dead, but there is no widely accepted replacement. Machine Learning can be that replacement. 4
  • 5.
    The Problem: Antivirus •Antivirus uses signatures, heuristics, and hand crafted rules that do not scale well • Using polymorphism and obfuscation, malware authors can circumvent rules based detection techniques 5
  • 6.
    The Solution: MachineLearning Machine Learning uses statistical techniques to learn patterns from large datasets 6 Two Steps: • Feature Extraction • Boundary Learning
  • 7.
    Machine Learning Advantages •Automation • Deep Insights • Scalability • Generalization 7
  • 8.
    Machine Learning Challenges •Requires labels • Requires large data sets • Security field requires very low tolerance for errors 8
  • 9.
    Boosted Decision Trees Basically,it’s a game of 20 questions Source: https://en.wikipedia.org/wiki/Decision_tree_learning A tree showing survival of passengers on the Titanic ("sibsp" is the number of spouses or siblings aboard). The figures under the leaves show the probability of survival and the percentage of observations in the leaf. 9
  • 10.
    Boosted Decision Trees •The trees are built by choosing “questions” that maximize the discrimination between two classes • The model is called “boosted” because misclassified samples are given higher weight in future tree building 10
  • 11.
    Why Boosted DecisionTrees? Proven results in security and physics References: https://www.kaggle.com/c/malware-classification/ http://arxiv.org/pdf/1511.04317.pdf http://jmlr.org/proceedings/papers/v42/chen14.pdf 11
  • 12.
    Malware Features The extractedfeatures determine your model’s performance, but there is a tradeoff Complicated Explainable 12
  • 13.
    Complicated Features Byte frequencyand byte entropy features form a binary fingerprint that inform the model 13
  • 14.
    Explainable Features Lists ofcapabilities don’t greatly help the model classify a sample, but they can provide more insight to an analyst. This sample can: • Record keystrokes • Send/receive network traffic • Modify registry 14
  • 15.
    Evaluating Performance We mustbe careful not to learn from “future” information: time time Train Data Test Data Model Train Times Patterns learned here…. ... should not inform classifications here 15
  • 16.
    Bringing Humans inthe Loop Amazon built an entire tool (Mechanical Turk) to cheaply generate labels from human intuition: Are these products related? 16
  • 17.
    Bringing Humans inthe Loop Our labels are more expensive to obtain, and so choosing what samples to label is even more important. Is this binary malicious? Active Learning can help! 17
  • 18.
    Bringing Humans inthe Loop When new data arrives, Active Learning tells analysts which labels would be most helpful. 18
  • 19.
    Integration • Our malwareclassifier model has been integrated into our stealthy sensor and Hunt Platform • Ask the other friendly Endgamers here for a demo! 19
  • 20.

Editor's Notes

  • #16 Dive right into train versus test data.