SlideShare a Scribd company logo
EFFECTIVE FLOWGRAPH-
BASED MALWARE VARIANT
DETECTION
Silvio Cesare, Ph.D. Candidate, Deakin University
http://www.foocodechu.com
silvio.cesare@gmail.com
WHO AM I AND WHERE DID THIS TALK
COME FROM?
   Ph.D. Student at Deakin University.

   Research interests include:
     Automated vulnerability discovery.
     Software similarity and classification.
     Malware detection.


   This presentation is based on my malware
    research.
OUTLINE
1.   Introduction (you might already know this)

2.   New approaches to flowgraph-based
     classification

3.   Evaluation

4.   Other things we use our system on.

5.   Conclusion
INTRODUCTION
This is to make sure everyone is up to speed.
If you’ve been to my presentations before, you
might have already seen it.
INTRODUCTION
   Malware a significant problem.
   Static detection of malware a dominant real-time
    technique.
   Detecting unknown variants from known
    samples very useful.                     Roron.ao
                                Klez.a       Roron.b
                                Klez.b       Roron.d
                                Klez.c       Roron.e
                                Klez.d       Roron.f
                                ...          ...
SIGNATURES AND BIRTHMARKS
   A birthmark is an invariant property in related
    samples.

   Birthmark comparison should allow inexact
    matching.
LIMITATIONS OF EXISTING
BIRTHMARKS
   Byte-level content can change in every variant.

   Comparing birthmarks often exact matching
    only.

   Inefficient for inexact database searching.

   Unable to detect unknown variants of known
    samples.

   Program structure a better birthmark.
THE SOFTWARE SIMILARITY
PROBLEM
THE SOFTWARE SIMILARITY
SEARCH
 Need a dissimilarity or distance metric.
 “Metric” property allows efficient database
  search.                     Query Benign

                                         r
                                   q
                          d(p,q)
              p
                                        Query Malicious
                  Query

                  Malware
EXISTING APPROACHES: A CALL GRAPH
BIRTHMARK
   Inter-procedureal control flow.
AN OPTIMAL DISSIMILARITY METRIC
FOR GRAPHS
   Graph edit distance.

   Number of operations to transform one graph to
    another.

   Complexity in NP.

   Non optimal solutions possible in cubic time.
OUR APPROACH: A SET OF CONTROL
FLOW GRAPHS BIRTHMARK
 Intra-procedural control flow.
 Many procedures.
TRANSFORMING GRAPH DISSIMILARITY
TO A STRING DISSIMILARITY PROBLEM
 Decompile control flow graphs to strings.
 Compare strings using ‘string metrics’.
                                           proc (){
                             L_0                                  W|IEH}R
                                           L_0:
                                             while (v1 || v2) {
                             L_3           L_1:
                                               if (v3) {
               true                        L_2:
                             L_6
                                               } else {
                      true                 L_4:
                                               }
               L_1           L_7           L_5:
                                    true     }
               true                        L_7:
                                             return ;
               L_2           L_4
                                           }
                             true

                             L_5
NEW APPROACHES TO
FLOWGRAPH-BASED
CLASSIFICATION
TRANSFORMING A SET OF STRINGS
PROBLEM INTO A STRING PROBLEM
   Decompiled CFGs give us a set of strings.
                                             R    W|IEH}R



   Order and concatenate strings.      W|IEH}R     R


                                                            W|IEH}RZ
                                                            RZ
                                         W|}R      W|}R     W|}RZ

   Deliminate substrings with ‘Z’.
                                                            IEHRZ
                                                            SRZ

                                         IEHR      IEHR




   Order based on metrics.               SR        SR




     Number of instructions in procedure.
     Number of basic blocks.
     etc
WHAT WE TRIED (AND ENDED UP NOT
USING)
   String metrics:

       Edit distance    ed(“hello”, “ggello”) = 2

                                                                   C ( xy ) −min{C ( x), (C , y )}
       Normalized Compression Distance          NCD ( x, y ) =
                                                                        max{C ( x), C ( y )}


                                   A K TKT K
       Sequence alignment        | | | | |
                                   ATKTT T K
   All databases indexed using metric trees.
SEQUENCE ALIGNMENT WITH
BLAST
   A heuristic genome sequence search tool.

   Local sequence alignment.

   Hugely popular in bioinformatics.

   So.. transform our strings into genome
    sequences.

   Then, do a genome search.
GENOME SEQUENCE EXTRACTION
                               proc (){
                 L_0                                  W|IEH}R
                               L_0:
                                 while (v1 || v2) {
                 L_3           L_1:
                                   if (v3) {
   true                        L_2:
                 L_6
                                   } else {
          true                 L_4:
                                   }




                                                                 ACGTRYKMACGTRYKM
  L_1            L_7           L_5:
                        true     }
  true                         L_7:
                                 return ;
  L_2            L_4
                               }
                 true

                 L_5




                 L_0
                               proc (){
                               L_0:
                                                      W|IEH}R
                                                                    A =   Adeline
                                 while (v1 || v2) {



   true

          true
                 L_3

                 L_6
                               L_1:

                               L_2:

                               L_4:
                                   if (v3) {

                                   } else {
                                                                    C =   Cytosine
                                                                    G =   Guanine
                                   }
  L_1            L_7           L_5:
                        true     }
  true                         L_7:
                                 return ;
  L_2            L_4
                               }



                                                                    T =   Thyamine
                 true

                 L_5




                                                                    ...
WHY DIDN’T WE USE THOSE
APPROACHES?
   Not optimally effective.

   Too slow.

   Best speed was using NCD.
A DISSIMILARITY METRIC FOR SETS OF
STRINGS (WHAT WE ENDED UP USING)
   Find a mapping between strings to minimize the
    sum of distances.
                                  p
                                      d=ed(p,q)
                                                  q
         BR

         BW|{B}BR
                    BR
         BI{B}BR
                    BW|{B}BR
         BSSR
                    BSSR
         BSR
         BSSSR
COMBINATORIAL OPTIMISATION: THE
ASSIGNMENT PROBLEM
   Finding a minimum cost mapping is known as
    the “assignment problem”

   Optimal solutions exist in cubic time.

   “Greedy” heuristic solutions faster.

   Has the properties of a metric.
EVALUATION
IMPLEMENTATION
   Malwise system is 100,000 lines of code of C++.

   The modules for this work < 3000 lines of code.

   Unpacks malware using an application level
    emulator (Ruxcon 2010)

   Pre-filtering stage to quickly cull non matching
    variants (Ruxcon 2011)
EVALUATION - EFFECTIVENESS
   Calculated similarity between Roron malware
    variants.

   Compared results to Ruxcon 2010 work.

   In tables, highlighted cells indicates a positive
    match.

   The more matches the more effective it is.
EVALUATION - EFFECTIVENESS
        ao       b       d       e       g        k       m         q         a            ao       b      d      e      g      k     m       q      a
  ao          0.44    0.28    0.27    0.28     0.55     0.44     0.44      0.47      ao          0.70   0.28   0.28   0.27   0.75   0.70   0.70   0.75
  b    0.44           0.27    0.27    0.27     0.51     1.00     1.00      0.58      b    0.74          0.31   0.34   0.33   0.82   1.00   1.00   0.87
  d    0.28   0.27            0.48    0.56     0.27     0.27     0.27      0.27      d    0.28   0.29          0.50   0.74   0.29   0.29   0.29   0.29
  e    0.27   0.27    0.48            0.59     0.27     0.27     0.27      0.27      e    0.31   0.34   0.50          0.64   0.32   0.34   0.34   0.33
  g    0.28   0.27    0.56    0.59             0.27     0.27     0.27      0.27      g    0.27   0.33   0.74   0.64          0.29   0.33   0.33   0.30
  k    0.55   0.51    0.27    0.27    0.27              0.51     0.51      0.75      k    0.75   0.82   0.29   0.30   0.29          0.82   0.82   0.96
  m    0.44   1.00    0.27    0.27    0.27     0.51              1.00      0.58      m    0.74   1.00   0.31   0.34   0.33   0.82          1.00   0.87
  q    0.44   1.00    0.27    0.27    0.27     0.51     1.00               0.58      q    0.74   1.00   0.31   0.34   0.33   0.82   1.00          0.87
  a    0.47   0.58    0.27    0.27    0.27     0.75     0.58     0.58                a    0.75   0.87   0.30   0.31   0.30   0.96   0.87   0.87



                Exact Matching                                                       Heuristic Approximate
                (Ruxcon 2010)                                                        Matching (Ruxcon 2010)
         ao       b       d       e        g        k       m          q         a
  ao           0.86    0.49    0.54     0.50     0.87     0.86      0.86      0.86
  b    0.87            0.57    0.63     0.62     0.96     1.00      1.00      0.96
  d    0.61    0.64            0.85     0.91     0.64     0.64      0.64      0.64
  e    0.64    0.69    0.85             0.90     0.68     0.69      0.69      0.68
  g    0.62    0.68    0.91    0.91              0.68     0.68      0.68      0.68
  k    0.88    0.96    0.58    0.62     0.61              0.96      0.96      0.99
  m    0.87    1.00    0.57    0.63     0.62     0.96               1.00      0.96
  q    0.87    1.00    0.57    0.63     0.62     0.96     1.00                0.96
  a    0.87    0.96    0.58    0.62     0.61     0.99     0.96      0.96



               Assignment problem
EVALUATION – FALSE POSITIVES
   Database of 10,000 malware.

   Scanned 1,601 benign binaries.

   7 false positives. Less than 1%.

   Very small binaries have small signatures and
    cause weak matching.
EVALUATION - EFFICIENCY
   Median benign and malware processing time is
    0.06s and 0.84s.
                                                       Malware
                        % Samples     Benign Time(s)
                                                       Time(s)
                                 10             0.02         0.16
                                 20             0.02         0.28
                                 30             0.03         0.30
                                 40             0.03         0.36
                                 50             0.06         0.84
                                 60             0.09         0.94
                                 70             0.13         0.97
                                 80             0.25         1.03
                                 90             0.56         1.31
                                100             8.06       585.16
BUT THAT’S NOT ALL WE
USE THE MALWISE ENGINE
FOR..
SIMSEER – A SOFTWARE SIMILARITY
WEB SERVICE
   An online service to identify similarity between
    programs
   Based on Malwise.
   Renders an evolutionary tree to show program
    relationships.
   Free to use!
   http://www.foocodechu.com/?q=simseer-a-software-similarit
Effective flowgraph-based malware variant detection
Effective flowgraph-based malware variant detection
SIMSEER - DEMO
   http://www.youtube.com/watch?v=ymo7DKlKCH4
BUGWISE
   Automatically detect bugs and vulnerabilities in
    Linux executable binaries.
   Uses static program analysis from Malwise.
     Decompilation
     Data Flow Analysis 

   Free to use!
   http://www.foocodechu.com/?q=bugwise-a-bug-detection-we
BUGWISE – SGID GAMES XONIX BUG IN
DEBIAN LINUX
  memset(score_rec[i].login, 0, 11);

  strncpy(score_rec[i].login, pw->pw_name, 10);

  memset(score_rec[i].full, 0, 65);

  strncpy(score_rec[i].full, fullname, 64);

  score_rec[i].tstamp = time(NULL);

  free(fullname);


  if((high = freopen(PATH_HIGHSCORE, "w",high)) == NULL) {

      fprintf(stderr, "xonix: cannot reopen high score filen");

      free(fullname);
      gameover_pending = 0;

      return;

  }
PUBLICATIONS
 Book published by Springer.
 http://www.springer.com/computer/security+and+
  cryptology/book/978-1-4471-2908-0
CONCLUSION
   Malwise effectively identifies malware variants.

   Runs in real-time in expected case.

   Large functional code base and years of
    development time.

   Happy to talk to vendors.

   http://www.FooCodeChu.com

More Related Content

What's hot (10)

PDF
Why we cannot ignore Functional Programming
Mario Fusco
 
PPTX
pointer, virtual function and polymorphism
ramya marichamy
 
PDF
A/F/C-orientation
Sosuke MORIGUCHI
 
PDF
Do we need a logic of quantum computation?
Matthew Leifer
 
PDF
automata problems
aiswarya chelikani
 
PDF
05 dataflow
ali Hussien
 
PDF
Lazy java
Mario Fusco
 
PDF
Latest C Interview Questions and Answers
DaisyWatson5
 
PPTX
Seminar on quantum automata (complete)
ranjanphu
 
PDF
Components - Graph Based Detection of Library API Limitations
ICSM 2011
 
Why we cannot ignore Functional Programming
Mario Fusco
 
pointer, virtual function and polymorphism
ramya marichamy
 
A/F/C-orientation
Sosuke MORIGUCHI
 
Do we need a logic of quantum computation?
Matthew Leifer
 
automata problems
aiswarya chelikani
 
05 dataflow
ali Hussien
 
Lazy java
Mario Fusco
 
Latest C Interview Questions and Answers
DaisyWatson5
 
Seminar on quantum automata (complete)
ranjanphu
 
Components - Graph Based Detection of Library API Limitations
ICSM 2011
 

Viewers also liked (7)

PPTX
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
Silvio Cesare
 
PPTX
Faster, More Effective Flowgraph-based Malware Classification
Silvio Cesare
 
PPT
Software Testing
Mousmi Pawar
 
PDF
Software testing methods, levels and types
Confiz
 
PPTX
Software Testing Basics
Belal Raslan
 
PPT
Software Testing Fundamentals
Chankey Pathak
 
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
Silvio Cesare
 
Faster, More Effective Flowgraph-based Malware Classification
Silvio Cesare
 
Software Testing
Mousmi Pawar
 
Software testing methods, levels and types
Confiz
 
Software Testing Basics
Belal Raslan
 
Software Testing Fundamentals
Chankey Pathak
 
Ad

Similar to Effective flowgraph-based malware variant detection (20)

DOC
optimization process on compiler
Santosh Sahu
 
PDF
Ponsini automatic slides
ICSM 2010
 
PPT
Graph
Shakil Ahmed
 
PPT
Pas oct12
Abhik Roychoudhury
 
PPT
PAS 2012
Abhik Roychoudhury
 
PPTX
ACM Distinguished Program: Cooperative Testing and Analysis: Human-Tool, Tool...
Tao Xie
 
KEY
Verification with LoLA: 7 Implementation
Universität Rostock
 
PPTX
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...
Silvio Cesare
 
PPTX
Longest Common Sub-sequence (LCS)
Badrul Alam
 
PPTX
Wire - A Formal Intermediate Language for Binary Analysis
Silvio Cesare
 
PDF
Time Machine session @ ICME 2012 - DTW's New Youth
Xavier Anguera
 
PPTX
Cjb0912010 lz algorithms
RAJAN ST
 
PDF
Pcg2012 presentation
Marlon Etheredge
 
PDF
Yet another object system for R
Hadley Wickham
 
PPTX
Parsing using graphs
kpingali
 
PPTX
Variability-Aware Parsing -- OOPSLA Talk
chk49
 
KEY
Verification with LoLA: 2 The LoLA Input Language
Universität Rostock
 
PDF
MapReduce for Parallel Trace Validation of LTL Properties
Sylvain Hallé
 
PPT
4 lexical
u10co093
 
PPT
Cleansing test suites from coincidental correctness to enhance falut localiza...
Tao He
 
optimization process on compiler
Santosh Sahu
 
Ponsini automatic slides
ICSM 2010
 
ACM Distinguished Program: Cooperative Testing and Analysis: Human-Tool, Tool...
Tao Xie
 
Verification with LoLA: 7 Implementation
Universität Rostock
 
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...
Silvio Cesare
 
Longest Common Sub-sequence (LCS)
Badrul Alam
 
Wire - A Formal Intermediate Language for Binary Analysis
Silvio Cesare
 
Time Machine session @ ICME 2012 - DTW's New Youth
Xavier Anguera
 
Cjb0912010 lz algorithms
RAJAN ST
 
Pcg2012 presentation
Marlon Etheredge
 
Yet another object system for R
Hadley Wickham
 
Parsing using graphs
kpingali
 
Variability-Aware Parsing -- OOPSLA Talk
chk49
 
Verification with LoLA: 2 The LoLA Input Language
Universität Rostock
 
MapReduce for Parallel Trace Validation of LTL Properties
Sylvain Hallé
 
4 lexical
u10co093
 
Cleansing test suites from coincidental correctness to enhance falut localiza...
Tao He
 
Ad

More from Silvio Cesare (14)

PDF
A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
Silvio Cesare
 
PDF
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
Silvio Cesare
 
PPTX
Simseer.com - Malware Similarity and Clustering Made Easy
Silvio Cesare
 
PPTX
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Silvio Cesare
 
PPTX
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Silvio Cesare
 
PPTX
Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...
Silvio Cesare
 
PPT
Simseer - A Software Similarity Web Service
Silvio Cesare
 
PPTX
Automated Detection of Software Bugs and Vulnerabilities in Linux
Silvio Cesare
 
PPT
Simple Bugs and Vulnerabilities in Linux Distributions
Silvio Cesare
 
PPT
Fast Automated Unpacking and Classification of Malware
Silvio Cesare
 
PPT
Malware Classification Using Structured Control Flow
Silvio Cesare
 
PPT
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
Silvio Cesare
 
PPT
Security Applications For Emulation
Silvio Cesare
 
PDF
Auditing the Opensource Kernels
Silvio Cesare
 
A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
Silvio Cesare
 
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
Silvio Cesare
 
Simseer.com - Malware Similarity and Clustering Made Easy
Silvio Cesare
 
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Silvio Cesare
 
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Silvio Cesare
 
Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...
Silvio Cesare
 
Simseer - A Software Similarity Web Service
Silvio Cesare
 
Automated Detection of Software Bugs and Vulnerabilities in Linux
Silvio Cesare
 
Simple Bugs and Vulnerabilities in Linux Distributions
Silvio Cesare
 
Fast Automated Unpacking and Classification of Malware
Silvio Cesare
 
Malware Classification Using Structured Control Flow
Silvio Cesare
 
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
Silvio Cesare
 
Security Applications For Emulation
Silvio Cesare
 
Auditing the Opensource Kernels
Silvio Cesare
 

Recently uploaded (20)

PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Python basic programing language for automation
DanialHabibi2
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PPTX
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
PPTX
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Python basic programing language for automation
DanialHabibi2
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 

Effective flowgraph-based malware variant detection

  • 1. EFFECTIVE FLOWGRAPH- BASED MALWARE VARIANT DETECTION Silvio Cesare, Ph.D. Candidate, Deakin University http://www.foocodechu.com [email protected]
  • 2. WHO AM I AND WHERE DID THIS TALK COME FROM?  Ph.D. Student at Deakin University.  Research interests include:  Automated vulnerability discovery.  Software similarity and classification.  Malware detection.  This presentation is based on my malware research.
  • 3. OUTLINE 1. Introduction (you might already know this) 2. New approaches to flowgraph-based classification 3. Evaluation 4. Other things we use our system on. 5. Conclusion
  • 4. INTRODUCTION This is to make sure everyone is up to speed. If you’ve been to my presentations before, you might have already seen it.
  • 5. INTRODUCTION  Malware a significant problem.  Static detection of malware a dominant real-time technique.  Detecting unknown variants from known samples very useful. Roron.ao Klez.a Roron.b Klez.b Roron.d Klez.c Roron.e Klez.d Roron.f ... ...
  • 6. SIGNATURES AND BIRTHMARKS  A birthmark is an invariant property in related samples.  Birthmark comparison should allow inexact matching.
  • 7. LIMITATIONS OF EXISTING BIRTHMARKS  Byte-level content can change in every variant.  Comparing birthmarks often exact matching only.  Inefficient for inexact database searching.  Unable to detect unknown variants of known samples.  Program structure a better birthmark.
  • 9. THE SOFTWARE SIMILARITY SEARCH  Need a dissimilarity or distance metric.  “Metric” property allows efficient database search. Query Benign r q d(p,q) p Query Malicious Query Malware
  • 10. EXISTING APPROACHES: A CALL GRAPH BIRTHMARK  Inter-procedureal control flow.
  • 11. AN OPTIMAL DISSIMILARITY METRIC FOR GRAPHS  Graph edit distance.  Number of operations to transform one graph to another.  Complexity in NP.  Non optimal solutions possible in cubic time.
  • 12. OUR APPROACH: A SET OF CONTROL FLOW GRAPHS BIRTHMARK  Intra-procedural control flow.  Many procedures.
  • 13. TRANSFORMING GRAPH DISSIMILARITY TO A STRING DISSIMILARITY PROBLEM  Decompile control flow graphs to strings.  Compare strings using ‘string metrics’. proc (){ L_0 W|IEH}R L_0: while (v1 || v2) { L_3 L_1: if (v3) { true L_2: L_6 } else { true L_4: } L_1 L_7 L_5: true } true L_7: return ; L_2 L_4 } true L_5
  • 15. TRANSFORMING A SET OF STRINGS PROBLEM INTO A STRING PROBLEM  Decompiled CFGs give us a set of strings. R W|IEH}R  Order and concatenate strings. W|IEH}R R W|IEH}RZ RZ W|}R W|}R W|}RZ  Deliminate substrings with ‘Z’. IEHRZ SRZ IEHR IEHR  Order based on metrics. SR SR  Number of instructions in procedure.  Number of basic blocks.  etc
  • 16. WHAT WE TRIED (AND ENDED UP NOT USING)  String metrics:  Edit distance  ed(“hello”, “ggello”) = 2 C ( xy ) −min{C ( x), (C , y )}  Normalized Compression Distance  NCD ( x, y ) = max{C ( x), C ( y )} A K TKT K  Sequence alignment  | | | | | ATKTT T K  All databases indexed using metric trees.
  • 17. SEQUENCE ALIGNMENT WITH BLAST  A heuristic genome sequence search tool.  Local sequence alignment.  Hugely popular in bioinformatics.  So.. transform our strings into genome sequences.  Then, do a genome search.
  • 18. GENOME SEQUENCE EXTRACTION proc (){ L_0 W|IEH}R L_0: while (v1 || v2) { L_3 L_1: if (v3) { true L_2: L_6 } else { true L_4: }  ACGTRYKMACGTRYKM L_1 L_7 L_5: true } true L_7: return ; L_2 L_4 } true L_5 L_0 proc (){ L_0: W|IEH}R A = Adeline while (v1 || v2) { true true L_3 L_6 L_1: L_2: L_4: if (v3) { } else { C = Cytosine G = Guanine } L_1 L_7 L_5: true } true L_7: return ; L_2 L_4 } T = Thyamine true L_5 ...
  • 19. WHY DIDN’T WE USE THOSE APPROACHES?  Not optimally effective.  Too slow.  Best speed was using NCD.
  • 20. A DISSIMILARITY METRIC FOR SETS OF STRINGS (WHAT WE ENDED UP USING)  Find a mapping between strings to minimize the sum of distances. p d=ed(p,q) q BR BW|{B}BR BR BI{B}BR BW|{B}BR BSSR BSSR BSR BSSSR
  • 21. COMBINATORIAL OPTIMISATION: THE ASSIGNMENT PROBLEM  Finding a minimum cost mapping is known as the “assignment problem”  Optimal solutions exist in cubic time.  “Greedy” heuristic solutions faster.  Has the properties of a metric.
  • 23. IMPLEMENTATION  Malwise system is 100,000 lines of code of C++.  The modules for this work < 3000 lines of code.  Unpacks malware using an application level emulator (Ruxcon 2010)  Pre-filtering stage to quickly cull non matching variants (Ruxcon 2011)
  • 24. EVALUATION - EFFECTIVENESS  Calculated similarity between Roron malware variants.  Compared results to Ruxcon 2010 work.  In tables, highlighted cells indicates a positive match.  The more matches the more effective it is.
  • 25. EVALUATION - EFFECTIVENESS ao b d e g k m q a ao b d e g k m q a ao 0.44 0.28 0.27 0.28 0.55 0.44 0.44 0.47 ao 0.70 0.28 0.28 0.27 0.75 0.70 0.70 0.75 b 0.44 0.27 0.27 0.27 0.51 1.00 1.00 0.58 b 0.74 0.31 0.34 0.33 0.82 1.00 1.00 0.87 d 0.28 0.27 0.48 0.56 0.27 0.27 0.27 0.27 d 0.28 0.29 0.50 0.74 0.29 0.29 0.29 0.29 e 0.27 0.27 0.48 0.59 0.27 0.27 0.27 0.27 e 0.31 0.34 0.50 0.64 0.32 0.34 0.34 0.33 g 0.28 0.27 0.56 0.59 0.27 0.27 0.27 0.27 g 0.27 0.33 0.74 0.64 0.29 0.33 0.33 0.30 k 0.55 0.51 0.27 0.27 0.27 0.51 0.51 0.75 k 0.75 0.82 0.29 0.30 0.29 0.82 0.82 0.96 m 0.44 1.00 0.27 0.27 0.27 0.51 1.00 0.58 m 0.74 1.00 0.31 0.34 0.33 0.82 1.00 0.87 q 0.44 1.00 0.27 0.27 0.27 0.51 1.00 0.58 q 0.74 1.00 0.31 0.34 0.33 0.82 1.00 0.87 a 0.47 0.58 0.27 0.27 0.27 0.75 0.58 0.58 a 0.75 0.87 0.30 0.31 0.30 0.96 0.87 0.87 Exact Matching Heuristic Approximate (Ruxcon 2010) Matching (Ruxcon 2010) ao b d e g k m q a ao 0.86 0.49 0.54 0.50 0.87 0.86 0.86 0.86 b 0.87 0.57 0.63 0.62 0.96 1.00 1.00 0.96 d 0.61 0.64 0.85 0.91 0.64 0.64 0.64 0.64 e 0.64 0.69 0.85 0.90 0.68 0.69 0.69 0.68 g 0.62 0.68 0.91 0.91 0.68 0.68 0.68 0.68 k 0.88 0.96 0.58 0.62 0.61 0.96 0.96 0.99 m 0.87 1.00 0.57 0.63 0.62 0.96 1.00 0.96 q 0.87 1.00 0.57 0.63 0.62 0.96 1.00 0.96 a 0.87 0.96 0.58 0.62 0.61 0.99 0.96 0.96 Assignment problem
  • 26. EVALUATION – FALSE POSITIVES  Database of 10,000 malware.  Scanned 1,601 benign binaries.  7 false positives. Less than 1%.  Very small binaries have small signatures and cause weak matching.
  • 27. EVALUATION - EFFICIENCY  Median benign and malware processing time is 0.06s and 0.84s. Malware % Samples Benign Time(s) Time(s) 10 0.02 0.16 20 0.02 0.28 30 0.03 0.30 40 0.03 0.36 50 0.06 0.84 60 0.09 0.94 70 0.13 0.97 80 0.25 1.03 90 0.56 1.31 100 8.06 585.16
  • 28. BUT THAT’S NOT ALL WE USE THE MALWISE ENGINE FOR..
  • 29. SIMSEER – A SOFTWARE SIMILARITY WEB SERVICE  An online service to identify similarity between programs  Based on Malwise.  Renders an evolutionary tree to show program relationships.  Free to use!  http://www.foocodechu.com/?q=simseer-a-software-similarit
  • 32. SIMSEER - DEMO  http://www.youtube.com/watch?v=ymo7DKlKCH4
  • 33. BUGWISE  Automatically detect bugs and vulnerabilities in Linux executable binaries.  Uses static program analysis from Malwise.  Decompilation  Data Flow Analysis   Free to use!  http://www.foocodechu.com/?q=bugwise-a-bug-detection-we
  • 34. BUGWISE – SGID GAMES XONIX BUG IN DEBIAN LINUX memset(score_rec[i].login, 0, 11); strncpy(score_rec[i].login, pw->pw_name, 10); memset(score_rec[i].full, 0, 65); strncpy(score_rec[i].full, fullname, 64); score_rec[i].tstamp = time(NULL); free(fullname); if((high = freopen(PATH_HIGHSCORE, "w",high)) == NULL) { fprintf(stderr, "xonix: cannot reopen high score filen"); free(fullname); gameover_pending = 0; return; }
  • 35. PUBLICATIONS  Book published by Springer.  http://www.springer.com/computer/security+and+ cryptology/book/978-1-4471-2908-0
  • 36. CONCLUSION  Malwise effectively identifies malware variants.  Runs in real-time in expected case.  Large functional code base and years of development time.  Happy to talk to vendors.  http://www.FooCodeChu.com