SlideShare a Scribd company logo
A GRAPH-BASED METHOD FOR
CROSS-ENTITY THREAT DETECTION
Herman Kwong, Ping Yan
Salesforce
Account Takeover
Detection is Key
• Basic features
• Known signature
• Usage anomaly
Each with their weaknesses
CrossLinks
• Unexpected common features, across unrelated user
accounts / environments
• Features:
IP, location, time zone,
user agent, browser fingerprint,
user action sequence, ...
 A Graph-Based Method For Cross-Entity Threat Detection
Why Graph(X)?
• Why Graph?
– Classical pair-wise entity relationship measurement solutions require O(N2) computations
– Computation complexity dramatically reduced by localizing computations
– Highly extensible solution with a multigraph
• Why GraphX?
– Spark ecosystem
– Scalability and performance
– Advanced Graph algorithms
Graph-theoretical Techniques
• Graph analysis is of high interest in many social network contexts
– Proximity-based approaches
– Personalized pagerank: closeness of each node to the restart nodes
– Simrank: similarity of contextual structures
• Bridge-Node anomaly [Akoglu, et al 2015]
– Publication networks: authors from different research communities
– Financial trading networks: cross-sector traders
– Customer-product networks: cross-border products
– Network intrusion detection: cut-vertices indicating nodes accessing multiple
communities that they do not belong to
Bipartite Graph
Bipartite: Application access data directly makes a bipartite graph where
an edge represents V1 accessing V2
V1 V2
An endpoint coming
from 73.228.152.xxx
Salesforce
Multigraph Formulation
We can also formulate the relationship of application access data as a
multigraph where an edge between two entities represents some features
that the two entities have in common.
V1 V2
IP={73.228.152.xxx}
UserAgent={Mozilla/5.0 (Macintosh; Intel Mac OS X)}
SalesforceHeroku
Anomaly Detection by Graph Change Detection
• Our objective is to quickly discover changes in the access graph
over time
• Unexpected new cross-entity connections are of particular interest in
security detection problems
• A naïve detector and a community-based algorithm were proposed
for access anomaly detection with a graph
Naïve Detector
REFERENCE GRAPH (TRAINING) - RG
DETECTION GRAPH (TESTING) – DG
MERGE OF RG & DG - RGDG
ANOMALY GRAPH (DEGREE INFO) – AG
CONNECTIVITY GRAPH (ENV-TO-ENV) – CG
Detection:
//outDegRG: count of neighbors in test nodes
//outDegRGDG: count of neighbors of nodes in the combination of test
data and reference data
//We like to calculate the difference between the two degree
//properties : outDegRGDG – outDegRG
Edges in red: edges in
only the detection graph but
not the reference graph.
Edges in blue: edges in
both the detection graph
and the reference graph.
Naïve Detector
V2
V1
V4
V3
V5
(1, 1)
(1, 3)
Reference Graph Detection Graph
Anomaly Graph with Degree Info
V1
V1
V4
V2
V2
V3
V3
V5
Merge (RG, DG)
IP={73.228.152.xxx}
Salesforce
 A Graph-Based Method For Cross-Entity Threat Detection
2nd-Order Connectivity
X.JOIN(X)
Bipartite graph Connectivity graph
IP={73.228.152.xxx}
Salesforce
2nd-Order Anomaly Graph
Accessed entities by communitiesEndpoint features
Cross-cluster outDegree: 1
Cross-cluster outDegree: 2
2nd-Order Anomaly Detector
Step 1: self join RG on the feature-of-interest (e.g., IP) to get the
env-to-env connectivity graph.
Step 2: Build the Anomaly Graph as in the Naïve Detector algorithm (1st-
order anomalies).
Step 3*: collapse the cluster of nodes into a single node on the
Anomaly Graph.
Step 4: run the naive algorithm to get the updated node degrees to
identify 2nd-order anomalies.
*: ConnectedComponent to approximate clusters
Experiments
• Reference Graph (RG) - Number of vertices: 2,222,613
• RG - Number of edges 2,156,104
• Connectivity Graph (CG) - Number of vertices: 4,682
• CG - Number of edges: 8,534
• CG - Number of ConnectedComponents: 1146
• Number of 1st-order anomalies: ~700
• Number of 2nd-order anomalies: ~200
• Computing time: ~ 5 minutes on a Mac Air (1.7 GHz Intel Core i7, 8G memory)
http://g.recordit.co/TBqvNbqonf.gif
Toolkit for Interactive Analysis
Lightning
Opportunities
• GraphDB for real-time indexing and query
• Probabilistic edges to support complex semantics
• Clustering on probabilistic graph for community detection
References
[Akoglu et al 2015] Akoglu, Leman, Hanghang Tong, and Danai Koutra. "Graph based
anomaly detection and description: a survey." Data Mining and Knowledge Discovery29.3
(2015): 626-688.
[Ding et al 2012] Ding, Qi, et al. "Intrusion as (anti) social communication:
characterization and detection." Proceedings of the 18th ACM SIGKDD international
conference on Knowledge discovery and data mining. ACM, 2012.
THANK YOU.
hkwong@salesforce.com
pyan@salesforce.com ( @pingpingya)

More Related Content

What's hot (20)

PDF
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
PDF
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Spark Summit
 
PPTX
Spark Summit EU talk by Sameer Agarwal
Spark Summit
 
PDF
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Spark Summit
 
PDF
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Databricks
 
PDF
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Databricks
 
PDF
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Databricks
 
PDF
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Spark Summit
 
PDF
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Spark Summit
 
PDF
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
Spark Summit
 
PDF
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
 
PDF
Enhancing Spark SQL Optimizer with Reliable Statistics
Jen Aman
 
PDF
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Databricks
 
PDF
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Spark Summit
 
PDF
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit
 
PDF
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Databricks
 
PDF
Scaling up data science applications
Kexin Xie
 
PPTX
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
Spark Summit
 
PDF
Understanding Query Plans and Spark UIs
Databricks
 
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Spark Summit
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Spark Summit
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Databricks
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Databricks
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Spark Summit
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Spark Summit
 
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
Spark Summit
 
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Jen Aman
 
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Databricks
 
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Spark Summit
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Databricks
 
Scaling up data science applications
Kexin Xie
 
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
Spark Summit
 
Understanding Query Plans and Spark UIs
Databricks
 

Viewers also liked (20)

PDF
GPU Computing With Apache Spark And Python
Jen Aman
 
PDF
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
 
PDF
Re-Architecting Spark For Performance Understandability
Jen Aman
 
PDF
Spark on Mesos
Jen Aman
 
PDF
Using A Distributed Graph Database To Make Sense Of Disparate Data Stores
InfiniteGraph
 
PDF
Re-Architecting Spark For Performance Understandability
Jen Aman
 
PDF
How we use neo4j for finding public transport routes
Evgenii Kozhanov
 
PDF
Deploying Accelerators At Datacenter Scale Using Spark
Jen Aman
 
PDF
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
 
PDF
RISELab:Enabling Intelligent Real-Time Decisions
Jen Aman
 
PDF
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
Spark Summit
 
PDF
Utilizing Human Data Validation For KPI Analysis And Machine Learning
Jen Aman
 
PDF
Spatial Analysis On Histological Images Using Spark
Jen Aman
 
PDF
Solving The N+1 Problem In Personalized Genomics
Spark Summit
 
PDF
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit
 
PDF
Spark at Bloomberg: Dynamically Composable Analytics
Jen Aman
 
PDF
Low Latency Execution For Apache Spark
Jen Aman
 
PDF
EclairJS = Node.Js + Apache Spark
Jen Aman
 
PDF
Spark Uber Development Kit
Jen Aman
 
PDF
From MapReduce to Apache Spark
Jen Aman
 
GPU Computing With Apache Spark And Python
Jen Aman
 
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
 
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Spark on Mesos
Jen Aman
 
Using A Distributed Graph Database To Make Sense Of Disparate Data Stores
InfiniteGraph
 
Re-Architecting Spark For Performance Understandability
Jen Aman
 
How we use neo4j for finding public transport routes
Evgenii Kozhanov
 
Deploying Accelerators At Datacenter Scale Using Spark
Jen Aman
 
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
 
RISELab:Enabling Intelligent Real-Time Decisions
Jen Aman
 
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...
Spark Summit
 
Utilizing Human Data Validation For KPI Analysis And Machine Learning
Jen Aman
 
Spatial Analysis On Histological Images Using Spark
Jen Aman
 
Solving The N+1 Problem In Personalized Genomics
Spark Summit
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit
 
Spark at Bloomberg: Dynamically Composable Analytics
Jen Aman
 
Low Latency Execution For Apache Spark
Jen Aman
 
EclairJS = Node.Js + Apache Spark
Jen Aman
 
Spark Uber Development Kit
Jen Aman
 
From MapReduce to Apache Spark
Jen Aman
 
Ad

Similar to A Graph-Based Method For Cross-Entity Threat Detection (20)

PPTX
Sparksummit2016 share
Ping Yan
 
PPTX
Anomaly detection in plain static graphs
dash-javad
 
PDF
Data Summer Conf 2018, “Analysing Billion Node Graphs (ENG)” — Giorgi Jvaridz...
Provectus
 
PPTX
Follow the wizard to select the database schema you want to diagram
shiva shadrooh
 
PDF
Fighting Malware with Graph Analytics: An End-to-End Case Study
Priyanka Aash
 
PDF
Graph analytic and machine learning
Stanley Wang
 
PPTX
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Mo Patel
 
PDF
Distributed graph processing
Bartosz Konieczny
 
PDF
Graph Gurus 23: Best Practices To Model Your Data Using A Graph Database
TigerGraph
 
PDF
En un mundo hiperconectado, las bases de datos de grafos son tu arma secreta
javier ramirez
 
PPTX
Security Operations, Engineering, and Intelligence Integration through the po...
Christopher Clark
 
PPT
An Introduction to Graph Databases
InfiniteGraph
 
PDF
Using Graph Algorithms for Advanced Analytics - Part 2 Centrality
TigerGraph
 
PDF
Graph Gurus Episode 29: Using Graph Algorithms for Advanced Analytics Part 3
TigerGraph
 
PDF
Graph Gurus Episode 22: Cybersecurity
TigerGraph
 
PPTX
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...
Sameera Horawalavithana
 
PDF
Graph Gurus Episode 22: Guarding Against Cyber Security Threats with a Graph ...
Amanda Morris
 
PPTX
Large Scale Graph Analytics with JanusGraph
DataWorks Summit
 
PPTX
Large Scale Graph Analytics with JanusGraph
P. Taylor Goetz
 
PPTX
Community Detection in Social Media
rezahk
 
Sparksummit2016 share
Ping Yan
 
Anomaly detection in plain static graphs
dash-javad
 
Data Summer Conf 2018, “Analysing Billion Node Graphs (ENG)” — Giorgi Jvaridz...
Provectus
 
Follow the wizard to select the database schema you want to diagram
shiva shadrooh
 
Fighting Malware with Graph Analytics: An End-to-End Case Study
Priyanka Aash
 
Graph analytic and machine learning
Stanley Wang
 
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Mo Patel
 
Distributed graph processing
Bartosz Konieczny
 
Graph Gurus 23: Best Practices To Model Your Data Using A Graph Database
TigerGraph
 
En un mundo hiperconectado, las bases de datos de grafos son tu arma secreta
javier ramirez
 
Security Operations, Engineering, and Intelligence Integration through the po...
Christopher Clark
 
An Introduction to Graph Databases
InfiniteGraph
 
Using Graph Algorithms for Advanced Analytics - Part 2 Centrality
TigerGraph
 
Graph Gurus Episode 29: Using Graph Algorithms for Advanced Analytics Part 3
TigerGraph
 
Graph Gurus Episode 22: Cybersecurity
TigerGraph
 
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...
Sameera Horawalavithana
 
Graph Gurus Episode 22: Guarding Against Cyber Security Threats with a Graph ...
Amanda Morris
 
Large Scale Graph Analytics with JanusGraph
DataWorks Summit
 
Large Scale Graph Analytics with JanusGraph
P. Taylor Goetz
 
Community Detection in Social Media
rezahk
 
Ad

More from Jen Aman (13)

PPTX
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Jen Aman
 
PDF
Snorkel: Dark Data and Machine Learning with Christopher Ré
Jen Aman
 
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
PDF
Efficient State Management With Spark 2.0 And Scale-Out Databases
Jen Aman
 
PDF
Livy: A REST Web Service For Apache Spark
Jen Aman
 
PDF
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Jen Aman
 
PDF
Spark: Interactive To Production
Jen Aman
 
PDF
High-Performance Python On Spark
Jen Aman
 
PDF
Scalable Deep Learning Platform On Spark In Baidu
Jen Aman
 
PDF
Scaling Machine Learning To Billions Of Parameters
Jen Aman
 
PDF
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Jen Aman
 
PDF
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Jen Aman
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Jen Aman
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Jen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Jen Aman
 
Livy: A REST Web Service For Apache Spark
Jen Aman
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Jen Aman
 
Spark: Interactive To Production
Jen Aman
 
High-Performance Python On Spark
Jen Aman
 
Scalable Deep Learning Platform On Spark In Baidu
Jen Aman
 
Scaling Machine Learning To Billions Of Parameters
Jen Aman
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Jen Aman
 
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Jen Aman
 

Recently uploaded (20)

PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PPTX
Dr djdjjdsjsjsjsjsjsjjsjdjdjdjdjjd1.pptx
Nandy31
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
Dr djdjjdsjsjsjsjsjsjjsjdjdjdjdjjd1.pptx
Nandy31
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
deep dive data management sharepoint apps.ppt
novaprofk
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 

A Graph-Based Method For Cross-Entity Threat Detection

  • 1. A GRAPH-BASED METHOD FOR CROSS-ENTITY THREAT DETECTION Herman Kwong, Ping Yan Salesforce
  • 3. Detection is Key • Basic features • Known signature • Usage anomaly Each with their weaknesses
  • 4. CrossLinks • Unexpected common features, across unrelated user accounts / environments • Features: IP, location, time zone, user agent, browser fingerprint, user action sequence, ...
  • 6. Why Graph(X)? • Why Graph? – Classical pair-wise entity relationship measurement solutions require O(N2) computations – Computation complexity dramatically reduced by localizing computations – Highly extensible solution with a multigraph • Why GraphX? – Spark ecosystem – Scalability and performance – Advanced Graph algorithms
  • 7. Graph-theoretical Techniques • Graph analysis is of high interest in many social network contexts – Proximity-based approaches – Personalized pagerank: closeness of each node to the restart nodes – Simrank: similarity of contextual structures • Bridge-Node anomaly [Akoglu, et al 2015] – Publication networks: authors from different research communities – Financial trading networks: cross-sector traders – Customer-product networks: cross-border products – Network intrusion detection: cut-vertices indicating nodes accessing multiple communities that they do not belong to
  • 8. Bipartite Graph Bipartite: Application access data directly makes a bipartite graph where an edge represents V1 accessing V2 V1 V2 An endpoint coming from 73.228.152.xxx Salesforce
  • 9. Multigraph Formulation We can also formulate the relationship of application access data as a multigraph where an edge between two entities represents some features that the two entities have in common. V1 V2 IP={73.228.152.xxx} UserAgent={Mozilla/5.0 (Macintosh; Intel Mac OS X)} SalesforceHeroku
  • 10. Anomaly Detection by Graph Change Detection • Our objective is to quickly discover changes in the access graph over time • Unexpected new cross-entity connections are of particular interest in security detection problems • A naïve detector and a community-based algorithm were proposed for access anomaly detection with a graph
  • 11. Naïve Detector REFERENCE GRAPH (TRAINING) - RG DETECTION GRAPH (TESTING) – DG MERGE OF RG & DG - RGDG ANOMALY GRAPH (DEGREE INFO) – AG CONNECTIVITY GRAPH (ENV-TO-ENV) – CG Detection: //outDegRG: count of neighbors in test nodes //outDegRGDG: count of neighbors of nodes in the combination of test data and reference data //We like to calculate the difference between the two degree //properties : outDegRGDG – outDegRG
  • 12. Edges in red: edges in only the detection graph but not the reference graph. Edges in blue: edges in both the detection graph and the reference graph. Naïve Detector V2 V1 V4 V3 V5 (1, 1) (1, 3) Reference Graph Detection Graph Anomaly Graph with Degree Info V1 V1 V4 V2 V2 V3 V3 V5 Merge (RG, DG) IP={73.228.152.xxx} Salesforce
  • 14. 2nd-Order Connectivity X.JOIN(X) Bipartite graph Connectivity graph IP={73.228.152.xxx} Salesforce
  • 15. 2nd-Order Anomaly Graph Accessed entities by communitiesEndpoint features Cross-cluster outDegree: 1 Cross-cluster outDegree: 2
  • 16. 2nd-Order Anomaly Detector Step 1: self join RG on the feature-of-interest (e.g., IP) to get the env-to-env connectivity graph. Step 2: Build the Anomaly Graph as in the Naïve Detector algorithm (1st- order anomalies). Step 3*: collapse the cluster of nodes into a single node on the Anomaly Graph. Step 4: run the naive algorithm to get the updated node degrees to identify 2nd-order anomalies. *: ConnectedComponent to approximate clusters
  • 17. Experiments • Reference Graph (RG) - Number of vertices: 2,222,613 • RG - Number of edges 2,156,104 • Connectivity Graph (CG) - Number of vertices: 4,682 • CG - Number of edges: 8,534 • CG - Number of ConnectedComponents: 1146 • Number of 1st-order anomalies: ~700 • Number of 2nd-order anomalies: ~200 • Computing time: ~ 5 minutes on a Mac Air (1.7 GHz Intel Core i7, 8G memory)
  • 19. Toolkit for Interactive Analysis Lightning
  • 20. Opportunities • GraphDB for real-time indexing and query • Probabilistic edges to support complex semantics • Clustering on probabilistic graph for community detection
  • 21. References [Akoglu et al 2015] Akoglu, Leman, Hanghang Tong, and Danai Koutra. "Graph based anomaly detection and description: a survey." Data Mining and Knowledge Discovery29.3 (2015): 626-688. [Ding et al 2012] Ding, Qi, et al. "Intrusion as (anti) social communication: characterization and detection." Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2012.