SlideShare a Scribd company logo
David Pryce PhD, Wandera
Detecting Mobile Malware
with Apache Spark
#SAISDev9
#SAISDev9
Summary
• The problem: Mobile-first malware detection
• The data and features
• The Machine Learning (ML) model
• Why Apache Spark?
• Making it production ready
• Data Science @ Wandera
!2
#SAISDev9
The power of enterprise mobility
!3
Devices are prone to
security threats
Concerns around
appropriate usage
Data usage costs are
opaque and spiraling
Potentially exposing
sensitive data
Seamless internal
communication
Added flexibility to
working hours
Access to more apps and
productivity tools
E-mail and other services
available anywhere
#SAISDev9
Happy hunting ground for attackers
!4
“Mobile threats can no longer be ignored”
- AUGUST 2017 - GARTNER MARKET GUIDE TO MOBILE THREAT DEFENSE
100%
Mobile malware
growth in 2016
435%
High severity threats
(CVSS) growth in 2016
80%
of organizations
experienced mobile
phishing attack
38%
of hackers bypass
endpoint defense using
social engineering
#SAISDev9
Secure Mobile Gateway
!5
ON-DEVICE
DETECTION
IN-NETWORK
PROTECTION
#SAISDev9
The Rise of Mobile Malware
!6
Credit: GData 2017
Our objectives: Identify and Classify
!7
SMS
MALWARE TYPES
Ransomware Spyware Banker Trojan
Rooting Adware
#SAISDev9
Why is this a novel problem?
• Mobile malware is on the rise
• Signature based detection is no longer scalable or effective
• We needed a solution that could
• work across both known and unknown threats;
• effectively protect our customers; and
• enable threat research to quickly identify new outbreaks
• First solution = signatures and lists
• Our solution = machine learning!
!8
#SAISDev9
The data …
!9
Good and bad apps
• Source 1: official app stores
• Source 2: seen in our devices
• Source 3: seen by our gateway
3rd-party threat intelligence
External input verified for labels
(supervised learning)
Currently storing: ~2 million labelled apps
+
#SAISDev9
… the features …
!10
Baidu 2016
#SAISDev9
… how we extract them
!11
Direct metadata extraction
• Total unique fields for all apps ~ 500,000
• A typical app ~ 10+ fields
• SPARSE VECTOR
Solution:
• Hashing function (vector to indices)
• Allows for fast retrieval
• With big enough map (2^20) to avoid clashes
• DENSE VECTOR
#SAISDev9
… and how the Machine Learns
!12
• Selected model = Logistic Regression
â—¦ Models tried = (LogReg, SVM, Decision Tree)
• K-fold cross validation to select best parameters
• Accuracy: 0.96 

#SAISDev9
Why Apache Spark?
!13
Model
persistence
PMML paradigm already
integrated
Truly big
data
Millions of data points,
millions of fields
Ease of use
Fast, easy and iterative.
From EDA to app in
days. Scala and python
API.
Deployment
and Scale
From local to cluster is
easy!
#SAISDev9 !14
Wandera 2018
Production Ready?
#SAISDev9
P.M.M.L
• Predictive Model Markup Language
• Industry standard
• Pro: Language agnostic, REST API, good algo
coverage
• Con: large file size
!15
#SAISDev9
Production Ready?
!16
• Saving to PMML (ML vs MLlib / DF vs RDD)
• DataFrame API - doesn’t have PMML functionality (yet)
• Hacked PMML to get probabilities for predictions
• Size of model ~ 20Mb (compressed)
• Overall time to train: less than 2 hours on a big enough cluster
F
#SAISDev9
Live Scoring
!17
Extracts features &
scores app
User installs new app
1
2
If score > 0.9
INVESTIGATE / NOTIFY
3
#SAISDev9
Data Science @ Wandera
!18
• Cross-disciplinary team of scientists, analysts & developers
• Focus on solving real-world problems in a real-time, distributed network
• Global team with presence in USA, London, UK and Czech Republic
= Innovative Research + Scalable Architecture + Efficient Feature Delivery
#SAISDev9
Thanks for listening
!19
#SAISDev9
Appendix 1: model testing results
!20
Wandera 2018

More Related Content

Similar to Detecting Mobile Malware with Apache Spark with David Pryce (20)

PDF
Detecting Mobile Malware with Apache Spark with David Pryce
Databricks
 
PDF
Accelerate Big Data Application Development with Cascading
Cascading
 
PDF
Scaling up with Cisco Big Data: Data + Science = Data Science
eRic Choo
 
PPTX
Evolving Beyond the Data Lake: A Story of Wind and Rain
MapR Technologies
 
PDF
Spark and MapR Streams: A Motivating Example
Ian Downard
 
PDF
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Spark Summit
 
PPTX
20160000 Cloud Discovery Event - Cloud Access Security Brokers
Robin Vermeirsch
 
PPTX
MapR Product Update - Spring 2017
MapR Technologies
 
PPTX
MapR and Machine Learning Primer
Mathieu Dumoulin
 
PDF
RightScale Roadtrip - Accelerate To Cloud
RightScale
 
PPTX
Cassandra Summit 2014: Apache Cassandra at Telefonica CBS
DataStax Academy
 
PDF
Inawisdom Overview - construction.pdf
PhilipBasford
 
PDF
Introduction to the source{d} Stack
source{d}
 
PPTX
Get Started with Cloudera’s Cyber Solution
Cloudera, Inc.
 
PDF
Accelerate to Cloud
RightScale
 
PDF
Big Data LDN 2017: How to leverage the cloud for Business Solutions
Matt Stubbs
 
PDF
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Carol McDonald
 
PPTX
July NY Enterprise Technology Meetup
Shay Hassidim
 
PPTX
Get started with Cloudera's cyber solution
Cloudera, Inc.
 
PDF
Fast Cars, Big Data How Streaming can help Formula 1
Carol McDonald
 
Detecting Mobile Malware with Apache Spark with David Pryce
Databricks
 
Accelerate Big Data Application Development with Cascading
Cascading
 
Scaling up with Cisco Big Data: Data + Science = Data Science
eRic Choo
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
MapR Technologies
 
Spark and MapR Streams: A Motivating Example
Ian Downard
 
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Spark Summit
 
20160000 Cloud Discovery Event - Cloud Access Security Brokers
Robin Vermeirsch
 
MapR Product Update - Spring 2017
MapR Technologies
 
MapR and Machine Learning Primer
Mathieu Dumoulin
 
RightScale Roadtrip - Accelerate To Cloud
RightScale
 
Cassandra Summit 2014: Apache Cassandra at Telefonica CBS
DataStax Academy
 
Inawisdom Overview - construction.pdf
PhilipBasford
 
Introduction to the source{d} Stack
source{d}
 
Get Started with Cloudera’s Cyber Solution
Cloudera, Inc.
 
Accelerate to Cloud
RightScale
 
Big Data LDN 2017: How to leverage the cloud for Business Solutions
Matt Stubbs
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Carol McDonald
 
July NY Enterprise Technology Meetup
Shay Hassidim
 
Get started with Cloudera's cyber solution
Cloudera, Inc.
 
Fast Cars, Big Data How Streaming can help Formula 1
Carol McDonald
 

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PPT
introdution to python with a very little difficulty
HUZAIFABINABDULLAH
 
PPTX
Solution+Architecture+Review+-+Sample.pptx
manuvratsingh1
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PDF
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PDF
McKinsey - Global Energy Perspective 2023_11.pdf
niyudha
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PPTX
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
PDF
Top Civil Engineer Canada Services111111
nengineeringfirms
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PDF
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PDF
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
introdution to python with a very little difficulty
HUZAIFABINABDULLAH
 
Solution+Architecture+Review+-+Sample.pptx
manuvratsingh1
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
apidays Munich 2025 - The Double Life of the API Product Manager, Emmanuel Pa...
apidays
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
McKinsey - Global Energy Perspective 2023_11.pdf
niyudha
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
Insurance-Analytics-Branch-Dashboard (1).pptx
trivenisapate02
 
Top Civil Engineer Canada Services111111
nengineeringfirms
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
Ad

Detecting Mobile Malware with Apache Spark with David Pryce

  • 1. David Pryce PhD, Wandera Detecting Mobile Malware with Apache Spark #SAISDev9
  • 2. #SAISDev9 Summary • The problem: Mobile-first malware detection • The data and features • The Machine Learning (ML) model • Why Apache Spark? • Making it production ready • Data Science @ Wandera !2
  • 3. #SAISDev9 The power of enterprise mobility !3 Devices are prone to security threats Concerns around appropriate usage Data usage costs are opaque and spiraling Potentially exposing sensitive data Seamless internal communication Added flexibility to working hours Access to more apps and productivity tools E-mail and other services available anywhere
  • 4. #SAISDev9 Happy hunting ground for attackers !4 “Mobile threats can no longer be ignored” - AUGUST 2017 - GARTNER MARKET GUIDE TO MOBILE THREAT DEFENSE 100% Mobile malware growth in 2016 435% High severity threats (CVSS) growth in 2016 80% of organizations experienced mobile phishing attack 38% of hackers bypass endpoint defense using social engineering
  • 6. #SAISDev9 The Rise of Mobile Malware !6 Credit: GData 2017
  • 7. Our objectives: Identify and Classify !7 SMS MALWARE TYPES Ransomware Spyware Banker Trojan Rooting Adware
  • 8. #SAISDev9 Why is this a novel problem? • Mobile malware is on the rise • Signature based detection is no longer scalable or effective • We needed a solution that could • work across both known and unknown threats; • effectively protect our customers; and • enable threat research to quickly identify new outbreaks • First solution = signatures and lists • Our solution = machine learning! !8
  • 9. #SAISDev9 The data … !9 Good and bad apps • Source 1: official app stores • Source 2: seen in our devices • Source 3: seen by our gateway 3rd-party threat intelligence External input verified for labels (supervised learning) Currently storing: ~2 million labelled apps +
  • 10. #SAISDev9 … the features … !10 Baidu 2016
  • 11. #SAISDev9 … how we extract them !11 Direct metadata extraction • Total unique fields for all apps ~ 500,000 • A typical app ~ 10+ fields • SPARSE VECTOR Solution: • Hashing function (vector to indices) • Allows for fast retrieval • With big enough map (2^20) to avoid clashes • DENSE VECTOR
  • 12. #SAISDev9 … and how the Machine Learns !12 • Selected model = Logistic Regression â—¦ Models tried = (LogReg, SVM, Decision Tree) • K-fold cross validation to select best parameters • Accuracy: 0.96 

  • 13. #SAISDev9 Why Apache Spark? !13 Model persistence PMML paradigm already integrated Truly big data Millions of data points, millions of fields Ease of use Fast, easy and iterative. From EDA to app in days. Scala and python API. Deployment and Scale From local to cluster is easy!
  • 15. #SAISDev9 P.M.M.L • Predictive Model Markup Language • Industry standard • Pro: Language agnostic, REST API, good algo coverage • Con: large file size !15
  • 16. #SAISDev9 Production Ready? !16 • Saving to PMML (ML vs MLlib / DF vs RDD) • DataFrame API - doesn’t have PMML functionality (yet) • Hacked PMML to get probabilities for predictions • Size of model ~ 20Mb (compressed) • Overall time to train: less than 2 hours on a big enough cluster F
  • 17. #SAISDev9 Live Scoring !17 Extracts features & scores app User installs new app 1 2 If score > 0.9 INVESTIGATE / NOTIFY 3
  • 18. #SAISDev9 Data Science @ Wandera !18 • Cross-disciplinary team of scientists, analysts & developers • Focus on solving real-world problems in a real-time, distributed network • Global team with presence in USA, London, UK and Czech Republic = Innovative Research + Scalable Architecture + Efficient Feature Delivery
  • 20. #SAISDev9 Appendix 1: model testing results !20 Wandera 2018