SlideShare a Scribd company logo
David Pryce, Wandera
Detecting Mobile Malware
with Apache Spark
#DSSAIS12
#DSSAIS12
Summary
• The problem: Mobile-first malware detection
• The data and features
• The Machine Learning (ML) model
• Why Apache Spark?
• Making it production ready
• Data Science @ Wandera
!2
The power of enterprise mobility
!3
Devices are prone to
security threats
Concerns around
appropriate usage
Data usage costs are
opaque and spiraling
Potentially exposing
sensitive data
Seamless internal
communication
Added flexibility to
working hours
Access to more apps and
productivity tools
E-mail and other services
available anywhere
Happy hunting ground for attackers
!4
“Mobile threats can no longer be ignored”
- AUGUST 2017 - GARTNER MARKET GUIDE TO MOBILE THREAT DEFENSE
100%
Mobile malware
growth in 2016
435%
High severity threats
(CVSS) growth in 2016
80%
of organizations
experienced mobile
phishing attack
38%
of hackers bypass
endpoint defense using
social engineering
Introducing the Secure Mobile Gateway
!5
ON-DEVICE
DETECTION
IN-NETWORK
PROTECTION
#DSSAIS12
The rise of mobile malware
!6
Credit: GData 2017
Our objectives: Identify and Classify
!7
SMS
MALWARE TYPES
Ransomware Spyware Banker Trojan
Rooting Adware
#DSSAIS12
Why is this a novel problem?
• Mobile malware is on the rise
• Signature based detection is no longer scalable or effective
• We needed a solution that could
• work across both known and unknown threats;
• effectively protect our customers; and
• enable threat research to quickly identify new outbreaks
• First solution = signatures and lists
• Our solution = machine learning!
!8
#DSSAIS12
The data…
!9
Good and bad apps
• Source 1: official app stores
• Source 2: seen in our devices
• Source 3: seen by our gateway
3rd-party threat intelligence
External input verified for labels
(supervised learning)
Currently storing: ~2 million labelled apps
+
#DSSAIS12
… and the features
!10
Baidu 2016
#DSSAIS12
Feature extraction
!11
Direct metadata extraction
• Total unique fields for all apps ~ 500,000
• A typical app ~ 10+ fields
• SPARSE VECTOR
Solution:
• Hashing function (vector to indices)
• Allows for fast retrieval
• With big enough map (2^20) to avoid clashes
• DENSE VECTOR
#DSSAIS12
The Machine Learning model
• Selected model = Logistic Regression
◦ Models tried = (LogReg, SVM, Decision Tree)
• K-fold cross validation to select best parameters
• Accuracy: 0.96 

!12
#DSSAIS12
Why Apache Spark?
!13
Model
persistence
PMML paradigm already
integrated
Truly big
data
Millions of data points,
millions of fields
Ease of use
Fast, easy and iterative.
From EDA to app in
days. Scala and python
API.
Deployment
and Scale
From local to cluster is
easy!
Wandera 2018
#DSSAIS12
Production ready?
!14
P.M.M.L
• Predictive Model Markup Language
• Industry standard
• Pro: Language agnostic, REST API, good algo
coverage
• Con: large file size
!15
#DSSAIS12
Production ready?
!16
• Saving to PMML (ML vs MLlib / DF vs RDD)
• DataFrame API - doesn’t have PMML functionality (yet)
• Hacked PMML to get probabilities for predictions
• Size of model ~ 20Mb (compressed)
• Overall time to train: less than 2 hours on a big enough cluster
F
Live scoring
!17
Extracts features &
scores app
User installs new app
1
2
If score > 0.9
INVESTIGATE / NOTIFY
3
#DSSAIS12
Data Science @ Wandera
!18
• Cross-disciplinary team of scientists, analysts & developers
• Focus on solving real-world problems in a real-time, distributed network
• Global team with presence in USA, London, UK and Czech Republic
= Innovative Research + Scalable Architecture + Efficient Feature Delivery
#DSSAIS12
Thanks for listening
!19
#DSSAIS12
Appendix 1: model testing results
!20
Wandera 2018

More Related Content

What's hot (20)

PDF
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
Databricks
 
PDF
Spark at Airbnb
Hao Wang
 
PPTX
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Spark Summit
 
PDF
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Databricks
 
PDF
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Databricks
 
PDF
How R Developers Can Build and Share Data and AI Applications that Scale with...
Databricks
 
PDF
Spark Summit EU talk by Pat Patterson
Spark Summit
 
PDF
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
Pavel Hardak
 
PDF
Automated Production Ready ML at Scale
Databricks
 
PDF
Accelerating Machine Learning on Databricks Runtime
Databricks
 
PDF
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Databricks
 
PDF
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Databricks
 
PDF
The Power of Unified Analytics with Ali Ghodsi
Databricks
 
PPTX
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Spark Summit
 
PDF
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Databricks
 
PDF
Building an AI-Powered Retail Experience with Delta Lake, Spark, and Databricks
Databricks
 
PDF
Managing the Complete Machine Learning Lifecycle with MLflow
Databricks
 
PDF
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Databricks
 
PPTX
Spline 2 - Vision and Architecture Overview
Vaclav Kosar
 
PDF
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
Databricks
 
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
Databricks
 
Spark at Airbnb
Hao Wang
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Spark Summit
 
Deep Learning for Large-Scale Online Fraud Detection—Fighting Fraudsters Amon...
Databricks
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Databricks
 
How R Developers Can Build and Share Data and AI Applications that Scale with...
Databricks
 
Spark Summit EU talk by Pat Patterson
Spark Summit
 
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
Pavel Hardak
 
Automated Production Ready ML at Scale
Databricks
 
Accelerating Machine Learning on Databricks Runtime
Databricks
 
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Databricks
 
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Databricks
 
The Power of Unified Analytics with Ali Ghodsi
Databricks
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Spark Summit
 
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Databricks
 
Building an AI-Powered Retail Experience with Delta Lake, Spark, and Databricks
Databricks
 
Managing the Complete Machine Learning Lifecycle with MLflow
Databricks
 
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Databricks
 
Spline 2 - Vision and Architecture Overview
Vaclav Kosar
 
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
Databricks
 

Similar to Detecting Mobile Malware with Apache Spark with David Pryce (20)

PDF
Detecting Mobile Malware with Apache Spark with David Pryce
Databricks
 
PDF
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 
PDF
DMIA: A MALWARE DETECTION SYSTEM ON IOS PLATFORM
csandit
 
PPTX
Machine Learning with Apache Spark
IBM Cloud Data Services
 
PDF
IRJET- Android Malware Detection using Deep Learning
IRJET Journal
 
PDF
spark_v1_2
Frank Schroeter
 
PDF
IRJET - Research on Data Mining of Permission-Induced Risk for Android Devices
IRJET Journal
 
PDF
MACHINE LEARNING APPROACH TO LEARN AND DETECT MALWARE IN ANDROID
IRJET Journal
 
PPTX
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
PDF
20151015 zagreb spark_notebooks
Andrey Vykhodtsev
 
PDF
Spark forspringdevs springone_final
sdeeg
 
PPTX
Spark.pptx to knowledge gaining in wdm days ago
PreethamMCPreethamMC
 
PDF
Permission Driven Malware Detection using Machine Learning
IRJET Journal
 
PDF
Leveraging Data Driven Research Through Microsoft Azure
Miguel González-Fierro
 
PDF
Scaling Analytics with Apache Spark
QuantUniversity
 
PDF
Advanced Analytics in Hadoop
AnalyticsWeek
 
PDF
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Krishna Sankar
 
PDF
Apache Spark for Cyber Security in an Enterprise Company
Databricks
 
PDF
Fighting Fraud with Apache Spark
Miklos Christine
 
PPTX
Next generation analytics with yarn, spark and graph lab
Impetus Technologies
 
Detecting Mobile Malware with Apache Spark with David Pryce
Databricks
 
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 
DMIA: A MALWARE DETECTION SYSTEM ON IOS PLATFORM
csandit
 
Machine Learning with Apache Spark
IBM Cloud Data Services
 
IRJET- Android Malware Detection using Deep Learning
IRJET Journal
 
spark_v1_2
Frank Schroeter
 
IRJET - Research on Data Mining of Permission-Induced Risk for Android Devices
IRJET Journal
 
MACHINE LEARNING APPROACH TO LEARN AND DETECT MALWARE IN ANDROID
IRJET Journal
 
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
20151015 zagreb spark_notebooks
Andrey Vykhodtsev
 
Spark forspringdevs springone_final
sdeeg
 
Spark.pptx to knowledge gaining in wdm days ago
PreethamMCPreethamMC
 
Permission Driven Malware Detection using Machine Learning
IRJET Journal
 
Leveraging Data Driven Research Through Microsoft Azure
Miguel González-Fierro
 
Scaling Analytics with Apache Spark
QuantUniversity
 
Advanced Analytics in Hadoop
AnalyticsWeek
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Krishna Sankar
 
Apache Spark for Cyber Security in an Enterprise Company
Databricks
 
Fighting Fraud with Apache Spark
Miklos Christine
 
Next generation analytics with yarn, spark and graph lab
Impetus Technologies
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna36
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna36
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 

Detecting Mobile Malware with Apache Spark with David Pryce

  • 1. David Pryce, Wandera Detecting Mobile Malware with Apache Spark #DSSAIS12
  • 2. #DSSAIS12 Summary • The problem: Mobile-first malware detection • The data and features • The Machine Learning (ML) model • Why Apache Spark? • Making it production ready • Data Science @ Wandera !2
  • 3. The power of enterprise mobility !3 Devices are prone to security threats Concerns around appropriate usage Data usage costs are opaque and spiraling Potentially exposing sensitive data Seamless internal communication Added flexibility to working hours Access to more apps and productivity tools E-mail and other services available anywhere
  • 4. Happy hunting ground for attackers !4 “Mobile threats can no longer be ignored” - AUGUST 2017 - GARTNER MARKET GUIDE TO MOBILE THREAT DEFENSE 100% Mobile malware growth in 2016 435% High severity threats (CVSS) growth in 2016 80% of organizations experienced mobile phishing attack 38% of hackers bypass endpoint defense using social engineering
  • 5. Introducing the Secure Mobile Gateway !5 ON-DEVICE DETECTION IN-NETWORK PROTECTION
  • 6. #DSSAIS12 The rise of mobile malware !6 Credit: GData 2017
  • 7. Our objectives: Identify and Classify !7 SMS MALWARE TYPES Ransomware Spyware Banker Trojan Rooting Adware
  • 8. #DSSAIS12 Why is this a novel problem? • Mobile malware is on the rise • Signature based detection is no longer scalable or effective • We needed a solution that could • work across both known and unknown threats; • effectively protect our customers; and • enable threat research to quickly identify new outbreaks • First solution = signatures and lists • Our solution = machine learning! !8
  • 9. #DSSAIS12 The data… !9 Good and bad apps • Source 1: official app stores • Source 2: seen in our devices • Source 3: seen by our gateway 3rd-party threat intelligence External input verified for labels (supervised learning) Currently storing: ~2 million labelled apps +
  • 10. #DSSAIS12 … and the features !10 Baidu 2016
  • 11. #DSSAIS12 Feature extraction !11 Direct metadata extraction • Total unique fields for all apps ~ 500,000 • A typical app ~ 10+ fields • SPARSE VECTOR Solution: • Hashing function (vector to indices) • Allows for fast retrieval • With big enough map (2^20) to avoid clashes • DENSE VECTOR
  • 12. #DSSAIS12 The Machine Learning model • Selected model = Logistic Regression ◦ Models tried = (LogReg, SVM, Decision Tree) • K-fold cross validation to select best parameters • Accuracy: 0.96 
 !12
  • 13. #DSSAIS12 Why Apache Spark? !13 Model persistence PMML paradigm already integrated Truly big data Millions of data points, millions of fields Ease of use Fast, easy and iterative. From EDA to app in days. Scala and python API. Deployment and Scale From local to cluster is easy!
  • 15. P.M.M.L • Predictive Model Markup Language • Industry standard • Pro: Language agnostic, REST API, good algo coverage • Con: large file size !15
  • 16. #DSSAIS12 Production ready? !16 • Saving to PMML (ML vs MLlib / DF vs RDD) • DataFrame API - doesn’t have PMML functionality (yet) • Hacked PMML to get probabilities for predictions • Size of model ~ 20Mb (compressed) • Overall time to train: less than 2 hours on a big enough cluster F
  • 17. Live scoring !17 Extracts features & scores app User installs new app 1 2 If score > 0.9 INVESTIGATE / NOTIFY 3
  • 18. #DSSAIS12 Data Science @ Wandera !18 • Cross-disciplinary team of scientists, analysts & developers • Focus on solving real-world problems in a real-time, distributed network • Global team with presence in USA, London, UK and Czech Republic = Innovative Research + Scalable Architecture + Efficient Feature Delivery
  • 20. #DSSAIS12 Appendix 1: model testing results !20 Wandera 2018