SlideShare a Scribd company logo
Lookout on Scaling Security to 100 Million Devices
, Principal Engineer
Over 30 years experience predominantly dealing with event pipelines and data
retrieval.
He currently works as a platform architect and principal developer at Lookout Inc
working on the Ingestion Pipeline and Query Services team working on the next
scale of data ingestion.
■ Provides security scanning for mobile devices for Enterprise and
Consumer markets
■ Founded in 2004 when the original founders discovered a
vulnerability in the Bluetooth and Nokia phones
■ Demonstrated the need for mobile security through a demonstration
at the 2005 Academy Awards downloading information from
celebrity phones 1.5 miles away from the venue
Lookout on Scaling Security to 100 Million Devices
■ Enterprise customers have the ability to apply corporate policies
against devices registered in their enterprise
■ To apply these policies Lookout ingests data about device
configuration and applications installed on devices
■ Functions as a proxy for all mobile devices in the Lookout fleet
■ Device telemetry is sent at various intervals for these categories
● Software
● Hardware
● Client
● Filesystem
● Configuration
● Binary Manifest
● Risky Configuration
● Personal Content Protection (safe browsing)
● Device Settings
● Device Permissions
● Activation Status
Lookout on Scaling Security to 100 Million Devices
■ Easy to setup and maintain
■ Scaling is easy
■ Cost Effective
■ Simple to handle the Unexpected
■ Some of the components are “single region” (EMR)
■ As the system grows the costs increase significantly (DynamoDB)
■ Limits on Primary Key (PK) and Sort Key (SK) for DynamoDB - Not
designed for time series data
Lookout on Scaling Security to 100 Million Devices
Lookout on Scaling Security to 100 Million Devices
Lookout on Scaling Security to 100 Million Devices
A highly scalable and fault tolerant streaming framework that can process messages (for
example Device Telemetry Messages) and persist these messages into a scalable, fault
tolerant persistent store and support operational queries.
Key Requirements:
■ Infrastructure should scale to support 100M devices
■ Cost effective ingestion, storage and querying at this scale
■ Low Latency, High Availability at scale (up/down)
■ Failure handling (no loss of data)
■ Ease of deployment and management
■ A NoSQL database that implements almost all the features of Apache Cassandra
■ Written in C++ 14 instead of Java to increase the performance.
■ Uses a shared nothing approach and uses the Seastar framework to shard requests by
core - http://seastar.io/
■ Scylla’s close-to-the-hardware design significantly reduces the number of instances
needed.
■ Can horizontal scale-out and is fault-tolerance like Apache Cassandra, but delivers 10X the
throughput and consistent, low single-digit latencies.
■ Has support for tunable job prioritization to support extremely high read and write
throughput (which was a problem that Cassandra has not solved yet). Has really high
throughput on instances with NVMe volumes (compared to EBS or non NVMe volumes).
Lookout on Scaling Security to 100 Million Devices
Lookout on Scaling Security to 100 Million Devices
■ Amount of storage available for data depends on the compaction strategy
selected.
● Levelled compaction - Half of data storage needed for compaction - not
recommended
● Size tiered compaction - Half of data storage needed for compaction
● Time window compaction - Depends on the number of tables and record size -
normally around half needed for compaction
● Incremental compaction - possible to push up to 85% for data storage, so storage
needs need to be planned well. - Enterprise Edition
■ May not be a good choice if storage requirements are very large as opposed to
transactions as you will have wasted compute tied to the increased storage needs.
■ Note that this assumes you do not plan to use low cost EBS volumes with much reduced
throughput.
■ No FedRamp certified version of Scylla Cloud available today requiring deployment of
self-managed cluster
■ No Autoscaling support as we have to provision nodes and rebalance data through
scripts/UI.
■ Not suitable for ad-hoc queries or table scan type queries, and does not support joins.
■ Each worker instance is stateless and coordinates
with each other via internal Kafka topics.
■ Kafka Connect automatically detects failures and
rebalances work over remaining processes.
■ Suitable for streaming data to and from Kafka and is
not suitable for complex operation like aggregations,
windowing, etc., that frameworks like Apache Spark
or Apache Flink support.
■ The maximum number of tasks is limited to the
number of partitions.
■ Exposes a REST API to create, modify and monitor
the connectors and tasks
■ Kafka
● 6 Kafka Brokers - R5.xlarge
● 6 Zookeepers - M5.large
● 3 Schema Registries - M5.large
● 6 Kafka Connect Workers - C5.xlarge
● 1 Control Center - M5.2xlarge
● Split over 3 AZs
● # partitions
■ Loaded Libraries - 120 partitions
■ Device Settings - 150 partitions
■ Other topics - 60 partitions
■ ScyllaDB
● 4 ScyllaDB instances - I3.4xlarge
● Split over 2 AZs
■ Load
● 12 different device telemetry
emulated
● Messages sent in Apache Avro
format
● 14 instances generating load -
C5d.4xlarge
■ The default partitioner (<murmur2 hash> mod <# partitions>) that comes with Kafka is
not very efficient with sharding when the number of partitions grow (approx 50% of the
partitions were idle).
■ We replaced by using a murmur3 hash and then put it through a consistent hashing
algorithm (jump hash) to get an even distribution across all partitions (we used Google’s
guava library). - “A Fast, Minimal Memory, Consistent Hash Algorithm” -
https://arxiv.org/pdf/1406.2294.pdf
■ We emulated approximately 38
million devices generating a total
of 109,668 messages/second or
394 Million messages/hr.
■ On average a device was
generating 253 messages/day
■ We don’t expect querying to be
much impact, so did not add that
as part of load
■ The load test duration was for 96
hrs.
Telemetry Type # Device Telemetry
emulated/second
# Device Telemetry
emulated/day
Avg size in
Bytes/Telemetry
Celldata 760 1.72 83
Client 760 1.72 166
Configuration 13908 31.62 396
Device Change 2280 6.91 218
Device Permissions 1520 3.45 74
Device Settings 45600 103.68 75
Hardware 760 1.72 254
Loaded Libraries 38000 86.40 219
Risk Configuration 1900 5.18 261
Software 760 1.72 375
Binary 1520 3.45 219
File System 1900 5.18 219
■ Message latency was in
milliseconds on average, unless the
system was overtaxed.
■ Repairs forced the load and was
generally taxing on the system
(CPU at 100%), but the cluster
continued to function.
■ The latency increased when Kafka
Connect tasks failed (when repairs
were running on ScyllaDB).
■ ScyllaDB Cluster was running near
capacity (CPU between 75-90%)
■ Overall, the results were really
positive.
Lookout on Scaling Security to 100 Million Devices
■ Kafka Connect provided a quick and easy solution to add new ingestion pipelines
■ Using DataMountaineer’s Kafka Connect connector for Cassandra was easier to
implement than the Confluent connector
■ Scylla DB CPU shot up while repairing and timeouts occurred - Scylla’s ability to reserve
capacity for maintenance tasks ensured repairs completed something not available in
Cassandra.
■ As the complexity of the data ingestion increased the solution leaned more towards
implementing a custom Kafka → Scylla worker cluster for debugging and maintenance
reasons
■ The cost benefits over the current architecture flow increased significantly as our volume
increased.
■ This does not include:
● Query load and associated costs.
● Dynamo streams and it’s equivalent on Scylla and associated costs.
DynamoDB Scylla
# Devices $ Cost/Mo # Devices $ Cost/Mo
On Demand
38,000,000 $304,400.00 38,000,000 $14,564.24
100,000,000 $801,052.63 100,000,000 $38,303.95
+20% Engineer cost
(Maintenance)
Provisioned
38,000,000 $55,610.00
100,000,000 $146,342.11
$801,052
$146,342
$38,303
Richard Ney
richard.ney@lookout.com
@rney_home

More Related Content

PPTX
FireEye & Scylla: Intel Threat Analysis Using a Graph Database
ScyllaDB
 
PPTX
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
ScyllaDB
 
PPTX
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
ScyllaDB
 
PPTX
iFood on Delivering 100 Million Events a Month to Restaurants with Scylla
ScyllaDB
 
PPTX
How Workload Prioritization Reduces Your Datacenter Footprint
ScyllaDB
 
PPTX
Sizing Your Scylla Cluster
ScyllaDB
 
PPTX
Writing Applications for Scylla
ScyllaDB
 
PPTX
How to be Successful with Scylla
ScyllaDB
 
FireEye & Scylla: Intel Threat Analysis Using a Graph Database
ScyllaDB
 
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
ScyllaDB
 
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
ScyllaDB
 
iFood on Delivering 100 Million Events a Month to Restaurants with Scylla
ScyllaDB
 
How Workload Prioritization Reduces Your Datacenter Footprint
ScyllaDB
 
Sizing Your Scylla Cluster
ScyllaDB
 
Writing Applications for Scylla
ScyllaDB
 
How to be Successful with Scylla
ScyllaDB
 

What's hot (20)

PPTX
How Scylla Manager Handles Backups
ScyllaDB
 
PDF
A glimpse of cassandra 4.0 features netflix
Vinay Kumar Chella
 
PPTX
Event Streaming Architectures with Confluent and ScyllaDB
ScyllaDB
 
PPTX
High-Load Storage of Users’ Actions with ScyllaDB and HDDs
ScyllaDB
 
PDF
Scylla Summit 2022: What’s New in ScyllaDB Operator for Kubernetes
ScyllaDB
 
PPTX
Lightweight Transactions at Lightning Speed
ScyllaDB
 
PPTX
Free & Open DynamoDB API for Everyone
ScyllaDB
 
PPTX
Scylla Summit 2018: Getting the Most Out of Scylla on Kubernetes
ScyllaDB
 
PPTX
Developing Scylla Applications: Practical Tips
ScyllaDB
 
PDF
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
ScyllaDB
 
PDF
Safer restarts, faster streaming, and better repair, just a glimpse of cassan...
Vinay Kumar Chella
 
PPTX
SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...
ScyllaDB
 
PDF
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...
ScyllaDB
 
PPTX
How ReversingLabs Serves File Reputation Service for 10B Files
ScyllaDB
 
PPTX
Powering a Graph Data System with Scylla + JanusGraph
ScyllaDB
 
PPTX
Scylla Summit 2019 Keynote - Dor Laor - Beyond Cassandra
ScyllaDB
 
PPTX
Using ScyllaDB with JanusGraph for Cyber Security
ScyllaDB
 
PDF
Looking towards an official cassandra sidecar netflix
Vinay Kumar Chella
 
PDF
Live traffic capture and replay in cassandra 4.0
Vinay Kumar Chella
 
PDF
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
ScyllaDB
 
How Scylla Manager Handles Backups
ScyllaDB
 
A glimpse of cassandra 4.0 features netflix
Vinay Kumar Chella
 
Event Streaming Architectures with Confluent and ScyllaDB
ScyllaDB
 
High-Load Storage of Users’ Actions with ScyllaDB and HDDs
ScyllaDB
 
Scylla Summit 2022: What’s New in ScyllaDB Operator for Kubernetes
ScyllaDB
 
Lightweight Transactions at Lightning Speed
ScyllaDB
 
Free & Open DynamoDB API for Everyone
ScyllaDB
 
Scylla Summit 2018: Getting the Most Out of Scylla on Kubernetes
ScyllaDB
 
Developing Scylla Applications: Practical Tips
ScyllaDB
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
ScyllaDB
 
Safer restarts, faster streaming, and better repair, just a glimpse of cassan...
Vinay Kumar Chella
 
SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...
ScyllaDB
 
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...
ScyllaDB
 
How ReversingLabs Serves File Reputation Service for 10B Files
ScyllaDB
 
Powering a Graph Data System with Scylla + JanusGraph
ScyllaDB
 
Scylla Summit 2019 Keynote - Dor Laor - Beyond Cassandra
ScyllaDB
 
Using ScyllaDB with JanusGraph for Cyber Security
ScyllaDB
 
Looking towards an official cassandra sidecar netflix
Vinay Kumar Chella
 
Live traffic capture and replay in cassandra 4.0
Vinay Kumar Chella
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
ScyllaDB
 
Ad

Similar to Lookout on Scaling Security to 100 Million Devices (20)

PDF
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
confluent
 
PDF
Building Event Streaming Architectures on Scylla and Kafka
ScyllaDB
 
PPTX
Meeting the challenges of OLTP Big Data with Scylla
ScyllaDB
 
PDF
New Ways to Reduce Database Costs with ScyllaDB
ScyllaDB
 
PDF
To Serverless and Beyond
ScyllaDB
 
PDF
Using ScyllaDB for Extreme Scale Workloads
MarisaDelao3
 
PDF
What’s New in ScyllaDB Open Source 5.0
ScyllaDB
 
PDF
About "Apache Cassandra"
Jihyun Ahn
 
PDF
ScyllaDB @ Apache BigData, may 2016
Tzach Livyatan
 
PDF
5 Factors When Selecting a High Performance, Low Latency Database
ScyllaDB
 
PDF
Transforming the Database: Critical Innovations for Performance at Scale
ScyllaDB
 
PDF
Elasticity, Speed & Simplicity: Get the Most Out of New ScyllaDB Capabilities
ScyllaDB
 
PPTX
ScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDB
ScyllaDB
 
PPT
Key Challenges in Cloud Computing and How Yahoo! is Approaching Them
Yahoo Developer Network
 
PPTX
Cassandra to ScyllaDB: Technical Comparison and the Path to Success
ScyllaDB
 
PDF
10 Devops-Friendly Database Must-Haves - Dor Laor, ScyllaDB - DevOpsDays Tel ...
DevOpsDays Tel Aviv
 
PDF
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
ScyllaDB
 
PDF
Dissecting Real-World Database Performance Dilemmas
ScyllaDB
 
PDF
Handling 20 billion requests a month
Dmitriy Dumanskiy
 
PDF
ScyllaDB is No Longer "Just a Faster Cassandra" by Felipe Cardeneti Mendes
ScyllaDB
 
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
confluent
 
Building Event Streaming Architectures on Scylla and Kafka
ScyllaDB
 
Meeting the challenges of OLTP Big Data with Scylla
ScyllaDB
 
New Ways to Reduce Database Costs with ScyllaDB
ScyllaDB
 
To Serverless and Beyond
ScyllaDB
 
Using ScyllaDB for Extreme Scale Workloads
MarisaDelao3
 
What’s New in ScyllaDB Open Source 5.0
ScyllaDB
 
About "Apache Cassandra"
Jihyun Ahn
 
ScyllaDB @ Apache BigData, may 2016
Tzach Livyatan
 
5 Factors When Selecting a High Performance, Low Latency Database
ScyllaDB
 
Transforming the Database: Critical Innovations for Performance at Scale
ScyllaDB
 
Elasticity, Speed & Simplicity: Get the Most Out of New ScyllaDB Capabilities
ScyllaDB
 
ScyllaDB Leaps Forward with Dor Laor, CEO of ScyllaDB
ScyllaDB
 
Key Challenges in Cloud Computing and How Yahoo! is Approaching Them
Yahoo Developer Network
 
Cassandra to ScyllaDB: Technical Comparison and the Path to Success
ScyllaDB
 
10 Devops-Friendly Database Must-Haves - Dor Laor, ScyllaDB - DevOpsDays Tel ...
DevOpsDays Tel Aviv
 
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
ScyllaDB
 
Handling 20 billion requests a month
Dmitriy Dumanskiy
 
ScyllaDB is No Longer "Just a Faster Cassandra" by Felipe Cardeneti Mendes
ScyllaDB
 
Ad

More from ScyllaDB (20)

PDF
Understanding The True Cost of DynamoDB Webinar
ScyllaDB
 
PDF
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
PDF
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
ScyllaDB
 
PDF
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
PDF
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
ScyllaDB
 
PDF
Leading a High-Stakes Database Migration
ScyllaDB
 
PDF
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
ScyllaDB
 
PDF
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
ScyllaDB
 
PDF
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
ScyllaDB
 
PDF
ScyllaDB: 10 Years and Beyond by Dor Laor
ScyllaDB
 
PDF
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
ScyllaDB
 
PDF
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
ScyllaDB
 
PDF
Vector Search with ScyllaDB by Szymon Wasik
ScyllaDB
 
PDF
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
ScyllaDB
 
PDF
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
ScyllaDB
 
PDF
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
ScyllaDB
 
PDF
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
ScyllaDB
 
PDF
Lessons Learned from Building a Serverless Notifications System by Srushith R...
ScyllaDB
 
PDF
A Dist Sys Programmer's Journey into AI by Piotr Sarna
ScyllaDB
 
PDF
High Availability: Lessons Learned by Paul Preuveneers
ScyllaDB
 
Understanding The True Cost of DynamoDB Webinar
ScyllaDB
 
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
ScyllaDB
 
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
ScyllaDB
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
ScyllaDB
 
Leading a High-Stakes Database Migration
ScyllaDB
 
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
ScyllaDB
 
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
ScyllaDB
 
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
ScyllaDB
 
ScyllaDB: 10 Years and Beyond by Dor Laor
ScyllaDB
 
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
ScyllaDB
 
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
ScyllaDB
 
Vector Search with ScyllaDB by Szymon Wasik
ScyllaDB
 
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
ScyllaDB
 
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
ScyllaDB
 
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
ScyllaDB
 
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
ScyllaDB
 
Lessons Learned from Building a Serverless Notifications System by Srushith R...
ScyllaDB
 
A Dist Sys Programmer's Journey into AI by Piotr Sarna
ScyllaDB
 
High Availability: Lessons Learned by Paul Preuveneers
ScyllaDB
 

Recently uploaded (20)

PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
The Future of Artificial Intelligence (AI)
Mukul
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 

Lookout on Scaling Security to 100 Million Devices

  • 2. , Principal Engineer Over 30 years experience predominantly dealing with event pipelines and data retrieval. He currently works as a platform architect and principal developer at Lookout Inc working on the Ingestion Pipeline and Query Services team working on the next scale of data ingestion.
  • 3. ■ Provides security scanning for mobile devices for Enterprise and Consumer markets ■ Founded in 2004 when the original founders discovered a vulnerability in the Bluetooth and Nokia phones ■ Demonstrated the need for mobile security through a demonstration at the 2005 Academy Awards downloading information from celebrity phones 1.5 miles away from the venue
  • 5. ■ Enterprise customers have the ability to apply corporate policies against devices registered in their enterprise ■ To apply these policies Lookout ingests data about device configuration and applications installed on devices
  • 6. ■ Functions as a proxy for all mobile devices in the Lookout fleet ■ Device telemetry is sent at various intervals for these categories ● Software ● Hardware ● Client ● Filesystem ● Configuration ● Binary Manifest ● Risky Configuration ● Personal Content Protection (safe browsing) ● Device Settings ● Device Permissions ● Activation Status
  • 8. ■ Easy to setup and maintain ■ Scaling is easy ■ Cost Effective ■ Simple to handle the Unexpected
  • 9. ■ Some of the components are “single region” (EMR) ■ As the system grows the costs increase significantly (DynamoDB) ■ Limits on Primary Key (PK) and Sort Key (SK) for DynamoDB - Not designed for time series data
  • 13. A highly scalable and fault tolerant streaming framework that can process messages (for example Device Telemetry Messages) and persist these messages into a scalable, fault tolerant persistent store and support operational queries. Key Requirements: ■ Infrastructure should scale to support 100M devices ■ Cost effective ingestion, storage and querying at this scale ■ Low Latency, High Availability at scale (up/down) ■ Failure handling (no loss of data) ■ Ease of deployment and management
  • 14. ■ A NoSQL database that implements almost all the features of Apache Cassandra ■ Written in C++ 14 instead of Java to increase the performance. ■ Uses a shared nothing approach and uses the Seastar framework to shard requests by core - http://seastar.io/ ■ Scylla’s close-to-the-hardware design significantly reduces the number of instances needed. ■ Can horizontal scale-out and is fault-tolerance like Apache Cassandra, but delivers 10X the throughput and consistent, low single-digit latencies. ■ Has support for tunable job prioritization to support extremely high read and write throughput (which was a problem that Cassandra has not solved yet). Has really high throughput on instances with NVMe volumes (compared to EBS or non NVMe volumes).
  • 17. ■ Amount of storage available for data depends on the compaction strategy selected. ● Levelled compaction - Half of data storage needed for compaction - not recommended ● Size tiered compaction - Half of data storage needed for compaction ● Time window compaction - Depends on the number of tables and record size - normally around half needed for compaction ● Incremental compaction - possible to push up to 85% for data storage, so storage needs need to be planned well. - Enterprise Edition
  • 18. ■ May not be a good choice if storage requirements are very large as opposed to transactions as you will have wasted compute tied to the increased storage needs. ■ Note that this assumes you do not plan to use low cost EBS volumes with much reduced throughput. ■ No FedRamp certified version of Scylla Cloud available today requiring deployment of self-managed cluster ■ No Autoscaling support as we have to provision nodes and rebalance data through scripts/UI. ■ Not suitable for ad-hoc queries or table scan type queries, and does not support joins.
  • 19. ■ Each worker instance is stateless and coordinates with each other via internal Kafka topics. ■ Kafka Connect automatically detects failures and rebalances work over remaining processes. ■ Suitable for streaming data to and from Kafka and is not suitable for complex operation like aggregations, windowing, etc., that frameworks like Apache Spark or Apache Flink support. ■ The maximum number of tasks is limited to the number of partitions. ■ Exposes a REST API to create, modify and monitor the connectors and tasks
  • 20. ■ Kafka ● 6 Kafka Brokers - R5.xlarge ● 6 Zookeepers - M5.large ● 3 Schema Registries - M5.large ● 6 Kafka Connect Workers - C5.xlarge ● 1 Control Center - M5.2xlarge ● Split over 3 AZs ● # partitions ■ Loaded Libraries - 120 partitions ■ Device Settings - 150 partitions ■ Other topics - 60 partitions ■ ScyllaDB ● 4 ScyllaDB instances - I3.4xlarge ● Split over 2 AZs ■ Load ● 12 different device telemetry emulated ● Messages sent in Apache Avro format ● 14 instances generating load - C5d.4xlarge
  • 21. ■ The default partitioner (<murmur2 hash> mod <# partitions>) that comes with Kafka is not very efficient with sharding when the number of partitions grow (approx 50% of the partitions were idle). ■ We replaced by using a murmur3 hash and then put it through a consistent hashing algorithm (jump hash) to get an even distribution across all partitions (we used Google’s guava library). - “A Fast, Minimal Memory, Consistent Hash Algorithm” - https://arxiv.org/pdf/1406.2294.pdf
  • 22. ■ We emulated approximately 38 million devices generating a total of 109,668 messages/second or 394 Million messages/hr. ■ On average a device was generating 253 messages/day ■ We don’t expect querying to be much impact, so did not add that as part of load ■ The load test duration was for 96 hrs. Telemetry Type # Device Telemetry emulated/second # Device Telemetry emulated/day Avg size in Bytes/Telemetry Celldata 760 1.72 83 Client 760 1.72 166 Configuration 13908 31.62 396 Device Change 2280 6.91 218 Device Permissions 1520 3.45 74 Device Settings 45600 103.68 75 Hardware 760 1.72 254 Loaded Libraries 38000 86.40 219 Risk Configuration 1900 5.18 261 Software 760 1.72 375 Binary 1520 3.45 219 File System 1900 5.18 219
  • 23. ■ Message latency was in milliseconds on average, unless the system was overtaxed. ■ Repairs forced the load and was generally taxing on the system (CPU at 100%), but the cluster continued to function. ■ The latency increased when Kafka Connect tasks failed (when repairs were running on ScyllaDB). ■ ScyllaDB Cluster was running near capacity (CPU between 75-90%) ■ Overall, the results were really positive.
  • 25. ■ Kafka Connect provided a quick and easy solution to add new ingestion pipelines ■ Using DataMountaineer’s Kafka Connect connector for Cassandra was easier to implement than the Confluent connector ■ Scylla DB CPU shot up while repairing and timeouts occurred - Scylla’s ability to reserve capacity for maintenance tasks ensured repairs completed something not available in Cassandra. ■ As the complexity of the data ingestion increased the solution leaned more towards implementing a custom Kafka → Scylla worker cluster for debugging and maintenance reasons ■ The cost benefits over the current architecture flow increased significantly as our volume increased.
  • 26. ■ This does not include: ● Query load and associated costs. ● Dynamo streams and it’s equivalent on Scylla and associated costs. DynamoDB Scylla # Devices $ Cost/Mo # Devices $ Cost/Mo On Demand 38,000,000 $304,400.00 38,000,000 $14,564.24 100,000,000 $801,052.63 100,000,000 $38,303.95 +20% Engineer cost (Maintenance) Provisioned 38,000,000 $55,610.00 100,000,000 $146,342.11