SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011–2018. All rights reserved
Typosquatting Detection
Apache Metron
Casey Stella & Mike Miklavcic
Principal Software Engineer & VP
Apache Metron
Mike Miklavcic
Staff Software Engineer & PMC
Apache Metron
2 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Metron: A Cybersecurity Analytics Solution
• Metron provides a scalable advanced security analytics framework to offer a centralized
tool for security monitoring and analysis
• Ultimately, this means that we provide a solution to ingest, enrich and detect anomalies in
disparate data sources
• Metron was initiated at Cisco in 2014 as OpenSOC
• Metron was submitted to the Apache Incubator in December 2015
• Metron graduated to a top level project in April 2017
3 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Metron: The Stack
• We aggressively use the Hadoop stack for both batch as well as streaming processing of
data.
• We use
• Apache Zookeeper for distributed configuration management
• Apache HBase for random access key/value data
• Apache Storm as a stream processing framework
• HDFS for long-term storage of processed data
• Apache Solr and Elasticsearch for low latency querying
• We’ve built a UI to display these alerts to security analysts and allow them to manage
them.
4 © Hortonworks Inc. 2011–2018. All rights reserved
5 © Hortonworks Inc. 2011–2018. All rights reserved
6 © Hortonworks Inc. 2011–2018. All rights reserved
7 © Hortonworks Inc. 2011–2018. All rights reserved
Enrichment by any means necessary
• There are a couple of different ways to enrich, each with their own semantics
Changing Scope
Retrieval
Semantics
Ingestion
Semantics
Metron
Solution
Static/Slow-moving Event Key/Value Lookup Batch HBase Enrichment
Static/Slow-moving Event Complex Batch Summarizer
Dynamic Event Key/Value Lookup Streaming
Streaming HBase
Enrichment
Static/Slow-moving Event Complex Batch Model as a Service
Dynamic Multi-event Complex Streaming Profiler
8 © Hortonworks Inc. 2011–2018. All rights reserved
Typosquatting
• Typosquatting is a form of attack that uses typos in URLs to trick unsuspecting users into
clicking on links
• Typical types of typos targeted are
• homoglyphs (l -> 1)
• Subdomain (amazon -> am.az.on)
• Kerning errors (m -> rn)
• This is an extremely common attack as part of a phishing or spearphishing attack.
• Users are sent emails with URLs to click on that are hosted on typosquatted domains
• The users’ credentials and personal information are then harvested
9 © Hortonworks Inc. 2011–2018. All rights reserved
Typosquatting – Approaches
• Typical approaches involve
• generating the set of typosquatted domains from known good domains
• checking incoming traffic against that set of typosquatted domains
• Generally the challenges around potential solutions break down into
• Scale – Even for modest numbers of domains, the number of records can grow quite large. The Top
Alexa 10k domains has on the order of 3 million potential typosquatted domains.
• Recency – Strategies involving just reference sets of known good domains can fail to spot targeted
phishing attacks from advanced actors
10 © Hortonworks Inc. 2011–2018. All rights reserved
Typosquatting – Approaches in Metron
• Generally, this kind of problem is an enrichment in Metron
• We ingest DNS logs
• For each DNS log, we want to tag domains visited with potentially typosquatted or not by
consulting an enrichment
• There are two distinct ways to handle this style of problem in Metron
• We could treat this as an Hbase Enrichment and use stellar to lookup the domain in
Hbase
• is_typosquatted := ENRICHMENT_EXISTS(‘typosquatted_domains’, ip_dst_addr, ‘enrichments’, ‘t’)
• This
11 © Hortonworks Inc. 2011–2018. All rights reserved
Typosquatting – HBase Enrichment
• We could treat this as an Hbase Enrichment and use stellar to lookup the domain in
Hbase
• Use DNS Twist to generate typosquatted domains from regular domains
• Ingest via our enrichment loader into HBase
• is_typosquatted := ENRICHMENT_EXISTS(‘typosquatted_domains’,
ip_dst_addr, ‘enrichments’, ‘t’)
• This has a couple of issues
• Relies upon a pass over the data locally with DNS twist
• Depending on how often we add new domains, typosquatted domains can be large
• We have to keep this table up-to-date with known good domains to catch the long tail
12 © Hortonworks Inc. 2011–2018. All rights reserved
Typosquatting – Summarizer
• Note that this boils down to a set inclusion check; Bloom filters are made for this
• Are we willing to accept false positives?
• Typosquatting is insufficient as a signal to stand on its own, so the impact of a false positive is
limited
• We can control the false positive rate
• The Summarizer is a utility that we have created that
• Pass over the data either locally or via a MR job
• Applies a transformation (in Stellar) in the map phase
• BLOOM_ADD(filter, DOMAIN_TYPOSQUAT(domain))
• Reduces in the reduce phase
• BLOOM_MERGE(filters)
• Writes the summarized object to HDFS
13 © Hortonworks Inc. 2011–2018. All rights reserved
Typosquatting – Summarized Enrichment
• We can now use the bloom filter that we output from the summarizer to tag a domain
as typosquatted
• is_typosquatted := BLOOM_EXISTS(OBJECT_GET(‘/path/to/filter’,
ip_dst_addr))
• Furthermore, we can use the DOMAIN_TYPOSQUAT stellar function with the profiler to
add fill in the gaps
• Capture bloom filters for the typosquatted domains which have been visited more than k times in
the timerange of your choice
• Determine if a domain is a typosquatted domain by consulting the reference bloom filter created
by the summarizer and the bloom filter from the profiler
• In the future, we’re adding a Stream Summary sketch to make this top-k more scalable
14 © Hortonworks Inc. 2011–2018. All rights reserved
DEMO
15 © Hortonworks Inc. 2011–2018. All rights reserved
Questions?
16 © Hortonworks Inc. 2011–2018. All rights reserved
Thank you
Come visit us at the Hortonworks Booth and attend the Cybersecurity Birds of a Feather on Wednesday!

More Related Content

What's hot (20)

PDF
Apache Metron in the Real World
DataWorks Summit
 
PDF
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
DataWorks Summit
 
PDF
Keynote
DataWorks Summit
 
PDF
Hadoop: The Unintended Benefits
DataWorks Summit
 
PPTX
Lessons learned running a container cloud on YARN
DataWorks Summit
 
PPTX
Log Analytics Optimization
Hortonworks
 
PPTX
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
DataWorks Summit
 
PPTX
Tracing your security telemetry with Apache Metron
DataWorks Summit/Hadoop Summit
 
PDF
Data in the Cloud Crash Course
DataWorks Summit
 
PPTX
Data Science Crash Course
DataWorks Summit
 
PDF
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFi
DataWorks Summit
 
PPTX
Mission to NARs with Apache NiFi
Hortonworks
 
PPTX
ODPi 101: Who we are, What we do
Hortonworks
 
PDF
Curing the Kafka Blindness – Streams Messaging Manager
DataWorks Summit
 
PDF
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
PPTX
Hadoop Operations - Past, Present, and Future
DataWorks Summit
 
PPTX
Webinar Series Part 5 New Features of HDF 5
Hortonworks
 
PPTX
Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...
DataWorks Summit
 
PDF
Running Enterprise Workloads with an open source Hybrid Cloud Data Architecture
DataWorks Summit
 
PPTX
Designing data pipelines for analytics and machine learning in industrial set...
DataWorks Summit
 
Apache Metron in the Real World
DataWorks Summit
 
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
DataWorks Summit
 
Hadoop: The Unintended Benefits
DataWorks Summit
 
Lessons learned running a container cloud on YARN
DataWorks Summit
 
Log Analytics Optimization
Hortonworks
 
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
DataWorks Summit
 
Tracing your security telemetry with Apache Metron
DataWorks Summit/Hadoop Summit
 
Data in the Cloud Crash Course
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFi
DataWorks Summit
 
Mission to NARs with Apache NiFi
Hortonworks
 
ODPi 101: Who we are, What we do
Hortonworks
 
Curing the Kafka Blindness – Streams Messaging Manager
DataWorks Summit
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
Hadoop Operations - Past, Present, and Future
DataWorks Summit
 
Webinar Series Part 5 New Features of HDF 5
Hortonworks
 
Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...
DataWorks Summit
 
Running Enterprise Workloads with an open source Hybrid Cloud Data Architecture
DataWorks Summit
 
Designing data pipelines for analytics and machine learning in industrial set...
DataWorks Summit
 

Similar to Scalable and adaptable typosquatting detection in Apache Metron (20)

PDF
Deploying Containers in Production and at Scale
Mesosphere Inc.
 
PPTX
High throughput data replication over RAFT
DataWorks Summit
 
PDF
Best Practices for Scaling an InfluxEnterprise Cluster
InfluxData
 
PPTX
Apache Metron: Community Driven Cyber Security
DataWorks Summit/Hadoop Summit
 
PPTX
HBaseCon 2013: ETL for Apache HBase
Cloudera, Inc.
 
PPTX
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
Cloudera, Inc.
 
PDF
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
Dr. Haxel Consult
 
PPTX
Apache Ambari: Past, Present, Future
Hortonworks
 
PPTX
Make Streaming Analytics work for you: The Devil is in the Details
DataWorks Summit/Hadoop Summit
 
PDF
Back to FME School - Day 2: Your Data and FME
Safe Software
 
PPTX
Spark+flume seattle
Hari Shreedharan
 
PPTX
Design Patterns For Real Time Streaming Data Analytics
DataWorks Summit
 
PPTX
Design Patterns For Real Time Streaming Data Analytics
DataWorks Summit
 
PDF
Curing the Kafka blindness—Streams Messaging Manager
DataWorks Summit
 
PDF
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks
 
PPTX
Apache Tez – Present and Future
Rajesh Balamohan
 
PPTX
Apache Tez – Present and Future
Jianfeng Zhang
 
PPTX
What's new in Ambari
DataWorks Summit
 
PPTX
Saving the elephant—now, not later
DataWorks Summit
 
PPTX
Lecture 15 run timeenvironment_2
Iffat Anjum
 
Deploying Containers in Production and at Scale
Mesosphere Inc.
 
High throughput data replication over RAFT
DataWorks Summit
 
Best Practices for Scaling an InfluxEnterprise Cluster
InfluxData
 
Apache Metron: Community Driven Cyber Security
DataWorks Summit/Hadoop Summit
 
HBaseCon 2013: ETL for Apache HBase
Cloudera, Inc.
 
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
Cloudera, Inc.
 
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
Dr. Haxel Consult
 
Apache Ambari: Past, Present, Future
Hortonworks
 
Make Streaming Analytics work for you: The Devil is in the Details
DataWorks Summit/Hadoop Summit
 
Back to FME School - Day 2: Your Data and FME
Safe Software
 
Spark+flume seattle
Hari Shreedharan
 
Design Patterns For Real Time Streaming Data Analytics
DataWorks Summit
 
Design Patterns For Real Time Streaming Data Analytics
DataWorks Summit
 
Curing the Kafka blindness—Streams Messaging Manager
DataWorks Summit
 
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks
 
Apache Tez – Present and Future
Rajesh Balamohan
 
Apache Tez – Present and Future
Jianfeng Zhang
 
What's new in Ambari
DataWorks Summit
 
Saving the elephant—now, not later
DataWorks Summit
 
Lecture 15 run timeenvironment_2
Iffat Anjum
 
Ad

More from DataWorks Summit (20)

PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
PPTX
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PPTX
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 

Scalable and adaptable typosquatting detection in Apache Metron

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved Typosquatting Detection Apache Metron Casey Stella & Mike Miklavcic Principal Software Engineer & VP Apache Metron Mike Miklavcic Staff Software Engineer & PMC Apache Metron
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved Apache Metron: A Cybersecurity Analytics Solution • Metron provides a scalable advanced security analytics framework to offer a centralized tool for security monitoring and analysis • Ultimately, this means that we provide a solution to ingest, enrich and detect anomalies in disparate data sources • Metron was initiated at Cisco in 2014 as OpenSOC • Metron was submitted to the Apache Incubator in December 2015 • Metron graduated to a top level project in April 2017
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved Apache Metron: The Stack • We aggressively use the Hadoop stack for both batch as well as streaming processing of data. • We use • Apache Zookeeper for distributed configuration management • Apache HBase for random access key/value data • Apache Storm as a stream processing framework • HDFS for long-term storage of processed data • Apache Solr and Elasticsearch for low latency querying • We’ve built a UI to display these alerts to security analysts and allow them to manage them.
  • 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved
  • 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved Enrichment by any means necessary • There are a couple of different ways to enrich, each with their own semantics Changing Scope Retrieval Semantics Ingestion Semantics Metron Solution Static/Slow-moving Event Key/Value Lookup Batch HBase Enrichment Static/Slow-moving Event Complex Batch Summarizer Dynamic Event Key/Value Lookup Streaming Streaming HBase Enrichment Static/Slow-moving Event Complex Batch Model as a Service Dynamic Multi-event Complex Streaming Profiler
  • 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved Typosquatting • Typosquatting is a form of attack that uses typos in URLs to trick unsuspecting users into clicking on links • Typical types of typos targeted are • homoglyphs (l -> 1) • Subdomain (amazon -> am.az.on) • Kerning errors (m -> rn) • This is an extremely common attack as part of a phishing or spearphishing attack. • Users are sent emails with URLs to click on that are hosted on typosquatted domains • The users’ credentials and personal information are then harvested
  • 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved Typosquatting – Approaches • Typical approaches involve • generating the set of typosquatted domains from known good domains • checking incoming traffic against that set of typosquatted domains • Generally the challenges around potential solutions break down into • Scale – Even for modest numbers of domains, the number of records can grow quite large. The Top Alexa 10k domains has on the order of 3 million potential typosquatted domains. • Recency – Strategies involving just reference sets of known good domains can fail to spot targeted phishing attacks from advanced actors
  • 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved Typosquatting – Approaches in Metron • Generally, this kind of problem is an enrichment in Metron • We ingest DNS logs • For each DNS log, we want to tag domains visited with potentially typosquatted or not by consulting an enrichment • There are two distinct ways to handle this style of problem in Metron • We could treat this as an Hbase Enrichment and use stellar to lookup the domain in Hbase • is_typosquatted := ENRICHMENT_EXISTS(‘typosquatted_domains’, ip_dst_addr, ‘enrichments’, ‘t’) • This
  • 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved Typosquatting – HBase Enrichment • We could treat this as an Hbase Enrichment and use stellar to lookup the domain in Hbase • Use DNS Twist to generate typosquatted domains from regular domains • Ingest via our enrichment loader into HBase • is_typosquatted := ENRICHMENT_EXISTS(‘typosquatted_domains’, ip_dst_addr, ‘enrichments’, ‘t’) • This has a couple of issues • Relies upon a pass over the data locally with DNS twist • Depending on how often we add new domains, typosquatted domains can be large • We have to keep this table up-to-date with known good domains to catch the long tail
  • 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved Typosquatting – Summarizer • Note that this boils down to a set inclusion check; Bloom filters are made for this • Are we willing to accept false positives? • Typosquatting is insufficient as a signal to stand on its own, so the impact of a false positive is limited • We can control the false positive rate • The Summarizer is a utility that we have created that • Pass over the data either locally or via a MR job • Applies a transformation (in Stellar) in the map phase • BLOOM_ADD(filter, DOMAIN_TYPOSQUAT(domain)) • Reduces in the reduce phase • BLOOM_MERGE(filters) • Writes the summarized object to HDFS
  • 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved Typosquatting – Summarized Enrichment • We can now use the bloom filter that we output from the summarizer to tag a domain as typosquatted • is_typosquatted := BLOOM_EXISTS(OBJECT_GET(‘/path/to/filter’, ip_dst_addr)) • Furthermore, we can use the DOMAIN_TYPOSQUAT stellar function with the profiler to add fill in the gaps • Capture bloom filters for the typosquatted domains which have been visited more than k times in the timerange of your choice • Determine if a domain is a typosquatted domain by consulting the reference bloom filter created by the summarizer and the bloom filter from the profiler • In the future, we’re adding a Stream Summary sketch to make this top-k more scalable
  • 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved DEMO
  • 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved Questions?
  • 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved Thank you Come visit us at the Hortonworks Booth and attend the Cybersecurity Birds of a Feather on Wednesday!

Editor's Notes

  • #2: TALK TRACK Hortonworks Powers the Future of Data: data-in-motion, data-at-rest, and Modern Data Applications. [NEXT SLIDE]
  • #5: Result Screenshot field cloud demo environment