SlideShare a Scribd company logo
Real time analysis
and visualization
ANUBISNETWORKS LABS
PTCORESEC
1
Agenda
 Who are we?
 AnubisNetworks Stream
 Stream Information Processing
 Adding Valuable Information to Stream Events
2
Who are we?
 Tiago Martins
 AnubisNetworks
 @Gank_101
3
 João Gouveia
 AnubisNetworks
 @jgouv
 Tiago Henriques
 Centralway
 @Balgan
Anubis StreamForce
 Events (lots and lots of events)
 Events are “volatile” by nature
 They exist only if someone is listening
 Remember?:
“If a tree falls in a forest and no one is
around to hear it, does it make a
sound?”
4
Anubis StreamForce
 Enter security Big Data
“a brave new world”
5
Volume
Variety Velocity
We are
here
Anubis StreamForce
 Problems (and ambitions) to tackle
 The huge amount and variety of data to process
 Mechanisms to share data across multiple systems,
organizations, teams, companies..
 Common API for dealing with all this (both from a
producer and a consumer perspective)
6
Anubis StreamForce
 Enter the security events CEP - StreamForce
High performance, scalable, Complex Event
Processor (CEP) – 1 node (commodity hw) = 50k
evt/second
Uses streaming technology
Follows a publish / subscriber model
7
Anubis StreamForce
 Data format
Events are published in JSON format
Events are consumed in JSON format
8
Anubis StreamForce
 Yes, we love
JSON
9
Anubis StreamForce 10
Sharing Models
MFE
OpenSource /
MailSpike community
Dashboard
Dashboard
Complex Event Processing
Sinkholes
Data-theft Trojans
Real Time Feeds
Real Time Feeds
IP Reputation
Passive DNSTraps /
Honeypots
Twitter
MFE
OpenSource /
MailSpike community
Dashboard
Dashboard
Complex Event Processing
Sinkholes
Data-theft Trojans
Real Time Feeds
Real Time Feeds
IP Reputation
Passive DNSTraps /
Honeypots
Twitter
Anubis CyberFeed 13
 Feed galore!
Sinkhole data, traps, IP reputation, etc.
 Bespoke feeds (create your own view)
 Measure, group, correlate, de-duplicate ..
 High volume (usually ~6,000 events per
second, more data being added frequently
MFE
OpenSource /
MailSpike community
Dashboard
Event navigation
Complex Event Processing
Sinkholes
Data-theft Trojans
Real Time Feeds
Real Time Feeds
IP Reputation
Passive DNSTraps /
Honeypots
Twitter
Anubis CyberFeed 15
 Apps (demo time)
Stream Information Processing
 Collecting events from the Stream.
 Generating reports.
 Real time visualization.
16
Challenge
 ~6k events/s and at peak over 10k events/s.
 Let‟s focus on trojans feed (banktrojan).
 Peaks @ ~4k events/s
{"_origin":"banktrojan","env":{"server_name":"anam0rph.su","remote_ad
dr":"46.247.141.66","path_info":"/in.php","request_method":"POST","http
_user_agent":"Mozilla/4.0"},"data":"upqchCg4slzHEexq0JyNLlaDqX40G
sCoA3Out1Ah3HaVsQj45YCqGKylXf2Pv81M9JX0","seen":1379956636,"tr
ojanfamily":"Zeus","_provider":"lab","hostn":"lab14","_ts":1379956641}
17
Challenge 18
Challenge 19
Challenge
 Let‟s use the Stream to help
 Group by machine and trojan
 From peak ~4k/s to peak ~1k/s
 Filter fields.
 Geo location
 We end up with
{"env":{"remote_addr":"207.215.48.83"},"trojanfamily":"W32Expiro","_geo_env_remote_addr
":{"country_code":"US","country_name":"United States","city":"Los
Angeles","latitude":34.0067,"longitude":-118.3455,"asn":7132,"asn_name":"AS for SBIS-AS"}}
20
Challenge
 How to process and store these events?
21
Technologies 22
 Applications
 NodeJS
 Server-side Javascript Platform.
 V8 Javascript Engine.
 http://nodejs.org/
Why?
 Great for prototyping.
 Fast and scalable.
 Modules for (almost) everything.
Technologies 23
 Databases
 MongoDB
 NoSQL Database.
 Stores JSON-style documents.
 GridFS
 http://www.mongodb.org/
Why?
 JSON from the
Stream, JSON in the
database.
 Fast and scalable.
 Redis
 Key-value storage.
 In-memory dataset.
 http://redis.io/
Why?
 Faster than MongoDB for
certain operations, like
keeping track of number of
infected machines.
 Very fast and scalable.
Data Collection 24
Storage
Aggregate
information
MongoDB Redis
Worker
Worker
Worker
Processor
Process real time
events
 Applications
 Collector
 Worker
 Processor
 Databases
 MongoDB
 Redis
Collector
Stream
Data Collection 25
Storage
Aggregate
information
MongoDB Redis
Worker
Worker
Worker
Processor
Process real time
events
 Events comes from the Stream.
 Collector distributes events to Workers.
 Workers persist event information.
 Processor aggregates information and stores it for statistical and historical
analysis.
Collector
Stream
Data Collection 26
Storage
Aggregate
information
MongoDB Redis
Worker
Worker
Worker
Processor
Process real time
events
 MongoDB
 Real time information of infected machines.
 Historical aggregated information.
 Redis
 Real time counters of infected machines.
Collector
Stream
Data Collection - Collector 27
Collector
 Old data is periodically remove, i.e. machines that don‟t
produce events for more than 24 hours.
 Send events to Workers.Workers
 Decrements counters of removed information.
 Send warnings
 Country / ASN is no longer infected.
 Botnet X decreased Y % of its size.
Data Collection - Worker 28
Worker
 Create new entries for unseen machines.
 Adds information about new trojans / domains.
 Update the last time the machine was seen.
 Process events and update the Redis counters
accordingly.
 Needs to check MongoDB to determine if:
 New entry – All counters incremented
 Existing entry – Increment only the counters related to
that Trojan
 Send warnings
 Botnet X increased Y % in its size.
 New infections seen on Country / ASN.
Data Collection - Processor
Processor
29
 Processor retrieves real time counters from Redis.
 Information is processed by:
 Botnet;
 ASN;
 Country;
 Botnet/Country;
 Botnet/ASN/Country;
 Total.
 Persisting information to MongoDB creates a historic
database of counters that can be queried and
analyzed.
Data Collection - MongoDB
 Collection for active machines in the last 24h
{
"city" : "Philippine",
"country" : "PH",
"region" : "N/A",
"geo" : {
"lat" : 16.4499,
"lng" : 120.5499
},
"created" : ISODate("2013-09-21T00:19:12.227Z "),
"domains" : [
{ "domain" : "hzmksreiuojy.nl",
"trojan" : "zeus",
"last" : ISODate("2013-09-21T09:42:56.799Z"),
"created" : ISODate("2013-09-21T00:19:12.227Z") }
],
"host" : "112.202.37.72.pldt.net",
"ip" : "112.202.37.72",
"ip_numeric" : 1892296008,
"asn" : "Philippine Long Distance Telephone Company",
"asn_code" : 9299,
"last" : ISODate("2013-09-21T09:42:56.799Z"),
"trojan" : [ "zeus” ]
}
30
Data Collection - MongoDB
 Collection for aggregated information (the historic counters database)
{
"_id" : ObjectId("519c0abac1172e813c004ac3"),
"0" : 744,
"1" : 745,
"3" : 748,
"4" : 748,
"5" : 746,
"6" : 745,
...
"10" : 745,
"11" : 742,
"12" : 746,
"13" : 750,
"14" : 753,
...
"metadata" : {
"country" : "CH",
"date" : "2013-05-22T00:00:00+0000",
"trojan" : "conficker_b",
"type" : "daily"
}
}
31
Preallocated entries for each hour
when the document is created.
If we don’t, MongoDB will keep
extending the documents by adding
thousands of entries every hour and it
becomes very slow.
Data Collection - MongoDB
 Collection for 24 hours
 4 MongoDB Shard instances
 >3 Million infected machines
 ~2 Gb of data
 ~558 bytes per document.
 Indexes by
 ip – helps inserts and updates.
 ip_numeric – enables queries by CIDRs.
 last – Faster removes for expired machines.
 host – Hmm, is there any .gov? 
 country, family, asn – Speeds MongoDB
queries and also allows faster custom
queries.
 Collection for aggregated information
 Data for 119 days (25 May to 11 July)
 > 18 Million entries
 ~6,5 Gb of data
 ~366 bytes per object
 ~56 Mb per day
 Indexes by
 metadata.country
 metadata.trojan
 metadata.date
 Metadata.asn
 Metadata.type,
metadata.country,metadata.date,met.......
(all)
32
Data Collection - Redis
 Counters by Trojan / Country
"cutwailbt:RO": "1256",
"rbot:LA": "3",
"tdss:NP": "114",
"unknown4adapt:IR": "100",
"unknownaff:EE": "0",
"cutwail:CM": "20",
"unknownhrat3:NZ": "56",
"cutwailbt:PR": "191",
"shylock:NO": "1",
"unknownpws:BO": "3",
"unknowndgaxx:CY": "77",
"fbhijack:GH": "22",
"pushbot:IE": "2",
"carufax:US": "424“
 Counters by Trojan
"unknownwindcrat": "18",
"tdss": "79530",
"unknownsu2": "2735",
"unknowndga9": "15",
"unknowndga3": "17",
"ircbot": "19874",
"jshijack": "35570",
"adware": "294341",
"zeus": "1032890",
"jadtre": "40557",
"w32almanahe": "13435",
"festi": "1412",
"qakbot": "19907",
"cutwailbt": "38308“
 Counters by Country
“BY": "11158",
"NA": "314",
"BW": "326",
"AS": "35",
"AG": "94",
"GG": "43",
"ID": "142648",
"MQ": "194",
"IQ": "16142",
"TH": "105429",
"MY": "35410",
"MA": "15278",
"BG": "15086",
"PL": "27384”
33
Data Collection - Redis
 Redis performance in our machine
 SET: 473036.88 requests per second
 GET: 456412.59 requests per second
 INCR: 461787.12 requests per second
 Time to get real time data
 Getting all the data from Familys/ASN/Counters to the NodeJS application and ready to
be processed in around half a second
 > 120 000 entries in… (very fast..)
 Our current usage is
 ~ 3% CPU (of a 2.0 Ghz core)
 ~ 480 Mb of RAM
34
Data Collection - API
 But! There is one more application..
 How to easily retrieve stored data
 MongoDB Rest API is a bit limited.
 NodeJS HTTP + MongoDB + Redis
 Redis
 http://<host>/counters_countries
 ...
 MongoDB
 http://<host>/family_country
 ...
 Custom MongoDB Querys
 http://<host>/ips?f.ip_numeric=95.68.149.0/22
 http://<host>/ips?f.country=PT
 http://<host>/ips?f.host=bgovb
35
Data Collection - Limitations
 Grouping information by machine and trojan doesn‟t allow to
study the real number of events per machine.
 Can be useful to get an idea of the botnet operations or how many
machines are behind a single IP (everyone is behind a router).
 Slow MongoDB impacts everything
 Worker application needs to tolerate a slow MongoDB and discard some
information has a last resort.
 Beware of slow disks! Data persistence occurs every 60 seconds (default)
and can take too much time, having a real impact on performance..
 >10s to persist is usually very bad, something is wrong with hard drives..
36
Data Collection - Evolution
 Warnings
 Which warnings to send? When? Thresholds?
 Aggregate data by week, month, year.
 Aggregate information in shorter intervals.
 Data Mining algorithms applied to all the collected information.
 Apply same principles to other feeds of the Stream.
 Spam
 Twitter
 Etc..
37
Reports
 What‟s happening in country X?
 What about network 192.168.0.1/24?
 Can send me the report of Y everyday at 7 am?
 Ohh!! Remember the report I asked last week?
 Can I get a report for ASN AnubisNetwork?
38
Reports 39
 HTTP API
 Schedule
 Get
 Edit
 Delete
 List schedules
 List reports
 Check MongoDB for work.
 Generate CSV report or store the JSON Document for
later querying.
 Send email with link to files when report is ready.
Server
Generator
Reports – MongoDB CSVs
 Scheduled Report
{
"__v" : 0,
"_id" : ObjectId("51d64e6d5e8fd0d145000008"),
"active" : true,
"asn_code" : "",
"country" : "PT",
"desc" : "Portugal Trojans",
"emails" : "",
"range" : "",
"repeat" : true,
"reports" : [
ObjectId("51d64e7037571bd24500000d"),
ObjectId("51d741e8bcb161366600000c"),
ObjectId("51d89367bcb161366600005f"),
ObjectId("51d9e4f9bcb16136660000ca"),
ObjectId("51db3678c3a15fc577000038"),
ObjectId("51dc87e216eea97c20000007"),
ObjectId("51ddd964a89164643b000001")
],
"run_at" : ISODate("2013-07-11T22:00:00Z"),
"scheduled_date" : ISODate("2013-07-
05T04:41:17.067Z")
}
 Report
{
"__v" : 0,
"_id" : ObjectId("51d89367bcb161366600005f"),
"date" : ISODate("2013-07-06T22:00:07.015Z"),
"files" : [
ObjectId("51d89368bcb1613666000060")
],
"work" : ObjectId("51d64e6d5e8fd0d145000008")
}
 Files
 Each report has an array of files that
represents the report.
 Each file is stored in GridFS.
40
Reports – MongoDB JSONs
 Scheduled Report
{
"__v" : 0,
"_id" : ObjectId("51d64e6d5e8fd0d145000008"),
"active" : true,
"asn_code" : "",
"country" : "PT",
"desc" : "Portugal Trojans",
"emails" : "",
"range" : "",
"repeat" : true,
“snapshots" : [
ObjectId("521f761c0a45c3b00b000001"),
ObjectId("521fb0848275044d420d392f"),
ObjectId("52207c2f7c53a8494f010afa"),
ObjectId("5221c9df4910ba3874000001"),
ObjectId("522275724910ba3874001f66"),
ObjectId("5223c6f24910ba3874003b7a"),
ObjectId("522518734910ba3874005763")
],
"run_at" : ISODate("2013-07-11T22:00:00Z"),
"scheduled_date" : ISODate("2013-07-05T04:41:17.067Z")
}
 Snapshot
{
"_id" : ObjectId("51d89367bcb161366600005f"),
"date" : ISODate("2013-07-06T22:00:07.015Z"),
"work" : ObjectId("521f761c0a45c3b00b000001"),
count: 123
}
 Results
{
"machine" : {
"trojan" : [ “conficker_b“ ],
"ip" : "2.80.2.53",
"host" : "Bl19-1-13.dsl.telepac.pt",
}, …
, "metadata" : {
"work" : ObjectId("521f837647b8d3ba7d000001"),
"snaptshot" : ObjectId("521f837aa669d0b87d000001"),
"date" : ISODate("2013-08-29T00:00:00Z")
},
}
41
Reports – Evolution
 Other reports formats.
 Charts?
 Other type of reports. (Not only botnets).
 Need to evolve Collector first.
42
Globe
 How to visualize real time events from the stream?
 Where are the botnets located?
 Who‟s the most infected?
 How many infections?
43
Globe – Stream
 origin = banktrojan
 Modules
 Group
 trojanfamily
 _geo_env_remote_addr.country_n
ame
 grouptime=5000
 Geo
 Filter fields
 trojanfamily
 Geolocation
 _geo_env_remote_addr.l*
 KPI
 trojanfamily
 _geo_env_remote_addr.country_n
ame
 kpilimit = 10
44
Stream NodeJS Browser
 Request botnets from stream
Globe – NodeJS 45
Stream NodeJS Browser
 NodeJS
 HTTP
 Get JSON from Stream.
 Socket.IO
 Multiple protocol support (to bypass some proxys and handle
old browsers).
 Redis
 Get real time number of infected machines.
Globe – Browser 46
Stream NodeJS Browser
 Browser
 Socket.IO Client
 Real time apps.
 Websockets and other
types of transport.
 WebGL
 ThreeJS
 Tween
 jQuery
 WebWorkers
 Runs in the background.
 Where to place the red dots?
 Calculations from geolocation
to 3D point goes here.
Globe – Evolution
 Some kind of HUD to get better interaction and notifications.
 Request actions by clicking in the globe.
 Generate report of infected in that area.
 Request operations in a specific that area.
 Real time warnings
 New Infections
 Other types of warnings...
47
Adding Valuable Information to
Stream Events
 How to distribute workload to other machines?
 Adding value to the information we already have.
48
Minions
 Typically the operations that would had value
are expensive in terms of resources
 CPU
 Bandwidth
 Master-slave approach that distributes work
among distributed slaves we called Minions.
49
Master
Minion
Minion
Minion
Minion
Minions 50
 Master receives work from Requesters and store the work in MongoDB.
 Minions request work.
 Requesters receive real time information on the work from the Master or
they can ask for work information at a later time.
Process / Storage Minions
Master MongoDB
DNS
Scan
Minion
Minion
Requesters
Minion
Minions
 Master has an API that allows custom Requesters to ask for
work and monitor the work.
 Minion have a modular architecture
 Easily create a custom module.
 Information received from the Minions can then be
processed by the Requesters and
 Sent to the Stream
 Saved on the database
 Update existing database
51
Minion
DNS
Scanning
Data
Mining
Extras...
 So what else could we possibly do using the Stream?
 Distributed Portscanning
 Distributed DNS Resolutions
 Transmit images
 Transmit videos
 Realtime tools
 Data agnostic. Throw stuff at it and it will deal with it.
52
Extras...
 So what else could we possibly do using the Stream?
 Distributed Portscanning
 Distributed DNS Resolutions
 Transmit images
 Transmit videos
 Realtime tools
 Data agnostic. Throw stuff at it and it will deal with it.
53
FOCUS
FOCUS
Portscanning
 Portscanning done right…
 Its not only about your portscanner being able to throw 1 billion
packets per second.
 Location = reliability of scans.
 Distributed system for portscanning is much better. But its not just
about having it distributed. Its about optimizing what it scans.
54
Portscanning 55
Portscanning 56
Portscanning 57
Portscanning
IP Australia
(intervolve)
China
(ChinaVPShosting)
Russia
(NQHost)
USA
(Ramnode)
Portugal
(Zon PT)
41.63.160.0/19
(Angola)
0 hosts up 0 hosts up 0 hosts up 0 hosts up 3 hosts up
(sometimes)
5.1.96.0/21
(China)
10 hosts up 70 hosts up 40 hosts up 10 hosts up 40 hosts up
41.78.72.0/22
(Somalia)
0 hosts up 0 hosts up 0 hosts up 0 hosts up 33 hosts up
92.102.229.0/24
(Russia)
20 hosts up 100 hosts up 2 hosts up 2 hosts up 150 hosts up
58
Portscanning problems...
 Doing portscanning correctly brings along certain problems.
 If you are not HD Moore or Dan Kaminsky, resource wise you are gonna have a bad time
59
Portscanning problems...
 Doing portscanning correctly brings along certain problems.
 If you are not HD Moore or Dan Kaminsky, resource wise you are gonna have a bad time
60
Portscanning problems...
 Doing portscanning correctly brings along certain problems.
 If you are not HD Moore or Dan Kaminsky, resource wise you are gonna have a bad time
 You need lots of minions in different parts of the world
 Doesn‟t actually require an amazing CPU or RAM if you do it correctly.
 Storing all that data...
 Querying that data...
Is it possible to have a cheap, distributed portscanning
system?
61
Portscanning problems... 62
Minion
Portscanning 63
Data…. 64
Data 65
Internet status... 66
Internet status... 67
If we„re doing it... Anyone else can.
Evil side?
68
Anubis StreamForce
 Have cool ideas? Contact us
 Access for Brucon participants:
API Endpoint:
http://brucon.cyberfeed.net:8080/stream?key=brucon
2013
 Web UI Dashboard maker:
http://brucon.cyberfeed.net:8080/webgui
69
Lol
 Last minute testing
70
Questions? 71

More Related Content

PPTX
Building a Big Data Pipeline
Jesus Rodriguez
 
PPTX
Experfy Online Course - Gain Competitive Advantage Using Microsoft Azure Data...
Experfy
 
PDF
Building Scalable Big Data Pipelines
Christian Gügi
 
PPTX
Big Data on azure
David Giard
 
PPTX
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
Stéphane Fréchette
 
PPTX
The key to unlocking the Value in the IoT? Managing the Data!
DataWorks Summit/Hadoop Summit
 
PPTX
JEEConf 2015 - Introduction to real-time big data with Apache Spark
Taras Matyashovsky
 
PPTX
Big Data in Azure
DataWorks Summit/Hadoop Summit
 
Building a Big Data Pipeline
Jesus Rodriguez
 
Experfy Online Course - Gain Competitive Advantage Using Microsoft Azure Data...
Experfy
 
Building Scalable Big Data Pipelines
Christian Gügi
 
Big Data on azure
David Giard
 
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
Stéphane Fréchette
 
The key to unlocking the Value in the IoT? Managing the Data!
DataWorks Summit/Hadoop Summit
 
JEEConf 2015 - Introduction to real-time big data with Apache Spark
Taras Matyashovsky
 

What's hot (20)

PDF
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Databricks
 
PPTX
Big Data in the Real World
Mark Kromer
 
PPTX
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Khalid Salama
 
PDF
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
StampedeCon
 
PPTX
Optimizing industrial operations using the big data ecosystem
DataWorks Summit
 
PPTX
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
Infochimps, a CSC Big Data Business
 
PDF
Yahoo's Next Generation User Profile Platform
DataWorks Summit/Hadoop Summit
 
PDF
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
Mars Lan
 
PPTX
The Stream is the Database - Revolutionizing Healthcare Data Architecture
DataWorks Summit/Hadoop Summit
 
PDF
How a Tweet Went Viral - BIWA Summit 2017
Rittman Analytics
 
PDF
Benefits of Hadoop as Platform as a Service
DataWorks Summit/Hadoop Summit
 
PPTX
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
 
PDF
Big Data Computing Architecture
Gang Tao
 
PPTX
Tools and approaches for migrating big datasets to the cloud
DataWorks Summit
 
PDF
Uber's data science workbench
Ran Wei
 
PPTX
Hadoop Reporting and Analysis - Jaspersoft
Hortonworks
 
PPTX
Hadoop data access layer v4.0
SpringPeople
 
PPTX
Options for Data Prep - A Survey of the Current Market
Dremio Corporation
 
PDF
Building a Data Lake on AWS
Gary Stafford
 
PDF
Marketing vs Technology
Nguyen Ngoc Hoai Aan
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Databricks
 
Big Data in the Real World
Mark Kromer
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Khalid Salama
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
StampedeCon
 
Optimizing industrial operations using the big data ecosystem
DataWorks Summit
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
Infochimps, a CSC Big Data Business
 
Yahoo's Next Generation User Profile Platform
DataWorks Summit/Hadoop Summit
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
Mars Lan
 
The Stream is the Database - Revolutionizing Healthcare Data Architecture
DataWorks Summit/Hadoop Summit
 
How a Tweet Went Viral - BIWA Summit 2017
Rittman Analytics
 
Benefits of Hadoop as Platform as a Service
DataWorks Summit/Hadoop Summit
 
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
 
Big Data Computing Architecture
Gang Tao
 
Tools and approaches for migrating big datasets to the cloud
DataWorks Summit
 
Uber's data science workbench
Ran Wei
 
Hadoop Reporting and Analysis - Jaspersoft
Hortonworks
 
Hadoop data access layer v4.0
SpringPeople
 
Options for Data Prep - A Survey of the Current Market
Dremio Corporation
 
Building a Data Lake on AWS
Gary Stafford
 
Marketing vs Technology
Nguyen Ngoc Hoai Aan
 
Ad

Viewers also liked (13)

PPTX
Theius: A Streaming Visualization Suite for Hadoop Clusters
jtedesco5
 
PDF
What Is Visualization?
OneSpring LLC
 
PPTX
An Introduction to Evaluation in Medical Visualization
Noeska Smit
 
PPTX
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
PDF
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...
Jonas Traub
 
PPTX
Text and text stream mining tutorial
mgrcar
 
PPT
Web 2 0 Projects Elementary
Cinci0987
 
PPT
Towards Utilizing GPUs in Information Visualization
Niklas Elmqvist
 
PDF
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
Guido Schmutz
 
PPT
Info vis 4-22-2013-dc-vis-meetup-shneiderman
University of Maryland
 
PDF
Information Visualization for Medical Informatics
University of Maryland
 
PDF
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Mia Yuan Cao
 
PDF
Stream Processing with Kafka in Uber, Danny Yuan
confluent
 
Theius: A Streaming Visualization Suite for Hadoop Clusters
jtedesco5
 
What Is Visualization?
OneSpring LLC
 
An Introduction to Evaluation in Medical Visualization
Noeska Smit
 
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...
Jonas Traub
 
Text and text stream mining tutorial
mgrcar
 
Web 2 0 Projects Elementary
Cinci0987
 
Towards Utilizing GPUs in Information Visualization
Niklas Elmqvist
 
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
Guido Schmutz
 
Info vis 4-22-2013-dc-vis-meetup-shneiderman
University of Maryland
 
Information Visualization for Medical Informatics
University of Maryland
 
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Mia Yuan Cao
 
Stream Processing with Kafka in Uber, Danny Yuan
confluent
 
Ad

Similar to Presentation Brucon - Anubisnetworks and PTCoresec (20)

PPTX
The Background Noise of the Internet
Andrew Morris
 
PDF
Cyber Analytics Applications for Data-Intensive Computing
Mike Fisk
 
PDF
The Heatmap
 - Why is Security Visualization so Hard?
Raffael Marty
 
PDF
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
eswcsummerschool
 
PPTX
High Throughput Data Analysis
J Singh
 
PDF
Streaming Analytics and Internet of Things - Geesara Prathap
WithTheBest
 
KEY
Flexible Event Tracking (Paul Gebheim)
MongoSF
 
PDF
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Big Data Spain
 
PDF
Workshop: Big Data Visualization for Security
Raffael Marty
 
PDF
Sensing the world with data of things
Sriskandarajah Suhothayan
 
PDF
Sensing the world with Data of Things
Sriskandarajah Suhothayan
 
PDF
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Marko Grobelnik
 
PPTX
David Henthorn [Rose-Hulman Institute of Technology] | Illuminating the Dark ...
InfluxData
 
PDF
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
Sriskandarajah Suhothayan
 
PDF
ACM DEBS 2015: Realtime Streaming Analytics Patterns
Srinath Perera
 
PDF
Introduction to Streaming Analytics
Guido Schmutz
 
PPTX
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
Aleksandr Tavgen
 
PDF
Building an event system on top MongoDB
BigPanda
 
PPTX
MongoDB Use Cases: Healthcare, CMS, Analytics
MongoDB
 
PPTX
Observability - the good, the bad, and the ugly
Aleksandr Tavgen
 
The Background Noise of the Internet
Andrew Morris
 
Cyber Analytics Applications for Data-Intensive Computing
Mike Fisk
 
The Heatmap
 - Why is Security Visualization so Hard?
Raffael Marty
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
eswcsummerschool
 
High Throughput Data Analysis
J Singh
 
Streaming Analytics and Internet of Things - Geesara Prathap
WithTheBest
 
Flexible Event Tracking (Paul Gebheim)
MongoSF
 
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Big Data Spain
 
Workshop: Big Data Visualization for Security
Raffael Marty
 
Sensing the world with data of things
Sriskandarajah Suhothayan
 
Sensing the world with Data of Things
Sriskandarajah Suhothayan
 
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Marko Grobelnik
 
David Henthorn [Rose-Hulman Institute of Technology] | Illuminating the Dark ...
InfluxData
 
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
Sriskandarajah Suhothayan
 
ACM DEBS 2015: Realtime Streaming Analytics Patterns
Srinath Perera
 
Introduction to Streaming Analytics
Guido Schmutz
 
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
Aleksandr Tavgen
 
Building an event system on top MongoDB
BigPanda
 
MongoDB Use Cases: Healthcare, CMS, Analytics
MongoDB
 
Observability - the good, the bad, and the ugly
Aleksandr Tavgen
 

More from Tiago Henriques (20)

PDF
BSides Lisbon 2023 - AI in Cybersecurity.pdf
Tiago Henriques
 
PDF
Pixels Camp 2017 - Stories from the trenches of building a data architecture
Tiago Henriques
 
PDF
Pixels Camp 2017 - Stranger Things the internet version
Tiago Henriques
 
PDF
The state of cybersecurity in Switzerland - FinTechDay 2017
Tiago Henriques
 
PDF
Webzurich - The State of Web Security in Switzerland
Tiago Henriques
 
PDF
BSides Lisbon - Data science, machine learning and cybersecurity
Tiago Henriques
 
PDF
I FOR ONE WELCOME OUR NEW CYBER OVERLORDS! AN INTRODUCTION TO THE USE OF MACH...
Tiago Henriques
 
PDF
BinaryEdge - Security Data Metrics and Measurements at Scale - BSidesLisbon 2015
Tiago Henriques
 
PDF
Codebits 2014 - Secure Coding - Gamification and automation for the win
Tiago Henriques
 
PPTX
Hardware hacking 101
Tiago Henriques
 
PPTX
Workshop
Tiago Henriques
 
PPTX
Enei
Tiago Henriques
 
PPTX
Confraria 28-feb-2013 mesa redonda
Tiago Henriques
 
PPTX
Preso fcul
Tiago Henriques
 
PPTX
How to dominate a country
Tiago Henriques
 
PPTX
Country domination - Causing chaos and wrecking havoc
Tiago Henriques
 
PDF
(Mis)trusting and (ab)using ssh
Tiago Henriques
 
PPTX
Secure coding - Balgan - Tiago Henriques
Tiago Henriques
 
PPTX
Vulnerability, exploit to metasploit
Tiago Henriques
 
PPTX
Practical exploitation and social engineering
Tiago Henriques
 
BSides Lisbon 2023 - AI in Cybersecurity.pdf
Tiago Henriques
 
Pixels Camp 2017 - Stories from the trenches of building a data architecture
Tiago Henriques
 
Pixels Camp 2017 - Stranger Things the internet version
Tiago Henriques
 
The state of cybersecurity in Switzerland - FinTechDay 2017
Tiago Henriques
 
Webzurich - The State of Web Security in Switzerland
Tiago Henriques
 
BSides Lisbon - Data science, machine learning and cybersecurity
Tiago Henriques
 
I FOR ONE WELCOME OUR NEW CYBER OVERLORDS! AN INTRODUCTION TO THE USE OF MACH...
Tiago Henriques
 
BinaryEdge - Security Data Metrics and Measurements at Scale - BSidesLisbon 2015
Tiago Henriques
 
Codebits 2014 - Secure Coding - Gamification and automation for the win
Tiago Henriques
 
Hardware hacking 101
Tiago Henriques
 
Workshop
Tiago Henriques
 
Confraria 28-feb-2013 mesa redonda
Tiago Henriques
 
Preso fcul
Tiago Henriques
 
How to dominate a country
Tiago Henriques
 
Country domination - Causing chaos and wrecking havoc
Tiago Henriques
 
(Mis)trusting and (ab)using ssh
Tiago Henriques
 
Secure coding - Balgan - Tiago Henriques
Tiago Henriques
 
Vulnerability, exploit to metasploit
Tiago Henriques
 
Practical exploitation and social engineering
Tiago Henriques
 

Recently uploaded (20)

PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Software Development Methodologies in 2025
KodekX
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 

Presentation Brucon - Anubisnetworks and PTCoresec

  • 1. Real time analysis and visualization ANUBISNETWORKS LABS PTCORESEC 1
  • 2. Agenda  Who are we?  AnubisNetworks Stream  Stream Information Processing  Adding Valuable Information to Stream Events 2
  • 3. Who are we?  Tiago Martins  AnubisNetworks  @Gank_101 3  João Gouveia  AnubisNetworks  @jgouv  Tiago Henriques  Centralway  @Balgan
  • 4. Anubis StreamForce  Events (lots and lots of events)  Events are “volatile” by nature  They exist only if someone is listening  Remember?: “If a tree falls in a forest and no one is around to hear it, does it make a sound?” 4
  • 5. Anubis StreamForce  Enter security Big Data “a brave new world” 5 Volume Variety Velocity We are here
  • 6. Anubis StreamForce  Problems (and ambitions) to tackle  The huge amount and variety of data to process  Mechanisms to share data across multiple systems, organizations, teams, companies..  Common API for dealing with all this (both from a producer and a consumer perspective) 6
  • 7. Anubis StreamForce  Enter the security events CEP - StreamForce High performance, scalable, Complex Event Processor (CEP) – 1 node (commodity hw) = 50k evt/second Uses streaming technology Follows a publish / subscriber model 7
  • 8. Anubis StreamForce  Data format Events are published in JSON format Events are consumed in JSON format 8
  • 11. MFE OpenSource / MailSpike community Dashboard Dashboard Complex Event Processing Sinkholes Data-theft Trojans Real Time Feeds Real Time Feeds IP Reputation Passive DNSTraps / Honeypots Twitter
  • 12. MFE OpenSource / MailSpike community Dashboard Dashboard Complex Event Processing Sinkholes Data-theft Trojans Real Time Feeds Real Time Feeds IP Reputation Passive DNSTraps / Honeypots Twitter
  • 13. Anubis CyberFeed 13  Feed galore! Sinkhole data, traps, IP reputation, etc.  Bespoke feeds (create your own view)  Measure, group, correlate, de-duplicate ..  High volume (usually ~6,000 events per second, more data being added frequently
  • 14. MFE OpenSource / MailSpike community Dashboard Event navigation Complex Event Processing Sinkholes Data-theft Trojans Real Time Feeds Real Time Feeds IP Reputation Passive DNSTraps / Honeypots Twitter
  • 15. Anubis CyberFeed 15  Apps (demo time)
  • 16. Stream Information Processing  Collecting events from the Stream.  Generating reports.  Real time visualization. 16
  • 17. Challenge  ~6k events/s and at peak over 10k events/s.  Let‟s focus on trojans feed (banktrojan).  Peaks @ ~4k events/s {"_origin":"banktrojan","env":{"server_name":"anam0rph.su","remote_ad dr":"46.247.141.66","path_info":"/in.php","request_method":"POST","http _user_agent":"Mozilla/4.0"},"data":"upqchCg4slzHEexq0JyNLlaDqX40G sCoA3Out1Ah3HaVsQj45YCqGKylXf2Pv81M9JX0","seen":1379956636,"tr ojanfamily":"Zeus","_provider":"lab","hostn":"lab14","_ts":1379956641} 17
  • 20. Challenge  Let‟s use the Stream to help  Group by machine and trojan  From peak ~4k/s to peak ~1k/s  Filter fields.  Geo location  We end up with {"env":{"remote_addr":"207.215.48.83"},"trojanfamily":"W32Expiro","_geo_env_remote_addr ":{"country_code":"US","country_name":"United States","city":"Los Angeles","latitude":34.0067,"longitude":-118.3455,"asn":7132,"asn_name":"AS for SBIS-AS"}} 20
  • 21. Challenge  How to process and store these events? 21
  • 22. Technologies 22  Applications  NodeJS  Server-side Javascript Platform.  V8 Javascript Engine.  http://nodejs.org/ Why?  Great for prototyping.  Fast and scalable.  Modules for (almost) everything.
  • 23. Technologies 23  Databases  MongoDB  NoSQL Database.  Stores JSON-style documents.  GridFS  http://www.mongodb.org/ Why?  JSON from the Stream, JSON in the database.  Fast and scalable.  Redis  Key-value storage.  In-memory dataset.  http://redis.io/ Why?  Faster than MongoDB for certain operations, like keeping track of number of infected machines.  Very fast and scalable.
  • 24. Data Collection 24 Storage Aggregate information MongoDB Redis Worker Worker Worker Processor Process real time events  Applications  Collector  Worker  Processor  Databases  MongoDB  Redis Collector Stream
  • 25. Data Collection 25 Storage Aggregate information MongoDB Redis Worker Worker Worker Processor Process real time events  Events comes from the Stream.  Collector distributes events to Workers.  Workers persist event information.  Processor aggregates information and stores it for statistical and historical analysis. Collector Stream
  • 26. Data Collection 26 Storage Aggregate information MongoDB Redis Worker Worker Worker Processor Process real time events  MongoDB  Real time information of infected machines.  Historical aggregated information.  Redis  Real time counters of infected machines. Collector Stream
  • 27. Data Collection - Collector 27 Collector  Old data is periodically remove, i.e. machines that don‟t produce events for more than 24 hours.  Send events to Workers.Workers  Decrements counters of removed information.  Send warnings  Country / ASN is no longer infected.  Botnet X decreased Y % of its size.
  • 28. Data Collection - Worker 28 Worker  Create new entries for unseen machines.  Adds information about new trojans / domains.  Update the last time the machine was seen.  Process events and update the Redis counters accordingly.  Needs to check MongoDB to determine if:  New entry – All counters incremented  Existing entry – Increment only the counters related to that Trojan  Send warnings  Botnet X increased Y % in its size.  New infections seen on Country / ASN.
  • 29. Data Collection - Processor Processor 29  Processor retrieves real time counters from Redis.  Information is processed by:  Botnet;  ASN;  Country;  Botnet/Country;  Botnet/ASN/Country;  Total.  Persisting information to MongoDB creates a historic database of counters that can be queried and analyzed.
  • 30. Data Collection - MongoDB  Collection for active machines in the last 24h { "city" : "Philippine", "country" : "PH", "region" : "N/A", "geo" : { "lat" : 16.4499, "lng" : 120.5499 }, "created" : ISODate("2013-09-21T00:19:12.227Z "), "domains" : [ { "domain" : "hzmksreiuojy.nl", "trojan" : "zeus", "last" : ISODate("2013-09-21T09:42:56.799Z"), "created" : ISODate("2013-09-21T00:19:12.227Z") } ], "host" : "112.202.37.72.pldt.net", "ip" : "112.202.37.72", "ip_numeric" : 1892296008, "asn" : "Philippine Long Distance Telephone Company", "asn_code" : 9299, "last" : ISODate("2013-09-21T09:42:56.799Z"), "trojan" : [ "zeus” ] } 30
  • 31. Data Collection - MongoDB  Collection for aggregated information (the historic counters database) { "_id" : ObjectId("519c0abac1172e813c004ac3"), "0" : 744, "1" : 745, "3" : 748, "4" : 748, "5" : 746, "6" : 745, ... "10" : 745, "11" : 742, "12" : 746, "13" : 750, "14" : 753, ... "metadata" : { "country" : "CH", "date" : "2013-05-22T00:00:00+0000", "trojan" : "conficker_b", "type" : "daily" } } 31 Preallocated entries for each hour when the document is created. If we don’t, MongoDB will keep extending the documents by adding thousands of entries every hour and it becomes very slow.
  • 32. Data Collection - MongoDB  Collection for 24 hours  4 MongoDB Shard instances  >3 Million infected machines  ~2 Gb of data  ~558 bytes per document.  Indexes by  ip – helps inserts and updates.  ip_numeric – enables queries by CIDRs.  last – Faster removes for expired machines.  host – Hmm, is there any .gov?   country, family, asn – Speeds MongoDB queries and also allows faster custom queries.  Collection for aggregated information  Data for 119 days (25 May to 11 July)  > 18 Million entries  ~6,5 Gb of data  ~366 bytes per object  ~56 Mb per day  Indexes by  metadata.country  metadata.trojan  metadata.date  Metadata.asn  Metadata.type, metadata.country,metadata.date,met....... (all) 32
  • 33. Data Collection - Redis  Counters by Trojan / Country "cutwailbt:RO": "1256", "rbot:LA": "3", "tdss:NP": "114", "unknown4adapt:IR": "100", "unknownaff:EE": "0", "cutwail:CM": "20", "unknownhrat3:NZ": "56", "cutwailbt:PR": "191", "shylock:NO": "1", "unknownpws:BO": "3", "unknowndgaxx:CY": "77", "fbhijack:GH": "22", "pushbot:IE": "2", "carufax:US": "424“  Counters by Trojan "unknownwindcrat": "18", "tdss": "79530", "unknownsu2": "2735", "unknowndga9": "15", "unknowndga3": "17", "ircbot": "19874", "jshijack": "35570", "adware": "294341", "zeus": "1032890", "jadtre": "40557", "w32almanahe": "13435", "festi": "1412", "qakbot": "19907", "cutwailbt": "38308“  Counters by Country “BY": "11158", "NA": "314", "BW": "326", "AS": "35", "AG": "94", "GG": "43", "ID": "142648", "MQ": "194", "IQ": "16142", "TH": "105429", "MY": "35410", "MA": "15278", "BG": "15086", "PL": "27384” 33
  • 34. Data Collection - Redis  Redis performance in our machine  SET: 473036.88 requests per second  GET: 456412.59 requests per second  INCR: 461787.12 requests per second  Time to get real time data  Getting all the data from Familys/ASN/Counters to the NodeJS application and ready to be processed in around half a second  > 120 000 entries in… (very fast..)  Our current usage is  ~ 3% CPU (of a 2.0 Ghz core)  ~ 480 Mb of RAM 34
  • 35. Data Collection - API  But! There is one more application..  How to easily retrieve stored data  MongoDB Rest API is a bit limited.  NodeJS HTTP + MongoDB + Redis  Redis  http://<host>/counters_countries  ...  MongoDB  http://<host>/family_country  ...  Custom MongoDB Querys  http://<host>/ips?f.ip_numeric=95.68.149.0/22  http://<host>/ips?f.country=PT  http://<host>/ips?f.host=bgovb 35
  • 36. Data Collection - Limitations  Grouping information by machine and trojan doesn‟t allow to study the real number of events per machine.  Can be useful to get an idea of the botnet operations or how many machines are behind a single IP (everyone is behind a router).  Slow MongoDB impacts everything  Worker application needs to tolerate a slow MongoDB and discard some information has a last resort.  Beware of slow disks! Data persistence occurs every 60 seconds (default) and can take too much time, having a real impact on performance..  >10s to persist is usually very bad, something is wrong with hard drives.. 36
  • 37. Data Collection - Evolution  Warnings  Which warnings to send? When? Thresholds?  Aggregate data by week, month, year.  Aggregate information in shorter intervals.  Data Mining algorithms applied to all the collected information.  Apply same principles to other feeds of the Stream.  Spam  Twitter  Etc.. 37
  • 38. Reports  What‟s happening in country X?  What about network 192.168.0.1/24?  Can send me the report of Y everyday at 7 am?  Ohh!! Remember the report I asked last week?  Can I get a report for ASN AnubisNetwork? 38
  • 39. Reports 39  HTTP API  Schedule  Get  Edit  Delete  List schedules  List reports  Check MongoDB for work.  Generate CSV report or store the JSON Document for later querying.  Send email with link to files when report is ready. Server Generator
  • 40. Reports – MongoDB CSVs  Scheduled Report { "__v" : 0, "_id" : ObjectId("51d64e6d5e8fd0d145000008"), "active" : true, "asn_code" : "", "country" : "PT", "desc" : "Portugal Trojans", "emails" : "", "range" : "", "repeat" : true, "reports" : [ ObjectId("51d64e7037571bd24500000d"), ObjectId("51d741e8bcb161366600000c"), ObjectId("51d89367bcb161366600005f"), ObjectId("51d9e4f9bcb16136660000ca"), ObjectId("51db3678c3a15fc577000038"), ObjectId("51dc87e216eea97c20000007"), ObjectId("51ddd964a89164643b000001") ], "run_at" : ISODate("2013-07-11T22:00:00Z"), "scheduled_date" : ISODate("2013-07- 05T04:41:17.067Z") }  Report { "__v" : 0, "_id" : ObjectId("51d89367bcb161366600005f"), "date" : ISODate("2013-07-06T22:00:07.015Z"), "files" : [ ObjectId("51d89368bcb1613666000060") ], "work" : ObjectId("51d64e6d5e8fd0d145000008") }  Files  Each report has an array of files that represents the report.  Each file is stored in GridFS. 40
  • 41. Reports – MongoDB JSONs  Scheduled Report { "__v" : 0, "_id" : ObjectId("51d64e6d5e8fd0d145000008"), "active" : true, "asn_code" : "", "country" : "PT", "desc" : "Portugal Trojans", "emails" : "", "range" : "", "repeat" : true, “snapshots" : [ ObjectId("521f761c0a45c3b00b000001"), ObjectId("521fb0848275044d420d392f"), ObjectId("52207c2f7c53a8494f010afa"), ObjectId("5221c9df4910ba3874000001"), ObjectId("522275724910ba3874001f66"), ObjectId("5223c6f24910ba3874003b7a"), ObjectId("522518734910ba3874005763") ], "run_at" : ISODate("2013-07-11T22:00:00Z"), "scheduled_date" : ISODate("2013-07-05T04:41:17.067Z") }  Snapshot { "_id" : ObjectId("51d89367bcb161366600005f"), "date" : ISODate("2013-07-06T22:00:07.015Z"), "work" : ObjectId("521f761c0a45c3b00b000001"), count: 123 }  Results { "machine" : { "trojan" : [ “conficker_b“ ], "ip" : "2.80.2.53", "host" : "Bl19-1-13.dsl.telepac.pt", }, … , "metadata" : { "work" : ObjectId("521f837647b8d3ba7d000001"), "snaptshot" : ObjectId("521f837aa669d0b87d000001"), "date" : ISODate("2013-08-29T00:00:00Z") }, } 41
  • 42. Reports – Evolution  Other reports formats.  Charts?  Other type of reports. (Not only botnets).  Need to evolve Collector first. 42
  • 43. Globe  How to visualize real time events from the stream?  Where are the botnets located?  Who‟s the most infected?  How many infections? 43
  • 44. Globe – Stream  origin = banktrojan  Modules  Group  trojanfamily  _geo_env_remote_addr.country_n ame  grouptime=5000  Geo  Filter fields  trojanfamily  Geolocation  _geo_env_remote_addr.l*  KPI  trojanfamily  _geo_env_remote_addr.country_n ame  kpilimit = 10 44 Stream NodeJS Browser  Request botnets from stream
  • 45. Globe – NodeJS 45 Stream NodeJS Browser  NodeJS  HTTP  Get JSON from Stream.  Socket.IO  Multiple protocol support (to bypass some proxys and handle old browsers).  Redis  Get real time number of infected machines.
  • 46. Globe – Browser 46 Stream NodeJS Browser  Browser  Socket.IO Client  Real time apps.  Websockets and other types of transport.  WebGL  ThreeJS  Tween  jQuery  WebWorkers  Runs in the background.  Where to place the red dots?  Calculations from geolocation to 3D point goes here.
  • 47. Globe – Evolution  Some kind of HUD to get better interaction and notifications.  Request actions by clicking in the globe.  Generate report of infected in that area.  Request operations in a specific that area.  Real time warnings  New Infections  Other types of warnings... 47
  • 48. Adding Valuable Information to Stream Events  How to distribute workload to other machines?  Adding value to the information we already have. 48
  • 49. Minions  Typically the operations that would had value are expensive in terms of resources  CPU  Bandwidth  Master-slave approach that distributes work among distributed slaves we called Minions. 49 Master Minion Minion Minion Minion
  • 50. Minions 50  Master receives work from Requesters and store the work in MongoDB.  Minions request work.  Requesters receive real time information on the work from the Master or they can ask for work information at a later time. Process / Storage Minions Master MongoDB DNS Scan Minion Minion Requesters Minion
  • 51. Minions  Master has an API that allows custom Requesters to ask for work and monitor the work.  Minion have a modular architecture  Easily create a custom module.  Information received from the Minions can then be processed by the Requesters and  Sent to the Stream  Saved on the database  Update existing database 51 Minion DNS Scanning Data Mining
  • 52. Extras...  So what else could we possibly do using the Stream?  Distributed Portscanning  Distributed DNS Resolutions  Transmit images  Transmit videos  Realtime tools  Data agnostic. Throw stuff at it and it will deal with it. 52
  • 53. Extras...  So what else could we possibly do using the Stream?  Distributed Portscanning  Distributed DNS Resolutions  Transmit images  Transmit videos  Realtime tools  Data agnostic. Throw stuff at it and it will deal with it. 53 FOCUS FOCUS
  • 54. Portscanning  Portscanning done right…  Its not only about your portscanner being able to throw 1 billion packets per second.  Location = reliability of scans.  Distributed system for portscanning is much better. But its not just about having it distributed. Its about optimizing what it scans. 54
  • 58. Portscanning IP Australia (intervolve) China (ChinaVPShosting) Russia (NQHost) USA (Ramnode) Portugal (Zon PT) 41.63.160.0/19 (Angola) 0 hosts up 0 hosts up 0 hosts up 0 hosts up 3 hosts up (sometimes) 5.1.96.0/21 (China) 10 hosts up 70 hosts up 40 hosts up 10 hosts up 40 hosts up 41.78.72.0/22 (Somalia) 0 hosts up 0 hosts up 0 hosts up 0 hosts up 33 hosts up 92.102.229.0/24 (Russia) 20 hosts up 100 hosts up 2 hosts up 2 hosts up 150 hosts up 58
  • 59. Portscanning problems...  Doing portscanning correctly brings along certain problems.  If you are not HD Moore or Dan Kaminsky, resource wise you are gonna have a bad time 59
  • 60. Portscanning problems...  Doing portscanning correctly brings along certain problems.  If you are not HD Moore or Dan Kaminsky, resource wise you are gonna have a bad time 60
  • 61. Portscanning problems...  Doing portscanning correctly brings along certain problems.  If you are not HD Moore or Dan Kaminsky, resource wise you are gonna have a bad time  You need lots of minions in different parts of the world  Doesn‟t actually require an amazing CPU or RAM if you do it correctly.  Storing all that data...  Querying that data... Is it possible to have a cheap, distributed portscanning system? 61
  • 68. If we„re doing it... Anyone else can. Evil side? 68
  • 69. Anubis StreamForce  Have cool ideas? Contact us  Access for Brucon participants: API Endpoint: http://brucon.cyberfeed.net:8080/stream?key=brucon 2013  Web UI Dashboard maker: http://brucon.cyberfeed.net:8080/webgui 69
  • 70. Lol  Last minute testing 70

Editor's Notes

  • #5: Internet scale. Devices, systems, firewalls, ids..
  • #7: Internet scale. Devices, systems, firewalls, ids..
  • #8: Internet scale. Devices, systems, firewalls, ids..
  • #9: Internet scale. Devices, systems, firewalls, ids..
  • #10: Internet scale. Devices, systems, firewalls, ids..
  • #11: Internet scale. Devices, systems, firewalls, ids..
  • #14: Internet scale. Devices, systems, firewalls, ids..
  • #16: Internet scale. Devices, systems, firewalls, ids..
  • #17: Hi, I’m going to present the next section of the presentation.So, how can we collect events from the Stream? What information can we gather from those events?How can we access to those events in real time?
  • #18: The challenge here is the large number of events per second, on total we currently have over 6000 events per second, 4000 of these events are from a single feed called banktrojans, which is basically formed by infected machines. This is what an event from that machines looks like.
  • #19: So, basicallythiswhatwesee..
  • #20: And, thisiswhatwewant.Wewant to knowwhereour targets are, where to look.
  • #21: Infected machines are usually noisy and they tend to produce a big number of events. We can use the stream to help us, the group module groups the events that occur within 4 minutes of each other and originate from the same machine and trojan, we can go from 4000 to 1000 events per second, basically we receive an event for a machine and trojan and the next events will not be received because they are considered duplicates. Then we have the filter module to filter the fields we need, for example, we only care about the IP address, ASN, Trojan, C&amp;C domain and geo location of the machine.How do we process and store these 1000 events per second?
  • #22: Infected machines are usually noisy and they tend to produce a big number of events. We can use the stream to help us, the group module groups the events that occur within 4 minutes of each other and originate from the same machine and trojan, we can go from 4000 to 1000 events per second, basically we receive an event for a machine and trojan and the next events will not be received because they are considered duplicates. Then we have the filter module to filter the fields we need, for example, we only care about the IP address, ASN, Trojan, C&amp;C domain and geo location of the machine.How do we process and store these 1000 events per second?
  • #23: First, some technical information about the technologies we use.For applications development, we NodeJS, a server-side javascript platform built on top of the V8 engine. It’s fast, scalable and has modules for almost everything.For data storage, MongoDB is a NoSQL database that is fast and scalable. It can also store JSON-style documents and files in GridFS.And then we have Redis, key-value storage that is very fast and also scalable.
  • #24: First, some technical information about the technologies we use.For applications development, we NodeJS, a server-side javascript platform built on top of the V8 engine. It’s fast, scalable and has modules for almost everything.For data storage, MongoDB is a NoSQL database that is fast and scalable. It can also store JSON-style documents and files in GridFS.And then we have Redis, key-value storage that is very fast and also scalable.
  • #25: This is an overview of the Data Collection. We built 3 applications: Collector; Worker; Processor.We have the events coming from the Stream to the Collector. The Collector then distributes the workload to workers that process and store the information in MongoDB and Redis.The Processor will then gather information from MongoDB and stores it for statistical and historical analysis.
  • #26: Events come from the Stream to the Collector. The Collector then distributes the workload to workers that process and store the information in MongoDB and Redis.The Processor will then gather information from Redis and stores it in MongoDB for statistical and historical analysis.
  • #27: Events come from the Stream to the Collector. The Collector then distributes the workload to workers that process and store the information in MongoDB and Redis.The Processor will then gather information from Redis and stores it in MongoDB for statistical and historical analysis.
  • #28: So the Collector, talks to these 3 components. It maintains the information on MongoDB, removing information about machines that don’t produce events for more than 24 hours.Decrements counters for Redis, and while maintaining this information, it is possible to send warnings.Workers receive events from the Collector and can run in any machine with connection to the collector and database..
  • #29: The Worker processes and stores the event in MongoDB, creating new entries or updating information about new trojans in existing entry. It also updates the last time we saw an event for that machine.While updating MongoDB the Worker also needs to maintain the Redis counters information, incrementing the values for new entries or updating counters for a new trojan in a seen machine. While performing this task it can also understand if there is a warning to be sent.
  • #30: The last component is Processor. It retrieves real time counters from Redis, processes and stores them in MongoDB aggregated by Botnet, ASN, Country, etc. This information can then be analysed and queried.
  • #31: Let’s now check the Databases. MongoDB collection that stores information of active machines in the last 24 hours, looks like this. It’s a JSON document with information about geolocation, IP address, Trojans, last time seen, etc. There is also a numerical representation of the IP Address that helps to query for specific network ranges.
  • #32: The aggregated information collection holds documents with this format. The metadata field that holds information about the specific document, its type and origin of information. In this case its country and trojan. It has an entry per hour with the number of infections, these entries need to be preallocated with zeros, so at every day a new document is created for a specific metadata with all the hours at 0. If we don’t do this there will be a lot of extends of documents on MongoDB and I will become very slow.
  • #33: Some more information for this collections. The 24 hours collection is sharded between 4 MongoDB instance and in July it had information over 3 million infected machines, that only takes 2 Gb of disk to store. The aggregated information collected for 119 days, had over 18 million entries and occupied around 6,5 Gb of data, that’s around 56Mb per day.These were the indexes created. We need to be very careful with these because they speed the readings but they slow the writings. We want fast writes for the 24 hours collection and for that reason we need to keep the indexes optimized and only the IP index runs on the foreground, all the other run on the background.For the aggregated information collection we don’t need to be very careful, we can add the indexes that will allow us to perform faster queries.
  • #34: Let’s look at the Redis information. The counters look like this, they are concatenation of string separated by colons, for example (example).
  • #35: Redis is very fast, we can retrieve all the information from the biggest in around half a second. The insertion of data is also very fast while using very few resources of a machine.
  • #36: There was also the need to access all this information on demand, so an API was created that allow to retrieve or query information on both Redis and MongoDB.
  • #37: So, there are a couple of limitations with these approaches. By grouping events in order to reduce the amount of events per second we are discarding information that could be studied in order to better understand what is behind those machines, for example, the number of events of a machine with a specific botnet could indicate how many machines are on that network (everyone has a router nowadays).Also MongoDB can impact everything, it is fast but needs to be used carefully. We need 3 MongoDB shards to keep the performance on acceptable levels. If we start getting 2 or 3 times the events we currently have the Workers won’t be able to persist all that information in time and will have to start discarding it at some point. The alternative to discard is to add more shards. You need to constantly monitor your hard drives, if the performance decreases, bad things will happen. Mongo won’t be able to persist the information in time and will start to slow down everything.
  • #38: How can we evolve this solution?We can send more warnings with the information we have, but when? What thresholds should we use.We only aggregate information by the hours and day, what about weeks, months, years? What about shorter intervals?We can also apply data mining algorithms in order to retrieve more information from the data we already collect.And of course, apply these principals to other feeds like Spam or Twitter.
  • #39: So how do we extract information about a specific network or country? What about what happened last week?
  • #40: Of course we used NodeJS and built 2 applications, one that is used as an API to access and request reports and the other that checks the Database for requests, generates the reports and stores them. The reports are saved in CSV or in JSON format, for later query. They are also sent by email where we give a URL to download the files.
  • #41: The collections that hold the CSV reports look like this. The have a scheduled work collection that keeps an record of the report its generating and the reports it already generated. The reports keep an array of files generated and saved on MongoDB storage for files, called GridFS.
  • #42: Then we have the JSONs reports, that we call snapshots. The main differences are the count field in the snapshot that holds the number of infected machines in that snapshot, and the results for that snapshot which include the information about the machine and the metadata that identifies the origin of that entry.We could store an array of results in the Snapshot collection but it would be hard to use it because it would have too many entries, possibly millions and would just be useless.
  • #43: How could we evolve the Reports? We can store reports in other formats, generate charts for that report with specific information and start storing other type of reports, not just for botnets.
  • #44: So, how can we visualize realtime events? Let’s focus on the botnets again, it would be awesome if we could see the distribution of botnets thru the world, receive warnings and monitor other information in realtime.For that purpose, there is a shiny globe (demo).We can see in realtime when infected machines produce events, monitor a top of most infected with a specific Trojan, number of events being generated every second and a total number of infections. countries
  • #45: This information comes from the Steam, we group it by Trojan and country, we don’t really want to sent ALL the events to the browser because some browsers would just crash.. For that reason we also filter only the geolocation and trojan family. The information about the top infected come from a KPI module that dynamically calculates the top in the stream.
  • #46: Between the Stream and the Browser we have a NodeJS application that controls the flow of events to the browser, discarding if too many events are received and relaying the information to the Browser using the socket.io module. We all need to get the total number of infected machines from the Redis counters.
  • #47: At the browser end we use the socket.io client to receive the events, process those events using WebWorkers (calculation of where to place the dots) and render everything using WebGL.
  • #48: We can evolve the globe to create a more interactive experience where we could perform actions in realtime through the globe.We can also show warnings in the globe, for example, about new infections.
  • #49: How can we add valuable information to the information we already have?
  • #50: Typically the operations that would had value are expensive, they need CPU and bandwidth. So we needed a master-slave approach that distributes the work among multiple slaves, we called Minions.
  • #51: Masters receive work from the Requesters and store that work on MongoDB. Minions will then request work and send the work result to the Master. Master then send updates directly to the Requester of the work and also stores the results in MongoDB.
  • #52: The Master has an API that allows for custom Requesters to ask for work and monitor the work results received from the Minion.The Minion application was built with a modular architecture in mind, so it is very easy to create a custom module.Information received by the minion can then be injected on the stream or stored in a database.
  • #53: Getting full picture from an infected machine or a networking involves lots of steps:Sinkholing that botnetPortscanning target gives u an idea if the machine is connected directly to internet or behind gateway or if there are shares available, how could this machine possibly been compromised (ms08-067 ? )DNS analysis
  • #54: We are going to focus on:PortscanningDNS resolutionsRealtime demos
  • #58: Its really cool to have a super fast scanner in a lab giving 1 quadrillion packets per second. However this is the wrong way. Correct way:Slow scanGeo DistributedScanning angola from australia = 60% of services timeout and look closedScanning USA from Russia or vice versa = retarded
  • #63: Combining a model B raspberry pi with the distro pwnpi and a custom set of scripts makes it a Minion. A cheap device that we can use to do distributed scanning and even ask others to deploy and contribute to our system.In the near future we intend to make this image available for others that want to contribute to our system.