SlideShare a Scribd company logo
OpenSOC
The Open Security Operations
Center
for
Analyzing 1.2 Million Network Packets per Second
in Real TimeJames Sirota,
Big Data Architect
Cisco Security Solutions Practice
jsirota@cisco.com
Sheetal Dolas
Principal Architect
Hortonworks
sheetal@hortonworks.com
June 3, 2014
2
 Problem Statement & Business Case for OpenSOC
 Solution Architecture and Design
 Best Practices and Lessons Learned
 Q & A
Over Next Few Minutes
3
Business Case
4
fatalism:
It's no longer if or when you get hacked,
but the assumption is that you've
already been hacked,
with a focus on minimizing the
damage.”
Source: Dark Reading / Security’s New
Reality: Assume The Worst
5
Breaches Happen in Hours…
But Go Undetected for Months or Even Years
Source: 2013 Data Breach Investigations
Report
Seconds Minutes Hours Days Weeks Months Years
Initial Attack to Initial
Compromise
10% 75% 12% 2% 0% 1% 1%
Initial Compromise
to Data Exfiltration
8% 38% 14% 25% 8% 8% 0%
Initial Compromise
to Discovery
0% 0% 2% 13% 29% 54% 2%
Discovery to
Containment/
Restoration 0% 1% 9% 32% 38% 17% 4%
Timespan of events by percent of breaches
In 60% of
breaches, data
is stolen in hours
54% of breaches
are not discovered for
months
6
Cisco Global Cloud Index
Source: 2014 Cisco Global Cloud Index
7
Introducing OpenSOC
Intersection of Big Data and Security Analytics
Multi Petabyte Storage
Interactive Query
Real-Time Search
Scalable Stream
Processing
Unstructured Data
Data Access Control
Scalable Compute
OpenSOC
Real-Time Alerts
Anomaly Detection
Data
Correlation
Rules and Reports
Predictive Modeling
UI and Applications
Big Data
Platform
Hadoop
8
OpenSOC Journey
Sept 2013
First Prototype
Dec 2013
Hortonworks
joins the
project
March 2014
Platform
development
finished
Sept 2014
General
Availability
May 2014
CR Work off
April 2014
First beta test
at customer
site
9
Solution Architecture &
Design
10
OpenSOC Conceptual Architecture
Raw Network Stream
Network Metadata
Stream
Netflow
Syslog
Raw Application Logs
Other Streaming
Telemetry
HiveHBase
Raw Packet
Store
Long-Term
Store
Elastic Search
Real-Time
Index
Network Packet
Mining and
PCAP
Reconstruction
Log Mining and
Analytics
Big Data
Exploration,
Predictive
Modeling
Applications + Analyst Tools
Parse+Format
Enrich
Alert
Threat Intelligence
Feeds
Enrichment Data
11
 Raw Network Packet Capture, Store, Traffic Reconstruction
 Telemetry Ingest, Enrichment and Real-Time Rules-Based
Alerts
 Real-Time Telemetry Search and Cross-Telemetry Matching
 Automated Reports, Anomaly Detection and Anomaly
Alerts
Key Functional Capabilities
12
 Fully-Backed by Cisco and Used Internally for Multiple
Customers
 Free, Open Source and Apache Licensed
 Built on Highly-Scalable and Proven Platforms (Hadoop,
Kafka, Storm)
 Extensible and Pluggable Design
 Flexible Deployment Model (On-Premise or Cloud)
 Centralize your processes, people and data
The OpenSOC Advantage
13
OpenSOC Deployment at Cisco
Hardware footprint (40u)
 14 Data Nodes (UCS C240 M3)
 3 Cluster Control Nodes (UCS C220
M3)
 2 ESX Hypervisor Hosts (UCS C220
M3)
 1 PCAP Processor (UCS C220 M3 +
Napatech NIC)
 2 SourceFire Threat alert processors
 1 Anue Network Traffic splitter
 1 Router
 1 48 Port 10GE Switch
Software Stack
HDP 2.1
Kafka 0.8
Elastic Search 1.1
MySQL 5.5
14
OpenSOC - Stitching Things Together
AccessMessaging SystemData CollectionSource Systems StorageReal Time Processing
StormKafka
B Topic
N Topic
Elastic
Search
Index
Web
Services
Search
PCAP
Reconstruction
HBase
PCAP Table
Analytic
Tools
R / Python
Power Pivot
Tableau
Hive
Raw Data
ORC
Passive
Tap
PCAP Topic
DPI Topic
A Topic
Telemetry
Sources
Syslog
HTTP
File System
Other
Flume
Agent A
Agent B
Agent N
B Topology
N Topology
A Topology
PCAP
Traffic
Replicato
r
PCAP
Topology
DPI Topology
15
OpenSOC - Stitching Things Together
AccessMessaging SystemData CollectionSource Systems StorageReal Time Processing
StormKafka
B Topic
N Topic
Elastic
Search
Index
Web
Services
Search
PCAP
Reconstruction
HBase
PCAP Table
Analytic
Tools
R / Python
Power Pivot
Tableau
Hive
Raw Data
ORC
Passive
Tap
PCAP Topic
DPI Topic
A Topic
Telemetry
Sources
Syslog
HTTP
File System
Other
Flume
Agent A
Agent B
Agent N
B Topology
N Topology
A Topology
PCAP
Traffic
Replicato
r
Deeper
Look
PCAP
Topology
DPI Topology
16
PCAP Topology
StorageReal Time Processing
Storm
Elastic Search
Index
HBase
PCAP Table
Hive
Raw Data
ORC
Kafka
Spout
Parse
r
Bolt
HDFS
Bolt
HBas
e
Bolt
ES
Bolt
17
DPI Topology & Telemetry Enrichment
StorageReal Time Processing
Storm
Elastic Search
Index
HBase
PCAP Table
Hive
Raw Data
ORC
Kafka
Spout
Parse
r Bolt
GEO
Enric
h
Whoi
s
Enric
h
CIF
Enric
h
HDF
S
Bolt
ES
Bolt
18
Enrichments
Parse
r
Bolt
GEO
Enrich
RAW
Message
{
“msg_key1”: “msg value1”,
“src_ip”: “10.20.30.40”,
“dest_ip”: “20.30.40.50”,
“domain”: “mydomain.com”
}
Who
Is
Enrich
"geo":[ {"region":"CA",
"postalCode":"95134",
"areaCode":"408",
"metroCode":"807",
"longitude":-121.946,
"latitude":37.425,
"locId":4522,
"city":"San Jose",
"country":"US"
}]
CIF
Enrich
"whois":[ {
"OrgId":"CISCOS",
"Parent":"NET-144-0-0-0-0",
"OrgAbuseName":"Cisco Systems Inc",
"RegDate":"1991-01-171991-01-17",
"OrgName":"Cisco Systems",
"Address":"170 West Tasman Drive",
"NetType":"Direct Assignment"
} ],
“cif”:”Yes”
Enriched
Message
Cache
MySQL
Geo Lite
Data
Cache
HBase
Who Is Data
Cache
HBase
CIF Data
19
Applications: Telemetry Matching and DPI
Step1: Search
Step2: Match
Step3: Analyze
Step4: Build PCAP
20
Integration with Analytics Tools
Dashboards Reports
21
Best Practices
and
Lessons Learned
22
Journey Towards Highly
Scalable Application
23
Kafka Tuning
24
This is where we began
25
Some code optimizations and increased
parallelism
26
 Is Disk I/O heavy
 Kafka 0.8+ supports replication and JBOD
 Better performance compared to RAID
 Parallelism is largely driven by number of disks and partitions per topic
 Key configuration parameters:
 num.io.threads - Keep it at least equal to number of disks provided to
Kafka
 num.network.threads - adjust it based on number of concurrent
producers, consumers and replication factor
Kafka Tuning
27
After Kafka Tuning
28
Bottleneck Isolation, Resource Profiling,
Load Balancing
29
HBase Tuning
30
This is where we began
31
 Row Key design is critical (gets or scans or both?)
 Keys with IP Addresses
 Standard IP addresses have only two variations of the first character : 1 & 2
 Minimum key length will be 7 characters and max 15 with a typical average of 12
 Subnet range scans become difficult – range of 90 to 220 excludes 112
 IP converted to hex (10.20.30.40 => 0a141e28)
 gives 16 variations of first key character
 consistently 8 character key
 Easy to search for subnet ranges
Row Key Design
32
Experiments with Row Key
33
 Know your data
 Auto split under high workload can result into hotspots and split storms
 Understand your data and presplit the regions
 Identify how many regions a RS can have to perform optimally. Use the
formula below
(RS memory)*(total memstore fraction)/((memstore size)*(# column families))
Region Splits
34
With Region Pre-Splits
35
 Enable Micro Batching (client side buffer)
 Smart shuffle/grouping in storm
 Understand your data and situationally exploit various WAL options
 Watch for many minor compactions
 For heavy ‘write’ workload Increase hbase.hstore.blockingStoreFiles (we
used 200)
Know Your Application
36
And Finally
37
Kafka Spout
38
 Parallelism is controlled by number of partitions per topic
 Set Kafka spout parallelism equal to number of partitions in
topic
 Other key parameters that drive performance
 fetchSizeBytes
 bufferSizeBytes
Kafka Spout
39
Mysteriously Missing Data
40
 A bug in Kafka spout that used to miss out some partitions
and loose data
 It is now fixed and available from Hortonworks repository (
http://repo.hortonworks.com/content/repositories/releases/org/apache/
storm/storm-Kafka )
Mysteriously Missing Data Root Cause
41
Storm
42
 Every small thing counts at scale
 Even simple string operations can slowdown throughput when
executed on millions of Tuples
Storm
43
 Error handling is critical
 Poorly handled errors can lead to topology failure and eventually loss
of data (or data duplication)
Storm
44
 Tune & Scale individual spout and bolts before performance
testing/tuning entire topology
 Write your own simple data generator spouts and no-op bolts
 Making as many things configurable as possible helps a lot
Storm
45
 When it comes to Hadoop…partner up
 Separate the hype from the opportunity
 Start small then scale up
 Design Iteratively
 It doesn’t work unless you have proven it at scale
 Keep an eye on ROI
Lessons Learned
46
How can you contribute?
 Technology Partner Program – contribute developers to
join the Cisco and Hortonworks team
Looking for Community Partners
Cisco + Hortonworks + Community Support for OpenSOC
Thank you!
We are hiring:
jsirota@cisco.com
sheetal@hortonworks.com

More Related Content

What's hot (20)

PDF
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...
HostedbyConfluent
 
PDF
Diving into Delta Lake: Unpacking the Transaction Log
Databricks
 
PPTX
Delta lake and the delta architecture
Adam Doyle
 
PPTX
Analyzing 1.2 Million Network Packets per Second in Real-time
DataWorks Summit
 
PDF
Building Reliable Data Lakes at Scale with Delta Lake
Databricks
 
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
PPTX
Microsoft Fabric.pptx
Shruti Chaurasia
 
PDF
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kai Wähner
 
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
PDF
Airbyte @ Airflow Summit - The new modern data stack
Michel Tricot
 
PPTX
Azure purview
Shafqat Turza
 
PPTX
Apache Atlas: Governance for your Data
DataWorks Summit/Hadoop Summit
 
PDF
Lessons Learned: Understanding Pipeline Pricing in Azure Data Factory and Azu...
Cathrine Wilhelmsen
 
PPTX
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward
 
PPTX
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
PPTX
Microsoft Fabric Introduction
James Serra
 
PDF
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...
Flink Forward
 
PDF
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
Kai Wähner
 
PDF
The Complete Guide to Service Mesh
Aspen Mesh
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...
HostedbyConfluent
 
Diving into Delta Lake: Unpacking the Transaction Log
Databricks
 
Delta lake and the delta architecture
Adam Doyle
 
Analyzing 1.2 Million Network Packets per Second in Real-time
DataWorks Summit
 
Building Reliable Data Lakes at Scale with Delta Lake
Databricks
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
Microsoft Fabric.pptx
Shruti Chaurasia
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kai Wähner
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Airbyte @ Airflow Summit - The new modern data stack
Michel Tricot
 
Azure purview
Shafqat Turza
 
Apache Atlas: Governance for your Data
DataWorks Summit/Hadoop Summit
 
Lessons Learned: Understanding Pipeline Pricing in Azure Data Factory and Azu...
Cathrine Wilhelmsen
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
Microsoft Fabric Introduction
James Serra
 
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...
Flink Forward
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
Kai Wähner
 
The Complete Guide to Service Mesh
Aspen Mesh
 
Data Lakehouse Symposium | Day 4
Databricks
 

Similar to Cisco OpenSOC (20)

PDF
Open Security Operations Center - OpenSOC
Sheetal Dolas
 
PDF
Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...
HostedbyConfluent
 
PDF
Capital One Delivers Risk Insights in Real Time with Stream Processing
confluent
 
PDF
Code-to-Cloud Visibility: An Essential Framework for DevOps Success
JadeCampbell13
 
PPTX
Crossing the Chasm
Hortonworks
 
PPTX
Next-Gen Decision Making in Under 2ms
Ilya Ganelin
 
PDF
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Sumeet Singh
 
PDF
Delivering Apache Hadoop for the Modern Data Architecture
Hortonworks
 
PPTX
Capital One's Next Generation Decision in less than 2 ms
Apache Apex
 
PDF
Towards Data Operations
Andrea Monacchi
 
PDF
Making Apache Kafka Even Faster And More Scalable
PaulBrebner2
 
PPTX
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
DataWorks Summit
 
PDF
Putting the Sec into DevOps
Maytal Levi
 
PPTX
Integrating OpenStack To Existing Infrastructure
Hui Cheng
 
PPTX
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Kevin Mao
 
PPTX
Putting Kafka Into Overdrive
Todd Palino
 
PPTX
HDFS: Optimization, Stabilization and Supportability
DataWorks Summit/Hadoop Summit
 
PPTX
Hdfs 2016-hadoop-summit-dublin-v1
Chris Nauroth
 
PDF
DevSecOps: Putting the Sec into the DevOps
shira koper
 
PPTX
Big data talk barcelona - jsr - jc
James Saint-Rossy
 
Open Security Operations Center - OpenSOC
Sheetal Dolas
 
Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...
HostedbyConfluent
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
confluent
 
Code-to-Cloud Visibility: An Essential Framework for DevOps Success
JadeCampbell13
 
Crossing the Chasm
Hortonworks
 
Next-Gen Decision Making in Under 2ms
Ilya Ganelin
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Sumeet Singh
 
Delivering Apache Hadoop for the Modern Data Architecture
Hortonworks
 
Capital One's Next Generation Decision in less than 2 ms
Apache Apex
 
Towards Data Operations
Andrea Monacchi
 
Making Apache Kafka Even Faster And More Scalable
PaulBrebner2
 
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
DataWorks Summit
 
Putting the Sec into DevOps
Maytal Levi
 
Integrating OpenStack To Existing Infrastructure
Hui Cheng
 
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Kevin Mao
 
Putting Kafka Into Overdrive
Todd Palino
 
HDFS: Optimization, Stabilization and Supportability
DataWorks Summit/Hadoop Summit
 
Hdfs 2016-hadoop-summit-dublin-v1
Chris Nauroth
 
DevSecOps: Putting the Sec into the DevOps
shira koper
 
Big data talk barcelona - jsr - jc
James Saint-Rossy
 
Ad

Recently uploaded (20)

PDF
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
PPTX
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
PPTX
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
PPTX
Damage of stability of a ship and how its change .pptx
ehamadulhaque
 
PPTX
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
PDF
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
PPTX
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
PPTX
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
PPTX
artificial intelligence applications in Geomatics
NawrasShatnawi1
 
PDF
Zilliz Cloud Demo for performance and scale
Zilliz
 
PDF
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
DOCX
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
PPTX
Day2 B2 Best.pptx
helenjenefa1
 
PPTX
Element 11. ELECTRICITY safety and hazards
merrandomohandas
 
PDF
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
PDF
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
PDF
Unified_Cloud_Comm_Presentation anil singh ppt
anilsingh298751
 
PPTX
GitOps_Repo_Structure for begeinner(Scaffolindg)
DanialHabibi2
 
DOC
MRRS Strength and Durability of Concrete
CivilMythili
 
PPTX
Thermal runway and thermal stability.pptx
godow93766
 
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
Damage of stability of a ship and how its change .pptx
ehamadulhaque
 
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
artificial intelligence applications in Geomatics
NawrasShatnawi1
 
Zilliz Cloud Demo for performance and scale
Zilliz
 
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
Day2 B2 Best.pptx
helenjenefa1
 
Element 11. ELECTRICITY safety and hazards
merrandomohandas
 
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
Unified_Cloud_Comm_Presentation anil singh ppt
anilsingh298751
 
GitOps_Repo_Structure for begeinner(Scaffolindg)
DanialHabibi2
 
MRRS Strength and Durability of Concrete
CivilMythili
 
Thermal runway and thermal stability.pptx
godow93766
 
Ad

Cisco OpenSOC

  • 1. OpenSOC The Open Security Operations Center for Analyzing 1.2 Million Network Packets per Second in Real TimeJames Sirota, Big Data Architect Cisco Security Solutions Practice [email protected] Sheetal Dolas Principal Architect Hortonworks [email protected] June 3, 2014
  • 2. 2  Problem Statement & Business Case for OpenSOC  Solution Architecture and Design  Best Practices and Lessons Learned  Q & A Over Next Few Minutes
  • 4. 4 fatalism: It's no longer if or when you get hacked, but the assumption is that you've already been hacked, with a focus on minimizing the damage.” Source: Dark Reading / Security’s New Reality: Assume The Worst
  • 5. 5 Breaches Happen in Hours… But Go Undetected for Months or Even Years Source: 2013 Data Breach Investigations Report Seconds Minutes Hours Days Weeks Months Years Initial Attack to Initial Compromise 10% 75% 12% 2% 0% 1% 1% Initial Compromise to Data Exfiltration 8% 38% 14% 25% 8% 8% 0% Initial Compromise to Discovery 0% 0% 2% 13% 29% 54% 2% Discovery to Containment/ Restoration 0% 1% 9% 32% 38% 17% 4% Timespan of events by percent of breaches In 60% of breaches, data is stolen in hours 54% of breaches are not discovered for months
  • 6. 6 Cisco Global Cloud Index Source: 2014 Cisco Global Cloud Index
  • 7. 7 Introducing OpenSOC Intersection of Big Data and Security Analytics Multi Petabyte Storage Interactive Query Real-Time Search Scalable Stream Processing Unstructured Data Data Access Control Scalable Compute OpenSOC Real-Time Alerts Anomaly Detection Data Correlation Rules and Reports Predictive Modeling UI and Applications Big Data Platform Hadoop
  • 8. 8 OpenSOC Journey Sept 2013 First Prototype Dec 2013 Hortonworks joins the project March 2014 Platform development finished Sept 2014 General Availability May 2014 CR Work off April 2014 First beta test at customer site
  • 10. 10 OpenSOC Conceptual Architecture Raw Network Stream Network Metadata Stream Netflow Syslog Raw Application Logs Other Streaming Telemetry HiveHBase Raw Packet Store Long-Term Store Elastic Search Real-Time Index Network Packet Mining and PCAP Reconstruction Log Mining and Analytics Big Data Exploration, Predictive Modeling Applications + Analyst Tools Parse+Format Enrich Alert Threat Intelligence Feeds Enrichment Data
  • 11. 11  Raw Network Packet Capture, Store, Traffic Reconstruction  Telemetry Ingest, Enrichment and Real-Time Rules-Based Alerts  Real-Time Telemetry Search and Cross-Telemetry Matching  Automated Reports, Anomaly Detection and Anomaly Alerts Key Functional Capabilities
  • 12. 12  Fully-Backed by Cisco and Used Internally for Multiple Customers  Free, Open Source and Apache Licensed  Built on Highly-Scalable and Proven Platforms (Hadoop, Kafka, Storm)  Extensible and Pluggable Design  Flexible Deployment Model (On-Premise or Cloud)  Centralize your processes, people and data The OpenSOC Advantage
  • 13. 13 OpenSOC Deployment at Cisco Hardware footprint (40u)  14 Data Nodes (UCS C240 M3)  3 Cluster Control Nodes (UCS C220 M3)  2 ESX Hypervisor Hosts (UCS C220 M3)  1 PCAP Processor (UCS C220 M3 + Napatech NIC)  2 SourceFire Threat alert processors  1 Anue Network Traffic splitter  1 Router  1 48 Port 10GE Switch Software Stack HDP 2.1 Kafka 0.8 Elastic Search 1.1 MySQL 5.5
  • 14. 14 OpenSOC - Stitching Things Together AccessMessaging SystemData CollectionSource Systems StorageReal Time Processing StormKafka B Topic N Topic Elastic Search Index Web Services Search PCAP Reconstruction HBase PCAP Table Analytic Tools R / Python Power Pivot Tableau Hive Raw Data ORC Passive Tap PCAP Topic DPI Topic A Topic Telemetry Sources Syslog HTTP File System Other Flume Agent A Agent B Agent N B Topology N Topology A Topology PCAP Traffic Replicato r PCAP Topology DPI Topology
  • 15. 15 OpenSOC - Stitching Things Together AccessMessaging SystemData CollectionSource Systems StorageReal Time Processing StormKafka B Topic N Topic Elastic Search Index Web Services Search PCAP Reconstruction HBase PCAP Table Analytic Tools R / Python Power Pivot Tableau Hive Raw Data ORC Passive Tap PCAP Topic DPI Topic A Topic Telemetry Sources Syslog HTTP File System Other Flume Agent A Agent B Agent N B Topology N Topology A Topology PCAP Traffic Replicato r Deeper Look PCAP Topology DPI Topology
  • 16. 16 PCAP Topology StorageReal Time Processing Storm Elastic Search Index HBase PCAP Table Hive Raw Data ORC Kafka Spout Parse r Bolt HDFS Bolt HBas e Bolt ES Bolt
  • 17. 17 DPI Topology & Telemetry Enrichment StorageReal Time Processing Storm Elastic Search Index HBase PCAP Table Hive Raw Data ORC Kafka Spout Parse r Bolt GEO Enric h Whoi s Enric h CIF Enric h HDF S Bolt ES Bolt
  • 18. 18 Enrichments Parse r Bolt GEO Enrich RAW Message { “msg_key1”: “msg value1”, “src_ip”: “10.20.30.40”, “dest_ip”: “20.30.40.50”, “domain”: “mydomain.com” } Who Is Enrich "geo":[ {"region":"CA", "postalCode":"95134", "areaCode":"408", "metroCode":"807", "longitude":-121.946, "latitude":37.425, "locId":4522, "city":"San Jose", "country":"US" }] CIF Enrich "whois":[ { "OrgId":"CISCOS", "Parent":"NET-144-0-0-0-0", "OrgAbuseName":"Cisco Systems Inc", "RegDate":"1991-01-171991-01-17", "OrgName":"Cisco Systems", "Address":"170 West Tasman Drive", "NetType":"Direct Assignment" } ], “cif”:”Yes” Enriched Message Cache MySQL Geo Lite Data Cache HBase Who Is Data Cache HBase CIF Data
  • 19. 19 Applications: Telemetry Matching and DPI Step1: Search Step2: Match Step3: Analyze Step4: Build PCAP
  • 20. 20 Integration with Analytics Tools Dashboards Reports
  • 24. 24 This is where we began
  • 25. 25 Some code optimizations and increased parallelism
  • 26. 26  Is Disk I/O heavy  Kafka 0.8+ supports replication and JBOD  Better performance compared to RAID  Parallelism is largely driven by number of disks and partitions per topic  Key configuration parameters:  num.io.threads - Keep it at least equal to number of disks provided to Kafka  num.network.threads - adjust it based on number of concurrent producers, consumers and replication factor Kafka Tuning
  • 28. 28 Bottleneck Isolation, Resource Profiling, Load Balancing
  • 30. 30 This is where we began
  • 31. 31  Row Key design is critical (gets or scans or both?)  Keys with IP Addresses  Standard IP addresses have only two variations of the first character : 1 & 2  Minimum key length will be 7 characters and max 15 with a typical average of 12  Subnet range scans become difficult – range of 90 to 220 excludes 112  IP converted to hex (10.20.30.40 => 0a141e28)  gives 16 variations of first key character  consistently 8 character key  Easy to search for subnet ranges Row Key Design
  • 33. 33  Know your data  Auto split under high workload can result into hotspots and split storms  Understand your data and presplit the regions  Identify how many regions a RS can have to perform optimally. Use the formula below (RS memory)*(total memstore fraction)/((memstore size)*(# column families)) Region Splits
  • 35. 35  Enable Micro Batching (client side buffer)  Smart shuffle/grouping in storm  Understand your data and situationally exploit various WAL options  Watch for many minor compactions  For heavy ‘write’ workload Increase hbase.hstore.blockingStoreFiles (we used 200) Know Your Application
  • 38. 38  Parallelism is controlled by number of partitions per topic  Set Kafka spout parallelism equal to number of partitions in topic  Other key parameters that drive performance  fetchSizeBytes  bufferSizeBytes Kafka Spout
  • 40. 40  A bug in Kafka spout that used to miss out some partitions and loose data  It is now fixed and available from Hortonworks repository ( http://repo.hortonworks.com/content/repositories/releases/org/apache/ storm/storm-Kafka ) Mysteriously Missing Data Root Cause
  • 42. 42  Every small thing counts at scale  Even simple string operations can slowdown throughput when executed on millions of Tuples Storm
  • 43. 43  Error handling is critical  Poorly handled errors can lead to topology failure and eventually loss of data (or data duplication) Storm
  • 44. 44  Tune & Scale individual spout and bolts before performance testing/tuning entire topology  Write your own simple data generator spouts and no-op bolts  Making as many things configurable as possible helps a lot Storm
  • 45. 45  When it comes to Hadoop…partner up  Separate the hype from the opportunity  Start small then scale up  Design Iteratively  It doesn’t work unless you have proven it at scale  Keep an eye on ROI Lessons Learned
  • 46. 46 How can you contribute?  Technology Partner Program – contribute developers to join the Cisco and Hortonworks team Looking for Community Partners Cisco + Hortonworks + Community Support for OpenSOC

Editor's Notes

  • #34: In Storm bolts shuffle group based on regions so that each HBase bolt gets data mostly for one or two regions and minimizes RS trips In case of DoS attack situations where actual packet are very small 20-60 bytes and individual packets are not very critical for analysis, skip WAL
  • #36: In Storm bolts shuffle group based on regions so that each HBase bolt gets data mostly for one or two regions and minimizes RS trips In case of DoS attack situations where actual packet are very small 20-60 bytes and individual packets are not very critical for analysis, skip WAL
  • #43: Frequent minor compactions reduce the overall throughput of system. For ‘write’ heavy workload reduce frequency of minor compactions by increasing hbase.hstore.blockingStoreFiles (we used 200)
  • #44: Frequent minor compactions reduce the overall throughput of system. For ‘write’ heavy workload reduce frequency of minor compactions by increasing hbase.hstore.blockingStoreFiles (we used 200)
  • #45: Frequent minor compactions reduce the overall throughput of system. For ‘write’ heavy workload reduce frequency of minor compactions by increasing hbase.hstore.blockingStoreFiles (we used 200)