SlideShare a Scribd company logo
NetFLow Data processing in large organization
using Hadoop and Vertica, a bit about
architecture and performance hints
June, 2017
Josef Niedermeier and Gabriela Aumayer
Agenda
●
Intro
●
NetFow overview
●
Logical view of the processing
●
Processing implementation
●
NetFlow Datagrams processing
●
Performance hints for Hadoop
2
NetFlow overview
●
NetFlow is a network protocol for collecting IP traffic information
by routers, layer 3 switches and firewalls that support it.
●
Most popular are v.5 (ip v4. only) and v.9 (supports ip v.6)
●
Traffic information are packed to so called Flows, that are
unidirectional
src IP dst IP src
port
dst
port
Protoc
ol
Packets Octets First Last Flags
x.x.x.x y.y.y.y 34552 22 6(TCP) 12 4563 2017-03-31
12:24:01.032
2017-03-31
12:25:34.342
2(SYN)
●
A flow is outputted when router determines that it finished
●
Flows are exported in batches via UDP or SCTP, collected by
collectors and stored in log files for time periods (~ 5minutes)
and processed by an analytical application
NetFlow
Collector
NetFlow
UDP/SCTP
Analytical applicationNetFlow
Log
NetFlow overview
Logical View of Processing
Filtering
Flows
stitching
Graph
Properties
Aggregation
by Host
De -
duplication
Aggregated
Edges
Aggregation
of Edges
by
Time Period
Hosts
Data
DNS
/DHCP
Detail
NetFlow
Log
Computers that
changed
behavior recently?
Computers that
replied to an external
scanning computer?
Exact start of
communication
Between X and Y?
Detail
NetFlow
Log
Detail
NetFlow
Log
Legend
Query example: Data: Processing
Component:
?
Processing - implementation
Filtering
Flows
stitching
Graph
Properties
Aggregation
by Host
De -
duplication
Aggregated
Edges
Aggregation
of Edges
by
Time Period
Hosts
Data
DNS
/DHCP
Detail
NetFlow
Log
Detail
NetFlow
Log
Detail
NetFlow
Log
MapReduce framework -Hadoop Columnar store - Vertica
Processing – implementation - integration
Vertica
Spark
Collector
Batch Processing
Filtering
De-duplication
Stitching
Edges Aggregation
Detail
queries
Mid Term
Storage
Batch Processing
Graph Properties
Flow Properties
REST
API
(LiftWEB)
UI
Work-flow
Automation
Machine Learning
Anomaly Detection, Classification
Security Analyst
SQL
HDFS – long term
storage
NetFlow
Log
NetFlow
Log
Collector
Hadoop
NetFlow Datagrams processing
● Data written dirrectly to HDFS.
● Avoid Unnecessary IO
● Can be done in parallel
● Haddop SequenceFileInputFormat
● Efficient from storage space point of view
● Binary
● Supports compression
● Efficient serialization/deserialization
● Parsing is done by mappers in parallel
NetFlow Datagrams processing - architecture
UDP server
Writer
Writer
HadoopCollector
NetFlow
UDP
Mapper
/Parser
Component Role
UDP server Listens on a port, checks incoming datagram, adds router IP and pass it to ring
buffer.
Writer Takes datagram data from the ring buffer and writes them to Distributed File
System (HDFS)
Mapper/parser Parses datagram, separates flows, converts timestamps to duration, emits flow
for next processing
HDFS
Mapper
/Parser
Router
Mapper
/Parser
NetFlow input Datagrams processing - Collector
●
Java Application, reactor pattern
●
Datagram data stored as binary arrays in Hadoop Sequence Format
●
Small heap to keep “stop the world” pauses short
●
JMX
●
Monitoring if no data are dropped
●
Log level setting (normally OFF)
●
Configurable block size (see performance hits later)
Performance Hints for Hadoop – coding
● Do not produce garbage to keep GC load small
● Minimize creation of new objects
● Reuse key and value beans
● Implement WritableComparator for the Key beans to allow sorting
without deserialization
Performance Hints – Hadoop - JVM setting for mapper and
reducer
12
● VM options for mapper and reducer
● "mapred.map.child.java.opts"
● "mapred.reduce.child.java.opts"
● heap size, -Xms -Xmx the same value to avoid JVM trying to return pages to OS
● Huge pages (+10-20% performance, makes GC faster)
● huge pages should be set in OS, do not use Transparent Huge Pages
● -XX:+UseLargePages
● Most of servers novadays have Not Uniform Meeory Aarchitecture (20% +)
● -XX:+UseParallelGC -XX:+UseNUMA
Performance Hints – Hadoop - JVM setting for mapper and
reducer - Example
13
$HADOOP_BIN jar $MR_JAR com.hpe.gcs.nfp2.processing.DedupJob 
-D mapreduce.map.java.opts="-Xmx800M -Xms800M
-XX:+UseParallelGC -XX:+UseNUMA -XX:+UseLargePages" 
-D mapreduce.reduce.java.opts="-Xmx1G -Xms1G
-XX:+UseParallelGC -XX:+UseNUMA -XX:+UseLargePages" 
-D mapreduce.map.memory.mb=1000 
-D mapreduce.reduce.memory.mb=1250
Performance Hints - Hadoop - sorting
14
● Increase memory for sorting
● Defaullt 100M is too small for bigger blocks (128-256MB)
● How to detect: Spilled Records > 2*Map output records
● Solution: Increase mapreduce.task.io.sort.mb to several hundreds MB *
● Check: Spilled Records == 2*Map output records
* optimum number depends on block size and used compression
Performance Hints - Hadoop - general
15
● Install native libraries
● Compression to decrease IO
● Snappy codec has good compression ratio with low CPU utilization
● Output compression:
● mapred.output.compress=true
● mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyC
● Map output compression:
● mapred.compress.map.output=true
● Increase number of reducers to utilize the Hadoop cluster
● Example: mapred.reduce.tasks=64
Does it look interesting for you?
We are hiring:
https://careers.hpe.com/job/galwa
y/security-analytics-data-
platform-engineer/3545/4572314
16

More Related Content

PDF
Erasure codes and storage tiers on gluster
Red_Hat_Storage
 
ODP
Gluster Data Tiering
Joseph Elwin Fernandes
 
PDF
Enabling Presto Caching at Uber with Alluxio
Alluxio, Inc.
 
PDF
Gluster.community.day.2013
Udo Seidel
 
PDF
Running OpenStack in Production - Barcamp Saigon 2016
Thang Man
 
PDF
TeraCache: Efficient Caching Over Fast Storage Devices
Databricks
 
PDF
Garbage collection in JVM
aragozin
 
PPTX
KDB+ Lite
Sayanosauras
 
Erasure codes and storage tiers on gluster
Red_Hat_Storage
 
Gluster Data Tiering
Joseph Elwin Fernandes
 
Enabling Presto Caching at Uber with Alluxio
Alluxio, Inc.
 
Gluster.community.day.2013
Udo Seidel
 
Running OpenStack in Production - Barcamp Saigon 2016
Thang Man
 
TeraCache: Efficient Caching Over Fast Storage Devices
Databricks
 
Garbage collection in JVM
aragozin
 
KDB+ Lite
Sayanosauras
 

What's hot (20)

PDF
SJTU Summary report
Yves Chan
 
PDF
KDB database (EPAM tech talks, Sofia, April, 2015)
Martin Toshev
 
PDF
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
Vasia Kalavri
 
PPTX
How to be Successful with Scylla
ScyllaDB
 
PDF
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon
 
PDF
Ceph Research at UCSC
Ceph Community
 
PPTX
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
Speedment, Inc.
 
PDF
Introduce to spark
Yen Hao Huang
 
PDF
Web scale monitoring
Dobrica Pavlinušić
 
PDF
Iceberg: a fast table format for S3
DataWorks Summit
 
ODP
Tiering barcelona
Gluster.org
 
PDF
Handling the growth of data
Piyush Katariya
 
PPTX
Case study- Real-time OLAP Cubes
Ziemowit Jankowski
 
PDF
Accordion HBaseCon 2017
Edward Bortnikov
 
PDF
Comparing pregel related systems
Prashant Raaghav
 
PDF
Distributed Postgres
Stas Kelvich
 
PDF
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
ScyllaDB
 
PDF
Caffe + H2O - By Cyprien noel
Sri Ambati
 
PPTX
Geo data analytics
Daniel Marcous
 
ODP
Gluster fs hadoop_fifth-elephant
Gluster.org
 
SJTU Summary report
Yves Chan
 
KDB database (EPAM tech talks, Sofia, April, 2015)
Martin Toshev
 
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
Vasia Kalavri
 
How to be Successful with Scylla
ScyllaDB
 
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon
 
Ceph Research at UCSC
Ceph Community
 
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
Speedment, Inc.
 
Introduce to spark
Yen Hao Huang
 
Web scale monitoring
Dobrica Pavlinušić
 
Iceberg: a fast table format for S3
DataWorks Summit
 
Tiering barcelona
Gluster.org
 
Handling the growth of data
Piyush Katariya
 
Case study- Real-time OLAP Cubes
Ziemowit Jankowski
 
Accordion HBaseCon 2017
Edward Bortnikov
 
Comparing pregel related systems
Prashant Raaghav
 
Distributed Postgres
Stas Kelvich
 
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
ScyllaDB
 
Caffe + H2O - By Cyprien noel
Sri Ambati
 
Geo data analytics
Daniel Marcous
 
Gluster fs hadoop_fifth-elephant
Gluster.org
 
Ad

Similar to NetFlow Data processing using Hadoop and Vertica (20)

PDF
Apache spark - Spark's distributed programming model
Martin Zapletal
 
PDF
Understanding Hadoop
Ahmed Ossama
 
PPTX
Spark 计算模型
wang xing
 
PDF
Spark Driven Big Data Analytics
inoshg
 
PDF
Architecting and productionising data science applications at scale
samthemonad
 
PDF
Analyzing Data at Scale with Apache Spark
Nicola Ferraro
 
PDF
Report Hadoop Map Reduce
Urvashi Kataria
 
PDF
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
Nelson Calero
 
PDF
2014 hadoop wrocław jug
Wojciech Langiewicz
 
PDF
Pivotal Real Time Data Stream Analytics
kgshukla
 
PPTX
Juniper Innovation Contest
AMIT BORUDE
 
PDF
Big data should be simple
Dori Waldman
 
PDF
Scaling ELK Stack - DevOpsDays Singapore
Angad Singh
 
ODP
Geospatial web services using little-known GDAL features and modern Perl midd...
Ari Jolma
 
PPTX
Exploiting machine learning to keep Hadoop clusters healthy
DataWorks Summit
 
ODP
Glusterfs and Hadoop
Shubhendu Tripathi
 
PDF
RAPIDS: GPU-Accelerated ETL and Feature Engineering
Keith Kraus
 
PDF
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
PDF
Effectively deploying hadoop to the cloud
Avinash Ramineni
 
PDF
Hadoop Network Performance profile
pramodbiligiri
 
Apache spark - Spark's distributed programming model
Martin Zapletal
 
Understanding Hadoop
Ahmed Ossama
 
Spark 计算模型
wang xing
 
Spark Driven Big Data Analytics
inoshg
 
Architecting and productionising data science applications at scale
samthemonad
 
Analyzing Data at Scale with Apache Spark
Nicola Ferraro
 
Report Hadoop Map Reduce
Urvashi Kataria
 
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
Nelson Calero
 
2014 hadoop wrocław jug
Wojciech Langiewicz
 
Pivotal Real Time Data Stream Analytics
kgshukla
 
Juniper Innovation Contest
AMIT BORUDE
 
Big data should be simple
Dori Waldman
 
Scaling ELK Stack - DevOpsDays Singapore
Angad Singh
 
Geospatial web services using little-known GDAL features and modern Perl midd...
Ari Jolma
 
Exploiting machine learning to keep Hadoop clusters healthy
DataWorks Summit
 
Glusterfs and Hadoop
Shubhendu Tripathi
 
RAPIDS: GPU-Accelerated ETL and Feature Engineering
Keith Kraus
 
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
Effectively deploying hadoop to the cloud
Avinash Ramineni
 
Hadoop Network Performance profile
pramodbiligiri
 
Ad

Recently uploaded (20)

PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
PPTX
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PPTX
Power BI in Business Intelligence with AI
KPR Institute of Engineering and Technology
 
PPT
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PDF
oop_java (1) of ice or cse or eee ic.pdf
sabiquntoufiqlabonno
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PDF
Chad Readey - An Independent Thinker
Chad Readey
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
Power BI in Business Intelligence with AI
KPR Institute of Engineering and Technology
 
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
oop_java (1) of ice or cse or eee ic.pdf
sabiquntoufiqlabonno
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
Chad Readey - An Independent Thinker
Chad Readey
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 

NetFlow Data processing using Hadoop and Vertica

  • 1. NetFLow Data processing in large organization using Hadoop and Vertica, a bit about architecture and performance hints June, 2017 Josef Niedermeier and Gabriela Aumayer
  • 2. Agenda ● Intro ● NetFow overview ● Logical view of the processing ● Processing implementation ● NetFlow Datagrams processing ● Performance hints for Hadoop 2
  • 3. NetFlow overview ● NetFlow is a network protocol for collecting IP traffic information by routers, layer 3 switches and firewalls that support it. ● Most popular are v.5 (ip v4. only) and v.9 (supports ip v.6) ● Traffic information are packed to so called Flows, that are unidirectional src IP dst IP src port dst port Protoc ol Packets Octets First Last Flags x.x.x.x y.y.y.y 34552 22 6(TCP) 12 4563 2017-03-31 12:24:01.032 2017-03-31 12:25:34.342 2(SYN)
  • 4. ● A flow is outputted when router determines that it finished ● Flows are exported in batches via UDP or SCTP, collected by collectors and stored in log files for time periods (~ 5minutes) and processed by an analytical application NetFlow Collector NetFlow UDP/SCTP Analytical applicationNetFlow Log NetFlow overview
  • 5. Logical View of Processing Filtering Flows stitching Graph Properties Aggregation by Host De - duplication Aggregated Edges Aggregation of Edges by Time Period Hosts Data DNS /DHCP Detail NetFlow Log Computers that changed behavior recently? Computers that replied to an external scanning computer? Exact start of communication Between X and Y? Detail NetFlow Log Detail NetFlow Log Legend Query example: Data: Processing Component: ?
  • 6. Processing - implementation Filtering Flows stitching Graph Properties Aggregation by Host De - duplication Aggregated Edges Aggregation of Edges by Time Period Hosts Data DNS /DHCP Detail NetFlow Log Detail NetFlow Log Detail NetFlow Log MapReduce framework -Hadoop Columnar store - Vertica
  • 7. Processing – implementation - integration Vertica Spark Collector Batch Processing Filtering De-duplication Stitching Edges Aggregation Detail queries Mid Term Storage Batch Processing Graph Properties Flow Properties REST API (LiftWEB) UI Work-flow Automation Machine Learning Anomaly Detection, Classification Security Analyst SQL HDFS – long term storage NetFlow Log NetFlow Log Collector Hadoop
  • 8. NetFlow Datagrams processing ● Data written dirrectly to HDFS. ● Avoid Unnecessary IO ● Can be done in parallel ● Haddop SequenceFileInputFormat ● Efficient from storage space point of view ● Binary ● Supports compression ● Efficient serialization/deserialization ● Parsing is done by mappers in parallel
  • 9. NetFlow Datagrams processing - architecture UDP server Writer Writer HadoopCollector NetFlow UDP Mapper /Parser Component Role UDP server Listens on a port, checks incoming datagram, adds router IP and pass it to ring buffer. Writer Takes datagram data from the ring buffer and writes them to Distributed File System (HDFS) Mapper/parser Parses datagram, separates flows, converts timestamps to duration, emits flow for next processing HDFS Mapper /Parser Router Mapper /Parser
  • 10. NetFlow input Datagrams processing - Collector ● Java Application, reactor pattern ● Datagram data stored as binary arrays in Hadoop Sequence Format ● Small heap to keep “stop the world” pauses short ● JMX ● Monitoring if no data are dropped ● Log level setting (normally OFF) ● Configurable block size (see performance hits later)
  • 11. Performance Hints for Hadoop – coding ● Do not produce garbage to keep GC load small ● Minimize creation of new objects ● Reuse key and value beans ● Implement WritableComparator for the Key beans to allow sorting without deserialization
  • 12. Performance Hints – Hadoop - JVM setting for mapper and reducer 12 ● VM options for mapper and reducer ● "mapred.map.child.java.opts" ● "mapred.reduce.child.java.opts" ● heap size, -Xms -Xmx the same value to avoid JVM trying to return pages to OS ● Huge pages (+10-20% performance, makes GC faster) ● huge pages should be set in OS, do not use Transparent Huge Pages ● -XX:+UseLargePages ● Most of servers novadays have Not Uniform Meeory Aarchitecture (20% +) ● -XX:+UseParallelGC -XX:+UseNUMA
  • 13. Performance Hints – Hadoop - JVM setting for mapper and reducer - Example 13 $HADOOP_BIN jar $MR_JAR com.hpe.gcs.nfp2.processing.DedupJob -D mapreduce.map.java.opts="-Xmx800M -Xms800M -XX:+UseParallelGC -XX:+UseNUMA -XX:+UseLargePages" -D mapreduce.reduce.java.opts="-Xmx1G -Xms1G -XX:+UseParallelGC -XX:+UseNUMA -XX:+UseLargePages" -D mapreduce.map.memory.mb=1000 -D mapreduce.reduce.memory.mb=1250
  • 14. Performance Hints - Hadoop - sorting 14 ● Increase memory for sorting ● Defaullt 100M is too small for bigger blocks (128-256MB) ● How to detect: Spilled Records > 2*Map output records ● Solution: Increase mapreduce.task.io.sort.mb to several hundreds MB * ● Check: Spilled Records == 2*Map output records * optimum number depends on block size and used compression
  • 15. Performance Hints - Hadoop - general 15 ● Install native libraries ● Compression to decrease IO ● Snappy codec has good compression ratio with low CPU utilization ● Output compression: ● mapred.output.compress=true ● mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyC ● Map output compression: ● mapred.compress.map.output=true ● Increase number of reducers to utilize the Hadoop cluster ● Example: mapred.reduce.tasks=64
  • 16. Does it look interesting for you? We are hiring: https://careers.hpe.com/job/galwa y/security-analytics-data- platform-engineer/3545/4572314 16