NetFlow Data processing using Hadoop and Vertica

NetFLow Data processing in large organization
using Hadoop and Vertica, a bit about
architecture and performance hints
June, 2017
Josef Niedermeier and Gabriela Aumayer

Agenda
●
Intro
●
NetFow overview
●
Logical view of the processing
●
Processing implementation
●
NetFlow Datagrams processing
●
Performance hints for Hadoop
2

NetFlow overview
●
NetFlow is a network protocol for collecting IP traffic information
by routers, layer 3 switches and firewalls that support it.
●
Most popular are v.5 (ip v4. only) and v.9 (supports ip v.6)
●
Traffic information are packed to so called Flows, that are
unidirectional
src IP dst IP src
port
dst
port
Protoc
ol
Packets Octets First Last Flags
x.x.x.x y.y.y.y 34552 22 6(TCP) 12 4563 2017-03-31
12:24:01.032
2017-03-31
12:25:34.342
2(SYN)

●
A flow is outputted when router determines that it finished
●
Flows are exported in batches via UDP or SCTP, collected by
collectors and stored in log files for time periods (~ 5minutes)
and processed by an analytical application
NetFlow
Collector
NetFlow
UDP/SCTP
Analytical applicationNetFlow
Log
NetFlow overview

Logical View of Processing
Filtering
Flows
stitching
Graph
Properties
Aggregation
by Host
De -
duplication
Aggregated
Edges
Aggregation
of Edges
by
Time Period
Hosts
Data
DNS
/DHCP
Detail
NetFlow
Log
Computers that
changed
behavior recently?
Computers that
replied to an external
scanning computer?
Exact start of
communication
Between X and Y?
Detail
NetFlow
Log
Detail
NetFlow
Log
Legend
Query example: Data: Processing
Component:
?

Processing - implementation
Filtering
Flows
stitching
Graph
Properties
Aggregation
by Host
De -
duplication
Aggregated
Edges
Aggregation
of Edges
by
Time Period
Hosts
Data
DNS
/DHCP
Detail
NetFlow
Log
Detail
NetFlow
Log
Detail
NetFlow
Log
MapReduce framework -Hadoop Columnar store - Vertica

Processing – implementation - integration
Vertica
Spark
Collector
Batch Processing
Filtering
De-duplication
Stitching
Edges Aggregation
Detail
queries
Mid Term
Storage
Batch Processing
Graph Properties
Flow Properties
REST
API
(LiftWEB)
UI
Work-flow
Automation
Machine Learning
Anomaly Detection, Classification
Security Analyst
SQL
HDFS – long term
storage
NetFlow
Log
NetFlow
Log
Collector
Hadoop

NetFlow Datagrams processing
● Data written dirrectly to HDFS.
● Avoid Unnecessary IO
● Can be done in parallel
● Haddop SequenceFileInputFormat
● Efficient from storage space point of view
● Binary
● Supports compression
● Efficient serialization/deserialization
● Parsing is done by mappers in parallel

NetFlow Datagrams processing - architecture
UDP server
Writer
Writer
HadoopCollector
NetFlow
UDP
Mapper
/Parser
Component Role
UDP server Listens on a port, checks incoming datagram, adds router IP and pass it to ring
buffer.
Writer Takes datagram data from the ring buffer and writes them to Distributed File
System (HDFS)
Mapper/parser Parses datagram, separates flows, converts timestamps to duration, emits flow
for next processing
HDFS
Mapper
/Parser
Router
Mapper
/Parser

NetFlow input Datagrams processing - Collector
●
Java Application, reactor pattern
●
Datagram data stored as binary arrays in Hadoop Sequence Format
●
Small heap to keep “stop the world” pauses short
●
JMX
●
Monitoring if no data are dropped
●
Log level setting (normally OFF)
●
Configurable block size (see performance hits later)

Performance Hints for Hadoop – coding
● Do not produce garbage to keep GC load small
● Minimize creation of new objects
● Reuse key and value beans
● Implement WritableComparator for the Key beans to allow sorting
without deserialization

Performance Hints – Hadoop - JVM setting for mapper and
reducer
12
● VM options for mapper and reducer
● "mapred.map.child.java.opts"
● "mapred.reduce.child.java.opts"
● heap size, -Xms -Xmx the same value to avoid JVM trying to return pages to OS
● Huge pages (+10-20% performance, makes GC faster)
● huge pages should be set in OS, do not use Transparent Huge Pages
● -XX:+UseLargePages
● Most of servers novadays have Not Uniform Meeory Aarchitecture (20% +)
● -XX:+UseParallelGC -XX:+UseNUMA

Performance Hints – Hadoop - JVM setting for mapper and
reducer - Example
13
$HADOOP_BIN jar $MR_JAR com.hpe.gcs.nfp2.processing.DedupJob
-D mapreduce.map.java.opts="-Xmx800M -Xms800M
-XX:+UseParallelGC -XX:+UseNUMA -XX:+UseLargePages"
-D mapreduce.reduce.java.opts="-Xmx1G -Xms1G
-XX:+UseParallelGC -XX:+UseNUMA -XX:+UseLargePages"
-D mapreduce.map.memory.mb=1000
-D mapreduce.reduce.memory.mb=1250

Performance Hints - Hadoop - sorting
14
● Increase memory for sorting
● Defaullt 100M is too small for bigger blocks (128-256MB)
● How to detect: Spilled Records > 2*Map output records
● Solution: Increase mapreduce.task.io.sort.mb to several hundreds MB *
● Check: Spilled Records == 2*Map output records
* optimum number depends on block size and used compression

Performance Hints - Hadoop - general
15
● Install native libraries
● Compression to decrease IO
● Snappy codec has good compression ratio with low CPU utilization
● Output compression:
● mapred.output.compress=true
● mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyC
● Map output compression:
● mapred.compress.map.output=true
● Increase number of reducers to utilize the Hadoop cluster
● Example: mapred.reduce.tasks=64

Does it look interesting for you?
We are hiring:
https://careers.hpe.com/job/galwa
y/security-analytics-data-
platform-engineer/3545/4572314
16

NetFlow Data processing using Hadoop and Vertica

More Related Content

What's hot (20)

Similar to NetFlow Data processing using Hadoop and Vertica (20)

Recently uploaded (20)

NetFlow Data processing using Hadoop and Vertica