CSE3069 - FLUENTD real time analytics.pptx

FLUENTD
Presentation by Abhineswari M – 21MIA1025 for CSE3069

WHAT IS THIS?
 A unified log collector
 So is this free?
 It is a open source tool that collects, processes and unifies the
data (log) collection process for better use and understanding.

WHAT ARE LOGS?
Assuming that we are viewing at a service application
platform such as AWS, Microsoft, Nintendo,Toastmaster, etc

WHAT AND WHY DO WE NEED LOG DATA?
 Logs are automatically generated records of events that occur within a system,
application or network.
 The purpose of obtaining this might be for checking the adherence and
regulations of compliance issues
 Security at every step of the application to protect the application from potential
breaches
 Since it has details such as timestamps, user actions, system events, errors, and
performance metrics.
 Analysing errors by debugging could be efficiently carried out
 These files are then stored in structured formats such as JSON, CSV or plain text
files.

HOW DO APPLICATIONS USUALLY LOG
DATA?
Indicated the three different methods

CHALLENGES OF COLLECTING AND
CONSUMING DATA?
Data Volume & Storage
• Logs generate massive amounts of data, requiring efficient storage solutions.
• Managing log retention policies to balance historical data and storage costs.
Data Collection Complexity
• Logs come from diverse sources (applications, servers, networks, devices) in different formats.
• Ensuring consistent logging across systems can be challenging.
Real-time Processing
• Analyzing logs in real-time for security or performance monitoring requires high-speed
processing.
• Delays in log aggregation can impact response times to incidents.

CHALLENGES OF COLLECTING AND
CONSUMING DATA?
Standardization & Compatibility
• Different systems may log data in varied formats (JSON, XML, plaintext), making integration
complex.
• Standardizing log structures and using centralized logging solutions can help.
Log Noise & Redundancy
• Large logs may include excessive or redundant information, making meaningful insights
harder to extract.
• Filtering and prioritizing relevant logs is essential for efficient analysis.

HOW DOES FLUENTD WORK?
 It acts as a unified logging layer
 First, it gets deployed into the
cluster and collects the log
data.
 Second, it allows for the
developers and analysts to
utilize many types of logs as
they are generated.
 It also mitigates the risk of bad
data – slowing down and
misinforming the organization

 Why Most Log Formats Are Weakly Structured
1. Human-Centric Design
1. Logs were originally designed for humans, not machines, so structure wasn’t a priority.
2. Weak Standardization
1. Log producers (e.g., web servers, syslog, middleware, sensors) followed inconsistent formatting practices.
3. Parsing Challenges
1. Arbitrary text-based logs are difficult for computers to analyze.
2. Extracting meaningful data often requires complex regular expressions.
4. Inefficient Data Processing
1. Many ad-hoc scripts and one-liners are needed to parse and clean logs.
2. Lack of structured formats makes automation harder.
2025?
Logstash > Fluentd > Fluent Bit

MAIN ADVANTAGES
 Define an interface that all log producers and consumers implement against. This is the first
requirement for the Unified Logging Layer.
 Reliability and Scalability
 Buffering:
- Uses file buffer for persistent data
- Buffer chunk had ID for idempotent
 Retrying and Error handling
- When a transaction fails the buffer stores the
Data and does not need secondary backup

Breakdown of the Architecture
1. Forwarder Layer
1. Logs are collected from various sources, including:
1. Kubernetes clusters (On-prem or Cloud)
2. Cloud VMs (AWS, Azure, Google Cloud)
3. On-Premises Servers
2. Each source has Fluent Bit installed, which is responsible for collecting and forwarding logs.
3. Fluent Bit uses the forward protocol to send logs to an intermediate component.
2. Aggregator Layer
1. Logs from multiple forwarders are load balanced using an IP or Load Balancer.
2. Load balancing can be done using round-robin or weighted load balancing.
3. The aggregated logs are processed by Fluent Bit or Fluentd, which act as central log aggregators.
3. Destination Layer
• After processing, logs are forwarded to multiple destinations based on configurations:
• Splunk (for log analysis and monitoring)
• Kafka (for real-time streaming and event processing)
• Elasticsearch (for indexing and searching logs)
• AWS S3 (for log storage and archiving)
• MongoDB (for structured storage of log data)

HOW DOES FLUENTD ROUTE?
Flexible routing through tags
to be stored in Elastic Search
Using the filter pluggin, we
can parse through the
formatting of the log

 Re-route Fluentd events in three ways:
1) by tag using the fluent-plugin-route plugin,
2) by label with the out_relabel plugin,
3) by record content with the fluent-plugin-rewrite-tag filter.
 Fluentd’s approach is more declarative whereas Logstash’s method is procedural.
 Therefore, programmers trained in procedural programming might see Logstash’s
configuration as easier for getting started.
 On the other hand, Fluentd’s tag-based routing allows complex routing to be expressed clearly.
 For example, the following configuration applies different logic to all production and
development events based on tag prefixes.

FOLLOWING THE PREVIOUS SLIDE EXAMPLE ARCHITECTURE
1. Data Sources (Left Side)
 The system collects logs from multiple sources, including:
• Manufacturing (Factories, Robotics, Assembly lines)
• Mobile & Vehicle (Drones, Smartphones, Cars, Trucks)
• Home Electronics (AC, TVs, Washing Machines, Refrigerators)
 Each of these sources generates large volumes of logs and telemetry data, which are sent to a log collector server.
2. Log Collector Server (Middle - Fluentd)
• Fluentd is used as the central log collector, aggregating logs from various sources.
• These logs are stored in a high-performance storage system.
• The logs are also structured and formatted into Apache Arrow, which is an efficient columnar in-memory format
optimized for fast processing.
3. GPU-Accelerated Processing (Middle - GPU Server)
• The logs in Apache Arrow format are sent to a DB/GPU server via GPU-Direct SQL over RDMA Network.
• RDMA (Remote Direct Memory Access) enables direct memory transfers between systems with minimal CPU
involvement, improving performance.
• GPU-Direct SQL allows SQL queries (WHERE, JOIN, GROUP BY) to be executed directly on GPUs,
significantly accelerating data processing.

4. Data Utilization (Right Side)
 Once the data is processed, it is used for multiple purposes:
• BI Tools (Visualization): Business Intelligence tools consume processed data for dashboards and reports.
• DB Admins / Users: Database administrators and analysts can query logs interactively.
• AI/ML (Anomaly Detection): Machine learning models analyze the logs for detecting anomalies, security
threats, or operational issues.
5. Elasticsearch (Bottom)
• The processed logs and analytical results can be indexed and stored in Elasticsearch, making them searchable
and accessible for advanced analytics.
CONCLUDINGLY:
 Fluentd collects logs from multiple IoT & industrial sources.
 Apache Arrow ensures efficient log processing.
 GPU-accelerated SQL speeds up query execution.
 RDMA networking enables high-speed data transfers.
 BI tools, AI/ML, and Elasticsearch utilize the processed data for visualization, anomaly detection, and search.

TAKEAWAYS
•Fluent Bit is used for lightweight log forwarding at the source level.
•Fluentd or Fluent Bit is used for aggregation at the central level.
•Load balancing ensures even distribution of logs to the aggregator.
•Multiple log destinations support different use cases like monitoring, real-time processing, and storage.

CSE3069 - FLUENTD real time analytics.pptx

More Related Content

Similar to CSE3069 - FLUENTD real time analytics.pptx (20)

Recently uploaded (20)

CSE3069 - FLUENTD real time analytics.pptx