Elasticsearch and Fluentd metrics

4 MINUTE READ

Big picture

Use the Prometheus monitoring and alerting tool for Fluentd and Elasticsearch metrics to ensure continuous network visibility.

Value

Platform engineering teams rely on logs, such as flow logs and DNS logs, for visibility into their networks. If collecting or storing logs are disrupted, this can impact network visibility. Prometheus can monitor log collection and storage metrics so platform engineering teams are alerted about problems before they occur.

Features

This how-to guide uses the following Calico Enterprise features:

Fluentd
Elasticsearch

Concepts

Component	Description
Prometheus	Monitoring tool that scrapes metrics from instrumented jobs and displays time series data in a visualizer (such as Grafana). For Calico Enterprise, the “jobs” that Prometheus can harvest metrics from are the Elasticsearch and Fluentd components.
Elasticsearch	Stores Calico Enterprise logs.
Fluentd	Sends Calico Enterprise logs to Elasticsearch for storage.

Multi-cluster management users: Elasticsearch metrics are collected only from the management cluster. Because managed clusters do not have Elasticsearch clusters, do not monitor Elasticsearch for managed clusters. However, managed clusters do feed Fluentd logs to Elasticsearch, so you should monitor fluentd for managed clusters.

How to

Create Prometheus alerts for Elasticsearch
Create Prometheus alerts for Fluentd

Create Prometheus alerts for Elasticsearch

The following example creates Prometheus rules to monitor some important Elasticsearch metrics, and alert when they have crossed certain thresholds:

Note: The Elasticsearch Prometheus rules are only applicable to standalone and management cluster types, not the managed cluster type.

Note: The ElasticsearchHighMemoryUsage alert is an absolute value. This must be configured before applying the rules.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: tigera-prometheus-log-storage-monitoring
  namespace: tigera-prometheus
  labels:
    role: tigera-prometheus-rules
    prometheus: calico-node-prometheus
spec:
  groups:
  - name: tigera-elasticsearch.rules
    rules:
    - alert: ElasticsearchClusterStatusRed
      expr: elasticsearch_cluster_health_status{color="red"} == 1
      labels:
        severity: Critical
      annotations:
        summary: "Elasticsearch cluster {{$labels.cluster}}'s status is red"
        description: "The Elasticsearch cluster {{$labels.cluster}} is very unhealthy and immediate action must be 
taken. Check the pod logs for the {{$labels.cluster}} Elasticsearch cluster to start the investigation."
    - alert: ElasticsearchClusterStatusYellow
      expr: elasticsearch_cluster_health_status{color="yellow"} == 1
      labels:
        severity: Warning
      annotations:
        summary: "Elasticsearch cluster {{$labels.cluster}} status is yellow"
        description: "The Elasticsearch cluster {{$labels.cluster}} may be unhealthy and could become very unhealthy if 
the issue isn't resolved. Check the pod logs for the {{$labels.cluster}} Elasticsearch cluster to start the 
investigation."
    - alert: ElasticsearchPodCriticallyLowDiskSpace
      expr: elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes < 0.10
      labels:
        severity: Critical
      annotations:
        summary: "Elasticsearch pod {{$labels.name}}'s disk space is critically low."
        description: "Elasticsearch pod {{$labels.name}} in Elasticsearch cluster {{$labels.name}} has less then 10% of 
free disk space left. To avoid service disruption review the LogStorage resource limits and curation settings."
    - alert: ElasticsearchPodLowDiskSpace
      expr: elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes < 0.25
      labels:
        severity: Warning
      annotations:
        summary: "Elasticsearch pod {{$labels.name}}'s disk space is getting low."
        description: "Elasticsearch pod {{$labels.name}} in Elasticsearch cluster {{$labels.name}} has less then 25% of
free disk space left. To avoid service disruption review the LogStorage resource limits and curation settings."
    - alert: ElasticsearchConsistantlyHighCPUUsage
      expr: avg_over_time(elasticsearch_os_cpu_percent[10m]) > 90
      labels:
        severity: Warning
      annotations:
        summary: "Elasticsearch pod {{$labels.name}}'s CPU usage is consistently high."
        description: "Elasticsearch pod {{$labels.name}} in Elasticsearch cluster {{$labels.cluster}} has been using 
above 90% of it's available CPU for the last 10 minutes. To avoid service disruption review the LogStorage resource
limits."
    - alert: ElasticsearchHighMemoryUsage
      expr: avg_over_time(elasticsearch_jvm_memory_pool_used_bytes[10m]) > 1000000000
      labels:
        severity: Warning
      annotations:
        summary: "Elasticsearch pod {{$labels.name}}'s memory usage is consistently high."
        description: "Elasticsearch pod {{$labels.name}} in Elasticsearch cluster {{$labels.cluster}} has been using 
an average of {{$labels.value}} bytes of memory over the past 10 minutes. To avoid service disruption review the 
LogStorage resource limits."

The alerts created in the example are described in the following table:

Alert	Severity	Requires	Issue/reason
ElasticsearchClusterStatusRed	Critical	Immediate action to avoid service disruption and restore the service.	Elasticsearch cluster is unhealthy.
ElasticsearchPodCriticallyLowDiskSpace	Critical		Disk space for pod is less than 10% of total available space. The LogStorage resource settings for disk space are not high enough or logs are not being correctly curated.
ElasticsearchClusterStatusYellow	Non-critical, warning	Immediate investigation to rule out critical issue.	Early warning of cluster problem.
ElasticsearchPodLowDiskSpace	Non-critical, warning		Disk space for an Elasticsearch pod is less than 25% of total available space. The LogStorage resource settings for disk space are not high enough or logs are not being correctly curated.
ElasticsearchPodConsistentlyHighCPUUsage	Non-critical, warning		An Elasticsearch pod is averaging above 90% of its CPU over the last 10 minutes.
ElasticsearchPodConsistentlyHighMemoryUsage	Non-critical, warning		An Elasticsearch pod is averaging above the set memory threshold over the last 10 minutes.

Create Prometheus alerts for Fluentd

The following example creates a Prometheus a rule to monitor some important Fluentd metrics, and alert when they have crossed certain thresholds:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: tigera-prometheus-log-collection-monitoring
  namespace: tigera-prometheus
  labels:
    role: tigera-prometheus-rules
    prometheus: calico-node-prometheus
spec:
  groups:
  - name: tigera-log-collection.rules
    rules:
    - alert: FluentdPodConsistentlyLowBufferSpace
      expr: avg_over_time(fluentd_output_status_buffer_available_space_ratio[5m]) < 75
      labels:
        severity: Warning
      annotations:
        summary: "Fluentd pod {{$labels.pod}}'s buffer space is consistently below 75 percent capacity."
        description: "Fluentd pod {{$labels.pod}} has very low buffer space. There may be connection issues between Elasticsearch
and Fluentd or there are too many logs to write out, check the logs for the Fluentd pod."

The alerts created in the example are described as follows:

Alert	Severity	Requires	Issue/reason
FluentdPodConsistentlyLowBufferSpace	Non-critical, warning	Immediate investigation to ensure logs are being gathered correctly.	A Fluentd pod’s available buffer size has averaged less than 75% over the last 5 minutes. This could mean Fluentd is having trouble communicating with the Elasticsearch cluster, the Elasticsearch cluster is down, or there are simply too many logs to process.

Big picture
Value
Features
Concepts
How to
- Create Prometheus alerts for Elasticsearch
  - The alerts created in the example are described in the following table:
- Create Prometheus alerts for Fluentd
  - The alerts created in the example are described as follows:

Use cases

Install and upgrade

Networking

Network policy

Visibility and troubleshooting

Multi-cluster management

Threat defense

Compliance and security

Operations

Reference

Release notes