1、概述
针对cassandra的监控,通过jmx_exporter方式暴露数据采集端口,然后通过prometheus定时数据采集,从而实现监控。
2、Agent配置
所有安装cassandra节点都需要进行相同配置;
2.1、下载jmx javaagent
从github下载jmx_prometheus_javaagent-0.16.1.jar安装包,并将
其上传到cassandra集群$CASSANDRA_HOME/lib/目录下。
2.2、配置metrics采集文件
配置文件内容来源于Cassandra dashboard,可以从https://grafana.com/grafana/dashboards 搜索下载5408编号。分别进入cassandra集群节点 conf/ 目录,增加如下配置文件。
vi cassandra-prometheus-jmx.yml
#并将下面内容拷贝到文件内容
lowercaseOutputName: true
lowercaseOutputLabelNames: true
whitelistObjectNames: [
"org.apache.cassandra.metrics:type=ColumnFamily,name=RangeLatency,*",
"org.apache.cassandra.metrics:type=ColumnFamily,name=LiveSSTableCount,*",
"org.apache.cassandra.metrics:type=ColumnFamily,name=SSTablesPerReadHistogram,*",
"org.apache.cassandra.metrics:type=ColumnFamily,name=SpeculativeRetries,*",
"org.apache.cassandra.metrics:type=ColumnFamily,name=MemtableOnHeapSize,*",
"org.apache.cassandra.metrics:type=ColumnFamily,name=MemtableSwitchCount,*",
"org.apache.cassandra.metrics:type=ColumnFamily,name=MemtableLiveDataSize,*",
"org.apache.cassandra.metrics:type=ColumnFamily,name=MemtableColumnsCount,*",
"org.apache.cassandra.metrics:type=ColumnFamily,name=MemtableOffHeapSize,*",
"org.apache.cassandra.metrics:type=ColumnFamily,name=BloomFilterFalsePositives,*",
"org.apache.cassandra.metrics:type=ColumnFamily,name=BloomFilterFalseRatio,*",
"org.apache.cassandra.metrics:type=ColumnFamily,name=BloomFilterDiskSpaceUsed,*",
"org.apache.cassandra.metrics:type=ColumnFamily,name=BloomFilterOffHeapMemoryUsed,*",
"org.apache.cassandra.metrics:type=ColumnFamily,name=SnapshotsSize,*",
"org.apache.cassandra.metrics:type=ColumnFamily,name=TotalDiskSpaceUsed,*",
"org.apache.cassandra.metrics:type=CQL,name=RegularStatementsExecuted,*",
"org.apache.cassandra.metrics:type=CQL,name=PreparedStatementsExecuted,*",
"org.apache.cassandra.metrics:type=Compaction,name=PendingTasks,*",
"org.apache.cassandra.metrics:type=Compaction,name=CompletedTasks,*",
"org.apache.cassandra.metrics:type=Compaction,name=BytesCompacted,*",
"org.apache.cassandra.metrics:type=Compaction,name=TotalCompactionsCompleted,*",
"org.apache.cassandra.metrics:type=ClientRequest,name=Latency,*",
"org.apache.cassandra.metrics:type=ClientRequest,name=Unavailables,*",
"org.apache.cassandra.metrics:type=ClientRequest,name=Timeouts,*",
"org.apache.cassandra.metrics:type=Storage,name=Exceptions,*",
"org.apache.cassandra.metrics:type=Storage,name=TotalHints,*",
"org.apache.cassandra.metrics:type=Storage,name=TotalHintsInProgress,*",
"org.apache.cassandra.metrics:type=Storage,name=Load,*",
"org.apache.cassandra.metrics:type=Connection,name=TotalTimeouts,*",
"org.apache.cassandra.metrics:type=ThreadPools,name=CompletedTasks,*",
"org.apache.cassandra.metrics:type=ThreadPools,name=PendingTasks,*",
"org.apache.cassandra.metrics:type=ThreadPools,name=ActiveTasks,*",
"org.apache.cassandra.metrics:type=ThreadPools,name=TotalBlockedTasks,*",
"org.apache.cassandra.metrics:type=ThreadPools,name=CurrentlyBlockedTasks,*",
"org.apache.cassandra.metrics:type=DroppedMessage,name=Dropped,*",
"org.apache.cassandra.metrics:type=Cache,scope=KeyCache,name=HitRate,*",
"org.apache.cassandra.metrics:type=Cache,scope=KeyCache,name=Hits,*",
"org.apache.cassandra.metrics:type=Cache,scope=KeyCache,name=Requests,*",
"org.apache.cassandra.metrics:type=Cache,scope=KeyCache,name=Entries,*",
"org.apache.cassandra.metrics:type=Cache,scope=KeyCache,name=Size,*",
#"org.apache.cassandra.metrics:type=Streaming,name=TotalIncomingBytes,*",
#"org.apache.cassandra.metrics:type=Streaming,name=TotalOutgoingBytes,*",
"org.apache.cassandra.metrics:type=Client,name=connectedNativeClients,*",
"org.apache.cassandra.metrics:type=Client,name=connectedThriftClients,*",
"org.apache.cassandra.metrics:type=Table,name=WriteLatency,*",
"org.apache.cassandra.metrics:type=Table,name=ReadLatency,*",
"org.apache.cassandra.net:type=FailureDetector,*",
]
#blacklistObjectNames: ["org.apache.cassandra.metrics:type=ColumnFamily,*"]
rules:
- pattern: org.apache.cassandra.metrics<type=(Connection|Streaming), scope=(\S*), name=(\S*)><>(Count|Value)
name: cassandra_$1_$3
labels:
address: "$2"
- pattern: org.apache.cassandra.metrics<type=(ColumnFamily), name=(RangeLatency)><>(Mean)
name: cassandra_$1_$2_$3
- pattern: org.apache.cassandra.net<type=(FailureDetector)><>(DownEndpointCount)
name: cassandra_$1_$2
- pattern: org.apache.cassandra.metrics<type=(Keyspace), keyspace=(\S*), name=(\S*)><>(Count|Mean|95thPercentile)
name: cassandra_$1_$3_$4
labels:
"$1": "$2"
- pattern: org.apache.cassandra.metrics<type=(Table), keyspace=(\S*), scope=(\S*), name=(\S*)><>(Count|Mean|95thPercentile)
name: cassandra_$1_$4_$5
labels:
"keyspace": "$2"
"table": "$3"
- pattern: org.apache.cassandra.metrics<type=(ClientRequest), scope=(\S*), name=(\S*)><>(Count|Mean|95thPercentile)
name: cassandra_$1_$3_$4
labels:
"type": "$2"
- pattern: org.apache.cassandra.metrics<type=(\S*)(?:, ((?!scope)\S*)=(\S*))?(?:, scope=(\S*))?,
name=(\S*)><>(Count|Value)
name: cassandra_$1_$5
labels:
"$1": "$4"
"$2": "$3"
2.3、配置cassandra-env.sh
进入cassandra各个节点的 config/ 目录,修改 conf/cassandra-env.sh文件内容,修改内容如下:
#7070为数据采集端口给prometheus
JVM_OPTS="$JVM_OPTS -javaagent:$CASSANDRA_HOME/lib/jamm-0.3.0.jar -javaagent:$CASSANDRA_HOME/lib/jmx_prometheus_javaagent-0.16.0.jar=7070:${CASSANDRA_HOME}/conf/cassandra-prometheus-jmx.yml"
3、Prometheus配置
3.1、Prometheus配置
修改prometheus组件的prometheus.yml配置,加入cassandra监控job:
#名称定义,必须以cassandra_开头
- job_name: 'cassandra_yd01'
scrape_interval: 60s
static_configs:
- targets: ['92.168.0.1:7070','92.168.0.2:7070','92.168.0.3:7070']
3.2、Prometheus启动验证
参考文章kubernetes配置:
4、Grafana配置
4.1、导入仪表盘模板
从 https://grafana.com/dashboards/5408下载仪表盘,导入到grafana,再结合自身业务修改最终仪表盘,有时候每个仪表盘都得手动Edit选择数据源“prometheus"。
需要注意下,grafana的cassandra metric dashboard的json有一些不正确的地方,需要人为修改下。
4.2、预警指标
常规预警 | 预警规则 |
---|---|
内存预警 | 内存使用达到阈值【>80%】时,进行预警 |
GC耗时预警 | 当GC耗时达到阈值【>0.5s】时,进行预警 |
GC次数预警 | 当每秒GC次数达到阈值【>5】时,进行预警 |