大数据之flume数据采集

大数据同盟会

已于 2022-05-02 16:25:57 修改

阅读量1.2w

点赞数 9

CC 4.0 BY-SA版权

分类专栏：大数据原理

于 2020-08-22 22:51:46 首次发布

本文链接：https://blog.csdn.net/chuan129/article/details/108176080

大数据原理专栏收录该内容

34 篇文章

订阅专栏

本文详细介绍了Flume分布式日志采集系统的架构与使用，包括agent组件的Source、Channel和Sink，通过多个实例演示了如何配置Flume采集网络、文件夹及文件数据，以及如何实现级联操作、选择器控制、自动失败切换等功能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Flume是一个分布式、可靠、和高可用的海量日志采集、聚合和传输的系统。

它可以采集文件，socket数据包等各种形式源数据，又可以将采集到的数据输出到HDFS、hbase、hive、kafka等众多外部存储系统中。

一、flume结构

在这里插入图片描述
Flume分布式系统中最核心的角色是agent，每一个agent相当于一个数据传递员，内部有三个组件：

Source： 采集源，用于跟数据源对接，以获取数据；

Channel ： angent内部的数据传输通道，用于从source将数据传递到sink。

Sink：：下沉地，采集数据的传送目的，用于往下一级agent传递数据或者往最终存储系统传递数据；

数据在flume内部以Event的封装形式存在。

flume的事务控制机制:

1、source到channel
2、channel到sink

二、Flume多个agent串联

在这里插入图片描述

三、Flume安装使用（未安装）

1、上传安装包，解压

2、执行脚本，模拟日志生产

while true; do echo 111111111111111111111111_$RANDOM >> access.log; sleep 0.2; done

案例一、采集端口数据

1、增加netcat-logger.conf

# Name the components on this agent
#给那三个组件取个名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
#类型, 从网络端口接收数据,在本机启动, 所以localhost, type=spoolDir采集目录源,目录里有就采
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
#下沉的时候是一批一批的, 下沉的时候是一个个eventChannel参数解释：
#capacity：默认该通道中最大的可以存储的event数量
#trasactionCapacity：每次最大可以从source中拿到或者送到sink中的event数量
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

2、启动

$ bin/flume-ng agent --conf conf --conf-file conf/netcat-logger.conf --name a1 -Dflume.root.logger=INFO,console

3、传入数据：

$ telnet localhost 44444

案例二、采集文件夹数据

1、增加spooldir-hdfs.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
#监听目录,spoolDir指定目录, fileHeader要不要给文件夹前坠名
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /home/hadoop/flumespool
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

2、启动

bin/flume-ng agent -c ./conf -f ./conf/spool-logger.conf -n a1 -Dflume.root.logger=INFO,console

3、传入数据：

往/home/hadoop/flumeSpool放文件

案例三：采集文件数据（方式一）

exec source 适用于监控一个实时追加的文件，没有偏移量，会出现数据丢失情况；

1、增加tail-hdfs.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# exec 指的是命令
a1.sources.r1.channels = c1
a1.sources.r1.type = exec
# F根据文件名追中, f根据文件的nodeid追踪，即使换了文件名，也能跟踪到
a1.sources.r1.command = tail -F /home/hadoop/log/test.log

#下沉目标
a1.sinks.k1.channel = c1
a1.sinks.k1.type = hdfs
# 指定目录, flum帮做目的替换
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/
#文件的命名, 前缀
a1.sinks.k1.hdfs.filePrefix = events-

#10 分钟就改目录
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute

#文件滚动之前的等待时间(秒)
a1.sinks.k1.hdfs.rollInterval = 120
#文件滚动的大小限制(bytes)
a1.sinks.k1.hdfs.rollSize = 268435456
#写入多少个event数据后滚动文件(事件个数)
a1.sinks.k1.hdfs.rollCount = 20

#1000个事件就往里面写入
a1.sinks.k1.hdfs.batchSize = 1000

#用本地时间格式化目录
a1.sinks.k1.hdfs.useLocalTimeStamp = true

#下沉后, 生成的文件类型，默认是Sequencefile，可用DataStream，则为普通文本
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1000

2、启动命令

bin/flume-ng agent -c conf -f conf/tail-hdfs.conf -n a1

案例四：采集文件数据（方式二）

taildir Source 既能够实现断点续传，又可以保证数据不丢失，还能够进行实时监控。为了防止00:00的时候，今日的数据写到明日，在sink处增加拦截器，给数据一个时间戳，不使用节点机器上时间。

tail dir 是根据通配符监视多个文件，即使文件改了名，也不会重复采集，它是根据偏移量进行跟踪的；

1、增加tail-hdfs.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.channels = c1
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups=g1
a1.sources.r1.filegroups.g1=  /logdata/a.*
a1.sources.r1.fileHeader = true

# 加入拦截器
a1.sources.s1.interceptors = i1
a1.sources.s1.interceptors.i1.type = timestamp
a1.sources.s1.interceptors.i1.headerName= timestamp

#下沉目标
a1.sinks.k1.channel = c1
a1.sinks.k1.type = hdfs
# 指定目录, flum帮做目的替换
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/
#文件的命名, 前缀
a1.sinks.k1.hdfs.filePrefix = events-

#10 分钟就改目录
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute

#文件滚动之前的等待时间(秒)
a1.sinks.k1.hdfs.rollInterval = 120
#文件滚动的大小限制(bytes)
a1.sinks.k1.hdfs.rollSize = 268435456
#写入多少个event数据后滚动文件(事件个数)
a1.sinks.k1.hdfs.rollCount = 20

#1000个事件就往里面写入
a1.sinks.k1.hdfs.batchSize = 1000

#用本地时间格式化目录
a1.sinks.k1.hdfs.useLocalTimeStamp = false

#下沉后, 生成的文件类型，默认是Sequencefile，可用DataStream，则为普通文本
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1000

2、启动命令

bin/flume-ng agent -c conf -f conf/tail-hdfs.conf -n a1

案例五、flume级联操作

使用前景：复杂的网络或者日志服务器特别多，每台服务器流量不多，需要进行汇集；

需要写两个配置文件，分别放在两个机器上，一个当发送者，一个当收集者（Kafka为例）

1、编写tail-avro-.conf

a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.channels = c1
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = g1
a1.sources.r1.filegroups.g1 = /logdata/a.*
a1.sources.r1.fileHeader = false

a1.channels.c1.type = file

a1.sinks.k1.channel = c1
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = doitedu02
a1.sinks.k1.port = 4444

2、编写avro-fakfa.conf

a1.sources = r1
a1.sinks = k1
a1.channels = c1


a1.sources.r1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.bind = doitedu02
a1.sources.r1.port = 4444
a1.sources.r1.batchSize = 100

a1.channels.c1.type = file

a1.sinks.k1.channel = c1
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.bootstrap.servers = doitedu01:9092,doitedu02:9092,doitedu03:9092
a1.sinks.k1.kafka.topic = doitedu17
a1.sinks.k1.kafka.producer.acks = 1

2、先启动avro-fakfa.conf，再启动tail-avro-.conf


bin/flume-ng agent -c conf -f conf/avro-fakfa.conf -n al -Dflume.root.logger=INFO,console

bin/flume-ng agent -c conf -f conf/tail-avro-.conf -n a1

3、kafka基本命令

## topic查看
bin/kafka-topics.sh --list --zookeeper doitedu01:2181

## topic创建
bin/kafka-topics.sh --create --topic topic2 --partitions 2 --replication-factor 2 --zookeeper doitedu01:2181

## 启动一个控制台生产者来生产数据
bin/kafka-console-producer.sh --broker-list doitedu01:9092,doitedu02:9092,doitedu03:9092 --topic topic2
>hello tom

## 启动一个控制台消费者来消费数据
bin/kafka-console-consumer.sh --bootstrap-server doitedu01:9092,doitedu02:9092,doitedu03:9092 --topic topic2 --from-beginning

案例六：flume选择器

一个 source 可以对接多个 channel，那么，source 的数据如何在多个 channel 之间传递，就由 selector 来控制，配置应该挂载到 source 组件

1、复制选择器

一个连hdfs，一个连kafka

a1.sources = r1
a1.channels = c1 c2
a1.sinks = k1 k2

a1.sources.r1.channels = c1 c2
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = g1
a1.sources.r1.filegroups.g1 = /logdata/a.*
a1.sources.r1.fileHeader = false
a1.sources.r1.selector.type = replicating

a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp
a1.sources.r1.interceptors.i1.headerName = timestamp

a1.channels.c1.type = memory
a1.channels.c2.type = memory

a1.sinks.k1.channel = c1
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.bootstrap.servers = doitedu01:9092,doitedu02:9092,doitedu03:9092
a1.sinks.k1.kafka.topic = doitedu17
a1.sinks.k1.kafka.producer.acks = 1

a1.sinks.k2.channel = c2
a1.sinks.k2.type = hdfs
a1.sinks.k2.hdfs.path = hdfs://doitedu01:8020/flumedata/%Y-%m-%d/%H
a1.sinks.k2.hdfs.filePrefix = doitedu-log-
a1.sinks.k2.hdfs.fileSuffix = .log
a1.sinks.k2.hdfs.rollSize = 268435456
a1.sinks.k2.hdfs.rollInterval = 120
a1.sinks.k2.hdfs.rollCount = 0
a1.sinks.k2.hdfs.batchSize = 1000
a1.sinks.k2.hdfs.fileType = CompressedStream
a1.sinks.k2.hdfs.codeC = snappy
a1.sinks.k2.hdfs.useLocalTimeStamp = false

2、多路选择器

一个source里数据，可能有不同种类数据，需要使用拦截器，对数据进行区分，然后使用多路选择器插入到不同的channel里，一个写到kakfa，一个写到hdfs。

2.1 拦截器，并打包放到flume的lib下

package cn.doitedu.yiee.flume;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.List;

public class MultiplexingInterceptor implements Interceptor {
    private Integer flagfield = 0;
    private Integer timestampfield = 0;

    public MultiplexingInterceptor(Integer flagfield,Integer timestampfield) {
        this.flagfield = flagfield;
        this.timestampfield = timestampfield;
    }

    /**
     * 拦截器构造实例后的初始化工作
     */
    public void initialize() {

    }

    // 日志格式:
    // u01,ev1,mall,1568738583468
    public Event intercept(Event event) {
        // 根据event的数据内容，以及参数中指定的标记字段，来产生不同的header值
        byte[] body = event.getBody();
        String line = new String(body);

        String[] split = line.split(",");

        // 切出业务标记,并添加到header
        event.getHeaders().put("flag",split[flagfield]);
        // 切出行为（事件）时间戳,并添加到header
        event.getHeaders().put("timestamp",split[timestampfield]);

        return event;
    }

    public List<Event> intercept(List<Event> list) {
        for (Event event : list) {
            intercept(event);
        }
        return list;
    }


    /**
     * 拦截器销毁之前的一些清理工作
     */
    public void close() {

    }

    public static class MultiplexingInterceptorBuilder implements Interceptor.Builder{

        Integer flagfield = 0;
        Integer timestampfield = 0;
        /**
         * 用户构建一个拦截器实例
         * @return
         */
        public Interceptor build() {

            return new MultiplexingInterceptor(flagfield,timestampfield);
        }

        /**
         * 获取参数的入口
         * @param context
         */
        public void configure(Context context) {
            flagfield = context.getInteger("flagfield");
            timestampfield = context.getInteger("timestampfield");

        }
    }
}

2.2 模拟日志生成脚本

while true
		do
	if [ $(($RANDOM % 2)) -eq 0 ]
		then
		echo "u$RANDOM,e1,waimai,`date +%s`000" >> a.log
	else
		echo "u$RANDOM,e1,mall,`date +%s`000" >> a.log
		fi
		sleep 0.2
	done

2.3变成配置文件

	1.sources = r1
	a1.channels = c1 c2
	a1.sinks = k1 k2
	
	a1.sources.r1.channels = c1 c2
	a1.sources.r1.type = TAILDIR
	a1.sources.r1.filegroups = g1
	a1.sources.r1.filegroups.g1 = /logdata/a.*
	a1.sources.r1.fileHeader = false
	
	a1.sources.r1.interceptors = i1
	a1.sources.r1.interceptors.i1.type = cn.doitedu.yiee.flume.MultiplexingInterceptor$MultiplexingInterceptorBuilder
	a1.sources.r1.interceptors.i1.flagfield = 2
	a1.sources.r1.interceptors.i1.timestampfield = 3
	
	a1.sources.r1.selector.type = multiplexing
	a1.sources.r1.selector.header = flag
	a1.sources.r1.selector.mapping.mall = c1
	a1.sources.r1.selector.mapping.waimai = c2
	a1.sources.r1.selector.default = c2
	
	
	a1.channels.c1.type = memory
	a1.channels.c1.capacity = 2000
	a1.channels.c1.transactionCapacity = 1000
	
	a1.channels.c2.type = memory
	a1.channels.c2.capacity = 2000
	a1.channels.c2.transactionCapacity = 1000
	
	
	a1.sinks.k1.channel = c1
	a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
	a1.sinks.k1.kafka.bootstrap.servers = doitedu01:9092,doitedu02:9092,doitedu03:9092
	a1.sinks.k1.kafka.topic = mall
	a1.sinks.k1.kafka.producer.acks = 1
	
	
	a1.sinks.k2.channel = c2
	a1.sinks.k2.type = hdfs
	a1.sinks.k2.hdfs.path = hdfs://doitedu01:8020/waimai/%Y-%m-%d/%H
	a1.sinks.k2.hdfs.filePrefix = doitedu-log-
	a1.sinks.k2.hdfs.fileSuffix = .log
	a1.sinks.k2.hdfs.rollSize = 268435456
	a1.sinks.k2.hdfs.rollInterval = 120
	a1.sinks.k2.hdfs.rollCount = 0
	a1.sinks.k2.hdfs.batchSize = 1000
	a1.sinks.k2.hdfs.fileType = DataStream
	a1.sinks.k2.hdfs.useLocalTimeStamp = false

案例七：自动失败切换

多个sink连接一个channel，默认不需要专门去配置的, 相当于负载均衡，或者failover sink processor 自动失败，需要将多个 sink 创建成 group。正常情况下，只运行一个sink，只有当它失败后，才切换到别的sink上。

在这里插入图片描述
默认是走兰色的线，若是兰色的机器挂掉，就走绿色的线；

1、级联高可用配置第一级

a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1

a1.sources.r1.channels = c1
a1.sources.r1.type = TAILDIR
a1.sources.r1.filegroups = g1
a1.sources.r1.filegroups.g1 = /logdata/a.*
a1.sources.r1.fileHeader = false


a1.channels.c1.type = memory
a1.channels.c1.capacity = 2000
a1.channels.c1.transactionCapacity = 1000


a1.sinks.k1.channel = c1
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = doitedu02
a1.sinks.k1.port = 4444


a1.sinks.k2.channel = c1
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = doitedu03
a1.sinks.k2.port = 4444


a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 200
a1.sinkgroups.g1.processor.priority.k2 = 100
a1.sinkgroups.g1.processor.maxpenalty = 5000

2、级联高可用配置第2级（节点1）

a1.sources = r1
a1.sinks = k1
a1.channels = c1


a1.sources.r1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.bind = doitedu02
a1.sources.r1.port = 4444
a1.sources.r1.batchSize = 100


a1.channels.c1.type = memory
a1.channels.c1.capacity = 2000
a1.channels.c1.transactionCapacity = 1000

a1.sinks.k1.channel = c1
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.bootstrap.servers = doitedu01:9092,doitedu02:9092,doitedu03:9092
a1.sinks.k1.kafka.topic = failover
a1.sinks.k1.kafka.producer.acks = 1

3、级联高可用配置第2级（节点2）

a1.sources = r1
a1.sinks = k1
a1.channels = c1


a1.sources.r1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.bind = doitedu03
a1.sources.r1.port = 4444
a1.sources.r1.batchSize = 100


a1.channels.c1.type = memory
a1.channels.c1.capacity = 2000
a1.channels.c1.transactionCapacity = 1000

a1.sinks.k1.channel = c1
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.bootstrap.servers = doitedu01:9092,doitedu02:9092,doitedu03:9092
a1.sinks.k1.kafka.topic = failover
a1.sinks.k1.kafka.producer.acks = 1

四、flume监控

flume 在运行时，状态是否正常，吞吐量是否正常，可以使用ganglia 进行展现：

-Dflume.monitoring.type=ganglia -Dflume.monitoring.port=34890

Ganglia 是一个通用的集群运维监控系统；
它在各台需要监控状态信息的机器上安装“探针”，然后这些“探针”会收集所在机器上的各种状态
信息（cpu 负载，内存负载，磁盘 IO 负载，网络 IO 负载，以及各类应用软件的状态信息），然后汇
聚到它的中心汇聚点，并提供 web 页面进行图形可视化查看

在这里插入图片描述
五、监控flume进程、自动拉起

#!/bin/bash

export FLUME_HOME=/opt/apps/flume-1.9.0
while true
do
pc=`ps -ef | grep flume | grep -v "grep" | wc -l`

if [[ $pc -lt 1 ]]
then
  echo "detected no flume process.... preparing to launch flume agent...... "
  ${FLUME_HOME}/bin/flume-ng agent -n a1 -c ${FLUME_HOME}/conf/ -f ${FLUME_HOME}/agentconf/failover.properties 1>/dev/null 2>&1 &
else
  echo "detected flume process number is : $pc "
fi

sleep 1

done