flume agent arvo传数据

本文详细介绍了使用Flume进行日志数据采集的具体配置方法,包括如何通过Avro方式在Flume Agent间传递数据,并将日志最终存入HDFS的过程。同时,针对数据量增大导致的超时问题进行了探讨。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

           flume agent之间的传导,目前我只知道通过arvo的方式,希望有高手能够提示,如果直接发udp到flume syslogudp中,再通过agent中转发到hdfs上。下面贴上配置以及启动命令:

agent.sources = execSource     // sources名字
agent.channels = memoryChannel   // channels名字
agent.sinks = k1 k2           // sinks名字


#source
agent.sources.execSource.type = exec              // agent source 类型
agent.sources.execSource.command = tail -F /opt/logs/nginx/stat.log               // source的输入源为 日志
agent.sources.execSource.channels = memoryChannel           // source 的数据流入memoryChannel这个channel

agent.sources.execSource.interceptors = hostInterceptor staticInterceptor          // 定义拦截器,我也有点不懂这个
agent.sources.execSource.interceptors.hostInterceptor.type = host                   
agent.sources.execSource.interceptors.staticInterceptor.type = static
agent.sources.execSource.interceptors.staticInterceptor.key = port
agent.sources.execSource.interceptors.staticInterceptor.value = 8080

#sink
agent.sinks.k1.type = avro          // sink 类型为avro
agent.sinks.k1.channel = memoryChannel       // sink的数据由memoryChannel流入
agent.sinks.k1.hostname = 10.10.10.10            // sink目的地的 ip
agent.sinks.k1.port = 10000                              // sink目的地的 port
agent.sinks.k1.connect-timeout = 200000        // sink连接的超时时间,默认为20000ms

agent.sinks.k2.type = avro
agent.sinks.k2.channel = memoryChannel
agent.sinks.k2.hostname = 10.10.10.10
agent.sinks.k2.port = 10001
agent.sinks.k2.connect-timeout = 200000


#sink group
agent.sinkgroups = g1                       // 定义 sinkgroup
agent.sinkgroups.g1.sinks = k1 k2        
agent.sinkgroups.g1.processor.type = load_balance     // 负载平衡
agent.sinkgroups.g1.processor.selector = round_robin     //  选择sink的方式


#channel
agent.channels.memoryChannel.type = memory          // channel 类型
agent.channels.memoryChannel.capacity = 100000     // 设置channel capacity
agent.channels.memoryChannel.transactionCapacity = 10000   // 设置channel 传输capacity
agent.channels.memoryChannel.keep-alive=60          // 不设容易超时
agent.channels.memoryChannel.write-timeout=20      // 不调易超时


接收端的agent配置:说明大概差不多,就是source是监听着相应ip端口的avro数据源。发往hdfs。

agent.sources = collection-nginx-stat
agent.channels = mem-nginx-stat
agent.sinks = k1 k2


# source & sink channel
agent.sources.collection-nginx-stat.channels = mem-nginx-stat
agent.sinks.k1.channel = mem-nginx-stat
agent.sinks.k2.channel = mem-nginx-stat


# source ip port binding
agent.sources.collection-nginx-stat.type = avro
agent.sources.collection-nginx-stat.bind = 10.10.10.10
agent.sources.collection-nginx-stat.port = 10000
agent.sources.collection-nginx-stat.interceptors = host-interceptor
agent.sources.collection-nginx-stat.interceptors.host-interceptor.type = host
agent.sources.collection-nginx-stat.interceptors.host-interceptor.preserveExisting = true
agent.sources.collection-nginx-stat.interceptors.host-interceptor.useIP = true
agent.sources.collection-nginx-stat.interceptors.host-interceptor.hostHeader = host


# channel property
agent.channels.mem-nginx-stat.type = memory
agent.channels.mem-nginx-stat.capacity = 1000000
agent.channels.mem-nginx-stat.transactionCapacity=10000
agent.channels.mem-nginx-stat.keep-alive=60


# sink property
agent.sinks.k1.type = hdfs
agent.sinks.k1.serializer = text
agent.sinks.k1.hdfs.path = hdfs://namenode:9000/nginx/stat/%y%m%d/%H/%{host}
agent.sinks.k1.hdfs.filePrefix = logData.sink1
agent.sinks.k1.hdfs.useLocalTimeStamp = true
agent.sinks.k1.hdfs.rollSize = 128000000
agent.sinks.k1.hdfs.rollInterval = 600
agent.sinks.k1.hdfs.rollCount = 3000000
agent.sinks.k1.hdfs.batchSize = 5000
agent.sinks.k1.hdfs.callTimeout = 300000
agent.sinks.k1.hdfs.writeFormat = Text
agent.sinks.k1.hdfs.fileType = DataStream


agent.sinks.k2.type = hdfs
agent.sinks.k2.serializer = text
agent.sinks.k2.hdfs.path = hdfs://namenode:9000/nginx/stat/%y%m%d/%H/%{host}
agent.sinks.k2.hdfs.filePrefix = logData.sink2
agent.sinks.k2.hdfs.useLocalTimeStamp = true
agent.sinks.k2.hdfs.rollSize = 128000000
agent.sinks.k2.hdfs.rollInterval = 600
agent.sinks.k2.hdfs.rollCount = 3000000
agent.sinks.k2.hdfs.batchSize = 5000
agent.sinks.k2.hdfs.callTimeout = 300000
agent.sinks.k2.hdfs.writeFormat = Text
agent.sinks.k2.hdfs.fileType = DataStream


#sink group
agent.sinkgroups = g1
agent.sinkgroups.g1.sinks = k1 k2
agent.sinkgroups.g1.processor.type = failover
agent.sinkgroups.g1.processor.priority.k1 = 5
agent.sinkgroups.g1.processor.priority.k2 = 10
agent.sinkgroups.g1.processor.maxpenalty = 10000



还有log4j.properties配置

#flume.root.logger=DEBUG,console
flume.root.logger=INFO,LOGFILE


#flume.root.logger=DEBUG,console
flume.root.logger=INFO,LOGFILE
flume.log.dir=/opt/logs/flume-nginx-stat
flume.log.file=flume.log


log4j.logger.org.apache.flume.lifecycle = INFO
log4j.logger.org.jboss = WARN
log4j.logger.org.mortbay = INFO
log4j.logger.org.apache.avro.ipc.NettyTransceiver = WARN
log4j.logger.org.apache.hadoop = INFO


# Define the root logger to the system property "flume.root.logger".
log4j.rootLogger=${flume.root.logger}

# Stock log4j rolling file appender
# Default log rotation configuration
log4j.appender.LOGFILE=org.apache.log4j.RollingFileAppender
log4j.appender.LOGFILE.MaxFileSize=100MB
log4j.appender.LOGFILE.MaxBackupIndex=10
log4j.appender.LOGFILE.File=${flume.log.dir}/${flume.log.file}
log4j.appender.LOGFILE.layout=org.apache.log4j.PatternLayout
log4j.appender.LOGFILE.layout.ConversionPattern=%d{dd MMM yyyy HH:mm:ss,SSS} %-5p [%t] (%C.%M:%L) %x - %m%n

# Warning: If you enable the following appender it will fill up your disk if you don't have a cleanup job!
# This uses the updated rolling file appender from log4j-extras that supports a reliable time-based rolling policy.
# See http://logging.apache.org/log4j/companions/extras/apidocs/org/apache/log4j/rolling/TimeBasedRollingPolicy.html
# Add "DAILY" to flume.root.logger above if you want to use this
log4j.appender.DAILY=org.apache.log4j.rolling.RollingFileAppender
log4j.appender.DAILY.rollingPolicy=org.apache.log4j.rolling.TimeBasedRollingPolicy
log4j.appender.DAILY.rollingPolicy.ActiveFileName=${flume.log.dir}/${flume.log.file}
log4j.appender.DAILY.rollingPolicy.FileNamePattern=${flume.log.dir}/${flume.log.file}.%d{yyyy-MM-dd}
log4j.appender.DAILY.layout=org.apache.log4j.PatternLayout
log4j.appender.DAILY.layout.ConversionPattern=%d{dd MMM yyyy HH:mm:ss,SSS} %-5p [%t] (%C.%M:%L) %x - %m%n

# console
# Add "console" to flume.root.logger above if you want to use this
# log4j.appender.console=org.apache.log4j.ConsoleAppender
# log4j.appender.console.target=System.err
# log4j.appender.console.layout=org.apache.log4j.PatternLayout
# log4j.appender.console.layout.ConversionPattern=%d (%t) [%p - %l] %m%n


以及flume-env.sh

# Enviroment variables can be set here.


JAVA_HOME=/opt/apps/jdk


# Give Flume more memory and pre-allocate, enable remote monitoring via JMX
JAVA_OPTS="-Xms4g -Xmx4g -Dcom.sun.management.jmxremote"


# Note that the Flume conf directory is always included in the classpath.
FLUME_CLASSPATH="/opt/conf/flume-nginx-stat"


启动命令:

nohup /opt/apps/flume/bin/flume-ng agent -n agent --conf /opt/conf/flume-nginx-stat --conf-file /opt/conf/flume-nginx-stat/flume-conf.properties -Dflume.monitoring.type=http -Dflume.monitoring.port=23403 >> /opt/logs/flume-nginx-stat/nohup.out 2>&1 &

            有问题求救呀,最近碰到的情况是,日志数据量增大了,然后,就经常出现数据发着发着就超时了,有时是FAIL,有时是connect time out。有知道的大神么,想直接用udp的方式发,然后,接收方用syslogudp监听udp数据,不知道咋配发送的sink。

### 启动 Flume Agent 的具体方法 Flume 是一个分布式、可靠和高可用的日志收集系统,其核心组件之一是 **Agent**。以下是关于如何配置和启动 Flume Agent 的详细说明。 #### 配置 Flume Agent Flume Agent 的配置通常存储在一个 `.conf` 文件中,该文件定义了数据流的源(source)、通道(channel)以及接收器(sink)。以下是一个典型的 Flume 配置文件结构: ```properties # 定义 agent 名称及其组件 agent1.sources = source1 agent1.sinks = sink1 agent1.channels = channel1 # 配置 Source agent1.sources.source1.type = exec agent1.sources.source1.command = tail -F /var/log/system.log # 配置 Sink agent1.sinks.sink1.type = hdfs agent1.sinks.sink1.hdfs.path = hdfs://namenode:8020/user/flume/logs/%Y%m%d agent1.sinks.sink1.hdfs.fileType = DataStream # 配置 Channel agent1.channels.channel1.type = memory agent1.channels.channel1.capacity = 1000 agent1.channels.channel1.transactionCapacity = 100 # 将 Source 和 Sink 绑定到 Channel agent1.sources.source1.channels = channel1 agent1.sinks.sink1.channel = channel1 ``` 此配置文件描述了一个简单的日志采集场景:从本地 `/var/log/system.log` 中读取数据并通过 HDFS 存储下来[^2]。 --- #### 启动 Flume Agent 命令 启动 Flume Agent 使用 `flume-ng` 工具完成。以下是具体的命令格式: ```bash flume-ng agent \ -c /path/to/conf/directory \ -f /path/to/flume/configuration/file.conf \ -n agent_name \ -Dflume.root.logger=LOG_LEVEL,console ``` - `-c`: 指定 Flume 的配置目录路径。 - `-f`: 指向 Flume 配置文件的具体位置。 - `-n`: 指定要启动的 Agent 名称。 - `-Dflume.root.logger`: 设置日志级别(如 DEBUG 或 INFO),并将日志输出到控制台。 例如,假设我们有一个名为 `exec_mem_hdfs.conf` 的配置文件,并希望以调试模式运行名为 `agent1` 的 Agent,则可以执行以下命令: ```bash flume-ng agent \ -c /apps/flume/conf \ -f /apps/flume/conf/exec_mem_hdfs.conf \ -n agent1 \ -Dflume.root.logger=DEBUG,console ``` 这会根据指定的配置文件启动 Flume Agent 并将其绑定到名称为 `agent1` 的实例上。 --- #### 动态管理 Flume Configuration via Zookeeper 如果需要动态更新 Flume 配置而无需重启服务,可以通过 Apache Zookeeper 实现集中化管理。以下是基本流程: 1. 创建 Zookeeper 节点用于存储 Flume 配置。 2. 编写脚本或工具将 Flume 配置文件的内容上至 Zookeeper 节点。 3. 修改 Flume 启动参数以支持从 Zookeeper 加载配置。 ##### 示例代码:Python 写入 Flume 配置到 Zookeeper 以下 Python 脚本展示了如何将 Flume 配置文件内容写入 Zookeeper 节点: ```python from kazoo.client import KazooClient def write_flume_config_to_zookeeper(zk_hosts, zk_path, config_content): zk = KazooClient(hosts=zk_hosts) zk.start() if not zk.exists(zk_path): zk.create(zk_path, makepath=True) zk.set(zk_path, bytes(config_content, 'utf-8')) zk.stop() if __name__ == "__main__": zookeeper_hosts = "localhost:2181" zookeeper_path = "/config/flume/agent1" flume_config = """ agent1.sources = source1 agent1.sinks = sink1 agent1.channels = channel1 # ... (其余配置省略) """ write_flume_config_to_zookeeper(zookeeper_hosts, zookeeper_path, flume_config) ``` 当配置成功写入 Zookeeper 后,Flume 可以通过监听这些节点的变化来实时加载新的配置[^4]。 --- #### 自动化部署与管理 为了简化大规模集群环境下 Flume Agent 的管理和维护,可以借助 Ansible 等自动化运维工具实现批量部署、升级或卸载操作。Ansible Playbook 提供了一种声明式的语法来定义任务序列,从而减少手动干预的可能性[^1]。 --- ### 总结 以上介绍了两种方式来配置和启动 Flume Agent: 1. 手动编写配置文件并通过命令行启动; 2. 利用 Zookeeper 进行动态配置管理并结合自动化工具提升效率。 无论采用哪种方式,都需要确保配置文件无误且网络连通性正常以便于各模块间通信顺畅。
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值