Application with id application_xxx doesn‘t exist in RM解决方法

yarn的resourcemanager.log搜索报错的application信息是搜索到报错之前一次运行时候产生日志里有这样一句话

Application application_1749524555394_0003 requests cleared

flink启动的日志写了yarn的属性路径,进去看看,

rm /tmp/.yarn-properties-root

报错的application信息果然在里面,删除掉,重新启动flink,不报错了

官网上的解释是

会话模式将在 中创建一个隐藏的 YARN 属性文件,在提交作业时,命令行界面将选取该文件进行群集发现。/tmp/.yarn-properties-<username>

纱线 |Apache Flinkhttps://nightlies.apache.org/flink/flink-docs-release-1.17/zh/docs/deployment/resource-providers/yarn/

[ade@www ~]$ docker logs --tail 200 nacos | grep -i "error\|exception" org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'memoryMonitor' defined in URL [jar:file:/home/nacos/target/nacos-server.jar!/BOOT-INF/lib/nacos-config-2.1.0.jar!/com/alibaba/nacos/config/server/monitor/MemoryMonitor.class]: Unsatisfied dependency expressed through constructor parameter 0; nested exception is org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'asyncNotifyService': Unsatisfied dependency expressed through field 'dumpService'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'externalDumpService': Invocation of init method failed; nested exception is ErrCode:500, ErrMsg:Nacos Server did not start because dumpservice bean construction failure : Caused by: org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'asyncNotifyService': Unsatisfied dependency expressed through field 'dumpService'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'externalDumpService': Invocation of init method failed; nested exception is ErrCode:500, ErrMsg:Nacos Server did not start because dumpservice bean construction failure : Caused by: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'externalDumpService': Invocation of init method failed; nested exception is ErrCode:500, ErrMsg:Nacos Server did not start because dumpservice bean construction failure : Caused by: com.alibaba.nacos.api.exception.NacosException: Nacos Server did not start because dumpservice bean construction failure : Caused by: java.lang.IllegalStateException: No DataSource set 2025-08-19 17:03:16,473 ERROR Nacos failed to start, please see /home/nacos/logs/nacos.log for more details. Error starting ApplicationContext. To display the conditions report re-run your application with 'debug' enabled. 2025-08-19 17:03:16,484 ERROR Application run failed org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'memoryMonitor' defined in URL [jar:file:/home/nacos/target/nacos-server.jar!/BOOT-INF/lib/nacos-config-2.1.0.jar!/com/alibaba/nacos/config/server/monitor/MemoryMonitor.class]: Unsatisfied dependency expressed through constructor parameter 0; nested exception is org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'asyncNotifyService': Unsatisfied dependency expressed through field 'dumpService'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'externalDumpService': Invocation of init method failed; nested exception is ErrCode:500, ErrMsg:Nacos Server did not start because dumpservice bean construction failure : Caused by: org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'asyncNotifyService': Unsatisfied dependency expressed through field 'dumpService'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'externalDumpService': Invocation of init method failed; nested exception is ErrCode:500, ErrMsg:Nacos Server did not start because dumpservice bean construction failure : Caused by: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'externalDumpService': Invocation of init method failed; nested exception is ErrCode:500, ErrMsg:Nacos Server did not start because dumpservice bean construction failure : Caused by: com.alibaba.nacos.api.exception.NacosException: Nacos Server did not start because dumpservice bean construction failure : Caused by: java.lang.IllegalStateException: No DataSource set [ade@www ~]$ [ade@www ~]$
08-20
<think>嗯,用户遇到了一个Hadoop的RemoteException问题,提示指定的应用程序ID在ResourceManager中存在。这个问题在Hadoop集群中挺常见的,需要系统性地排查。 首先想到应用程序ID格式问题。Hadoop的ID格式通常是"application_<时间戳>_<序列号>",用户可能输错了格式。需要提醒用户用yarn application -list命令验证ID是否存在。 然后是时间因素。ResourceManager默认只保留7天内的应用记录,如果用户查询的是旧任务,可能已经被清理了。这时候需要检查yarn-site.xml里的yarn.resourcemanager.state-store.max-completed-applications参数值。 集群状态也很关键。ResourceManager服务如果刚重启过,内存中的状态就会丢失。而NodeManager重启可能导致容器状态一致,需要查看ResourceManager和NodeManager的日志时间戳。 用户权限问题容易被忽视。如果用户用的是非提交者账号或者没有查看权限,也会报这个错。可以建议用户用yarn application -status <ID> -appOwner <user>指定所有者试试。 还有个特殊情况是作业刚提交时的延迟。ResourceManager接收AM注册可能有几秒到几十秒的延迟,这时候查询会显示存在。需要提醒用户耐心等待后再查。 日志排查方面,ResourceManager的日志会记录应用注册和状态变更,NodeManager日志有容器启动记录,而AM自身的日志可能包含注册过程的错误。用户应该重点检查这三类日志的时间线是否衔接。 如果上述方法行,可能需要考虑状态存储异常。比如ZooKeeper-based的状态存储出现连接问题,或者FS-based的存储目录权限错误。这时候需要检查yarn.resourcemanager.store.class的配置和对应存储介质的健康状况。 最后要注意的是,同Hadoop版本对这个错误的处理可能有差异。比如2.7之前版本在AM注册失败时会清理残留状态,而3.x版本有更好的状态同步机制。用户需要根据版本来调整排查方向。</think>以下是Hadoop报错`RemoteException: Application with id does not exist in RM`的完整排查指南: --- ### **一、错误原因分析** 1. **应用程序ID无效** - 检查ID格式是否正确(标准格式:`application_<时间戳>_<序列号>`) - 使用命令验证是否存在: ```bash yarn application -list | grep <application_id> ``` 2. **应用记录过期** ResourceManager (RM) 默认保留7天内的应用记录(由`yarn.resourcemanager.state-store.max-completed-applications`配置控制)。若查询超期应用会触发该错误[^1]。 3. **集群状态一致** - **ResourceManager重启**:RM重启后内存状态丢失 - **NodeManager异常**:容器状态未同步(需检查`nodemanager.log`) - **应用未成功注册**:ApplicationMaster (AM) 未能向RM注册(超时或网络故障) 4. **权限问题** - 用户无访问权限(非应用提交者或未配置ACL) - 跨集群查询时Kerberos认证失败 5. **延迟问题** 作业提交后,AM向RM注册可能有 **10-30秒延迟**(受集群负载影响),立即查询会报错。 --- ### **二、排查步骤** #### **1. 验证应用状态** ```bash # 检查所有运行/完成的应用 yarn application -list -appStates ALL | grep <app_id> # 查看RM日志中的注册记录 grep <app_id> /var/log/hadoop-yarn/resourcemanager/*.log ``` #### **2. 检查时间范围** - 若应用是**历史作业**: 确认`yarn.resourcemanager.state-store.max-completed-applications`的值(默认10000) 查看历史服务器是否启用:`yarn timeline-service` #### **3. 日志关键点排查** | 日志类型 | 检查内容 | |------------------|----------------------------------| | **ResourceManager** | `Receiving registration request from AM`(AM注册请求) | | **ApplicationMaster** | `Registering with RM`(注册记录)或`Retrying connection`(重连) | | **NodeManager** | `Container <container_id> running`(容器状态) | #### **4. 权限验证** ```bash # 指定所有者重试查询 yarn application -status <app_id> -appOwner <submit_user> ``` #### **5. 集群状态检查** ```bash # 确认RM/NM服务状态 yarn rmadmin -getServiceState rm1 # 返回ACTIVE/STANDBY yarn node -list | grep <node_ip> # 检查节点状态 ``` --- ### **三、解决方案** 1. **等待重试** 新提交作业等待 **30秒以上** 再查询。 2. **调整保留策略** 在`yarn-site.xml`增加配置: ```xml <property> <name>yarn.resourcemanager.state-store.max-completed-applications</name> <value>20000</value> <!-- 增加保留数量 --> </property> ``` 3. **修复状态存储** - 若使用**ZooKeeper存储状态**: ```bash zkCli.sh ls /rmstore # 检查ZNode是否存在 ``` - 若使用**HDFS存储**: 检查目录权限:`hdfs dfs -ls /rmstate` 4. **强制清理残留状态** 删除RM本地缓存(谨慎操作): ```bash rm -rf /tmp/hadoop-yarn/resourcemanager/recovery/* ``` 5. **API直查方案** 通过REST API绕过CLI工具验证: ```bash curl http://<rm_host>:8088/ws/v1/cluster/apps/<app_id> ``` --- ### **四、预防措施** 1. **增加AM注册超时时间** ```xml <property> <name>yarn.am.liveness-monitor.expiry-interval-ms</name> <value>600000</value> <!-- 默认10分钟改为60分钟 --> </property> ``` 2. **启用高可用(HA)** 配置多个ResourceManager避免单点故障。 > **典型场景案例**:某集群因磁盘写满导致RM状态存储失败,AM注册后无法持久化记录。清理磁盘空间并重启RM后恢复[^2]。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值