Doris报错there is no scanNode Backend

博客详细分析了一起SparkStreaming作业在访问Doris表时遇到的Stage失败问题,错误信息显示有三个BE节点因超时被加入黑名单。作者通过源码解析发现,可能是由于BE服务挂掉导致,推测可能是任务配置不当或服务压力过大。由于日志备份缺失,无法确定具体原因,但已部署Prometheus规则和Alertmanager告警来预防此类问题,并建议调整Spark和Doris的相关配置。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

背景

业务开发侧3.8号反应SparkStreaming流失扫Doris表(查询sql)报错

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 20, hd012.corp.yodao.com, executor 7): com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: errCode = 2, 
detailMessage = there is no scanNode Backend. [126101: in black list(Ocurrs time out with specfied time 10000 MICROSECONDS), 14587381: in black list(Ocurrs time out with specfied time 10000 MICROSECONDS), 213814: in black list(Ocurrs time out with specfied time 10000 MICROSECONDS)]

报错

detailMessage = there is no scanNode Backend. [126101: in black list(Ocurrs time out with specfied time 10000 MICROSECONDS), 14587381: in black list(Ocurrs time out with specfied time 10000 MICROSECONDS), 213814: in black list(Ocurrs time out with specfied time 10000 MICROSECONDS)]

源码分析

//黑名单对象
private static Map<Long, Pair<Integer, String>> blacklistBackends = Maps.newConcurrentMap();

//任务执行过程中需要getHost,返回值为TNetworkAddress对象
public static TNetworkAddress getHost(long backendId,
                                      List<TScanRangeLocation> locations,
                                      ImmutableMap<Long, Backend> backends,
                                      Reference<Long> backendIdRef)

//getHost()方法中通过backendId获得backend对象
Backend backend = backends.get(backendId);



//判断backend对象是否可用
//可用就返回TNetworkAddress对象
//不可用就在locations对象中循环遍历去找一个候选的backend对象
//如果刚刚不可用的backend与候选backend对象id一致,则continue
//如果不一致,则判断是否可用,可用则返回改候选be的TNetworkAddress
//不可用则继续换换下一个候选be

if (isAvailable(backend)) {
    backendIdRef.setRef(backendId);
    return new TNetworkAddress(backend.getHost(), backend.getBePort());
}  else {
    for (TScanRangeLocation location : locations) {
        if (location.backend_id == backendId) {
            continue;
        }
        // choose the first alive backend(in analysis stage, the locations are random)
        Backend candidateBackend = backends.get(location.backend_id);
        if (isAvailable(candidateBackend)) {
            backendIdRef.setRef(location.backend_id);
            return new TNetworkAddress(candidateBackend.getHost(), candidateBackend.getBePort());
        }
    }
}

public static boolean isAvailable(Backend backend) {
    return (backend != null && backend.isAlive() && !blacklistBackends.containsKey(backend.getId()));
}


//若直至最后都不能返回一个be,则返回异常原因
// no backend returned
throw new UserException("there is no scanNode Backend. " +
        getBackendErrorMsg(locations.stream().map(l -> l.backend_id).collect(Collectors.toList()),
                backends, locations.size()));


// get the reason why backends can not be chosen.
private static String getBackendErrorMsg(List<Long> backendIds, ImmutableMap<Long, Backend> backends, int limit) {
    List<String> res = Lists.newArrayList();
    for (int i = 0; i < backendIds.size() && i < limit; i++) {
        long beId = backendIds.get(i);
        Backend be = backends.get(beId);
        if (be == null) {
            res.add(beId + ": not exist");
        } else if (!be.isAlive()) {
            res.add(beId + ": not alive");
        } else if (blacklistBackends.containsKey(beId)) {
            Pair<Integer, String> pair = blacklistBackends.get(beId);
            res.add(beId + ": in black list(" + (pair == null ? "unknown" : pair.second) + ")");
        } else {
            res.add(beId + ": unknown");
        }
    }
    return res.toString();
}


//blacklistBackends对象的put
public static void addToBlacklist(Long backendID, String reason) {
    if (backendID == null) {
        return;
    }

    blacklistBackends.put(backendID, Pair.create(FeConstants.heartbeat_interval_second + 1, reason));
    LOG.warn("add backend {} to black list. reason: {}", backendID, reason);
}


public static void addToBlacklist(Long backendID, String reason) {
    if (backendID == null) {
        return;
    }

    blacklistBackends.put(backendID, Pair.create(FeConstants.heartbeat_interval_second + 1, reason));
    LOG.warn("add backend {} to black list. reason: {}", backendID, reason);
}

原因分析

根据任务报错
detailMessage = there is no scanNode Backend. [126101: in black list(Ocurrs time out with specfied time 10000 MICROSECONDS), 14587381: in black list(Ocurrs time out with specfied time 10000 MICROSECONDS), 213814: in black list(Ocurrs time out with specfied time 10000 MICROSECONDS)]
分析,BE id为126101、14587381、213814三个节点在黑名单的原因可能就是Ocurrs time out with specfied time 10000 MICROSECONDS
那么说明很可能当时3.8号这三台BE挂了
根据社区同学之前的经验的第7点
可以推测,很可能当时因为任务或者配置不当导致BE挂了

  • broker或者其他任务压垮了BE服务
  • max_broker_concurrency
  • max_bytes_per_broker_scanner

在这里插入图片描述

具体的报错因为问题出现时间在3.8号,到今天20多天过去了,期间经历了Doris集群扩容、节点重新编排等运维工作,日志以及很多备份无法恢复了,只能依据Ocurrs time out with specfied time 10000 MICROSECONDS推测可能当时BE挂了,然后我们的服务都会挂载在supervisord上的,所以会自启动(之前没有完善好节点服务不可用的Prometheus rules&alertmanager的告警)
后续如果再出现相同问题继续完善此文章

解决措施

部署了be节点服务不可用的Prometheus rules&alertmanager的告警
调整fe.conf中的配置
配置好spark任务、broker任务在执行时的配置
暂时没有什么实质性的方案,如果问题复现后继续跟踪,补充解决措施

评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Geoffrey Turing

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值