本次主要分享的是近期遇到的StarRocks集群(版本 v3.2.4
)的异常 compaction
问题,其主要表现是在BE(Backend)节点的问题。具体表现为BE节点进程频繁崩溃并重启,无法保持稳定的在线状态。此外,BE节点的内存使用出现异常的剧增,并且无法正常释放。
在事后的回顾中猜测:在业务高峰期,由于数据写入量巨大,系统需要处理大量的实时数据摄入以及定期的批量数据导入。这些操作可能会触发频繁的合并任务,以确保数据的有效组织和存储优化。compaction
是集群在有持续数据导入的背景下,将新的版本数据与历史数据合并从而形成最新数据文件的任务,在极端情况下触发的 bug
导致数据持续无法合并,从而引发 BE
节点循环崩溃重启。
StarRocks BE节点在更新合并(Update Compaction
)过程中触发段错误(SIGSEGV
),致使BE进程崩溃。具体表现为:
SIGSEGV (@0x7f8c49f20950)
,内存地址访问越界。BitShufflePageDecoder::next_batch()和SegmentIterator::_read()
,意味着数据解码或分片合并时出现异常。SIGSEGV (@0x7f8c49f20950) received by PID 2986034 PC: __memmove_evex_unaligned_erms Stack trace涉及:
starrocks::FixedLengthColumnBase::append_numbers()
starrocks::BitShufflePageDecoder::next_batch()
starrocks::SegmentIterator::_read()
RowsetMergerImpl::_do_merge_vertically()
阶段,推测为多版本数据合并时出现内存或数据不一致问题。start time: Thu Feb 13 05:57:45 PM CST 2025
Ignored unknown config: default_rowset_type
Ignored unknown config: JAVA_OPTS
3.2.4-1.39 RELEASE (build e5d7f2e)
query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
tracker:process consumption: 248896936
tracker:query_pool consumption: 0
tracker:query_pool/connector_scan consumption: 0
tracker:load consumption: 0
tracker:metadata consumption: 14277210
tracker:tablet_metadata consumption: 2050871
tracker:rowset_metadata consumption: 7837292
tracker:segment_metadata consumption: 836135
tracker:column_metadata consumption: 3552912
tracker:tablet_schema consumption: 433895
tracker:segment_zonemap consumption: 710880
tracker:short_key_index consumption: 0
tracker:column_zonemap_index consumption: 954240
tracker:ordinal_index consumption: 1161632
tracker:bitmap_index consumption: 142000
tracker:bloom_filter_index consumption: 0
tracker:compaction consumption: 36853432
tracker:schema_change consumption: 0
tracker:column_pool consumption: 0
tracker:page_cache consumption: 171968
tracker:update consumption: 12
tracker:chunk_allocator consumption: 0
tracker:clone consumption: 0
tracker:consistency consumption: 0
tracker:datacache consumption: 0
tracker:replication consumption: 0
*** Aborted at 1739440667 (unix time) try "date -d @1739440667" if you are using GNU date ***
PC: @ 0x7f849059d30d __memmove_evex_unaligned_erms
*** SIGSEGV (@0x7f8b9ebe83d0) received by PID 2981141 (TID 0x7f83cdbb9640) from PID 18446744072077870032; stack trace: ***
@ 0x67c6ea2 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f849416d1d0 (unknown)
@ 0x7f849059d30d __memmove_evex_unaligned_erms
@ 0x2be8466 starrocks::FixedLengthColumnBase<>::append_numbers()
@ 0x346b789 starrocks::NullableColumn::append_numbers()
@ 0x56410d7 starrocks::BitShufflePageDecoder<>::next_batch()
@ 0x569b950 starrocks::ParsedPageV2::read()
@ 0x5667bb8 starrocks::ScalarColumnIterator::next_batch()
@ 0x51ba0b0 starrocks::SegmentIterator::_read()
@ 0x51b37d4 starrocks::SegmentIterator::_do_get_next()
@ 0x51b5a20 starrocks::SegmentIterator::do_get_next()
@ 0x521cb33 starrocks::HeapMergeIterator::fill()
@ 0x521d5e2 starrocks::HeapMergeIterator::do_get_next()
@ 0x5285fb2 starrocks::MergeEntry<>::next()
@ 0x5288fe7 starrocks::RowsetMergerImpl<>::_fill_heap()
@ 0x528964b starrocks::RowsetMergerImpl<>::get_next()
@ 0x528a933 starrocks::RowsetMergerImpl<>::_do_merge_horizontally()
@ 0x528bc23 starrocks::RowsetMergerImpl<>::_do_merge_vertically()
@ 0x528dd3b starrocks::RowsetMergerImpl<>::do_merge()
@ 0x5280ea1 starrocks::compaction_merge_rowsets()
@ 0x515a078 starrocks::TabletUpdates::_do_compaction()
@ 0x515c0fe starrocks::TabletUpdates::compaction_for_size_tiered()
@ 0x515cb7a starrocks::TabletUpdates::compaction()
@ 0x50a317c starrocks::StorageEngine::_perform_update_compaction()
@ 0x5034744 starrocks::StorageEngine::_update_compaction_thread_callback()
@ 0x8c011c0 execute_native_thread_routine
@ 0x7f84941623fb start_thread
@ 0x7f849051be83 __GI___clone
@ 0x0 (unknown)
I0213 20:38:19.276448 95354 tablet_manager.cpp:808] Found the best tablet to compact. compaction_type=update tablet_id=17527311, disk=/mnt/disk1/starrocks/storage, highest_score=504
I0213 20:38:19.276468 95354 tablet_updates.cpp:2707] update compaction start tablet:17527311 version:2 score:13551475712 merge levels:5 pick:1/valid:1/all:1 0 #pick_segments:54 #valid_segments:54 #rows:2935658->2935658 bytes:644.31 MB->644.31 MB(estimate)
W0213 20:38:19.297608 95355 rowset_merger.cpp:448] update compaction rows read(2919843) != rows written(1361022)
W0213 20:38:19.297655 95355 rowset_merger.cpp:293] update compaction merge finished. tablet=17527884, disk=/mnt/disk2/starrocks/storage, #key=2 algorithm=VERTICAL_COMPACTION column_group_size=6 chunk_size min:4096 max:4096 input(entry=1 rows=2919843 del=0 actual=2919843 bytes=608.56 MB) output(rows=1361022 chunk=334 bytes=0) duration: 397ms, err=update compaction rows read(2919843) != rows written(1361022)
W0213 20:38:19.355593 95355 storage_engine.cpp:988] failed to perform update compaction. res=Internal error: update compaction rows read(2919843) != rows written(1361022)
/root/jenkins/workspace/emr-olap-starrocks-32-release/emr_starrocks/be/src/storage/rowset_merger.cpp:474 _do_merge_horizontally(tablet, tablet_schema, version, schema, rowsets, writer, cfg, total_input_size, total_rows, total_chunk, stats, mask_buffer.get(), &rowsets_mask_buffer), tablet=17527884.1818706437.2f45675865dc13a6-2a854a0cc4b08495
I0213 20:38:19.356320 95355 tablet_manager.cpp:808] Found the best tablet to compact. compaction_type=update tablet_id=17528086, disk=/mnt/disk2/starrocks/storage, highest_score=516
I0213 20:38:19.356336 95355 tablet_updates.cpp:2707] update compaction start tablet:17528086 version:3 score:13603075072 merge levels:5 pick:1/valid:2/all:2 0 #pick_segments:54 #valid_segments:55 #rows:2576234->2576228 bytes:595.10 MB->595.10 MB(estimate)
max_compaction_concurrency
参数值。如果该值过高,在大批量导入场景下可能引发内存溢出(OOM)update compaction rows read(2919843) != rows written(1361022)
这能明确合并过程中触发的异常类型(如文件锁冲突、版本合并超限)。mem_limit
阈值,可能是导致问题的因素。SHOW PROC '/compaction'
查询未合并的版本数。单个Tablet版本数超过阈值会触发强制合并,可能导致资源突发消耗。v3.2.11
)。enable_pk_size_tiered_compaction_strategy = false
和enable_size_tiered_compaction_strategy = false
。这两种策略在当前出现问题的场景下可能会加剧合并时的内存压力或者导致数据一致性问题。通过关闭它们,可以避免相关策略可能带来的不良影响,使合并过程更加稳定。compaction_max_memory_limit=-1
compaction_max_memory_limit_percent=-1
(此处列出默认值)在实际处理过程中,经过上述紧急处理、参数调节后尝试手动sh命令
手动拉取be进程,很遗憾依然报错依旧。此时不得不调整方案,从日志中发现一系列异常的 tablet
,通过以下命令我们得知该 tablet
对应的一个表;因此,尝试通过元数据层面解决,删除异常的tablet
,再多副本的情况下be可以通过副本均衡机制自动的在对应的be节点补充删除的tablet
,请注意要是所有的副本的tablt都存在问题数据就难以恢复了,就需要重新同步一次数据。具体操作如下:
I0213 20:38:19.356320 95355 tablet_manager.cpp:808] Found the best tablet to compact. compaction_type=update tablet_id=17528086, disk=/mnt/disk2/starrocks/storage, highest_score=516
I0213 20:38:19.356336 95355 tablet_updates.cpp:2707] update compaction start tablet:17528086 version:3 score:13603075072 merge levels:5 pick:1/valid:2/all:2 0 #pick_segments:54 #valid_segments:55 #rows:2576234->2576228 bytes:595.10 MB->595.10 MB(estimate)
根据日志确定异常tablet,例如:
show tablet xxx;
使用meta_tool.sh
删除异常tablet,如在出问题的be节点执行:
./meta_tool.sh --operation=delete_meta --root_path=/mnt/diskxxx/starrocks/storage/ --tablet_id=xxx
在聊内存调整之前,我们先要了解下BE节点一些核心的工作进程,StarRocks 的 BE(Backend)节点主要基于 C++ 实现,但在某些特定场景下会依赖 Java 的生态能力。以下是两种语言在 BE 节点中的具体分工:
BE 节点的核心功能完全由 C++ 实现,负责高性能数据存储、查询计算和系统级优化:
RowsetMerger
模块)。Scan
、Aggregation
、Join
等算子的高性能实现。ChunkAllocator
)管理查询和合并任务的内存分配。BE 节点在以下场景中会依赖 Java 生态能力,通常通过 JNI(Java Native Interface) 或独立进程交互:
通常情况下,C++模块的内存不需要用户手动配置,而Java模块则需要进行配置。正如我们所了解的,JVM进程主要用于处理UDF函数、数据湖(如Paimon、Iceberg等)、以及外部数据源(如JDBC、HIVE等)的数据交互。一旦JVM的内存申请确定,其他进程可用的内存就会减少。因此,如果StarRocks集群仅处理内部表数据,不涉及其他模块的数据连接,我们强烈建议将BE节点的内存更多地分配给其他进程。通常建议将BE节点的JVM内存配置为整个BE节点内存的四分之一左右。
当BE节点出现内存异常(非JVM内存问题)时,也可能导致BE节点短时间内挂掉。其主要表现包括日志中出现“GlobalUpdate is too busy”或“Current memory statistics: process(23991010872)”或“reason=InternalError”等信息,以及日志顺序打印错乱等问题。以下是一个案例日志:
W0225 21:41:06.904227 2679451 server.cpp:342] UpdateDerivedVars is too busy!
W0225 21:36:12.222838 2679331 scan_operator.cpp:231] set_finishing scan fragment 49efb35d-f37a-11ef-b4b9-00163e006dc0 driver_id 6 _num_running_io_tasks= 1 _submit_task_counter= 36 _morsels_counter= 4
I0225 21:43:10.235177 2679310 internal_service.cpp:493] cancel fragment, fragment_instance_id=50d3010e-f37a-11ef-b4b9-00163e006dbf, reason: InternalError
I0225 21:43:27.689874 2679459 primary_index.cpp:958] primary index released table:18017764 tablet:18018912 memory: 11010027
I0225 21:35:36.689265 2679488 tablet.cpp:1360] start to do tablet meta checkpoint, tablet=19205376.1486930064.c14655fb3cdf5112-46fa917ea98557b4
I0225 21:44:30.973304 2679313 internal_service.cpp:493] cancel fragment, fragment_instance_id=50d3010e-f37a-11ef-b4b9-00163e006db9, reason: InternalError
I0225 21:48:21.110564 2679312 internal_service.cpp:493] cancel fragment, fragment_instance_id=56bf6c73-f37a-11ef-b4b9-00163e006db7, reason: InternalError
I0225 21:51:38.990103 2679505 starlet.cc:103] Empty starmanager address, skip reporting!
W0225 21:52:46.925017 2679451 server.cpp:342] UpdateDerivedVars is too busy!
W0225 21:52:12.017813 2679458 global.cpp:244] GlobalUpdate is too busy!
I0225 21:54:12.607643 2679314 internal_service.cpp:493] cancel fragment, fragment_instance_id=50d3010e-f37a-11ef-b4b9-00163e006db6, reason: InternalError
I0225 21:56:15.020393 2679317 internal_service.cpp:493] cancel fragment, fragment_instance_id=49efb35d-f37a-11ef-b4b9-00163e006dc3, reason: InternalError
I0225 21:58:44.311798 2679315 internal_service.cpp:493] cancel fragment, fragment_instance_id=49efb35d-f37a-11ef-b4b9-00163e006dba, reason: InternalError
W0225 21:59:31.796851 2679310 fragment_context.cpp:173] [Driver] Canceled, query_id=50d3010e-f37a-11ef-b4b9-00163e006db4, instance_id=50d3010e-f37a-11ef-b4b9-00163e006dbf, reason=InternalError
I0225 22:00:23.982470 2679316 internal_service.cpp:493] cancel fragment, fragment_instance_id=50d3010e-f37a-11ef-b4b9-00163e006dbb, reason: InternalError
W0225 21:58:43.943780 2679311 fragment_context.cpp:173] [Driver] Canceled, query_id=50d3010e-f37a-11ef-b4b9-00163e006db4, instance_id=50d3010e-f37a-11ef-b4b9-00163e006dbd, reason=InternalError
W0225 22:01:53.783878 2679451 server.cpp:342] UpdateDerivedVars is too busy!
W0225 22:02:34.233786 2679313 fragment_context.cpp:173] [Driver] Canceled, query_id=50d3010e-f37a-11ef-b4b9-00163e006db4, instance_id=50d3010e-f37a-11ef-b4b9-00163e006db9, reason=InternalError
W0225 22:02:36.837792 2679458 global.cpp:244] GlobalUpdate is too busy!
I0225 22:02:39.804589 2679331 sink_buffer.cpp:215] fragment_instance_id 49efb35d-f37a-11ef-b4b9-00163e006dc0 -> 49efb35d-f37a-11ef-b4b9-00163e006dc0, _num_uncancelled_sinkers 7, _is_finishing false, _num_remaining_eos 16
W0225 22:05:25.711773 2679312 fragment_context.cpp:173] [Driver] Canceled, query_id=56bf6c73-f37a-11ef-b4b9-00163e006db4, instance_id=56bf6c73-f37a-11ef-b4b9-00163e006db7, reason=InternalError
I0225 22:06:27.055780 2679505 starlet.cc:103] Empty starmanager address, skip reporting!
W0225 22:09:26.595907 2679314 fragment_context.cpp:173] [Driver] Canceled, query_id=50d3010e-f37a-11ef-b4b9-00163e006db4, instance_id=50d3010e-f37a-11ef-b4b9-00163e006db6, reason=InternalError
W0225 22:10:27.437798 2679451 server.cpp:342] UpdateDerivedVars is too busy!
I0225 22:09:32.233987 3688911 heartbeat_server.cpp:77] get heartbeat from FE.host:172.18.236.1, port:9020, cluster id:640888435, run_mode:SHARED_NOTHING, counter:90337
W0225 22:12:36.789398 2679317 fragment_context.cpp:173] [Driver] Canceled, query_id=49efb35d-f37a-11ef-b4b9-00163e006db4, instance_id=49efb35d-f37a-11ef-b4b9-00163e006dc3, reason=InternalError
W0225 22:12:38.909652 2679458 global.cpp:244] GlobalUpdate is too busy!
W0225 22:13:42.716780 2679315 fragment_context.cpp:173] [Driver] Canceled, query_id=49efb35d-f37a-11ef-b4b9-00163e006db4, instance_id=49efb35d-f37a-11ef-b4b9-00163e006dba, reason=InternalError
W0225 22:14:04.602185 2679331 pipeline_driver_poller.cpp:84] [Driver] Timeout, query_id=50d3010e-f37a-11ef-b4b9-00163e006db4, instance_id=50d3010e-f37a-11ef-b4b9-00163e006dbb
W0225 22:14:47.607328 2679316 fragment_context.cpp:173] [Driver] Canceled, query_id=50d3010e-f37a-11ef-b4b9-00163e006db4, instance_id=50d3010e-f37a-11ef-b4b9-00163e006dbb, reason=InternalError
W0225 22:14:38.751798 2679310 recorder.h:254] Input=2147483647 to `rpc_server_8060_doris_pbackend_service_cancel_plan_fragment' overflows
W0225 22:17:26.154855 2679311 recorder.h:254] Input=2147483647 to `rpc_server_8060_doris_pbackend_service_cancel_plan_fragment' overflows
I0225 22:16:47.614809 2679495 olap_server.cpp:886] try to clear expired replication snapshots!
I0225 22:19:18.245793 2679505 starlet.cc:103] Empty starmanager address, skip reporting!
W0225 22:19:05.562366 2679451 server.cpp:342] UpdateDerivedVars is too busy!
W0225 22:21:19.061115 2679458 global.cpp:244] GlobalUpdate is too busy!
I0225 21:56:40.389847 2679459 primary_index.cpp:958] primary index released table:10041 tablet:10049 memory: 18213846
W0225 22:24:28.349790 2679317 recorder.h:254] Input=2147483647 to `rpc_server_8060_doris_pbackend_service_cancel_plan_fragment' overflows
W0225 22:24:22.146806 2679331 scan_operator.cpp:231] set_finishing scan fragment 50d3010e-f37a-11ef-b4b9-00163e006dbb driver_id 1 _num_running_io_tasks= 4 _submit_task_counter= 31 _morsels_counter= 4
I0225 22:26:38.758642 2679310 internal_service.cpp:493] cancel fragment, fragment_instance_id=49efb35d-f37a-11ef-b4b9-00163e006dc0, reason: InternalError
W0225 22:22:13.280774 2679177 segment_replicate_executor.cpp:181] Failed to send rpc to SyncChannnel [host: 172.18.236.12, port: 8060, load_id: 0648a763-b4f6-b0f0-47ef-9bff2becdeb2, tablet_id: 4300366, txn_id: 15271292] err=Internal error: no associated load channel 0648a763-b4f6-b0f0-47ef-9bff2becdeb2
I0225 22:27:00.728781 2679311 internal_service.cpp:493] cancel fragment, fragment_instance_id=5542b779-f37a-11ef-b4b9-00163e006dbc, reason: InternalError
W0225 22:28:13.696820 2679451 server.cpp:342] UpdateDerivedVars is too busy!
W0225 22:28:51.655735 2679452 recorder.h:254] Input=2147483647 to `rpc_server_8060_doris_pbackend_service_tablet_writer_cancel' overflows
W0225 22:29:59.658911 2679458 global.cpp:244] GlobalUpdate is too busy!
W0225 22:30:07.901304 2679312 recorder.h:254] Input=2147483647 to `rpc_server_8060_doris_pbackend_service_cancel_plan_fragment' overflows
W0225 22:31:00.606828 2679316 recorder.h:254] Input=2147483647 to `rpc_server_8060_doris_pbackend_service_cancel_plan_fragment' overflows
I0225 22:31:41.665771 2679505 starlet.cc:103] Empty starmanager address, skip reporting!
I0225 22:32:37.658082 2679419 load_path_mgr.cpp:188] Going to remove path. path=/mnt/disk1/starrocks/storage/error_log/error_log_5925228df1e711ef_b4b900163e006db5
W0225 22:32:45.768781 2679313 recorder.h:254] Input=2147483647 to `rpc_server_8060_doris_pbackend_service_cancel_plan_fragment' overflows
I0225 22:34:34.714396 2679317 internal_service.cpp:493] cancel fragment, fragment_instance_id=49efb35d-f37a-11ef-b4b9-00163e006db6, reason: InternalError
W0225 22:35:51.570765 2679455 sink_buffer.cpp:362] transmit chunk rpc failed [dest_instance_id=49efb35d-f37a-11ef-b4b9-00163e006dbb] [dest=172.18.236.12:8060] detail:brpc failed, error=RPC call is timed out, error_text=[E1008]Reached timeout=180000ms @172.18.236.12:8060
W0225 22:36:42.268931 2679451 server.cpp:342] UpdateDerivedVars is too busy!
W0225 22:36:12.060503 2679314 recorder.h:254] Input=2147483647 to `rpc_server_8060_doris_pbackend_service_cancel_plan_fragment' overflows
W0225 22:30:49.870075 2679176 segment_replicate_executor.cpp:181] Failed to send rpc to SyncChannnel [host: 172.18.236.12, port: 8060, load_id: 0648a763-b4f6-b0f0-47ef-9bff2becdeb2, tablet_id: 4300372, txn_id: 15271292] err=Internal error: no associated load channel 0648a763-b4f6-b0f0-47ef-9bff2becdeb2
W0225 22:36:47.580717 2679454 sink_buffer.cpp:362] transmit chunk rpc failed [dest_instance_id=49efb35d-f37a-11ef-b4b9-00163e006dba] [dest=172.18.236.11:8060] detail:brpc failed, error=RPC call is timed out, error_text=[E1008]Reached timeout=180000ms @172.18.236.11:8060
W0225 22:36:41.571928 2679315 recorder.h:254] Input=2147483647 to `rpc_server_8060_doris_pbackend_service_cancel_plan_fragment' overflows
W0225 22:38:15.974128 2679310 recorder.h:254] Input=2147483647 to `rpc_server_8060_doris_pbackend_service_cancel_plan_fragment' overflows
W0225 22:39:21.610781 2679311 fragment_context.cpp:173] [Driver] Canceled, query_id=5542b779-f37a-11ef-b4b9-00163e006db4, instance_id=5542b779-f37a-11ef-b4b9-00163e006dbc, reason=InternalError
I0225 22:39:50.483726 2679312 internal_service.cpp:493] cancel fragment, fragment_instance_id=566390cd-f37a-11ef-b4b9-00163e006db5, reason: InternalError
W0225 22:38:50.098785 2679458 global.cpp:244] GlobalUpdate is too busy!
W0225 22:36:01.373281 2679177 segment_replicate_executor.cpp:320] Failed to sync segment SyncChannnel [host: 172.18.236.12, port: 8060, load_id: 0648a763-b4f6-b0f0-47ef-9bff2becdeb2, tablet_id: 4300366, txn_id: 15271292] err Internal error: no associated load channel 0648a763-b4f6-b0f0-47ef-9bff2becdeb2
/root/jenkins/workspace/emr-olap-starrocks-32-release/emr_starrocks/be/src/storage/segment_replicate_executor.cpp:131 _wait_response(replicate_tablet_infos, failed_tablet_infos)
I0225 22:40:33.373755 2679419 load_path_mgr.cpp:191] Remove path success. path=/mnt/disk1/starrocks/storage/error_log/error_log_5925228df1e711ef_b4b900163e006db5
I0225 22:39:18.729777 2679331 sink_buffer.cpp:215] fragment_instance_id 50d3010e-f37a-11ef-b4b9-00163e006dbb -> 50d3010e-f37a-11ef-b4b9-00163e006dbb, _num_uncancelled_sinkers 7, _is_finishing false, _num_remaining_eos 16
本次StarRocks BE更新合并异常问题是由v3.2.4
版本合并逻辑缺陷引起的,主要是数据分片合并时内存访问越界。通过升级版本、调整合并策略、优化内存配置以及必要时的元数据修复等措施,可以有效解决这一问题。
往期精彩回顾
扫描二维码
专注于大数据
数据圈
点在看,捧个人场就行。