SAM 2: Segment Anything in Images and Videos论文精读（逐段解析）

最新推荐文章于 2025-08-22 21:18:02 发布

原创最新推荐文章于 2025-08-22 21:18:02 发布 · 957 阅读

25 ·

CC 4.0 BY-SA版权

文章标签：

#人工智能 #目标检测 #深度学习 #图像分割 #视觉大模型 #SAM2

【视觉大模型论文精读】带你逐段解析 (持续更新) 专栏收录该内容

6 篇文章

订阅专栏

SAM 2: Segment Anything in Images and Videos论文精读（逐段解析）

SAM 2：图像和视频中的万物分割

论文地址：https://arxiv.org/abs/2408.00714

Meta FAIR

2024

【论文总结】SAM2（Segment Anything Model 2）是Meta AI在原始SAM基础上开发的视频分割基础模型，其核心创新在于将静态图像分割能力扩展到动态视频领域。主要包括：
1、统一架构设计：采用"图像即单帧视频"的统一思想，用同一套模型参数处理图像和视频，避免了维护两套系统的复杂性。
2、流式内存机制：引入记忆库（Memory Bank）存储历史帧的物体特征和用户交互信息，通过记忆注意力模块实现跨帧信息融合，使模型能够理解物体的时序变化。
3、可提示视觉分割（PVS）任务：支持在视频任意帧上提供点击、框选或掩码提示，实现交互式视频分割，相比传统方法减少3倍交互次数。
4、数据引擎技术：采用"人在回路"的协同标注策略，通过SAM2模型辅助人工标注，构建了包含35.5M掩码的SA-V数据集，标注效率提升8.4倍。
5、实时处理能力：基于Hiera图像编码器的流式架构，支持逐帧处理，在A100 GPU上达到43.8 FPS的实时速度，比SAM快6倍。
6、多尺度特征融合：通过跳跃连接将高分辨率特征直接传递给解码器，在保持时序信息融合的同时确保空间细节的精确性。

We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos . We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. Our model is a simple transformer architecture with streaming memory for real-time video processing. SAM 2 trained on our data provides strong performance across a wide range of tasks. In video segmentation, we observe better accuracy, using $3\times$ fewer interactions than prior approaches. In image segmentation, our model is more accurate and $6\times$ faster than the Segment Anything Model (SAM). We believe that our data, model, and insights will serve as a significant milestone for video segmentation and related perception tasks. We are releasing our main model, dataset, as well as code for model training and our demo.

【翻译】我们提出了Segment Anything Model 2 (SAM 2)，这是一个用于解决图像和视频中可提示视觉分割的基础模型。我们构建了一个数据引擎，通过用户交互来改进模型和数据，以收集迄今为止最大的视频分割数据集。我们的模型是一个简单的transformer架构，具有流式内存用于实时视频处理。在我们的数据上训练的SAM 2在各种任务中提供了强劲的性能。在视频分割中，我们观察到更高的准确性，比先前方法减少了 $3\times$ 的交互次数。在图像分割中，我们的模型比Segment Anything Model (SAM)更准确且快 $6\times$ 。我们相信我们的数据、模型和见解将成为视频分割和相关感知任务的重要里程碑。我们正在发布我们的主要模型、数据集，以及用于模型训练和演示的代码。

【解析】SAM 2是在原始SAM模型基础上的重大升级，主要突破在于将分割能力从静态图像扩展到了动态视频领域。所谓"可提示视觉分割"，就是用户可以通过点击、框选或者给出mask等简单交互方式来指定想要分割的目标，模型就能自动完成分割任务。数据引擎的设计非常关键，它采用了"人在回路"的策略，让标注人员和模型协同工作，模型先给出分割结果，人工进行修正，然后这些修正又被用来训练更好的模型，形成正反馈循环。这种方式大大提高了数据标注的效率和质量。模型架构上，SAM 2使用了transformer结构，但加入了流式内存机制，这个内存能够存储之前帧的信息，使得模型在处理视频时能够保持对目标对象的连续跟踪和理解。性能表现上， $3\times$ 更少交互次数说明模型能够更智能地理解用户意图，减少了反复修正的需要，而 $6\times$ 的速度提升说明在保持高精度的同时，推理效率得到了显著优化，这对于实际应用场景非常重要。

Demo: https://sam2.metademolab.com

Code: https://github.com/facebookresearch/sam2

Website: https://ai.meta.com/sam2

1 Introduction

Segment Anything (SA) introduced a foundation model for promptable segmentation in images ( Kirillov et al. , 2023 ). However an image is only a static snapshot of the real world in which visual segments can exhibit complex motion, and with the rapid growth of multimedia content, a significant portion is now recorded with a temporal dimension, particularly in video data. Many important applications in AR/VR, robotics, autonomous vehicles, and video editing require temporal localization beyond image-level segmentation. We believe a universal visual segmentation system should be applicable to both images and videos.

【翻译】Segment Anything (SA)引入了一个用于图像中可提示分割的基础模型（Kirillov et al., 2023）。然而，图像只是现实世界的静态快照，其中视觉片段可能表现出复杂的运动，随着多媒体内容的快速增长，相当大的一部分现在都记录有时间维度，特别是在视频数据中。许多重要的应用，如AR/VR、机器人、自动驾驶汽车和视频编辑，都需要超越图像级分割的时间定位。我们相信一个通用的视觉分割系统应该既适用于图像也适用于视频。

【解析】这段话指出了从静态图像分割向动态视频分割扩展的必要性。传统的SAM模型只能处理单张图像的分割任务，但现实世界中的物体和场景是动态变化的。在视频中，同一个物体可能会移动、变形、被遮挡或者改变形状，这些都是图像分割无法处理的时序信息。一个真正通用的视觉分割系统必须具备处理时序信息的能力，这就是SAM 2要解决的核心问题。

Segmentation in video aims to determine the spatio-temporal extent of entities, which presents unique challenges beyond those in images. Entities can undergo significant changes in appearance due to motion, deformation, occlusion, lighting changes, and other factors. Videos often have lower quality than images due to camera motion, blur, and lower resolution. Further, efficient processing of a large number of frames is a key challenge. While SA successfully addresses segmentation in images, existing video segmentation models and datasets fall short in providing a comparable capability to “segment anything in videos”.

【翻译】视频中的分割旨在确定实体的时空范围，这带来了超越图像分割的独特挑战。实体可能由于运动、变形、遮挡、光照变化和其他因素而在外观上发生显著变化。由于相机运动、模糊和较低分辨率，视频通常比图像质量更低。此外，高效处理大量帧是一个关键挑战。虽然SA成功解决了图像中的分割问题，但现有的视频分割模型和数据集在提供可比的"视频中分割任何物体"能力方面存在不足。

【解析】视频分割面临着比图像分割更复杂的技术挑战。时空范围的确定说明不仅要知道物体在每一帧中的位置和形状，还要理解它在时间轴上的连续性和一致性。物体外观的变化是视频分割的核心难点：运动导致的形状变化、形变导致的几何变化、遮挡导致的部分消失和重现、光照变化导致的颜色和纹理变化，这些都会让同一个物体在不同帧中看起来完全不同。视频质量问题进一步加剧了这些挑战：相机运动产生的全局运动模糊、快速移动物体的局部运动模糊、压缩导致的分辨率降低和细节丢失，都让精确分割变得更加困难。计算效率是另一个关键约束，视频包含大量连续帧，如果每帧都独立处理，计算成本会变得不可接受，因此需要设计能够利用时序相关性来提高效率的算法。现有方法的局限性在于它们通常只能处理特定类别的物体或者需要大量的用户交互，无法达到SAM在图像领域那样的通用性和易用性。

We introduce the Segment Anything Model 2 (SAM 2), a unified model for video and image segmentation (we consider an image as a single-frame video). Our work includes a task, model, and dataset (see Fig. 1 ).

【翻译】我们引入了Segment Anything Model 2 (SAM 2)，这是一个用于视频和图像分割的统一模型（我们将图像视为单帧视频）。我们的工作包括任务、模型和数据集（见图1）。

【解析】SAM 2采用了统一架构的思想，通过将图像视为单帧视频来实现架构的一致性。其优势在于可以用同一套模型参数和推理流程来处理图像和视频，避免了维护两套不同系统的复杂性。当处理图像时，模型的时序记忆机制为空，表现得就像原始的SAM；当处理视频时，模型利用时序记忆来保持对物体的连续跟踪。这种统一设计不仅简化了模型架构，还使得在图像和视频数据上的训练可以相互促进，提高了模型的泛化能力。

We focus on the Promptable Visual Segmentation (PVS) task that generalizes image segmentation to the video domain. The task takes as input points, boxes, or masks on any frame of the video to define a segment of interest for which the spatio-temporal mask (i.e., a ’ masklet ') is to be predicted. Once a masklet is predicted, it can be iteratively refined by providing prompts in additional frames.

【翻译】我们专注于可提示视觉分割（PVS）任务，该任务将图像分割推广到视频领域。该任务以视频任意帧上的点、框或掩码作为输入，用于定义感兴趣的片段，需要预测其时空掩码（即"masklet"）。一旦预测出masklet，就可以通过在其他帧中提供提示来进行迭代细化。

【解析】可提示视觉分割任务是从静态图像分割向动态视频分割的自然延伸。传统的图像分割只处理单一时刻的空间信息，而视频分割需要处理时空连续性问题。所谓"时空掩码"或masklet，实际上是一个三维的数据结构，其中前两个维度表示空间位置（x和y坐标），第三个维度表示时间（帧序号）。这样的结构能够完整描述一个物体在整个视频序列中的位置和形状变化。任务的核心创新在于用户可以在任意帧上提供输入，而不是传统方法中只能在第一帧或特定帧上标注。这种灵活性大大降低了用户的工作量，因为用户可以选择物体最清晰、最容易标注的帧进行操作。迭代细化机制允许用户根据模型的初步预测结果进行逐步修正，通过在关键帧上添加额外的提示来纠正错误或提高精度。这种人机交互的设计符合实际应用中的工作流程，用户不需要一次性提供完美的标注，而是可以通过多轮交互来达到满意的分割效果。

Our model (§ 4 ) produces segmentation masks of the object of interest, in single images and across video frames. SAM 2 is equipped with a memory that stores information about the object and previous interactions, which allows it to generate masklet predictions throughout the video, and also effectively correct these based on the stored memory context of the object from previously observed frames. Our streaming architecture is a natural generalization of SAM to the video domain, processing video frames one at a time, equipped with a memory attention module to attend to the previous memories of the target object. When applied to images, the memory is empty and the model behaves like SAM.

【翻译】我们的模型（第4节）为感兴趣的物体生成分割掩码，既可用于单张图像也可用于视频帧序列。SAM 2配备了一个存储物体信息和先前交互的内存，这使其能够在整个视频中生成masklet预测，并且能够基于从先前观察帧中存储的物体内存上下文来有效地进行修正。我们的流式架构是SAM向视频领域的自然推广，逐帧处理视频帧，配备了内存注意力模块来关注目标物体的先前记忆。当应用于图像时，内存为空，模型表现得像SAM一样。

【解析】SAM 2的核心技术突破在于引入了内存机制，这是区别于传统方法的关键创新。内存系统包含两个主要组成部分：物体特征内存和交互历史内存。物体特征内存存储目标物体在不同帧中的视觉特征表示，包括颜色、纹理、形状等信息，这些特征会随着时间的推移不断更新和完善。交互历史内存则记录用户在各个帧上提供的提示信息，包括点击位置、框选区域、掩码标注等，这些信息对于理解用户意图和保持分割一致性至关重要。流式处理架构的设计考虑了实际应用的需求，视频通常是按时间顺序产生的，逐帧处理的方式可以实现实时或近实时的分割效果。内存注意力模块是一个基于transformer的注意力机制，它能够选择性地关注内存中与当前帧最相关的信息。这种注意力机制不是简单的特征拼接，而是通过学习到的权重来动态地融合历史信息和当前帧信息，从而实现更准确的分割预测。当处理单张图像时，内存为空的设计确保了模型的向后兼容性，此时模型退化为原始的SAM，保持了在图像分割任务上的优异性能。

Figure 1 We introduce the Segment Anything Model 2 (SAM 2), towards solving the promptable visual segmentation task (a) with our foundation model (b), trained on our large-scale SA-V dataset collected through our data engine ©. SAM 2 is capable of interactively segmenting regions through prompts (clicks, boxes, or masks) on one or multiple video frames by utilizing a streaming memory that stores previous prompts and predictions.

【翻译】图1 我们介绍了Segment Anything Model 2 (SAM 2)，旨在通过我们的基础模型（b）解决可提示视觉分割任务（a），该模型在通过我们的数据引擎（c）收集的大规模SA-V数据集上进行训练。SAM 2能够通过利用存储先前提示和预测的流式内存，在一个或多个视频帧上通过提示（点击、框或掩码）交互式地分割区域。

We employ a data engine (§ 5 ) to generate training data by using our model in the loop with annotators to interactively annotate new and challenging data. Different from most existing video segmentation datasets, our data engine is not restricted to objects of specific categories, but instead targeted to provide training data for segmenting any object with a valid boundary, including parts and subparts. Compared to existing model-assisted approaches, our data engine with SAM 2 in the loop is 8.4 $\times$ faster at comparable quality. Our final Segment Anything Video (SA-V) dataset (§ 5.2 ) consists of 35.5M masks across 50.9K videos, 53 $\times$ more masks than any existing video segmentation dataset. SA-V is challenging with small objects and parts that get occluded and re-appear throughout the video. Our SA-V dataset is geographically diverse, and a fairness evaluation of SAM 2 indicates minimal performance discrepancy in video segmentation based on perceived gender, and little variance among the three perceived age groups we evaluated.

【翻译】我们采用一个数据引擎（第5节）通过在循环中使用我们的模型与标注者一起交互式地标注新的和具有挑战性的数据来生成训练数据。与大多数现有的视频分割数据集不同，我们的数据引擎不限于特定类别的物体，而是旨在为分割任何具有有效边界的物体提供训练数据，包括部分和子部分。与现有的模型辅助方法相比，我们使用SAM 2的数据引擎在相当质量下快 $8.4× \times$ 倍。我们最终的Segment Anything Video (SA-V)数据集（第5.2节）包含50.9K个视频中的35.5M个掩码，比任何现有视频分割数据集多 $53× \times$ 倍的掩码。SA-V具有挑战性，包含在整个视频中被遮挡并重新出现的小物体和部分。我们的SA-V数据集在地理上具有多样性，SAM 2的公平性评估表明基于感知性别的视频分割性能差异很小，在我们评估的三个感知年龄组之间变化很小。

【解析】数据引擎是SAM 2项目中最重要的技术之一，是一种全新的数据收集和模型训练范式。传统的数据收集方式是人工标注者独立完成所有标注工作，效率低且质量难以保证。而数据引擎采用"人在回路"的协同标注策略，将模型预测能力与人类专业知识相结合。具体工作流程是：模型首先对视频内容进行初步分割预测，然后标注者基于这些预测结果进行修正和完善，修正后的数据又被用来进一步训练模型，形成一个持续改进的正反馈循环。显著提高了标注效率，模型的初步预测为标注者提供了很好的起点，减少了从零开始标注的工作量。与传统数据集只关注特定类别物体不同，SA-V数据集追求"分割任何物体"的目标，不仅包含完整物体，还大量标注了物体的局部部分和子部分，比如人体的手指、车辆的车轮、建筑物的窗户等。这种精细化的标注策略使得模型能够学习到更丰富的分割知识。

Our experiments (§ 6 ) show that SAM 2 delivers a step-change in the video segmentation experience. SAM 2 can produce better segmentation accuracy while using 3 $\times$ fewer interactions than prior approaches. Further, SAM 2 outperforms prior work in established video object segmentation benchmarks, under multiple evaluation settings, and delivers better performance compared to SAM on image segmentation benchmarks, while being $6\times$ faster. SAM 2 is shown to be effective across a variety of video and image distributions as observed through numerous zero-shot benchmarks including 17 for video segmentation and 37 for single-image segmentation.

【翻译】我们的实验（第6节）显示SAM 2在视频分割体验上带来了根本性改变。SAM 2能够产生更好的分割精度，同时比先前方法使用 $\times$ 更少的交互次数。此外，SAM 2在既定的视频物体分割基准测试中，在多种评估设置下都优于先前工作，并且在图像分割基准测试中比SAM表现更好，同时速度快 $6\times$ 倍。通过大量的零样本基准测试可以观察到，SAM 2在各种视频和图像分布上都表现有效，包括17个视频分割基准和37个单图像分割基准。

【解析】零样本性能的优异表现说明模型学到的不是针对特定数据集的特化知识，而是具有普遍适用性的视觉分割能力。

We are releasing our work under permissive open licences, including the SA-V dataset (CC by 4.0), the SAM 2 model checkpoints 1 , training code (Apache 2.0), and code for our interactive online demo (Apache 2.0).

【翻译】我们在宽松的开放许可证下发布我们的工作，包括SA-V数据集（CC by 4.0）、SAM 2模型检查点、训练代码（Apache 2.0）以及交互式在线演示代码（Apache 2.0）。

2 Related work

Image segmentation. Segment Anything ( Kirillov et al. , 2023 ) introduces a promptable image segmentation task where the goal is to output a valid segmentation mask given an input prompt such as a bounding box or a point that refers to the object of interest. SAM trained on the SA-1B dataset allows for zero-shot segmentation which enabled its adoption to a wide range of applications. Recent work has extended SAM, e.g., by introducing a High-Quality output token to train on fine-grained masks ( Ke et al. , 2024 ), or improve SAM’s efficiency ( Xiong et al. , 2023 ; Zhang et al. , 2023a ; Zhao et al. , 2023 ). More broadly, SAM is used in a wide range of applications, including medical imaging ( Ma et al. , 2024 ; Deng et al. , 2023 ; Mazurowski et al. , 2023 ; Wu et al. , 2023a ), remote sensing ( Chen et al. , 2024 ; Ren et al. , 2024 ), motion segmentation ( Xie et al. , 2024 ), and camouflaged object detection ( Tang et al. , 2023 ).

【翻译】图像分割。Segment Anything（Kirillov等，2023）引入了一个可提示图像分割任务，其目标是在给定输入提示（如边界框或指向感兴趣物体的点）的情况下输出有效的分割掩码。在SA-1B数据集上训练的SAM允许零样本分割，这使其能够被广泛应用于各种应用中。最近的工作扩展了SAM，例如，通过引入高质量输出标记来训练精细掩码（Ke等，2024），或提高SAM的效率（Xiong等，2023；Zhang等，2023a；Zhao等，2023）。更广泛地说，SAM被用于各种应用，包括医学成像（Ma等，2024；Deng等，2023；Mazurowski等，2023；Wu等，2023a）、遥感（Chen等，2024；Ren等，2024）、运动分割（Xie等，2024）和伪装物体检测（Tang等，2023）。

【解析】SAM的突破在于提出了"可提示分割"的概念，用户只需要简单的交互输入就能获得高质量的分割结果。模型无需针对特定物体类别进行重新训练，就能够处理训练过程中从未见过的新物体。这种泛化能力来源于大规模数据集SA-1B的训练和精心设计的模型架构。SA-1B数据集包含超过10亿个分割掩码，覆盖了极其丰富的物体类别和场景，为模型学习通用的视觉分割知识提供了基础。后续研究在SAM基础上的改进主要集中在两个方向：精度提升和效率优化。精度提升方面，高质量输出标记的引入使得模型能够生成更加精细的分割边界，特别是在处理复杂形状和细小结构时表现更佳。效率优化方面，研究者们通过模型压缩、量化、蒸馏等技术手段显著降低了计算复杂度和内存占用，使得SAM能够在移动设备和边缘计算设备上实时运行。

Interactive Video Object Segmentation (iVOS). Interactive video object segmentation has emerged as a crucial task to efficiently obtain object segmentations in videos (masklets) with user guidance, often in the form of scribbles, clicks, or bounding boxes. A few early approaches ( Wang et al. , 2005 ; Bai & Sapiro , 2007 ; Fan et al. , 2015 ) deploy graph-based optimization to guide the segmentation annotation process. More recent approaches ( Heo et al. , 2020 ; Cheng et al. , 2021b ; Delatolas et al. , 2024 ) often adopt a modular design, converting user inputs into a mask representation on a single frame and then propagating it to other frames.

【翻译】交互式视频物体分割（iVOS）。交互式视频物体分割已成为在用户指导下高效获得视频中物体分割（masklets）的关键任务，通常以涂鸦、点击或边界框的形式进行。一些早期方法（Wang等，2005；Bai & Sapiro，2007；Fan等，2015）部署基于图的优化来指导分割标注过程。更近期的方法（Heo等，2020；Cheng等，2021b；Delatolas等，2024）通常采用模块化设计，将用户输入转换为单帧上的掩码表示，然后将其传播到其他帧。

【解析】涂鸦交互允许用户精确地标记物体边界，但对于复杂形状的物体需要大量的绘制工作。点击交互最为简便，用户只需要点击物体内部或边界，但可能存在歧义性问题。边界框交互提供了物体的大致位置和尺寸信息，但对于非矩形物体的分割精度有限。早期的基于图优化的方法将视频分割问题建模为图上的能量最小化问题，通过构建像素级或超像素级的图结构来表示空间和时序关系。这些方法的优势在于能够全局优化分割结果，但计算复杂度较高，难以处理长视频序列。现代方法普遍采用模块化设计思想，将复杂的视频分割任务分解为单帧分割和时序传播两个相对独立的子问题。单帧分割模块负责根据用户输入生成初始的分割掩码，时序传播模块则负责将这个掩码在时间维度上进行扩展和更新。这种设计的优势在于各个模块可以独立优化和替换，提高了系统的灵活性和可维护性。

Click-based input is easier to collect ( Homayounfar et al. , 2021 ) for interactive video segmentation. Recent works have used a combination of SAM on images with video trackers based on masks ( Cheng et al. , 2023b ; Yang et al. , 2023 ; Cheng et al. , 2023c ) or points ( Rajič et al. , 2023 ). However, these approaches have limitations: the tracker may not work for all objects, SAM may not perform well on video frames, and there is no mechanism to interactively refine a model’s mistakes, other than re-annotating using SAM in each frame and restarting the tracking from there.

【翻译】基于点击的输入对于交互式视频分割更容易收集（Homayounfar等，2021）。最近的工作使用了SAM在图像上与基于掩码（Cheng等，2023b；Yang等，2023；Cheng等，2023c）或点（Rajič等，2023）的视频跟踪器的组合。然而，这些方法有局限性：跟踪器可能不适用于所有物体，SAM在视频帧上可能表现不佳，并且除了在每一帧中使用SAM重新标注并从那里重新开始跟踪之外，没有机制来交互式地修正模型的错误。

【解析】从数据收集角度来看，点击标注比精细的边界绘制或掩码标注要快速得多，这对于构建大规模视频分割数据集至关重要。从用户体验角度来看，点击操作最为直观自然，用户无需学习复杂的操作技巧就能进行有效的物体标注。将SAM与视频跟踪器结合的方法代表了一种直觉性的解决方案：利用SAM强大的图像分割能力处理单帧，然后通过跟踪算法在时间维度上维持分割的连续性。基于掩码的跟踪方法将SAM生成的分割掩码作为跟踪目标的模板，在后续帧中搜索最匹配的区域。基于点的跟踪方法则将用户点击的位置作为关键点，通过光流或特征匹配等技术在时间序列中跟踪这些点的运动轨迹。这些组合方法存在的根本问题在于两个组件之间缺乏深度集成。跟踪器的失效往往发生在物体快速运动、严重遮挡、光照剧变等具有挑战性的场景中，而这些场景恰恰是视频分割最需要处理的困难情况。SAM在视频帧上的性能下降主要源于训练数据的分布差异，SAM主要在高质量的静态图像上进行训练，而视频帧通常存在运动模糊、压缩伪影、分辨率较低等问题。错误修正机制的缺失是这类方法的最大弱点，当系统出现错误时，用户必须从错误发生的位置重新开始整个标注过程，这严重影响了交互效率。理想的交互式视频分割系统应该支持增量式的错误修正，用户可以在任意帧上提供额外的输入来纠正局部错误，而不影响其他帧的正确分割结果。

Our work shares a similar goal to these works to segment objects across videos interactively, and we build a strong unified model that directly takes prompts for interactive video segmentation, along with a large and diverse dataset in pursuit of solving this goal.

【翻译】我们的工作与这些工作有着相似的目标，即在视频中交互式地分割物体，我们构建了一个强大的统一模型，该模型直接接受用于交互式视频分割的提示，并配有一个大型且多样化的数据集来追求解决这一目标。

Video Object Segmentation (VOS). The VOS task begins with an object mask as input in the first frame, which must be accurately tracked throughout the video ( Pont-Tuset et al. , 2017 ). The task is referred to as “semi-supervised VOS” since the input mask can be seen as supervision signal of the object to be tracked. Modern VOS approaches can achieve high accuracy while operating in real-time.

【翻译】视频物体分割（VOS）。VOS任务从第一帧中的物体掩码作为输入开始，该掩码必须在整个视频中被准确跟踪（Pont-Tuset等，2017）。该任务被称为"半监督VOS"，因为输入掩码可以被视为待跟踪物体的监督信号。现代VOS方法可以在实时运行的同时实现高精度。

【解析】半监督VOS的核心挑战在于如何利用有限的监督信息（仅第一帧的掩码）来维持整个视频序列中物体分割的准确性和一致性。现实约束：用户通常只愿意在视频的开始阶段提供详细的标注，而期望系统能够自动完成后续的分割工作。第一帧掩码包含了物体的完整形状、纹理和上下文信息，这些信息构成了后续帧分割的重要先验知识。现代VOS方法通常采用特征记忆机制来存储和更新物体的视觉表示，使用注意力机制来关联不同帧之间的对应关系，并通过轻量化的网络设计来平衡精度和速度的要求。

Early deep learning based approaches have often used online fine-tuning on the first video frame ( Caelles et al. , 2016 ; Perazzi et al. , 2016 ; Yoon et al. , 2017 ; Maninis et al. , 2017 ; Hu et al. , 2018a ; Bhat et al. , 2020 ; Robinson et al. , 2020 ) or on all frames ( Voigtlaender & Leibe , 2017 ) to adapt the model to the target object. Faster inference has been achieved with offline-trained models, conditioned either only on the first frame ( Hu et al. , 2018b ; Chen et al. , 2018 ), or also integrating the previous frame ( Oh et al. , 2018 ; Yang et al. , 2018 , 2019 ; Xu et al. , 2018a ). More recent works have leveraged a memory mechanism ( Oh et al. , 2019 ; Seong et al. , 2020 ; Liang et al. , 2020 ; Cheng et al. , 2021a , 2022 ) to create more persistent representations.

【翻译】早期基于深度学习的方法通常使用在第一个视频帧上的在线微调（Caelles等，2016；Perazzi等，2016；Yoon等，2017；Maninis等，2017；Hu等，2018a；Bhat等，2020；Robinson等，2020）或在所有帧上的在线微调（Voigtlaender & Leibe，2017）来使模型适应目标物体。通过仅基于第一帧（Hu等，2018b；Chen等，2018）或同时整合前一帧（Oh等，2018；Yang等，2018，2019；Xu等，2018a）进行条件化的离线训练模型实现了更快的推理。更近期的工作利用了内存机制（Oh等，2019；Seong等，2020；Liang等，2020；Cheng等，2021a，2022）来创建更持久的表示。

【解析】在线微调策略的思想是在测试时针对特定的目标物体进行模型参数的动态调整。在第一帧微调的方法使用初始掩码作为监督信号，通过几个梯度下降步骤来更新模型参数，使其更好地适应当前视频中的特定物体。这种方法的优势在于能够快速学习目标物体的独特特征，但缺点是计算开销较大且可能导致过拟合。在所有帧上进行微调的方法进一步扩展了这一思想，利用视频序列中的时序信息来持续优化模型性能，但这种方法的计算复杂度更高，难以满足实时应用的需求。离线训练模型的发展标志着VOS领域的重要转折点，这类方法通过在大规模数据集上预训练来学习通用的物体跟踪和分割能力，然后在推理时直接应用而无需额外的优化步骤。基于第一帧条件化的方法将初始掩码编码为特征表示，并在后续帧的处理中作为参考模板。整合前一帧信息的方法进一步考虑了时序连续性，通过分析相邻帧之间的变化来提高分割的稳定性和准确性。内存机制的引入代表了VOS技术的最新发展方向，这类方法能够维护长期的物体表示，不仅记住物体在最近几帧中的外观，还能保持对物体在整个视频历史中变化的记忆。

Semi-supervised VOS can be seen as a special case of our Promptable Visual Segmentation (PVS) task, with only a mask prompt in the first video frame. Notably, annotating the required high-quality object mask in the first frame in VOS is practically challenging and time-consuming for inference.

【翻译】半监督VOS可以被视为我们的可提示视觉分割（PVS）任务的一个特例，仅在第一个视频帧中有掩码提示。值得注意的是，在VOS中标注第一帧所需的高质量物体掩码在实际应用中对于推理来说是具有挑战性和耗时的。

【解析】传统的半监督VOS要求用户在第一帧提供完整且精确的物体掩码，这个要求在实际应用中往往难以满足。用户需要使用专业的标注工具来精确地勾画物体边界，这个过程不仅技术门槛较高，而且非常耗时，特别是对于形状复杂的物体。PVS任务通过支持多种类型的提示输入（点击、边界框、粗略掩码等）显著降低了用户的操作难度和时间成本。更重要的是，PVS允许在任意帧上提供提示，这在处理长视频或物体在中间帧才出现的情况下特别有用。

Video segmentation datasets. Many datasets have been proposed to support the VOS task. Early VOS datasets ( Prest et al. , 2012 ; Li et al. , 2013 ; Ochs et al. , 2014 ; Fan et al. , 2015 ), such as DAVIS ( Pont-Tuset et al. , 2017 ; Caelles et al. , 2019 ), include high-quality annotations but their size limits deep-learning based approaches. YouTube-VOS ( Xu et al. , 2018b ) is the first large-scale dataset for VOS. As algorithms became better and benchmark performance started to saturate, researchers have looked at increasing the difficulty of the VOS task by specifically focusing on challenging scenarios such as Long-term tracking ( Lukezic et al. , 2018 ), Referring segmentation ( Khoreva et al. , 2018 ; Wu et al. , 2023b ), or addressing challenging objects such as those that are fast-moving ( Qi et al. , 2022 ), camouflaged ( Cheng et al. , 2021c ), or have thin structures ( Li et al. , 2022 ).

【翻译】视频分割数据集。已经提出了许多数据集来支持VOS任务。早期的VOS数据集（Prest等，2012；Li等，2013；Ochs等，2014；Fan等，2015），如DAVIS（Pont-Tuset等，2017；Caelles等，2019），包含高质量的标注，但其规模限制了基于深度学习的方法。YouTube-VOS（Xu等，2018b）是第一个VOS的大规模数据集。随着算法变得更好，基准性能开始饱和，研究人员开始通过专门关注具有挑战性的场景来增加VOS任务的难度，如长期跟踪（Lukezic等，2018）、指称分割（Khoreva等，2018；Wu等，2023b），或处理具有挑战性的物体，如快速移动的（Qi等，2022）、伪装的（Cheng等，2021c）或具有细长结构的（Li等，2022）。

【解析】针对特殊物体类型的数据集反映了实际应用中的挑战：快速移动物体容易产生运动模糊和预测滞后；伪装物体与背景的相似性使得分割变得困难；细长结构物体的精确分割对算法的空间分辨能力提出了更高要求。

We find that current video segmentation datasets lack sufficient coverage to achieve the capability of “segmenting anything in videos”. Their annotations typically cover entire objects (not parts) and datasets are often centered around specific object classes, such as people, vehicles, and animals. In comparison to these datasets, our released SA-V dataset not only focuses on whole objects but also extensively covers object parts and contains over an order of magnitude more masks.

【翻译】我们发现当前的视频分割数据集缺乏足够的覆盖范围来实现"在视频中分割任何物体"的能力。它们的标注通常覆盖整个物体（而非部分），数据集通常围绕特定的物体类别，如人、车辆和动物。与这些数据集相比，我们发布的SA-V数据集不仅关注整体物体，还广泛覆盖物体部分，并且包含的掩码数量超过一个数量级。

3 任务：可提示视觉分割

Our PVS task allows providing prompts to the model on any frame of a video. Prompts can be positive/negative clicks, boxes, or masks, either to define an object to segment or to refine a model-predicted one. To provide an interactive experience, upon receiving a prompt on a specific frame, the model should immediately respond with a valid segmentation mask of the object on this frame. After receiving initial prompts (either on the same frame or different frames), the model should propagate these prompts to obtain the masklet of the object across the entire video , localizing the segmentation mask of the target on every video frame. Additional prompts can be provided to the model on any frame to refine the segment throughout the video (example in Fig. 2 ). For details on the task, see § B .

【翻译】我们的PVS任务允许在视频的任何帧上向模型提供提示。提示可以是正/负点击、边界框或掩码，既可以用来定义要分割的物体，也可以用来细化模型预测的结果。为了提供交互式体验，在特定帧上收到提示后，模型应该立即响应该帧上物体的有效分割掩码。在收到初始提示（无论是在同一帧还是不同帧上）后，模型应该传播这些提示以获得物体在整个视频中的masklet，在每个视频帧上定位目标的分割掩码。可以在任何帧上向模型提供额外的提示来细化整个视频中的分割（如图2中的示例）。有关任务的详细信息，请参见§B。

【解析】PVS支持时空任意位置的交互式提示输入。与传统的视频分割方法相比，PVS不再限制用户只能在第一帧提供输入，而是允许在视频序列的任意时刻进行干预和修正。提示的多样性体现在三个层面：输入模态的多样性（点击、框、掩码），语义的多样性（正向指示目标、负向排除干扰），以及功能的多样性（初始定义、后续细化）。实时响应机制确保了良好的用户体验，当用户在某一帧提供提示时，系统必须能够即时生成该帧的分割结果，这对模型的推理效率提出了严格要求。时空传播能力是PVS的另一个关键特征，模型需要能够理解提示在时间维度上的含义，并将这种理解扩展到整个视频序列。这种传播不是简单的复制粘贴，而是需要考虑物体在时间序列中的运动、形变、遮挡等复杂变化。增量式细化机制允许用户在发现错误时进行局部修正，而不需要重新开始整个标注过程，这大大提高了实际应用中的可用性和效率。

Figure 2 Interactive segmentation with SAM 2. Step 1 (selection): we prompt SAM 2 in frame 1 to obtain the segment of the target object (the tongue). Green/red dots indicate positive/negative prompts respectively. SAM 2 automatically propagates the segment to the following frames (blue arrows) to form a masklet . If SAM 2 loses the object (after frame 2), we can correct the masklet by providing an additional prompt in a new frame (red arrow). Step 2 (refinement): a single click in frame 3 is sufficient to recover the object and propagate it to obtain the correct masklet. A decoupled SAM $^+$ video tracker approach would require several clicks in frame 3 (as in frame 1) to correctly re-annotate the object as the segmentation is restarted from scratch. With SAM 2’s memory, a single click can recover the tongue.

【翻译】图2 SAM 2的交互式分割。步骤1（选择）：我们在第1帧中提示SAM 2以获得目标物体（舌头）的分割。绿色/红色点分别表示正向/负向提示。SAM 2自动将分割传播到后续帧（蓝色箭头）以形成masklet。如果SAM 2丢失了物体（在第2帧之后），我们可以通过在新帧中提供额外提示（红色箭头）来修正masklet。步骤2（细化）：在第3帧中单次点击就足以恢复物体并传播以获得正确的masklet。解耦的SAM $^+$ 视频跟踪器方法需要在第3帧中进行多次点击（如第1帧中那样）来正确地重新标注物体，因为分割是从头重新开始的。借助SAM 2的内存，单次点击就可以恢复舌头。

【解析】图展示了SAM 2相比传统方法的优势：内存机制带来的错误恢复能力。在传统的解耦方法中，图像分割模型（如SAM）和视频跟踪器是独立工作的两个组件。当跟踪失败时，用户必须在失败的帧上重新使用图像分割模型进行完整的标注过程，这通常需要多个提示点来准确定义物体边界。SAM 2的统一架构通过内存机制保存了物体在历史帧中的特征表示和分割信息。当系统在某帧出现错误时，单个修正点击就能激活这些历史记忆，快速恢复对目标物体的正确理解。这种设计不仅提高了交互效率，还提升了用户体验的连续性。内存机制的另一个重要作用是维持物体特征的时间一致性，即使在物体部分遮挡或快速运动的情况下，系统仍能通过参考历史信息来做出合理的分割预测。

SAM 2 (§ 4 ) is applied as a data collection tool to the PVS task for building our SA-V dataset (§ 5 ). We evaluate the model (§ 6 ) by simulating interactive video segmentation scenarios across multiple frames, in the conventional semi-supervised VOS setting where annotations are limited to the first frame, and for image segmentation on the SA benchmarks.

【翻译】SAM 2（§4）被应用为PVS任务的数据收集工具，用于构建我们的SA-V数据集（§5）。我们通过模拟跨多帧的交互式视频分割场景来评估模型（§6），包括传统的半监督VOS设置（其中标注仅限于第一帧）以及在SA基准上的图像分割。

4 Model

SAM 2 (Fig. 3 ) can be seen as a generalization of SAM to the video (and image) domain, taking point, box, and mask prompts on individual frames to define the spatial extent of the object to be segmented spatio-temporally. Spatially, the model behaves similarly to SAM. A promptable and light-weight mask decoder takes an image embedding and prompts (if any) and outputs a segmentation mask for the frame. Prompts can be iteratively added on a frame in order to refine the masks.

【翻译】SAM 2（图3）可以被视为SAM在视频（和图像）领域的泛化，接受在各个帧上的点、框和掩码提示来定义要进行时空分割的物体的空间范围。在空间上，该模型的行为类似于SAM。一个可提示的轻量级掩码解码器接受图像嵌入和提示（如果有的话）并输出该帧的分割掩码。可以在一帧上迭代地添加提示以细化掩码。

【解析】SAM 2在SAM的基础上实现了从静态图像到动态视频的扩展，这个扩展不仅仅是简单的功能叠加，而是在架构层面的改进。在空间维度上，SAM 2保持了SAM的核心优势，即通过多种类型的提示来精确定义分割目标。但在时间维度上，SAM 2引入了全新的时空一致性约束，确保同一物体在不同帧之间的分割结果保持连贯性。轻量级掩码解码器的设计考虑了实时处理的需求，在保证分割精度的同时控制计算复杂度。迭代细化机制体现了交互式设计的核心思想，用户可以通过逐步添加提示来逐渐完善分割结果，渐进式的优化过程使得即使对于复杂场景也能达到满意的分割效果。

The frame embedding used by the SAM 2 decoder is not directly from an image encoder and is instead conditioned on memories of past predictions and prompted frames . It is possible for prompted frames to also come “from the future” relative to the current frame. Memories of frames are created by the memory encoder based on the current prediction and placed in a memory bank for use in subsequent frames. The memory attention operation takes the per-frame embedding from the image encoder and conditions it on the memory bank, before the mask decoder ingests it to form a prediction.

【翻译】SAM 2解码器使用的帧嵌入不是直接来自图像编码器，而是基于过去预测和提示帧的记忆进行条件化的。提示帧也可能相对于当前帧来自"未来"。帧的记忆由记忆编码器基于当前预测创建，并放置在记忆库中供后续帧使用。记忆注意力操作从图像编码器获取每帧嵌入，并在记忆库上对其进行条件化，然后掩码解码器接收它以形成预测。

【解析】SAM 2的核心创新在于引入了记忆机制来处理视频序列中的时间依赖关系。传统的图像分割模型每次处理一帧时都是独立的，而SAM 2通过记忆库来维护历史信息，使得当前帧的分割能够参考之前帧的预测结果。记忆编码器的作用是将当前的分割预测转换为紧凑的记忆表示，这种表示既包含了物体的视觉特征，也蕴含了分割的语义信息。记忆库作为一个动态存储系统，不仅保存历史预测的记忆，还存储带有用户提示的关键帧信息。"未来"提示帧的概念说明SAM 2支持非线性的视频处理模式，用户可以在视频的任意位置提供提示，这些提示信息会影响整个视频序列的分割结果。记忆注意力机制是整个系统的核心，它通过注意力权重来决定哪些历史信息对当前帧的分割最为重要，从而实现了智能的信息融合和特征增强。

We describe individual components and training below and provide more details in Appendix D .

【翻译】我们在下面描述各个组件和训练，并在附录D中提供更多细节。

Image encoder. For real-time processing of arbitrarily long videos, we take a streaming approach, consuming video frames as they become available. The image encoder is only run once for the entire interaction and its role is to provide unconditioned tokens (feature embeddings) representing each frame. We use an MAE ( He et al. , 2022 ) pre-trained Hiera ( Ryali et al. , 2023 ; Bolya et al. , 2023 ) image encoder, which is hierarchical , allowing us to use multiscale features during decoding.

【翻译】图像编码器。为了实时处理任意长度的视频，我们采用流式方法，在视频帧可用时逐帧消费。图像编码器在整个交互过程中只运行一次，其作用是提供表示每帧的无条件标记（特征嵌入）。我们使用MAE（He等，2022）预训练的Hiera（Ryali等，2023；Bolya等，2023）图像编码器，它是分层的，允许我们在解码过程中使用多尺度特征。

【解析】图像编码器承担着特征提取的基础功能，流式处理方式是处理长视频序列的关键，它避免了将整个视频加载到内存中的需求，支持对无限长度的视频进行实时分析。"只运行一次"说明图像编码器的计算成本被最小化，所有帧共享同一个编码器权重，这种设计既保证了特征提取的一致性，又大幅降低了计算复杂度。无条件标记的概念指的是这些特征嵌入不依赖于任何外部输入（如用户提示或历史信息），它们纯粹是对视觉内容的抽象表示。MAE预训练为编码器提供了强大的视觉理解基础，这种自监督学习方法能够学习到丰富的视觉表示。Hiera架构的分层特性支持多尺度特征提取，这对于处理不同大小的物体和细节层次至关重要。在解码阶段，粗粒度特征有助于理解全局语义，而细粒度特征则确保分割边界的精确性。

Figure 3 The SAM 2 architecture. For a given frame, the segmentation prediction is conditioned on the current prompt and/or on previously observed memories. Videos are processed in a streaming fashion with frames being consumed one at a time by the image encoder, and cross-attended to memories of the target object from previous frames. The mask decoder, which optionally also takes input prompts, predicts the segmentation mask for that frame. Finally, a memory encoder transforms the prediction and image encoder embeddings (not shown in the figure) for use in future frames.

【翻译】图3 SAM 2架构。对于给定帧，分割预测基于当前提示和/或先前观察到的记忆进行条件化。视频以流式方式处理，帧由图像编码器逐一消费，并与来自先前帧的目标物体记忆进行交叉注意。掩码解码器（可选择性地接受输入提示）预测该帧的分割掩码。最后，记忆编码器转换预测和图像编码器嵌入（图中未显示）供未来帧使用。

【解析】SAM 2的整体架构是一个完整的视频理解和分割流水线。条件化机制是该架构的核心，当前帧的分割结果不是孤立生成的，而是综合考虑了多种信息源：用户在当前帧或其他帧提供的提示，以及系统从历史帧中积累的物体记忆。流式处理架构保证了系统能够处理实时视频流，每个新帧到达时都能立即开始处理，而不需要等待整个视频序列。交叉注意机制是连接时间维度信息的桥梁，它让当前帧能够"看到"并利用历史帧中关于目标物体的重要信息。这种设计使得即使在物体被遮挡或发生形变的情况下，系统仍能基于历史记忆做出合理的分割预测。掩码解码器的可选提示输入设计体现了系统的灵活性，它既可以在没有新提示时自主工作，也可以在接收到用户反馈时进行调整。记忆编码器扮演着知识积累和传递的角色，它将当前帧的分割结果和视觉特征转化为紧凑的记忆表示，为后续帧的处理提供有价值的上下文信息。

Memory attention. The role of memory attention is to condition the current frame features on the past frames features and predictions as well as on any new prompts. We stack $L$ transformer blocks, the first one taking the image encoding from the current frame as input. Each block performs self-attention, followed by cross-attention to memories of (prompted/unprompted) frames and object pointers (see below), stored in a memory bank (see below), followed by an MLP. We use vanilla attention operations for self- and cross-attention, allowing us to benefit from recent developments in efficient attention kernels ( Dao , 2023 ).

【翻译】记忆注意力。记忆注意力的作用是基于过去帧的特征和预测以及任何新提示来条件化当前帧特征。我们堆叠 $L$ 个transformer块，第一个块将来自当前帧的图像编码作为输入。每个块执行自注意力，然后对存储在记忆库中的（有提示/无提示）帧记忆和物体指针进行交叉注意力，随后是MLP。我们使用标准的注意力操作进行自注意力和交叉注意力，使我们能够受益于高效注意力核的最新发展（Dao，2023）。

【解析】记忆注意力模块是SAM 2实现时序信息融合的核心组件，它通过巧妙的注意力机制设计将当前帧与历史信息有机结合。 $L$ 层transformer块的堆叠结构提供了足够的模型容量来处理复杂的时空关系，每一层都能在前一层的基础上进一步细化特征表示。自注意力机制首先在当前帧内部建立特征之间的关联，这有助于理解当前帧中不同区域之间的空间关系和语义依赖。交叉注意力是该模块的核心创新，它让当前帧能够选择性地关注历史记忆中的相关信息。记忆库中存储的信息包括两类：带有用户提示的关键帧记忆和普通的历史帧记忆，这种分类存储机制确保了不同类型信息的有效利用。物体指针作为高层语义抽象，为注意力机制提供了物体级别的引导信息，帮助模型聚焦于目标物体的相关特征。MLP层在注意力操作之后进行非线性变换，增强了模型的表达能力。采用标准注意力操作的设计选择体现了实用主义考虑，这种做法既保证了模型性能，又能充分利用现有的高效实现，如Flash Attention等优化技术，大幅提升了计算效率。

Prompt encoder and mask decoder. Our prompt encoder is identical to SAM’s and can be prompted by clicks (positive or negative), boxes, or masks to define the extent of the object in a given frame. Sparse prompts are represented by positional encodings summed with learned embeddings for each prompt type, while masks are embedded using convolutions and summed with the frame embedding.

【翻译】提示编码器和掩码解码器。我们的提示编码器与SAM的相同，可以通过点击（正向或负向）、框或掩码来提示，以定义给定帧中物体的范围。稀疏提示通过位置编码与每种提示类型的学习嵌入相加来表示，而掩码则使用卷积嵌入并与帧嵌入相加。

【解析】提示编码器负责将用户的各种交互信息转换为模型可以理解的数值表示。SAM 2在这个组件上继承了SAM的成熟设计，多种提示类型的支持（点击、框、掩码）为用户提供了灵活的交互方式，适应不同的应用场景和用户习惯。稀疏提示（点击和框）和密集提示（掩码）采用不同的编码策略：对于稀疏提示，位置编码捕获了提示在空间中的位置信息，而学习嵌入则编码了提示的语义类型（正向点击表示"这里是目标"，负向点击表示"这里不是目标"）。对于掩码提示，卷积操作能够保持空间结构信息，这对于理解复杂的形状边界非常重要。将处理后的提示信息与帧嵌入相加是一种有效的信息融合方式，它让模型能够在同一特征空间中同时考虑视觉内容和用户意图。

Our decoder design largely follows SAM. We stack “two-way” transformer blocks that update prompt and frame embeddings. As in SAM, for ambiguous prompts (i.e., a single click) where there may be multiple compatible target masks, we predict multiple masks. This design is important to ensure that the model outputs valid masks. In video , where ambiguity can extend across video frames, the model predicts multiple masks on each frame. If no follow-up prompts resolve the ambiguity, the model only propagates the mask with the highest predicted IoU for the current frame.

【翻译】我们的解码器设计很大程度上遵循SAM。我们堆叠"双向"transformer块来更新提示和帧嵌入。与SAM一样，对于可能有多个兼容目标掩码的模糊提示（即单次点击），我们预测多个掩码。这种设计对于确保模型输出有效掩码非常重要。在视频中，模糊性可能延伸到整个视频帧中，模型在每一帧上预测多个掩码。如果没有后续提示来解决模糊性，模型只传播当前帧预测IoU最高的掩码。

【解析】"双向"transformer块的能够同时更新提示嵌入和帧嵌入，这种双向信息流动让提示和视觉内容能够相互影响和优化。多掩码预测策略是用于处理交互式分割中固有歧义性。当用户提供一个点击时，这个点可能位于多个物体的边界上，或者该点可能对应不同层次的分割目标（比如一个人的头部、整个人，或者包含人的更大区域）。通过预测多个候选掩码，模型为用户提供了选择空间，也为后续的交互优化留下了余地。在视频场景中，这种歧义性问题变得更加复杂，因为同一个模糊提示在不同帧中可能对应不同的最优分割结果。SAM 2通过在每帧都预测多个掩码来应对这种时间维度上的不确定性。IoU预测机制为模型提供了自我评估能力，它能够估计每个预测掩码的质量，从而在没有额外用户输入的情况下自动选择最可能正确的结果进行传播。

Unlike SAM where there is always a valid object to segment given a positive prompt, in the PVS task it is possible for no valid object to exist on some frames (e.g. due to occlusion). To support this new output mode, we add an additional head that predicts whether the object of interest is present on the current frame. Another novelty are skip connections from our hierarchical image encoder (bypassing the memory attention) to incorporate high-resolution embeddings for mask decoding (see § D ).

【翻译】与SAM不同（在SAM中，给定正向提示总是有一个有效的物体可以分割），在PVS任务中，某些帧上可能不存在有效物体（例如由于遮挡）。为了支持这种新的输出模式，我们添加了一个额外的头部来预测目标物体是否存在于当前帧中。另一个新颖之处是来自我们分层图像编码器的跳跃连接（绕过记忆注意力）以结合高分辨率嵌入进行掩码解码（见 § D ）。

【解析】视频分割任务与静态图像分割的根本区别在于目标物体的可见性问题。在静态图像中，用户点击的位置通常确实存在某个可分割的物体，但在视频中，目标物体可能由于遮挡、移出画面或其他因素而暂时消失。这种情况要求模型具备"无物体"检测能力，即能够识别当前帧中确实不存在目标物体的情况。额外预测头的引入是一个重要的架构创新，它让模型能够显式地输出"目标不存在"的判断，而不是强行生成一个可能错误的分割掩码。这种设计提高了系统的可靠性，避免了在物体不可见时产生虚假的分割结果。跳跃连接的设计解决了多尺度信息融合的问题。分层图像编码器在不同层次提取不同分辨率的特征，浅层特征包含丰富的细节信息，深层特征包含高级语义信息。通过绕过记忆注意力模块直接将高分辨率特征传递给解码器，模型能够在进行时序信息融合的同时保持空间细节的精确性。这种设计特别重要，因为记忆注意力操作可能会在信息融合过程中丢失一些细粒度的空间细节，而这些细节对于准确的边界预测是至关重要的。跳跃连接机制确保了模型既能利用时序上下文信息，又能保持高质量的空间分割精度。

Memory encoder. The memory encoder generates a memory by downsampling the output mask using a convolutional module and summing it element-wise with the unconditioned frame embedding from the image-encoder (not shown in Fig. 3 ), followed by light-weight convolutional layers to fuse the information.

【翻译】记忆编码器。记忆编码器通过使用卷积模块对输出掩码进行下采样，并将其与来自图像编码器的无条件帧嵌入（图3中未显示）进行逐元素求和来生成记忆，然后使用轻量级卷积层来融合信息。

【解析】记忆编码器是SAM 2架构中负责将当前帧的分割结果转化为可存储记忆的关键组件。下采样操作将高分辨率的分割掩码压缩到与特征图相匹配的尺寸，这种尺寸对齐是后续信息融合的前提条件。卷积模块在下采样过程中不仅改变空间分辨率，还通过学习到的滤波器提取掩码中的关键结构信息，去除冗余细节而保留重要的形状和位置特征。逐元素求和操作将分割掩码的几何信息与图像编码器提供的视觉特征进行融合，这种加法操作确保了记忆既包含物体的视觉外观特征，也包含其精确的空间位置和形状信息。无条件帧嵌入提供了纯粹的视觉内容表示，不受任何外部提示或历史信息影响，为记忆提供了稳定的基础。轻量级卷积层是效率与性能的平衡考虑，它们具备足够的表达能力来融合多源信息，同时保持较低的计算开销，确保整个记忆生成过程不会成为系统的性能瓶颈。融合后的记忆表示同时包含了语义信息和几何信息，为后续帧的分割提供了丰富而紧凑的上下文。

Memory bank. The memory bank retains information about past predictions for the target object in the video by maintaining a FIFO queue of memories of up to $N$ recent frames and stores information from prompts in a FIFO queue of up to $M$ prompted frames. For instance, in the VOS task where the initial mask is the only prompt, the memory bank consistently retains the first frame’s memory along with memories of up to $N$ recent (unprompted) frames. Both sets of memories are stored as spatial feature maps.

【翻译】记忆库。记忆库通过维护最多 $N$ 个最近帧的记忆FIFO队列来保留视频中目标物体过去预测的信息，并在最多 $M$ 个提示帧的FIFO队列中存储来自提示的信息。例如，在初始掩码是唯一提示的VOS任务中，记忆库始终保留第一帧的记忆以及最多 $N$ 个最近（无提示）帧的记忆。两组记忆都存储为空间特征图。

【解析】记忆库是SAM 2实现时序信息管理的核心存储系统，它采用了精心设计的双队列架构来处理不同类型的历史信息。FIFO（先进先出）队列机制确保了记忆库能够在有限的存储空间内保持最相关的历史信息，当新的记忆加入时，最古老的记忆会被自动移除，这种动态更新机制防止了记忆库无限增长导致的计算和存储负担。 $N$ 个最近帧记忆队列专门存储普通的历史预测结果，这些记忆反映了目标物体在时间序列中的连续变化，为模型提供了物体运动轨迹和形变模式的重要信息。 $M$ 个提示帧记忆队列则专门保存包含用户交互信息的关键帧，这些帧通常包含更可靠和准确的分割信息，因为它们融合了用户的明确指导。在视频目标分割任务中，第一帧通常是用户提供初始掩码的起始帧，因此它在整个分割过程中具有特殊的重要性，记忆库会确保这个关键帧的信息始终被保留。空间特征图的存储格式保持了记忆的空间结构信息，这对于理解物体的形状、位置和与周围环境的关系至关重要。这种双队列设计使得模型既能利用最新的时序信息来跟踪物体的当前状态，又能保持对关键交互时刻的长期记忆，实现了短期适应性和长期一致性的有机结合。

In addition to the spatial memory, we store a list of object pointers as lightweight vectors for high-level semantic information of the object to segment, based on mask decoder output tokens of each frame. Our memory attention cross-attends to both spatial memory features and these object pointers.

【翻译】除了空间记忆外，我们还存储一个物体指针列表作为轻量级向量，用于存储要分割物体的高级语义信息，基于每帧的掩码解码器输出标记。我们的记忆注意力对空间记忆特征和这些物体指针都进行交叉注意。

【解析】物体指针代表了一种抽象层次更高的记忆表示形式，它们补充了空间记忆在语义理解方面的不足。空间记忆主要捕获物体的几何和视觉特征，而物体指针则专注于提取和保存物体的高级语义属性，如物体类别、行为模式、运动特征等抽象概念。轻量级向量的设计确保了这些语义信息能够以紧凑的形式存储和传输，不会对系统的计算效率造成显著影响。掩码解码器输出标记作为物体指针的来源是一个巧妙的设计选择，因为解码器在生成分割掩码的过程中已经学习到了丰富的物体表示，这些内部表示天然地包含了语义层面的抽象信息。通过将每帧的解码器输出标记转化为物体指针，系统能够积累关于目标物体的语义知识，形成越来越完善的物体理解。记忆注意力机制对空间特征和物体指针的双重交叉注意设计实现了多层次信息融合：空间记忆提供详细的几何和视觉线索，物体指针提供抽象的语义引导，两者结合使得模型能够在保持精确空间定位的同时，具备高层语义理解能力。这种双重注意机制让模型在面对复杂场景时能够同时考虑"物体在哪里"和"物体是什么"这两个fundamental问题，从而做出更加准确和智能的分割决策。

We embed temporal position information into the memories of $N$ recent frames, allowing the model to represent short-term object motion, but not into those of prompted frames, because the training signal from prompted frames is sparser and it is more difficult to generalize to the inference setting where prompted frames may come from a very different temporal range than seen during training.

【翻译】我们将时间位置信息嵌入到 $N$ 个最近帧的记忆中，使模型能够表示短期物体运动，但不嵌入到提示帧的记忆中，因为来自提示帧的训练信号更稀疏，且更难泛化到推理设置中，在推理时提示帧可能来自与训练期间看到的非常不同的时间范围。

【解析】时间位置编码设计。对于最近的 $N$ 个帧，模型添加了时间位置信息，这种编码让模型能够理解帧之间的时序关系，从而捕捉物体的运动模式。短期运动信息对于预测下一帧中物体的位置和形状变化至关重要，特别是当物体存在规律性运动或惯性运动时。然而，对于提示帧（用户主动标注的关键帧），系统选择不添加时间位置编码，这背后有深层的训练和泛化考虑。提示帧的训练信号稀疏性是一个关键问题，在训练数据中，用户提示往往集中在视频的某些特定时刻，这些时刻在时间轴上的分布是不均匀的。如果模型学会了依赖提示帧的特定时间位置，那么在实际应用中，当用户在不同时间点提供提示时，模型可能无法正确泛化。更重要的是，在实际推理过程中，用户可能在视频的任意时刻提供提示，这些时刻的时间分布与训练时遇到的模式可能完全不同。通过不为提示帧添加时间位置编码，模型被迫学习时间无关的提示表示，这样无论提示出现在视频的什么位置，模型都能给出一致和可靠的响应。

Training. The model is trained jointly on image and video data. Similar to previous work ( Kirillov et al. , 2023 ; Sofiiuk et al. , 2022 ), we simulate interactive prompting of the model. We sample sequences of 8 frames and randomly select up to 2 frames to prompt and probabilistically receive corrective clicks which are sampled using the ground-truth masklet and model predictions during training. The training task is to sequentially (and “interactively”) predict the ground-truth masklet. Initial prompts to the model can be the ground-truth mask with probability 0 . 5 , a positive click sampled from the ground-truth mask with probability 0 . 25 , or a bounding box input with probability 0 . 25 . See § D for more details.

【翻译】训练。模型在图像和视频数据上进行联合训练。与之前的工作（Kirillov et al.，2023；Sofiiuk et al.，2022）类似，我们模拟模型的交互式提示。我们采样8帧序列，随机选择最多2帧进行提示，并概率性地接收纠正点击，这些点击在训练期间使用真实掩码和模型预测进行采样。训练任务是顺序地（和"交互式地"）预测真实掩码。模型的初始提示可以是概率为0.5的真实掩码、概率为0.25的从真实掩码中采样的正向点击，或概率为0.25的边界框输入。详见§D。

【解析】SAM 2的训练策略采用了联合学习方法，同时利用静态图像数据和视频数据进行训练。这种混合训练方式让模型既能从大量图像分割数据中学习基础的视觉分割能力，又能从视频数据中学习时序建模和物体跟踪能力。交互式提示的模拟是训练过程中的核心组成部分，它通过人工方式重现用户与系统的真实交互过程。8帧序列的采样设计在计算效率和时序信息捕捉之间取得了平衡，这个长度既足以让模型学习短期时序依赖关系，又不会导致过大的计算开销。随机选择最多2帧进行提示的策略模拟了实际应用中用户提示的稀疏性和随机性，用户通常不会在每一帧都提供标注，而是在关键帧或者模型预测出现错误时才会介入。纠正点击的概率性生成是一个重要的训练技巧，它基于真实标注和当前模型预测的差异来生成训练样本。当模型预测与真实标注存在差异时，系统会在错误区域生成负向点击，在缺失区域生成正向点击，这种方式让模型学会如何响应和利用用户的纠正反馈。顺序预测任务要求模型按照时间顺序处理帧序列，这种设计确保了模型能够正确利用历史信息进行当前帧的预测。初始提示的多样化设计（真实掩码、正向点击、边界框）让模型能够适应不同类型的用户输入，其中0.5的真实掩码概率说明在很多情况下用户可能已经有一个相对准确的初始分割结果，而0.25的点击和边界框概率反映了用户从零开始进行标注的情况。这种概率分布的设计基于对实际用户行为的观察和分析，确保训练过程能够覆盖各种可能的交互场景。

5 Data

To develop the capability to “segment anything” in video, we built a data engine to collect a large and diverse video segmentation dataset. We employ an interactive model in the loop setup with human annotators. Similar to Kirillov et al. ( 2023 ), we do not impose semantic constraints on the annotated masklets, and focus on both whole objects (e.g., a person) and parts (e.g., a person’s hat). Our data engine went through three phases, each categorized based on the level of model assistance provided to annotators. Next, we describe each data engine phase and our SA-V dataset.

【翻译】为了开发在视频中"分割任何物体"的能力，我们构建了一个数据引擎来收集大规模且多样化的视频分割数据集。我们采用了人机交互的模型在环设置，配合人工标注者。与Kirillov等人（2023）类似，我们不对标注的masklet施加语义约束，既关注完整物体（如一个人），也关注部分物体（如一个人的帽子）。我们的数据引擎经历了三个阶段，每个阶段都根据提供给标注者的模型辅助程度进行分类。接下来，我们描述每个数据引擎阶段和我们的SA-V数据集。

5.1 Data engine

Phase 1: SAM per frame. The initial phase used the image-based interactive SAM ( Kirillov et al. , 2023 ) to assist human annotation. Annotators are tasked with annotating the mask of a target object in every frame of the video at 6 frames per second (FPS) using SAM, and pixel-precise manual editing tools such as a “brush” and “eraser”. There is no tracking model involved to assist with the temporal propagation of masks to other frames. As this is a per-frame method, and all frames require mask annotation from scratch, the process is slow, with an average annotation time of 37.8 seconds per frame in our experiment. However, this yields high-quality spatial annotations per frame. In this phase, we collected 16K masklets across 1.4K videos. We further use this approach to annotate our SA-V val and test sets to mitigate potential biases of SAM 2 during evaluation.

【翻译】阶段1：逐帧SAM。初始阶段使用基于图像的交互式SAM（Kirillov et al.，2023）来辅助人工标注。标注者的任务是使用SAM和像素精确的手动编辑工具（如"画笔"和"橡皮擦"）以每秒6帧（FPS）的速度标注视频中每一帧目标物体的掩码。没有跟踪模型参与来辅助掩码在其他帧之间的时间传播。由于这是一种逐帧方法，所有帧都需要从头开始进行掩码标注，因此过程缓慢，在我们的实验中平均每帧标注时间为37.8秒。然而，这产生了高质量的每帧空间标注。在这个阶段，我们在1.4K个视频中收集了16K个掩码片段。我们进一步使用这种方法来标注我们的SA-V验证集和测试集，以减轻评估过程中SAM 2的潜在偏差。

Phase 2: SAM $\pmb{+}$ SAM 2 Mask. The second phase added SAM 2 into the loop, where SAM 2 only accepted masks as prompts. We refer to this version as SAM 2 Mask. Annotators used SAM and other tools as in Phase 1 to generate spatial masks in the first frame, and then use SAM 2 Mask to temporally propagate the annotated mask to other frames to get the full spatio-temporal masklets. At any subsequent video frame, annotators can spatially modify the predictions made by SAM 2 Mask by annotating a mask from scratch with SAM, a “brush” and/or “eraser”, and re-propagate with SAM 2 Mask, repeating this process until the masklet is correct. SAM 2 Mask was initially trained on the Phase 1 data and publicly available datasets. During Phase 2, we re-trained and updated SAM 2 Mask in the annotation loop twice using the collected data. In Phase 2, we collected 63.5K masklets. The annotation time went down to 7.4 s/frame, a $\sim5.1\times$ speed up over Phase 1.

【翻译】第二阶段：SAM $\pmb{+}$ SAM 2 掩码。第二阶段将SAM 2加入到循环中，其中SAM 2只接受掩码作为提示。我们将这个版本称为SAM 2 Mask。标注者使用SAM和第一阶段的其他工具在第一帧中生成空间掩码，然后使用SAM 2 Mask将标注的掩码时间上传播到其他帧，以获得完整的时空masklet。在任何后续视频帧中，标注者可以通过使用SAM、"画笔"和/或"橡皮擦"从头开始标注掩码来空间上修改SAM 2 Mask的预测，并使用SAM 2 Mask重新传播，重复此过程直到masklet正确。SAM 2 Mask最初在第一阶段数据和公开可用数据集上进行训练。在第二阶段期间，我们使用收集的数据在标注循环中重新训练和更新SAM 2 Mask两次。在第二阶段，我们收集了63.5K个masklet。标注时间降至7.4秒/帧，比第一阶段提速约 $\sim5.1\times$ 。

Despite an improvement in annotation time, this approach requires annotating masks in intermediate frames from scratch without previous memory. We then advanced to develop the fully-featured SAM 2, capable of both interactive segmentation and mask propagation in a unified model.

【翻译】尽管标注时间有所改善，但这种方法需要在中间帧中从头开始标注掩码，没有先前的记忆。然后我们进一步开发了功能完整的SAM 2，能够在统一模型中进行交互式分割和掩码传播。

Phase 3: SAM 2. In the final phase, we utilize the fully-featured SAM 2, which accepts various types of prompts, including points and masks. SAM 2 benefits from memories of objects across the temporal dimension to generate mask predictions. This means annotators only need to provide occasional refinement clicks to SAM 2 to edit the predicted masklets in intermediate frames, as opposed to annotating from scratch with a spatial SAM which has no such memory context. During Phase 3, we re-trained and updated SAM 2 using the collected annotations five times. With SAM 2 in the loop, the annotation time per frame went down to 4.5 seconds, a $\sim8.4\times$ speed up over Phase 1. In Phase 3, we collected 197.0K masklets.

【翻译】第三阶段：SAM 2。在最后阶段，我们使用功能完整的SAM 2，它接受各种类型的提示，包括点和掩码。SAM 2从跨时间维度的物体记忆中受益，以生成掩码预测。这意味着标注者只需要偶尔向SAM 2提供精化点击来编辑中间帧中的预测masklet，而不需要像没有此类记忆上下文的空间SAM那样从头开始标注。在第三阶段期间，我们使用收集的标注重新训练和更新SAM 2五次。有了SAM 2在循环中，每帧标注时间降至4.5秒，比第一阶段提速 $\sim8.4\times$ 。在第三阶段，我们收集了197.0K个masklet。

Quality verification. To uphold a high standard for annotation, we introduce a verification step. A separate set of annotators are tasked with verifying the quality of each annotated masklet as “satisfactory” (correctly and consistently tracking the target object across all frames) or “unsatisfactory” (target object is well defined with a clear boundary but the masklet is not correct or consistent). Unsatisfactory masklets were sent back to the annotation pipeline for refinement. Any masklets tracking not well defined objects were rejected entirely.

【翻译】质量验证。为了维持高标准的标注质量，我们引入了验证步骤。一组独立的标注者负责验证每个标注masklet的质量，评定为"满意"（正确且一致地跟踪所有帧中的目标物体）或"不满意"（目标物体定义明确且边界清晰，但masklet不正确或不一致）。不满意的masklet被送回标注流程进行改进。任何跟踪定义不明确物体的masklet都会被完全拒绝。

Table 1 Evolution of data engine phases showing the average annotation time per frame, the average percent of edited frames per masklet, the number of manual clicks per clicked frame, and Mask Alignment to Phase 1 by mask size.

【翻译】表1 数据引擎阶段的演进，显示每帧平均标注时间、每个masklet编辑帧的平均百分比、每个点击帧的手动点击次数，以及按掩码大小与第一阶段的掩码对齐度。

Auto masklet generation. Ensuring diversity in annotation is important to enable the anything capability of our model. As human annotators might typically focus more on salient objects, we augment the annotations with automatically generated masklets (referred to as “Auto”). This serves a dual purpose of increasing the coverage of annotations and helping identify model failure cases. To generate auto masklets, we prompt SAM 2 with a regular grid of points in the first frame and generate candidate masklets. These are then sent to the masklet verification step for filtering. Automatic masklets tagged as “satisfactory” are added to the SA-V dataset. Masklets identified as “unsatisfactory” (i.e., model failure cases) are sampled and presented to annotators to refine with SAM 2 in the loop (Phase 3 of the data engine). These automatic masklets cover large salient central objects but also objects of varying sizes and positions in the background.

【翻译】自动masklet生成。确保标注的多样性对于实现我们模型的"任何物体"能力非常重要。由于人工标注者通常可能更多地关注显著物体，我们用自动生成的masklet（称为"Auto"）来增强标注。这具有增加标注覆盖范围和帮助识别模型失败案例的双重目的。为了生成自动masklet，我们在第一帧中用规律的点网格来提示SAM 2并生成候选masklet。然后将这些发送到masklet验证步骤进行过滤。标记为"满意"的自动masklet被添加到SA-V数据集中。被识别为"不满意"的masklet（即模型失败案例）被采样并呈现给标注者，使用SAM 2在循环中进行改进（数据引擎的第3阶段）。这些自动masklet覆盖了大型显著的中心物体，也包括背景中不同大小和位置的物体。

Analysis. Table 1 shows a comparison of the annotation protocol in each data engine phase through a controlled experiment (details in § E.2.2 ). We compare the average annotation time per frame, the average percentage of manually edited frames per masklet, and the average number of clicks per clicked frame. For quality evaluation, we define the Phase 1 Mask Alignment Score as the percentage of masks whose IoU compared to the corresponding masks in Phase 1 exceeds 0.75. Phase 1 data is chosen as a reference as it has per-frame high quality manual annotations. Phase 3 with SAM 2 in the loop leads to increased efficiency and comparable quality: it is 8.4 $\times$ faster than Phase 1, has the lowest edited frame percentage and clicks per frame, and results in better alignment.

【翻译】分析。表1通过对照实验显示了每个数据引擎阶段标注协议的比较（详见§E.2.2）。我们比较了每帧的平均标注时间、每个masklet手动编辑帧的平均百分比，以及每个点击帧的平均点击次数。对于质量评估，我们将"第1阶段掩码对齐得分"定义为与第1阶段对应掩码相比IoU超过0.75的掩码百分比。选择第1阶段数据作为参考，因为它具有每帧高质量的手动标注。第3阶段将SAM 2纳入循环中，提高了效率并保持了可比的质量：比第1阶段快8.4倍，具有最低的编辑帧百分比和每帧点击次数，并获得更好的对齐效果。

In Table 2 , we show the performance comparison of SAM 2 trained on the available data at the end of each phase keeping the number of iterations fixed , therefore measuring solely the impact of the additional data. We evaluate on our own SA-V val set and also on 9 zero-shot benchmarks (see § F.1 for details) using the standard $\mathcal{I}\&\mathcal{F}$ accuracy metric (the higher the better) when prompting with 3-clicks on the first frame. We note a consistent improvement after iteratively including the data from each phase, not only on the in-domain SA-V val set, but also on the 9 zero-shot benchmarks.

【翻译】在表2中，我们展示了SAM 2在每个阶段结束时使用可用数据进行训练的性能比较，保持迭代次数固定，因此仅测量额外数据的影响。我们在自己的SA-V验证集和9个零样本基准测试上进行评估（详见§F.1），当在第一帧使用3次点击提示时，使用标准的 $\mathcal{I}\&\mathcal{F}$ 准确性指标（越高越好）。我们注意到，在迭代包含每个阶段的数据后，不仅在域内SA-V验证集上，而且在9个零样本基准测试上都有一致的改善。

Table 2 Segmentation accuracy ( $\mathcal{J} \& \mathcal{F}$ metric) improvement from adding data from each data engine phase. “VOS” is a set of video object segmentation datasets. Details are in §F.

【翻译】表2 通过添加每个数据引擎阶段的数据获得的分割准确性（ $\mathcal{J} \& \mathcal{F}$ 指标）改善。"VOS"是一组视频物体分割数据集。详见§F。

5.2 SA-V dataset

The SA-V dataset collected with our data engine comprises 50.9K videos with 642.6K masklets. In Table 3 we compare the SA-V composition to common VOS datasets across the number of videos, masklets, and masks. Notably, the number of annotated masks is $53\times$ ( $15\times$ without auto) larger than any existing VOS dataset, providing a substantial resource for future work. We are releasing SA-V under a permissive license.

【翻译】使用我们数据引擎收集的SA-V数据集包含50.9K个视频和642.6K个masklet。在表3中，我们将SA-V的组成与常见VOS数据集在视频数量、masklet和掩码方面进行比较。值得注意的是，标注掩码的数量比任何现有VOS数据集大 $53\times$ （不包括自动生成的为 $15\times$ ），为未来工作提供了丰富的资源。我们在宽松许可证下发布SA-V。

Videos. We collected a new set of 50.9K videos captured by crowdworkers. Videos comprise 54% indoor and $46\%$ outdoor scenes with an average duration of 14 seconds. Videos feature " in-the-wild " diverse environments, and cover various everyday scenarios.

【翻译】视频。我们收集了由众包工作者拍摄的50.9K个新视频。视频包括54%的室内场景和46%的户外场景，平均持续时间为14秒。视频具有"野外"多样化环境特征，涵盖各种日常场景。

Masklets. The annotations comprise 190.9K manual masklet annotations and 451.7K automatic masklets collected using our data engine. Example videos with masklets overlaid (manual and automatic) are shown in Fig. 4 . SA-V has $53\times$ ( $15\times$ without auto annotations) more masks than the largest VOS dataset. The disappearance rate ( Ding et al. , 2023 ) in SA-V Manual (the percentage of annotated masklets that disappear in at least one frame and then re-appear) is $42.5\%$ , competitive among existing datasets.

【翻译】Masklet。标注包含使用我们数据引擎收集的190.9K个手动masklet标注和451.7K个自动masklet。图4显示了叠加masklet（手动和自动）的示例视频。SA-V的掩码数量比最大的VOS数据集多 $53\times$ （不包括自动标注为 $15\times$ ）。SA-V Manual中的消失率（Ding等人，2023）（在至少一帧中消失然后重新出现的标注masklet的百分比）为 $42.5\%$ ，在现有数据集中具有竞争力。

Figure 4 Example videos from the SA-V dataset with masklets overlaid (manual and automatic). Each masklet has a unique color, and each row represents frames from one video, with 1 second between them.

【翻译】图4 SA-V数据集中叠加masklet（手动和自动）的示例视频。每个masklet具有唯一的颜色，每行代表来自一个视频的帧，它们之间间隔1秒。

Table 3 Comparison of our datasets with open source VOS datasets in terms of number of videos, duration, number of masklets, masks, frames, and disappearance rate. SA-V Manual contains only manually annotated labels. SA-V Manual + Auto combines manually annotated labels with automatically generated masklets.

【翻译】表3 我们的数据集与开源VOS数据集在视频数量、持续时间、masklet数量、掩码、帧数和消失率方面的比较。SA-V Manual仅包含手动标注的标签。SA-V Manual + Auto结合了手动标注的标签和自动生成的masklet。

SA-V training, validation and test splits. We split SA-V based on the video authors (and their geographic locations) to ensure minimal overlap of similar objects. To create SA-V val and SA-V test sets, we focus on challenging scenarios in selecting videos, and ask annotators to identify challenging targets that are fast-moving, have complex occlusions by other objects as well as disappearance/re-appearance patterns. These targets were annotated at 6 FPS using the data engine Phase 1 setup in § 5.1 . There are 293 masklets and 155 videos in the SA-V val split, and 278 masklets and 150 videos in the SA-V test split.

【翻译】SA-V训练、验证和测试划分。我们基于视频作者（及其地理位置）对SA-V进行划分，以确保相似物体的最小重叠。为了创建SA-V验证集和SA-V测试集，我们在选择视频时专注于具有挑战性的场景，并要求标注者识别快速移动、被其他物体复杂遮挡以及具有消失/重新出现模式的挑战性目标。这些目标使用§5.1中数据引擎第1阶段设置以6 FPS进行标注。SA-V验证集包含293个masklet和155个视频，SA-V测试集包含278个masklet和150个视频。

Internal dataset. We also used internally available licensed video data to further augment our training set. Our internal dataset is comprised of 62.9K videos and 69.6K masklets annotated in Phase 2 and Phase 3 (see § 5.1 ) for training, and 96 videos and 189 masklets annotated using Phase 1 for testing (Internal-test).

【翻译】内部数据集。我们还使用内部可用的授权视频数据来进一步增强我们的训练集。我们的内部数据集包含62.9K个视频和69.6K个在第2阶段和第3阶段标注的masklet用于训练（见§5.1），以及96个视频和189个使用第1阶段标注的masklet用于测试（Internal-test）。

See Appendix E for more details on the data engine and SA-V dataset, including a fairness evaluation.

【翻译】有关数据引擎和SA-V数据集的更多详细信息，包括公平性评估，请参见附录E。

6 零样本实验

Here, we compare SAM 2 with previous work on zero-shot video and image tasks. We report the standard $\mathcal{I}\&\mathcal{F}$ metric ( Pont-Tuset et al. , 2017 ) for video and mIoU metric for image tasks. Unless otherwise mentioned, the results in this section follow our default setup using Hiera-B+ image encoder with a resolution of 1024 and trained on the full combination of datasets, i.e., SAM 2 (Hiera-B+) in Table 6 (see also § D.2 for details).

【翻译】在这里，我们将SAM 2与之前在零样本视频和图像任务上的工作进行比较。我们报告视频任务的标准 $\mathcal{I}\&\mathcal{F}$ 指标（Pont-Tuset等人，2017）和图像任务的mIoU指标。除非另有说明，本节中的结果遵循我们的默认设置，使用分辨率为1024的Hiera-B+图像编码器，并在数据集的完整组合上进行训练，即表6中的SAM 2 (Hiera-B+)（另见§D.2详细信息）。

6.1 可提示的视频分割

We first evaluate promptable video segmentation, which involves simulating an interactive setting that resembles the user experience. We have two settings, offline evaluation, where multiple passes are made through a video to select frames to interact with based on the largest model error, and online evaluation, where the frames are annotated in a single forward pass through the video. These evaluations are conducted on 9 densely annotated zero-shot video datasets using $N_{\mathrm{click}}=3$ clicks per frame (see § F.1 for details).

【翻译】我们首先评估可提示的视频分割，这涉及模拟类似用户体验的交互式设置。我们有两种设置：离线评估，通过视频进行多次处理以基于最大模型误差选择要交互的帧；在线评估，通过视频的单次前向传递对帧进行标注。这些评估在9个密集标注的零样本视频数据集上进行，每帧使用 $N_{\mathrm{click}}=3$ 次点击（详见§F.1）。

We create two strong baselines, SAM+XMem++ and SAM+Cutie, based on two state-of-the-art models for video object segmentation, XMem++ ( Bekuzarov et al. , 2023 ) and Cutie ( Cheng et al. , 2023a ).

【翻译】我们基于两个最先进的视频物体分割模型XMem++（Bekuzarov等人，2023）和Cutie（Cheng等人，2023a），创建了两个强基准：SAM+XMem++和SAM+Cutie。

Figure 5 Zero-shot accuracy over 9 datasets in interactive offline and online evaluation settings.

【翻译】图5 在交互式离线和在线评估设置下9个数据集的零样本准确性。

We use XMem++ to generate a video segmentation based on mask inputs on one or multiple frames. SAM is used to provide an initial mask or to refine an output (by feeding the current segmentation as a mask prompt to SAM). For the SAM+Cutie baseline, we modify Cutie to allow taking mask inputs on multiple frames.

【翻译】我们使用XMem++基于一个或多个帧上的掩码输入生成视频分割。SAM用于提供初始掩码或优化输出（通过将当前分割作为掩码提示输入到SAM中）。对于SAM+Cutie基准，我们修改Cutie以允许在多个帧上接收掩码输入。

In Fig. 5 , we rt the average $\mathcal{I}\&\mathcal{F}$ accuracy over $N_{\mathrm{frame}}=1,\dots,8$ interacted frames. SAM 2 outperforms SAM+XMem++ and SAM+Cutie for both offline and online evaluation settings. Across all 9 datasets (see per-dataset results in § F.1 ), SAM 2 dominates both methods, generating high-quality video segmentation from a few clicks while allowing continued refinement with prompts. Overall, SAM 2 can generate better segmentation accuracy, with ${}>3\times{}$ fewer interactions.

【翻译】在图5中，我们报告了在 $N_{\mathrm{frame}}=1,\dots,8$ 个交互帧上的平均 $\mathcal{I}\&\mathcal{F}$ 准确性。SAM 2在离线和在线评估设置中都优于SAM+XMem++和SAM+Cutie。在所有9个数据集中（详见§F.1中的每个数据集结果），SAM 2都主导这两种方法，仅通过少量点击就能生成高质量的视频分割，同时允许通过提示进行持续优化。总体而言，SAM 2能够以少于 $3\times$ 的交互次数生成更好的分割准确性。

6.2 半监督视频物体分割

Table 4 Zero-shot accuracy across 17 video datasets using different prompts. We report average accuracy for each type of prompt (1, 3 or 5 clicks, bounding boxes, or ground-truth masks) in the first video frame ( $^\ddag$ : this case directly uses masks as inputs into XMem $^{++}$ or Cutie without SAM).

【翻译】表4 使用不同提示在17个视频数据集上的零样本准确性。我们报告在第一个视频帧中每种类型提示（1次、3次或5次点击、边界框或真实标签掩码）的平均准确性（ $^\ddag$ ：这种情况直接将掩码作为输入输入到XMem $^{++}$ 或Cutie中，不使用SAM）。

We evaluate the semi-supervised video object segmentation (VOS) setting ( Pont-Tuset et al. , 2017 ) with click, box, or mask prompts only on the first frame of the video. When using click prompts, we interactively sample either 1, 3 or 5 clicks on the first video frame.

【翻译】我们评估半监督视频物体分割（VOS）设置（Pont-Tuset等人，2017），仅在视频的第一帧上使用点击、边界框或掩码提示。当使用点击提示时，我们在第一个视频帧上交互式地采样1次、3次或5次点击。

Similar to the interactive setting in § 6.1 , we compare to XMem++ and Cutie, using SAM for click and box prompts, and in their default setting when using mask prompts. We report the standard $\mathcal{I}\&\mathcal{F}$ accuracy ( Pont-Tuset et al. , 2017 ), except for on VOST ( Tokmakov et al. , 2022 ), where we report the $\mathcal{I}$ metric following its protocol. The results are in Table 4 . SAM 2 outperforms both methods on the 17 datasets. The results underline that SAM 2 also excels at the conventional, non-interactive VOS task with mask input, for which these other works are specifically designed. Details are in § F.1.3 .

【翻译】类似于§6.1中的交互式设置，我们与XMem++和Cutie进行比较，对于点击和边界框提示使用SAM，对于掩码提示使用它们的默认设置。我们报告标准的 $\mathcal{I}\&\mathcal{F}$ 准确性（Pont-Tuset等人，2017），除了在VOST（Tokmakov等人，2022）上，我们按照其协议报告 $\mathcal{I}$ 指标。结果见表4。SAM 2在17个数据集上都优于这两种方法。结果表明，SAM 2在传统的、非交互式的掩码输入VOS任务上也表现出色，而这些其他工作正是专门为此设计的。详细信息见§F.1.3。

6.3 图像分割

We evaluate SAM 2 on the Segment Anything task across 37 zero-shot datasets, including 23 datasets previously used by SAM for evaluation. 1-click and 5-click mIoUs are reported in Table 5 and we show the average mIoU by dataset domain and model speed in frames per second (FPS) on a single A100 GPU.

【翻译】我们在37个零样本数据集上评估SAM 2的Segment Anything任务，包括23个之前SAM用于评估的数据集。表5中报告了1次点击和5次点击的mIoU，并展示了按数据集域分类的平均mIoU和在单个A100 GPU上的模型速度（每秒帧数FPS）。

The first column (SA-23 All) shows accuracy on the 23 datasets from SAM. SAM 2 achieves higher accuracy (58.9 mIoU with 1 click) than SAM (58.1 mIoU with 1 click), without using any extra data and while being $6\times$ faster . This can be mainly attributed to the smaller but more effective Hiera image encoder in SAM 2.

【翻译】第一列（SA-23 All）显示了SAM的23个数据集上的准确性。SAM 2在不使用任何额外数据的情况下获得了比SAM更高的准确性（1次点击58.9 mIoU vs 1次点击58.1 mIoU），同时速度快 $6\times$ 。这主要归功于SAM 2中更小但更有效的Hiera图像编码器。

The bottom row shows how training on our SA-1B and video data mix can further improve accuracy to 61.4% on average on the 23 datasets. We also see exceptional gains on the video benchmarks from SA-23 (video datasets are evaluated as images, identical to Kirillov et al. ( 2023 )), and the 14 new video datasets we added. More detailed results including a breakdown by dataset are in § F.3 .

【翻译】底部行显示了在我们的SA-1B和视频数据混合上训练如何进一步将23个数据集上的平均准确性提高到61.4%。我们还在SA-23的视频基准测试中看到了卓越的提升（视频数据集作为图像进行评估，与Kirillov等人（2023）相同），以及我们添加的14个新视频数据集。包括按数据集分解的更详细结果见§F.3。

7 与半监督VOS最先进方法的比较

Our primary focus is on the general, interactive PVS task, but we also address the specific semi-supervised VOS setting (where the prompt is a ground-truth mask on the first frame), as it is a historically common protocol. We evaluate two versions of SAM 2 with varying image encoder sizes (Hiera-B+/-L) with different speed-vs-accuracy tradeoffs. We measure frames per second (FPS) on a single A100 GPU using a batch-size of one. SAM 2 based on Hiera-B+ and Hiera-L runs at real-time speeds of 43.8 and 30.2 FPS, respectively.

【翻译】我们的主要关注点是通用的交互式PVS任务，但我们也处理特定的半监督VOS设置（其中提示是第一帧上的真实标签掩码），因为这是一个历史上常见的协议。我们评估了两个版本的SAM 2，具有不同的图像编码器尺寸（Hiera-B+/-L），提供不同的速度与准确性权衡。我们在单个A100 GPU上使用批次大小为1来测量每秒帧数（FPS）。基于Hiera-B+和Hiera-L的SAM 2分别以43.8和30.2 FPS的实时速度运行。

We present a comparison with existing state-of-the-art in Table 6 , reporting accuracy using standard protocols. SAM 2 shows significant improvement over the best existing methods. We observe that using a larger image encoder brings significant accuracy gains across the board.

【翻译】我们在表6中展示了与现有最先进方法的比较，使用标准协议报告准确性。SAM 2相比于现有最佳方法显示出显著改进。我们观察到使用更大的图像编码器在各个方面都带来了显著的准确性提升。

Table 5 Zero-shot accuracy on the Segment Anything (SA) task across 37 datasets. The table shows the average 1- and 5-click mIoU of SAM 2 compared to SAM by domains (image/video). We report the average metrics on the 23 datasets used by SAM (SA-23) and the average across 14 additional zero-shot video datasets (as detailed in § F.3 ).

【翻译】表5 在37个数据集上Segment Anything（SA）任务的零样本准确性。该表显示了SAM 2与SAM按域（图像/视频）比较的平均1次和5次点击mIoU。我们报告了SAM使用的23个数据集（SA-23）上的平均指标以及14个额外零样本视频数据集的平均值（详见§F.3）。

Table 6 VOS comparison to prior work. SAM 2 performs well in accuracy $(\mathcal{I}\&\mathcal{F},\mathcal{G})$ for video segmentation based on first-frame ground-truth mask prompts. SAM 2 performs significantly better on SA-V val/test.

【翻译】表6 与先前工作的VOS比较。SAM 2在基于第一帧真实标签掩码提示的视频分割准确性 $(\mathcal{I}\&\mathcal{F},\mathcal{G})$ 方面表现良好。SAM 2在SA-V val/test上表现显著更好。

We also evaluate existing work on the SA-V val and test sets which measure performance for open-world segments of “any” object class. When comparing on this benchmark, we see that most previous methods peak at around the same accuracy. The best performance on SA-V val and SA-V test for prior work is significantly lower demonstrating the gap to a “segment anything in videos” capability. Finally, we see that SAM 2 also brings notable gains in long-term video object segmentation as observed in the LVOS benchmark result. For data and model ablations, see § A .

【翻译】我们还在SA-V验证集和测试集上评估了现有工作，这些数据集测量对"任何"物体类别的开放世界分割的性能。在这个基准测试上进行比较时，我们看到大多数先前的方法在相似的准确性水平上达到峰值。先前工作在SA-V验证集和SA-V测试集上的最佳性能显著较低，这表明与"在视频中分割任何物体"能力存在差距。最后，我们看到SAM 2在长期视频物体分割方面也带来了显著收益，这在LVOS基准测试结果中得到体现。有关数据和模型消融研究，请参见§A。

8 Conclusion

We present a natural evolution of Segment Anything into the video domain, based on three key aspects: (i) extending the promptable segmentation task to video, (ii) equipping the SAM architecture to use memory when applied to video, and (iii) the diverse SA-V dataset for training and benchmarking video segmentation. We believe SAM 2 marks a significant advancement in visual perception, positioning our contributions as milestones that will propel further research and applications.

【翻译】我们展示了Segment Anything向视频领域的自然演进，基于三个关键方面：(i)将可提示分割任务扩展到视频，(ii)为SAM架构配备在应用于视频时使用记忆的能力，以及(iii)用于训练和基准测试视频分割的多样化SA-V数据集。我们相信SAM 2标志着视觉感知的重大进步，将我们的贡献定位为推动进一步研究和应用的里程碑。