Temporal-Relational CrossTransformers for Few-Shot Action Recognition 学习解读

最新推荐文章于 2025-06-18 11:00:00 发布

原创

最新推荐文章于 2025-06-18 11:00:00 发布 · 1.6k 阅读

8 ·

CC 4.0 BY-SA版权

文章标签：

#python #深度学习

本文提出了一种名为Temporal-Relational CrossTransformers (TRX)的新方法，用于小样本动作识别。与传统方法相比，TRX利用注意力机制构建特定查询的类原型，匹配并聚合支持集视频的所有子序列，而不是使用类平均值或单个最佳匹配。通过使用不同数量帧的有序元组，TRX能更好地匹配不同速度和时间偏移的动作，适用于精细分类。在Kinetics、SSv2、HMDB51和UCF101等数据集上，TRX实现了最先进的结果，并在详细消融研究中展示了其优势。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Temporal-Relational CrossTransformers for Few-Shot Action Recognition

Abstract
Introduction
Related Work
Method
Ablations

一作：Toby Perrett 主页介绍
作者之前主要做LSTM、元学习；这篇文章也很快就开源了，开源地址如下，作者很热心，回复很耐心。
Github源码

Abstract

Distinct from previous few-shot works, we construct class prototypes using the CrossTransformer attention mechanism to observe relevant sub-sequences of all support videos, rather than using class averages or single best matches. Video representations are formed from ordered tuples of varying numbers of frames, which allows sub-sequences of actions at different speeds and temporal offsets to be compared.

我们主要关注这两句话；首先指出了和以前的小样本学习方法的不同，然后提出解决了什么样的问题。

观察了所有支持集视频的相关子序列--------而不是类平均值or单个最佳匹配值（之前的方法）
视频表示由不同数量帧的有序元组构成，可以比较不同速度和时间偏移下的动作子序列

Introduction

We propose a novel approach to few-shot action recognition, which we term Temporal-Relational CrossTransformers (TRX). A query-specific class prototype is constructed by using an attention mechanism to match each query sub-sequence against all sub-sequences in the support set, and aggregating this evidence. By performing the attention operation over temporally