VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary

Lin, Kevin Qinghong; Shou, Mike Zheng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.09402 (cs)

[Submitted on 12 Mar 2025 (v1), last revised 9 Jun 2025 (this version, v2)]

Title:VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary

Authors:Kevin Qinghong Lin, Mike Zheng Shou

View PDF HTML (experimental)

Abstract:Human daily activities can be concisely narrated as sequences of routine events (e.g., turning off an alarm) in video streams, forming an event vocabulary. Motivated by this, we introduce VLog, a novel video understanding framework that define video narrations as vocabulary, going beyond the typical subword vocabularies in existing generative video-language models. Built on the lightweight language model GPT-2, VLog feature three key innovations: (i) A generative retrieval model, marrying language model's complex reasoning capabilities with contrastive retrieval's flexible upgrading over narration vocabulary. (ii) A hierarchical vocabulary derived from large-scale video narrations using our narration pair encoding algorithm, enabling efficient indexing of specific events (e.g., cutting a tomato) by identifying broader scenarios (e.g., kitchen) with expressive postfixes (e.g., by the left hand). (iii) A vocabulary update strategy leveraging generative models to extend the vocabulary for novel events encountered during inference. To validate our approach, we introduce VidCap-Eval, a development set requiring concise narrations with reasoning relationships (e.g., before and after). Experiments on EgoSchema, COIN, and HiREST further demonstrate the effectiveness of VLog, highlighting its ability to generate concise, contextually accurate, and efficient narrations, offering a novel perspective on video understanding. Codes are released at this https URL.

Comments:	Accepted by CVPR 2025. Github: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2503.09402 [cs.CV]
	(or arXiv:2503.09402v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.09402

Submission history

From: Qinghong Lin [view email]
[v1] Wed, 12 Mar 2025 13:53:30 UTC (26,562 KB)
[v2] Mon, 9 Jun 2025 16:24:26 UTC (25,651 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators