Abstract
Real-world videos often have complex dynamics; and methods for generating
open-domain video descriptions should be sensitive to temporal structure and
allow both input (sequence of frames) and output (sequence of words) of variable
length. To approach this problem, we propose a novel end-to-end
sequence-to-sequence model to generate captions for videos. For this we exploit
recurrent neural networks, specifically LSTMs, which have demonstrated
state-of-the-art performance in image caption generation.
Our LSTM model is trained on video-sentence pairs and learns to associate a
sequence of video frames to a sequence of words in order to generate a
description of the event in the video clip. Our model naturally is able to learn
the temporal structure of the sequence of frames as well as the sequence model
of the generated sentences, i.e. a language model.
We evaluate several variants of our model that exploit different visual features on a
standard set of YouTube videos and two movie description datasets (M-VAD and MPII-MD).
PDF
Slides
Poster
Overview
An overview of the S2VT video to text architecture.
ICCV 2015 Spotlight Video.
Reference
If you find this useful in your work please consider citing:
@inproceedings{venugopalan15iccv,
title = {Sequence to Sequence -- Video to Text},
author = {Venugopalan, Subhashini and Rohrbach, Marcus and Donahue, Jeff
and Mooney, Raymond and Darrell, Trevor and Saenko, Kate},
booktitle = {Proceedings of the IEEE International Conference on Computer Vision (ICCV)},
year = {2015}
}