In this paper, we address the temporal moment localization issue, namely, localizing a video moment described by a natural language query in an untrimmed video. This is a general yet challenging vision-language task since it requires not only the localization of moments, but also the multimodal comprehension of textual-temporal information (e.g., "first" and "leaving") that helps to distinguish the desired moment from the others, especially those with similar visual content. While existing studies treat the given language queries as a single unit, we propose to decompose them into two components: the relevant cue related to the desired moment localization and the irrelevant one meaningless to the localization. This allows us to flexibly adapt to arbitrary queries in an end-to-end framework. In our proposed model, a language-temporal attention network is utilized to learn the word attention based on the temporal context information in the video. Therefore, our model can automatically select "what words to listen to" for localizing the desired moment. We evaluate the proposed model on two public benchmark datasets: DiDeMo and Charades-STA. The experimental results verify its superiority over several state-of-the-art methods.
- Paper: ACM MM
- Code Download: Baidu Netdisk
- Extraction Code:
k9ew
Our method achieves competitive or superior results compared with previous methods on multiple benchmarks.
Copyright (C) 2018 Shandong University
This program is licensed under the GNU General Public License v3.0.
You may obtain a copy of the license at:
https://www.gnu.org/licenses/gpl-3.0.html
Any derivative work based on this program must also be licensed under the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version, if such derivative work is distributed to a third party.
The copyright of this program is owned by Shandong University.
For commercial projects that require distributing this code as part of a program that cannot be released under the GNU General Public License, please contact mengliu.sdu@gmail.com to obtain a commercial license.
If you find this project useful in your research, please consider citing:
@inproceedings{10.1145/3240508.3240549,
author = {Liu, Meng and Wang, Xiang and Nie, Liqiang and Tian, Qi and Chen, Baoquan and Chua, Tat-Seng},
title = {Cross-modal Moment Localization in Videos},
year = {2018},
booktitle = {Proceedings of the 26th ACM International Conference on Multimedia},
pages = {843–851}
}

