Object-Centric VOL
Unsupervised Open-Vocabulary Object Localization in Videos

ICCV 2023

Ke Fan^1,*, Zechen Bai^2,*, Tianjun Xiao², Dominik Zietlow², Max Horn², Zixu Zhao²,
Carl-Johann Simon-Gabriel², Mike Zheng Shou³, Francesco Locatello²,
Bernt Schiele², Thomas Brox², Zheng Zhang^2,†, Yanwei Fu^1,†, Tong He²

¹Fudan University, ²Amazon Web Service, ³National University of Singapore

arXiv Code

Abstract

In this paper, we show that recent advances in video representation learning and pre-trained vision-language models allow for substantial improvements in self-supervised video object localization. We propose a method that first localizes objects in videos via a slot attention approach and then assigns text to the obtained slots. The latter is achieved by an unsupervised way to read localized semantic information from the pre-trained CLIP model. The resulting video object localization is entirely unsupervised apart from the implicit annotation contained in CLIP, and it is effectively the first unsupervised approach that yields good results on regular video benchmarks.

Model Pipeline

We propose an unsupervised video object localization method that first localizes objects in videos via a slot attention approach and then assigns text to the obtained slots. The latter is achieved by an unsupervised way to read localized semantic information from the pre-trained CLIP model. The resulting video object localization is entirely unsupervised apart from the implicit annotation contained in CLIP.

Results

We uses slot attention in feature space to localize tubes (second column), assigns text to the slot features via a CLIP model that was modified to allow local feature alignment (third column), and finally merges slots that overlap in text space (last column)

BibTeX


      @InProceedings{Fan_2023_ICCV,
        author    = {Fan, Ke and Bai, Zechen and Xiao, Tianjun and Zietlow, Dominik and Horn, Max and Zhao, Zixu and Simon-Gabriel, Carl-Johann and Shou, Mike Zheng and Locatello, Francesco and Schiele, Bernt and Brox, Thomas and Zhang, Zheng and Fu, Yanwei and He, Tong},
        title     = {Unsupervised Open-Vocabulary Object Localization in Videos},
        booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
        month     = {October},
        year      = {2023},
        pages     = {13747-13755}
    }

Object-Centric VOL Unsupervised Open-Vocabulary Object Localization in Videos