CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
Hongwei Xue*,
Yuchong Sun*,
Bei Liu,
Jianlong Fu,
Ruihua Song,
Houqiang Li,
Jiebo Luo
January, 2023
Abstract
We adapt image-text pre-trained models to video-text pre-training (i.e., post-pretraining). In this work, we propose an Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Publication
In The Eleventh International Conference on Learning Representations