CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

Abstract

We adapt image-text pre-trained models to video-text pre-training (i.e., post-pretraining). In this work, we propose an Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.

Publication
In The Eleventh International Conference on Learning Representations