CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

Hongwei Xue*, Yuchong Sun*, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, Jiebo Luo

January, 2023

Abstract

We adapt image-text pre-trained models to video-text pre-training (i.e., post-pretraining). In this work, we propose an Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.

Type

Conference paper

Publication

In The Eleventh International Conference on Learning Representations