Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

Abstract

We collect a large dataset which is the first high-resolution dataset including 371.5k hours of 720p videos and the most diversified dataset covering 15 popular YouTube categories. We propose a novel High-resolution and Diversified VIdeo-LAnguage pre-training model (HD-VILA) for many visual tasks.

Publication
in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition