Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning

Abstract

We introduce a Long-Form VIdeo-LAnguage pre-training model (LF-VILA) and train it on a large-scale long-form video and paragraph dataset constructed from HD-VILA-100M. We propose a Multimodal Temporal Contrastive (MTC) loss to learn the temporal relation across different modalities by encouraging fine-grained alignment between long-form videos and paragraphs. We then propose a Hierarchical Temporal Window Attention (HTWA) mechanism to effectively capture long-range dependency while reducing computational cost in Transformer.

Publication
In 36th Conference on Neural Information Processing Systems