Skip to content

Latest commit

 

History

History
7 lines (4 loc) · 808 Bytes

File metadata and controls

7 lines (4 loc) · 808 Bytes

Video Search with V-Jepa 2 and LanceDB

V-Jepa 2 is a self-supervised video model designed to enhance AI's understanding, prediction, and planning capabilities in real-world environments. The model is initially pre-trained on over one million hours of internet video data using a mask-denoising technique in representation space, demonstrating state-of-the-art performance in video understanding and human action anticipation.

alt text

Subsequently, an action-conditioned variant, V-JEPA 2-AC, is fine-tuned with a limited amount of robot interaction data, enabling zero-shot robotic planning for tasks like object manipulation. The research also highlights V-JEPA 2's effectiveness when integrated with a large language model for video question-answering tasks, achieving strong results.