V-JEPA extends the Joint-Embedding Predictive Architecture (JEPA) principle from images to video, training a visual encoder by predicting masked spatio-temporal regions of a video within a learned ...