Self-supervised Visual Representation Learning with Synthetic Data
Event Description
Abstract: Pretraining vision encoders with self-supervision (SSL) leads to stronger representations that excel across diverse downstream tasks. One of the key factors enabling self-supervision is extracting multiple views of the same scene to formulate either: 1) View-invariant pretraining (DINO, SimCLR, iBOT), where the objective is predicting the same representation for different views of the scene; or 2) Cross-view pretraining (cross-view Masked Autoencoders), where the objective is predicting missing parts of one view using other views. For extracting multiple views, view-invariant methods rely on a combination of handcrafted augmentations (random cropping, color jittering, gaussian blur, etc.) of the same image, whereas cross-view pretraining methods rely on image cropping or video frames. In this work, we present methods to effectively incorporate synthetic views from diffusion models into SSL training.
For view-invariant pretraining, we introduce Gen-SIS, a method that leverages the ability of diffusion models to generate interpolated images through interpolation in conditioning space. We introduce a disentanglement pretext task: disentangling two source images from an interpolated synthetic image. This disentanglement task, in addition to vanilla single-source generative augmentation for view extraction, improves visual pretraining of various view-invariant methods (DINO, SimCLR, iBOT).
For cross-view pretraining, we introduce CDG-MAE, a novel cross-view masked autoencoder (MAE) based method that uses diverse synthetic views generated from static images via an image-conditioned diffusion model to learn dense correspondences. We present a quantitative method to evaluate the local and global consistency of the generated views to choose the right diffusion model for cross-view pretraining. These generated views exhibit substantial changes in pose and perspective, providing a rich training signal that overcomes the limitations of video (expensive) and crop-based (less variation) methods. CDG-MAE substantially narrows the gap to video-based MAE methods on video label propagation tasks while maintaining the data advantages of image-only MAEs.
Speaker: Varun Belagali
Location: NCS 120
Zoom: https://stonybrook.zoom.