Sample, Representation, and Computational Efficiency for Visual Generation

Event Description

Abstract: Visual generation is a fundamental problem in computer vision and graphics, with applications ranging from 3D capture to content creation and image/video synthesis. Despite rapid progress in neural rendering and generative models, efficiency remains a key obstacle in practice: high-quality 3D reconstruction often depends on dense multi-view supervision; scalable 3D synthesis faces heavy optimization, training, and rendering costs; and modern image/video generators incur substantial computation as token grids grow with spatial resolution and temporal length.
This thesis targets efficient visual world modeling by improving sample efficiency in 3D reconstruction, representation efficiency in 3D generation, and computational efficiency in image/video synthesis. First, we improve sample efficiency for neural implicit surface reconstruction under sparse views by integrating multi-view stereo probability volumes as a geometric regularizer, enabling high-quality reconstruction from as few as three input images. Next, we introduce an explicit 3D representation for 3D generation, built from multi-view depth and RGB predictions with 3D Gaussian features, which enables the use of 2D generative priors while enforcing multi-view consistency via epipolar attention. We then address the computational bottleneck of image and video synthesis with importance-based token merging, using importance signals available during generation to preserve critical information while merging redundant tokens. Finally, we propose efficient mixed-resolution diffusion transformers via cross-resolution phase-aligned attention, aiming to improve attention stability under mixed token grids and support high-fidelity mixed-resolution generation.

Speaker: Haoyu Wu

Location: NCS120

Date Start

Date End