Multi-Modal Neural Representations for Humans

Event Description

Abstract:

Capturing the spatio-temporal (4D) dynamics of humans has been a long standing research problem in computer vision and graphics. Synthesizing photorealistic human avatars has broad applications, ranging from immersive telepresence in AR/VR and the movie industry, to enriching the education and healthcare systems. Earlier approaches relied on hand-engineered models that use a small amount of data from one or more subjects. With the advent of neural networks, training on large datasets enhanced the output visual quality. Currently, the combination of neural networks with graphics techniques has achieved natural-looking human animation. However, most approaches are identity-specific, trained only on a single identity, and use only one modality.

In this thesis, we address the problem of learning neural representations of humans in a holistic way. Given that the video data in the real world include multiple modalities (audio and video) and multiple identities, we develop multi-modal and multi-identity representations. First, we propose to reconstruct the 4D face geometry of humans by leveraging both audio and video information. In this way, the network produces accurate lip shapes and is robust to cases when either modality is insufficient. Next, we introduce a NeRF-based representation for audio-driven human face animation that achieves high-quality lip synchronization for cinematic content. Since humans communicate with their full body, combining body pose, hand gestures, and facial expressions, we extend our network to capture the full-body human motion for multiple identities simultaneously. In order to better disentangle identity and non-identity specific information, we subsequently study non-linear interactions between latent factors of variation, and propose a specific multiplicative module. In this way, we learn a multi-identity NeRF that robustly animates human faces under novel expressions and achieves a significant decrease in the total training time. Similarly, we propose a multi-identity gaussian splatting representation for human bodies, by constructing a high-order tensor. Assuming a low-rank structure, we learn a tensor decomposition that leads to a significant decrease in the total number of learnable parameters, as well as to a robust animation under novel poses. In the future, we propose to jointly synthesize audio and visual outputs from just text input. Given the recent rise of large language models, coupling text with natural-looking avatars can enhance the overall interaction between a human and an AI system.

Speaker: Aggelina Chatziagapi

Where: NCS, Room 220

Zoom link: https://stonybrook.zoom.us/j/98775312249?pwd=uORNAnSdcssrPZdqOsqaMAF5aLcRD9.1
ID: 98775312249
Passcode: 505777

Date Start

Date End