AI Breakthrough: Bringing Humans to Life in Virtual Reality

Stony Brook University researchers have found a way to create your dynamic 3D avatar using video clippings from your phone.

Have you ever wished you had a 3D avatar to talk to your friends in a fantasy world? Or that you could teach your kid to interact with elephants in virtual reality? Or that you had a digital double starring in your reels?

Bringing reality to fantastic worlds like The San-Ti’s planet from the novel, The Three-Body Problem, seems like a far cry. But we’re closer to creating these realities than we were only less than two decades ago, when Google introduced Street View in its Maps.

This year, Stony Brook University researchers collaborated with Adobe to work on a new project, titled ‘Rig3DGS: Creating Controllable Portraits from Casual Monocular Videos.’ Their goal was to use video clippings from phones to generate 3D portraits, where an AI model could entirely control a person’s expressions, movement, and the camera’s angles.

Alfredo Rivero, a Ph.D. student and Research Assistant at Stony Brook University says, “Being able to create 3D human portraits is crucial to create immersive experiences, and yet, it’s still not possible for everyday consumers to realize this technology using basic smartphone cameras.”

Their biggest challenge was to create a neutral expression and head pose by looking at the video. ShahRukh Athar, another Ph.D. student adds, “Having these neutral expressions was important, so later we could alter them to generate newer expressions and head profiles.”

This was difficult to achieve because a person’s expressions from video clippings produced neutral images with facial deformities, like a distorted lip or a half-open eye. This problem worsens when using video clippings with only a single point of view.

The team tackled this issue by first using existing techniques in Computer Vision to create a 3D structure of the person and the scene in the video.

Then, they focused on the face to create a neutral expression and head pose, layer by layer. This allowed the model to also capture the details of a person’s face, such as hair and glasses, and reproduce them with incredible accuracy.

Once the neutral expression and pose were generated, it was time to look at the video frame by frame. Using this, the neutral image could now imitate what was happening in the video. This was done using a 3D AI model, called FLAME.

“Once this process was complete, and we had converted this static 3D model into an animated 3D model that could capture a person’s facial expressions and movements, we could use the same frame-by-frame technique to tweak things, reanimate, and generate a new video, with new facial expressions, poses, and backgrounds,” says Professor Dimitris Samaras, SUNY Empire Innovation Professor and Director of the Computer Vision Lab at the Department of Computer Science, Stony Brook University.

This project, which was supported in part by the CDC/NIOSH and a gift from Adobe, is far from complete. The model has one drawback — it requires the person in the video to remain relatively still. The team plans to address this in future work.

“The applications of this model are tremendous,” adds Professor Samaras, “making it possible for people to use simple videos captured from smartphones to generate their avatars. These avatars can exist in infinite realities: Speak in virtual meetings, game with friends, act in movies, and teach children, from anywhere in the world, at any time.”

Communications Assistant
Ankita Nagpal

AI Innovation Institute