Stony Brook researchers develop a novel approach to speech and facial animation, paving the way for humanlike, emotionally intelligent communication with virtual avatars.
Stony Brook, NY — April 21, 2025 — Researchers at Stony Brook and Meta AI have joined hands in developing a new AI model to generate photo-realistic 4D talking avatars from text. The system, called AV-Flow, is capable of producing highly synchronized speech, lip movement, facial expressions, and head motion, marking a significant step forward in natural human-computer interaction.
The rise of large language models like ChatGPT has made it easier to interact with artificial intelligence through natural language. But most of these interactions are still limited to text on a screen or, at best, a disembodied voice. For communication to truly feel human, visual and emotional cues—like tone, expression, and head nods—are just as important as words.
AV-Flow transforms plain text into animated 4D avatars that not only speak but also actively listen and react as a person would. This includes lip-syncing, head movement, and facial expressions that align with what’s being said. The result is a more natural and engaging experience — something much closer to talking with another person than with a chatbot.
At the heart of AV-Flow is a unique approach: instead of generating audio and visuals separately, the system produces both in parallel. To achieve this, it uses two connected models that “talk” to each other while generating output — one model focuses on the voice, while the other generates visual animation. These models are designed to ensure that the timing and emotion of the voice match the facial movement and expressions perfectly.
AV-Flow also goes a step further than most systems by handling two-way interactions. Not only can the avatar speak expressively, but it can also appear to listen, nodding and smiling, or showing subtle reactions to what a user says. This makes it ideal for creating always-on virtual agents that feel attentive and emotionally responsive.
The actor reacts (with their gaze or smile) according to the participant’s expression and/or tone.
The results are impressive. The generated avatars are highly realistic, with fluid movement and synchronized audio-visual output. Aggelina Chatziagapi, a Ph.D. student in the computer science department at Stony Brook, said, “Unlike many previous systems, AV-Flow does not require a pre-recorded voice. Everything is generated from scratch, based only on the input text.”
Dimitris Samaras, SUNY Empire Innovation Professor at Stony Brook University, said, “AV-Flow is a major step toward more natural communication with AI. Whether it's powering virtual tutors, healthcare assistants, customer service agents, or creative storytelling tools, this technology brings us closer to emotionally aware and visually present AI companions.”
By combining speech, expression, and active listening into a single system, AV-Flow offers a glimpse into the future of human-AI interaction — one where talking to an AI might feel just like talking to a person.
Communications Assistant
Ankita Nagpal