Beyond the Video: Action Anticipation Involving Text

You are driving a car down a city street, when you see a child playing with a ball by the side of the road. When the ball suddenly shoots across the road, you slam on the brakes—anticipating that the child may run out in the street chasing the ball.

Action anticipation, the task of predicting the next action in a sequence, is an important problem in computer vision. For instance, accurately predicting the actions of other cars in a driving sequence is essential to building safe self-driving vehicles. Typically, video frames are used to train AI systems for action anticipation, but language models are another type of data that can be helpful here.

In their recent paper “Distilling Knowledge from Language Models for Video-based Action Anticipation,” Professors Minh Hoai Nguyen and Niranjan Balasubramanian of the Stony Brook Department of Computer Science, along with several Computer Science PhD candidates, go beyond the boundaries of video and bring text into their action anticipation research.

“Most of the datasets online now are written in text, so it’s easier to collect than video,” says PhD candidate Vinh Tran.

To achieve this, a text-based model served as the teacher for a video-based model student to produce language-to-vision knowledge distillation.

The knowledge provided by the pre-trained language model was used to build the text-based model, or the “teacher.” The teacher was trained with both the video frames and action labels on the frames. In other words, it was trained by both video and text. On the other hand, the “student,” the vision-based model, was supervised and trained by the teacher. As a result, this knowledge distillation was found to improve performance.

“Action anticipation helps many applications, for example, autonomous driving. A car can anticipate that a person in front of it is going to cross the street, and it can detect potential danger,” says Tran. “A second example would be surveillance cameras for the elderly. If your grandparent is about to reach something from the top shelf, you can be alerted to prevent an accident. It’s very useful for our daily lives.”

Advances in AI like action anticipation are vital to everyday life, including in AI safety in our communities. With the addition of language models in action anticipation training, these advancements can come into fruition.

-Sara Giarnieri, Communications Assistant

AI Innovation Institute