You are driving a car down a city street, when you see a child playing with a ball by the side of the road. When the ball suddenly shoots across the road, you slam on the brakes—anticipating that the child may run out in the street chasing the ball.
Action anticipation, the task of predicting the next action in a sequence, is an important problem in computer vision. For instance, accurately predicting the actions of other cars in a driving sequence is essential to building safe self-driving vehicles. Typically, video frames are used to train AI systems for action anticipation, but language models are another type of data that can be helpful here.