An Intelligent Brain Exists in Robots
Event Description
Abstract: Robot control has evolved from optimization-based controllers---precise but task-specific---through deep reinforcement learning's learned policies, to Vision-Language-Action (VLA) models that leverage pretrained vision-language backbones for language-conditioned manipulation across diverse tasks.
Despite their promise, VLAs exhibit a critical limitation: they function primarily as trajectory learners rather than skill learners. Recent evaluations reveal that VLAs often fail when faced with even minor variations in object initialization or environmental conditions, suggesting they memorize specific trajectories rather than acquiring generalizable manipulation skills. Attempts to address this through 3D spatial representations have shown limited success, indicating that the missing component may be more fundamental than geometric understanding alone.
This work argues that World Models (WMs)---internal representations that predict future states given actions---constitute the missing piece for robust VLA systems. We present one completed contribution and two ongoing investigations.
We developed a dual-layer world model for human-robot interaction that anticipates both physical scene evolution and latent human preferences for assistive tasks. Building on these foundations, we present ongoing work probing VLA internal representations to verify implicit world model existence, and propose a WM-VLA integration approach operating in the native visual domain through embedding prediction and image decoding.
Together, these contributions and investigations establish a foundation for WM-VLA systems, pointing toward robust, generalizable robot policies.
Speaker: Jason Qin
Location: NCS 220
Despite their promise, VLAs exhibit a critical limitation: they function primarily as trajectory learners rather than skill learners. Recent evaluations reveal that VLAs often fail when faced with even minor variations in object initialization or environmental conditions, suggesting they memorize specific trajectories rather than acquiring generalizable manipulation skills. Attempts to address this through 3D spatial representations have shown limited success, indicating that the missing component may be more fundamental than geometric understanding alone.
This work argues that World Models (WMs)---internal representations that predict future states given actions---constitute the missing piece for robust VLA systems. We present one completed contribution and two ongoing investigations.
We developed a dual-layer world model for human-robot interaction that anticipates both physical scene evolution and latent human preferences for assistive tasks. Building on these foundations, we present ongoing work probing VLA internal representations to verify implicit world model existence, and propose a WM-VLA integration approach operating in the native visual domain through embedding prediction and image decoding.
Together, these contributions and investigations establish a foundation for WM-VLA systems, pointing toward robust, generalizable robot policies.
Speaker: Jason Qin
Location: NCS 220