A Holistic Approach to Human Gaze Understanding: Probabilistic Modeling, Geometric Reasoning, and Multimodal Learning
Event Description
Abstract: Human gaze behavior is a fundamental cue for understanding social intent, human-machine interaction, and cognitive processes. This thesis addresses the challenges of gaze target estimation (GTE), also known as gaze following, by developing a holistic understanding of gaze in complex environments.
The first part of this work improves GTE performance by introducing Patch-level Distribution Prediction (PDP). Unlike traditional models that rely on strict pixel-wise regression, PDP models gaze as a distribution over patches, which better accounts for annotation variance and bridges the gap between target location and in/out-of-frame prediction. To address the laborious nature of data labeling, the second part presents GCDR, the first semi-supervised method for gaze following. By prompting large Visual Question Answering (VQA) models to generate initial Grad-CAM heatmaps and refining them with a diffusion model, this method achieves high performance with significantly fewer human annotations. The third part expands the applicability of GTE to multi-camera environments. By introducing the Multi-View Gaze Target (MVGT) dataset, along with two novel frameworks for integrating information between multiple views and predicting the gaze target across views, we explore a new direction that overcomes single-view limitations such as face occlusion and out-of-view targets.
Building on these foundations, the final part of this thesis proposes a new direction toward semantic social gaze understanding using next-generation multimodal Large Language Models (LLMs). Rather than focusing solely on geometric gaze target localization, we aim to enrich gaze prediction with semantic and relational interpretation in complex social scenes. To this end, we will leverage existing gaze following datasets to derive social gaze supervision, including mutual gaze and shared attention, and obtain aligned language descriptions of scene-level gaze behaviors. This proposed work will enable the model to not only locate gaze targets but also predict structured social gaze relations among individuals, meanwhile generating a concise natural-language summary describing the dominant gaze interactions. By integrating spatial gaze estimation, social relation reasoning, and language-based scene understanding within a unified multimodal model, this work takes an important step toward a holistic understanding of human gaze behavior in real-world environments.
Speaker: Qiaomu Miao
The first part of this work improves GTE performance by introducing Patch-level Distribution Prediction (PDP). Unlike traditional models that rely on strict pixel-wise regression, PDP models gaze as a distribution over patches, which better accounts for annotation variance and bridges the gap between target location and in/out-of-frame prediction. To address the laborious nature of data labeling, the second part presents GCDR, the first semi-supervised method for gaze following. By prompting large Visual Question Answering (VQA) models to generate initial Grad-CAM heatmaps and refining them with a diffusion model, this method achieves high performance with significantly fewer human annotations. The third part expands the applicability of GTE to multi-camera environments. By introducing the Multi-View Gaze Target (MVGT) dataset, along with two novel frameworks for integrating information between multiple views and predicting the gaze target across views, we explore a new direction that overcomes single-view limitations such as face occlusion and out-of-view targets.
Building on these foundations, the final part of this thesis proposes a new direction toward semantic social gaze understanding using next-generation multimodal Large Language Models (LLMs). Rather than focusing solely on geometric gaze target localization, we aim to enrich gaze prediction with semantic and relational interpretation in complex social scenes. To this end, we will leverage existing gaze following datasets to derive social gaze supervision, including mutual gaze and shared attention, and obtain aligned language descriptions of scene-level gaze behaviors. This proposed work will enable the model to not only locate gaze targets but also predict structured social gaze relations among individuals, meanwhile generating a concise natural-language summary describing the dominant gaze interactions. By integrating spatial gaze estimation, social relation reasoning, and language-based scene understanding within a unified multimodal model, this work takes an important step toward a holistic understanding of human gaze behavior in real-world environments.
Speaker: Qiaomu Miao