Predicting Subjective Attributes in Visual Data - Zijun Wei

ABSTRACT: Recent progress in deep neural networks has revolutionized many computer vision tasks such as image classification, detection and segmentation. However, in addition to excelling in tasks that predict well-defined objective information, human-centered artificial intelligence systems should also be able to model subjective attributes, as defined by human perceptual behavior, that goes beyond the pure physical content of visual data. Example subjective tasks are the prediction of spatial or temporal regions that are interesting to humans (e.g., attract attention or are visually pleasing) and the recognition of subjective attributes (e.g., visually elicited sentiments). Better models for these tasks will improve the human-computer interaction experience in various applications. This thesis investigates several approaches to address the challenges in predicting those subjective attributes in visual data over a diverse set of tasks. I first present a novel framework for real-time automatic photo composition. The framework consists of a cost-effective data collection workflow, an efficient model training pipeline and a lightweight module to account for personalized preferences. Then I develop a novel and general algorithm to detect interesting segments in sequential data, which can be naturally applied to video summarization tasks. Furthermore, I propose methods that learn to represent sentiments elicited by images, in an unsupervised manner, using linguistic features extracted from large scale Web data. To conclude this thesis, I introduce a human-vision-inspired image classification algorithm that also predicts spatial visual attention even though no attention data was used for training it.  

Abstract: Pre-trained diffusion and flow matching models have made visual generation remarkably powerful, enabling high-fidelity synthesis of images and videos from natural language prompts. However, their behavior is still largely dictated by the pre-training data distribution and likelihood objective, which do not directly encode downstream desiderata such as fine-grained semantic alignment, controllability, or realism. This gap motivates post-training: starting from a base generator and further optimizing it with additional supervision signals derived from human or reward model preferences.This work presents post-training for visual generative models through two complementary case studies. First, Hummingbird addresses the problem of fine-grained contextual alignment in image-text-to-image generation. We introduce a multimodal context evaluator that scores the consistency between rich contextual descriptions and generated images, capturing fine-grained alignment beyond global CLIP similarity. By directly backpropagating these differentiable rewards through the diffusion sampler, Hummingbird substantially improves semantic faithfulness while preserving high visual quality.
Second, PISCES tackles post-training for text-to-video generation, where alignment is inherently semantic-spatio-temporal. We show that naive VLM-based rewards suffer from distributional mismatch and token-level misalignment, leading to reward hacking and suboptimal optimization. PISCES introduces a bi-objective, Optimal Transport (OT)-aligned reward module: distributional OT using Neural Optimal Transport to align text and video embedding distributions, and discrete, partial OT over a spatio-temporal cost matrix to capture semantic alignment at the token level. These rewards are integrated into both direct backpropagation and GRPO-style optimization to post-train state-of-the-art text-to-video generators. Together, Hummingbird and PISCES provide a unified view of how carefully designed visual reward models, coupled with OT-based representation alignment, can reliably improve the downstream behavior of pre-trained image and video generators.

Speaker: Minh Quan Le

Location: NCS 220

Zoom: https://stonybrook.zoom.us/j/94798224254?pwd=CFraer25qnpORbJ14aAVHRwaSJOjJM.1
Abstract: Capturing the spatio-temporal (4D) dynamics of humans has been a long standing research problem in computer vision and graphics. Synthesizing photorealistic human avatars has broad applications, ranging from immersive telepresence in AR/VR and the movie industry, to enriching the education and healthcare systems. Earlier approaches relied on hand-engineered models that use a small amount of data from one or more subjects. With the advent of neural networks, training on large datasets enhanced the output visual quality. Currently, the combination of neural networks with graphics techniques has achieved natural-looking human animation. However, most approaches are identity-specific, trained only on a single identity, and use only one modality.

In this dissertation, we address the problem of learning neural representations of humans in a holistic way. Given that the video data in the real world include multiple modalities (e.g., audio and video) and multiple identities, we develop multi-modal and multi-identity representations. First, we propose to reconstruct the 4D face geometry of humans by leveraging both audio and video information. In this way, the network produces accurate lip shapes and is robust to cases when either modality is insufficient. Next, we introduce a NeRF-based representation for audio-driven human face animation that achieves high-quality lip synchronization for cinematic content. Since humans communicate with their full body, combining body pose, hand gestures, and facial expressions, we extend the network to capture full-body human motion for multiple identities simultaneously. In order to better disentangle identity and non-identity specific information, we subsequently study non-linear interactions between latent factors of variation, and propose a specific multiplicative module. In this way, we learn a multi-identity NeRF that robustly animates human faces under novel expressions and achieves a significant decrease in the total training time. Similarly, we propose a multi-identity Gaussian splatting representation for human bodies, by constructing a high-order tensor. Assuming a low-rank structure, we learn a tensor decomposition that leads to a significant decrease in the total number of learnable parameters, as well as to a robust animation under novel poses. Last but not least, we propose to jointly synthesize audio and visual outputs from just text input. Given the recent rise of large language models, coupling text with natural-looking avatars can enhance the overall interaction between a human and an AI system.

Location: NCS 220 or Zoom

University Libraries Presents: The Library AI Club is a welcoming space for students, faculty, and staff to explore AI in a supportive, low-pressure environment. Meeting every two weeks, the club features discussions, collaborative projects, guest speakers, and hands-on experiments. Join us to learn, share ideas, and engage with AI responsibly and creatively. We'd love to see you at an upcoming meeting! Location: Melville Library, Scholarly Communication Seminar Room
Abstract: Many foundation models for digital pathology have been released recently. Benchmarking available methods then becomes paramount to get a clearer view of the research landscape. For this reason, we introduce THUNDER, a tile-level benchmark for digital pathology foundation models, allowing for efficient comparison of many models on diverse datasets with a series of downstream tasks, studying their feature spaces and assessing the robustness and uncertainty of predictions informed by their embeddings. Such foundation models are often used as feature extractors and combined with Multiple Instance Learning (MIL) aggregators at downstream time. Such aggregation must be efficient and reliable. We will focus on two specific examples of this: (I) HistAug, a fast and efficient generative model for controllable augmentations in the latent space of foundation models to perform data augmentation for MIL, and (ii) CAR-MIL, a method based on counterfactual attention regularisation to improve the reliability of attention maps of MIL methods.

Short-bio: Pierre Marza is a Postdoctoral Researcher at CentraleSupelec in the Biomathematics team of the MICS lab, studying Computer Vision and Deep Learning for Medical Imaging, with a focus on Digital Pathology. Prior to this, he was a PhD student at INSA Lyon, in the LIRIS and CITI labs, advised by Christian Wolf, and co-advised by Laetita Matignon and Olivier Simonin. He studied Visual Navigation, Embodied AI, Spatial Reasoning, more specifically how to learn to represent 3D space, generalize to new environments and master diverse tasks from light supervision.

Location: NCS 220

Zoom: https://stonybrook.zoom.us/j/94798224254?pwd=CFraer25qnpORbJ14aAVHRwaSJOjJM.1
Research challenges in using computer vision in robotics systems Abstract The past decade has seen a remarkable increase in the level of performance of computer vision techniques, including with the introduction of effective deep learning techniques. Much of this progress is in the form of rapidly increasing performance on standard, curated datasets. However, translating these results into operational vision systems for robotics applications remains a formidable challenge. This talk with explore some of the fundamental questions at the boundary between computer vision and robotics that need to be addressed. This includes introspection/self-awareness of performance, anytime algorithms for computer vision, multi-hypothesis generation, rapid learning and adaptation. The discussion will be illustrated by examples from autonomous air and ground robots.