Imagine machines that can see beyond human limitations--drones locating hidden survivors, cameras predicting structural failures, or medical devices detecting tumors beneath the skin. Traditional vision systems are constrained by the boundaries of human perception, missing vast information present in light interactions. This talk explores the development of advanced vision systems that capture underutilized dimensions of light, model intricate light-scene interactions, and extract hidden 3D information--around corners, beneath surfaces, and at high speeds. By jointly developing novel imaging hardware, efficient rendering models, and physics-based learning algorithms, we aim to transcend conventional vision capabilities--unlocking critical applications in autonomous navigation, structural monitoring, and non-invasive medical imaging.

Speaker Bio:


Akshat Dave is a Postdoctoral Associate at MIT Media Lab in the Camera Culture group working with Prof. Ramesh Raskar. He received his Ph.D. from Rice University ECE Department in 2023 where he was advised by Prof. Ashok Veeraraghavan. His research lies at the intersection of applied optics, computer graphics, and computer vision. His research focuses on developing vision systems that go beyond human perception. His work has been recognized by Rice University's Best Thesis Award, OSA Best Paper Prize, and fellowships by Texas Instruments and Qualcomm.
Abstract: Visual generation is a fundamental problem in computer vision and graphics, with applications ranging from 3D capture to content creation and image/video synthesis. Despite rapid progress in neural rendering and generative models, efficiency remains a key obstacle in practice: high-quality 3D reconstruction often depends on dense multi-view supervision; scalable 3D synthesis faces heavy optimization, training, and rendering costs; and modern image/video generators incur substantial computation as token grids grow with spatial resolution and temporal length.
This thesis targets efficient visual world modeling by improving sample efficiency in 3D reconstruction, representation efficiency in 3D generation, and computational efficiency in image/video synthesis. First, we improve sample efficiency for neural implicit surface reconstruction under sparse views by integrating multi-view stereo probability volumes as a geometric regularizer, enabling high-quality reconstruction from as few as three input images. Next, we introduce an explicit 3D representation for 3D generation, built from multi-view depth and RGB predictions with 3D Gaussian features, which enables the use of 2D generative priors while enforcing multi-view consistency via epipolar attention. We then address the computational bottleneck of image and video synthesis with importance-based token merging, using importance signals available during generation to preserve critical information while merging redundant tokens. Finally, we propose efficient mixed-resolution diffusion transformers via cross-resolution phase-aligned attention, aiming to improve attention stability under mixed token grids and support high-fidelity mixed-resolution generation.

Speaker: Haoyu Wu

Location: NCS120
Mind Brain Lecture: Constructing the World of Taste in Your Head You fork the morsel into your mouth and say yum...chocolate cake. The appreciation of your dessert's taste seems to follow directly, quickly and simply from the placement of the food on your tongue. The truth, however, is far more interesting and complex: your brain actually begins determining whether you will enjoy a bite of food even before the fork approaches your mouth and continues to work the problem well after. Information about your food's color, smell, texture and taste activates multiple parts of your brain, where that information collides with your pre-mouthful beliefs about how it should taste. The coming-together and shuffling of that information around the brain takes time, as networks of neurons work together to help you decide whether the morsel in your mouth is worth swallowing. Referring to work from psychology, biology and computational neuroscience, Professor Katz will de-mystify and reveal the beauty of these complexities of the neuroscience of taste. Donald Katz, Professor of Psychology, Departments of Neuroscience, Psychology, and the Volen National Center for Complex Systems, Brandeis University Free presentation intended for a general audience. Reception to follow. https://www.stonybrook.edu/commcms/mind/
Abstract: Many scientific and engineering challenges, such as the design of materials or molecules or the control of experimental systems, rely on the existence of fast predictive models that can evaluate potential designs or control policies. Traditionally this has been accomplished through numerical simulation; more recently data-driven machine learning methods have been applied. However, both approaches leave gaps: physical modeling can be accurate and extrapolates well to previously-unstudied conditions, but it is often computationally expensive and relies on physics approximations that may not be valid. Machine learning can generalize from massive amounts of real-world or simulation data, but suffers from physical grounding and extrapolation into new regimes, as well as in settings where large data sets do not exist.
In this talk I explore an intermediate regime, which is hybrid reduced order models: fast simplified physics approximations where some of the unknown or approximated equations are replaced with data-driven machine learning components. Examples include coarse-grained models where the full macroscopic equations cannot be derived from first-principles microscopic equations, multiscale models with unknown closure terms or sub-grid parameterization schemes, and low-order or latent dynamical systems that learn governing equations on a low-dimensional reduced state space. I discuss how such reduced systems can be identified from very limited data, much less than is often needed in traditional machine learning but at much lower time-to-solution than traditional numerical modeling. This facilitates not only system design and control but also uncertainty quantification approaches that search the space of possible equations for predictive models that can explain the data. I will focus on an example from materials science concerning the design of self-assembling block copolymer nanomaterials.

Speaker: Dr. Nathan Urban, Applied Mathematics Department, Brookhaven National Laboratory

Location: Laufer 101

Zoom: https://stonybrook.zoom.us/j/96090260834?pwd=mw8QTHbMOw9oeU9hazZeoq8bN4VIfH.1
Meeting ID: 960 9026 0834 Passcode: 374969

Abstract: The remarkable success of large foundational models, such as LLMs and diffusion models, is built on their learning over vast amounts of static data from the Internet. However, human learning and problem-solving are fundamentally interactive processes--humans learn by engaging with their environment, tools, search engine, and feedback loops, iteratively refining their understanding and decisions. This gap between the interactivity of human learning and the static nature of model training raises a critical question: how can we imbue foundational models with the capacity for meaningful interaction?

In this talk, I will explore methods to enhance foundational models by incorporating interaction with the external environment. I will discuss strategies such as leveraging external tools, compilers, function calls to provide dynamic feedback to enhance foundation models. By drawing inspiration from human's interactive learning processes, I demonstrate how interaction-driven learning can lead to models that are not only more accurate but also more adaptable to real-world applications.

This work bridges the gap between static training paradigms and the dynamic, iterative nature of human intelligence, paving the way for a new generation of interactive AI systems.

Bio: Wenhu Chen has been an assistant professor at the Computer Science Department in University of Waterloo and Vector Institute since 2022. He obtained the Canada CIFAR AI Chair Award in 2022 and CIFAR Catalyst Award in 2024. He has worked for Google Deepmind as a part-time research scientist since 2021. Before that, he obtained his PhD from the University of California, Santa Barbara under the supervision of William Wang and Xifeng Yan. His research interest lies in natural language processing, deep learning and multimodal learning. He aims to design models to handle complex reasoning scenarios like math problem-solving, structure knowledge grounding, etc. He is also interested in building more powerful multimodal models to bridge different modalities. He received the Area Chair Award in AACL 2023, the Best Paper Honorable Mention in WACV 2021, the Best Paper Finalist in CVPR 2024, and the UCSB CS Outstanding Dissertation Award in 2021.
CSE 656 Seminars in Computer Vision - Wednesdays 11:30am-12:50pm, Room NCS 120

The overall purpose of this seminar is to bring together people with interests in Computer Vision theory and techniques and to examine current research issues. This course will be appropriate for people who already took a Computer Vision graduate course or already had research experience in Computer Vision. To enroll in this course, you must either: (1) be in the PhD program or (2) receive permission from the instructors.

Each seminar will consist of multiple short talks (around 10 minutes) by multiple people. Students can register for 1 credit for CSE656. Registered students must attend and present a minimum of 2 or 3 talks. Everyone else is welcome to attend. Fill in https://forms.gle/pCVXovgfMfQwGqG38 to subscribe to our mailing list for further announcement.

The first meeting will be Wed Jan 29 at 11.30am, room 120 New CS. The meeting will deal with organizational matters and we will start right away with some presentations. Send David Paredes Merino <dparedesmeri@cs.stonybrook.edu> an email if you are interested but cannot attend the first meeting. Please forward to people outside the CS department that you think might be interested.
Abstract: Many unresolved legal questions over LLMs and copyright center on memorization: whether specific training data have been encoded in the model's weights during training, and whether those memorized data can be extracted in the model's outputs. While many believe that LLMs do not memorize much of their training data, recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models. However, it remains an open question if similar extraction is feasible for production LLMs, given the safety measures these systems implement. We investigate this question using a two-phase procedure: (1) an initial probe to test for extraction feasibility, which sometimes uses a Best-of-N (BoN) jailbreak, followed by (2) iterative continuation prompts to attempt to extract the book. We evaluate our procedure on four production LLMs -- Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, and Grok 3 -- and we measure extraction success with a score computed from a block-based approximation of longest common substring (nv-recall). With different per-LLM experimental configurations, we were able to extract varying amounts of text. For the Phase 1 probe, it was unnecessary to jailbreak Gemini 2.5 Pro and Grok 3 to extract text (e.g, nv-recall of 76.8% and 70.3%, respectively, for Harry Potter and the Sorcerer's Stone), while it was necessary for Claude 3.7 Sonnet and GPT-4.1. In some cases, jailbroken Claude 3.7 Sonnet outputs entire books near-verbatim (e.g., nv-recall=95.8%). GPT-4.1 requires significantly more BoN attempts (e.g., 20X), and eventually refuses to continue (e.g., nv-recall=4.0%). Taken together, our work highlights that, even with model- and system-level safeguards, extraction of (in-copyright) training data remains a risk for production LLMs.

Speaker: Xinyue

Location: CS2311
Abstract: Humans perceive the world around them by recognizing global patterns and structures such as object parts, branches, their spatial arrangement, and so on. Most deep learning models, however, take a fundamentally local approach. They process images pixel-by-pixel rather than focusing on structures as a whole. While these models indeed perform well on many tasks, the local (pixel-level) versus global (structure-level) disconnect makes them harder to interpret and control.

Topology, in a general sense, is a mathematical language for describing structure. It delineates how different parts of an image relate to one another, capturing both individual structures and their overall layout. Preserving topology enforces structural correctness and, by extension, semantic validity.

In this thesis, we investigate how topological constraints can be used to bridge the gap between local and global understanding. We use topology to inform the design of deep learning models that are explicitly structure-aware. Our thesis focuses on dense prediction tasks, which include image segmentation, uncertainty estimation, and generative modeling. First, we introduce a topological interaction module for semantic segmentation that encodes containment and exclusion constraints directly into the learning process. This preserves anatomical hierarchies and improves multi-class consistency. Next, since segmentation models can never be truly perfect, we address the need for reliable uncertainty estimation to identify error-prone regions. Unlike conventional pixel-wise uncertainty maps, which tend to be noisy and difficult to interpret, we propose reasoning at the level of structural units--branches and connections--which are more visually discernible and actionable. Finally, we leverage topology for generative modeling. We propose a topology-guided diffusion framework that can be controlled using structural attributes like object count and connectivity.

Together, these contributions establish a unified approach to topology-informed, structure-preserving dense prediction models. By integrating topological reasoning with deep networks, this thesis advances models that are not only accurate, but also structurally consistent, interpretable, and controllable. The results from this thesis have been published in ECCV, NeurIPS, and ICLR.

Speaker: Saumya Gupta

Location: New Computer Science (NCS) 120


Zoom: https://stonybrook.zoom.us/j/93643318604?pwd=kv8DagpbayzizivU29UCYItnlzlYRM.1&jst=2
CSE 600 Seminar Series | Fall 2025


Abstract: Vision-language models that see and describe the world are now part of our daily lives, from internet search and accessibility tools to content generation and automatic moderation. However, as these models grow and become more widely used, their limitations have also become increasingly visible. In particular, it has been shown that these models are unable to reliably perform complex tasks that require abstraction and compositional reasoning. For example, they struggle to decompose an image or text into entities, attributes, and relations, and then reason over new combinations of these elements. As a result, we see generated content full of hallucinations, privacy leaks in images, and different types of biases in the model outputs.In this talk, I will outline a research agenda that aims to build trustworthy vision-language models in the age of generative AI. I will begin with compositional reasoning: how natural language inference can be used to decompose complex instructions and captions into atomic, verifiable statements, improving both evaluation and model behavior on tasks that require multi-step reasoning. I will then discuss how synthetic data and simulated environments can be used to train more reliable models, and how they can also stress-test models beyond standard benchmarks, revealing when models drop attributes, break object relations, or fail under distribution shifts. I will also share recent work on using hallucination correction as a signal to improve video-language alignment, and on privacy-preserving image understanding for blind and low-vision users. I will conclude with possible ways we can systematically probe, debug, and repair these models, turning synthetic perception into something we can trust in real-world deployments.



Speaker: Paola Cascante-Bonilla is a tenure-track Assistant Professor in the Department of Computer Science at Stony Brook University (SUNY). Before that, she was a Postdoctoral Associate at the University of Maryland Institute for Advanced Computer Studies (UMIACS), developing methods and metrics related to trustworthy machine learning. She received her Ph.D. in Computer Science at Rice University in 2024, working on Computer Vision, Natural Language Processing, and Machine Learning.Her research focuses on developing systems that enable compositional reasoning and common-sense inference through vision and language, while tackling issues such as cultural biases, data distribution, explainability, and trustworthy AI. Additionally, Cascante-Bonilla creates simulated environments for embodied agents to learn in a safe, controlled setting, aiming to facilitate effective collaboration and problem-solving for complex tasks by leveraging the implicit knowledge of large-scale pre-trained deep learning models.
Cascante-Bonilla is the recipient of the Ken Kennedy Institute SLB Graduate Fellowship (2022/23), she was selected as a Future Faculty Fellow by Rice's George R. Brown School of Engineering (2023) and as a Rising Star in EECS (2023).
Location: NCS 120

Designed for faculty, staff, presidents, provosts, academic leaders, student affairs professionals, IT specialists, librarians, researchers, administrators, institutional decision-makers, and other higher education stakeholders, the conference highlights practical strategies institutions can implement now while exploring longer-term governance, policy, and ethical considerations. Participants will leave with concrete tools, cross-institutional insights, and collaborative connections that support mission-aligned AI innovation.

Hosted by: AAC&U

Location: Atlanta, GA and Virtual

Register here.