Abstract: Pretraining vision encoders with self-supervision (SSL) leads to stronger representations that excel across diverse downstream tasks. One of the key factors enabling self-supervision is extracting multiple views of the same scene to formulate either: 1) View-invariant pretraining (DINO, SimCLR, iBOT), where the objective is predicting the same representation for different views of the scene; or 2) Cross-view pretraining (cross-view Masked Autoencoders), where the objective is predicting missing parts of one view using other views. For extracting multiple views, view-invariant methods rely on a combination of handcrafted augmentations (random cropping, color jittering, gaussian blur, etc.) of the same image, whereas cross-view pretraining methods rely on image cropping or video frames. In this work, we present methods to effectively incorporate synthetic views from diffusion models into SSL training.
For view-invariant pretraining, we introduce Gen-SIS, a method that leverages the ability of diffusion models to generate interpolated images through interpolation in conditioning space. We introduce a disentanglement pretext task: disentangling two source images from an interpolated synthetic image. This disentanglement task, in addition to vanilla single-source generative augmentation for view extraction, improves visual pretraining of various view-invariant methods (DINO, SimCLR, iBOT).
For cross-view pretraining, we introduce CDG-MAE, a novel cross-view masked autoencoder (MAE) based method that uses diverse synthetic views generated from static images via an image-conditioned diffusion model to learn dense correspondences. We present a quantitative method to evaluate the local and global consistency of the generated views to choose the right diffusion model for cross-view pretraining. These generated views exhibit substantial changes in pose and perspective, providing a rich training signal that overcomes the limitations of video (expensive) and crop-based (less variation) methods. CDG-MAE substantially narrows the gap to video-based MAE methods on video label propagation tasks while maintaining the data advantages of image-only MAEs.

Speaker: Varun Belagali

Location: NCS 120
Zoom: https://stonybrook.zoom.us/j/93647452432?pwd=hZaX7LXCAD8KPHWYE1Afw2sDI3owpv.1

Abstract: Spectroscopy and imaging are two primary tools for probing material structures. However, the discovery of trends that guide the design of improved materials is often hindered by intertwined physical interactions or significant experimental noise. In this talk, I will present machine learning approaches that address both challenges. The first part focuses on the interpretation of X-ray absorption spectroscopy (XAS). We developed a controlled projection algorithm, RankAAE, which disentangles coupled structural descriptors in complex datasets and reveals analysis rules for inferring new structural information visually from spectra. The second part targets transmission electron microscopy (TEM) imaging of material structures. We developed a machine learning model capable of denoising extremely noisy images, while demonstrating strong out-of-distribution generalization. I will describe the construction of these models and demonstrate their effectiveness through representative scientific case studies.

Bio: Dr. Xiaohui Qu is a Staff Scientist in the Theory and Computation Group at the Center for Functional Nanomaterials (CFN), Brookhaven National Laboratory. His research focuses on developing interpretable machine learning and data analytics methods for materials science, with an emphasis on extracting structural insights from X-ray absorption spectroscopy and transmission electron microscopy. Dr. Qu earned his B.S. in Environmental Engineering and Ph.D. in Environmental Science from Shandong University, China, followed by postdoctoral research in Physics at Nanyang Technological University, Singapore, in Chemistry at Universidade Nova de Lisboa, Portugal, and in Materials at Lawrence Berkeley National Laboratory.

Location: IACS Seminar Room


Event Details & Calendar Link (includes zoom info): https://calendar.stonybrook.edu/site/iacs/event/iacs-seminar-speaker--xiaohui-qu-brookhaven-national-lab/


New York Scientific Data Summit (NYSDS) is a premier annual conference that brings together researchers and thought leaders from academia, national labs and industry to exchange ideas and foster collaboration focused on data-driven science and technology. Co-hosted by Brookhaven National Laboratory and the Institute for Advanced Computational Science (IACS) at Stony Brook University, NYSDS 2025 will take place on September 11-12, 2025, in the SUNY Global Center in New York City.

NYSDS 2025 will spotlight artificial intelligence (AI), machine learning (ML) and robotics - fields currently at a pivotal point with transformative impacts on science and technology. From accelerating computationally demanding simulations to discerning signals from noisy data, AI/ML has become an integral part of the scientific workflows. Despite many advances, challenges remain to ensure that AI/ML applications are reliable, explainable and trustworthy.

Robotics, a growing field that couples AI with physically actuated mechanical bodies, has seen increased interest in areas spanning science, technology and manufacturing. The need for real-time decision-making and control, along with the intricate morphology of robots, makes robotics an intriguing application of AI, advanced computing and optimization.


This NYSDS 2025 is open to the public. To be eligible to attend, all participants must register online by August 30, 2025. For questions or assistance with registering, please contact the Summit Coordinator.

Register here.

Abstract: Human gaze behavior is a fundamental cue for understanding social intent, human-machine interaction, and cognitive processes. This thesis addresses the challenges of gaze target estimation (GTE), also known as gaze following, by developing a holistic understanding of gaze in complex environments.

The first part of this work improves GTE performance by introducing Patch-level Distribution Prediction (PDP). Unlike traditional models that rely on strict pixel-wise regression, PDP models gaze as a distribution over patches, which better accounts for annotation variance and bridges the gap between target location and in/out-of-frame prediction. To address the laborious nature of data labeling, the second part presents GCDR, the first semi-supervised method for gaze following. By prompting large Visual Question Answering (VQA) models to generate initial Grad-CAM heatmaps and refining them with a diffusion model, this method achieves high performance with significantly fewer human annotations. The third part expands the applicability of GTE to multi-camera environments. By introducing the Multi-View Gaze Target (MVGT) dataset, along with two novel frameworks for integrating information between multiple views and predicting the gaze target across views, we explore a new direction that overcomes single-view limitations such as face occlusion and out-of-view targets.

Building on these foundations, the final part of this thesis proposes a new direction toward semantic social gaze understanding using next-generation multimodal Large Language Models (LLMs). Rather than focusing solely on geometric gaze target localization, we aim to enrich gaze prediction with semantic and relational interpretation in complex social scenes. To this end, we will leverage existing gaze following datasets to derive social gaze supervision, including mutual gaze and shared attention, and obtain aligned language descriptions of scene-level gaze behaviors. This proposed work will enable the model to not only locate gaze targets but also predict structured social gaze relations among individuals, meanwhile generating a concise natural-language summary describing the dominant gaze interactions. By integrating spatial gaze estimation, social relation reasoning, and language-based scene understanding within a unified multimodal model, this work takes an important step toward a holistic understanding of human gaze behavior in real-world environments.

Speaker: Qiaomu Miao

You are cordially invited to attend the biweekly Brookhaven AI Mixer (BAM). BAM includes one short talk on AI research happening at BNL, followed by an open mixer over coffee and snacks for everyone to network and discuss all things AI. The first half hour will consist of presentations that will be available via ZOOM, and the second half hour will be for in person only networking.

Join us every other Tuesday at noon in CDSD's Training Room (building 725, 2nd floor) to learn about interesting AI methods and applications, engage with potential collaborators, prepare for pending FASST funding calls, and build a community of AI for Science at BNL.

At our Oct 7 Mixer, BNL's newly minted interim director, John Hill will be present to give opening remarks and kick us off on a new year of impactful scientific AI collaborations.

Abstract: Weather extremes and strong seasonal-to- subseasonal variability pose growing challenges to urban populations, infrastructure, and energy systems. Yet, most cities remain data deserts: routine weather observations are sparse, with stations concentrated at airports rather than within the urban core. This lack of coverage limits our ability to monitor and predict fine-scale urban weather patterns precisely where they matter most. We present a new AI-driven framework for optimal sensor placement and urban weather monitoring. Unlike traditional approaches, our method leverages physics- based simulations together with Bayesian experimental design principles, but does so using a computationally efficient variational inference strategy that makes large-scale optimization tractable. This allows us to guide sensor networks in a way that minimizes information loss while capturing spatiotemporal variability at city scales. Applied to Phoenix, Arizona, our framework outperforms random sensor placement strategies, especially when only a limited number of sensors can be deployed. Importantly, the same AI models that guide sensor placement also function as a real-time nowcasting tool, providing urban weather information over the entire domain, beyond sensor locations. Together, these capabilities offer a scalable pathway to reduce urban data deserts, enhance monitoring of weather extremes, and improve resilience planning for energy, transportation, and public health systems.

Biography: Dr. Katia Lamer is an atmospheric scientist and the Director of the Center for Multiscale Applied Sensing at Brookhaven National Laboratory. Originally from Canada, she earned her B.S. and M.S. in Atmospheric and Oceanic Sciences from McGill University and a Ph.D. in Meteorology from Penn State University. Her research focuses on atmospheric boundary layer processes and remote sensing technologies, with a strong emphasis on data science. At Brookhaven, she is known for her work with the CMAS mobile observatories and its facility that connect fundamental atmospheric science to real-world applications, improving weather prediction, environmental monitoring, and urban climate resilience. Her work has been featured in public outlets such as New Scientist and Wired. Dr. Lamer also serves as an invited member of the World Meteorological Organization's Data Assimilation and Observing Systems Working Group, and the American Meteorological Society's Boundary Layer and Turbulence Committee. puting, communications and sensing, all enabled by AI.

Location: CDS, Bldg. 725, Training Room

Join ZoomGov Meeting: https://bnl.zoomgov.com/j/1604383624?pwd=ffQ5cUPNxTI7nzClKQO6cnsNbhF9Vf.1

Meeting ID: 160 438 3624 | Passcode: 558449

CSE 656 Seminars in Computer Vision - Wednesdays 11:30am-12:50pm, Room NCS 120

The overall purpose of this seminar is to bring together people with interests in Computer Vision theory and techniques and to examine current research issues. This course will be appropriate for people who already took a Computer Vision graduate course or already had research experience in Computer Vision. To enroll in this course, you must either: (1) be in the PhD program or (2) receive permission from the instructors.

Each seminar will consist of multiple short talks (around 10 minutes) by multiple people. Students can register for 1 credit for CSE656. Registered students must attend and present a minimum of 2 or 3 talks. Everyone else is welcome to attend. Fill in https://forms.gle/pCVXovgfMfQwGqG38 to subscribe to our mailing list for further announcement.

Learn how to summarize docs with AI, output a PowerPoint from AI, & Create professional visuals

Unlock greater efficiency and impact in your university role with AI productivity tools. This workshop is your introduction to a few ways that I have found to make our daily tasks more efficient. Discover how easily you can create presentations (that outputs to a PowerPoint format), summarize content using AI, and get information from images. These AI tool tips are invaluable resources designed to streamline your work processes. Start working smarter today!

In this session, you will

  1. Summarize docs with AI
  2. Output a PowerPoint from AI
  3. Gather information from visuals

Register here.
Abstract:

What is the nature of linguistic knowledge, and how is it acquired from limited data? In recent years, the program of subregular linguistics has identified formal language classes expressive enough to account for most phenomena in natural language but also sufficiently limited to be efficiently learned from positive data. An advantage to these formal learning algorithms is that they come with mathematically proven guarantees about their performance, and it is easy to reason about how and why they behave the way they do.

In this talk, I discuss the Multi Tier-based 2-Strictly Local Inference Algorithm (MT2SLIA), which probably learns the syntactically relevant class of 2-Factor Muti Tier-based Strictly Local (2FMSTL) tree languages. This algorithm efficiently learns from a polynomially-sized sample of positive data by identifying missing substructures and generalizing these as constraints over tiers in a principled manner.

I will introduce a working prototype implementation of this algorithm and demonstrate its behavior on a curated sample of natural language data to show how it can learn relevant syntactic patterns.

Bio:

Logan Swanson is a third year PhD student in the Department of Linguistics at Stony Brook University. He is advised by Dr. Jefferey Heinz and Dr. Thomas Graf. His interests include learning theory, computational syntax, and language change. His current research focuses on understanding the learning-theoretic elements of natural language by designing, implementing, and testing learning algorithms for linguistically relevant formal language classes.

*Please note: this seminar will be held in person (IACS Seminar Room w/ food provided) and online.

Join Zoom Meeting
https://stonybrook.zoom.us/j/95707958315?pwd=6ITUJ0ffCXjRJb4wpt0KMDTApfSLZ0.1

Meeting ID: 957 0795 8315
Passcode: 920473

Abstract:

It is known that models like large language models (LLMs) can often suggest colloquial plans given verbal descriptions of tasks, yet they are unable to reliably provide executable and verifiable plans given formally specified environments. In this talk, I will discuss a strand of efforts to have LLMs generate accurate and explainable plans in textual simulations. Instead of directly generating the plan or actions, LLMs are prompted to generate Planning Domain Definition Language (PDDL) that specifies the environment (domain file) and the task (problem file), which can then be deterministically solved with an off-the-shelf planner. In a 3-phase study, my collaborators and I first observed that it is possible but very challenging for LLMs to generate long-form code such as PDDL domain and problem files given textual specifications. Next, we devise methodologies for LLMs to iteratively generate and refine problem files while exploring a partially-observed, simulated, textual environment. Finally, we show that domain files are even more difficult to generate correctly, even on well-established planning tasks such as BlocksWorld. Finally, I will discuss ongoing efforts to improve said ability of structured generation and promising frontiers to explore.

Bio:
Li Harry Zhang is an assistant professor at Drexel University, focusing on Natural Language Processing (NLP) and artificial intelligence (AI). He obtained his PhD degree from the University of Pennsylvania advised by Prof. Chris Callison-Burch. Prior, he obtained his Bachelor's degree at the University of Michigan mentored by Prof. Rada Mihalcea and Prof. Dragomir Radev. His current research uses large language models (LLMs) to reason and plan via symbolic and structured representations. He has published more than 20 peer-reviewed papers in NLP and AI conferences, such as ACL, EMNLP, and AACL, that have been cited more than 1,000 times. He also consistently serves as Area Chair, Session Chair, and reviewer in those venues. Being a musician, producer, and content creator having over 50,000 subscribers, he is also passionate in the research of AI music and creativity.