Time: May 5, 2022, Thursday, 02:00 PM Eastern Time (US and Canada)
Place: New Computer Science (NCS) Room 220, and Zoom

Zoom link: https://stonybrook.zoom.us/j/95948672934?pwd=d3ZDcUJkK3VweFBDVWhIVDhtaFU2Zz09
Meeting ID:  959 4867 2934
Passcode:  082036

Title:  Generative Adversarial Learning using Optimal Transport

Abstract: 

Generative Adversarial Learning (GAL) aims to learn a target distribution in an adversarial manner. A Generative Adversarial Network (GAN) is a concrete implementation of GAL using a discriminator and a generator that play a min-max game. GANs have been used in many machine learning and computer vision applications. However, GANs are known to be hard to train, mainly because a min-max saddle point optimization problem needs to be solved in GAL. In this thesis, I investigate several methods to improve generative adversarial learning using Optimal Transport (OT). 

Previous Wasserstein GANs (WGANs) do not compute the correct Wasserstein distance to train the discriminator. To address this problem, I propose WGAN-TS that uses the L1 transport cost and computes the correct Wasserstein distance to train the discriminator. To ensure the local convergence of WGANs, I propose WGAN-QC that adopts the quadratic transport cost. I prove that WGAN-QC not only computes the correct Wasserstein distance but also converges to a local equilibrium point. To compute the Wasserstein distance over the whole dataset, I propose to use Semi-Discrete Optimal Transport (SDOT) to match noise points and the real images during GAN training. To measure the quality of an SDOT map, I use the Maximum Relative Error (MRE) and the L_1 distance between the target distribution and the transported distribution obtained by an OT map. I propose statistical methods to estimate the MRE and the L_1 distance. I propose an efficient Epoch Gradient Descent algorithm for SDOT (SDOT-EGD). To deal with the 2D special case of GAL, I propose to use OT to learn 2D distributions. In particular, I adopt OT to match persistent diagrams in training a topology-aware GAN and learn density maps in the crowd counting task. Finally, I use OT and the topological maps of the crowd to improve the crowd counting performance and propose a topology-based metric to measure the quality of the crowd density maps.


The International Conference on Learning Representations (ICLR) is the premier gathering of professionals dedicated to the advancement of the branch of artificial intelligence called representation learning, but generally referred to as deep learning.



ICLR is globally renowned for presenting and publishing cutting-edge research on all aspects of deep learning used in the fields of artificial intelligence, statistics and data science, as well as important application areas such as machine vision, computational biology, speech recognition, text understanding, gaming, and robotics.

ICLR is one of the fastest growing artificial intelligence conferences in the world. Participants at ICLR span a wide range of backgrounds, from academic and industrial researchers, to entrepreneurs and engineers, to graduate students and postdocs.

The rapidly developing field of deep learning is concerned with questions surrounding how we can best learn meaningful and useful representations of data. ICLR takes a broad view of the field and includes topics such as feature learning, metric learning, compositional modeling, structured prediction, reinforcement learning, and issues regarding large-scale learning and non-convex optimization.

A non-exhaustive list of relevant topics explored at the conference include:



  • Unsupervised, Semi-supervised, and Supervised Representation Learning
  • Representation Learning for Planning and Reinforcement Learning
  • Metric Learning and Kernel Learning
  • Sparse Coding and Dimensionality Expansion
  • Hierarchical Models

  • Optimization for Representation Learning
  • Learning Representations of Outputs or States
  • Implementation Issues, Parallelization, Software Platforms, Hardware
  • Applications in Vision, Audio, Speech, Natural Language Processing, Robotics, Neuroscience, or Any Other Field


For more information or registration, please visit the official website.
CSE 600 Seminar Series | Fall 2025


Abstract: Vision-language models that see and describe the world are now part of our daily lives, from internet search and accessibility tools to content generation and automatic moderation. However, as these models grow and become more widely used, their limitations have also become increasingly visible. In particular, it has been shown that these models are unable to reliably perform complex tasks that require abstraction and compositional reasoning. For example, they struggle to decompose an image or text into entities, attributes, and relations, and then reason over new combinations of these elements. As a result, we see generated content full of hallucinations, privacy leaks in images, and different types of biases in the model outputs.In this talk, I will outline a research agenda that aims to build trustworthy vision-language models in the age of generative AI. I will begin with compositional reasoning: how natural language inference can be used to decompose complex instructions and captions into atomic, verifiable statements, improving both evaluation and model behavior on tasks that require multi-step reasoning. I will then discuss how synthetic data and simulated environments can be used to train more reliable models, and how they can also stress-test models beyond standard benchmarks, revealing when models drop attributes, break object relations, or fail under distribution shifts. I will also share recent work on using hallucination correction as a signal to improve video-language alignment, and on privacy-preserving image understanding for blind and low-vision users. I will conclude with possible ways we can systematically probe, debug, and repair these models, turning synthetic perception into something we can trust in real-world deployments.



Speaker: Paola Cascante-Bonilla is a tenure-track Assistant Professor in the Department of Computer Science at Stony Brook University (SUNY). Before that, she was a Postdoctoral Associate at the University of Maryland Institute for Advanced Computer Studies (UMIACS), developing methods and metrics related to trustworthy machine learning. She received her Ph.D. in Computer Science at Rice University in 2024, working on Computer Vision, Natural Language Processing, and Machine Learning.Her research focuses on developing systems that enable compositional reasoning and common-sense inference through vision and language, while tackling issues such as cultural biases, data distribution, explainability, and trustworthy AI. Additionally, Cascante-Bonilla creates simulated environments for embodied agents to learn in a safe, controlled setting, aiming to facilitate effective collaboration and problem-solving for complex tasks by leveraging the implicit knowledge of large-scale pre-trained deep learning models.
Cascante-Bonilla is the recipient of the Ken Kennedy Institute SLB Graduate Fellowship (2022/23), she was selected as a Future Faculty Fellow by Rice's George R. Brown School of Engineering (2023) and as a Rising Star in EECS (2023).
Location: NCS 120
The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) is the premier annual computer vision event comprising the main conference and several co-located workshops and short courses. With its high quality and low cost, it provides an exceptional value for students, academics and industry researchers.



Location: Colorado Convention Center
AI for Conservation: AI and Humans Combating Extinction Together by Daniel I. Rubenstein of Princeton University

ABSTRACT: The state of our planet is not good. We have lost more than 60% of the world's wildlife. Stopping the decline remains a challenge, especially since acquiring appropriate knowledge is expensive, time consuming and risky. Visual observations following the fates of a few individuals was the currency of the realm. But GPS technology and now machine learning provide a non-invasive scalable alternative. Photographs, taken by field scientists, tourists, automated cameras and incidental photographers, are the most abundant source of data on wildlife today. Wildbook, a project of tech for conservation coordinated by a non-profit Wild Me, is an autonomous computational system that starts from massive collections of images and, by detecting various species of animals and identifying individuals, combined with sophisticated data management, turns them into high-resolution information databases, enabling scientific inquiry, conservation and citizen science.

BIO: Dan Rubenstein is the Class of 1877 Professor of Zoology. He is currently Director of Princeton's Environmental Studies Program and is former Chair of Princeton University's Department of Ecology and Evolutionary Biology and Director of Princeton's Program in African Studies. He is a behavioral ecologist who studies how environmental variation and individual differences shape social behavior, social structure, sex
roles and the dynamics of populations. He has special interests in all species of wild horses, zebras and asses, and has done field work on them throughout the world identifying rules governing decision-making, the emergence of complex behavioral patterns and how these understandings influence their management
and conservation. In Kenya he also works with pastoral communities to develop and assess impacts of various grazing strategies on rangeland quality, wildlife use and livelihoods. He has also developed a scout program for gathering data on Grevy's zebras and created curricular modules for local schools to raise awareness about the plight of this endangered species. He engages people as 'Citizen Scientists' and has recently extended his work to measuring the effects of environmental change, including issues pertaining to the global commons
and changes wrought by management and by global warming, on behavior.

Abstract:

In recent years, the landscape of artificial intelligence (AI) has been reshaped by the rapid emergence of Foundation Models (FMs). These versatile models have garnered widespread attention for their remarkable ability to transcend the boundaries of traditional, bespoke AI solutions and to generalize to a large set of downstream tasks. In this presentation we will describe the development of geospatial FMs with earth observation and weather data and discuss initial results of such models. We will also show how such foundation models can be a new and exciting tool for assisting with and accelerating scientific discovery.

Speaker:

Hendrik Hamann
Distinguished Researcher
IBM T.J. Watson Research Center

Abstract: Large language models (LLMs) may exhibit unintended or undesirable behaviors. Recent works have concentrated on aligning LLMs to mitigate harmful outputs. Despite these efforts, some anomalies indicate that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Does alignment fine-tuning yield have robust effects on models, or are its impacts merely superficial? In this work, we make the first exploration of this phenomenon from both theoretical and empirical perspectives. Empirically, we demonstrate the elasticity of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Leveraging compression theory, we formally deduce that fine-tuning disproportionately undermines alignment relative to pre-training, potentially by orders of magnitude. We validate the presence of elasticity through experiments on models of varying types and scales. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. Furthermore, we further reveal that elasticity positively correlates with the increased model size and the expansion of pre-training data. Our findings underscore the need to address the inherent elasticity of LLMs to mitigate their resistance to alignment.

Speaker: Huajian Zhang

Location: CS2311
Abstract: Sea ice is crucial to Earth's climate, Arctic communities, and ecosystems, yet climate change is driving significant losses, threatening polar stability. Quantifying the long-term impacts of a declining sea ice cover requires tools which improve climate-timescale prediction and bring new understanding of climate interactions. In this talk, I discuss how meeting this challenge requires a multi-disciplinary approach. Climate models, while essential, suffer from systematic biases due to missing or inaccurate physics, leading to uncertainty in future projections. I show how data assimilation (DA) offers a statistical framework for integrating satellite observations with climate models to quantify systematic sea ice model errors. Using convolutional neural networks (CNNs), we can learn these errors based on the model's atmospheric, oceanic, and sea ice conditions--what I term a state-dependent representation of the error. This approach enables real-time corrections to subsequent model simulations, which systematically reduces global sea ice biases. I highlight key successes and challenges in developing this hybrid ML+climate modeling framework, including transfer learning to enhance online generalization of ML models, and new methods for integrating Python-based ML frameworks with Fortran climate model code. Finally, I introduce GPSat, a scalable Gaussian process-based tool for reconstructing complete sea ice fields from sparse satellite altimetry data. Together, the DA+ML framework and GPSat offer future opportunities for improving targeted model physics errors for more robust climate simulation.


IACS Seminar Speaker: William Gregory, Princeton University

Location: IACS Seminar Room
TITLE: Towards a Theory of Encode/Decoder Architectures by Andrej Risteski of CMU

ABSTRACT: A common choice of architecture in representation learning (i.e., learning a good embedding of the data) is an encoder/decoder architecture, which tries to map a part of the input into a good latent representation (via an encoder), and predict the remaining part of the input (via a decoder). Two common examples are universal machine translation: where one tries to learn to translate between any pair of a set of languages via a common latent language, given paired up corpora for only a part of the pairs; and contextual encoders -- where one tries to predict a part of the image, given the rest of the image.
 
We will give a framework for analyzing the sample complexity of such architectures -- i.e., how many pairs of languages do we need to have paired up corpora for? How many image prediction tasks do we have to solve to get a good representation?