Spring 2026, Wednesdays 2 to 3:20 pm, NCS 220 and Zoom link to be announced soon.

The seminar will be jointly taught by Prof. Dimitris Samaras (samaras@cs.stonybrook.edu).

The overall purpose of this seminar is to bring together people with interests in Computer Vision theory and techniques and to examine current research issues. This course will be appropriate for people who already took a Computer Vision graduate course or already had research experience in Computer Vision.

To enroll in this course, you must either: (1) be in the Ph.D. program or (2) receive permission from the instructors.

Each seminar will consist of multiple short talks (around 15 minutes) by multiple students. Students can register for 1 credit for CSE656. Registered students must attend and present a minimum of 2 talks. Registered students must attend in person. Up to 3 absences will be excused. Everyone else is welcome to attend.
Spring 2026, Wednesdays 2 to 3:20 pm, NCS 220 and Zoom link to be announced soon.

The seminar will be jointly taught by Prof. Dimitris Samaras (samaras@cs.stonybrook.edu).

The overall purpose of this seminar is to bring together people with interests in Computer Vision theory and techniques and to examine current research issues. This course will be appropriate for people who already took a Computer Vision graduate course or already had research experience in Computer Vision.

To enroll in this course, you must either: (1) be in the Ph.D. program or (2) receive permission from the instructors.

Each seminar will consist of multiple short talks (around 15 minutes) by multiple students. Students can register for 1 credit for CSE656. Registered students must attend and present a minimum of 2 talks. Registered students must attend in person. Up to 3 absences will be excused. Everyone else is welcome to attend.
Abstract: Modern language agents often need to solve tasks requiring long-horizon, multi-turn interactions, where they retrieve external information, adapt to observations, and answer interdependent queries. Yet, most LLM systems rely on full-context prompting, appending all past turns regardless of their relevance. This leads to un-bounded memory growth, increased computational costs, and degraded reasoning performance on out-of-distribution input lengths due to LLM forgetting the context. We introduce MEM1, an end-to-end reinforcement learning framework that enables agents to operate with constant context size when solving long multi-turn tasks. At each turn, MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning. Leveraging reinforcement learning (RL) and rollout trajectory truncation, we train a MEM1 agent to develop internal states that integrate prior memory with new observations from the environment while strategically discarding irrelevant or redundant information. Experiments across three domains, including internal retrieval QA, open-domain web QA, and multi-turn web shopping, show that MEM1-7B improves performance by 3.5x while reducing memory usage by 3.7x compared to Qwen2.5-14B-Instruct on an augmented multi-hop QA dataset with 16 objectives in each task, and generalizes beyond the training horizon. Our results demonstrate the promise of reasoning-driven memory consolidation as a scalable alternative to existing solutions for training long-horizon task-solving agents that involve multiple interactions, where both efficiency and performance are optimized.

Speaker: Yiyang Feng

Location: CS2311
The Fourth Arabic Natural Language Processing Conference (ArabicNLP 2026) is organized by the ACL Special Interest Group on Arabic NLP (SIGARAB).
The research focus of ArabicNLP is, naturally, Arabic, a collection of language varieties, from Classical to Modern Standard Arabic (MSA), and including many living and historical Arabic dialects. Arabic poses many challenges for the field of computational linguistics, including rich morphology, orthographic ambiguity as well as the wide variety of understudied dialects.

Location: Budapest, Hungary

Register here.
Face Editing with Machine Learning presented by Zhixin Shu

ABSTRACT: The face is the most informative feature of humans and has been a long-standing research topic in Computer Vision and Graphics. Images of faces are also ubiquitous in photography and social media, and people have devoted significant resources to capturing and editing face images. Face editing can be broadly viewed as the encoding, manipulation and the decoding of some representations for face images. The challenges are that we want to manipulate an image in a controllable way and generate results that are both desirable and as realistic as possible. This thesis explores different Machine Learning-based face-editing approaches. I discuss the role of machine learning for achieving desirable edits by learning both the physical aspects as well as the statistical manifold of human faces. In my work for eye-editing, I discuss the importance of understanding multiple physical elements of a face image, such as shape, illumination, pose, etc. In a deep-learning-based approach, I introduce image formation domain knowledge to the construction and training of a neural network. This network provides transparent access to the disentangled representations of the aforementioned physical properties. With this network, we can achieve various face editing tasks in forms of representation manipulation. After that, I introduce Deforming Autoencoders, a network that learns to disentangle shape and appearance in an unsupervised manner. This disentanglement is beneficial for the learning of some other factors of variations, such as illumination and facial expression. In an extension of Deforming Autoencoders, we incorporate non-rigid structure-from-motion to learn a 3D morphable model for faces that only requires an image set for training. At last, I describe an image-to-image network for 3D face reconstruction, which also utilizes structure-from-motion in deep learning. With real face images in training, this network not only reconstructs 3D faces more accurately than prior art but also has better generalization ability in real-life testing cases.

Abstract:
Deep learning models have achieved remarkable success across a wide range of computer vision tasks, including image classification, semantic segmentation, etc. However, such success highly relies on a large amount of annotated data, which are expensive to obtain. Moreover, their performance often degrades when there exist distribution shifts between training and test data. Domain Adaptation overcomes these issues by transferring knowledge from a label-rich source domain to a related but different target domain. Despite its popularity, domain adaptation is still a challenging task, especially when the data distribution shifts are severe, while the target domain has no or few labeled data.

In this thesis, I develop four efficient domain adaptation approaches to improve model performance on the target domain. Firstly, inspired by the large-scale pretraining of Vision Transformers, I explore Transformer-based domain adaptation for stronger feature representation and design a safe training mechanism to avoid model collapse in the situation of a large domain gap. Secondly, I observe that source models have low confidences on the target data. To address this, I focus on the penultimate activations of target data and propose an adversarial training strategy to enhance model prediction confidences. Thirdly, I study using weak supervision from prior knowledge about target domain label distribution. A novel Knowledge-guided Unsupervised Domain Adaptation paradigm is devised, and a plug-in module is designed to rectify pseudo labels. Lastly, I step into the task of Active Domain Adaptation, where the labels of a small portion of target data can be inquired. I propose a novel active selection criterion based on the local context and devise a progressive augmentation module to better utilize queried target data. The robustness of domain adaptation approaches, in addition to accuracy, is critical yet under-explored. To conclude the thesis, I empirically study set prediction in domain adaptation using the tool of conformal prediction and conformal training.


Location: New Computer Science Bldg., Room 120
Zoom Link: https://stonybrook.zoom.us/j/92736258273?pwd=ipDdh1CTG6dRYmqa3ltUvooei8OfaT.1Meeting ID: 927 3625 8273
Passcode: 466399
Abstract:

Recent advances in deep learning have significantly enhanced the capabilities of Natural Language Processing (NLP) and Vision-Language Models (VLMs). However, these advancements come with increased vulnerabilities, notably through backdoor attacks that pose severe security threats. This thesis addresses two critical dimensions of Trustworthy AI and Efficient Multimodal Representation Learning: (1) security through analyzing, detecting, and designing backdoor attacks in NLP and VLMs, and (2) efficiency through advanced multimodal representation methods tailored for clinical and medical imaging applications.

In the first dimension, we explore the internal mechanisms exploited by backdoor attacks, identifying the distinctive phenomenon of attention focus drifting in compromised transformer models, where trigger tokens consistently hijack attention. Leveraging these insights, we propose robust detection frameworks, including the attention-based Trojan detector (AttenTD) and a task-agnostic logit-based detection method (TABDet), achieving effective identification of backdoored NLP models across diverse tasks. We further introduce novel backdoor attack methodologies: the Trojan Attention Loss (TAL), enhancing attack efficiency and stealth through direct attention manipulation, and BadCLM, demonstrating critical vulnerabilities in clinical decision-support systems by effectively compromising clinical language models.

Extending our security exploration to multimodal settings, we investigate backdoor attacks on Vision-Language Models (VLMs), particularly in complex image-to-text generation tasks, proposing innovative techniques (TrojVLM, VLOOD) capable of embedding backdoors without direct access to original training data, thus showcasing practical risks in real-world scenarios.

In the second dimension, we address efficiency and interpretability challenges in clinical and pathology applications. We introduce TCP-LLaVA, the first multimodal large language model (MLLM) designed explicitly for Whole Slide Image (WSI) Visual Question Answering (VQA). Utilizing a novel token compression mechanism inspired by transformer-based models, TCP-LLaVA substantially reduces computational resource consumption while maintaining superior VQA performance across multiple tumor subtypes. Additionally, we present a multimodal transformer model integrating structured Electronic Health Records (EHR) with clinical notes, demonstrating enhanced predictive accuracy and interpretability for in-hospital mortality prediction through integrated gradient-based interpretability methods.

Together, these contributions present a comprehensive approach to ensuring AI models are not only secure against malicious manipulation but also efficient and interpretable for critical clinical applications, underscoring the essential need for trustworthy and effective AI systems.

Speaker: Weimin Lyu

Zoom: https://stonybrook.zoom.us/j/2392326575?pwd=SVQ2VkFXTnZZYmJUMXgvTXBuZWM3UT09

Meeting ID: 239 232 6575
Passcode: 436192
Hyperscale Verification in Microsoft Azure talk by Nikolaj Bjorner

Abstract: Cloud providers are increasingly embracing network verification for managing complex datacenter network infrastructure. Microsoft's Azure cloud infrastructure integrates the SecGuru tool, which leverages the Z3 Satisfiability Modulo Theories solver, for checking network access
control lists. It also integrates a verifier that uses both custom verification algorithms and Z3 that checks correctness of forwarding tables in Azure data-centers. These tools assure that the network is configured to preserve desired intent over hundreds of thousands of network devices. We describe our experiences building and running SecGuru for network verification in Azure.

Finally we mention recent advances in Z3, including a distributed version of Z3 that scales with Azure's elastic cloud. It integrates recent advances in lookahead and distributed SAT solving for Z3's
engines for SMT. A different recent advance includes integration of DNNs to learn variable branching strategies for high-performance SAT solvers, including MiniSAT, Glucose and Z3's SAT solver.

Bio: Nikolaj Bjorner is a Principal Researcher at Microsoft Research, Redmond, working in the area of Automated Theorem Proving and Software Engineering. His current main line of work is around the state-of-the art theorem prover Z3, which is used as a foundation of several software engineering tools. Z3 received the 2015 ACM SIGPLAN Software System award and most influential tool paper in the first 20 years of TACAS in 2014, and test of time award at ETAPS 2018. Together with Leonardo de Moura received the CADE 2019 Herbrand award for contributions to SMT and applications. Previously, he developed the DFSR, Distributed File System - Replication, and Remote Differential
Compression protocols, RDC, part of Windows Server since 2005 and before that worked on distributed file sharing systems at a startup, and program synthesis and transformation systems at the Kestrel Institute. He received his Master's and PhD degrees in computer science from Stanford University.