Backdoor Attacks and Defenses in Language Models and Vision-Language Models

Event Description

Abstract: Recent studies have highlighted the vulnerability of Natural Language Processing (NLP) and Vision-Language Models (VLMs) to backdoor attacks, posing significant security risks. Understanding these attack strategies is crucial for assessing model robustness and developing effective defenses. This thesis proposal aims to investigate the vulnerability of language and vision-language models, analyze abnormal behaviors in backdoor-attacked models, and develop defense methods to enhance safety of modern machine learning models at deployment.

We investigate the internal mechanisms of backdoored NLP models, identifying a distinct attention focus drifting phenomenon, where trigger tokens hijack attention regardless of the input context. Through comprehensive qualitative and quantitative analysis, we provide insights into the underlying mechanisms that enable backdoor attacks. Building on these insights, we propose detection methods to differentiate backdoored models from clean ones, through inspecting both the attention distribution and the model predictions. To better understand the vulnerability, we develop advanced backdoor attack strategies targeting language models in classification tasks. For BERT variants, we introduce Trojan Attention Loss (TAL), a novel method that directly manipulates attention patterns to enhance backdoor effectiveness, ensuring stealth and robustness. Vision-Language Models have demonstrated strong performance in recent years. Yet their vulnerability is largely underexplored. We investigate advanced backdoor attack strategies on Vision-Language Models, focusing on image-to-text generation tasks. We demonstrate how backdoors can be embedded in complex multimodal tasks while maintaining semantic integrity under poisoned inputs. Additionally, we propose innovative techniques for injecting backdoors without requiring access to the original training data, expanding the feasibility of real-world attacks.

This proposal provides novel insights into the internal mechanisms of backdoored models, propose effective detection strategies, and develop advanced attack techniques that expose critical vulnerabilities. These findings underscore the urgent need for robust security measures to defend against emerging backdoor threats in deep learning models. The results have been published in top venues including ICLR, ECCV, NAACL, EMNLP, etc.

Speaker: Weimin Lyu

Zoom link: https://stonybrook.zoom.us/j/99880605139?pwd=cfWbRG6n9v3GXEa7OqvXa5cOp5eLBv.1
Meeting ID: 998 8060 5139
Passcode: 843302

Date Start

Tue, 03/04/2025 - 14:30

Date End

Tue, 03/04/2025 - 15:30

AI Innovation Institute

Event Description

Date Start

Date End