NLP Reading Group: Scalable Extraction of Training Data from Aligned, Production Language Models
Event Description
Abstract: Large language models are prone to memorizing some of their training data. Memorized (and possibly sensitive) samples can then be extracted at generation time by adversarial or benign users. There is hope that model alignment---a standard training process that tunes a model to harmlessly follow user instructions---would mitigate the risk of extraction. However, we develop two novel attacks that undo a language model's alignment and recover thousands of training examples from popular proprietary aligned models such as OpenAI's ChatGPT. Our work highlights the limitations of existing safeguards to prevent training data leakage in production language models.
Speaker: Pegah Alipoormolabashi
Location: CS2311
Speaker: Pegah Alipoormolabashi
Location: CS2311