Location
Event Description
Hidden Biases. Ethical Issues in NLP, and What to Do about Them presented by Dirk Hovy of Bocconi University
ABSTRACT: Through language, we fundamentally express who we are as humans. This property makes text a fantastic resource for research into the complexity of the human mind, from social sciences to humanities. However, it is exactly that property that also creates some ethical problems. Texts reflect the authors' biases, which get magnified by statistical models. This has unintended consequences for our analysis: If our data is not reflective of the population as a whole, if we do not pay attention to the biases contained, we can easily draw the wrong conclusions, and create disadvantages for our users.
In this talk, I will discuss several types of biases that affect NLP models, their sources, and potential counter measures: (1) Bias stemming from data, i.e., selection bias (if our texts do not adequately reflect the population we want to study), label bias (if the labels we use are skewed) and semantic bias (the latent stereotypes encoded in embeddings); (2) Biases deriving from the models themselves, i.e., their tendency to amplify any imbalances that are present in the data; (3) Design bias, i.e., the biases arising from our (the researchers) decisions which topics to analyze, which data sets to use, and what to do with them. For each bias, I will provide examples and discuss the possible ramifications for a wide range of applications, and various ways to address and counteract these biases, ranging from simple labeling considerations to new types of models.
BIO: Dirk Hovey is an associate professor of Computer Science in the department of marketing at Bocconi University. He received his PhD from the University of Southern California in Los Angeles, where he worked as a research assistant at the Information Sciences Institute.
He works in Natural Language Processing (NLP), a subfield of artificial intelligence. His research focuses on computational social science. His interests include integrating sociolinguistic knowledge into NLP models, using large-scale statistics to model the interaction between people's socio-demographic profile and their language use, and ethics for data science and algorithmic fairness.