Evaluation, Datasets and Models for Reliable Multi-Step Reasoning

Event Description

Abstract:

Many real world complex problems are multi-step reasoning tasks. These range from analytic tasks such as answering questions to automation tasks where agents complete tasks on behalf of users.. Evaluation, datasets, and models for such tasks can be unreliable for multiple reasons. (i) Datasets often have annotation artifacts and biases, allowing models to take reasoning shortcuts. Such shortcuts can allow models to make effective guesses -- or, in a sense, cheat -- to achieve high performance without any multi-step reasoning. This issue is further exacerbated for complex tasks because as the number of the required reasoning steps increases, so do the avenues for bypassing those steps. (ii) Models trained on such dataset/s learn to solve the task by taking reasoning shortcuts instead of proper multi-step reasoning. As a result, these models are not robust (reliable) when evaluated in an out-of-distribution evaluation setting. (iii) Lastly, recent works have shown that language models can solve complex multi-step tasks by producing a step-by-step explanation without any training. However, these methods often hallucinate factually incorrect (i.e., unreliable) explanations when posed with knowledge-intensive tasks.

I address these challenges by carefully characterizing the requirements of robust multi-step reasoning and designing reliable evaluation datasets and training methods that necessitate thorough multi-step reasoning. In DiRe, I first formalize and introduce Disconnected Reasoning, i.e., reasoning that allows models to arrive at the correct answer by bypassing necessary reasoning steps, and use this formalization to measure how much multi-step reasoning a model does on a dataset. In MuSiQue, I built a multi-step reasoning dataset for QA from scratch that avoids cheatability via disconnected reasoning, providing a more reliable evaluation. In TeaBReaC, I developed a synthetically generated multi-step QA pretraining dataset designed to force models to avoid disconnected reasoning and learn reliable multi-step reasoning. In IRCoT, I address the reliability of model-generated multi-step reasoning chains by interleaving models' step-by-step reasoning with a step-by-step retrieval from an external corpus, resulting in more factually correct reasoning. Finally, in AppWorld, I built a multi-step reasoning dataset that requires highly interactive problem-solving in an environment carefully designed to ensure models need thorough reasoning to succeed.
Speaker: Harsh Trivedi

Location: NCS 220 or Zoom

https://stonybrook.zoom.us/j/99096379762?pwd=zYCJZQVxRuZd9BboscO4nlodCwsKBr.1

Date Start

Thu, 02/27/2025 - 12:45

Date End

Thu, 02/27/2025 - 13:45

AI Innovation Institute

Event Description

Date Start

Date End