Sue Hyun Park July 12, 2021

<aside> 🔗 We have reposted this blog on our Medium publication. Read this on Medium.

</aside>

Every day we search for an answer to a question. Search engines like Google is a standard, and recently virtual assistants like AI speakers come in handy. These implement question answering (QA) mechanisms that process natural language questions and construct answers through the query of a collection of natural language documents.

The Multi-hop Question Answering (QA) task is gaining importance as complex questions require connecting information from several texts. An answer is deduced after capturing multiple relevant facts, each representing evidence.

Recent multi-hop QA models are trained for answerability to predict the correct answer if the answer exists in texts. However, this practice focused on giving out an answer raises a reasoning shortcut problem. Previous works point out that these models exploit disconnected reasoning in which they selectively assess and combine information that happens to be far from evidence. The predicted answer is via bad reasoning, almost by "cheating"!

Let's say we ask a QA model which country got independence when World War II ended. Assume the model doesn't have a knowledge base and the passage being searched is exactly one sentence that contains the answer "Korea". As below, even though the passage lacks relevant information about the time WWII ended, the model simply figures out that the answer should be a form of country and predicts "Korea" in the passage should be the right answer.

An example of a reasoning shortcut. If the model was truly following the reasoning process, it should've addressed that it is unable to answer.

A multi-hop QA model can guess the answer but fail to understand the underlying reasoning process.

To this end, we propose to supervise evidentiality by training the QA model to recognize whether its answer is supported by evidence. The model itself learns to connect the logical link between a given question and the right answer by discovering influential sentences. This post first explains our multi-hop task setting the QA model will be tested on. Then we introduce our novel method for generating training examples without human annotation and for increasing the robustness of our model.

Our Multi-Hop QA Task Description

We follow the distractor setting in HotpotQA, a dataset comprising 112k questions that require finding and reasoning over multiple supporting documents to answer.

Each question has a candidate set of 10 paragraphs:

2 positive paragraphs $\mathcal{P}^+$, where supporting facts for reasoning are scattered in two paragraphs to avoid single-hop reasoning.
8 negative paragraphs $\mathcal{P}^-$ containing no evidence, i.e., the distractors.

This task aims to aggregate relevant facts from the candidate set and estimate a consecutive answer span. For task evaluation, the estimated answer span will be compared with the ground truth answer span.

Generating Examples for Supervision

We build four different types of passages to train our QA model for both answerability and evidentiality.

For predicting the correct answer,

answer-positive set $\mathbb{A}^+$ is a set of passages with both the answer and complete evidence.
answer-negative set $\mathbb{A}^-$ is a set of passages with neither the answer nor evidence.

For detecting a reasoning chain assuming a correct answer exists,

evidence-positive set $\mathbb{E}^+$ is a set of passages "expected" to have all pieces of evidence that contribute to an explainable answer.
evidence-negative set $\mathbb{E}^-$ is a set of passages with the answer but no evidence. If a model deduces the correct answer from this set, a reasoning shortcut has taken place.

Overview of our proposed supervision