Machines can now produce coherent text, with some being virtually indistinguishable from human-authored content. This development carries significant implications, such as enabling students to potentially cheat on their take-home essay exams. Conversely, the overarching aim of machine learning models is to generate text that exhibits maximum human-like qualities. In both of these scenarios, the pressing question is: How can we effectively measure the disparity between machine-generated text and human-generated text? This paper endeavors to propose metrics for precisely evaluating this distinction.

The Tale of 2 Kinds of Errors

  1. Type I Error - In this case, machines produce text that lacks the natural touch of human authorship, making it evident that it wasn't written by a human.
  2. Type II Error - Here, machines fail to generate text that aligns with the typical style and content that a human would naturally produce.

By distinguishing and measuring these two categories of errors, we can effectively quantify the extent to which machine-generated text differs from human-generated text. MAUVE proposes to use an information theoretic measure to do this.

KL Divergence

Let’s say that the distribution of human generated text is captured by $P$ and the distribution of machine generated text is captured by $Q$.

Type I Error can be calculated using KL Divergence as follows. Notice that if $Q(x)$ is huge and $P(x)$ is small, then this quantity will be high. That is, the machine generates text, that wouldnt have been really produced by humans.

$$ KL(Q|P) = \sum\limits_{x} Q(x) log\ \frac{Q(x)}{P(x)} $$

Similarly, Type II errors can be calculated as follows

$$ KL(P|Q) = \sum\limits_{x} P(x) log\ \frac{P(x)}{Q(x)} $$

In practical applications, these two types of errors can arise from various factors. For instance, machine learning algorithms often have limitations on the number of words they can generate, leading to arbitrary truncation of text. While such truncation may be feasible under distribution $Q$, it wouldn't occur under distribution $P$. To encompass a broad spectrum of these scenarios, this paper suggests employing a distinct reference distribution.

$$ R_{\lambda}= \lambda KL(Q |P) + (1 - \lambda) KL(P|Q) $$

Different values of $\lambda$ captures different reference distributions and we can collect many such reference distributions. Then we can plot a divergence curve using the following coordinates. Here $c$ is a constant. The area under such a curve is the MAUVE score.

$$ C(P, Q) = (e^{-cKL(Q, R_{\lambda)}}, e^{-cKL(P, R_\lambda)}) $$

<aside> 🤔 This curve is similar to the False Positive Rate vs True Positive Rate curve that is famous in machine learning. The area under such a curve is called the AUC-ROC — the area under the curve receiving operating characteristic. The authors use a similar intuition here to calculate MAUVE.