Some Models Are Useful, Many Models Are Harmful, and All Models Are Incorrect: A Defense of Parsimony as the Practical Optimality Criterion in Phylogenetics

Abstract

The debate between maximum likelihood (ML) and parsimony (P) as optimality criteria in phylogenetics has persisted for decades, often framed in terms of statistical consistency. Here, we argue that when the assumptions underlying real biological data are taken seriously (model uncertainty, finite data, unique evolutionary histories, and computational constraints), the theoretical advantages of standard ML (i.e., maximum average likelihood, MAL) over parsimony largely dissolve. We articulate a set of premises grounded in the biological realities of systematic analysis, show that the forms of likelihood that converge to parsimony embody philosophically coherent treatments of character evolution, and conclude that parsimony offers the most practical, robust, and epistemologically defensible approach to phylogenetic inference.

1. Premises and Values

We begin by stating clearly the assumptions and values that guide our argument. These are not arbitrary preferences; they emerge from the nature of the biological phenomena that phylogenetics seeks to explain.

Premise 1: All models are incorrect, and many are harmful

George Box's famous aphorism states that "all models are wrong, but some are useful." This deserves a sharper edge when applied to phylogenetics. Yes, all models are simplifications. But in a field where the correctness of the model is a prerequisite for the most celebrated statistical properties of ML, an incorrect model is not merely imprecise. It can also be actively misleading. As Wheeler (2012, p. 279) states plainly, the consistency of ML requires "that the model used for reconstruction is the same as that which generated the tree in the first place. In general, likelihood will be inconsistent otherwise." Chang (1996) showed that even a case as simple as using one model when the true scenario was generated by two (even on the same tree) destroys consistency. The models we use today are, in Wheeler's words, are "clearly simplifications of the myriad forces molding the evolution of creatures in time and space" (Wheeler, 2012, p. 281). An incorrect model does not merely fail to help; it can positively mislead, and we can never verify that a model is correct for real biological data.

Premise 2: What matters in practice matters, and consistency may not matter in practice

Statistical consistency is the guarantee that an estimator converges to the truth as the data grow without bound. As such, statistical consistency is often invoked as the decisive advantage of ML over parsimony. But this guarantee rests on assumptions that are routinely violated in phylogenetic practice. Wheeler (2012, pp. 275–277) systematically catalogues these violations: empirical sequence data are not independent (they share evolutionary history and are drawn from functionally related genomic regions), they are not identically distributed (dynamics vary even within a single locus, as between stem and loop regions of structural RNAs), and the data are finite — entire genomes set a hard limit on sample size, and "it is unclear if these sample sizes are sufficient to ensure asymptotic behavior" (Wheeler, 2012, p. 277). Edwards (1972) stated that asymptotic behavior was irrelevant, and that the only operation that mattered was the relative ranking of hypotheses. As Wheeler concludes, "the real-world inapplicability of consistency proofs does not impugn likelihood methods in any way; it simply signifies that consistency is not a basis to favor likelihood over other methods (e.g., parsimony)" (Wheeler, 2012, p. 281).

Premise 3: ML is also inconsistent

The inconsistency of parsimony in the "Felsenstein Zone" is well known (Felsenstein, 1978). Less frequently discussed is that ML also suffers from systematic failures. Pol and Siddall (2001) demonstrated that likelihood is prone to long-branch attraction as well, "regardless of whether likelihood analysis is performed using an incorrect or correct (generating) model" (Wheeler, 2012, p. 284). Crucially, likelihood also suffers from Long-Branch Repulsion (termed the "Farris Zone" by Siddall, 1998), or the anti/Inverse Felsenstein Zone (a failure mode "not found in parsimony analyses" according to Wheeler, 2012, p. 284). In this zone, parsimony outperforms likelihood: as the probability of change on the short branches approaches zero and long-edge sequences approach randomization, "with k sites, the probability of parsimony reconstructing the tree correctly approaches $1 − (3/4)^k$ while that for likelihood will be no more than ${2}/{3}$" (Wheeler, 2012, p. 284, citing Steel and Penny, 2000). There is no safe harbor: "all methods can, under some conditions, exhibit this behavior" (Wheeler, 2012, p. 285), and "the only way to recognize the phenomenon is by knowing the actual history of a group. Yet, we cannot know this history for real data" (Wheeler, 2012, p. 285).

Premise 4: The model can never be known to be correct for real biological data

This point deserves its own emphasis because it is the linchpin of the consistency argument. Every proof of ML consistency assumes model correctness. In practice, we face what Wheeler describes as "evolutionary processes varying over time, space, and taxon, and finite, incomplete data due to inherent limitations on data quantity, extinction, and technical difficulties" (Wheeler, 2012, p. 283). Steel (2011b), examining varieties of NCM models, found that "relatively slight alterations of conditions had marked effects on results" and concluded:

"This brings into question the robustness of any consistency results to even slight model misspecification and suggests that other statistical considerations (e.g., bias, efficiency) may override consistency issues." (as quoted in Wheeler, 2012, p. 281)

Yang (1997) demonstrated similar fragility for common-mechanism models, "disconcertingly finding 'wrong' models outperforming the right" (Wheeler, 2012, p. 282). If even slight model misspecification can reverse the expected hierarchy of methods, then the theoretical superiority of ML over parsimony rests on ground that is never firm in practice.