<aside> 📜
© 2026 Denis Jacob Machado. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
</aside>
Based on Wheeler (2012), Chapter 11: Optimality Criteria—Likelihood.
In phylogenetics, once we have data (say, DNA sequences or morphological characters) and a candidate tree, we need a way to ask: how good is this tree? That's where optimality criteria come in. Parsimony (Chapter 10) scores trees by minimizing the number of character transformations. Maximum likelihood (ML), on the other hand, scores trees using stochastic models of character change, asking: what is the probability of observing our data given this tree and some model of evolution?
As Wheeler puts it, the likelihood of a tree T given data D is "proportional to the probability of the data given the tree" (Wheeler, 2012, p. 216; following Edwards, 1972):
l(T|D) ∝ pr(D|T)
A systematic ML method then selects the tree T that maximizes pr(D|T). However, computing this probability requires specifying many additional quantities—transformation models, branch lengths, rate distributions—collectively called nuisance parameters (denoted θ). How we deal with these nuisance parameters gives rise to the different "flavors" of likelihood.
Wheeler (2012, pp. 216–218) presents a hierarchical classification of likelihood methods used in systematics (his Figure 11.3). The key idea is that each method differs in how it handles nuisance parameters and ancestral state assignments. Let's walk through them step by step.
If we actually knew the distribution of nuisance parameters, Φ(θ|T), we could integrate them out over the parameter space Θ (Wheeler, 2012, p. 216):
p(D|T) = ∫ p(D|T, θ) dΦ(θ|T)
The tree that maximizes this integrated probability is the Maximum Integrated Likelihood (MIL) tree (Steel and Penny, 2000). This is the most theoretically satisfying approach because it accounts for our uncertainty in the nuisance parameters rather than picking a single "best" set.
Now, if we also have a prior distribution on trees, p(T), we can ask: which tree has the highest posterior probability? Székely & Steel (1999) showed that the method with the highest expectation of returning the true tree selects the tree maximizing p(T) × pr(D|T). This is the Maximum A Posteriori (MAP) tree (the Bayesian estimate).
Here's the important connection: when all priors of the tree are equal (a flat, uninformative prior), the MAP tree is identical to the MIL tree. As Wheeler notes, "the use of non-uniform tree priors (such as empirical or Yule) breaks this identity" (Wheeler, 2012, p. 217).