Adam, Eve and Population Genetics: A Reply to Dr. Richard Buggs (Part 1)

Swamidass · February 7, 2018, 5:19am

@RichardBuggs thanks for your last post. I think you honed in on the point of confusion. Thanks for elucidating it.

You are right, that is the key point. I’m glad we are getting chance to explain it.

That is exactly right. That is what they are doing, with a few bells and whistles. Essentially, this is exactly what MrBayes does (http://mrbayes.sourceforge.net/), except that unlike MrBayes, ArgWeaver can handle recombination. Technically, it is constructing ARGs (ancestral recombination graphs), not phylogenetic trees. ARGs (of the sort argweaver computes) can be represented as sequential trees along the genome. That’s convenient representation that is easier for most of us to wrap our heads around, but the actual entity it is constructing is that ARG.

Except, as you are coming to see, this is not a coalescence simulation at all.

To clarify for observers, there are three types of activities/programs relevant here.

Phylogenetic tree inference. Starting DNA sequences → find the best fitting phylogenetic tree (or ARGs when using recombination) → assign mutations to legs of tree (or ARG) → use #mutations to determine length of legs. (see for example MrBayes)
Coalescence simulation. Starting from a known population history → simulated phylogenetic trees (or ARGs when using recombination) → simulated DNA sequences. (see for example ms, msms, and msprime)
Demographic history inference. Many methods, but one common way is compute #1. Starting from DNA sequences → Infer phylogenetic trees / args (task #1) → compute the coalescent rate at time windows in the past → Ne is the reciprocal of the coalescent rate. (see for example psmc and msmc).

It seems that there was some confusion about what ArgWeaver was doing. Some people thought it was doing #2 or #3, but it is actually just doing #1. The confusion arose because it used a fixed Ne as parameter, which seemed only to make sense if it was doing #2, and might make its results suspect if it was doing #3. However, ArgWeaver was never designed to do #2 or #3. Instead, it is doing #1.

So what is the Ne for? One of the features of ArgWeaver is that it uses a prior, which is good statistical practice. They were using Ne to tune the shape of the prior, but ultimately this does not have a large effect on the trees. It’s only important, in the end, when there is low amounts of data. As I’ve explained several times too, the prior they used pushed the TMR4A downwards from what the data showed too.

How This All Gets Confusing…

In defense of the confused, one of the confusing realities of population genetics is that the same quantities can be expressed in several different units. Often they are all used interchangeably without clear explanation, and its really up to the listener to sort out by context what is going on.

At the core of this is the units we choose to measure the lengths legs a phylogenetic tree. To help explain, let’s go back to a figure from much earlier in the conversation:

https://discourse-cdn-sjc2.com/standard9/uploads/peacefulscience/original/1X/7a137bd8ef95f0a198251ddb8480d0ad6f8ca0d9.jpg

.
In this figure, the dots are mutations assigned to legs in tree, the scale bar is in units of time (years in this case), and the leaves of the tree are observed DNA sequences obtained from actual humans. I’ve seen several units of tree length pop in this conversation and the literature…

Number of mutations (dots in figure, or D in my formula)
Years (scale bar in figure)
Generations (argweaver)
Coalescence units (number mutations / sequence length, or D in my formula)

A critical point it that the mutations are observed in the data, and the number along each leg is used to estimate the time. All these things are all just unit conversions, provided we clarify mutation rates, the length of the sequence, and (sometimes) generation time. So all these units are essentially interconvertible if we know the mutation rate. If we just express them as coalescence units or number of mutations, then they do not even require specifying a mutation rate and they are a fundamental property of the data itself.

Though, as we have discussed, we have reasonable estimates of mutation rates. For example, ArgWeaver uses a generation time of 25 years / generation, and a mutation rate of 1.25e-8 / bp / generation. This is equivalent to using a mutation rate of 0.5e-9 / bp / year.

Maximum Likelihood Estimation (MLE) of Lengths

One of the easiest ways to estimate a leg length is with a MLE estimate. Let’s imagine we observer 10 mutations in a 10,000 bp block (or 1e4). For illustration, we can convert this to all the units we’ve mentioned, using the argweaver defaults.

1e-3 coalescent units (or 10 mutations / 1e4 bp).
2,000,000 years (1e-3 coalescent units / 0.5e-9 mutation rate per year)
800,000 generations (1e-3 coalescent units / 1.25e-8 mutation rate per generation)

In actual trees, it is a little more complex, because some branch points have multiple legs. In these cases, we are going to average lengths computed across the data in each leg if we are building an ultrametric tree (distance from tip to each leaf is the same). In this application, the ultrametric constraint makes a lot of sense (because we all agree these alleles are related), and this gives a way to pool data together to get a higher confidence estimate that is not sensitive to population specific variation in mutation rates.

Nonetheless, these units are so trivially interchangeable, that they are not consistently used. While coalescence units is the most germaine to the data, it is also the most archain. So it is very common for programs to use different units to display results more understandably. Argweaver and msprime, for example, use “generations.”

Maximum A Posteriori (MAP) Length

So MLE is great when we have lots of data, but it is very unstable when there is only small amounts of data.

For example, what if the number of bp we are looking at is really small, let’s say exactly zero. In this case, what is the mutation rate? 0 mutations / 0 bp is undefined mathematically, and creates problems when taking recombination into account, some trees can end up having 0 bp spans in high recombination areas.
How about if the number of bp is just 100, and the observed mutations is zero. What is the mutation rate then? From the data we would say zero, but that’s not true. We know it is low, but its not zero.

So how do we deal with these problems? One way to solve this problem is to add a weak prior to the mutation rate computation. There is a whole lot of math involved in doing this in a formal way (using a beta prior), but I’ll show you a mathematically equivalent solution that uses something called pseudocounts.

With pseudocounts we preload the estimate with some fake data, pseudo data. If the mutation rate is 0.5e-9 / year and we think this leg should be about 10,000 years long, we can use this to make our fake data. In this case, we will say the fake data is a 100 bp stretch, where we observed 0.0005 mutations (100 * 10000 * 0.5e-9). This is fake data so we can make fractional observations like this. We choose 100 bp to make this easily overwhelmed by the actual data.

Now, we estimate the mutation rate by looking at the data + pseudo data, instead of the data alone. If, for example, we are looking at no data. We would end up with a length of 10,000 years instead of the nasty undefined 0/0 we get in the MLE. Likewise, if we look at a real tree over a 2,000 bp region where 3 mutations are observed.

We can make a MLE estimate of the length in coalescent units, at 0.0015 (or 3 / 2000), which is equivalent to 3 million years.
We can also make MAP estimate of its length (using our pseudo counts), at 0.001428 (or 3.0005 / 2100, which is equivalent to 2.8 million years)

There are a few observations to make about this example.

These numbers can be converted into other units as discussed above.
The MLE estimate and MAP estimate are pretty close. The more data there is, the closer they will be.
Even though our prior was 10,000 years, it’s totally overwhelmed by the data in this case, to give an estimate of millions of years.
Only a few mutations is enough to increase the estimate of the length, which is why individual estimates have very high error (they will both be above and below the true value). We really need to see estimates from across the whole genome. NOnetheless, this example is not quite typical (just for illustration) and had 3 mutations in a tiny stretch of 2000 bp. That is a really high amount of mutations.
In the end, we want to choose a prior that will have little impact on the final results, but will help us in some of corner cases where things blow up in the MLE estimate. That is why were use a weak prior (low pseudocounts).

This is just an illustration, designed to be easy to understand without requiring statistical training. It is not precisely how argweaver works, for example, but is a very close theoretical analogy.

ArgWeaver Works Like MAP

ArgWeaver works very close to a MAP estimate. Our median TMR4A estimate is very much like a MAP estimate of TMR4A. What are the differences, however, with how ArgWeaver works from MAP…

ArgWeaver is not making a single MLE or MAP estimate (as described above). Instead, it is sampling ARGs based on fit to the data (likelihood) and the prior. This called Markov Chain Monte Carlo (MCMC) and is closely related to a MAP estimate when a prior is used in sampling (as it is here).
ArgWeaver prior is not implemented using pseudo counts, instead they are using an explicit prior distribution. Using a prior distribution (rather than psuedocounts) is the preferred way of doing this, as it is less ad hoc, more flexible, has clear theoretical justification, and clarifies upfront the starting point of the algorithm.
The ArgWeaver prior does not use a fixed time (we used 10,000 years above), but a range of times. This is how the Ne comes in. They use the distribution of times expected from a fixed population of 11,534. I have no idea why the chose such a specific number.
The ArgWeaver prior is on the time of coalescence, not the length of a leg in the tree. This is subtle distinction, but the TMR4A is actually the sum of several legs in the tree. The prior ArgWeaver uses says that we expect (not having looked at data) for that TMR4A time (which is a sum of leg lengths in the tree) to be at about 100 kya. As implemented, it’s a weak prior, and is overwhelmed by the data. Ultimately, the tree lengths computed in the by ArgWeaver are not strongly influenced by the prior.
Though I have explained this as actions on trees, ArgWeaver is applying this to branch lengths on the ARGs (the ancestral recombination graph). This is important because ARGs end up using more information (larger lengths of sequences) to estimate the length than naively trying to estimate phylogenetic branch lengths independently for each tree. The trees we have been using are an alternative representation of an ARG that is less efficient, but easier to use for many purposes (like estimating TMR4A).

In the end, to ease interpretation, ArgWeaver reports results in “generations” but its converting using the equations I’ve already given. So we can easily convert back and forth into any of these units. Most importantly, at its core, we are just using the fundamental formula:

D = R * T

Mutational distance is the product of mutational rate and time. That’s all that is here. That is what enables the conversions. The fact that argweaver makes the surprising decision to use Ne to parameterize its weak prior is just a non issue. As I have explained, the prior it uses for TMR4A is lower than TMR4A, so it’s just pulling the estimate down any ways. Getting rid of it will only increase the estimate (only a small amount). MAP estimates, also, are considered vastly superior to MLE estimates, so it just makes no sense to doubt this result for using a better statistical technique.

A Prior Is Not an Assumption

It should be clear now why it is just incorrect (despite that footnote in the paper) to call a prior an assumption. It is also incorrect to say that argweaver is “simulating” a large population. All it is doing is using a weak prior on the tree lengths, and that is a good thing for it to do that makes the results more stable.

As an aside, the language of prior and posterior is chosen intentionally. The terms are defined in relation to taking the data into account. In Bayesian analysis, the prior is updated by the data into the posterior. Then, the posterior becomes the new prior. We can then look at new data, to update it again. So priors, by definition, are not assumptions. They are starting beliefs that are updated and improved as we look at more data. It is just an error to call them assumptions.

Okay, I know that is a lot, but I figure that some people will find this useful. This is a good illustrative case of the fundamentals of Bayesian analysis. While the rigorous treatment requires a lot of math, this should give enough for most observers to follow what is going on here.