Adam, Eve and Population Genetics: A Reply to Dr. Richard Buggs (Part 1)

Swamidass · February 1, 2018, 9:10pm

Hello All,

Going to try work through some of this in the coming days.

I do not think they assumed constant population size, but I do agree they used that word “assume” imprecisely. What they did was compute an estimate of the trees using a weak prior, which was overwhelmed by the data, by design. This is a standard approach in statistical modeling and is not correctly called an assumption.

This is important because there is no modeling of the population taking place in argweaver; its just computing trees. Contrast this with, for example, the ABC method. In the ABC method (e.g. Inferring Population Size History from Large Samples of Genome-Wide Molecular Data - An Approximate Bayesian Computation Approach) populations are explicitely modeled and assuming Ne > 10,000 would make detection of lower Ne impossible.

As I explain here: Heliocentric Certainty Against a Bottleneck of Two? - #10 by swamidass - Peaceful Science

As a prior, this is not an assumption, but a starting belief that is meant to be overridden by the data. The only way that the ArgWeaver program uses the population size is in computing this prior. Population size is neither simulated nor modeled in the program except for placing this weak prior on population size. Remember, priors are not assumptions or constraints. That is why the measured

The ArgWeaver output files tell us the strength of the prior vs. the data, and it is just about 5%. That means the model output is dominated 95% by the data, and not by the prior (as it is designed).

The prior distribution for TMR4A is at about 100 kya, but we measured the TMR4A at about 420 kya. That means the data is pulling the estimate upwards from the prior, not downwards.

This last point should end any confusion. To draw analogy, it’s like we measured the weight of widgets, with the weak starting belief that the average weight of these widgets is 200 lb. After weighing several of them, and taking the prior into account, we compute the average weight is 420 lb. The fact we used a prior could be an argument that the real average is greater than 420 lb, but that is not a plausible argument that the true average is less than 420 lb. The prior, in our case is biasing the results downwards, not upwards.

The paper is imprecise in its use of the word “assume,” but the way it is actually used in the code, it is a weak prior, not an assumption.

That means the TMR4A (and all TMRCAs) are determined primarily using the formula: D = T * R, where D is mutational distance, T is time, and R is the mutation rate. That is the key determinants of the TMR4A. The prior has only a tiny impact on this, pushing the estimated T lower (not higher) than that which the data indicates.

Of course, we could try and redo the analysis without a prior, or a weaker prior. We would not expect much to change except for the TMR4A estimate to increase.

Remember, also, as you pointed out…

So we expect high Ne, even if there was a bottleneck. This is a pretty important point. Even if the method assumed Ne is high, there is no reason to doubt the TMR4A we compute from the data. Because Ne is largely decoupled from a single generation bottleneck in the distant past.

And I appreciate you bringing the question forward. It has been fun to get to the bottom of this.

More to come when I can.