I thought this post required a more detailed follow up…
This is a great hypothesis but it is not borne out in the data marshaled. It confirms, instead, my point that the effect is only small on TMRCA. At the core of this is misunderstanding of the statistics at play.
Case in point is S4:
Look at the last column where it shows the true ARG branch length. We see that the variance of the estimate increases but there is no systematic error that is shifting estimates upwards from the true value. This is a critical point. Remember we are taking the median of the TMR4A, which only depends on where the distribution is centered, not its spread. That is why we chose median, not mean. This graph is very strong evidence that the recombination inference errors are NOT increasing the TMR4A, just as I hypothesized.
In fact, even when recombinations are underidentified (bottom row, middle column), ARG length is still measured about the same, but with higher variance (bottom row, right column). We chose an estimator that does not depend on variance, however, so this has zero impact on TMR4A.
We see the exact same pattern in S7. Even though you write…
The correlation between the true and estimated TMRCA drops a little, from 0.87 to 0.78, but if there is systematic error, it is very low. We cannot tell precisely this graph, but we can from the prior graph. The median of each distribution is going to be very close to each other. Taken with the prior figure, this is evidence that the under inference of recombination is not a major source of error. In fact, these data points show that recombination inference mistakes do not change the average/median TMRCA estimates. Also, TMRCA has much higher variance than TMR4A, and TMR4A will be even less susceptible to these types of errors.
In the end, these figures validate pretty clearly my alternate hypothesis, which I formed based on knowledge of how this algorithm works.
Remember, I am a computational biologists, and I build models for biological systems like this in my “day job,” so I am working from a substantial foundation of firsthand experience in how these algorithms work. This is not an appeal to authority (trust or distrust me as you like), but an explanation of why I had some confidence in this in the first place. What we see is what I guessed. There is increased variance in the estimate, but no clearly evidence that the TMRCA is increased more than it is decreased.
This hypothesis seems false. We can see from figure S4 and S7 that about 50% are less parsimonious than the correct tree (higher TMRCA) and about 50% are more parsimonious (lower TMRCA). Remember, we do not expect the trees to be precisely correct. There are just estimates, and we hope (with good reason) that the errors one way are largely cancelled by the errors the other way when we aggregate lots of estimates.
Finally, we are aggregating a lot of estimates together to compute the TMR4A across the whole genome. This is important, because by aggregating across 12.5 million trees, we reduce the error. While our estimate in a specific part of the genome might have high error, that error cancels out when we measure across the 12.5 million trees. This is a critical point.
The statistics here substantially increases our confidence in these numbers.
Just about any source of error we can identify will push some of the TMRCA estimates up and some of them down. However, because we are looking at the median of all these estimates, this increase in variance does not affect the accuracy much. A great example of this is mutation rates.
Yes, there is variation in mutation rate. We can measure it in different populations, and we can even detect some differences in the past. These variations, however, in humans are all relatively small. These variations, also, are not always to higher mutation rates, but also to lower mutation rates. So yes, it is likely that mutation rates were slightly higher in particular populations or points in the past (let’s say within 2-fold per year), but it is also likely they were slightly lower at times too. For the most part, this just averages out over long periods when looking at the whole human population. That is not 100% true, but the law of averages is why variation in mutation rate is not going to dramatically increase our confidence interval on TMR4A by much.
Let’s remember why we are here:
I think it’s fair to say at this point that I did rigorously test the bottleneck hypothesis. Right?
Perhaps there will be improved follow on analysis that will refine my estimates, and I encourage that. However, TMR4A is a feature of the data. It is the length (in units of time, computed by mutational length / mutation rate) of the most parsimonious trees of genome wide human variation. This is not an artifact of a population genetics modeling effort. Rather, it is a way of computing the time required to produce the amount of variation we see in human genetics.
Also, this analysis is very generous to to the bottleneck hypothesis. Though I’m not certain yet (and plan to do the studies to find out), bottlenecks going back as far as 800 kya might be inconsistent with the data we see. There are some large unresolved questions about how a bottleneck effects coalescence rate signatures before the median TMR4A, and if they are detectable.
I hope there can be some agreements on these points, as a conclusion to this portion of the conversation would be valuable. It would be great to move on to more interesting data.