So this new data is really helpful. Got some results to show.
Also, I was able to clarify that they are using a generation time of 25 years / generation. That becomes helpful in converting to years.
Okay, some good news, they include enough to reconstruct the distribution in Figure S17. http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004342#s5 This data includes a random sample of 69 neutral regions (dashed line), compared with 69 regions undergoing balancing selection and containing no CpGs (black). The red line is the 56 regions undergoing balancing selection by with shared CpGs. Though not the entire genome, the dashed line is going to be a good estimate of the neutral genome-wide distribution.
Distribution of TMRCAs in regions predicted to be under balancing selection. Cumulative distribution functions (CDFs) are shown for the 125 regions identified by Leffler et al.  based on segregating haplotypes shared between humans and chimpanzees (black circles), the subset of 69 loci containing no shared polymorphisms in CpG dinucleotides (black circles) and a collection of 69 putatively neutral regions having the same length distribution. Neutral regions consisted of noncoding regions from which known genes, binding sites, and conserved elements had been removed (see ). Notice the pronounced shift toward larger TMRCAs in the regions predicted to be under balancing selection, and a slightly more pronounced shift for the subset not containing CpGs (which are more likely to have undergone parallel mutations on both lineages). TMRCAs are measured in generations, as in all other figures and tables.
For the statistically untrained, this going to be a hard graph to read. It is a CDF, not a PDF (https://en.wikipedia.org/wiki/Cumulative_distribution_function).
This distribution changes things quite a bit. We do not see multiple modes. We also see that there is a very high positive skew to the data, and that balancing selection (black and red lines) increases TMRCA estimates quite a bit (no surprise), as much as by 2 fold (the magnitude of that effect is a surprise to me, but in retrospect is not so surprising). This means that estimates of TMRCA that do not take balancing selection into account are going to overestimate the value substantially. The CpG sites, which have a higher mutation rates but these mutations are more likely to be shared, so these regions can decrease TMRCA by about 10%.
Several factors can conspire to increase or reduce TMRCA. Molecular clocks only work when these factors are not interfering. That is why whole genome distributions are so important. We can test the effect of different regions. For example, if we wanted, we could start to untangle how identifiably neanderthal interbreeding biases results upwards, by seeing the results on those regions separately. We can also see how positive selection (which violates the assumptions required for dating). Some regions of the genome, also have lower mutation rates (and therefore will overestimate TMRCA).
From this, we want the best estimate of TMRCA in neutral regions of the genome (the dashed line) in a way that reduces these sources of error. This is a fairly important point as dates, can can only be reliably inferred in places that are not under selection. These are the only places where a molecular clock is expected to hold. Even then, some regions will still get “lucky” and coalesce more quickly to or much more slowly. So to a first approximation, we want the the median of these values. The median has another helpful feature. It should exclude the effect of regions we know for a fact include evidence of interbreeding with neanderthals and denisovans in the last 100 kya or so. Unfortunately, it cannot exclude regions affected by more ancient interbreeding (which could be the entire genome).
Nonetheless, we can make our estimate. In the regions not under selection, we see a mode for the TMRCA at about 50,000 generations. You can see it yourself tracing the blue line in the graph:
Multiplying by 25 years / generation, and dividing by four, this gives us a TMR4A of about 300 kya. By the way, I’ve grown more convinced in the TMR4A estimate, based on a brush up on the mathematics of phylogenies and coalescents.
This is a TMR4A about 2 times greater than y-MRCA and m-MRCA (which are about 150 kya to 200 kya); how do we make sense of this? Remember, Y and mito DNA does not recombine. So it is essentially a lot of data about single “block” of the genome. To some extent, is a single sample from this distribution. When we look at a whole genome, however, we are looking at about 100,000 “blocks”. We have less information per block, but there are just so many more of them. With that in mind, the y-MRCA and m-MRCA are entirely consistent with this distribution, but the mode of this distribution is a better estimate.
I should add that TMR4A in autosomal regions outside this range are suspicious. We need an explanation of why we should trust values inconsistent with the y-MRCA and m-MRCA. I’ve given one here for this data, but based on this data I’d be skeptical of data that put TMR4A outside this range.
So this does plausibly make the case stronger, pushing the TMR4A back from 150K to 200K (from y-MRCA to m-MRCA) to a non-cherry picked number of about 300 kya (probably plus or minus 20 kya). This number is corrected for bias due to positive selection and recent interbreeding, and is consistent with the y-MRCA and m-MRCA data.
That does support Dennis’s claim if Homo sapiens arise 200 kya ago and there was no interbreeding. However, there was interbreeding, and we found out this year that Homo sapiens might have arisen earlier than 300 kya. To be clear, the correspondence between 300 kya between the mode TMR4A and the origin of Homo sapiens can not be interpreted as evidence for a bottleneck at this specific point in time. Rather, this view of the data data does not specifically dispute the bottleneck hypothesis. To the point, I think we have much more confidence in heliocentrism than this data’s ability to demonstrate Dennis’s claim.
Though this has been immensely interesting and informative.
There is, of course, other data. I still point to trans-species variation. Speaking of which…
The only way there can be trans-species variation between us and chimps is for there to be balancing selection (or perhaps inconceivably recent interbreeding). Coalescence times for neutral regions are more than one order of magnitude smaller than divergence (as we have just seen). We just do not expect any trans species variation in neutral regions of the genome. Evidence of trans-species variation are very strong evidence that TMRCA’s in this region are not valid.
Regions under balancing selection need to be handled with seperate reasoning. For example, the TMRCA / 4 = TMR4A estimate certainly does not apply here. TMR4A can be arbitrarily smaller than TMRCA if the trans species variation is 4 or less alleles (coalesced). The paper being referenced here appears to show only five regions with only 1 to 3 ancestral alleles in each region. That would push the TMRCA back, but not the TMR4A.
To be sure, this is evidence for common descent, but unless we see more than four alleles in a single locus (in an autosomal region), as it appears we do in MHC antigens (the Ayala), it does not make the case against a bottleneck. The alternative explanation for Ayala’s MHC data is convergent evolution (which is where I’m sure @RichardBuggs will go). I’m much less convinced by this; and it is also testable by comparing with orangutan and gorilla data (though admittedly not yet tested).